# Lab 2: Hotel reviews

Generate word clouds for good and bad hotel reviews

Objectives:
- part of speech tagging with spacy
- extract phrases that match a part of speech pattern
- scale processing pipeline with nlp.pipe
- compute c-values

In [135]:
import re
from collections import defaultdict

import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
from cytoolz import *
from tqdm import tqdm
from wordcloud import WordCloud

In [136]:
df = pd.read_pickle("/data/hotels.pkl.gz")

In [137]:
len(df)

231294

In [138]:
# df = df.sample(frac=.1)

In [139]:
len(df)

231294

In [140]:
df.head()

Unnamed: 0,title,text,overall,value,service,cleanliness
0,“Perfect/Reasonable”,"We spent four days at the La Quinta Downtown last week and couldn't have been more pleased. The hotel was impeccably clean, the breakfast was bountiful and the location was convenient to everything you wanted to do. The price was fabulous! We would highly recommend it!",4.0,5.0,3.0,4.0
1,“Absolutely loved it”,"The room was pleasant and the breakfast just fine. But it is the location that makes this hotel remarkable. You step out the door, and TImes Square (and the TKTS booth) is a half block away. I just hope they don't go crazy with the pricing and make this hotel out-of-range for us regular non-business travelers.",4.0,4.0,3.0,4.0
2,“Our regular at NY City”,"Our ladies' troup have become a regular visitor at Casablanca. We simply love the hotel, it's location. Everything is very clean and the staff friendly and helpful. Great location! We will be back for the next trip to NY City",4.0,4.0,4.0,5.0
3,“Reccommended”,"The hotel has a great location, situated behind Wall Street and very close to a number of subway stations. Facilities were great with free wifi and an empty fridge in the hotel room, both a rarity it seems. Rooms were small but ample (and in line with the size of most NY hotel rooms!) and the hotel was clean and the staff helpful. Would definately stay again.",4.0,4.0,4.0,4.0
4,“Great hotel!!”,"I am currently staying at this hotel while I find an apartment in NYC. I moved in here a week ago after a bad experience at a Ramada a few blocks southwest of this place. I LOVE it here. The beds, first of all, are the most comfy beds EVER. I also love having full control of the AC/heater (don't assume that all NYC hotels do this, I learned the hard way). It's always as hot or cold as I like it. The free internet access is great... I'd rather it be wireless, but if I had to choose between wired or sketchy wireless, I'd choose wired, so it's fine with me. The hotel staff is soooo friendly and accomodating. Everyone greets you with a smile and helps when you need something. The hotel itself is very clean and updated. I know the curb appeal isn't great with all of the scaffolding, but just look beyond that and walk into your own little home in the city. My only minor complaint is that room service is too expensive and doesn't really have a great variety of food. I know most people eat out when they're in NYC, but I'm stuck here for two weeks on business and sometimes I just want what's easy. Plus, being in midtown, there aren't tons of places right next door to eat (especially on the weekends). But walk a few blocks in any direction, and I guarantee you'll find something to suit your fancy. Oh, and Ray's Original Famous Pizza does deliver here, and it's WONDERFUL. The front desk was kind enough to give me menus of a few places. :) There's tons of shopping six blocks south and one block west, near Macy's... it's easy to walk there and back. And Times Square is an easy walk, too... The Empire State Building is very close, as well, as is Grand Central Terminal, which will take you anywhere you want to go. All in all, I would recommend this place to anyone staying in Manhattan... it truly has easy access to any place in the city and provides you with the comforts of home. You won't regret this one.",5.0,5.0,5.0,5.0


In [141]:
df['text'] = df['title'] + " " + df['text']

In [142]:
max(df['text'].apply(len))

19941

In [143]:
pd.set_option('display.max_colwidth', 20000)

---

## Collect candidate term phrases

Collect all sequences of words that match the part-of-speech pattern `(Adj|Noun)+ Noun`

In [144]:
import spacy
from spacy.matcher import Matcher

In [145]:
nlp = spacy.load(
    "en_core_web_sm", exclude=["parser", "ner", "lemmatizer", "attribute_ruler"]
)

In [146]:
matcher = Matcher(nlp.vocab)
matcher.add(
    "Term",
    [
        [
            {"TAG": {"IN": ["JJ", "NN", "NNS", "NNP"]}},
            {"TAG": {"IN": ["JJ", "NN", "NNS", "NNP", "HYPH"]}, "OP": "*"},
            {"TAG": {"IN": ["NN", "NNS", "NNP"]}},
        ]
    ],
)

In [147]:
def get_phrases(doc):
    spans = matcher(doc, as_spans=True)
    return [tuple(tok.norm_ for tok in span) for span in spans]

In [148]:
doc = nlp(df['text'].iloc[0])

In [149]:
print([(t, t.tag_) for t in doc])

[(“, '``'), (Perfect, 'NNP'), (/, 'SYM'), (Reasonable, 'JJ'), (”, "''"), (We, 'PRP'), (spent, 'VBD'), (four, 'CD'), (days, 'NNS'), (at, 'IN'), (the, 'DT'), (La, 'NNP'), (Quinta, 'NNP'), (Downtown, 'NNP'), (last, 'JJ'), (week, 'NN'), (and, 'CC'), (could, 'MD'), (n't, 'RB'), (have, 'VB'), (been, 'VBN'), (more, 'RBR'), (pleased, 'JJ'), (., '.'), (The, 'DT'), (hotel, 'NN'), (was, 'VBD'), (impeccably, 'RB'), (clean, 'JJ'), (,, ','), (the, 'DT'), (breakfast, 'NN'), (was, 'VBD'), (bountiful, 'JJ'), (and, 'CC'), (the, 'DT'), (location, 'NN'), (was, 'VBD'), (convenient, 'JJ'), (to, 'IN'), (everything, 'NN'), (you, 'PRP'), (wanted, 'VBD'), (to, 'TO'), (do, 'VB'), (., '.'), (The, 'DT'), (price, 'NN'), (was, 'VBD'), (fabulous, 'JJ'), (!, '.'), (We, 'PRP'), (would, 'MD'), (highly, 'RB'), (recommend, 'VB'), (it, 'PRP'), (!, '.')]


In [150]:
print(list(get_phrases(doc)))

[('la', 'quinta'), ('quinta', 'downtown'), ('la', 'quinta', 'downtown'), ('last', 'week'), ('downtown', 'last', 'week'), ('quinta', 'downtown', 'last', 'week'), ('la', 'quinta', 'downtown', 'last', 'week')]


In [151]:
candidates = list(
    concat(map(get_phrases, nlp.pipe(tqdm(df["text"]), batch_size=20, n_process=4)))
)

100%|██████████| 231294/231294 [09:23<00:00, 410.76it/s]


In [152]:
import pickle

with open("cands.pkl", "wb") as out:
    pickle.dump(candidates, out)

In [153]:
candidates[:10]

[('la', 'quinta'),
 ('quinta', 'downtown'),
 ('la', 'quinta', 'downtown'),
 ('last', 'week'),
 ('downtown', 'last', 'week'),
 ('quinta', 'downtown', 'last', 'week'),
 ('la', 'quinta', 'downtown', 'last', 'week'),
 ('times', 'square'),
 ('tkts', 'booth'),
 ('half', 'block')]

In [154]:
freqs = defaultdict(nltk.FreqDist)
for c in candidates:
    freqs[len(c)][c] += 1

In [155]:
freqs.keys()

dict_keys([2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 11, 15, 16, 14, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 49, 52, 55, 58, 61, 64, 67, 70, 73, 76, 79, 82, 85, 88, 91, 94, 97, 100, 103, 106, 109, 112, 115, 118, 121])

In [156]:
freqs[22]

FreqDist({('off', '-on', '-', 'off', '-on', '-', 'off', '-on', '-', 'off', '-on', '-', 'off', '-on', '-', 'off', '-on', '-', 'off', '-on', '-', 'off'): 34, ('=', '=', '=', '=', '=', '=', '=', '=', '=', '=', '=', '=', '=', '=', '=', '=', '=', '=', '=', '=', '=', '='): 28, ('new', 'york', "n'hesiter", 'pas', 'cet', 'hotel', 'est', 'au', 'centre', 'de', 'tout', 'la', 'vue', 'est', 'magnifique', 'et', 'le', 'service', 'au', 'top', "n'hesiter", 'pas'): 1, ('york', "n'hesiter", 'pas', 'cet', 'hotel', 'est', 'au', 'centre', 'de', 'tout', 'la', 'vue', 'est', 'magnifique', 'et', 'le', 'service', 'au', 'top', "n'hesiter", 'pas', 'a'): 1, ("n'hesiter", 'pas', 'cet', 'hotel', 'est', 'au', 'centre', 'de', 'tout', 'la', 'vue', 'est', 'magnifique', 'et', 'le', 'service', 'au', 'top', "n'hesiter", 'pas', 'a', 'soliciter'): 1, ('pas', 'cet', 'hotel', 'est', 'au', 'centre', 'de', 'tout', 'la', 'vue', 'est', 'magnifique', 'et', 'le', 'service', 'au', 'top', "n'hesiter", 'pas', 'a', 'soliciter', 'le'): 1,

In [157]:
freqs[10].most_common(10)

[(('=', '=', '=', '=', '=', '=', '=', '=', '=', '='), 64),
 (('off', '-on', '-', 'off', '-on', '-', 'off', '-on', '-', 'off'), 38),
 (('preauth',
   'ocean',
   'park',
   'inn',
   'ocean',
   'park',
   'inn',
   'san',
   'diego',
   'caus'),
  4),
 (('-', 'o', '-', 'o', '-', 'o', '-', 'o', '-', 'o'), 2),
 (('s', '-', 'l', '-', 'o', '-', 'o', '-', 'o', '-'), 2),
 (('o', '-', 'o', '-', 'o', '-', 'o', '-', 'o', '-'), 2),
 (('fast',
   'wireless',
   'high',
   '-',
   'speed',
   'internet',
   'access',
   '-',
   'indoor',
   'pool'),
  1),
 (('nice',
   'lobby',
   '-',
   'friendly',
   'helpful',
   'staff',
   '-',
   'free',
   'airport',
   'shuttle'),
  1),
 (('friendly',
   'helpful',
   'staff',
   '-',
   'free',
   'airport',
   'shuttle',
   '-',
   'good',
   'insulation'),
  1),
 (('new',
   'york',
   "n'hesiter",
   'pas',
   'cet',
   'hotel',
   'est',
   'au',
   'centre',
   'de'),
  1)]

-----

## Extract terms

Calculate c-values for candidate phrases

$$\textrm{C-value}(a)=
\begin{cases}
\log_2|a|\cdot f(a) & \mbox{if } a \mbox{ is not nested}\\
\log_2|a|\left(f(a)-\frac{1}{P(T_a)}\sum_{b\in T_a}f(b)\right) & \mbox{otherwise}\\
\end{cases}
$$

and select terms above threshold value $\theta$

In [158]:
def get_subterms(term):
    k = len(term)
    for m in range(k - 1, 1, -1):
        yield from nltk.ngrams(term, m)


def c_value(F, theta):

    termhood = nltk.FreqDist()
    longer = defaultdict(list)

    for k in sorted(F, reverse=True):
        for term in F[k]:
            if term in longer:
                discount = sum(longer[term]) / len(longer[term])
            else:
                discount = 0
            c = np.log2(k) * (F[k][term] - discount)
            if c > theta:
                termhood[term] = c
                for subterm in get_subterms(term):
                    if subterm in F[len(subterm)]:
                        longer[subterm].append(F[k][term])
    return termhood

In [159]:
terms = c_value(freqs, theta=200)

In [160]:
terms.most_common(10)

[(('front', 'desk'), 44853.470588235294),
 (('great', 'location'), 34337.0),
 (('new', 'york'), 22039.0),
 (('times', 'square'), 15835.333333333334),
 (('great', 'hotel'), 14015.0),
 (('room', 'service'), 13001.333333333334),
 (('front', 'desk', 'staff'), 12226.400730562998),
 (('check', '-', 'in'), 10533.660779792803),
 (('good', 'location'), 9844.0),
 (('san', 'francisco'), 9223.0)]

Check threshold: are the items at the bottom of the list (with the lowest c-values) really terms? If not, increase $\theta$ and try again. Repeat until we're happy with the results.

In [161]:
terms.most_common()[-10:]

[(('corporate', 'rate'), 201.0),
 (('package', 'deal'), 201.0),
 (('several', 'nights'), 201.0),
 (('shower', 'door'), 201.0),
 (('new', 'furniture'), 201.0),
 (('boston', 'commons'), 201.0),
 (('extra', 'blankets'), 201.0),
 (('amazing', 'place'), 201.0),
 (('lobby', 'level'), 201.0),
 (('=', '=', '=', '=', '=', '=', '=', '=', '=', '=', '='), 200.64703388096325)]

In [162]:
terms.most_common()[-10:]

[(('corporate', 'rate'), 201.0),
 (('package', 'deal'), 201.0),
 (('several', 'nights'), 201.0),
 (('shower', 'door'), 201.0),
 (('new', 'furniture'), 201.0),
 (('boston', 'commons'), 201.0),
 (('extra', 'blankets'), 201.0),
 (('amazing', 'place'), 201.0),
 (('lobby', 'level'), 201.0),
 (('=', '=', '=', '=', '=', '=', '=', '=', '=', '=', '='), 200.64703388096325)]

In [163]:
terms.most_common(10)

[(('front', 'desk'), 44853.470588235294),
 (('great', 'location'), 34337.0),
 (('new', 'york'), 22039.0),
 (('times', 'square'), 15835.333333333334),
 (('great', 'hotel'), 14015.0),
 (('room', 'service'), 13001.333333333334),
 (('front', 'desk', 'staff'), 12226.400730562998),
 (('check', '-', 'in'), 10533.660779792803),
 (('good', 'location'), 9844.0),
 (('san', 'francisco'), 9223.0)]

Save terms for later use

In [164]:
with open('terms.txt', 'w') as f:
    for term in terms:
        print(' '.join(term), file=f)

-----

## Multi-word tokenizer

Here we define a tokenizer that recognizes multi-word terms as single tokens

In [165]:
from spacy.matcher import PhraseMatcher
from spacy.util import filter_spans

In [166]:
nlp = spacy.load(
    "en_core_web_sm",
    exclude=["tagger", "parser", "ner", "lemmatizer", "attribute_ruler"],
)
phraser = PhraseMatcher(nlp.vocab, attr="LOWER")

In [167]:
with open('terms.txt', 'r') as f:
    phraser.add("TERM", [nlp.tokenizer(t.strip()) for t in f])

In [168]:
def tokenize(text, sep="_"):
    doc = nlp.tokenizer(text)
    with doc.retokenize() as r:
        for span in filter_spans(phraser(doc, as_spans=True)):
            r.merge(span)
    return [t.norm_.replace(" ", sep) for t in doc if not t.is_space and not t.is_punct]

In [169]:
print(tokenize(df['text'].iloc[0]))

['perfect', 'reasonable', 'we', 'spent', 'four', 'days', 'at', 'the', 'la_quinta', 'downtown', 'last_week', 'and', 'could', 'not', 'have', 'been', 'more', 'pleased', 'the', 'hotel', 'was', 'impeccably', 'clean', 'the', 'breakfast', 'was', 'bountiful', 'and', 'the', 'location', 'was', 'convenient', 'to', 'everything', 'you', 'wanted', 'to', 'do', 'the', 'price', 'was', 'fabulous', 'we', 'would', 'highly', 'recommend', 'it']


----

## Word clouds

In this last section, use the tokenizer defined above to make some word clouds comparing good and bad hotels (let's say good = five stars and bad = one or two stars). You might want to look at both the overall rating and some of the sub-scores (like value and service). Can you draw any conclusions that might be useful for hotel owners and managers?

When you are finished, download your notebook file (with a name ending in .ipynb) and submit it via Canvas.

In [None]:
total = nltk.FreqDist(concat(df["text"].apply(tokenize)))
bad = nltk.FreqDist(concat(df.query("overall <= 2 & value <= 2 & service <= 2 ")["text"].apply(tokenize)))
good = nltk.FreqDist(concat(df.query("overall == 5 & value == 5 & service == 5")["text"].apply(tokenize)))

In [None]:
good.most_common(5)

In [None]:
bad.most_common(5)

In [None]:
metrics = nltk.BigramAssocMeasures()
metrics.likelihood_ratio(bad["ugly"], (total["ugly"], bad.N()), total.N())

In [None]:
bad_llr = nltk.FreqDist()
bad_pmi = nltk.FreqDist()

for w in bad:
    if bad[w] > 10:
        bad_llr[w] = metrics.likelihood_ratio(bad[w], (total[w], bad.N()), total.N())
        bad_pmi[w] = metrics.pmi(bad[w], (total[w], bad.N()), total.N())

In [None]:
bad_llr.most_common(10)

In [None]:
bad_pmi.most_common(10)

In [None]:
def cloud(freqs, title, k=50):
    plt.figure(figsize=(8, 8))
    wc = WordCloud(
        width=750, height=750, background_color="black"
    ).generate_from_frequencies(freqs)
    plt.title(title)
    plt.axis("off")
    plt.imshow(wc, interpolation="bilinear")
    plt.show()

In [None]:
cloud(bad, "Bad reviews (by frequency)")

Bad reviews with words by frequency happen to be the commonly used terms in any sentences like articles, prepositions, helping verbs etc.

In [None]:
cloud(bad_pmi, "Bad reviews (by pmi)")

In [None]:
df.query('text.str.contains("Kenlen")')

In [None]:
df.query('text.str.contains("woogo")')

When you look at bad review words by pmi, it can be inferred that the highest scored word was the name of a person which is 
mentioned in only one review. Others like bbb is found to be mostly mis spelled in words like lobby. 

While the entire bad reviews based on pmi scores seems like more related to customer service and their bad experiences with staff.
Improving Customer service will make a big change for hotels with bad reviews.


In [None]:
cloud(bad_llr, "Bad reviews (by llr)")

Not a lot of good outputs from llr in terms of highest scored bad review words but most seems like cleanliness issues.

In [None]:
good_llr = nltk.FreqDist()
good_pmi = nltk.FreqDist()

for w in good:
    if good[w] > 10:
        good_llr[w] = metrics.likelihood_ratio(good[w], (total[w], good.N()), total.N())
        good_pmi[w] = metrics.pmi(good[w], (total[w], good.N()), total.N())


In [None]:
good_llr.most_common(5)

In [None]:
good_pmi.most_common(5)

In [None]:
cloud(good, "Good reviews (by frequency)")

In [None]:
cloud(good_llr, "Good reviews (by llr)")

In [None]:
cloud(good_pmi, "Good reviews (by pmi)")

In [None]:
df.query('text.str.contains("Karli")')

Most of high scored good review words seems to regarding their customer satisfaction in regards to how they were treated. 
Some of them were customer names and staff names who got high praises.