# Analysis of Yelp Restaurant Reviews with NLP

Adapted from Patrick Harrison's **Modern NLP in Python - Or - What you can learn about food by analyzing a million Yelp reviews**

**All credit goes to Patrick Harrison**. You can find the video of the original presentation [here](https://www.youtube.com/watch?v=6zm9NC9uRkk).

## Outline
1. A tour of the dataset
1. Introduction to text processing with spaCy
1. Automatic phrase modeling
1. Topic modeling with LDA
1. Visualizing topic models with pyLDAvis
1. Word vector models with word2vec
1. Visualizing word2vec with t-SNE

## The Yelp Dataset
[**The Yelp Dataset**](https://www.yelp.com/dataset_challenge/) is a dataset published by the business review service [**Yelp**](http://yelp.com) for academic research and educational purposes.

The current iteration of the Yelp dataset (as of this demo) consists of the following data:
- __1,236,101__ users
- __174567__ businesses
- __5,261,669__ user reviews

When focusing on restaurants alone, there are approximately __55K__ restaurants with approximately __3M__ user reviews written about them.

The data is provided in a handful of files in _.json_ format. We'll be using the following files for our demo:
- __business.json__ &mdash; _the records for individual businesses_
- __review.json__ &mdash; _the records for reviews users wrote about businesses_

The files are text files (UTF-8) with one _json object_ per line, each one corresponding to an individual data record. Let's take a look at a few examples.

The business records consist of _key, value_ pairs containing information about the particular business. A few attributes we'll be interested in for this demo include:
- __business\_id__ &mdash; _unique identifier for businesses_
- __categories__ &mdash; _an array containing relevant category values of businesses_

In [1]:
import warnings
warnings.filterwarnings('ignore')

import json
import os

with open('business.json', 'r') as f:
    first_business_record = json.loads(f.readline())

print(first_business_record)

{'business_id': 'FYWN1wneV18bWNgQjJ2GNg', 'name': 'Dental by Design', 'neighborhood': '', 'address': '4855 E Warner Rd, Ste B9', 'city': 'Ahwatukee', 'state': 'AZ', 'postal_code': '85044', 'latitude': 33.3306902, 'longitude': -111.9785992, 'stars': 4.0, 'review_count': 22, 'is_open': 1, 'attributes': {'AcceptsInsurance': True, 'ByAppointmentOnly': True, 'BusinessAcceptsCreditCards': True}, 'categories': ['Dentists', 'General Dentistry', 'Health & Medical', 'Oral Surgeons', 'Cosmetic Dentists', 'Orthodontists'], 'hours': {'Friday': '7:30-17:00', 'Tuesday': '7:30-17:00', 'Thursday': '7:30-17:00', 'Wednesday': '7:30-17:00', 'Monday': '7:30-17:00'}}


The review records are stored in a similar manner &mdash; _key, value_ pairs containing information about the reviews. A few attributes of note on the review records:
- __business\_id__ &mdash; _indicates which business the review is about_
- __text__ &mdash; _the natural language text the user wrote_

In [2]:
with open('review.json', 'r') as f:
    first_review_record = json.loads(f.readline())
    
print(first_review_record)

{'review_id': 'v0i_UHJMo_hPBq9bxWvW4w', 'user_id': 'bv2nCi5Qv5vroFiqKGopiw', 'business_id': '0W4lkclzZThpx3V65bVgig', 'stars': 5, 'date': '2016-05-28', 'text': "Love the staff, love the meat, love the place. Prepare for a long line around lunch or dinner hours. \n\nThey ask you how you want you meat, lean or something maybe, I can't remember. Just say you don't want it too fatty. \n\nGet a half sour pickle and a hot pepper. Hand cut french fries too.", 'useful': 0, 'funny': 0, 'cool': 0}


The code below extracts the unique restaurant IDs and counts them

In [3]:
restaurant_ids = set()

# open the businesses file
with open('business.json') as f:
    
    # iterate through each line (json record) in the file
    for business_json in f:
        
        # convert the json record to a Python dict
        business = json.loads(business_json)
        
        # if this business is not a restaurant, skip to the next one
        if u'Restaurants' not in business[u'categories']:
            continue
            
        # add the restaurant business id to our restaurant_ids set
        restaurant_ids.add(business[u'business_id'])

# turn restaurant_ids into a frozenset, as we don't need to change it anymore
restaurant_ids = frozenset(restaurant_ids)

# print the number of unique restaurant ids in the dataset
print('{:,}'.format(len(restaurant_ids)), u'restaurants in the dataset.')

54,618 restaurants in the dataset.


We now extract from the reviews the text relative to restaurant reviews, and only those.

In [4]:
%%time

review_txt_filepath = os.path.join('intermediate', 'review_text_all.txt')
if os.path.isfile(review_txt_filepath):
    with open(review_txt_filepath, 'r') as fh:
        for review_count, line in enumerate(review_txt_filepath):
            pass
else:
    review_count = 0

    # create & open a new file in write mode
    with gzip.open(review_txt_filepath, 'w') as review_txt_filepath:

        # open the existing review json file
        with gzip.open('review.json') as review_json_file:

            # loop through all reviews in the existing file and convert to dict
            for review_json in review_json_file:
                review = json.loads(review_json)

                # if this review is not about a restaurant, skip to the next one
                if review[u'business_id'] not in restaurant_ids:
                    continue

                # write the restaurant review as a line in the new file
                # escape newline characters in the original review text
                review_txt_filepath.write(review[u'text'].replace('\n', '\\n') + '\n')
                review_count += 1

print('''Text from {:,} restaurant reviews 
         written to the new txt file.'''.format(review_count))

Text from 31 restaurant reviews 
         written to the new txt file.
CPU times: user 659 µs, sys: 0 ns, total: 659 µs
Wall time: 417 µs


![spaCy](https://s3.amazonaws.com/skipgram-images/spaCy.png)

[**spaCy**](https://spacy.io) is an industrial-strength natural language processing (_NLP_) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

SpaCy contains built-in data and models which you can use out-of-the-box for processing general-purpose English language text:
- Large English vocabulary, including stopword lists
- Token "probabilities"
- Word vectors

SpaCy Can perform
- Tokenization
- Text normalization, such as lowercasing, stemming/lemmatization
- Part-of-speech tagging
- Syntactic dependency parsing
- Sentence boundary detection
- Named entity recognition and annotation

Let's look at one of the reviews.

In [5]:
import spacy
import pandas as pd
import itertools as it

nlp = spacy.load('en')

with open(review_txt_filepath) as f:
    sample_review = list(it.islice(f, 8, 9))[0]
    sample_review = sample_review.replace('\\n', '\n')
        
print(sample_review)

This is currently my parents new favourite restaurant. 

We come here in the morning for dim sum. They are not the cart pushing type of dim sum, it is order off of the sheet. Dim sum is not bad and not expensive either.

We also frequent the dinner scene. Their set dinner menu is not bad. We typically order a 6 dish menu and it's big enough to feed a family of 9 with leftovers. 

Overall, food is pretty tasty!



### Decomposition into sentences

Let's decompose this review into it's language components. First of all, how many sentences is the review composed of?

In [6]:
parsed_review = nlp(sample_review)

for num, sentence in enumerate(parsed_review.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print('')

Sentence 1:
This is currently my parents new favourite restaurant. 



Sentence 2:
We come here in the morning for dim sum.

Sentence 3:
They are not the cart pushing type of dim sum, it is order off of the sheet.

Sentence 4:
Dim sum is not bad and not expensive either.



Sentence 5:
We also frequent the dinner scene.

Sentence 6:
Their set dinner menu is not bad.

Sentence 7:
We typically order a 6 dish menu and it's big enough to feed a family of 9 with leftovers. 



Sentence 8:
Overall, food is pretty tasty!




### Named-Entity recognition

**GPE** stands for Geo-Political Entity. Apparently the newlines are confusing the library.

In [7]:
for num, entity in enumerate(parsed_review.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print('')

Entity 1: 6 - CARDINAL

Entity 2: 9 - CARDINAL

Entity 3: 
 - GPE



### Parts Of Speech (POS) tagging

Let's decompose the text into POS tags, like verbs, adverbs, adjectives, punctuation, etc.

In [8]:
token_text = [token.orth_ for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]
unique_token_pos = list(set(token_pos))
explanations = {x: spacy.explain(x) for x in unique_token_pos}
print(explanations)
df_pos = pd.DataFrame(list(zip(token_text, token_pos)),
                      columns=['token_text', 'part_of_speech'])
df_pos['explanation'] = df_pos.part_of_speech.map(explanations)
df_pos[:20]

{'DET': 'determiner', 'SPACE': 'space', 'VERB': 'verb', 'CCONJ': 'coordinating conjunction', 'PART': 'particle', 'NOUN': 'noun', 'ADJ': 'adjective', 'ADV': 'adverb', 'PRON': 'pronoun', 'PUNCT': 'punctuation', 'NUM': 'numeral', 'ADP': 'adposition'}


Unnamed: 0,token_text,part_of_speech,explanation
0,This,DET,determiner
1,is,VERB,verb
2,currently,ADV,adverb
3,my,ADJ,adjective
4,parents,NOUN,noun
5,new,ADJ,adjective
6,favourite,ADJ,adjective
7,restaurant,NOUN,noun
8,.,PUNCT,punctuation
9,\n\n,SPACE,space


Note that `dim` is considered an adjective (as in *dim light*), and `sum` is considered a noun (as in *total sum*).

### Stemming, lemmatization

From [Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/): The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

    am, are, is => be
    car, cars, car's, cars' => car 

The result of this mapping of text will be something like:

    the boy's cars are different colors => the boy car be differ color 

In [9]:
token_lemma = [token.lemma_ for token in parsed_review]
df_stem = pd.DataFrame(list(zip(token_text, token_lemma)),
                       columns=['token_text', 'token_lemma'])
df_stem[:20]

Unnamed: 0,token_text,token_lemma
0,This,this
1,is,be
2,currently,currently
3,my,-PRON-
4,parents,parent
5,new,new
6,favourite,favourite
7,restaurant,restaurant
8,.,.
9,\n\n,\n\n


Let's check whether each token is:
- a stop-word, *i.e.*, words that are very frequent in any text, and usually not informative.
- a punctuation
- a whitespace
- a number
- included in spaCy's default vocabulary?

In [10]:
token_attributes = [(token.orth_,
                     # token.prob,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_review]

df_tok = pd.DataFrame(token_attributes,
                      columns=['text',
                               # 'log_probability',
                               'stop?',
                               'punctuation?',
                               'whitespace?',
                               'number?',
                               'in vocab.?'])

df_tok.loc[:, 'stop?':'in vocab.?'] = (df_tok.loc[:, 'stop?':'in vocab.?'].applymap(
    lambda x: u'Yes' if x else u''))
                                               
df_tok[:20]

Unnamed: 0,text,stop?,punctuation?,whitespace?,number?,in vocab.?
0,This,,,,,Yes
1,is,Yes,,,,Yes
2,currently,,,,,Yes
3,my,Yes,,,,Yes
4,parents,,,,,Yes
5,new,,,,,Yes
6,favourite,,,,,Yes
7,restaurant,,,,,Yes
8,.,,Yes,,,Yes
9,\n\n,,,Yes,,Yes


**Note** that all this works because the text is *generic*. If we had clinical, medical, or other technical language, SpaCy would struggle. This can be remediated, but this is outside of the scope of this notebook.

## Phrase Modeling

So far we have considered single words. Words that frequently co-occur, often refer to a single entity. For example, the Chinese food *dim sum* is currently interpreted as the two English words *dim* and *sum*. The [gensim](https://radimrehurek.com/gensim/) library contains Machine Learning models to identify couples, triplets etc. of words representing a single entity.

In [11]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

The functions below:
1. Checks whether a word is punctuation or whitespace.
2. Fixes the new-line representation in the single reviews.
3. Takes a review, drops punctuation characters, and return the lemmatized version.

In [12]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    
    with open(filename) as f:
        for review in f:
            yield review.replace('\\n', '\n')
            
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    
    for parsed_review in nlp.pipe(line_review(filename)):
        
        for sent in parsed_review.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

Let's a few reviews and see how they have been transformed

In [13]:
%%time
unigram_sentences_filepath = os.path.join('intermediate', 'unigram_sentences_all.txt')

if not os.path.isfile(unigram_sentences_filepath):  
    with open(unigram_sentences_filepath, 'w') as f:
        for sentence in lemmatized_sentence_corpus(review_txt_filepath):
            f.write(sentence + '\n')
else:
    unigram_sentences = LineSentence(unigram_sentences_filepath)

for unigram_sentence in it.islice(unigram_sentences, 0, 15):
    print(' '.join(unigram_sentence))
    print('')

love the staff love the meat love the place prepare for a long line around lunch or dinner hour

-PRON- ask -PRON- how -PRON- want -PRON- meat lean or something maybe -PRON- can not remember

just say -PRON- do not want -PRON- too fatty

get a half sour pickle and a hot pepper

hand cut french fry too

super simple place but amazing nonetheless

-PRON- be be around since the 30 's and -PRON- still serve the same thing -PRON- start with a bologna and salami sandwich with mustard

staff be very helpful and friendly

small unassuming place that change -PRON- menu every so often

cool decor and vibe inside -PRON- 30 seat restaurant

call for a reservation

-PRON- have -PRON- beef tartar and pork belly to start and a salmon dish and lamb meal for main

everything be incredible

-PRON- could go on at length about how all the list ingredient really make -PRON- dish amazing but honestly -PRON- just ne to go

a bit outside of downtown montreal but take the metro out and -PRON- be less than a 10

Next, we'll learn a phrase model that will link individual words into two-word phrases. For example "`pork belly`" will be linked together to form  "`pork_belly`".

In [14]:
intermediate_directory = 'intermediate'
bigram_model_filepath = 'intermediate/bigram_model_all'
bigram_model = Phrases.load(bigram_model_filepath)
bigram_sentences_filepath = os.path.join(intermediate_directory,
                                         'bigram_sentences_all.txt')

trigram_model_filepath = os.path.join(intermediate_directory,
                                      'trigram_model_all')
trigram_sentences_filepath = os.path.join(intermediate_directory,
                                          'trigram_sentences_all.txt')
trigram_model = Phrases.load(trigram_model_filepath)
trigram_sentences = LineSentence(trigram_sentences_filepath)

for trigram_sentence in it.islice(trigram_sentences, 0, 5):
    print(' '.join(trigram_sentence))
    print('')

love the staff love the meat love the place prepare for a long_line around lunch or dinner hour

-PRON- ask -PRON- how -PRON- want -PRON- meat lean or something maybe -PRON- can not remember

just say -PRON- do_not want -PRON- too fatty

get a half sour pickle and a hot_pepper

hand cut french_fry too



### Text normalization

We further *normalize* the reviews by removing all the *stop-words*, i.e., words that are particularly frequent in any text (like 'the', 'is', 'a' etc). We also remove extra white space, new-lines etc. Note that `best` becomes `good`, plurals are turned into singulars, and so on.

In [15]:
trigram_reviews_filepath = os.path.join(intermediate_directory,
                                        'trigram_transformed_reviews_all.txt')
print('ORIGINAL:' + u'\n')

for review in it.islice(line_review(review_txt_filepath), 0, 1):
    print(review)

print('----' + u'\n')
print('NORMALIZED:' + u'\n')

with open(trigram_reviews_filepath) as f:
    for review in it.islice(f, 0, 1):
        print(review)

ORIGINAL:

Love the staff, love the meat, love the place. Prepare for a long line around lunch or dinner hours. 

They ask you how you want you meat, lean or something maybe, I can't remember. Just say you don't want it too fatty. 

Get a half sour pickle and a hot pepper. Hand cut french fries too.

----

NORMALIZED:

love staff love meat love place prepare long_line lunch dinner hour -PRON- ask -PRON- -PRON- want -PRON- meat lean maybe -PRON- remember -PRON- do_not want -PRON- fatty half sour pickle hot_pepper hand cut french_fry



## Topic Modeling

We have removed much of the grammatical richness of these reviews, which may seem counterproductive. The reason is that we want to apply **topic modeling**. In Topic modeling documents are treated as a mixture of a *predefined number of topics*, and topics are *mixtures of words*.

![LDA](https://s3.amazonaws.com/skipgram-images/LDA.png)

The idea of Topic Modeling is to explore a *corpus* of text, and automatically find which topics appear in each document, and to what extent. There are several Topic Modeling techniques. We will be using probably the most popular, called **Latent Dirichlet Allocation** or **LDA** for short.

In [16]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore
import pyLDAvis
import pyLDAvis.gensim
import warnings
import pickle
trigram_dictionary_filepath = os.path.join(intermediate_directory,
                                           'trigram_dict_all.dict')

We first learn the full vocabulary by going through every word in the corpus. We then remove words that are too rare or too common.

In [17]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to learn the dictionary yourself.
if os.path.isfile(trigram_dictionary_filepath):
    # load the finished dictionary from disk
    trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)
else:
    trigram_reviews = LineSentence(trigram_reviews_filepath)

    # learn the dictionary by iterating over all of the reviews
    trigram_dictionary = Dictionary(trigram_reviews)
    
    # filter tokens that are very rare or too common from
    # the dictionary (filter_extremes) and reassign integer ids (compactify)
    trigram_dictionary.filter_extremes(no_below=10, no_above=0.4)
    trigram_dictionary.compactify()

    trigram_dictionary.save(trigram_dictionary_filepath)

CPU times: user 3.73 ms, sys: 7.96 ms, total: 11.7 ms
Wall time: 11 ms


Each review is represented as a **Bag-Of-Words** (**BOW**). A bag of words is simply a list of how many times each word in the vocabulary appears in a review. Most words do not appear (zero counts), therefore the BOW representation is extremely *sparse*.

The code below creates a file where each review is represented by a Bag-Of-Words.

In [18]:
trigram_bow_filepath = os.path.join(intermediate_directory,
                                    'trigram_bow_corpus_all.mm')

def trigram_bow_generator(filepath):
    """
    generator function to read reviews from a file
    and yield a bag-of-words representation
    """
    
    for review in LineSentence(filepath):
        yield trigram_dictionary.doc2bow(review)
        
# this is a bit time consuming - make the if statement True
# if you want to build the bag-of-words corpus yourself.
if os.path.exists(trigram_bow_filepath):
    # load the finished bag-of-words corpus from disk
    trigram_bow_corpus = MmCorpus(trigram_bow_filepath)
else:
    # generate bag-of-words representations for
    # all reviews and save them as a matrix
    MmCorpus.serialize(trigram_bow_filepath,
                       trigram_bow_generator(trigram_reviews_filepath))
    trigram_bow_corpus = MmCorpus(trigram_bow_filepath)

We can now run the code that performs the LDA analysis. The results of the analysis are saved in a file called `lda_model_all`.

In [19]:
%%time
lda_model_filepath = os.path.join(intermediate_directory, 'lda_model_all')

# this is a bit time consuming - make the if statement True
# if you want to train the LDA model yourself.
if os.path.exists(lda_model_filepath):
    # load the finished LDA model from disk
    lda = LdaMulticore.load(lda_model_filepath)
else:
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # workers => sets the parallelism, and should be
        # set to your number of physical cores minus one
        lda = LdaMulticore(trigram_bow_corpus,
                           num_topics=50,
                           id2word=trigram_dictionary,
                           workers=3)
    
    lda.save(lda_model_filepath)

CPU times: user 48.9 ms, sys: 4.17 ms, total: 53 ms
Wall time: 52 ms


Our topic model is now trained and ready to use! Since each topic is represented as a mixture of tokens, you can manually inspect which tokens have been grouped together into which topics to try to understand the patterns the model has discovered in the data.

The topics are in no particular order. Let's look at the first one.

In [20]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print (u'{:20} {}'.format(u'term', u'frequency') + u'\n')

    for term, frequency in lda.show_topic(topic_number, topn=25):
        print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))

explore_topic(topic_number=0)

term                 frequency

waffle               0.089
u                    0.038
n                    0.037
juice                0.035
chicken              0.018
tour                 0.013
squeeze              0.010
syrup                0.010
soo                  0.009
quiche               0.009
t                    0.008
lil                  0.007
matt                 0.007
ohio                 0.007
r                    0.007
factory              0.007
chris                0.006
monte                0.006
hoagie               0.006
fresh                0.006
's                   0.006
oj                   0.006
inn                  0.006
breakfast_burrito    0.006
smoothie             0.005


It's not immediately clear what it is about, but it seems to have to do with breakfast (waffle, juice, syrup, breakfast_burrito, smoothie, etc).

Looking at each topic is a bit inefficient. Luckily, we can explore the results via a graphical interface.

## Explanation of the plot

* **On the left**, there is a plot of the "distance" between all of the topics (labeled as the _Intertopic Distance Map_)
  * The plot is rendered in two dimensions according a [*multidimensional scaling (MDS)*](https://en.wikipedia.org/wiki/Multidimensional_scaling) algorithm. Topics that are generally similar should be appear close together on the plot, while *dis*similar topics should appear far apart.
  * The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.
  * An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.
* **On the right**, there is a bar chart showing top terms.
  * When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's *saliency* is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.
  * When a particular topic is selected, the bar chart changes to show the top-30 most "relevant" terms for the selected topic. 
* The **relevance metric** is controlled by the parameter $\lambda$, which can be adjusted with a slider above the bar chart.
  * Setting the $\lambda$ parameter close to 1.0 (the default) will rank the terms solely according to their *probability* within the topic.
  * Setting $\lambda$ close to 0.0 will rank the terms solely according to their "distinctiveness" or "*exclusivity*" within the topic &mdash; i.e., terms that occur *only* in this topic, and do not occur in other topics.
  * Setting $\lambda$ to values between 0.0 and 1.0 will result in an intermediate ranking, weighting term probability and exclusivity accordingly.
* Rolling the mouse over a term in the bar chart on the right will cause the topic circles to resize in the plot on the left, to show the strength of the relationship between the topics and the selected term.

In [21]:
%%time
LDAvis_data_filepath = os.path.join(intermediate_directory, 'ldavis_prepared')

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if os.path.isfile(LDAvis_data_filepath):
    # load the pre-prepared pyLDAvis data from disk
    with open(LDAvis_data_filepath, 'rb') as f:
        LDAvis_prepared = pickle.load(f)
else:
    LDAvis_prepared = pyLDAvis.gensim.prepare(lda, trigram_bow_corpus,
                                              trigram_dictionary)
    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)

CPU times: user 3.29 ms, sys: 0 ns, total: 3.29 ms
Wall time: 2.61 ms


In [22]:
pyLDAvis.display(LDAvis_prepared)

## Word Embeddings

In [23]:
from gensim.models import Word2Vec

trigram_sentences = LineSentence(trigram_sentences_filepath)
word2vec_filepath = os.path.join(intermediate_directory, 'word2vec_model_all')

In [24]:
%%time
# this is a bit time consuming - make the if statement True
# if you want to train the word2vec model yourself.
if os.path.isfile(word2vec_filepath):
    # load the finished model from disk
    food2vec = Word2Vec.load(word2vec_filepath)
    food2vec.init_sims()
else:
    # initiate the model and perform the first epoch of training
    food2vec = food2vec = Word2Vec(size=100, window=5, min_count=20, sg=1, workers=6)
    food2vec.build_vocab(trigram_sentences)
    food2vec.train(trigram_sentences, total_examples=food2vec.corpus_count, epochs=12) #food2vec.iter)
    food2vec.save(word2vec_filepath)

CPU times: user 22.4 ms, sys: 0 ns, total: 22.4 ms
Wall time: 21.7 ms


  setattr(self, attrib, None)


### Measuring word similarity

Now, each word in our vocabulary is associated with a vector of 100 numbers. We can numerically measure how similar two words are, in terms of meaning. For example, we would expect the words `apple` and `banana` to be more similar to each other, since they are both fruits, than they are to the word `waiter`.

In [25]:
import numpy as np
from scipy import spatial

av, bv, cv = food2vec.wv['apple'], food2vec.wv['banana'], food2vec.wv['waiter']
s1 = np.round(1 - spatial.distance.cosine(av, bv), 3)
s2 = np.round(1 - spatial.distance.cosine(av, cv), 3)
s3 = np.round(1 - spatial.distance.cosine(bv, cv), 3)

print('Similarity between "apple" and "banana" = {}'.format(s1))
print('Similarity between "apple" and "waiter" = {}'.format(s2))
print('Similarity between "banana" and "waiter" = {}'.format(s3))

Similarity between "apple" and "banana" = 0.847
Similarity between "apple" and "waiter" = 0.205
Similarity between "banana" and "waiter" = 0.204


### Findint the most related terms

Another question we may ask is, "what are the words that are most related with the word *sushi*? What about *beer*?

In [26]:
def get_related_terms(token, topn=5):
    """
    look up the topn most similar terms to token
    and print them as a formatted list
    """

    for word, similarity in food2vec.wv.most_similar(positive=[token], topn=topn):
        print('{:20} {}'.format(word, round(similarity, 3)))

In [27]:
get_related_terms('sushi')

ayce                 0.729
sashimi              0.728
chinese_food         0.711
dim_sum              0.67
in_vegas             0.662


In [28]:
get_related_terms('beer')

wine                 0.815
drink                0.806
on_tap               0.746
ipa                  0.736
draft                0.733


### Adding and subtracting words

One of the most fascinating properties of word embeddings is the possibility of adding and subtracting words as if they were vectors (which they are, actually). For example, if we add `breakfast` and `lunch`, what word would you expect?

In [29]:
def word_algebra(add=[], subtract=[], topn=1):
    """
    combine the vectors associated with the words provided
    in add= and subtract=, look up the topn most similar
    terms to the combined vector, and print the result(s)
    """
    answers = food2vec.wv.most_similar(positive=add, negative=subtract, topn=topn)
    
    for term, similarity in answers:
        print(term)

In [30]:
word_algebra(add=['lunch', 'breakfast'])

brunch


A more elaborate example: if we take `lunch`, we subtract `day` and we add `night`, what would you expect?

In [31]:
word_algebra(add=[u'lunch', u'night'], subtract=[u'day'])

dinner


## Word Vector Visualization with t-SNE

We have seen that word embeddings allow to measure similarity between words. Each word is represented as a vector of 100 numbers. It would be nice if we could visualize these words in a plot, and see which words cluster together. the t-SNE algorithm is a very popular method for the visualization of high dimensional data.

We start creating a table with the words in our reviews in the rows, and the 100-long vectors in the columns. In the example below, we visualize the first 10 tokens.

In [32]:
ordered_vocab = [(term, voc.index, voc.count) for term, voc in food2vec.wv.vocab.items()]
# sort by the term counts, so the most common terms appear first
ordered_vocab = sorted(ordered_vocab, key=lambda x: -x[2])

# unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)

# create a DataFrame with the food2vec vectors as data,
# and the terms as row labels
word_vectors = pd.DataFrame(food2vec.wv.syn0norm[term_indices, :],
                            index=ordered_terms)

word_vectors[:10]

  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
-PRON-,0.045916,0.000635,-0.061782,-0.140499,-0.064366,0.067728,-0.008556,0.036057,-0.039211,-0.106828,...,0.08386,0.083087,0.016415,0.126301,0.096194,0.148145,0.122138,0.009392,-0.053284,0.138093
be,0.048781,-0.075263,0.035716,-0.064056,-0.018375,-0.037974,-0.131532,-0.056598,0.073894,-0.111527,...,0.088843,-0.070907,0.104998,0.097718,0.01192,0.009315,0.191531,0.163952,-0.167926,0.196185
the,0.115627,0.007902,-0.111051,-0.075056,-0.056502,-0.044206,0.004868,0.03975,0.189443,-0.062639,...,0.063754,0.069547,0.016668,0.154754,-0.055646,0.058354,0.15692,-0.151786,-0.015674,0.018669
and,0.096375,-0.019689,-0.008946,-0.226972,0.069982,-0.042489,-0.181513,0.003743,0.103862,-0.20131,...,0.142871,-0.070689,-0.032741,0.143664,0.104773,0.136523,0.177073,-0.087622,-0.063217,0.141989
a,0.128857,0.100882,-0.030482,-0.056695,-0.107777,0.171308,-0.004104,0.147862,0.092478,-0.210171,...,-0.044011,-0.05874,0.014097,0.269718,0.002572,0.096095,-0.147366,-0.088985,-0.081468,0.146563
to,0.040896,-0.098899,0.071526,-0.098185,-0.029845,0.127691,0.020093,-0.111298,0.008239,-0.224959,...,0.092088,0.13954,-0.023543,0.06076,0.048208,-0.155986,0.213906,0.110785,-0.117334,0.139845
have,0.11302,-0.061003,0.113982,-0.210212,-0.046846,0.257448,0.136409,0.002115,0.0187,-0.162979,...,0.020285,0.161998,0.245748,0.140094,0.021241,0.106034,-0.00734,-0.004375,-0.062392,-0.015397
of,0.012092,-0.082581,-0.139682,0.101404,0.071991,-0.004279,-0.027865,-0.018391,0.064706,-0.33204,...,0.036715,-0.029599,0.014828,0.087817,0.159584,0.119476,-0.103963,0.066997,-0.116807,-0.063
for,-0.045175,-0.084536,0.043217,-0.112851,-0.015646,0.102164,-0.082914,-0.024096,0.105399,-0.210253,...,0.040247,0.109287,-0.026036,-0.04578,0.03696,0.100756,-0.048925,0.14233,0.046238,0.08881
with,-0.023679,-0.005423,-0.135798,-0.164545,0.19232,0.01254,-0.076395,-0.009785,0.070201,-0.089973,...,0.036527,-0.006839,0.060646,0.040318,0.04645,-0.016745,-0.067125,-0.086016,0.0199,0.160963


The code below runs the t-SNE algorithm on this table and produces the table that we will use for the visualization.

In [33]:
from sklearn.manifold import TSNE
tsne_input = word_vectors.drop(nlp.Defaults.stop_words, errors='ignore')
tsne_input = tsne_input.head(5000)
tsne_input.head()
tsne_filepath = os.path.join(intermediate_directory, 'tsne_model')
tsne_vectors_filepath = os.path.join(intermediate_directory, 'tsne_vectors.npy')

In [34]:
%%time
if not os.path.isfile(tsne_vectors_filepath):
    tsne = TSNE(n_iter=3000, perplexity=15)
    tsne_vectors = tsne.fit_transform(tsne_input.values)
    with open(tsne_filepath, 'wb') as f:
        pickle.dump(tsne, f)
    pd.np.save(tsne_vectors_filepath, tsne_vectors)
else:
    with open(tsne_filepath, 'rb') as f:
        tsne = pickle.load(f)
    
tsne_vectors = pd.np.load(tsne_vectors_filepath)
tsne_vectors = pd.DataFrame(tsne_vectors,
                            index=pd.Index(tsne_input.index),
                            columns=[u'x_coord', u'y_coord'])

CPU times: user 2.35 ms, sys: 761 µs, total: 3.11 ms
Wall time: 1.46 ms


Finally, we use the Bokeh library to produce an interactive visualization of the data.

In [35]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()

# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vectors)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(title=u't-SNE Word Embeddings',
                   plot_width = 800,
                   plot_height = 800,
                   tools= (u'pan, wheel_zoom, box_zoom,'
                           u'box_select, reset'),
                   active_scroll=u'wheel_zoom')

# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = u'@index') )

# draw the words as circles on the plot
tsne_plot.circle(u'x_coord', u'y_coord', source=plot_data,
                 color=u'blue', line_alpha=0.2, fill_alpha=0.1,
                 size=10, hover_line_color=u'black')

# configure visual elements of the plot
tsne_plot.title.text_font_size = value(u'16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# engage!
show(tsne_plot);