# NLP in Python
## Analyzing millions of Yelp reviews

Patrick Harrison presented this notebook during the [PyData DC 2016 conference](http://pydata.org/dc2016/schedule/presentation/11/). To view the video of the presentation on YouTube, see [here](https://www.youtube.com/watch?v=6zm9NC9uRkk)._

## Topics to be covered
This tutorial features an end-to-end data science & natural language processing pipeline, starting with **raw data** and running through **preparing**, **modeling**, **visualizing**, and **analyzing** the data. We'll touch on the following points:
1. A tour of the dataset
1. Introduction to text processing with spaCy
1. Automatic phrase modeling
1. Topic modeling with LDA
1. Visualizing topic models with pyLDAvis
1. Word vector models with word2vec
1. Visualizing word2vec with t-SNE

...and we might even learn a thing or two about Python along the way.

Let's get started!

## The Yelp Dataset
[**The Yelp Dataset**](https://www.yelp.com/dataset_challenge/) is a dataset published by the business review service [Yelp](http://yelp.com) for academic research and educational purposes. The dataset contains crowd sourced reviews about various business like restaurant, clinics etc

The data is provided in a handful of files in _.json_ format. We'll be using the following files for our demo:
- __yelp\_academic\_dataset\_business.json__ &mdash; _the records for individual businesses_
- __yelp\_academic\_dataset\_review.json__ &mdash; _the records for reviews users wrote about businesses_

The files are text files (UTF-8) with one _json object_ per line, each one corresponding to an individual data record. Let's take a look at a few examples.

In [4]:
import os
import codecs

data_directory = os.path.join('..', 'data')

businesses_filepath = os.path.join(data_directory,
                                   'yelp_academic_dataset_business.json')

with codecs.open(businesses_filepath, encoding='utf_8') as f:
    first_business_record = f.readline() 

print(first_business_record)

{"business_id":"Apn5Q_b6Nz61Tq4XzPdf9A","name":"Minhas Micro Brewery","neighborhood":"","address":"1314 44 Avenue NE","city":"Calgary","state":"AB","postal_code":"T2E 6L6","latitude":51.0918130155,"longitude":-114.031674872,"stars":4.0,"review_count":24,"is_open":1,"attributes":{"BikeParking":"False","BusinessAcceptsCreditCards":"True","BusinessParking":"{'garage': False, 'street': True, 'validated': False, 'lot': False, 'valet': False}","GoodForKids":"True","HasTV":"True","NoiseLevel":"average","OutdoorSeating":"False","RestaurantsAttire":"casual","RestaurantsDelivery":"False","RestaurantsGoodForGroups":"True","RestaurantsPriceRange2":"2","RestaurantsReservations":"True","RestaurantsTakeOut":"True"},"categories":"Tours, Breweries, Pizza, Restaurants, Food, Hotels & Travel","hours":{"Monday":"8:30-17:0","Tuesday":"11:0-21:0","Wednesday":"11:0-21:0","Thursday":"11:0-21:0","Friday":"11:0-21:0","Saturday":"11:0-21:0"}}



The business records consist of _key, value_ pairs containing information about the particular business. A few attributes we'll be interested in for this demo include:
- __business\_id__ &mdash; _unique identifier for businesses_
- __categories__ &mdash; _an array containing relevant category values of businesses_

The _categories_ attribute is of special interest. This demo will focus on restaurants, which are indicated by the presence of the _Restaurant_ tag in the _categories_ array. In addition, the _categories_ array may contain more detailed information about restaurants, such as the type of food they serve.

The review records are stored in a similar manner &mdash; _key, value_ pairs containing information about the reviews.

In [2]:
review_json_filepath = os.path.join(data_directory,
                                    'yelp_academic_dataset_review.json')

with codecs.open(review_json_filepath, encoding='utf_8') as f:
    first_review_record = f.readline()
    
print(first_review_record)

{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id":"iCQpiavjjPzJ5_3gPD5Ebg","stars":2,"date":"2011-02-25","text":"The pizza was okay. Not the best I've had. I prefer Biaggio's on Flamingo \/ Fort Apache. The chef there can make a MUCH better NY style pizza. The pizzeria @ Cosmo was over priced for the quality and lack of personality in the food. Biaggio's is a much better pick if youre going for italian - family owned, home made recipes, people that actually CARE if you like their food. You dont get that at a pizzeria in a casino. I dont care what you say...","useful":0,"funny":0,"cool":0}



A few attributes of note on the review records:
- __business\_id__ &mdash; _indicates which business the review is about_
- __text__ &mdash; _the natural language text the user wrote_

The _text_ attribute will be our focus today!

_json_ is a handy file format for data interchange, but it's typically not the most usable for any sort of modeling work. Let's do a bit more data preparation to get our data in a more usable format. Our next code block will do the following:
1. Read in each business record and convert it to a Python `dict`
2. Filter out business records that aren't about restaurants (i.e., not in the "Restaurant" category)
3. Create a `frozenset` of the business IDs for restaurants, which we'll use in the next step

Note: 

tuples are immutable lists, frozensets are immutable sets.

tuples are indeed an ordered collection of objects, but they can contain duplicates and unhashable objects, and have slice functionality

frozensets aren't indexed, but you have the functionality of sets - O(1) element lookups, and functionality such as unions and intersections. They also can't contain duplicates, like their mutable counterparts.

In [3]:
import json

restaurant_ids = set()

# open the businesses file
with codecs.open(businesses_filepath, encoding='utf_8') as f:
    
    # iterate through each line (json record) in the file
    for business_json in f:
        
        # convert the json record to a Python dict
        business = json.loads(business_json)
        
        # if this business is not a restaurant, skip to the next one
        if business[u'categories'] is not None and u'Restaurants' not in business[u'categories']:
            continue
            
        # add the restaurant business id to our restaurant_ids set
        restaurant_ids.add(business[u'business_id'])

# turn restaurant_ids into a frozenset, as we don't need to change it anymore
restaurant_ids = frozenset(restaurant_ids)

# print the number of unique restaurant ids in the dataset
print ('{:,}'.format(len(restaurant_ids)), u'restaurants in the dataset.')

57,714 restaurants in the dataset.


Next, we will create a new file that contains only the text from reviews about restaurants, with one review per line in the file.

In [9]:
intermediate_directory = os.path.join('..', 'intermediate')

review_txt_filepath = os.path.join(intermediate_directory,
                                   'review_text_all.txt')

In [5]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if True:
    
    review_count = 0

    # create & open a new file in write mode
    with codecs.open(review_txt_filepath, 'w', encoding='utf_8') as review_txt_file:

        # open the existing review json file
        with codecs.open(review_json_filepath, encoding='utf_8') as review_json_file:

            # loop through all reviews in the existing file and convert to dict
            for review_json in review_json_file:
                review = json.loads(review_json)

                # if this review is not about a restaurant, skip to the next one
                if review[u'business_id'] not in restaurant_ids:
                    continue

                # write the restaurant review as a line in the new file
                # escape newline characters in the original review text
                review_txt_file.write(review[u'text'].replace('\n', '\\n') + '\n')
                review_count += 1

    print (u'''Text from {:,} restaurant reviews
              written to the new txt file.'''.format(review_count))
    
else:
    
    with codecs.open(review_txt_filepath, encoding='utf_8') as review_txt_file:
        for review_count, line in enumerate(review_txt_file):
            pass
        
    print (u'Text from {:,} restaurant reviews in the txt file.'.format(review_count + 1))

Text from 3,658,019 restaurant reviews
              written to the new txt file.
Wall time: 6min 46s


## START FROM HERE
## spaCy &mdash; Industrial-Strength NLP in Python

![spaCy](https://s3.amazonaws.com/skipgram-images/spaCy.png)

[**spaCy**](https://spacy.io) is an industrial-strength natural language processing (_NLP_) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline:
- Tokenization
- Text normalization, such as lowercasing, stemming/lemmatization
- Part-of-speech tagging
- Syntactic dependency parsing
- Sentence boundary detection
- Named entity recognition and annotation

In the "batteries included" Python tradition, spaCy contains built-in data and models which you can use out-of-the-box for processing general-purpose English language text:
- Large English vocabulary, including stopword lists
- Token "probabilities"
- Word vectors

spaCy is written in optimized Cython, which means it's _fast_. According to a few independent sources, it's the fastest syntactic parser available in any language. Key pieces of the spaCy parsing pipeline are written in pure C, enabling efficient multithreading (i.e., spaCy can release the _GIL_).

In [5]:
import spacy
import pandas as pd
import itertools as it

#run this command in your terminal before using the below code: python -m spacy download en
nlp = spacy.load('en')

Let's grab a sample review to play with.

In [6]:
review_txt_filepath = "C:\\Knowledge Base\\GreyAtom\\nlp-in-python-master\\intermediate\\review_text_all_small.txt"
with codecs.open(review_txt_filepath, encoding='utf_8') as f:
    sample_review = list(it.islice(f, 9, 10))[0]
    sample_review = sample_review.replace('\\n', '\n')
        
print(sample_review)

If you're looking for an adventure in Las Vegas when it comes to food, this is the place to go (and this adventure isn't for the weak!) I had to say I was extremely sketchy because you get a pound or more of crawfish freshly boiled, mixed in with their seasonings and sauces, and you have to be the one to crack them open yourself to get to the crawfish meat, but that is the whole experience and beauty of the restaurant. 

I wish I could try their other sauces and seasoning, but I feel even the next step up from what is considered mild is way too spicy and hot for me. Guess I need to work on becoming a novice here! Anyways, this location is always busy, and expect a bit of a wait every time you come (larger parties are always going to be an hour of more). I rate this 4 stars because this is a very interesting experience when it comes to dining! Be fearless!



Hand the review text to spaCy, and be prepared to wait...

In [14]:
%%time
parsed_review = nlp(sample_review)

Wall time: 308 ms


...1/20th of a second or so. Let's take a look at what we got during that time...

In [15]:
print (parsed_review)

If you're looking for an adventure in Las Vegas when it comes to food, this is the place to go (and this adventure isn't for the weak!) I had to say I was extremely sketchy because you get a pound or more of crawfish freshly boiled, mixed in with their seasonings and sauces, and you have to be the one to crack them open yourself to get to the crawfish meat, but that is the whole experience and beauty of the restaurant. 

I wish I could try their other sauces and seasoning, but I feel even the next step up from what is considered mild is way too spicy and hot for me. Guess I need to work on becoming a novice here! Anyways, this location is always busy, and expect a bit of a wait every time you come (larger parties are always going to be an hour of more). I rate this 4 stars because this is a very interesting experience when it comes to dining! Be fearless!



Looks the same! What happened under the hood?

What about sentence detection and segmentation?

In [16]:
for num, sentence in enumerate(parsed_review.sents):
    print ('Sentence {}:'.format(num + 1))
    print (sentence)
    print ('')

Sentence 1:
If you're looking for an adventure in Las Vegas when it comes to food, this is the place to go (and this adventure isn't for the weak!)

Sentence 2:
I had to say I was extremely sketchy because you get a pound or more of crawfish freshly boiled, mixed in with their seasonings and sauces, and you have to be the one to crack them open yourself to get to the crawfish meat, but that is the whole experience and beauty of the restaurant. 



Sentence 3:
I wish I could try their other sauces and seasoning, but I feel even the next step up from what is considered mild is way too spicy and hot for me.

Sentence 4:
Guess I need to work on becoming a novice here!

Sentence 5:
Anyways, this location is always busy, and expect a bit of a wait every time you come (larger parties are always going to be an hour of more).

Sentence 6:
I rate this 4 stars because this is a very interesting experience when it comes to dining!

Sentence 7:
Be fearless!




What about named entity detection?

In [18]:
#GPE - Geographical/Social/Political Entities (GPE)

for num, entity in enumerate(parsed_review.ents):
    print ('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print ('')

Entity 1: Las Vegas - GPE

Entity 2: 4 - CARDINAL

Entity 3: 
 - GPE



What about part of speech tagging?

In [19]:
token_text = [token.orth_ for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_pos)),columns=['token_text', 'part_of_speech'])

Unnamed: 0,token_text,part_of_speech
0,If,ADP
1,you,PRON
2,'re,VERB
3,looking,VERB
4,for,ADP
5,an,DET
6,adventure,NOUN
7,in,ADP
8,Las,PROPN
9,Vegas,PROPN


What about text normalization, like stemming/lemmatization and shape analysis?

In [20]:
token_lemma = [token.lemma_ for token in parsed_review]
token_shape = [token.shape_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_lemma, token_shape)),
             columns=['token_text', 'token_lemma', 'token_shape'])

Unnamed: 0,token_text,token_lemma,token_shape
0,If,if,Xx
1,you,-PRON-,xxx
2,'re,be,'xx
3,looking,look,xxxx
4,for,for,xxx
5,an,an,xx
6,adventure,adventure,xxxx
7,in,in,xx
8,Las,las,Xxx
9,Vegas,vegas,Xxxxx


What about token-level entity analysis?

In [21]:
token_entity_type = [token.ent_type_ for token in parsed_review]
token_entity_iob = [token.ent_iob_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_entity_type, token_entity_iob)),
             columns=['token_text', 'entity_type', 'inside_outside_begin'])

Unnamed: 0,token_text,entity_type,inside_outside_begin
0,If,,O
1,you,,O
2,'re,,O
3,looking,,O
4,for,,O
5,an,,O
6,adventure,,O
7,in,,O
8,Las,GPE,B
9,Vegas,GPE,I


What about a variety of other token-level attributes, such as the relative frequency of tokens, and whether or not a token matches any of these categories?
- stopword
- punctuation
- whitespace
- represents a number
- whether or not the token is included in spaCy's default vocabulary?

In [22]:
token_attributes = [(token.orth_,
                     token.prob,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_review]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           'log_probability',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))
                                               
df

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,If,-20.0,,,,,Yes
1,you,-20.0,Yes,,,,Yes
2,'re,-20.0,,,,,Yes
3,looking,-20.0,,,,,Yes
4,for,-20.0,Yes,,,,Yes
5,an,-20.0,Yes,,,,Yes
6,adventure,-20.0,,,,,Yes
7,in,-20.0,Yes,,,,Yes
8,Las,-20.0,,,,,Yes
9,Vegas,-20.0,,,,,Yes


If the text you'd like to process is general-purpose English language text (i.e., not domain-specific, like medical literature), spaCy is ready to use out-of-the-box.

I think it will eventually become a core part of the Python data science ecosystem &mdash; it will do for natural language computing what other great libraries have done for numerical computing.

## Phrase Modeling

_Phrase modeling_ is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that _co-occur_ (i.e., appear one after another) together much more frequently than you would expect them to by random chance. The formula our phrase models will use to determine whether two tokens $A$ and $B$ constitute a phrase is:

$$\frac{count(A\ B) - count_{min}}{count(A) * count(B)} * N > threshold$$

...where:
* $count(A)$ is the number of times token $A$ appears in the corpus
* $count(B)$ is the number of times token $B$ appears in the corpus
* $count(A\ B)$ is the number of times the tokens $A\ B$ appear in the corpus *in order*
* $N$ is the total size of the corpus vocabulary
* $count_{min}$ is a user-defined parameter to ensure that accepted phrases occur a minimum number of times
* $threshold$ is a user-defined parameter to control how strong of a relationship between two tokens the model requires before accepting them as a phrase

Once our phrase model has been trained on our corpus, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.

Phrase modeling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model (so _new york_ would become *new\_york*). But you would also expect multi-word expressions that represent common concepts, but aren't specifically named entities (such as _happy hour_) to also become phrases in the model.

We turn to the indispensible [**gensim**](https://radimrehurek.com/gensim/index.html) library to help us with phrase modeling &mdash; the [**Phrases**](https://radimrehurek.com/gensim/models/phrases.html) class in particular.

In [12]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

As we're performing phrase modeling, we'll be doing some iterative data transformation at the same time. Our roadmap for data preparation includes:

1. Segment text of complete reviews into sentences & normalize text
1. First-order phrase modeling $\rightarrow$ _apply first-order phrase model to transform sentences_
1. Second-order phrase modeling $\rightarrow$ _apply second-order phrase model to transform sentences_
1. Apply text normalization and second-order phrase model to text of complete reviews

We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

First, let's define a few helper functions that we'll use for text normalization. In particular, the `lemmatized_sentence_corpus` generator function will use spaCy to:
- Iterate over the 1M reviews in the `review_txt_all.txt` we created before
- Segment the reviews into individual sentences
- Remove punctuation and excess whitespace
- Lemmatize the text

... and do so efficiently in parallel, thanks to spaCy's `nlp.pipe()` function.

In [24]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    
    with codecs.open(filename, encoding='utf_8') as f:
        for review in f:
            yield review.replace('\\n', '\n')
            
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    
    for parsed_review in nlp.pipe(line_review(filename),
                                  batch_size=10000, n_threads=4):
        
        for sent in parsed_review.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

In [25]:
unigram_sentences_filepath = os.path.join(intermediate_directory,
                                          'unigram_sentences_all.txt')

Let's use the `lemmatized_sentence_corpus` generator to loop over the original review text, segmenting the reviews into individual sentences and normalizing the text. We'll write this data back out to a new file (`unigram_sentences_all`), with one normalized sentence per line. We'll use this data for learning our phrase models.

In [26]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 1 == 1:

    with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
        for sentence in lemmatized_sentence_corpus(review_txt_filepath):
            f.write(sentence + '\n')

Wall time: 1min 6s


If your data is organized like our `unigram_sentences_all` file now is &mdash; a large text file with one document/sentence per line &mdash; gensim's [**LineSentence**](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence) class provides a convenient iterator for working with other gensim components. It *streams* the documents/sentences from disk, so that you never have to hold the entire corpus in RAM at once. This allows you to scale your modeling pipeline up to potentially very large corpora.

In [27]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

Let's take a look at a few sample sentences in our new, transformed file.

In [28]:
for unigram_sentence in it.islice(unigram_sentences, 230, 240):
    print (u' '.join(unigram_sentence))
    print (u'')

food be great but with most newly open restaurant -PRON- still have something to work out

-PRON- would definitely eat at sushi guru again to see if improve

great atmospher great food great customer service

finger lickin good

will be back

-PRON- be excited to go to this place because -PRON- have be hear good thing about -PRON-

the restaurant have a really cool atmosphere and the staff be really nice

although -PRON- feel like the menu be a bit small

other than raman there be not much food and dessert

basically -PRON- can customize -PRON- own raman



Next, we'll learn a phrase model that will link individual words into two-word phrases. We'd expect words that together represent a specific concept, like "`ice cream`", to be linked together to form a new, single token: "`ice_cream`".

In [29]:
bigram_model_filepath = os.path.join(intermediate_directory, 'bigram_model_all')

In [30]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if 1 == 1:

    bigram_model = Phrases(unigram_sentences)

    bigram_model.save(bigram_model_filepath)
    
# load the finished model from disk
bigram_model = Phrases.load(bigram_model_filepath)

Wall time: 815 ms


Now that we have a trained phrase model for word pairs, let's apply it to the review sentences data and explore the results.

In [31]:
bigram_sentences_filepath = os.path.join(intermediate_directory,
                                         'bigram_sentences_all.txt')

In [32]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 1 == 1:

    with codecs.open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
        for unigram_sentence in unigram_sentences:
            
            bigram_sentence = u' '.join(bigram_model[unigram_sentence])
            
            f.write(bigram_sentence + '\n')



Wall time: 1.4 s


In [33]:
bigram_sentences = LineSentence(bigram_sentences_filepath)

In [35]:
for bigram_sentence in it.islice(bigram_sentences, 230, 240):
    print (u' '.join(bigram_sentence))
    print (u'')

food be great but with most newly open restaurant -PRON- still have something to work out

-PRON- would_definitely eat at sushi guru again to see if improve

great atmospher great food great customer_service

finger lickin good

will be back

-PRON- be excited to go to this_place because -PRON- have be hear good thing about -PRON-

the restaurant have a really cool atmosphere and the staff be really nice

although -PRON- feel_like the menu be a_bit small

other_than raman there be not much food and dessert

basically -PRON- can customize -PRON- own raman



Looks like the phrase modeling worked! We now see two-word phrases, such as "`customer_service`" and "`feel_like`", linked together in the text as a single token. Next, we'll train a _second-order_ phrase model. We'll apply the second-order phrase model on top of the already-transformed data, so that incomplete word combinations like "`vanilla_ice cream`" will become fully joined to "`vanilla_ice_cream`".

In [36]:
trigram_model_filepath = os.path.join(intermediate_directory,
                                      'trigram_model_all')

In [37]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if 1 == 1:

    trigram_model = Phrases(bigram_sentences)

    trigram_model.save(trigram_model_filepath)
    
# load the finished model from disk
trigram_model = Phrases.load(trigram_model_filepath)

Wall time: 766 ms


We'll apply our trained second-order phrase model to our first-order transformed sentences, write the results out to a new file, and explore a few of the second-order transformed sentences.

In [42]:
trigram_sentences_filepath = os.path.join(intermediate_directory,
                                          'trigram_sentences_all.txt')

In [39]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 1 == 1:

    with codecs.open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
        for bigram_sentence in bigram_sentences:
            
            trigram_sentence = u' '.join(trigram_model[bigram_sentence])
            
            f.write(trigram_sentence + '\n')



Wall time: 1.42 s


In [40]:
trigram_sentences = LineSentence(trigram_sentences_filepath)

In [41]:
for trigram_sentence in it.islice(trigram_sentences, 230, 240):
    print (u' '.join(trigram_sentence))
    print (u'')

food be great but with most newly open restaurant -PRON- still have something to work out

-PRON- would_definitely eat at sushi guru again to see if improve

great atmospher great food great customer_service

finger lickin good

will be back

-PRON- be excited to go to this_place because -PRON- have be hear good thing about -PRON-

the restaurant have a really cool atmosphere and the staff be really nice

although -PRON- feel_like the menu be a_bit small

other_than raman there be not much food and dessert

basically -PRON- can customize -PRON- own raman



The final step of our text preparation process circles back to the complete text of the reviews. We're going to run the complete text of the reviews through a pipeline that applies our text normalization and phrase models.

In addition, we'll remove stopwords at this point. _Stopwords_ are very common words, like _a_, _the_, _and_, and so on, that serve functional roles in natural language, but typically don't contribute to the overall meaning of text. Filtering stopwords is a common procedure that allows higher-level NLP modeling techniques to focus on the words that carry more semantic weight.

Finally, we'll write the transformed text out to a new file, with one review per line.

In [14]:
trigram_reviews_filepath = os.path.join(intermediate_directory,
                                        'trigram_transformed_reviews_all.txt')

In [49]:
%%time

from spacy.lang.en import STOP_WORDS

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 1 == 1:

    with codecs.open(trigram_reviews_filepath, 'w', encoding='utf_8') as f:
        
        for parsed_review in nlp.pipe(line_review(review_txt_filepath),
                                      batch_size=10000, n_threads=4):
            
            # lemmatize the text, removing punctuation and whitespace
            unigram_review = [token.lemma_ for token in parsed_review
                              if not punct_space(token)]
            
            # apply the first-order and second-order phrase models
            bigram_review = bigram_model[unigram_review]
            trigram_review = trigram_model[bigram_review]
            
            # remove any remaining stopwords
            trigram_review = [term for term in trigram_review
                              if term not in STOP_WORDS]
            
            # write the transformed review as a line in the new file
            trigram_review = u' '.join(trigram_review)
            f.write(trigram_review + '\n')



Wall time: 1min 8s


Let's preview the results. We'll grab one review from the file with the original, untransformed text, grab the same review from the file with the normalized and transformed text, and compare the two.

In [52]:
print (u'Original:' + u'\n')

for review in it.islice(line_review(review_txt_filepath), 11, 12):
    print (review)

print (u'----' + u'\n')
print (u'Transformed:' + u'\n')

with codecs.open(trigram_reviews_filepath, encoding='utf_8') as f:
    for review in it.islice(f, 11, 12):
        print (review)

Original:

Not at all one of the best all you can eat sushi joints that I've been to. The food was okay at best. There was nothing special or standout about this place. They don't even have the Ipad's for ordering like a lot of the other places do. The menu was pretty basic and didn't really offer anything spectacular.

It wasn't the worst and it wasn't the best. This was just a truly mediocre experience, I would go back if it was close by and if it was lunch time, but I would probably find somewhere else if I had to go for dinner since the price really isn't worth it.

----

Transformed:

good -PRON- eat sushi joint -PRON- food okay good nothing_special standout this_place -PRON- do_not_even ipad 's order like lot_of place menu pretty basic do_not offer spectacular -PRON- bad -PRON- good truly mediocre experience -PRON- go_back -PRON- close -PRON- lunch time -PRON- probably find somewhere_else -PRON- dinner price worth -PRON-



You can see that most of the grammatical structure has been scrubbed from the text &mdash; capitalization, articles/conjunctions, punctuation, spacing, etc. However, much of the general semantic *meaning* is still present.

## Word Vector Embedding with Word2Vec

Pop quiz! Can you complete this text snippet?

<br><br>

![word2vec quiz](https://s3.amazonaws.com/skipgram-images/word2vec-1.png)

<br><br><br>
You just demonstrated the core machine learning concept behind word vector embedding models!
<br><br><br>

![word2vec quiz 2](https://s3.amazonaws.com/skipgram-images/word2vec-2.png)

The goal of *word vector embedding models*, or *word vector models* for short, is to learn dense, numerical vector representations for each term in a corpus vocabulary. If the model is successful, the vectors it learns about each term should encode some information about the *meaning* or *concept* the term represents, and the relationship between it and other terms in the vocabulary. Word vector models are also fully unsupervised &mdash; they learn all of these meanings and relationships solely by analyzing the text of the corpus, without any advance knowledge provided.

Perhaps the best-known word vector model is [word2vec](https://arxiv.org/pdf/1301.3781v3.pdf), originally proposed in 2013. The general idea of word2vec is, for a given *focus word*, to use the *context* of the word &mdash; i.e., the other words immediately before and after it &mdash; to provide hints about what the focus word might mean. To do this, word2vec uses a *sliding window* technique, where it considers snippets of text only a few tokens long at a time.

At the start of the learning process, the model initializes random vectors for all terms in the corpus vocabulary. The model then slides the window across every snippet of text in the corpus, with each word taking turns as the focus word. Each time the model considers a new snippet, it tries to learn some information about the focus word based on the surrouding context, and it "nudges" the words' vector representations accordingly. One complete pass sliding the window across all of the corpus text is known as a training *epoch*. It's common to train a word2vec model for multiple passes/epochs over the corpus. Over time, the model rearranges the terms' vector representations such that terms that frequently appear in similar contexts have vector representations that are *close* to each other in vector space.

For a deeper dive into word2vec's machine learning process, see [here](https://arxiv.org/pdf/1411.2738v4.pdf).

Word2vec has a number of user-defined hyperparameters, including:
- The dimensionality of the vectors. Typical choices include a few dozen to several hundred.
- The width of the sliding window, in tokens. Five is a common default choice, but narrower and wider windows are possible.
- The number of training epochs.

For using word2vec in Python, [gensim](https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/) comes to the rescue again! It offers a [highly-optimized](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/), [parallelized](https://rare-technologies.com/parallelizing-word2vec-in-python/) implementation of the word2vec algorithm with its [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) class.

In [43]:
from gensim.models import Word2Vec

trigram_sentences = LineSentence(trigram_sentences_filepath)
word2vec_filepath = os.path.join(intermediate_directory, 'word2vec_model_all')

We'll train our word2vec model using the normalized sentences with our phrase models applied. We'll use 100-dimensional vectors, and set up our training process to run for twelve epochs.

In [73]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to train the word2vec model yourself.
if 1 == 1:

    # initiate the model and perform the first epoch of training
    food2vec = Word2Vec(trigram_sentences, size=100, window=5,
                        min_count=20, sg=1, workers=4)
    
    food2vec.save(word2vec_filepath)

    #perform another 11 epochs of training
    #for i in range(1,12):

    #    food2vec.train(trigram_sentences,total_examples=1, epochs=10)
    #    food2vec.save(word2vec_filepath)
        
# load the finished model from disk
food2vec = Word2Vec.load(word2vec_filepath)
food2vec.init_sims()

print (u'{} training epochs so far.'.format(food2vec.train_count))

  if hasattr(self, attrib):
  asides[attrib] = getattr(self, attrib)
  delattr(self, attrib)
  setattr(obj, attrib, val)
  setattr(self, attrib, None)


1 training epochs so far.
Wall time: 1.96 s


On my four-core machine, each epoch over all the text in the ~1 million Yelp reviews takes about 5-10 minutes.

In [51]:
print (u'{:,} terms in the food2vec vocabulary.'.format(len(food2vec.wv.vocab)))

1,024 terms in the food2vec vocabulary.


Let's take a peek at the word vectors our model has learned. We'll create a pandas DataFrame with the terms as the row labels, and the 100 dimensions of the word vector model as the columns.

In [56]:
# build a list of the terms, integer indices,
# and term counts from the food2vec model vocabulary
ordered_vocab = [(term, voc.index, voc.count)
                 for term, voc in food2vec.wv.vocab.items()]

# sort by the term counts, so the most common terms appear first
#ordered_vocab = sorted(ordered_vocab, key=lambda (term, index, count): -count)

# unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)

# create a DataFrame with the food2vec vectors as data,
# and the terms as row labels
word_vectors = pd.DataFrame(food2vec.wv.syn0norm[term_indices, :],
                            index=ordered_terms)

word_vectors

  


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
after,-0.301643,0.039364,-0.145531,0.110742,0.144773,-0.078214,0.061256,-0.023227,0.003406,0.011680,...,0.082084,0.014832,0.110643,-0.116054,-0.063942,-0.063380,-0.021826,-0.042620,0.081347,-0.062028
for,-0.280029,0.024631,-0.076534,0.094662,0.101035,0.014097,0.002313,-0.035879,0.128832,-0.055912,...,0.121852,0.024043,-0.013434,-0.132626,-0.118617,-0.072394,0.145242,-0.085593,0.092369,-0.071842
what,-0.241954,0.111122,-0.103809,0.102062,0.054980,0.138506,0.001434,0.088418,-0.011469,-0.066780,...,0.189196,0.029457,0.064549,-0.073160,0.020430,-0.109981,0.084992,-0.042069,0.060030,0.068329
feel_like,-0.255318,0.026057,-0.108719,0.105725,0.106940,0.057198,0.029765,0.011120,0.039998,-0.036228,...,0.101847,-0.056427,0.073716,-0.163047,-0.051640,-0.047314,0.043506,-0.056412,0.143357,-0.030122
hubby,-0.243441,0.026121,-0.127511,0.090440,0.120251,0.048076,0.099165,0.073350,0.042066,0.000108,...,0.074756,-0.054829,0.132065,-0.169416,-0.020577,-0.020169,0.007734,-0.060025,0.147040,-0.029602
-PRON-,-0.172933,0.008653,-0.015970,0.030818,0.108273,0.047040,0.026432,0.164874,-0.005151,0.047890,...,0.129939,0.020924,-0.011828,-0.080609,0.093543,-0.004127,0.027260,-0.146688,0.106643,0.109947
decide_to,-0.302338,0.038833,-0.169464,0.111838,0.144900,-0.039691,0.091398,0.031585,0.001609,0.008215,...,0.083595,-0.025650,0.136317,-0.132196,-0.035999,-0.046731,-0.024604,-0.076644,0.127589,-0.034433
grab,-0.276622,-0.020122,-0.077872,0.088111,0.125661,-0.082433,0.046129,0.024112,0.103239,0.001841,...,0.058610,-0.028942,0.124583,-0.091033,-0.150055,-0.085140,0.019366,-0.028484,0.070502,-0.007932
drink,-0.265507,-0.019445,-0.028791,0.130656,0.066367,0.002516,0.071866,0.056457,0.123166,-0.056270,...,0.093099,0.015193,0.081738,-0.119529,-0.159087,-0.085912,-0.014769,0.014766,0.097616,0.051688
at,-0.232834,-0.008028,0.000442,0.045449,0.148151,-0.062544,-0.070685,-0.050062,-0.085577,-0.021800,...,0.116347,0.112865,0.015112,-0.122782,-0.061385,-0.003893,0.124158,-0.053636,0.158281,-0.181152


Holy wall of numbers! This DataFrame has 50,835 rows &mdash; one for each term in the vocabulary &mdash; and 100 colums. Our model has learned a quantitative vector representation for each term, as expected.

Put another way, our model has "embedded" the terms into a 100-dimensional vector space.

### So... what can we do with all these numbers?
The first thing we can use them for is to simply look up related words and phrases for a given term of interest.

In [58]:
def get_related_terms(token, topn=10):
    """
    look up the topn most similar terms to token
    and print them as a formatted list
    """

    for word, similarity in food2vec.most_similar(positive=[token], topn=topn):

        print (u'{:20} {}'.format(word, round(similarity, 3)))

### What things are like Burger King?

In [63]:
get_related_terms(u'beer')

  import sys
  if np.issubdtype(vec.dtype, np.int):


amount_of            0.962
cocktail             0.958
wine                 0.958
as_well_as           0.954
classic              0.954
variety_of           0.952
selection_of         0.948
come_out             0.942
presentation         0.94
incredible           0.94


The model has learned that fast food restaurants are similar to each other! In particular, *mcdonalds* and *wendy's* are the most similar to Burger King, according to this dataset. In addition, the model has found that alternate spellings for the same entities are probably related, such as *mcdonalds*, *mcdonald's* and *mcd's*.

### When is happy hour?

In [64]:
get_related_terms(u'happy_hour')

  import sys
  if np.issubdtype(vec.dtype, np.int):


total                0.971
tuesday              0.97
date                 0.964
friday               0.963
pm                   0.962
booth                0.962
group_of             0.962
summer               0.962
11                   0.96
game                 0.958


The model has noticed several alternate spellings for happy hour, such as *hh* and *happy hr*, and assesses them as highly related. If you were looking for reviews about happy hour, such alternate spellings would be very helpful to know.

Taking a deeper look &mdash; the model has turned up phrases like *3-6pm*, *4-7pm*, and *mon-fri*, too. This is especially interesting, because the model has no advance knowledge at all about what happy hour is, and what time of day it should be. But simply by scanning through restaurant reviews, the model has discovered that the concept of happy hour has something very important to do with that block of time around 3-7pm on weekdays.

### Let's make pasta tonight. Which style do you want?

In [65]:
get_related_terms(u'pasta', topn=20)

  import sys
  if np.issubdtype(vec.dtype, np.int):


baked                0.985
sour                 0.985
fry_rice             0.985
duck                 0.984
cream                0.984
thick                0.984
bit                  0.984
turkey               0.984
tofu                 0.984
batter               0.984
soft                 0.983
meatball             0.982
white                0.982
mango                0.982
crab                 0.981
gravy                0.98
chocolate            0.98
lamb                 0.98
thin                 0.98
seafood              0.98


## Word algebra!
No self-respecting word2vec demo would be complete without a healthy dose of *word algebra*, also known as *analogy completion*.

The core idea is that once words are represented as numerical vectors, you can do math with them. The mathematical procedure goes like this:
1. Provide a set of words or phrases that you'd like to add or subtract.
1. Look up the vectors that represent those terms in the word vector model.
1. Add and subtract those vectors to produce a new, combined vector.
1. Look up the most similar vector(s) to this new, combined vector via cosine similarity.
1. Return the word(s) associated with the similar vector(s).

But more generally, you can think of the vectors that represent each word as encoding some information about the *meaning* or *concepts* of the word. What happens when you ask the model to combine the meaning and concepts of words in new ways? Let's see.

In [67]:
def word_algebra(add=[], subtract=[], topn=1):
    """
    combine the vectors associated with the words provided
    in add= and subtract=, look up the topn most similar
    terms to the combined vector, and print the result(s)
    """
    answers = food2vec.most_similar(positive=add, negative=subtract, topn=topn)
    
    for term, similarity in answers:
        print (term)

### breakfast + lunch = ?
Let's start with a softball.

In [68]:
word_algebra(add=[u'breakfast', u'lunch'])

  import sys
  if np.issubdtype(vec.dtype, np.int):


dinner


OK, so the model knows that *brunch* is a combination of *breakfast* and *lunch*. What else?

### lunch - day + night = ?

In [69]:
word_algebra(add=[u'lunch', u'night'], subtract=[u'day'])

  import sys
  if np.issubdtype(vec.dtype, np.int):


dinner


Now we're getting a bit more nuanced. The model has discovered that:
- Both *lunch* and *dinner* are meals
- The main difference between them is time of day
- Day and night are times of day
- Lunch is associated with day, and dinner is associated with night

What else?

### taco - mexican + chinese = ?

In [74]:
word_algebra(add=[u'taco', u'chinese'], subtract=[u'mexican'])

  import sys
  if np.issubdtype(vec.dtype, np.int):


tomato


Here's an entirely new and different type of relationship that the model has learned.
- It knows that tacos are a characteristic example of Mexican food
- It knows that Mexican and Chinese are both styles of food
- If you subtract *Mexican* from *taco*, you're left with something like the concept of a *"characteristic type of food"*, which is represented as a new vector
- If you add that new *"characteristic type of food"* vector to Chinese, you get *dumpling*.

What else?

### bun - american + mexican = ?

In [75]:
word_algebra(add=[u'bun', u'mexican'], subtract=[u'american'])

  import sys
  if np.issubdtype(vec.dtype, np.int):


fry_chicken


The model knows that both *buns* and *tortillas* are the doughy thing that goes on the outside of your real food, and that the primary difference between them is the style of food they're associated with.

What else?

You could do this all day. One last analogy before we move on...

## Conclusion

Whew! Let's round up the major components that we've seen:
1. Text processing with **spaCy**
1. Automated **phrase modeling**
1. Word vector modeling with **word2vec**

#### Why use these models?
Dense vector representations for text like LDA and word2vec can greatly improve performance for a number of common, text-heavy problems like:
- Text classification
- Search
- Recommendations
- Question answering

...and more generally are a powerful way machines can help humans make sense of what's in a giant pile of text. They're also often useful as a pre-processing step for many other downstream machine learning applications.