## Modern NLP in Python

#### What We Are Doing Today

It's easy for humans to derive meaning and order from text, but traditionally this has been very difficult for computers.  We will explore some of the awesome natural language processing tools available in Python that makes this sort of analysis easy by examining Yelp restuarant reviews.

Here is the plan, we will get as far as we can:
1. Check out the dataset (quickly)
2. Introduction to text processing with spaCy
3. Automatic phrase modeling
4. Topic modeling with LDA
5. Visualizing topic models with pyLDAvis
6. Word vector models with word2vec
7. Visualizing word2vec with t-SNE

And maybe a little Python along the way.


__Notes:__ This presentation is in a Jupyter Notebook, an terrific tool for sharing reproducible work.  If you do not already have it you can just follow along and after the presentation I would be glad to show anyone how to set it up.

This presentation is will show you the NLP tools available in Python, but we will not go deep on the underlying math.  There are high-level descriptions for those who are unfamiliar, but I am more than happy to go deeper into the theory at a later point.

### The Yelp Dataset

The review website [**Yelp**](https://www.yelp.com/detroit) publishes their data so it can be used for education and academic research.  It is well suited for machine learning and natural language processing because its big, but not too big.

**About the Data:** If you plan to follow along interactively, you will need to download the Yelp dataset; if you are viewing a static copy of the notebook, you can skip this step.  Directions to download the data are as follows:
1. Please visit the Yelp dataset webpage [here](https://www.yelp.com/dataset_challenge/)
1. Click "Get the Data"
1. Please review, agree to, and respect Yelp's terms of use!
1. The dataset downloads as a compressed .tgz file; uncompress it
1. Place the uncompressed dataset files (*yelp_academic_dataset_business.json*, etc.) in a directory named *yelp_dataset_challenge_academic_dataset*
1. Place the *yelp_dataset_challenge_academic_dataset* within the *data* directory in the *Modern NLP in Python* project folder

The data is provided in several files in [json](https://en.wikipedia.org/wiki/JSON) format.  We will be using the following files in our demonstration:
- __yelp\_academic\_dataset\_business.json__ &mdash; _the records for individual businesses_
- __yelp\_academic\_dataset\_review.json__ &mdash; _the records for reviews users wrote about businesses_

The files are text files (UTF-8) with one _json object_ per line, each one corresponding to an individual data record. Let's take a look at a few examples.

In [1]:
import os
import codecs

data_directory = os.path.join('data','yelp_dataset_challenge_academic_dataset')

businesses_json_filepath = os.path.join(data_directory,
                                   'yelp_academic_dataset_business.json')

with codecs.open(businesses_json_filepath, encoding='utf_8') as f:
    first_business_record = f.readline() 

print first_business_record

{"business_id": "5UmKMjUEUNdYWqANhGckJw", "full_address": "4734 Lebanon Church Rd\nDravosburg, PA 15034", "hours": {"Friday": {"close": "21:00", "open": "11:00"}, "Tuesday": {"close": "21:00", "open": "11:00"}, "Thursday": {"close": "21:00", "open": "11:00"}, "Wednesday": {"close": "21:00", "open": "11:00"}, "Monday": {"close": "21:00", "open": "11:00"}}, "open": true, "categories": ["Fast Food", "Restaurants"], "city": "Dravosburg", "review_count": 7, "name": "Mr Hoagie", "neighborhoods": [], "longitude": -79.9007057, "state": "PA", "stars": 3.5, "latitude": 40.3543266, "attributes": {"Take-out": true, "Drive-Thru": false, "Good For": {"dessert": false, "latenight": false, "lunch": false, "dinner": false, "brunch": false, "breakfast": false}, "Caters": false, "Noise Level": "average", "Takes Reservations": false, "Delivery": false, "Ambience": {"romantic": false, "intimate": false, "classy": false, "hipster": false, "divey": false, "touristy": false, "trendy": false, "upscale": false,

A single business will consist of a *key, value* pairs containing information about the business. There are two specific attributes we care about:
- __business_id__ &mdash; *the business's unique identifier*
- __categories__ &mdash; *an array containing information about the type of business* 

Because we are focusing on restaurants in this demo, we are looking for __categories__ that contain the word restaurant.  This array can contain other descriptors, like the type of restaurant, but will always include the word restaurant if it is a restaurant.

The review records are also stored in a *key, value* pairs.  Below is an example of a review:

In [2]:
review_json_file = os.path.join(data_directory, 'yelp_academic_dataset_review.json')

with codecs.open(review_json_file, encoding='utf_8') as f:
    first_review_record = f.readline()
    
print first_review_record

{"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "PUFPaY9KxDAcGqfsorJp3Q", "review_id": "Ya85v4eqdd6k9Od8HbQjyA", "stars": 4, "date": "2012-08-01", "text": "Mr Hoagie is an institution. Walking in, it does seem like a throwback to 30 years ago, old fashioned menu board, booths out of the 70s, and a large selection of food. Their speciality is the Italian Hoagie, and it is voted the best in the area year after year. I usually order the burger, while the patties are obviously cooked from frozen, all of the other ingredients are very fresh. Overall, its a good alternative to Subway, which is down the road.", "type": "review", "business_id": "5UmKMjUEUNdYWqANhGckJw"}



The review records also have two fields that we care about:
- __business_id__ &mdash; the unique identifier used to tie the busines record to the review
- __text__ &mdash; the free-form text review from the user

Even though *json* is very good for transferring data, it is not the best format for analysis.  We will now put our data into a form that is easier to perform analytics on.  We will transform the data in the following way:
1. Convert each record into a Python dict
2. filter out all the businesses that do not have 'Restaurant' in their __category__ field
3. create a [*frozenset*](https://docs.python.org/2.4/lib/types-set.html) with the __business_id__ of all the restaurants in the dataset

In [3]:
import json

restaurant_ids = set()

#open the business file
with codecs.open(businesses_json_filepath, encoding='utf_8') as f:
    
    #iterate through each json record in the file
    for business_json in f:
        
        #convert the record to a Python dict
        business = json.loads(business_json)
        
        #check if the business is a restaurant
        if u'Restaurants' not in business[u'categories']:
            continue
            
        
        #if a restaurant, add to the set
        restaurant_ids.add(business[u'business_id'])

#turn the set into a frozenset because we will no longer be adding or deleting from it
restaurant_ids = frozenset(restaurant_ids)

#total number of restaurants in our data
print '{:,} in the dataset'.format(len(restaurant_ids))

26,729 in the dataset


In [4]:
#select 10,000 restaurants to analyze
# more managable for this demo
import itertools as it

restaurant_ids_1 =  frozenset(it.islice(restaurant_ids, 0,10000))

In [5]:
intermediate_directory = os.path.join('intermediate')

review_txt_filepath = os.path.join(intermediate_directory, 'review_text_all.txt')

In [6]:
%%time

# the next step can take a while
# make the if statement true if you want to run this
if 1==0:
    
    review_count = 0
    
    # create and open a new file in write more
    with codecs.open(review_txt_filepath, 'w', encoding='utf_8') as review_text_file:
        
        # open the existing review json file
        with codecs.open(review_json_file, encoding='utf_8') as review_json_file:
            
            # loop through all the reviews in the existing file and convert it to a dictionary
            for review_json in review_json_file:
                review = json.loads(review_json)
                
                # if this review is not a restaurant (in our set), skip it
                if review[u'business_id'] not in restaurant_ids_1:
                    continue
                    
                # write the restaurant reviewas a line in the new file 
                # escape newline characters in the original review text
                review_text_file.write(review[u'text'].replace('\n', '\\n') + '\n')
                review_count += 1
                
    print u'''{:,} reviews of restaurants written to new text file.'''.format(review_count)
    
else:
    with codecs.open(review_txt_filepath, encoding='utf_8') as review_txt_file:
        for review_count, line in enumerate(review_txt_file):
            pass
    print u'''{:,} reviews of restaurants read from the text file.'''.format(review_count)

630,941 reviews of restaurants read from the text file.
CPU times: user 15.6 s, sys: 558 ms, total: 16.1 s
Wall time: 16.4 s


Now that the data is prepared, we can start analyzing it.

## spaCy &mdash; NLP in Python

![spaCy](https://s3.amazonaws.com/skipgram-images/spaCy.png)

[**spaCy**](https://spacy.io/) is an industrial strength natural language processing toolkit built for Python.  The goal of spaCy is to package the most cutting edge NLP techniques and give it to consumers who can apply it real life problems.

spaCy is capabale of many tasks associated with and end-to-end natural language processing pipeline.
- Tokenization
- Text normalization (lowercasing, stemming, lemmatization, etc)
- Part-of-speech tagging
- Syntatic dependency parsing
- Sentence boundary detection
- Named entity recognition and annotation

spaCy also comes with built in models which allow for out-of-the-box processing of general-purpose English language text:
- Large English vocabulary
- Token 'probabilities'
- Word vectors

But is it fast?

spaCy is written in Cython, so yes, it is fast.  Several independent sources state that it is the fatest syntatic parser available in any language.  Key pieces of the spaCy parsing pipeline are written in pure C, which enables very efficient multithreading (this allows spaCy to release the Python *GIL*).

In [7]:
import spacy
import pandas as pd


nlp = spacy.load('en')

We will test spaCy out on a random review.

In [8]:
with codecs.open(review_txt_filepath, encoding='utf_8') as f:
    sample_review = list(it.islice(f, 13317, 13318))[0]
    sample_review = sample_review.replace('\\n', '\n')
        
print sample_review

Last night was my first visit to George and Dragon....and unfortunately, it will probably be my last.  My boyfriend, my daughter and I decided to go there for FATHER'S DAY.  We walked in the front door and stood in the middle of the room facing the bar.  There were maybe 6 or 7 occupied tables--it wasn't busy.  So we stood there, unsure of wether we sat ourselves or waited to be seated.  3 servers were present as well as a blonde bartender who totally ignored us.  Finally we asked if we sat ourselves and the bartender said to go right ahead.  We picked a booth facing the bar and sat.  And sat........and sat......and sat.  One male bartender seemed to be working in a back room only.  Another girl stood alongside the bar, staring out @ us, drinking a soda as we sat there with no menus.  The bartender, who was maybe 15 feet from us, would look @ us sideways, and look away.  My boyfriend works for the City of Phoenix.  He's a HUGE tipper.  Maybe we don't look like much, but...??????? We wa

Now we let spaCy do its magic, but we should be prepared to wait.

In [9]:
%%time 

#wall time is what we care about
parsed_review = nlp(sample_review)

CPU times: user 89.9 ms, sys: 359 ms, total: 449 ms
Wall time: 535 ms


That was quick, let's check it out!

In [10]:
print parsed_review

Last night was my first visit to George and Dragon....and unfortunately, it will probably be my last.  My boyfriend, my daughter and I decided to go there for FATHER'S DAY.  We walked in the front door and stood in the middle of the room facing the bar.  There were maybe 6 or 7 occupied tables--it wasn't busy.  So we stood there, unsure of wether we sat ourselves or waited to be seated.  3 servers were present as well as a blonde bartender who totally ignored us.  Finally we asked if we sat ourselves and the bartender said to go right ahead.  We picked a booth facing the bar and sat.  And sat........and sat......and sat.  One male bartender seemed to be working in a back room only.  Another girl stood alongside the bar, staring out @ us, drinking a soda as we sat there with no menus.  The bartender, who was maybe 15 feet from us, would look @ us sideways, and look away.  My boyfriend works for the City of Phoenix.  He's a HUGE tipper.  Maybe we don't look like much, but...??????? We wa

Well, that doesn't look any different.  But let's check out all the things we can do now. 

Keep in mind, I have done nothing, no models have been trained, not sentence delimiters, nothing.  Everything we are going to see is out of the box spaCy.

In [11]:
for num, sentence in enumerate(parsed_review.sents, start=1):
    print 'Sentence {}:'.format(num)
    print sentence
    print ''

Sentence 1:
Last night was my first visit to George and Dragon....and unfortunately, it will probably be my last.  

Sentence 2:
My boyfriend, my daughter and I decided to go there for FATHER'S DAY.  We walked in the front door and stood in the middle of the room facing the bar.  

Sentence 3:
There were maybe 6 or 7 occupied tables--it wasn't busy.  

Sentence 4:
So we stood there, unsure of wether we sat ourselves or waited to be seated.  3 servers were present as well as a blonde bartender who totally ignored us.  

Sentence 5:
Finally we asked if we sat ourselves and the bartender said to go right ahead.  

Sentence 6:
We picked a booth facing the bar and sat.  

Sentence 7:
And sat........and sat......and sat.  

Sentence 8:
One male bartender seemed to be working in a back room only.  

Sentence 9:
Another girl stood alongside the bar, staring out @ us, drinking a soda as we sat there with no menus.  

Sentence 10:
The bartender, who was maybe 15 feet from us, would look @ us sid

So it did pretty well finding the sentences.  Let's see how it does with entity detection.

In [12]:
for num, entity in enumerate(parsed_review.ents, start=1):
    print 'Entity {}:'.format(num), entity, '-', entity.label_
    print ''

Entity 1: first - ORDINAL

Entity 2: George - PERSON

Entity 3: 6 - CARDINAL

Entity 4: 7 - CARDINAL

Entity 5: 3 - CARDINAL

Entity 6: One - CARDINAL

Entity 7: 15 feet - QUANTITY

Entity 8: the City of Phoenix - GPE

Entity 9: 12 minutes - TIME

Entity 10: Bill - PERSON

Entity 11: The Black Forest Mill - ORG



Next let's check out how spaCy does with part of speech tagging.

__NOTE:__ [here](http://universaldependencies.org/en/pos/index.html) are the definitions of the parts of speech.

In [13]:
token_text = [token.orth_ for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]

pd.DataFrame(zip(token_text, token_pos),
            columns=['token_text', 'part_of_speech'])

Unnamed: 0,token_text,part_of_speech
0,Last,ADJ
1,night,NOUN
2,was,VERB
3,my,ADJ
4,first,ADJ
5,visit,NOUN
6,to,ADP
7,George,PROPN
8,and,CONJ
9,Dragon,PROPN


spaCy also provides support for text normalization, like lemmatization and shaping.

In [14]:
token_lemma = [token.lemma_ for token in parsed_review]
token_shape = [token.shape_ for token in parsed_review]

pd.DataFrame(zip(token_text, token_lemma, token_shape),
            columns=['text', 'token_lemma', 'token_shape'])

Unnamed: 0,text,token_lemma,token_shape
0,Last,last,Xxxx
1,night,night,xxxx
2,was,be,xxx
3,my,my,xx
4,first,first,xxxx
5,visit,visit,xxxx
6,to,to,xx
7,George,george,Xxxxx
8,and,and,xxx
9,Dragon,dragon,Xxxxx


spaCy also offers token-level entity analysis.

In [15]:
token_entity_type = [token.ent_type_ for token in parsed_review]
token_entity_iob = [token.ent_iob_ for token in parsed_review]

pd.DataFrame(zip(token_text, token_entity_type, token_entity_iob),
            columns = ['token_text', 'entity_type', 'inside_outside_begin'])

Unnamed: 0,token_text,entity_type,inside_outside_begin
0,Last,,O
1,night,,O
2,was,,O
3,my,,O
4,first,ORDINAL,B
5,visit,,O
6,to,,O
7,George,PERSON,B
8,and,,O
9,Dragon,,O


We can also access many other token-level attributes that will help us understand the text.  These include:
- token probabilities
- stopwords
- punctuation
- whitespace
- numerical representation
- token in spaCy's built in dictionary

In [16]:
token_attributes = [(token.orth_,
                     token.prob,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_review]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           'log_probability',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))
                                               
df

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,Last,-10.101164,Yes,,,,
1,night,-8.517073,,,,,
2,was,-5.252320,Yes,,,,
3,my,-5.491643,Yes,,,,
4,first,-7.063717,Yes,,,,
5,visit,-10.082088,,,,,
6,to,-3.856022,Yes,,,,
7,George,-10.853457,,,,,
8,and,-4.113108,Yes,,,,
9,Dragon,-10.920993,,,,,


We have now seen all the awesome things spaCy can do with no work from us.  Next we will train models to get us even more insight into our data.

## Phrase Modeling

_Phrase Modeling_ is an approach used to learn combinations of tokens that together represent meaningful multi-word concepts.  We identify these combinations by looping over our entire corpus and finding combinations of tokens that occur together much more frequently than we would expect by chance.

\begin{align}
\frac {count(A B) - count_{min}} {count(A) * count(B)} * N > threshold
\end{align}

...where:

- _count(A)_ is the number of times token _A_ appears in the corpus
- _count(B)_ is the number of times toekn _B_ appears in the corpus
- _count(A B)_ is the number of times token _A_ and _B_ appear in the corpus in order
- _N_ is the number of tokens in the corpus
- _$count_{min}$_ is the minimum number of times the phrase must appear in the corpus (user-defined)
- _threshold_ is the parameter that determines how strong the relationship between the tokens must be (user-defined)

We must first train our phrase model on our corpus, then we can apply it to new text.  Once trained our model will look for n-grams that we have identified as phrases and merge them into a new, single token.

Phrase modeling does exactly what you would expect it to do.  It will find named entities that should be a single phrase (new york becomes new_york), but it will also find multi-word expressions that represent common concepts (like happy hour and red wine).

To assist us, we will use the [__gensim__](https://radimrehurek.com/gensim/) library, specifically the [__Phrases__](https://radimrehurek.com/gensim/models/phrases.html) class.

In [17]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

Here are the steps we will take in the following section:
1. Segment the reviews into sentences and normalize the text
2. First-order phrase modeling $\rightarrow$ _apply first-order phrase model to transform sentences_
3. Second-order phrase modeling $\rightarrow$ _apply second-order phrase model to transform sentences_
4. Apply text normalization and second-order phrase model to all reviews.

_We will use this transformed text in the higher-level modeling approaches in the later sections._

Before we move on we will define a few helper functions that we will use to normalize the text.  Specifically, we will use the `lemmatized_sentence_corpus` generator from spaCy to:
- Iterate over all our reviews in the `review_txt_all.txt` file we created
- Segment the reviews into individual sentences
- Remove punctuation and whitespaces
- Lemmatize the text

And thanks to spaCy's `nlp.pipe()` function we can do this in parallel (so it's fast).

In [18]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    
    with codecs.open(filename, encoding='utf_8') as f:
        for review in f:
            yield review.replace('\\n', '\n')
            
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    
    for parsed_review in nlp.pipe(line_review(filename),
                                  batch_size=10000, n_threads=4):
        
        for sent in parsed_review.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

In [19]:
unigram_sentence_filepath = os.path.join(intermediate_directory,
                                          'unigram_sentences_all.txt')

We will now use `lemmatized_sentence_corpus` to iterate through the reviews, segmenting and normalizing it.  We will write this back out to the file we created in the previous cell with one normalized sentence per line; we will use this normalized text in our phrase model.

In [20]:
%%time

# this is a bit time consuming - make the if statement True if you want to run

if 0 == 1:

    with codecs.open(unigram_sentence_filepath, 'w', encoding='utf_8') as f:
        for sentence in lemmatized_sentence_corpus(review_txt_filepath):
            f.write(sentence + '\n')

CPU times: user 5 µs, sys: 3 µs, total: 8 µs
Wall time: 20 µs


Now that are data is organized into a large text file with one sentence per line we can use gensim's [**LineSentence**](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence) as our iterator when working with other gensim components.  It _streams_ the sentences from disk, so you never have to hold the entire text file in RAM at once.  This allows you to scale your modeling pipeline to handle very large corpora.

In [21]:
unigram_sentences = LineSentence(unigram_sentence_filepath)

We can now look at our new, transformed file (lemmatized with punctuation and extra whitespaces removed.

In [29]:
for unigram_sentence in it.islice(unigram_sentences, 128, 138):
    print u' '.join(unigram_sentence)
    print u''

i also add an orange juice

she go and put my order in while i wait and come back with it after not too long

the egg be cook exactly how i want them the cheesy hash brown casserole and the french toast be both delicious

i also enjoy the sausage which be pretty typical

kings family restaurant feature a very friendly staff great price and tasty food

i be pleased and will definitely come back again

breakfast time

i stop by the kings family restaurant prior to grocery shopping

i have never be to a kings family restaurant since move to pittsburgh

there be several local establishment



Next we will learn a phrase model that will link induvidual words into two-word phrases.  For example, `"ice cream"` will be joined into a single token `"ice_cream"`.

In [23]:
bigram_model_filepath = os.path.join(intermediate_directory, 'bigram_model_all')

In [24]:
%%time

# this is a bit time consuming - make the if statement True if you want to run

if 1==0:
    
    bigram_model = Phrases(unigram_sentences)
    
    bigram_model.save(bigram_model_filepath)
    
# load the finished model from disk
bigram_model= Phrases.load(bigram_model_filepath)

CPU times: user 2.2 s, sys: 1.32 s, total: 3.52 s
Wall time: 3.91 s


Now that we have a trained model for word pairs, we can apply it to our reviews.

In [25]:
bigram_sentences_filepath = os.path.join(intermediate_directory,
                                         'bigram_sentences_all.txt')

In [26]:
%%time 

# this is a bit time consuming - make the if statement True if you want to run

if 1 == 0:

    with codecs.open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
        for unigram_sentence in unigram_sentences:
            
            bigram_sentence = u' '.join(bigram_model[unigram_sentence])
            
            f.write(bigram_sentence + '\n')

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 10 µs


In [27]:
bigram_sentences = LineSentence(bigram_sentences_filepath)

In [30]:
for bigram_sentence in it.islice(bigram_sentences, 128, 138):
    print u' '.join(bigram_sentence)
    print u''

i also add an orange_juice

she go and put my order in while i wait and come back with it after not too long

the egg be cook exactly how i want them the cheesy hash_brown casserole and the french_toast be both delicious

i also enjoy the sausage which be pretty typical

kings family restaurant feature a very friendly staff great price and tasty food

i be pleased and will definitely come back again

breakfast time

i stop by the kings family restaurant prior to grocery_shopping

i have never be to a kings family restaurant since move to pittsburgh

there be several local establishment



It appears that our phrase modeling worked.  We can see that two-word phrases are now linked into single tokens.  Next, we will trian our _second-order_ phrase model to turn linked tokens into tokens that link three words together.

In [31]:
trigram_model_filepath = os.path.join(intermediate_directory,
                                      'trigram_model_all')

In [32]:
%%time

# this is a bit time consuming - make the if statement True if you want to run

if 0 == 1:

    trigram_model = Phrases(bigram_sentences)

    trigram_model.save(trigram_model_filepath)
    
# load the finished model from disk
trigram_model = Phrases.load(trigram_model_filepath)

CPU times: user 2.38 s, sys: 1.25 s, total: 3.63 s
Wall time: 3.93 s


We can now apply our _second-order_ phrase model to our first order phrase model.  Then we will write the results to a new file.

In [33]:
trigram_sentences_filepath = os.path.join(intermediate_directory,
                                          'trigram_sentences_all.txt')

In [34]:
%%time

# this is a bit time consuming - make the if statement True if you want to run

if 1 == 0:

    with codecs.open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:
        
        for bigram_sentence in bigram_sentences:
            
            trigram_sentence = u' '.join(trigram_model[bigram_sentence])
            
            f.write(trigram_sentence + '\n')

CPU times: user 5 µs, sys: 2 µs, total: 7 µs
Wall time: 11.9 µs


In [35]:
trigram_sentences = LineSentence(trigram_sentences_filepath)

In [36]:
for trigram_sentence in it.islice(trigram_sentences, 735, 745):
    print u' '.join(trigram_sentence)
    print u''

it ' a great place for cheap_eats

the delivery_driver mistakenly ring my doorbell have confuse 133 and 113

rather_than take a step back and analyze the situation he begin to accuse my wife and i of ordering and refuse to pay for this pizza

the driver then get on his_cell_phone and rather_than call the number than be give when the order be place begin to call his boss and start threaten me with felony charge

so i take the initiative and ask the fine upstanding gentleman what the phone_number of the order er be phone my neighbor and discover the mistake

rather_than a thank you or a sorry he just speed off break the speed_limit on our block to reach his destination 50_foot_away

i would call to complain but base on the other review its clear the owner do not care about carnegie or it ' resident and its pretty well know around town just how awful their food be so it would be pointless to boycott a place i would never order from again anyway

do yourself a favor and order from any othe

Again we were successful.  We now see groups of three tokens and four tokens that are commonly found next to eachother joined into single tokens.

The final step of our text preparation process is to run the complete text of all the reviews through a pipeline that applies our text normalization and phrase models.

In addition, we will remove stopwords from our text.

Finally, we will write the transformed text out to a new file, one review per line.

In [37]:
trigram_reviews_filepath = os.path.join(intermediate_directory,
                                        'trigram_transformed_reviews_all.txt')

In [38]:
%%time

# this is a bit time consuming - make the if statement True if you want to run

if 1 == 0:

    with codecs.open(trigram_reviews_filepath, 'w', encoding='utf_8') as f:
        
        for parsed_review in nlp.pipe(line_review(review_txt_filepath),
                                      batch_size=10000, n_threads=4):
            
            # lemmatize the text, removing punctuation and whitespace
            unigram_review = [token.lemma_ for token in parsed_review
                              if not punct_space(token)]
            
            # apply the first-order and second-order phrase models
            bigram_review = bigram_model[unigram_review]
            trigram_review = trigram_model[bigram_review]
            
            # remove any remaining stopwords
            trigram_review = [term for term in trigram_review
                              if term not in spacy.en.language_data.STOP_WORDS]
            
            # write the transformed review as a line in the new file
            trigram_review = u' '.join(trigram_review)
            f.write(trigram_review + '\n')

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 10 µs


Now we can check out some results.  We will look at a single review, first in its original form, then in its transformed state.

In [42]:
print u'Original:' + u'\n'

for review in it.islice(line_review(review_txt_filepath), 81, 82):
    print review

print u'----' + u'\n'
print u'Transformed:' + u'\n'

with codecs.open(trigram_reviews_filepath, encoding='utf_8') as f:
    for review in it.islice(f, 81, 82):
        print review

Original:

Recommended. 16 inch pizza on special was cheap. Fast service. Not the best pizza but above average.

----

Transformed:

recommend 16_inch_pizza special cheap fast service good pizza above_average



You can see that most of the grammatical structure has been scrubbed from the text &mdash; capitalization, articles/conjunctions, punctuation, spacing, etc has been removed. However, much of the general semantic *meaning* is still present. Also, multi-word concepts such as "`16_inch_pizza`" and "`above_average`" have been joined into single tokens, as expected. The review text is now ready for higher-level modeling. 

## Topic Modeling with Latent Dirichlet Allocation (_LDA_)

*Topic modeling* is a family of techniques that can be used to describe and summarize the documents in a corpus according to a set of latent "topics". For this demo, we'll be using [*Latent Dirichlet Allocation*](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) or LDA, a popular approach to topic modeling.

In many conventional NLP applications, documents are represented as a mixture of the individual tokens (words and phrases) they contain. In other words, a document is represented as a *vector* of token counts. There are two layers in this model &mdash; documents and tokens &mdash; and the size or dimensionality of the document vectors is the number of tokens in the corpus vocabulary. Here is what a corpus may look like:

In [43]:
import numpy as np

data = np.array([[1,1,0,0,0,0],
                [0,0,0,1,1,0],
                [0,0,1,0,0,1]])

LDA_frame = pd.DataFrame(data=data, columns=['hello', 'there', 'store', 'knife', 'fork', 'goodbye'], 
                         index=['Document_1','Document_2','Document_3'])

LDA_frame

Unnamed: 0,hello,there,store,knife,fork,goodbye
Document_1,1,1,0,0,0,0
Document_2,0,0,0,1,1,0
Document_3,0,0,1,0,0,1


This approach has a number of disadvantages:
* Document vectors tend to be large (one dimension for each token $\Rightarrow$ lots of dimensions)
* They also tend to be very sparse. Any given document only contains a small fraction of all tokens in the vocabulary, so most values in the document's token vector are 0.
* The dimensions are fully indepedent from each other &mdash; there's no sense of connection between related tokens, such as _knife_ and _fork_.

LDA injects a third layer into this conceptual model. Documents are represented as a mixture of a pre-defined number of *topics*, and the *topics* are represented as a mixture of the individual tokens in the vocabulary. The number of topics is a model hyperparameter selected by the user. LDA makes a prior assumption that the (document, topic) and (topic, token) mixtures follow [*Dirichlet*](https://en.wikipedia.org/wiki/Dirichlet_distribution) probability distributions. This assumption encourages documents to consist mostly of a handful of topics, and topics to consist mostly of a modest set of the tokens.

![LDA](https://s3.amazonaws.com/skipgram-images/LDA.png)

LDA is fully unsupervised. The topics are "discovered" automatically from the data by trying to maximize the likelihood of observing the documents in your corpus, given the modeling assumptions. They are expected to capture some latent structure and organization within the documents, and often have a meaningful human interpretation for people familiar with the subject material.

We'll again turn to gensim to assist with data preparation and modeling. In particular, gensim offers a high-performance parallelized implementation of LDA with its [**LdaMulticore**](https://radimrehurek.com/gensim/models/ldamulticore.html) class.

In [44]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
import warnings
import cPickle as pickle

The first step to creating an LDA model is to learn the full vocabulary of the corpus to be modeled. We'll use gensim's [**Dictionary**](https://radimrehurek.com/gensim/corpora/dictionary.html) class for this.

In [45]:
trigram_dictionary_filepath = os.path.join(intermediate_directory,
                                           'trigram_dict_all.dict')

In [46]:
%%time

# this is a bit time consuming - make the if statement True if you want to run

if 0 == 1:

    trigram_reviews = LineSentence(trigram_reviews_filepath)

    # learn the dictionary by iterating over all of the reviews
    trigram_dictionary = Dictionary(trigram_reviews)
    
    # filter tokens that are very rare or too common from
    # the dictionary (filter_extremes) and reassign integer ids (compactify)
    trigram_dictionary.filter_extremes(no_below=10, no_above=0.4)
    trigram_dictionary.compactify()

    trigram_dictionary.save(trigram_dictionary_filepath)
    
# load the finished dictionary from disk
trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)

CPU times: user 16.4 ms, sys: 11 ms, total: 27.4 ms
Wall time: 32.8 ms


Like many NLP techniques, LDA uses a simplifying assumption known as the [*bag-of-words* model](https://en.wikipedia.org/wiki/Bag-of-words_model). In the bag-of-words model, a document is represented by the counts of distinct terms that occur within it. Additional information, such as word order, is discarded. 

Using the gensim Dictionary we learned to generate a bag-of-words representation for each review. The `trigram_bow_generator` function implements this. We'll save the resulting bag-of-words reviews as a matrix.

In the following code, "bag-of-words" is abbreviated as `bow`.

In [47]:
trigram_bow_filepath = os.path.join(intermediate_directory,
                                    'trigram_bow_corpus_all.mm')

In [48]:
def trigram_bow_generator(filepath):
    """
    generator function to read reviews from a file
    and yield a bag-of-words representation
    """
    
    for review in LineSentence(filepath):
        yield trigram_dictionary.doc2bow(review)

In [49]:
%%time

# this is a bit time consuming - make the if statement True if you want to run

if 1 == 0:

    # generate bag-of-words representations for
    # all reviews and save them as a matrix
    MmCorpus.serialize(trigram_bow_filepath,
                       trigram_bow_generator(trigram_reviews_filepath))
    
# load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus(trigram_bow_filepath)

CPU times: user 26.8 ms, sys: 10 ms, total: 36.8 ms
Wall time: 49.1 ms


With the bag-of-words corpus, we're finally ready to learn our topic model from the reviews. We simply need to pass the bag-of-words matrix and Dictionary from our previous steps to `LdaMulticore` as inputs, along with the number of topics the model should learn. For this demo, we're asking for 50 topics.

In [50]:
lda_model_filepath = os.path.join(intermediate_directory, 'lda_model_all')

In [51]:
%%time

# this is a bit time consuming - make the if statement True if you want to run

if 0 == 1:

    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # workers => sets the parallelism, and should be
        # set to your number of physical cores minus one
        lda = LdaMulticore(trigram_bow_corpus,
                           num_topics=50,
                           id2word=trigram_dictionary,
                           workers=3)
    
    lda.save(lda_model_filepath)
    
# load the finished LDA model from disk
lda = LdaMulticore.load(lda_model_filepath)

CPU times: user 38.5 ms, sys: 50.1 ms, total: 88.6 ms
Wall time: 128 ms


Our topic model is now trained and ready to use! Because each topic is represented as a mixture of tokens, you can manually inspect which tokens have been grouped together into which topics to try to understand the patterns the model has discovered in the data.

In [52]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print u'{:20} {}'.format(u'term', u'frequency') + u'\n'

    for term, frequency in lda.show_topic(topic_number, topn=25):
        print u'{:20} {:.3f}'.format(term, round(frequency, 3))

In [53]:
#29, 33, 42, 49
explore_topic(topic_number=29)

term                 frequency

pizza                0.165
crust                0.019
order                0.014
cheese               0.013
slice                0.013
like                 0.013
sauce                0.012
try                  0.011
great                0.011
pie                  0.010
'                    0.009
time                 0.009
fresh                0.009
topping              0.009
love                 0.008
taste                0.007
sausage              0.007
salad                0.006
eat                  0.006
ingredient           0.006
wing                 0.006
come                 0.006
little               0.006
delicious            0.006
pepperoni            0.006


In [54]:
topic_names = {0: u'mexican',
               1: u'menu',
               2: u'thai',
               3: u'steak',
               4: u'donuts & appetizers',
               5: u'specials',
               6: u'soup',
               7: u'wings, sports bar',
               8: u'foreign language',
               9: u'las vegas',
               10: u'chicken',
               11: u'aria buffet',
               12: u'noodles',
               13: u'ambience & seating',
               14: u'sushi',
               15: u'arizona',
               16: u'family',
               17: u'price',
               18: u'sweet',
               19: u'waiting',
               20: u'general_',
               21: u'tapas',
               22: u'dirty',
               23: u'customer service',
               24: u'restrooms',
               25: u'chinese',
               26: u'gluten free',
               27: u'pizza_',
               28: u'seafood',
               29: u'pizza',
               30: u'positive reviews',
               31: u'bar & atmosphere',
               32: u'poor service',
               33: u'middle eastern',
               34: u'mexican',
               35: u'casino buffet',
               36: u'general',
               37: u'ambiance',
               38: u'customer service',
               39: u'mexican buffet',
               40: u'experience',
               41: u'brunch',
               42: u'japanese',
               43: u'italian',
               44: u'high cuisine',
               45: u'breakfast',
               46: u'fast food',
               47: u'happy hour & drinks',
               48: u'lunch',
               49: u'latin'}

In [55]:
topic_names_filepath = os.path.join(intermediate_directory, 'topic_names.pkl')

with open(topic_names_filepath, 'w') as f:
    pickle.dump(topic_names, f)

You can see that, along with **mexican**, there are a variety of topics related to different styles of food, such as **thai**, **steak**, **sushi**, **pizza**, and so on. In addition, there are topics that are more related to the overall restaurant *experience*, like **ambience & seating**, **good service**, **waiting**, and **price**.

Beyond these two categories, there are still some topics that are difficult to apply a meaningful human interpretation to, such as topic 30 and 36.


In [56]:
explore_topic(topic_number=30)

term                 frequency

great                0.047
price                0.039
menu                 0.022
friendly             0.020
staff                0.019
service              0.017
happy_hour           0.015
love                 0.014
lunch                0.013
fresh                0.011
nice                 0.011
portion              0.011
try                  0.010
restaurant           0.010
time                 0.009
selection            0.008
definitely           0.008
wall                 0.008
favorite             0.007
large                0.007
delicious            0.007
special              0.007
amazing              0.007
awesome              0.006
small                0.006



Manually reviewing the top terms for each topic is a helpful exercise, but to get a deeper understanding of the topics and how they relate to each other, we need to visualize the data &mdash; preferably in an interactive format. Fortunately, we have the fantastic [**pyLDAvis**](https://pyldavis.readthedocs.io/en/latest/readme.html) library to help with that!

pyLDAvis includes a one-line function to take topic models created with gensim and prepare their data for visualization.

In [57]:
LDAvis_data_filepath = os.path.join(intermediate_directory, 'ldavis_prepared')

In [58]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 0 == 1:

    LDAvis_prepared = pyLDAvis.gensim.prepare(lda, trigram_bow_corpus,
                                              trigram_dictionary)

    with open(LDAvis_data_filepath, 'w') as f:
        pickle.dump(LDAvis_prepared, f)
        
# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath) as f:
    LDAvis_prepared = pickle.load(f)

CPU times: user 437 ms, sys: 13.8 ms, total: 451 ms
Wall time: 472 ms


`pyLDAvis.display(...)` displays the topic model visualization in-line in the notebook.

In [59]:
pyLDAvis.display(LDAvis_prepared)

### Wait, what am I looking at again?
There are a lot of moving parts in the visualization. Here's a brief summary:

* On the left, there is a plot of the "distance" between all of the topics (labeled as the _Intertopic Distance Map_)
  * The plot is rendered in two dimensions according a [*multidimensional scaling (MDS)*](https://en.wikipedia.org/wiki/Multidimensional_scaling) algorithm. Topics that are generally similar should be appear close together on the plot, while *dis*similar topics should appear far apart.
  * The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.
  * An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.
* On the right, there is a bar chart showing top terms.
  * When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's *saliency* is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.
  * When a particular topic is selected, the bar chart changes to show the top-30 most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter $\lambda$, which can be adjusted with a slider above the bar chart.
    * Setting the $\lambda$ parameter close to 1.0 (the default) will rank the terms solely according to their probability within the topic.
    * Setting $\lambda$ close to 0.0 will rank the terms solely according to their "distinctiveness" or "exclusivity" within the topic &mdash; i.e., terms that occur *only* in this topic, and do not occur in other topics.
    * Setting $\lambda$ to values between 0.0 and 1.0 will result in an intermediate ranking, weighting term probability and exclusivity accordingly.
* Rolling the mouse over a term in the bar chart on the right will cause the topic circles to resize in the plot on the left, to show the strength of the relationship between the topics and the selected term.

A more detailed explanation of the pyLDAvis visualization can be found [here](https://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf). Unfortunately, though the data used by gensim and pyLDAvis are the same, they don't use the same ID numbers for topics. If you need to match up topics in gensim's `LdaMulticore` object and pyLDAvis' visualization, you have to dig through the terms manually.

### Analyzing our LDA model
The interactive visualization pyLDAvis produces is helpful for both:
1. Better understanding and interpreting individual topics, and
1. Better understanding the relationships between the topics.

For (1), you can manually select each topic to view its top most freqeuent and/or "relevant" terms, using different values of the $\lambda$ parameter. This can help when you're trying to assign a human interpretable name or "meaning" to each topic.

For (2), exploring the _Intertopic Distance Plot_ can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

The x-axis separates two large groups of topics &mdash; above the X-axis seems to be clusters that relate to the dining experience.  Below the X-axis is where we find the food-type groups.  In other words:
* The super-topic in the *lower*-half tends to be about *food*. It groups together the **burger & fries**, **breakfast**, **sushi**, **barbecue**, and **greek** topics, among others.
* The super-topic in the *upper*-half tends to be about other elements of the *restaurant experience*. It groups together the **ambience & seating**, **location & time**, **family**, and **customer service** topics, among others.

So, in addition to the 50 direct topics the model has learned, our analysis suggests a higher-level pattern in the data. Restaurant reviewers in the Yelp dataset talk about two main things in their reviews, in general: (1) the food, and (2) their overall restaurant experience. For this dataset, this is a very intuitive result, and we probably didn't need a sophisticated modeling technique to tell it to us. When working with datasets from other domains, though, such high-level patterns may be much less obvious from the outset &mdash; and that's where topic modeling can help.

### Describing text with LDA
Beyond data exploration, one of the key uses for an LDA model is providing a compact, quantitative description of natural language text. Once an LDA model has been trained, it can be used to represent free text as a mixture of the topics the model learned from the original corpus. This mixture can be interpreted as a probability distribution across the topics, so the LDA representation of a paragraph of text might look like 50% _Topic A_, 20% _Topic B_, 20% _Topic C_, and 10% _Topic D_.

To use an LDA model to generate a vector representation of new text, you'll need to apply any text preprocessing steps you used on the model's training corpus to the new text, too. For our model, the preprocessing steps we used include:
1. Using spaCy to remove punctuation and lemmatize the text
1. Applying our first-order phrase model to join word pairs
1. Applying our second-order phrase model to join longer phrases
1. Removing stopwords
1. Creating a bag-of-words representation

Once you've applied these preprocessing steps to the new text, it's ready to pass directly to the model to create an LDA representation. The `lda_description(...)` function will perform all these steps for us, including printing the resulting topical description of the input text.

In [60]:
def get_sample_review(review_number):
    """
    retrieve a particular review index
    from the reviews file and return it
    """
    
    return list(it.islice(line_review(review_txt_filepath),
                          review_number, review_number+1))[0]

In [61]:
def lda_description(review_text, min_topic_freq=0.05):
    """
    accept the original text of a review and (1) parse it with spaCy,
    (2) apply text pre-proccessing steps, (3) create a bag-of-words
    representation, (4) create an LDA representation, and
    (5) print a sorted list of the top topics in the LDA representation
    """
    
    # parse the review text with spaCy
    parsed_review = nlp(review_text)
    
    # lemmatize the text and remove punctuation and whitespace
    unigram_review = [token.lemma_ for token in parsed_review
                      if not punct_space(token)]
    
    # apply the first-order and secord-order phrase models
    bigram_review = bigram_model[unigram_review]
    trigram_review = trigram_model[bigram_review]
    
    # remove any remaining stopwords
    trigram_review = [term for term in trigram_review
                      if not term in spacy.en.language_data.STOP_WORDS]
    
    # create a bag-of-words representation
    review_bow = trigram_dictionary.doc2bow(trigram_review)
    
    # create an LDA representation
    review_lda = lda[review_bow]
    
    # sort with the most highly related topics first
    review_lda = sorted(review_lda, key=lambda (topic_number, freq): -freq)
    
    for topic_number, freq in review_lda:
        if freq < min_topic_freq:
            break
            
        # print the most highly related topic names and frequencies
        print '{:25} {}'.format(topic_names[topic_number],
                                round(freq, 3))

In [62]:
sample_review = get_sample_review(50)
print sample_review

Gab n Eat is the best breakfast in Pittsburgh and always a treat.  We went there around 11:30 which brought up the lunch. Vs breakfast debate. I saw another customers soup and knew i would have to try it. The soup was a  Lobster Bisque and it was amazing.....creamy deliciousness. I could have had a quart of it. Yum!  I ended up ordering the create your own mixed grill. (When at Gab n Eat. You have to order the 1/2 grill) I had mine with bacon. Bacon was crisp and delish. and M ordered the regular mixed grill. We also shared an order of potato pancakes. They even brought us 2 sides of applesauce and sour cream. The pancakes could have been a tiny bit crisper, but still very good.



In [63]:
lda_description(sample_review)

breakfast                 0.417
donuts & appetizers       0.233
ambiance                  0.227
brunch                    0.106




In [64]:
sample_review = get_sample_review(210)
print sample_review

I really love deep dish pizza. The thick crust that is crunchy on the outside and soft on the inside. The spicy, peppery sauce that can singe your tongue. The mixture of ingredients just sliding down your chin as it bubbles out of your mouth. That being said, I really don't understand why people go to Pizzeria Uno. This chain has done to the deep dish pizza what McDonald's has done to hamburgers, Pizza Hut has done to pan pizza, Dominoes has done to hand tossed pizza, and Taco Bell has done to Tex-Mex; it has made it almost unpalatable. On our menu:

French onion
Chicken bites
Pizza with pepperoni, hamburger, and romano

Looking past the fact that the service was not the best or the fact that the food came out undercooked and tepid in temperature, the food just was not good. Don't get me wrong, this is not the worst pizza I have ever eaten, but this chain is a perfect example of what happens when something becomes so prescribed, overhandled, and overthought that it no longer resembles 

In [65]:
lda_description(sample_review)

italian                   0.556
pizza                     0.31
steak                     0.054


## Word Vector Embedding with Word2Vec

![word2vec quiz](https://s3.amazonaws.com/skipgram-images/word2vec-1.png)

![word2vec quiz 2](https://s3.amazonaws.com/skipgram-images/word2vec-2.png)

The goal of *word vector embedding models*, or *word vector models* for short, is to learn dense, numerical vector representations for each term in a corpus vocabulary. If the model is successful, the vectors it learns about each term should encode some information about the *meaning* or *concept* the term represents, and the relationship between it and other terms in the vocabulary. Word vector models are also fully unsupervised &mdash; they learn all of these meanings and relationships solely by analyzing the text of the corpus, without any advance knowledge provided.

Perhaps the best-known word vector model is [word2vec](https://arxiv.org/pdf/1301.3781v3.pdf), originally proposed in 2013. The general idea of word2vec is, for a given *focus word*, to use the *context* of the word &mdash; i.e., the other words immediately before and after it &mdash; to provide hints about what the focus word might mean. To do this, word2vec uses a *sliding window* technique, where it considers snippets of text only a few tokens long at a time.

At the start of the learning process, the model initializes random vectors for all terms in the corpus vocabulary. The model then slides the window across every snippet of text in the corpus, with each word taking turns as the focus word. Each time the model considers a new snippet, it tries to learn some information about the focus word based on the surrouding context, and it "nudges" the words' vector representations accordingly. One complete pass sliding the window across all of the corpus text is known as a training *epoch*. It's common to train a word2vec model for multiple passes/epochs over the corpus. Over time, the model rearranges the terms' vector representations such that terms that frequently appear in similar contexts have vector representations that are *close* to each other in vector space.

For a deeper dive into word2vec's machine learning process, see [here](https://arxiv.org/pdf/1411.2738v4.pdf).

Word2vec has a number of user-defined hyperparameters, including:
- The dimensionality of the vectors. Typical choices include a few dozen to several hundred.
- The width of the sliding window, in tokens. Five is a common default choice, but narrower and wider windows are possible.
- The number of training epochs.

For using word2vec in Python, [gensim](https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/) comes to the rescue again! It offers a [highly-optimized](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/), [parallelized](https://rare-technologies.com/parallelizing-word2vec-in-python/) implementation of the word2vec algorithm with its [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) class.

In [66]:
from gensim.models import Word2Vec

trigram_sentences = LineSentence(trigram_sentences_filepath)
word2vec_filepath = os.path.join(intermediate_directory, 'word2vec_model_all')

We'll train our word2vec model using the normalized sentences with our phrase models applied. We'll use 100-dimensional vectors, and set up our training process to run for twelve epochs.

In [67]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to train the word2vec model yourself.
if 0 == 1:

    # initiate the model and perform the first epoch of training
    food2vec = Word2Vec(trigram_sentences, size=100, window=5,
                        min_count=20, sg=1, workers=4)
    
    food2vec.save(word2vec_filepath)

    # perform another 11 epochs of training
#     for i in range(1,12):

#         food2vec.train(trigram_sentences)
#         food2vec.save(word2vec_filepath)
        
# load the finished model from disk
food2vec = Word2Vec.load(word2vec_filepath)
food2vec.init_sims()

print u'{} training epochs so far.'.format(food2vec.train_count)

1 training epochs so far.
CPU times: user 1.38 s, sys: 2.63 s, total: 4.01 s
Wall time: 5.38 s


In [68]:
print u'{:,} terms in the food2vec vocabulary.'.format(len(food2vec.vocab))

31,690 terms in the food2vec vocabulary.


We can create a Pandas dataframe and check out the word vectors that our model has learned.

In [69]:
# build a list of the terms, integer indices,
# and term counts from the food2vec model vocabulary
ordered_vocab = [(term, voc.index, voc.count)
                 for term, voc in food2vec.vocab.iteritems()]

# sort by the term counts, so the most common terms appear first
ordered_vocab = sorted(ordered_vocab, key=lambda (term, index, count): -count)

# unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)

# create a DataFrame with the food2vec vectors as data,
# and the terms as row labels
word_vectors = pd.DataFrame(food2vec.syn0norm[term_indices, :],
                            index=ordered_terms)

word_vectors

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
be,0.100396,-0.047040,0.058648,0.141263,0.127754,0.204631,0.028114,-0.176108,-0.063177,0.146795,...,-0.203432,-0.123732,0.204359,-0.026366,-0.092495,-0.111798,0.023836,-0.084826,0.015103,-0.026555
the,0.156249,-0.003483,0.011376,-0.049678,0.148479,0.081484,0.053127,-0.088918,-0.123329,-0.014659,...,-0.069448,-0.246058,0.045532,0.032465,-0.075400,-0.085848,0.184431,0.063933,0.015114,-0.017962
and,-0.098279,0.028030,0.044832,-0.006285,0.046772,0.074520,0.067213,-0.263015,0.019374,0.062434,...,-0.043965,-0.120470,-0.050109,0.021659,-0.188732,-0.125845,0.090948,-0.083185,0.063343,-0.110353
i,0.063866,0.085946,0.017861,0.134637,0.049490,-0.038398,0.039321,0.009734,-0.061692,-0.049952,...,-0.244703,-0.014963,-0.025670,0.108573,0.020436,-0.019091,0.266274,0.082171,0.013671,-0.036848
a,0.180133,-0.043865,0.157700,0.112006,0.019187,-0.021478,0.074837,-0.202002,0.011699,0.061437,...,-0.079992,0.086968,0.066493,-0.117041,-0.071215,0.096765,-0.002123,-0.029480,-0.084849,-0.030448
to,-0.029800,0.242651,-0.064001,0.113354,-0.115069,0.086858,-0.003962,-0.190949,0.073353,0.250381,...,-0.151424,-0.086880,-0.001472,0.152081,-0.201066,-0.037630,0.044213,-0.041549,-0.017640,0.030386
it,0.154292,0.016742,-0.105910,0.114105,0.040828,-0.027775,-0.151092,-0.107539,-0.064215,0.087055,...,-0.032262,-0.140275,0.067828,0.089749,-0.019760,0.113149,0.010474,0.142114,-0.015628,-0.155279
have,-0.017811,0.049110,0.029961,-0.081723,0.135560,0.047020,-0.058954,-0.153939,-0.032679,0.026244,...,-0.298696,-0.227017,-0.032741,-0.039939,-0.111513,-0.026491,0.190585,-0.182176,-0.005937,-0.093941
of,0.058251,-0.127608,-0.045526,-0.016808,0.047153,0.148519,0.097284,-0.063336,0.091341,0.067728,...,-0.174333,-0.059729,-0.128359,-0.041158,-0.178422,-0.007582,0.171630,0.129522,0.007311,0.056923
not,0.215381,0.019298,-0.183479,0.018395,0.066796,0.185074,-0.106711,-0.091918,0.132160,0.126565,...,-0.071666,-0.179337,0.087024,-0.003941,0.030340,0.014743,0.160018,0.001428,-0.023337,0.028454


This data frame has a row for each word in our corpus.  And as you can see, we have embedded each word into a 100-dimensional vector space.

### But what can we do with these vectors?

In [70]:
def get_related_terms(token, topn=10):
    """
    look up the topn most similar terms to token
    and print them as a formatted list
    """

    for word, similarity in food2vec.most_similar(positive=[token], topn=topn):

        print u'{:20} {}'.format(word, round(similarity, 3))

To start with, we can use these vectors to determine which words are similar to one another.

In [125]:
get_related_terms(u'pizza')

thin_crust_pizza     0.854
calzone              0.789
thin_crust           0.775
pie                  0.772
sicilian_pizza       0.751
pepperoni            0.746
pizza-               0.741
deep_dish            0.739
za                   0.737
stromboli            0.733


In [None]:
get_related_terms(u'')

Word2Vec also gives us the ability to manipulate words with _word algebra_ (also known as _analogy completion_).

The core idea is that once words are represented as numerical vectors, you can do math with them. The mathematical procedure goes like this:
1. Provide a set of words or phrases that you'd like to add or subtract.
1. Look up the vectors that represent those terms in the word vector model.
1. Add and subtract those vectors to produce a new, combined vector.
1. Look up the most similar vector(s) to this new, combined vector via cosine similarity.
1. Return the word(s) associated with the similar vector(s).

But more generally, you can think of the vectors that represent each word as encoding some information about the *meaning* or *concepts* of the word. What happens when you ask the model to combine the meaning and concepts of words in new ways? Let's see.

In [71]:
def word_algebra(add=[], subtract=[], topn=1):
    """
    combine the vectors associated with the words provided
    in add= and subtract=, look up the topn most similar
    terms to the combined vector, and print the result(s)
    """
    answers = food2vec.most_similar(positive=add, negative=subtract, topn=topn)
    
    for term, similarity in answers:
        print term

In [72]:
word_algebra(add=[u'lunch', u'night'], subtract=[u'day'])

dinner


### Burger King + fine dining = ?

In [86]:
word_algebra(add=[u'burger_king', u'fine_dining'])

### Denny's + fine dining = ?

In [85]:
word_algebra(add=[u"denny_'s", u'mexican'])

### Applebee's + Italian = ?

In [84]:
word_algebra(add=[u"applebee_'s", u'italian'])

### Burger - bun + tortilla = ?

In [82]:
word_algebra(add=[u'burger', u'tortilla'], subtract=[u'bun'])

burrito


In [None]:
word_algebra(add=[], subtract=[])

## Word Vector Visualization with t-SNE

[t-Distributed Stochastic Neighbor Embedding](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf), or *t-SNE* for short, is a dimensionality reduction technique to assist with visualizing high-dimensional datasets. It attempts to map high-dimensional data onto a low two- or three-dimensional representation such that the relative distances between points are preserved as closely as possible in both high-dimensional and low-dimensional space.

scikit-learn provides a convenient implementation of the t-SNE algorithm with its [TSNE](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) class.

In [87]:
from sklearn.manifold import TSNE

Our input for t-SNE will be the DataFrame of word vectors we created before. Let's first:
1. Drop stopwords &mdash; it's probably not too interesting to visualize *the*, *of*, *or*, and so on
1. Take only the 5,000 most frequent terms in the vocabulary &mdash; no need to visualize all ~50,000 terms right now.

In [88]:
tsne_input = word_vectors.drop(spacy.en.language_data.STOP_WORDS, errors=u'ignore')
tsne_input = tsne_input.head(5000)

In [89]:
tsne_input.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
good,0.058404,0.094421,-0.089817,-0.086707,0.163473,0.030847,-0.056579,-0.036778,-0.033341,0.027926,...,-0.072606,-0.207204,-0.05266,-0.049346,0.011246,-0.057931,0.147647,0.085732,0.043211,-0.2122
food,0.082169,-0.005769,-0.139226,0.059455,-0.053859,0.096395,-0.102365,0.005925,0.042754,0.035665,...,-0.05612,-0.048596,-0.056088,0.018725,0.004112,-0.025999,0.104075,0.036108,-0.032087,-0.128216
place,0.05546,-0.064872,-0.088949,0.056881,-0.008927,-0.022996,-0.249944,-0.200411,-0.065363,0.070529,...,-0.057301,-0.014397,0.034302,0.217944,-0.048372,-0.12434,-0.103206,-0.088915,-0.00121,-0.086354
order,-0.062707,0.066594,-0.12596,-0.113671,-0.009391,0.035103,0.093105,-0.095218,-0.088589,-0.029266,...,-0.264956,-0.054617,-0.099331,0.014024,-0.075721,-0.074371,0.172258,-0.070418,0.01984,0.073994
great,0.100848,0.017086,-0.062039,0.000676,0.007057,-0.087765,0.035425,-0.184122,-0.128618,-0.017338,...,-0.002093,-0.167171,-0.014442,-0.044375,-0.144915,-0.108412,0.045141,0.086508,0.150066,-0.253457


In [90]:
tsne_filepath = os.path.join(intermediate_directory,
                             u'tsne_model')

tsne_vectors_filepath = os.path.join(intermediate_directory,
                                     u'tsne_vectors.npy')

In [91]:
%%time

if 1 == 0:
    
    tsne = TSNE()
    tsne_vectors = tsne.fit_transform(tsne_input.values)
    
    with open(tsne_filepath, 'w') as f:
        pickle.dump(tsne, f)

    pd.np.save(tsne_vectors_filepath, tsne_vectors)
    
with open(tsne_filepath) as f:
    tsne = pickle.load(f)
    
tsne_vectors = pd.np.load(tsne_vectors_filepath)

tsne_vectors = pd.DataFrame(tsne_vectors,
                            index=pd.Index(tsne_input.index),
                            columns=[u'x_coord', u'y_coord'])

CPU times: user 1.56 s, sys: 59.5 ms, total: 1.62 s
Wall time: 1.73 s


We have now successfully reduced our data set from 100-dimsensional to 2-dimensional.

In [92]:
tsne_vectors.head()

Unnamed: 0,x_coord,y_coord
good,-3.920494,1.487516
food,-5.39223,0.613856
place,-6.976865,-0.976673
order,0.566433,-1.543686
great,-3.802129,0.613526


In [93]:
tsne_vectors[u'word'] = tsne_vectors.index

## Plotting with Bokeh

Bokeh is an amazing plotting package.  It was written for Python, but now has bindings for R, Julia, and many other languages.

In [94]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()

In [95]:
# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vectors)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(title=u't-SNE Word Embeddings',
                   plot_width = 800,
                   plot_height = 800,
                   tools= (u'pan, wheel_zoom, box_zoom,'
                           u'box_select, resize, reset'),
                   active_scroll=u'wheel_zoom')

# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = u'@word') )

# draw the words as circles on the plot
tsne_plot.circle(u'x_coord', u'y_coord', source=plot_data,
                 color=u'blue', line_alpha=0.2, fill_alpha=0.1,
                 size=10, hover_line_color=u'black')

# configure visual elements of the plot
tsne_plot.title.text_font_size = value(u'16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# engage!
show(tsne_plot);

## Conclusion

Let's round up the major components that we've seen:
1. Text processing with **spaCy**
1. Automated **phrase modeling**
1. Topic modeling with **LDA** $\ \longrightarrow\ $ visualization with **pyLDAvis**
1. Word vector modeling with **word2vec** $\ \longrightarrow\ $ visualization with **t-SNE**

#### Why use these models?
Dense vector representations for text like LDA and word2vec can greatly improve performance for a number of common, text-heavy problems like:
- Text classification
- Search
- Recommendations
- Question answering

...and more generally are a powerful way machines can help humans make sense of what's in a giant pile of text. They're also often useful as a pre-processing step for many other downstream machine learning applications.

## Questions ??