# Neural Networks for NLP

We'll start with some standard library imports.

## Load some Yelp reviews

In [1]:
import numpy as np
import pandas as pd

Let's read in the Yelp reviews data from disk and preview the first several rows.

In [2]:
reviews_df = pd.read_csv('yelp_training_set_review.csv')

In [3]:
reviews_df.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes.cool,votes.funny,votes.useful
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,0,5
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I als...,review,0hT2KtfLiobPvh6cDC8JQg,0,0,1
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,0,2
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


The two columns we'll focus on in this notebook are:

- `text` : the natural-language text the user wrote as part of their review
- `votes.useful` : the number of members of the Yelp user community marked the review as "useful" 

In [4]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 229907 entries, 0 to 229906
Data columns (total 10 columns):
business_id     229907 non-null object
date            229907 non-null object
review_id       229907 non-null object
stars           229907 non-null int64
text            229901 non-null object
type            229907 non-null object
user_id         229907 non-null object
votes.cool      229907 non-null int64
votes.funny     229907 non-null int64
votes.useful    229907 non-null int64
dtypes: int64(4), object(6)
memory usage: 17.5+ MB


It looks like there are a few reviews with missing data in the `text` column. Let's drop these rows from the dataset.

In [5]:
reviews_df = reviews_df.dropna(subset=['text'])

reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 229901 entries, 0 to 229906
Data columns (total 10 columns):
business_id     229901 non-null object
date            229901 non-null object
review_id       229901 non-null object
stars           229901 non-null int64
text            229901 non-null object
type            229901 non-null object
user_id         229901 non-null object
votes.cool      229901 non-null int64
votes.funny     229901 non-null int64
votes.useful    229901 non-null int64
dtypes: int64(4), object(6)
memory usage: 19.3+ MB


## Import and explore spaCy (an NLP library)

[spaCy](https://spacy.io) is an industrial-strength natural language processing (NLP) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

spaCy is written in optimized Cython, which means it's _fast_. According to a few independent sources, it's the fastest syntactic parser available in any programming language. Key pieces of the spaCy parsing pipeline are written in pure C, enabling efficient multithreading (i.e., spaCy can release the GIL).

In [6]:
import spacy

If you want to use many of spaCy's most powerful features, you need to download and install a [spaCy model](https://spacy.io/models/). In this notebook, we'll use the `en_core_web_md` model.

The next code cell downloads and installs the `en_core_web_md` model, and loads it into memory with the reference `nlp`. If you receive a warning message that "linking was unsuccessful," you can disregard this message.

In [7]:
spacy.cli.download('en_core_web_md')

import en_core_web_md

nlp = en_core_web_md.load()


[93m    Error: Couldn't link model to 'en_core_web_md'[0m
    Creating a symlink in spacy/data failed. Make sure you have the required
    permissions and try re-running the command as admin, or use a
    virtualenv. You can still import the model as a module and call its
    load() method, or create the symlink manually.

    /usr/local/anaconda/lib/python3.6/site-packages/en_core_web_md -->
    /usr/local/anaconda/lib/python3.6/site-packages/spacy/data/en_core_web_md


[93m    Creating a shortcut link for 'en' didn't work (maybe you don't have
    admin permissions?), but you can still load the model via its full
    package name: nlp = spacy.load('{name}')[0m
    Download successful but linking failed



Let's grab a sample review to play with.

In [8]:
sample_review = reviews_df.loc[86251, 'text']

print(sample_review)

Back in the day, resorts and hotels were some of the best culinary destinations around. The Fairmont Scottsdale, formerly known as the Fairmont Scottsdale Princess, was one of those shining examples of upscale greatness with its Marquesa and La Hacienda restaurants. With new owners came new ideas, and both establishments were shuttered to make way for culinary concepts, including the fantastic Bourbon Steak by Chef Michael Mina. 

Now, with the help of chef Richard Sandoval, La Hacienda is re-opening with an updated look and menu.

Sandoval grew up in Mexico City working with his restaurateur father. His upbringing helped to fuel a passion that led to multiple accolades and a reputation for innovative Latin cuisine. He opened his own restaurant in 1997 and over the years he's opened several more star-awarded restaurants.

La Hacienda has always been known for great margaritas. The traditional La Hacienda ($10) is still the tops, but one of the new comers, the El Pina Diablo ($11)

Hand the review text to spaCy, and be prepared to wait...

In [9]:
%%time
parsed_review = nlp(sample_review)

CPU times: user 1.88 s, sys: 313 ms, total: 2.2 s
Wall time: 509 ms


...1/5th of a second or so. Let's take a look at what we got during that time...

In [10]:
print(parsed_review)

Back in the day, resorts and hotels were some of the best culinary destinations around. The Fairmont Scottsdale, formerly known as the Fairmont Scottsdale Princess, was one of those shining examples of upscale greatness with its Marquesa and La Hacienda restaurants. With new owners came new ideas, and both establishments were shuttered to make way for culinary concepts, including the fantastic Bourbon Steak by Chef Michael Mina. 

Now, with the help of chef Richard Sandoval, La Hacienda is re-opening with an updated look and menu.

Sandoval grew up in Mexico City working with his restaurateur father. His upbringing helped to fuel a passion that led to multiple accolades and a reputation for innovative Latin cuisine. He opened his own restaurant in 1997 and over the years he's opened several more star-awarded restaurants.

La Hacienda has always been known for great margaritas. The traditional La Hacienda ($10) is still the tops, but one of the new comers, the El Pina Diablo ($11)

Looks the same! But in that fraction of a second, spaCy has created a [`Doc`](https://spacy.io/api/doc) object for us:

In [11]:
type(parsed_review)

spacy.tokens.doc.Doc

A `Doc` object is a sequence of [`Token`](https://spacy.io/api/token) objects. Print out the first token in the parsed review:

In [12]:
parsed_review[0]

Back

In [13]:
type(parsed_review[0])

spacy.tokens.token.Token

Print out the first 10 tokens in the parsed review.

In [14]:
parsed_review[0:10]

Back in the day, resorts and hotels were some

In [15]:
list(parsed_review[0:10])

[Back, in, the, day, ,, resorts, and, hotels, were, some]

Let's look at the length of the parsed review in tokens.

In [16]:
len(parsed_review)

1009

While creating a `Doc` object from text, spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline, including:

* Tokenization
* Text normalization, such as lowercasing, stemming/lemmatization
* Part-of-speech tagging
* Syntactic dependency parsing
* Sentence boundary detection
* Named entity recognition and annotation
* Making built-in word vectors available

...among others.

For our purposes, we will primarily use spaCy for tokenization, lemmatization, and sentence segmentation.

Here's how lemmatization works with spaCy:

In [17]:
token_text = [token.text for token in parsed_review]
token_lemma = [token.lemma_ for token in parsed_review]

pd.DataFrame(
    list(zip(token_text, token_lemma)),
    columns=['token_text', 'token_lemma']
    )

Unnamed: 0,token_text,token_lemma
0,Back,back
1,in,in
2,the,the
3,day,day
4,",",","
5,resorts,resort
6,and,and
7,hotels,hotel
8,were,be
9,some,some


...and this is how you use sentence segmentation:

In [18]:
for num, sentence in enumerate(parsed_review.sents):
    print(f'Sentence {num+1}:')
    print(sentence)
    print('')

Sentence 1:
Back in the day, resorts and hotels were some of the best culinary destinations around.

Sentence 2:
The Fairmont Scottsdale, formerly known as the Fairmont Scottsdale Princess, was one of those shining examples of upscale greatness with its Marquesa and La Hacienda restaurants.

Sentence 3:
With new owners came new ideas, and both establishments were shuttered to make way for culinary concepts, including the fantastic Bourbon Steak by Chef Michael Mina. 



Sentence 4:
Now, with the help of chef Richard Sandoval, La Hacienda is re-opening with an updated look and menu.



Sentence 5:
Sandoval grew up in Mexico City working with his restaurateur father.

Sentence 6:
His upbringing helped to fuel a passion that led to multiple accolades and a reputation for innovative Latin cuisine.

Sentence 7:
He opened his own restaurant in 1997 and over the years he's opened several more star-awarded restaurants.



Sentence 8:
La Hacienda has always been known for great margaritas

If the text you'd like to process is general-purpose English language text (i.e., not domain-specific, like medical literature), spaCy is ready to use out-of-the-box.

It has become a core part of the Python data science ecosystem &mdash; it is doing for natural language computing what other great libraries have done for numerical computing.

## Phrase Modeling

_Phrase modeling_ is an approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that _co-occur_ (i.e., appear one after another) together much more frequently than you would expect them to by random chance. The formula our phrase models will use to determine whether two tokens $A$ and $B$ constitute a phrase is:

$$\frac{count(A\ B) - count_{min}}{count(A) * count(B)} * N > threshold$$

...where:
* $count(A)$ is the number of times token $A$ appears in the corpus
* $count(B)$ is the number of times token $B$ appears in the corpus
* $count(A\ B)$ is the number of times the tokens $A\ B$ appear in the corpus *in order*
* $N$ is the total size of the corpus vocabulary
* $count_{min}$ is a user-defined parameter to ensure that accepted phrases occur a minimum number of times
* $threshold$ is a user-defined parameter to control how strong of a relationship between two tokens the model requires before accepting them as a phrase

Once our phrase model has been trained on our corpus, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.

Phrase modeling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model (so _new york_ would become *new\_york*). But you would also expect multi-word expressions that represent common concepts, but aren't specifically named entities (such as _happy hour_) to also become phrases in the model.

We turn to the indispensible [**gensim**](https://radimrehurek.com/gensim/index.html) library to help us with phrase modeling &mdash; the [**Phrases**](https://radimrehurek.com/gensim/models/phrases.html) class in particular.

In [19]:
from gensim.models.phrases import Phrases, Phraser

As we're performing phrase modeling, we'll be doing some iterative data transformation at the same time. Our roadmap for data preparation includes:

1. Segment text of complete reviews into sentences & normalize text
1. First-order phrase modeling $\rightarrow$ _apply first-order phrase model to transform sentences_
1. Second-order phrase modeling $\rightarrow$ _apply second-order phrase model to transform sentences_
1. Apply text normalization and second-order phrase model to text of complete reviews

We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

First, let's define a few helper functions that we'll use for text normalization. In particular, the `lemmatized_sentence_corpus` generator function will use spaCy to:

- Iterate over the reviews in the `reviews_df` DataFrame
- Segment the review text into individual sentences
- Remove excess whitespace
- Lemmatize the text

... and do so efficiently in parallel, thanks to spaCy's `nlp.pipe()` function.

In [20]:
from multiprocessing import cpu_count

CPU_COUNT = cpu_count()

Let's use the `lemmatized_sentence_corpus` generator (defined below) to loop over the original review text, segmenting the reviews into individual sentences and normalizing the text. We'll write this data back out to a new file (`unigram_sentences`), with one normalized sentence per line. We'll use this data for learning our phrase models.

In [21]:
def custom_lemmatize(token):
    
    if token.lemma_ == '-PRON-':
        # For pronouns (I, she, his, etc.), just lowercase them
        # spaCy's lemmatizer would return -PRON- for all pronouns
        return token.lower_
    
    else:
        return token.lemma_        


def lemmatized_sentence_corpus(texts):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    
    for parsed_review in nlp.pipe(texts, batch_size=10000, n_threads=CPU_COUNT):
        
        for sent in parsed_review.sents:
            yield ' '.join([
                custom_lemmatize(token) for token in sent
                if not token.is_space
                ])

Let's use the `lemmatized_sentence_corpus` generator to loop over the original review text, segmenting the reviews into individual sentences and normalizing the text. We'll write this data back out to a new file (`unigram_sentences`), with one normalized sentence per line. We'll use this data for learning our phrase models.

In [22]:
unigram_sentences_filepath = 'unigram_sentences.txt'

For certain code cells in this notebook, there is a significant amount of data and computational heavy lifting involved. The code in these cells will take quite a while to complete, so we have pre-executed these steps and saved their results to disk. You'll see this pattern throughout the notebook:

```
execute = False

if execute:

    <some long-running computationally intense step>
    
    <save results from long-running step to disk>
```

You can see how long each cell originally took by looking at `Wall time` in the cell output. Instead of having to re-run all the computation, we can simply load the saved results from disk.

If you'd like to run the computational heavy-lifting steps yourself, you can set `execute` to `True`, but that's not necessary to follow the notebook.

In [23]:
%%time

# this is a bit time consuming
# set execute to True if you
# want to execute data prep yourself.
execute = False

if execute:

    with open(unigram_sentences_filepath, 'w') as f:
        for sentence in lemmatized_sentence_corpus(reviews_df['text']):
            f.write(sentence + '\n')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 8.82 µs


```CPU times: user 9h 24min 13s, sys: 37min 46s, total: 10h 1min 59s
Wall time: 1h 27min 36s```

If your data is organized like our `unigram_sentences` file now is &mdash; a large text file with one document/sentence per line &mdash; gensim's [**LineSentence**](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence) class provides a convenient iterator for working with other gensim components. It *streams* the documents/sentences from disk, so that you never have to hold the entire corpus in RAM at once. This allows you to scale your modeling pipeline up to potentially very large corpora.

In [24]:
from gensim.models.word2vec import LineSentence

unigram_sentences = LineSentence(unigram_sentences_filepath)

Let's take a look at a few sample sentences in our new, transformed file.

In [25]:
from itertools import islice

start, end = 121, 131

for unigram_sentence in islice(unigram_sentences, start, end):
    print(' '.join(unigram_sentence))
    print('')

i have snickerdoodle with cookie and cream ice cream .

it be marvelous .

wonderful vietnamese sandwich shoppe .

their baguette be great hot out of the oven with butter or in one of their many sandwich choice .

they have a modest selection of bake good along with some of the good egg roll around .

br cash or your atm card as no credit card be accept but they have an atm on premise .

they have a limited time thing go on right now with bbq chicken pizza ( not sure how long it be go to last ) but let me just say it be amazing .

probably the good bbq chicken pizza i have ever have .

i have try other thing too , like the tomato basil soup , and many of their sandwich ... very good , very fresh - every time .

the 5 star be for the pizza , but if i be to rate jason 's deli over all they would get about a 4 .



Next, we'll learn a phrase model that will link individual words into two-word phrases. We'd expect words that together represent a specific concept, like "`credit card`", to be linked together to form a new, single token: "`credit_card`".

In [26]:
bigram_model_filepath = 'bigram_model'

In [27]:
%%time

# this is a bit time consuming
# set execute to True if you
# want to execute data prep yourself.
execute = False

if execute:

    bigram_model = Phrases(unigram_sentences)

    bigram_model.save(bigram_model_filepath)

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 7.87 µs


```CPU times: user 2min 5s, sys: 924 ms, total: 2min 5s
Wall time: 2min 5s```

In [28]:
# load the finished model from disk
bigram_model = Phrases.load(bigram_model_filepath)
bigram_model = Phraser(bigram_model)

Now that we have a trained phrase model for word pairs, let's apply it to the review sentences data, and write a new file out to disk.

In [29]:
bigram_sentences_filepath = 'bigram_sentences.txt'

In [30]:
%%time

# this is a bit time consuming
# set execute to True if you
# want to execute data prep yourself.
execute = False

if execute:

    with open(bigram_sentences_filepath, 'w') as f:
        
        for unigram_sentence in unigram_sentences:
            
            bigram_sentence = ' '.join(bigram_model[unigram_sentence])
            
            f.write(bigram_sentence + '\n')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 8.34 µs


```CPU times: user 2min 31s, sys: 281 ms, total: 2min 32s
Wall time: 2min 32s```

Let's explore the same sample sentences after they've been transformed by the phrase model. 

In [31]:
bigram_sentences = LineSentence(bigram_sentences_filepath)

In [32]:
for bigram_sentence in islice(bigram_sentences, start, end):
    print(' '.join(bigram_sentence))
    print('')

i have snickerdoodle with cookie and cream ice_cream .

it be marvelous .

wonderful vietnamese sandwich shoppe .

their baguette be great hot out of the oven with butter or in one of their many sandwich choice .

they have a modest selection of bake good along with some of the good egg_roll around .

br_cash or your atm card as no credit_card be accept but they have an_atm on premise .

they have a limited time thing go on right now with bbq chicken pizza ( not sure how long it be go to last ) but let me just say it be amazing .

probably the good bbq chicken pizza i have ever have .

i have try other thing too , like the tomato_basil soup , and many of their sandwich ... very good , very fresh - every time .

the 5_star be for the pizza , but if i be to rate jason 's deli over all they would get about a 4 .



Looks like the phrase modeling worked! We now see two-word phrases, such as "`credit_card`" and "`ice_cream`", linked together in the text as a single token. Next, we'll train a _second-order_ phrase model. We'll apply the second-order phrase model on top of the already-transformed data, so that incomplete word combinations like "`tomato_basil soup`" will become fully joined to "`tomato_basil_soup`".

In [33]:
trigram_model_filepath = 'trigram_model'

In [34]:
%%time

# this is a bit time consuming
# set execute to True if you
# want to execute data prep yourself.
execute = False

if execute:

    trigram_model = Phrases(bigram_sentences)

    trigram_model.save(trigram_model_filepath)

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 7.87 µs


```CPU times: user 2min 14s, sys: 973 ms, total: 2min 15s
Wall time: 2min 15s```

In [35]:
# load the finished model from disk
trigram_model = Phrases.load(trigram_model_filepath)
trigram_model = Phraser(trigram_model)

We'll apply our trained second-order phrase model to our first-order transformed sentences, write the results out to a new file, and explore a few of the second-order transformed sentences.

In [36]:
trigram_sentences_filepath = 'trigram_sentences.txt'

In [37]:
%%time

# this is a bit time consuming
# set execute to True if you
# want to execute data prep yourself.
execute = False

if execute:

    with open(trigram_sentences_filepath, 'w') as f:
        
        for bigram_sentence in bigram_sentences:
            
            trigram_sentence = ' '.join(trigram_model[bigram_sentence])
            
            f.write(trigram_sentence + '\n')

CPU times: user 6 µs, sys: 1 µs, total: 7 µs
Wall time: 9.78 µs


```CPU times: user 2min 30s, sys: 376 ms, total: 2min 31s
Wall time: 2min 31s```

In [38]:
trigram_sentences = LineSentence(trigram_sentences_filepath)

In [39]:
for trigram_sentence in islice(trigram_sentences, start, end):
    print(' '.join(trigram_sentence))
    print('')

i have snickerdoodle with cookie and cream ice_cream .

it be marvelous .

wonderful vietnamese sandwich shoppe .

their baguette be great hot out of the oven with butter or in one of their many sandwich choice .

they have a modest selection of bake good along with some of the good egg_roll around .

br_cash or your atm_card as no credit_card be accept but they have an_atm on premise .

they have a limited time thing go on right now with bbq chicken pizza ( not sure how long it be go to last ) but let me just say it be amazing .

probably the good bbq chicken pizza i have ever have .

i have try other thing too , like the tomato_basil_soup , and many of their sandwich ... very good , very fresh - every time .

the 5_star be for the pizza , but if i be to rate jason_'s deli over all they would get about a 4 .



Looks like the second-order phrase model was successful. We're now seeing three-word phrases, such as "`tomato_basil_soup`".

The final step of our text preparation process circles back to the complete text of the reviews. We're going to run the complete text of the reviews through a pipeline that applies our text normalization and phrase models.

We'll write the transformed text out to a new file, with one review per line. We'll use this pre-processed review text later as the input to a neural network model to predict useful votes.

In [40]:
transformed_reviews_filepath = 'transformed_reviews.txt'

In [41]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.

execute = False

if execute:

    with open(transformed_reviews_filepath, 'w') as f:
        
        pipe = nlp.pipe(reviews_df['text'], batch_size=10000, n_threads=CPU_COUNT)
        
        for parsed_review in pipe:
            
            # lemmatize the text, removing punctuation and whitespace
            unigram_review = [
                custom_lemmatize(token) for token in parsed_review
                if not token.is_space
                ]
            
            # apply the first-order and second-order phrase models
            bigram_review = bigram_model[unigram_review]
            trigram_review = trigram_model[bigram_review]
                        
            # write the transformed review as a line in the new file
            trigram_review = ' '.join(trigram_review)
            
            f.write(trigram_review + '\n')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 10.3 µs


```CPU times: user 9h 19min 37s, sys: 37min 10s, total: 9h 56min 48s
Wall time: 1h 30min 8s```

## Word Vector Embedding with Word2Vec

Pop quiz! Can you complete this text snippet?

<br><br>

![word2vec quiz](https://s3.amazonaws.com/skipgram-images/word2vec-1.png)

<br><br><br>
You just demonstrated the core machine learning concept behind word vector embedding models!
<br><br><br>

![word2vec quiz 2](https://s3.amazonaws.com/skipgram-images/word2vec-2.png)

The goal of *word vector embedding models*, or *word vector models* for short, is to learn dense, numerical vector representations for each term in a corpus vocabulary. If the model is successful, the vectors it learns about each term should encode some information about the *meaning* or *concept* the term represents, and the relationship between it and other terms in the vocabulary. Word vector models are also fully unsupervised &mdash; they learn all of these meanings and relationships solely by analyzing the text of the corpus, without any advance knowledge provided.

Perhaps the best-known word vector model is [word2vec](https://arxiv.org/pdf/1301.3781v3.pdf), originally proposed in 2013. The general idea of word2vec is, for a given *focus word*, to use the *context* of the word &mdash; i.e., the other words immediately before and after it &mdash; to provide hints about what the focus word might mean. To do this, word2vec uses a *sliding window* technique, where it considers snippets of text only a few tokens long at a time.

At the start of the learning process, the model initializes random vectors for all terms in the corpus vocabulary. The model then slides the window across every snippet of text in the corpus, with each word taking turns as the focus word. Each time the model considers a new snippet, it tries to learn some information about the focus word based on the surrouding context, and it "nudges" the words' vector representations accordingly. One complete pass sliding the window across all of the corpus text is known as a training *epoch*. It's common to train a word2vec model for multiple passes/epochs over the corpus. Over time, the model rearranges the terms' vector representations such that terms that frequently appear in similar contexts have vector representations that are *close* to each other in vector space.

Under the hood, the word2vec model we're training is really a small neural network with one hidden layer. It looks something like this:

<img src="https://dlwt-images.s3.amazonaws.com/1_2_full_skipgram.png?0" height="75%" width="75%" style="display: block; margin-left: auto; margin-right: auto; padding: 20px">

The word vectors the model learns are the weight matrix for the neurons in the hidden layer. If we pulled out the weight matrix and inspected it, it would looks something like this:

<img src="https://dlwt-images.s3.amazonaws.com/1_2_W%28h%29.png" height="80%" width="80%" style="display: block; margin-left: auto; margin-right: auto; padding: 20px">

For a deeper dive into word2vec's machine learning process, see [here](https://arxiv.org/pdf/1411.2738v4.pdf).

Word2vec has a number of user-defined hyperparameters, including:
- The dimensionality of the vectors. Typical choices include a few dozen to several hundred.
- The width of the sliding window, in tokens. Five is a common default choice, but narrower and wider windows are possible.
- The number of training epochs.

For using word2vec in Python, [gensim](https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/) comes to the rescue again! It offers a [highly-optimized](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/), [parallelized](https://rare-technologies.com/parallelizing-word2vec-in-python/) implementation of the word2vec algorithm with its [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) class.

In [42]:
from gensim.models import Word2Vec

trigram_sentences = LineSentence(trigram_sentences_filepath)
word2vec_filepath = 'word2vec_model'

We'll train our word2vec model using the normalized sentences with our phrase models applied. We'll use 100-dimensional vectors, and set up our training process to run for twelve epochs.

In [43]:
%%time

# this is a bit time consuming
# set execute to True if you
# want to execute data prep yourself.
execute = False

if execute:

    # initiate the model and perform the first epoch of training
    word2vec = Word2Vec(
        trigram_sentences,
        size=100,
        window=5,
        min_count=20,
        sg=1,
        workers=CPU_COUNT-1
        )
    
    word2vec.save(word2vec_filepath)

    # perform another 11 epochs of training
    for i in range(1,12):

        word2vec.train(
            trigram_sentences,
            total_examples=word2vec.corpus_count,
            epochs=1
            )
        
        word2vec.save(word2vec_filepath)

CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 8.34 µs


```CPU times: user 57min 15s, sys: 10.4 s, total: 57min 25s
Wall time: 9min 52s```

In [44]:
# load the finished model from disk
word2vec = Word2Vec.load(word2vec_filepath)
word2vec.init_sims()

In [45]:
print(f'{word2vec.train_count} training epochs so far.')
print(f'{len(word2vec.wv.vocab):,} terms in the word2vec vocabulary.')

12 training epochs so far.
23,940 terms in the word2vec vocabulary.


Let's take a peek at the word vectors our model has learned. We'll create a pandas DataFrame with the terms as the row labels, and the 100 dimensions of the word vector model as the columns.

In [46]:
# build a list of the terms, integer indices,
# and term counts from the word2vec model vocabulary
ordered_vocab = [
    (term, voc.index, voc.count)
    for term, voc in word2vec.wv.vocab.items()
    ]

# sort by the term counts, so the most common terms appear first
ordered_vocab = sorted(ordered_vocab, key=lambda vocab_element: -vocab_element[-1])

# unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)

# create a DataFrame with the word2vec vectors as data,
# and the terms as row labels
word_vectors = pd.DataFrame(word2vec.wv.vectors_norm[term_indices, :],
                            index=ordered_terms)

word_vectors

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
.,-0.254288,-0.008166,0.096933,0.120253,0.203089,0.109891,0.088139,-0.015824,-0.048591,0.144771,...,0.027867,-0.092910,0.063256,-0.030933,-0.052652,0.007322,-0.079695,0.090100,0.084017,0.076051
be,-0.169987,-0.037755,0.031239,0.086343,-0.041922,-0.004628,0.190583,0.079585,0.024429,0.043859,...,-0.299139,0.029100,0.102870,-0.106142,-0.061285,-0.032494,0.003788,0.081990,-0.135785,-0.035845
the,-0.224838,-0.031757,0.066099,-0.000337,-0.089664,0.152004,0.063224,0.059955,0.083353,0.011475,...,-0.064983,-0.119364,0.097833,0.074799,-0.110316,0.068887,0.100405,0.083792,-0.023126,-0.128029
",",-0.084222,-0.241057,0.153468,0.030384,0.073891,-0.003068,0.075903,0.008140,-0.059969,-0.059622,...,-0.092725,-0.075149,0.058186,-0.079363,-0.098208,-0.010734,-0.072859,0.155270,-0.037007,-0.040295
and,-0.130864,-0.106706,0.319187,-0.020662,-0.026065,0.130232,0.028848,-0.036718,-0.045193,0.106748,...,-0.077935,-0.116944,-0.068429,0.034556,-0.185711,-0.035680,-0.154883,0.071010,0.014853,0.087299
i,-0.036812,-0.001789,0.098743,0.033161,0.108806,-0.060845,0.010600,0.112932,-0.107017,0.162365,...,-0.008071,-0.045966,-0.033461,-0.155408,-0.019195,0.029607,-0.256207,-0.087891,0.060058,-0.122758
a,-0.159124,0.026329,0.090037,0.041029,0.130571,-0.074993,0.253993,-0.123532,0.067198,0.206722,...,0.043978,-0.077543,-0.067460,-0.005895,-0.068900,0.058941,-0.233234,0.064196,-0.157664,-0.109600
to,-0.069858,0.147919,0.108706,0.095508,-0.095533,-0.012318,0.081875,0.158887,0.032820,0.106667,...,-0.114423,-0.008895,-0.083843,0.157734,-0.086326,0.115300,0.068325,0.003174,0.103649,-0.079623
have,-0.052769,0.006958,0.188641,-0.106268,0.022796,-0.091381,0.042286,-0.018741,0.029946,0.037364,...,-0.145475,0.107154,0.170398,-0.025789,-0.090797,-0.070032,-0.169535,-0.065836,-0.068351,-0.168688
it,-0.147424,0.040524,-0.021452,-0.023354,0.014061,-0.041111,0.062585,-0.015974,-0.083464,0.055700,...,-0.011148,-0.220539,0.076913,-0.036763,-0.141299,0.095313,-0.154693,-0.034851,-0.034097,-0.139978


What a wall of numbers! This DataFrame has one row for each term in the vocabulary and 100 colums. Our model has learned a quantitative vector representation for each term, as expected.

Put another way, our model has "embedded" the terms into a 100-dimensional vector space.

### So... what can we do with all these numbers?
The first thing we can use them for is to simply look up related words and phrases for a given term of interest.

In [47]:
def get_related_terms(token, topn=10):
    """
    look up the topn most similar terms to token
    and print them as a formatted list
    """

    for word, similarity in word2vec.wv.most_similar(positive=[token], topn=topn):

        print(f'{word:20} {round(similarity, 3)}')

### What things are like Burger King?

In [48]:
get_related_terms('burger_king')

mcdonalds            0.807
mcdonald_'s          0.802
wendy_'s             0.785
denny_'s             0.753
bk                   0.701
taco_bell            0.677
panda_express        0.677
subway               0.674
sbarro               0.672
kfc                  0.67


The model has learned that fast food restaurants are similar to each other! In particular, *mcdonalds* and *wendy's* are the most similar to Burger King, according to this dataset. In addition, the model has found that alternate spellings for the same entities are probably related, such as *mcdonalds*, *mcdonald's* and *mcd's*.

### When is happy hour?

In [49]:
get_related_terms('happy_hour')

hh                   0.909
reverse_happy_hour   0.85
during_happy_hour    0.795
happy_hr             0.716
until_7_p.m.         0.694
happy_hour-          0.69
sunday_brunch        0.651
mon                  0.645
during_hh            0.644
happy_hours          0.643


The model has noticed several alternate spellings for happy hour, such as *hh* and *happy hr*, and assesses them as highly related. If you were looking for reviews about happy hour, such alternate spellings would be very helpful to know.

Taking a deeper look &mdash; the model has turned up phrases like *3-6pm*, *4-7pm*, and *mon-fri*, too. This is especially interesting, because the model has no advance knowledge at all about what happy hour is, and what time of day it should be. But simply by scanning through restaurant reviews, the model has discovered that the concept of happy hour has something very important to do with that block of time around 3-7pm on weekdays.

### Let's make pasta tonight. Which style do you want?

In [50]:
get_related_terms('pasta', topn=20)

pasta_dish           0.807
angel_hair_pasta     0.772
meatball             0.764
penne                0.759
pesto                0.757
lasagna              0.755
spaghetti            0.755
mizithra_cheese      0.753
fettucine            0.748
alfredo              0.746
salad                0.742
spinach              0.735
eggplant_parmesan    0.735
ravioli              0.724
risotto              0.719
tortellini           0.719
chicken_marsala      0.716
gnocchi              0.716
rigatoni             0.716
carbonara            0.715


## Word algebra!
No self-respecting word2vec demo would be complete without a healthy dose of *word algebra*, also known as *analogy completion*.

The core idea is that once words are represented as numerical vectors, you can do math with them. The mathematical procedure goes like this:
1. Provide a set of words or phrases that you'd like to add or subtract.
1. Look up the vectors that represent those terms in the word vector model.
1. Add and subtract those vectors to produce a new, combined vector.
1. Look up the most similar vector(s) to this new, combined vector via cosine similarity.
1. Return the word(s) associated with the similar vector(s).

But more generally, you can think of the vectors that represent each word as encoding some information about the *meaning* or *concepts* of the word. What happens when you ask the model to combine the meaning and concepts of words in new ways? Let's see.

In [51]:
def word_algebra(add=[], subtract=[], topn=1):
    """
    combine the vectors associated with the words provided
    in add= and subtract=, look up the topn most similar
    terms to the combined vector, and print the result(s)
    """
    answers = word2vec.wv.most_similar(positive=add, negative=subtract, topn=topn)
    
    for term, similarity in answers:
        print(term)

### breakfast + lunch = ?
Let's start with a softball.

In [52]:
word_algebra(add=['breakfast', 'lunch'])

brunch


OK, so the model knows that *brunch* is a combination of *breakfast* and *lunch*. What else?

### lunch - day + night = ?

In [53]:
word_algebra(add=['lunch', 'night'], subtract=['day'])

dinner


Now we're getting a bit more nuanced. The model has discovered that:
- Both *lunch* and *dinner* are meals
- The main difference between them is time of day
- Day and night are times of day
- Lunch is associated with day, and dinner is associated with night

What else?

### burrito - mexican + chinese = ?

In [54]:
word_algebra(add=['burrito', 'chinese'], subtract=['mexican'])

lo_mein


Here's an entirely new and different type of relationship that the model has learned.

- It knows that burritos are an example of Mexican food
- It knows that Mexican and Chinese are both styles of food
- If you subtract *Mexican* from *burrito*, you're left with something like the concept of a *"a type of food"*, which is represented as a new vector
- If you add that new *"a type of food"* vector to _Chinese_, and look up a nearby vector, you get *lo mein*.

What else?

### bun - american + mexican = ?

In [55]:
word_algebra(add=['bun', 'mexican'], subtract=['american'])

tortilla


The model knows that both *buns* and *tortillas* are the doughy thing that goes on the outside of your real food, and that the primary difference between them is the style of food they're associated with.

What else?

### filet - beef + seafood = ?

In [56]:
word_algebra(add=['filet', 'seafood'], subtract=['beef'])

lobster_tail


The model has learned a concept of *delicacy*. If you take _filet_ and subtract _beef_ from it, you're left with a vector that roughly corresponds to delicacy. If you add the delicacy vector to *seafood*, you get *lobster tail*.

What else?

### Applebee's + x = ?

In [57]:
word_algebra(add=["applebee_'s", 'italian'])

olive_garden


In [58]:
word_algebra(add=["applebee_'s", 'breakfast'])

ihop


In [59]:
word_algebra(add=["applebee_'s", 'cheesecake'])

cheesecake_factory


## Word Vector Visualization with t-SNE

[t-Distributed Stochastic Neighbor Embedding](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf), or t-SNE for short, is a dimensionality reduction technique to assist with visualizing high-dimensional datasets. It attempts to map high-dimensional data onto a low two- or three-dimensional representation such that the relative distances between points are preserved as closely as possible in both high-dimensional and low-dimensional space.

scikit-learn provides a convenient implementation of the t-SNE algorithm with its [`TSNE`](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) class.

In [60]:
from sklearn.manifold import TSNE

Our input for t-SNE will be the DataFrame of word vectors we created before. Let's first:

- Drop punctuation and stopwords — it's probably not too interesting to visualize the, of, or, and so on
- Take only the 5,000 most frequent terms in the vocabulary — no need to visualize all terms right now.

In [61]:
from string import punctuation

punct_and_stop = set.union(set(punctuation), nlp.Defaults.stop_words)

tsne_input = word_vectors.drop(punct_and_stop, errors='ignore')
tsne_input = tsne_input.head(5000)

In [62]:
tsne_input.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
good,-0.010751,0.003195,0.061726,0.012054,0.018434,0.043065,-0.025598,-0.0885,-0.019108,0.06702,...,-0.000666,0.000682,0.049886,0.065574,0.009957,0.028708,-0.165895,-0.035162,-0.053935,-0.040618
place,-0.072572,0.101745,-0.002463,0.137321,-0.037302,-0.027995,0.040581,0.066791,-0.127034,-0.004785,...,-0.138536,-0.186387,0.240998,0.008538,0.076003,-0.065291,-0.283148,0.080416,-0.070483,-0.041105
food,-0.140375,0.015928,-0.078567,-0.003502,-0.045927,0.152604,0.029936,-0.105862,0.006941,0.118307,...,-0.016753,0.000615,0.094833,-0.0303,-0.004086,-0.054307,-0.043243,0.165271,0.001989,-0.205273
like,-0.026877,0.068767,0.263107,0.052399,-0.021948,-0.109308,0.065365,0.019362,0.013777,0.036759,...,-0.281176,0.067463,-0.047427,-0.040918,0.019139,0.192089,-0.010476,0.025508,-0.01546,-0.12665
great,-0.095316,-0.04016,0.062136,0.067663,-0.058903,0.05115,0.007506,-0.148531,0.01334,0.030207,...,0.029164,0.141338,0.12793,0.222855,0.00593,0.070921,-0.200842,0.014155,-0.074503,-0.101075


Next, we'll "train" a `TSNE` model on our 100-dimensional word vectors. We'll use it to generate a two-dimensional representation that preserves the spatial relationships in the original data as much as possible.

In [63]:
import pickle

tsne_filepath = 'tsne_model'
tsne_vectors_filepath = 'tsne_vectors.npy'

In [64]:
%%time

execute = False

if execute:
    
    tsne = TSNE()
    tsne_vectors = tsne.fit_transform(tsne_input.values)
    
    with open(tsne_filepath, 'wb') as f:
        pickle.dump(tsne, f)

    pd.np.save(tsne_vectors_filepath, tsne_vectors)

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 10 µs


```CPU times: user 1min 42s, sys: 10.8 s, total: 1min 53s
Wall time: 1min 53s```

In [65]:
with open(tsne_filepath, 'rb') as f:
    tsne = pickle.load(f)
    
tsne_vectors = pd.np.load(tsne_vectors_filepath)

tsne_vectors = pd.DataFrame(
    tsne_vectors,
    index=pd.Index(tsne_input.index),
    columns=['x_coord', 'y_coord']
    )

Now we have a two-dimensional representation of our data! Let's preview the new coordinates.

In [66]:
tsne_vectors.head()

Unnamed: 0,x_coord,y_coord
good,20.968317,28.874466
place,-11.629225,43.165817
food,-31.311211,36.447498
like,16.364672,42.78075
great,18.449545,29.140442


In [67]:
tsne_vectors['word'] = tsne_vectors.index

### Plotting with Bokeh

Bokeh is an interactive visualization library in Python that targets web browsers for presentation. We can use Bokeh to create an interactive plot of our t-SNE word vectors to help us explore and understand the relationships our word2vec model has learned from our Yelp data.

In [68]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, value

output_notebook()

In [69]:
# add our DataFrame as a ColumnDataSource for Bokeh
plot_data = ColumnDataSource(tsne_vectors)

# create the plot and configure the
# title, dimensions, and tools
tsne_plot = figure(
    title='t-SNE Word Embeddings',
    plot_width=800,
    plot_height=800,
    tools=('pan, wheel_zoom, box_zoom, box_select, resize, reset'),
    active_scroll='wheel_zoom'
    )

# add a hover tool to display words on roll-over
tsne_plot.add_tools(HoverTool(tooltips = '@word'))

# draw the words as circles on the plot
tsne_plot.circle(
    'x_coord',
    'y_coord',
    source=plot_data,
    color='blue',
    line_alpha=0.2,
    fill_alpha=0.1,
    size=10,
    hover_line_color='black'
    )

# configure visual elements of the plot
tsne_plot.title.text_font_size = value('16pt')
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None

# engage!
show(tsne_plot);

In this plot, every word or token is represented as a light blue circle. You can navigate the plot using the tools displayed in the upper right: zoom, pan, select, etc. Hovering your mouse pointer over one or more circles on the plot will display the words the circles represent.

Areas of dense clustering indicate that the model has found a group of words it believes are highly similar. Words in neighboring clusters might be more loosely related to each other. Words that are distant from each other are not related, according to the model (note that this is not the same this as being _dissimilar_).

Spend some time playing with this plot to gain a bit of intuition for the meanings for words and relationships between words that the word2vec model has learned from our dataset!

## Predicting Useful Votes with Neural Networks in Keras

Finally, we are ready to build a neural network model to predict whether a review is likely to get any useful votes. We'll use the Keras neural network library as a high-level interface for TensorFlow. Our neural network will use the word vector embeddings we've learned in the previous section as an initial _Embedding_ layer in its model architecture.

Let's first import everything we'll need from Keras.

In [70]:
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential, load_model
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import LSTM, Bidirectional
from keras.layers import Conv1D, MaxPooling1D

Using TensorFlow backend.


Next, we'll read in the normalized text from our reviews we pre-processed earlier in the notebook. Each review will be read in as a list of lemmatized tokens.

Since this is a smallish dataset, for convenience, we'll simply read all of the data into RAM. If we needed to work with a larger dataset, we could stream the data from disk, only holding a subset of the reviews in memory at once.

In [71]:
transformed_reviews = list(LineSentence(transformed_reviews_filepath))

In [72]:
sample_review_id = 2

print(transformed_reviews[sample_review_id])

['love', 'the', 'gyro', 'plate', '.', 'rice', 'be', 'so', 'good', 'and', 'i', 'also', 'dig', 'their', 'candy', 'selection', ':)']


#### Prepare the training data

Like other machine learning models, neural networks work with numbers, not raw text. Keras will expect our input data to be a sequence of integer IDs corresponding to the tokens in the original text.

Fortunately, our word2vec model has already learned the vocabulary from our text data and assigned an integer index to each token. We can reuse its dictionary for our integer IDs in our Keras model.

In [73]:
word2vec.wv.vocab['love'].index

67

We'll define a function to generate sequences of integer IDs for us. This function will accept a sequence of tokens, loop through them, and look up the integer ID for each one. It will return a new list containing the word2vec model IDs for the token sequence.

Like most things in Python, the word2vec model starts assigning integer IDs to tokens with the number 0, and counts up from there. We'll add 1 to all of the IDs, so that the ID 0 is reserved for (1) tokens that aren't in the model's vocabulary, and (2) sequence padding, which we'll see in the next section.

In [74]:
def generate_seq(text):

    seq = []

    for token in text:
        try:
            # add 1 to all token ids
            # reserve id 0 for out-of-vocabulary & sequence padding
            seq.append(word2vec.wv.vocab[token].index + 1)
        except KeyError:
            # if token is not in model vocabulary,
            # use 0 as a placeholder id
            seq.append(0)
            
    return seq

Let's test our new function.

In [75]:
seq = generate_seq(transformed_reviews[sample_review_id])

pd.DataFrame(
    [transformed_reviews[sample_review_id], seq],
    index=['token', 'id']
    ).T

Unnamed: 0,token,id
0,love,68
1,the,3
2,gyro,1058
3,plate,277
4,.,1
5,rice,362
6,be,2
7,so,30
8,good,23
9,and,5


We can use our function to generate a token ID sequence for every review in our dataset. We're left with a list of lists, where each list is a token ID sequence for one review.

In [76]:
seqs = [generate_seq(text) for text in transformed_reviews]

In [77]:
seqs[sample_review_id]

[68, 3, 1058, 277, 1, 362, 2, 30, 23, 5, 6, 76, 1316, 55, 1294, 235, 479]

An inconvenient fact of life in Keras and TensorFlow is that Keras expects all of the input data to have the same size and shape. This isn't much of a problem, for, say, image data that can easily be resized, or structured data with a fixed schema.

Yelp reviews can and do have different lengths, so our input data is not naturally the same size and shape:

In [78]:
avg_length = sum((len(seq) for seq in seqs)) / len(seqs)
longest = max((len(seq) for seq in seqs))

print(f'The average sequence length is {avg_length:.1f} tokens')
print(f'The longest sequence is {longest} tokens')

The average sequence length is 147.1 tokens
The longest sequence is 1144 tokens


The typical way to get around this is to use a workaround called _sequence padding_. We will choose a fixed length and resize all of our token ID sequences to match this length.

If a sequence is shorter than the fixed length, we'll _pad_ the sequence with extra zeros until it matches the chosen length. If a sequence is longer than the designated length, the extra length will be chopped off and discarded.

Keras provides a `pad_sequences` utility function for this purpose.

In [79]:
max_length = 500

padded_seqs = pad_sequences(seqs, maxlen=max_length, value=0)

In [80]:
type(padded_seqs)

numpy.ndarray

Let's inspect the same sample review &mdash; this time comparing the original text with the padded sequence of token IDs we produced from it.

In [81]:
print(' '.join(transformed_reviews[sample_review_id]))
print('')
print(pd.Series(padded_seqs[sample_review_id, :]))

love the gyro plate . rice be so good and i also dig their candy selection :)

0         0
1         0
2         0
3         0
4         0
5         0
6         0
7         0
8         0
9         0
10        0
11        0
12        0
13        0
14        0
15        0
16        0
17        0
18        0
19        0
20        0
21        0
22        0
23        0
24        0
25        0
26        0
27        0
28        0
29        0
       ... 
470       0
471       0
472       0
473       0
474       0
475       0
476       0
477       0
478       0
479       0
480       0
481       0
482       0
483      68
484       3
485    1058
486     277
487       1
488     362
489       2
490      30
491      23
492       5
493       6
494      76
495    1316
496      55
497    1294
498     235
499     479
Length: 500, dtype: int32


 We see that it's now mostly padding, with the token IDs for real words at the end.

Let's look at one more.

In [82]:
print(' '.join(transformed_reviews[sample_review_id + 1]))
print('')
print(pd.Series(padded_seqs[sample_review_id + 1, :]))

rosie , dakota , and i love chaparral_dog_park ! ! ! it be very convenient and surround_by a lot of path , a desert xeriscape , baseball_field , ballpark , and a lake with duck . the scottsdale park and rec dept . do a wonderful job of keep the park clean and shade . you can find trash can and poopy - pick_up mitt locate all over the park and path . the fenced in area be huge to let the dog run , play , and sniff !

0          0
1          0
2          0
3          0
4          0
5          0
6          0
7          0
8          0
9          0
10         0
11         0
12         0
13         0
14         0
15         0
16         0
17         0
18         0
19         0
20         0
21         0
22         0
23         0
24         0
25         0
26         0
27         0
28         0
29         0
       ...  
470        5
471        0
472       27
473      601
474    19805
475      686
476       42
477      114
478        3
479      536
480        5
481     2980
482        1
483     

The original reviews are different lengths, but as far as Keras is concerned, their size and shape of the sequences are now consistent! These padded sequences will serve as the raw input data to our neural network:

In [83]:
padded_seqs.shape

(229901, 500)

#### Prepare the labels

Let's turn our attention to the target variable. The `votes.useful` column contains how many useful votes a review received from the Yelp community:

In [84]:
reviews_df['votes.useful'].head()

0    5
1    0
2    1
3    2
4    0
Name: votes.useful, dtype: int64

We'd like to frame this as a classification problem, so we'll transform the `votes.useful` data into a binary variable &mdash; 1 if a review received any useful votes, 0 otherwise.

In [85]:
labels = reviews_df['votes.useful'].apply(bool).astype(int)

In [86]:
labels.head()

0    1
1    0
2    1
3    1
4    0
Name: votes.useful, dtype: int64

In [87]:
labels.describe()

count    229901.000000
mean          0.585195
std           0.492689
min           0.000000
25%           0.000000
50%           1.000000
75%           1.000000
max           1.000000
Name: votes.useful, dtype: float64

Looks like 58% of the reviews in our dataset received at least one useful vote.

This provides a helpful baseline: a naive strategy would be to predict 1 for every review. You'd be right 58% of the time! A worthwhile model must at least beat 58% accuracy.

We'll use the standard `test_train_split` function from scikit-learn to set up our training and testing sets.

In [88]:
from sklearn.model_selection import train_test_split

In [89]:
X_train, X_test, y_train, y_test = train_test_split(padded_seqs, labels, test_size=0.15, random_state=42)

#### Define the network architecture

With our training data and labels squared away, we can define the architecture for our neural network.

In [90]:
nn = Sequential()

A single example of input data for our network will be one of the token ID sequences we created in the previous section &mdash; in other words, a vector of shape (500, 1).

##### Embedding

The first layer in our network will be an _Embedding_ layer. The Embedding layer will simply look up the embedding vector for each token ID in the sequence.

We can create re-use the word vector embeddings we learned in our word2vec model as the weights for the Embedding layer. This allows the model to "import" pre-trained knowledge about the meanings of and relationships between the words before we even start the training process.

Since we reserved ID 0 for out-of-vocabulary words and sequence padding, we need to add a row of zeros as the first row in the embedding matrix.

In [91]:
word2vec.wv.vectors.shape

(23940, 100)

In [92]:
embedding_weights = np.vstack((
    np.zeros((1, word2vec.wv.vectors.shape[1])),
    word2vec.wv.vectors
    ))

embedding_weights.shape

(23941, 100)

Let's create the Embedding layer and add it to our network like so:

In [93]:
nn.add(
    Embedding(
        embedding_weights.shape[0],
        embedding_weights.shape[1],
        weights=[embedding_weights],
        input_length=max_length,
        trainable=False
        )
    )

The input to the Embedding layer will be a vector of shape (500, 1). The Embedding layer will replace each token ID with a 100-dimension token vector, so the output from the embedding layer will be a matrix of shape (500, 100).

##### Dropout

To help combat overfitting, we'll add a _Dropout_ layer next. This layer will randomly set 25% of the values in our (500, 100) matrix to 0.

In [94]:
nn.add(Dropout(0.25))

##### Convolution

Next up is a _Convolutional_ layer. We set the `convolution_size` to 5, so a single convolution in this layer will accept a window of five word vectors at a time as input &mdash; a (5, 100) subset of the input data. For each of these windows, the convolution will produce a single number as its output value.

Since our input data to this layer is a (500, 100) matrix of word vectors, we will apply the convolution over all 496 possible five-vector windows in the input matrix (there are no five-vector windows centered on rows 0, 1, 498, or 499). Since it produces a single output value for each window, when applied across the whole input, the convolution will produce a vector of output values of shape (496, 1).

In the full Convolutional layer, we will create 64 separate convolutions and apply all of them as described above. The output from the full layer will be a matrix of shape (496, 64).

In [95]:
convolutions = 64
convolution_size = 5

In [96]:
nn.add(
    Conv1D(
        convolutions,
        convolution_size,
        padding='valid',
        activation='relu',
        strides=1
        )
    )

You can think of each convolution as a "feature extractor". It will learn to look for certain relevant patterns or features in a given window of input data. When a similar feature is present, the convolution will have a high activation value. Otherwise, its activation will be low.

[See here](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/) for more information about the application of convolutional layers for NLP.

##### Max-pooling

After our convolutions, we apply a _max-pooling_ layer. A max-pooling layer down-samples a sequence of input data, returning only the maximum value within a particular window.

We'll use a max-pooling window of 4. The pooling operation will apply over the output sequence of each convloution individually. Max-pooling for the output sequence of a single convolution would look like this:

```
6  1  3  2 | 0  4  9  7 | ... | 3  3  4  1
    max          max                max
     6            9       ...        4
```

In [97]:
pool_size = 4

In [98]:
nn.add(
    MaxPooling1D(pool_size=pool_size)
    )

The input data to the max-pooling layer is a matrix of shape (496, 64). The max-pooling layer will down-sample the sequence from each convolution by a factor of 4, so the output from this layer will be (124, 64).

##### Bidirectional Recurrent (LSTM)

Next, we'll add a _recurrent_ layer &mdash; specifically, an _LSTM_ (Long Short Term Memory) layer. Recurrent layers are designed to read a _sequence_ of input data, _in order_, to detect relevant longer-term patterns or features in the sequence.

We'll set the size of our LSTM layer to 70, which means the LSTM will read the input matrix and produce a vector of size (70, 1) that encodes relevant information the LSTM detected.

For more information on LSTM layers, [see here](http://colah.github.io/posts/2015-08-Understanding-LSTMs/).

We will add one twist to the standard LSTM layer &mdash; we'll make it _bidirectional_. This means we'll really have _two_ LSTM layers operating over the same input sequence. One layer will process the sequence forwards, from first-to-last, and the other will process the sequence backwards, from last-to-first. We'll glue the final vectors from each layer together as the output.

In [99]:
lstm_output_size = 70

In [100]:
nn.add(
    Bidirectional(LSTM(lstm_output_size, dropout=0.2, recurrent_dropout=0.2))
    )

Since we have two LSTMs reading the same (124, 64) input sequence and each producing a (70, 1) output vector, the final output from this layer will be a vector of size (140, 1).

##### Dense layer for final classification

Finally, we'll add a single _Dense_ (fully-connected) neuron to perform the final classification. It will accept the (140, 1) vector from the bidirectional LSTM layer as input, and produce a single activation value. Since this is a classification problem, we'll use the _sigmoid_ activation function, so the output is between 0 and 1.

The activation value from this layer will serve as the network's overall prediction that a given review receives any useful votes.

In [101]:
nn.add(Dense(1, activation='sigmoid'))

Now that the network is complete, let's compile it and view a summary of the network architecture. We'll use the _binary crossentropy_ loss function, also known as the log loss, which is an appropriate choice for a binary classification problem.

In [102]:
nn.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
    )

In [103]:
nn.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 100)          2394100   
_________________________________________________________________
dropout_1 (Dropout)          (None, 500, 100)          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 496, 64)           32064     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 124, 64)           0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 140)               75600     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 141       
Total params: 2,501,905
Trainable params: 107,805
Non-trainable params: 2,394,100
____________________________________________________________

#### Train the model

In [104]:
neural_network_filepath = 'neural_network_model.h5'

In [105]:
batch_size = 30
epochs = 3

In [106]:
execute = False

if execute:
    
    nn.fit(
        X_train,
        y_train,
        batch_size=batch_size,
        epochs=epochs,
        validation_data=(X_test, y_test)
        )
    
    nn.save(neural_network_filepath)

```
Train on 195415 samples, validate on 34486 samples
Epoch 1/3
195415/195415 [==============================] - 1353s - loss: 0.6391 - acc: 0.6331 - val_loss: 0.6345 - val_acc: 0.6393
Epoch 2/3
195415/195415 [==============================] - 1333s - loss: 0.6361 - acc: 0.6367 - val_loss: 0.6334 - val_acc: 0.6407
Epoch 3/3
195415/195415 [==============================] - 1338s - loss: 0.6336 - acc: 0.6373 - val_loss: 0.6345 - val_acc: 0.6416
```

In [107]:
nn = load_model(neural_network_filepath)

After training for three epochs over the training data, the network's prediction accuracy over the test data is about 64.2%. This is certainly a bit better than the 58% naive baseline. But is it accurate enough to be useful?

#### Use the trained model to predict useful reviews

Let's use our trained model to make predictions for all the reviews in our original dataset.

In [108]:
predictions_filepath = 'preds.npy'

In [109]:
%%time

execute = False

if execute:
    
    preds = nn.predict(padded_seqs, verbose=1)
    
    np.save(predictions_filepath, preds)

CPU times: user 10 µs, sys: 0 ns, total: 10 µs
Wall time: 19.3 µs


```
229901/229901 [==============================] - 754s
```

In [110]:
preds = np.load(predictions_filepath)

In [111]:
preds.shape

(229901, 1)

In [112]:
reviews_df['useful.predicted'] = preds

In [113]:
sorted_reviews = reviews_df.sort_values('useful.predicted', ascending=False)

Let's preview the most useful reviews, as predicted by the model.

In [114]:
sorted_reviews.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes.cool,votes.funny,votes.useful,useful.predicted
155961,f_leeIRar8uNCkk5sAmHRQ,2008-10-09,5h6lRtXvNlwYTU4kZKUdBQ,3,[Scene] \r\n\r\nIt's 5:45 PM on a Sunday. A ma...,review,HOleI3jz1MLNUJ6cc1x0Pw,7,16,4,0.95089
197107,vxC3YTLOeGJwrnHcep63FQ,2012-09-17,QBnnWLpktAtJ42rwOwTjTg,4,"Yay, a brand new Buffet opened today! :-))...",review,ikm0UCahtK34LbLCEw4YTw,3,2,6,0.940665
196029,PLLM4iCYy_OrT-53IZ88Qw,2012-03-13,kjZeix1k5Nw1VWqoId3lzg,4,"Eew, a Buffet with no meat?¿? :-O\r\n\r\nO...",review,ikm0UCahtK34LbLCEw4YTw,3,2,4,0.939794
3887,E5dqlvxPw9Q4XFqOMw_lqw,2011-08-03,qT8arKbR7ioofdeh6wJ3SA,4,Decided to stop by for breakfast by myself one...,review,kGgAARL2UmvCcTRfiscjug,4,4,10,0.939739
58754,VKNl2-aF4n1x7lPVyU-Mag,2011-04-14,SE21xRhgPI3ZIdUq9xIhFQ,5,Was looking for a Mexican joint on Yelp and st...,review,kGgAARL2UmvCcTRfiscjug,16,12,18,0.939562


In [115]:
for idx, review in sorted_reviews.iloc[0:3].iterrows():
    
    print(f'Review {idx}:', '\n')
    print(review['text'], '\n')
    print(f'Predicted useful: {review["useful.predicted"]:.4f}')
    print(f'Useful votes: {review["votes.useful"]}')
    print('\n', '------', '\n')

Review 155961: 

[Scene] 

It's 5:45 PM on a Sunday. A man and woman are in North Phoenix meeting some friends at Sala Thai for dinner at 6:30. With so much time to spare, they decide to grab an alcoholic beverage. Thanks to Yelp on the man's Blackberry, they see that The Blooze is right down the street. They pull up into the shopping center shared by Big Lots and park in front of the bar. They open the door to find a dimly lit room with a bar to the left. There is only one open seat at the bar, so they decide to order a drink and then go sit at one of the high tables on the right. A man sitting to the right of the open seat turns toward them. As he opens his mouth to speak, the man and woman realize that he is missing all of his front teeth. 

Man with no teeth: Take my chair! I'll stand. I've been sitting for hours!
Woman: Oh no, that's ok. We'll just get our drinks and then go sit over there (points to high tables). 
Man with no teeth: No, please. I insist. Take my chair. Sit 

On the other hand, let's preview the least useful reviews, as predicted by the model.

In [116]:
sorted_reviews.tail()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes.cool,votes.funny,votes.useful,useful.predicted
155221,RmA5zN-Bs89lsTuY6v8W3Q,2012-06-02,DHNMBl8kUOOxYD2AYtcwug,5,We love Tammy!,review,i7b1NSPyvIGwPP6d8v-dkA,0,0,0,0.210678
91188,Nc6GQOf3sr6zaOm-E0v4_g,2011-06-27,QmNX_DBJ9IVOxmTGHPe5dg,5,Been going here since i was 15 and i love how ...,review,SEPWK9VIxXJ3vUcjGUvdkA,0,0,0,0.210123
33250,zOJdahdaaLz7DpfdIucslg,2012-04-02,ImVhPN-3vznU-muBEF9DlQ,5,My sons favorite place.,review,_M2Mt2whJMhXc_iPerB6rQ,0,0,0,0.20561
200593,VdoqSzXlB3rldeW_BIag0w,2011-11-17,PsrXAasDpkLy1i-K6e_ZUQ,5,I love their shellac!,review,4ozupHULqGyO42s3zNUzOQ,2,1,1,0.199734
13854,EQ4UBZPlLhOGpINRD3oI_g,2011-05-21,M6CDNXF49Vgl7TR3oDrmGg,5,Great job on my pedi!,review,pkIRF1JY08UY-y5Mt93_2Q,0,0,0,0.188988


In [117]:
for idx, review in sorted_reviews.iloc[-5:].iterrows():
    
    print(f'Review {idx}:', '\n')
    print(review['text'], '\n')
    print(f'Predicted useful: {review["useful.predicted"]:.4f}')
    print(f'Useful votes: {review["votes.useful"]}')
    print('\n', '------', '\n')

Review 155221: 

We love Tammy! 

Predicted useful: 0.2107
Useful votes: 0

 ------ 

Review 91188: 

Been going here since i was 15 and i love how nancy the owner does my nails, eyebrows, and pedicures! 

Predicted useful: 0.2101
Useful votes: 0

 ------ 

Review 33250: 

My sons favorite place. 

Predicted useful: 0.2056
Useful votes: 0

 ------ 

Review 200593: 

I love their shellac! 

Predicted useful: 0.1997
Useful votes: 1

 ------ 

Review 13854: 

Great job on my pedi! 

Predicted useful: 0.1890
Useful votes: 0

 ------ 



From a qualitative perspective, does it seem like the model is doing a good job predicting which reviews Yelp users are most likely to find useful?