# GloVe Word Embeddings Demo

This demo was part of a presentation for [this word embeddings workshop](https://www.eventbrite.com/e/practical-ai-for-female-engineers-product-managers-and-designers-tickets-34805104003) and a similar talk at [this Demystifying AI conference](https://www.eventbrite.com/e/demystifying-deep-learning-ai-tickets-34351888423).  It is not necessary to download the demo to be able to follow along and enjoy the workshop.

It is available on Github at https://github.com/fastai/word-embeddings-workshop

## Loading our data

In [2]:
import pickle
import numpy as np
import re
import json

In [3]:
np.set_printoptions(precision=4, suppress=True)

The dataset is available at http://files.fast.ai/models/glove/6B.100d.tgz
To download and unzip the files from the command line, you can run:

    wget http://files.fast.ai/models/glove_50_glove_100.tgz 
    tar xvzf glove_50_glove_100.tgz

You will need to update the path below to be accurate for where you are storing the data.

In [4]:
vecs = np.load("glove_vectors_100d.npy")
vecs50 = np.load("glove_vectors_50d.npy")

In [5]:
with open('words.txt') as f:
    content = f.readlines()
words = [x.strip() for x in content] 

In [6]:
wordidx = json.load(open('wordsidx.txt'))

Let's see what our data looks like:

In [7]:
len(words)

400000

In [8]:
words[:10]

['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s"]

In [9]:
words[600:610]

['together',
 'congress',
 'index',
 'australia',
 'results',
 'hard',
 'hours',
 'land',
 'action',
 'higher']

wordidx allows us to look up a word in order to find out it's index:

In [10]:
type(wordidx)

dict

In [11]:
wordidx['feminist']

11853

In [12]:
words[11853]

'feminist'

## Words as vectors

The word "intelligence" is represented by the 100 dimensional vector:

In [13]:
type(vecs)

numpy.ndarray

In [14]:
vecs[11853]

array([ 0.296 ,  0.7626, -0.9866,  0.3776,  0.3194,  0.8286, -0.1686,
       -1.4558,  0.1965,  0.3854, -0.3348, -0.6503, -0.2528, -0.11  ,
       -0.1545,  0.5354, -0.4527, -0.0516,  0.1312,  0.0744,  0.5001,
        0.2151,  0.0688,  0.4347,  0.261 , -0.0371,  0.1385, -1.518 ,
        0.0641,  0.149 , -0.0314,  0.5038,  0.2839,  0.3457, -0.4411,
       -0.3459, -0.2118,  0.5651, -0.088 , -0.0438, -1.2228,  0.6039,
       -0.23  ,  0.2287, -0.2695, -0.9398,  0.2376,  0.3302, -0.2422,
        0.6359,  0.1347,  0.5542,  0.1432,  0.2861,  0.0216, -0.7437,
        0.3508,  0.362 ,  0.5566,  0.3403,  0.3613,  0.5185, -0.5437,
       -0.285 ,  1.1831, -0.1192,  0.2473,  0.0614,  0.4436, -0.244 ,
        0.2016,  0.5143, -0.4695, -0.0974, -0.9836, -0.3594,  0.3903,
       -0.517 , -0.1659, -1.2132, -1.3228,  0.0578,  0.7022,  0.3492,
       -0.9103, -0.381 , -0.1545,  0.4467, -0.009 , -0.9838,  1.0114,
       -0.227 ,  0.2697,  0.1566,  0.5613,  0.1175, -0.5755, -0.6324,
        0.1052,  1.2

This lets us do some useful calculations. For instance, we can see how far apart two words are using a distance metric:

In [15]:
from scipy.spatial.distance import cosine as dist

Smaller numbers mean two words are closer together, larger numbers mean they are further apart.

The distance between similar words is low:

In [16]:
dist(vecs[wordidx["puppy"]], vecs[wordidx["dog"]])

0.27636240676695256

In [17]:
dist(vecs[wordidx["queen"]], vecs[wordidx["princess"]])

0.20527545040329642

And the distance between unrelated words is high:

In [18]:
dist(vecs[wordidx["celebrity"]], vecs[wordidx["dusty"]])

0.98835787578057777

In [19]:
dist(vecs[wordidx["kitten"]], vecs[wordidx["airplane"]])

0.87298516557634254

In [134]:
dist(vecs[wordidx["avalanche"]], vecs[wordidx["antique"]])

0.96211070894511519

### Bias

There is a lot of opportunity for bias:

In [20]:
dist(vecs[wordidx["man"]], vecs[wordidx["genius"]])

0.50985148631697985

In [21]:
dist(vecs[wordidx["woman"]], vecs[wordidx["genius"]])

0.6897833082810727

Not all pairs are stereotyped:

In [22]:
dist(vecs[wordidx["man"]], vecs[wordidx["emotional"]])

0.55957489609574407

In [23]:
dist(vecs[wordidx["woman"]], vecs[wordidx["emotional"]])

0.62572056015698596

I just checked the distance between pairs of words, because this is a quick and simple way to illustrate the concept.  It is also a very **noisy** approach, and **researchers approach this problem in more systematic ways**.

## Visualizing the words

We will use [Plotly](https://plot.ly/), a Python library to make interactive graphs (note: everything below is done with the free, offline version of Plotly).

### Methods

In [41]:
import plotly
import plotly.graph_objs as go    
from IPython.display import IFrame

In [42]:
def plotly_3d(Y, cat_labels):
    trace_dict = {}
    for i, label in enumerate(cat_labels):
        trace_dict[i] = go.Scatter3d(
            x=Y[i*5:(i+1)*5, 0],
            y=Y[i*5:(i+1)*5, 1],
            z=Y[i*5:(i+1)*5, 2],
            mode='markers',
            marker=dict(
                size=8,
                line=dict(
                    color='rgba('+ str(i*40) + ',' + str(i*40) + ',' + str(i*40) + ', 0.14)',
                    width=0.5
                ),
                opacity=0.8
            ),
            text = my_words[i*5:(i+1)*5],
            name = label
        )

    data = [item for item in trace_dict.values()]
    layout = go.Layout(
        margin=dict(
            l=0,
            r=0,
            b=0,
            t=0
        )
    )

    plotly.offline.plot({
        "data": data,
        "layout": layout
    })

In [43]:
def plotly_2d(Y, cat_labels):
    trace_dict = {}
    for i, label in enumerate(cat_labels):
        trace_dict[i] = go.Scatter(
            x=Y[i*5:(i+1)*5, 0],
            y=Y[i*5:(i+1)*5, 1],
            mode='markers',
            marker=dict(
                size=8,
                line=dict(
                    color='rgba('+ str(i*40) + ',' + str(i*40) + ',' + str(i*40) + ', 0.14)',
                    width=0.5
                ),
                opacity=0.8
            ),
            text = my_words[i*5:(i+1)*5],
            name = label
        )

    data = [item for item in trace_dict.values()]
    layout = go.Layout(
        margin=dict(
            l=0,
            r=0,
            b=0,
            t=0
        )
    )

    plotly.offline.plot({
        "data": data,
        "layout": layout
    })

### Preparing the Data

Let's plot words from a few different categories:

In [45]:
categories = [
              "bugs", "music", 
              "pleasant", "unpleasant", 
              "science", "arts"
             ]

In [44]:
my_words = [
            "maggot", "flea", "tarantula", "bedbug", "mosquito", 
            "violin", "cello", "flute", "harp", "mandolin",
            "joy", "love", "peace", "pleasure", "wonderful",
            "agony", "terrible", "horrible", "nasty", "failure", 
            "physics", "chemistry", "science", "technology", "engineering",
            "poetry", "art", "literature", "dance", "symphony",
           ]

Again, we need to look up the indices of our words using the wordidx dictionary:

In [46]:
X = np.array([wordidx[word] for word in my_words])

In [47]:
vecs[X].shape

(30, 100)

Now, we will make a set combining our words with the first 10,000 words in our entire set of words (some of the words will already be in there), and create a matrix of their embeddings.

In [62]:
embeddings = np.concatenate((vecs[X], vecs[:10000,:]), axis=0); embeddings.shape

(10030, 100)

### Viewing the words in 3D

The words are in 100 dimensions, so we will need a way to reduce them to 3 dimensions so that we can view them.  Two good options are T-SNE or PCA.  You can look up the details on your own, but the idea is to find a meaningful way to go from 100 dimensions to 3 dimensions (while keeping a similar idea of what is close to what).

You would typically just use one of these (T-SNE or PCA).  I've included both if you're interested.

#### TSNE

In [58]:
from sklearn import manifold

In [63]:
tsne = manifold.TSNE(n_components=3, init='pca', random_state=0)
Y = tsne.fit_transform(subset)
plotly_3d(Y, categories)

In [64]:
IFrame('temp-plot.html', width=600, height=400)

#### PCA

In [49]:
from sklearn import decomposition

In [55]:
pca = decomposition.PCA(n_components=3).fit(subset.T)
components = pca.components_
plotly_3d(components.T[:len(my_words),:], categories)

In [53]:
IFrame('temp-plot.html', width=600, height=400)

## Nearest Neighbors

We can also see what words are close to a given word.

In [25]:
from sklearn.neighbors import NearestNeighbors

Nearest Neighbors is an algorithm that finds the points closest to a given point.

In [26]:
neigh = NearestNeighbors(n_neighbors=10, radius=0.5, metric='cosine', algorithm='brute')
neigh.fit(vecs) 

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=1, n_neighbors=10, p=2, radius=0.5)

In [27]:
distances, indices = neigh.kneighbors([vecs[wordidx["feminist"]]])

In [60]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('intelligence', 1.7881393e-07),
 ('cia', 0.25781947),
 ('information', 0.27898049),
 ('security', 0.3036899),
 ('fbi', 0.30377108),
 ('military', 0.3065179),
 ('secret', 0.31066364),
 ('counterterrorism', 0.32373756),
 ('pentagon', 0.33488154),
 ('defense', 0.34354311)]

We can take this a step further, and add two words together.  What is the result?

In [24]:
new_vec = vecs[wordidx["artificial"]] + vecs[wordidx["intelligence"]]

In [25]:
new_vec

array([ 0.0345, -0.1185,  0.746 ,  0.3256,  0.3256, -1.4699, -0.8715,
       -0.9421,  0.0679,  0.922 ,  0.6811, -0.3729,  1.0969,  0.7196,
        1.3515,  1.2493,  0.6621,  0.1901, -0.2707, -0.0444, -1.232 ,
        0.1744,  0.7577, -0.9177, -1.2184,  0.6959, -0.1966, -0.415 ,
       -0.3358,  0.5452,  0.589 , -0.0299, -0.9744, -0.8937,  0.2283,
       -0.2092, -1.3795,  1.7811,  0.2269,  0.47  , -0.3045, -0.1573,
       -0.478 ,  0.3071,  0.4202, -0.4434,  0.1602,  0.1443, -0.9528,
       -0.5565,  0.7537,  0.182 ,  1.4008,  1.8967,  0.595 , -3.0072,
        0.6811, -0.2557,  2.0217,  0.7825,  0.4251,  1.3615,  0.5902,
       -0.1312,  0.9344, -0.5377, -0.3988, -0.6415,  0.6527,  0.5117,
        0.7315,  0.1396,  0.3785, -0.6403, -0.094 ,  0.1076,  0.6197,
        0.2537, -1.4346,  1.169 ,  1.6931,  0.1458, -0.5981,  0.8195,
       -3.1903,  1.2429,  2.1481,  1.6004,  0.2014, -0.2121,  0.3698,
       -0.001 , -0.628 ,  0.2869,  0.3119, -0.1093, -0.6341, -1.7804,
        0.5857,  0.3

In [58]:
distances, indices = neigh.kneighbors([new_vec])

In [27]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('intelligence', 0.18831611),
 ('artificial', 0.25617576),
 ('information', 0.3256532),
 ('knowledge', 0.33641893),
 ('secret', 0.36480361),
 ('human', 0.36726683),
 ('biological', 0.37090683),
 ('using', 0.37736303),
 ('scientific', 0.38513899),
 ('communication', 0.38691515)]

In [61]:
distances, indices = neigh.kneighbors([vecs[wordidx["king"]]])

In [146]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('king', 0.0),
 ('prince', 0.23176712),
 ('queen', 0.24923098),
 ('son', 0.29791123),
 ('brother', 0.30142248),
 ('monarch', 0.30221093),
 ('throne', 0.30800098),
 ('kingdom', 0.31885898),
 ('father', 0.3197971),
 ('emperor', 0.32871419)]

In [147]:
new_vec = vecs[wordidx["king"]] - vecs[wordidx["he"]] + vecs[wordidx["she"]]

In [148]:
distances, indices = neigh.kneighbors([new_vec])

In [149]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('king', 0.13275802),
 ('queen', 0.16259885),
 ('princess', 0.24821734),
 ('daughter', 0.29121184),
 ('prince', 0.29464376),
 ('elizabeth', 0.29630506),
 ('mother', 0.3091293),
 ('sister', 0.31979591),
 ('father', 0.34473372),
 ('throne', 0.34474838)]

In [150]:
wordidx["programmer"]

19226

In [152]:
distances, indices = neigh.kneighbors([vecs[wordidx["programmer"]]])

Closest words to "programmer":

In [153]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('programmer', 0.0),
 ('programmers', 0.32259798),
 ('animator', 0.36951029),
 ('software', 0.38250893),
 ('computer', 0.40600348),
 ('technician', 0.41406858),
 ('engineer', 0.43037564),
 ('user', 0.43565339),
 ('translator', 0.43721014),
 ('linguist', 0.44948018)]

Feminine version of "programmer"

In [29]:
new_vec = vecs[wordidx["programmer"]] - vecs[wordidx["he"]] + vecs[wordidx["she"]]

In [30]:
distances, indices = neigh.kneighbors([new_vec])

In [31]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('programmer', 0.19503421),
 ('stylist', 0.42715943),
 ('animator', 0.48206449),
 ('programmers', 0.48337293),
 ('choreographer', 0.48626775),
 ('technician', 0.4862805),
 ('designer', 0.48710018),
 ('prodigy', 0.49118328),
 ('lets', 0.49730021),
 ('screenwriter', 0.49754214)]

Masculine version of "programmer"

In [32]:
new_vec = vecs[wordidx["programmer"]] - vecs[wordidx["she"]] + vecs[wordidx["he"]]

In [33]:
distances, indices = neigh.kneighbors([new_vec])

In [34]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('programmer', 0.17419636),
 ('programmers', 0.41335857),
 ('engineer', 0.46376407),
 ('compiler', 0.46731704),
 ('software', 0.4681465),
 ('animator', 0.48923665),
 ('computer', 0.50461578),
 ('mechanic', 0.51500672),
 ('setup', 0.51882535),
 ('developer', 0.51953185)]

In [35]:
distances, indices = neigh.kneighbors([vecs[wordidx["doctor"]]])

In [36]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('doctor', 0.0),
 ('physician', 0.23267597),
 ('nurse', 0.24784923),
 ('dr.', 0.28248072),
 ('doctors', 0.29191142),
 ('patient', 0.29258156),
 ('medical', 0.30040079),
 ('surgeon', 0.30946612),
 ('hospital', 0.30990696),
 ('psychiatrist', 0.3410902)]

Feminine version of doctor:

In [37]:
new_vec = vecs[wordidx["doctor"]] - vecs[wordidx["he"]] + vecs[wordidx["she"]]

In [38]:
distances, indices = neigh.kneighbors([new_vec])

In [39]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('doctor', 0.13456273),
 ('nurse', 0.22582489),
 ('mother', 0.27610379),
 ('woman', 0.29901671),
 ('pregnant', 0.32096934),
 ('girl', 0.33241045),
 ('patient', 0.34357929),
 ('she', 0.35723114),
 ('child', 0.36312521),
 ('herself', 0.363388)]

Masculine version of doctor:

In [40]:
new_vec = vecs[wordidx["doctor"]] - vecs[wordidx["she"]] + vecs[wordidx["he"]]

In [41]:
distances, indices = neigh.kneighbors([new_vec])

In [42]:
[(words[int(ind)], dist) for ind, dist in zip(list(indices[0]), list(distances[0]))]

[('doctor', 0.15277696),
 ('physician', 0.27226871),
 ('medical', 0.37674332),
 ('he', 0.37695646),
 ('doctors', 0.38290107),
 ('dr.', 0.38466901),
 ('surgeon', 0.39124882),
 ('him', 0.40270936),
 ('hospital', 0.42226428),
 ('himself', 0.42476076)]

## Bias

Again, just looking at individual words is a **noisy** approach (I'm using it as a simple illustration).  [Researchers from Princeton and University of Bath](https://www.princeton.edu/~aylinc/papers/caliskan-islam_semantics.pdf) use **small baskets of terms** to represent concepts.  They first confirmed that flowers are more pleasant than insects, and musical instruments are more pleasant from weapons.

They then found that European American names are "more pleasant" than African American names, as captured by how close the word vectors are (as embedded by GloVe, which is a library from Stanford, along the same lines as Word2Vec).

    We show for the first time that if AI is to exploit via our language the vast 
    knowledge that culture has compiled, it will inevitably inherit human-like 
    prejudices. In other words, if AI learns enough about the properties of language 
    to be able to understand and produce it, it also acquires cultural associations 
    that can be offensive, objectionable, or harmful.

[Researchers from Boston University and Microsoft Research](https://arxiv.org/pdf/1606.06121.pdf) found the pairs most analogous to *He : She*.  They found gender bias, and also proposed a way to debias the vectors.

Rob Speer, CTO of Luminoso, tested for ethnic bias by finding correlations for a list of positive and negative words:

    The tests I implemented for ethnic bias are to take a list of words, such as 
    “white”, “black”, “Asian”, and “Hispanic”, and find which one has the strongest 
    correlation with each of a list of positive and negative words, such as “cheap”, 
    “criminal”, “elegant”, and “genius”. I did this again with a fine-grained version 
    that lists hundreds of words for ethnicities and nationalities, and thus is more 
    difficult to get a low score on, and again with what may be the trickiest test of 
    all, comparing words for different religions and spiritual beliefs.

**Ways to address bias**

There are a few different approaches:

- Debias word embeddings
  - [Technique in Bolukbasi, et al.](https://arxiv.org/abs/1606.06121)
  - [ConceptNet Numberbatch (Rob Speer)](https://blog.conceptnet.io/2017/04/24/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/)
- Argument that “awareness is better than blindness”: debiasing should happen at time of action, not at perception. ([Caliskan-Islam, Bryson, Narayanan](https://www.princeton.edu/~aylinc/papers/caliskan-islam_semantics.pdf))

Either way, you need to be on the lookout for bias and have a plan to address it!

If you are interested in the topic of bias in AI, I gave a workshop [you can watch here](https://www.youtube.com/watch?v=25nC0n9ERq4) that covers this material and goes into more depth about bias.

# Movie Reviews Sentiment Analysis Demo

This demo has been adapted (and simplified) from part of Lesson 5 of [Practical Deep Learning for Coders](http://course.fast.ai/index.html)

## Setup data

We're going to look at the IMDB dataset, which contains movie reviews from IMDB, along with their sentiment. 

We will be using [Keras](https://keras.io/), a high-level neural network API. Two of the guiding principles of Keras are **user-friendliness** (it's designed for humans, not machines) and **works with Python**.  Yay for both of these!

Keras can run on top of many other neural network frameworks, including TensorFlow, Theano, R, MxNet, or CNTK.  I am using it on top of TensorFlow here.

Keras comes with some helpers for the IMDB dataset.

In [6]:
from keras.datasets import imdb
from keras.utils.data_utils import get_file
idx = imdb.get_word_index()

Using TensorFlow backend.


In [8]:
import keras.backend as K

def limit_mem():
    K.get_session().close()
    cfg = K.tf.ConfigProto()
    cfg.gpu_options.allow_growth = True
    cfg.gpu_options.per_process_gpu_memory_fraction = 0.6
    K.set_session(K.tf.Session(config=cfg))
    
limit_mem()

This is the word list:

In [171]:
idx_arr = sorted(idx, key=idx.get)
idx_arr[:10]

['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i']

...and this is the mapping from id to word

In [172]:
idx2word = {v: k for k, v in idx.items()}

We download the reviews using code from https://keras.io/datasets/:

In [173]:
path = get_file('imdb_full.pkl',
                origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
                md5_hash='d091312047c43cf9e4e38fef92437263')
f = open(path, 'rb')
(x_train, labels_train), (x_test, labels_test) = pickle.load(f)

In [174]:
len(x_train)

25000

Here's the 1st review. As you see, the words have been replaced by ids. The ids can be looked up in idx2word.

In [175]:
', '.join(map(str, x_train[0]))

'23022, 309, 6, 3, 1069, 209, 9, 2175, 30, 1, 169, 55, 14, 46, 82, 5869, 41, 393, 110, 138, 14, 5359, 58, 4477, 150, 8, 1, 5032, 5948, 482, 69, 5, 261, 12, 23022, 73935, 2003, 6, 73, 2436, 5, 632, 71, 6, 5359, 1, 25279, 5, 2004, 10471, 1, 5941, 1534, 34, 67, 64, 205, 140, 65, 1232, 63526, 21145, 1, 49265, 4, 1, 223, 901, 29, 3024, 69, 4, 1, 5863, 10, 694, 2, 65, 1534, 51, 10, 216, 1, 387, 8, 60, 3, 1472, 3724, 802, 5, 3521, 177, 1, 393, 10, 1238, 14030, 30, 309, 3, 353, 344, 2989, 143, 130, 5, 7804, 28, 4, 126, 5359, 1472, 2375, 5, 23022, 309, 10, 532, 12, 108, 1470, 4, 58, 556, 101, 12, 23022, 309, 6, 227, 4187, 48, 3, 2237, 12, 9, 215'

The first word of the first review is 23022. Let's see what that is.

In [176]:
idx2word[23022]

'bromwell'

Here's the whole review, mapped from ids to words.

In [177]:
' '.join([idx2word[o] for o in x_train[0]])

"bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell high's satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers' pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector i'm here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isn't"

The labels are 1 for positive, 0 for negative.

In [178]:
labels_train[:10]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Reduce vocab size by setting rare words to max index.

In [179]:
vocab_size = 5000

trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_train]
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]

Look at distribution of lengths of sentences.

In [180]:
trn[:10]

[array([4999,  309,    6,    3, 1069,  209,    9, 2175,   30,    1,  169,
          55,   14,   46,   82, 4999,   41,  393,  110,  138,   14, 4999,
          58, 4477,  150,    8,    1, 4999, 4999,  482,   69,    5,  261,
          12, 4999, 4999, 2003,    6,   73, 2436,    5,  632,   71,    6,
        4999,    1, 4999,    5, 2004, 4999,    1, 4999, 1534,   34,   67,
          64,  205,  140,   65, 1232, 4999, 4999,    1, 4999,    4,    1,
         223,  901,   29, 3024,   69,    4,    1, 4999,   10,  694,    2,
          65, 1534,   51,   10,  216,    1,  387,    8,   60,    3, 1472,
        3724,  802,    5, 3521,  177,    1,  393,   10, 1238, 4999,   30,
         309,    3,  353,  344, 2989,  143,  130,    5, 4999,   28,    4,
         126, 4999, 1472, 2375,    5, 4999,  309,   10,  532,   12,  108,
        1470,    4,   58,  556,  101,   12, 4999,  309,    6,  227, 4187,
          48,    3, 2237,   12,    9,  215]),
 array([4999,   39, 4999,   14,  739, 4999, 3428,   44,   74,   32

In [181]:
lens = np.array([len(review) for review in trn])

In [182]:
(lens.max(), lens.min(), lens.mean())

(2493, 10, 237.71364)

Pad (with zero) or truncate each sentence to make consistent length.

In [183]:
from keras.preprocessing import sequence

In [184]:
seq_len = 500

trn = sequence.pad_sequences(trn, maxlen=seq_len, value=0)
test = sequence.pad_sequences(test, maxlen=seq_len, value=0)

This results in nice rectangular matrices that can be passed to ML algorithms. Reviews shorter than 500 words are pre-padded with zeros, those greater are truncated.

In [185]:
trn.shape

(25000, 500)

## Create a model

### Single conv layer with max pooling

*Convolutional neural networks* (abbreviated CNNs) are a powerful type of neural networks that do well with ordered data.  They have traditionally been used primarily for image data, but more recently are showing great results on natural language data.  [Facebook AI recently announced results](https://code.facebook.com/posts/1978007565818999/a-novel-approach-to-neural-machine-translation/) of using a CNN to speed up language translation 9x faster than the current state-of-the-art. 

We'll use a 1D CNN, since a sequence of words is 1D.

For this workshop, we will treat the CNN as a black box.  If you want to learn more about what is going on inside it, check out [Practical Deep Learning for Coders](http://course.fast.ai/) (the only pre-req is 1 year of coding experience).

In [520]:
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.core import Flatten, Dense, Dropout
from keras.layers.convolutional import Convolution1D, MaxPooling1D
from keras.optimizers import Adam

In [521]:
conv1 = Sequential([
    Embedding(vocab_size, 50, input_length=seq_len, dropout=0.4),
    Convolution1D(64, 5, border_mode='same', activation='relu'),
    Dropout(0.2),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')])

In [522]:
conv1.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [523]:
conv1.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f0ac42b21d0>

In deep learning, often as you get closer to the answer, you need to reduce your *learning rate*, which is the step size for how the algorithm changes it's guess each time.  When you are far from the answer, you want to take large steps to get to the right vicinity.  Once you are close to the answer, you want to take small steps so you don't overshoot the answer.

In [524]:
conv1.optimizer.lr=1e-4

In [525]:
conv1.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f0ac42b2128>

The [Stanford paper](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf)(2011) that this dataset is from cites a state of the art accuracy (without unlabelled data) of 88.3%.  We have surpassed that!

Note that accuracy of 88.9% means an error rate of 11.1% (it's often more helpful to talk about error rates).

### Using our GloVe word embeddings

We could improve our model by using the GloVe word embeddings from above, since this capture semantic meaning, and have been trained for much longer on a much larger dataset than what we are using here.

We are going to use a version of GloVe where the embeddings have just 50 dimensions (as opposed to 100).  It's the same idea as before.

The glove word ids and imdb word ids use different indexes. So we create a simple function that creates an embedding matrix using the indexes from imdb, and the embeddings from glove (where they exist).

In [526]:
def create_emb():
    n_fact = vecs50.shape[1]
    emb = np.zeros((vocab_size, n_fact))

    for i in range(1,len(emb)):
        word = idx2word[i]
        if word and re.match(r"^[a-zA-Z0-9\-]*$", word):
            src_idx = wordidx[word]
            emb[i] = vecs50[src_idx]
        else:
            # If we can't find the word in glove, randomly initialize
            emb[i] = np.random.normal(scale=0.6, size=(n_fact,))

    # This is our "rare word" id - we want to randomly initialize
    emb[-1] = np.random.normal(scale=0.6, size=(n_fact,))
    emb/=3
    return emb

In [189]:
emb = create_emb()

We pass our embedding matrix to the Embedding constructor, and set it to non-trainable.

In [190]:
model = Sequential([
    Embedding(vocab_size, 50, input_length=seq_len, dropout=0.4, weights=[emb]),
    Convolution1D(64, 5, border_mode='same', activation='relu'),
    Dropout(0.2),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')])

In [191]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [192]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fce8dae3da0>

We decrease the learning rate now that we are getting closer to the answer.

In [193]:
model.optimizer.lr=1e-4

In [194]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fce8da40b38>

Our error rate has improved from 11.1% to 10.3%, a 7% improvement 

(this value was fluctuating, but I typically got that it was between 4-10%)

In [195]:
(11.1 - 10.3)/11.1

0.07207207207207197