## Loading our data

In [3]:
import pickle
import numpy as np
import re

In [4]:
import bcolz
def load_array(fname):
    return bcolz.open(fname)[:]

In [5]:
def load_vectors(loc):
    return (load_array(loc+'.dat'),
        pickle.load(open(loc+'_words.pkl','rb'), encoding='latin1'),
        pickle.load(open(loc+'_idx.pkl','rb'), encoding='latin1'))

In [1]:
%ls /data/datasets/nlp/glove/results

[0m[01;34m6B.100d.dat[0m/       [01;34m6B.200d.dat[0m/       [01;34m6B.300d.dat[0m/       [01;34m6B.50d.dat[0m/
6B.100d_idx.pkl    6B.200d_idx.pkl    6B.300d_idx.pkl    6B.50d_idx.pkl
[01;31m6B.100d.tgz[0m        [01;31m6B.200d.tgz[0m        [01;31m6B.300d.tgz[0m        [01;31m6B.50d.tgz[0m
6B.100d_words.pkl  6B.200d_words.pkl  6B.300d_words.pkl  6B.50d_words.pkl


In [6]:
vecs, words, wordidx = load_vectors('/data/datasets/nlp/glove/results/6B.300d')

Let's see what our data looks like:

In [7]:
len(words)

400000

In [8]:
words[:10]

['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s"]

In [9]:
words[600:610]

['together',
 'congress',
 'index',
 'australia',
 'results',
 'hard',
 'hours',
 'land',
 'action',
 'higher']

wordidx allows us to look up a word in order to find out it's index:

In [10]:
wordidx['intelligence']

1226

In [11]:
words[1226]

'intelligence'

What words are similar to intelligence?

Right now, our list of words can't answer that.

### Words as vectors

Intelligence is represented by the 100 dimensional vector:

In [12]:
vecs[1226]

array([ -6.52069986e-01,   2.87869990e-01,  -1.07029997e-01,
        -1.06070004e-02,  -2.31639996e-01,  -1.54410005e-01,
        -3.81969988e-01,   5.29470026e-01,  -4.03880000e-01,
        -3.02410007e+00,   4.64819998e-01,  -5.28949983e-02,
        -1.52679995e-01,   3.22320014e-01,  -2.55369991e-01,
         2.64629990e-01,   8.37830007e-01,  -9.72450003e-02,
         8.51809978e-02,   3.98719996e-01,   1.02660000e-01,
        -4.09559995e-01,  -7.27319997e-03,  -3.51529986e-01,
        -4.16489989e-01,   6.47870004e-02,  -1.48800001e-01,
         7.11059988e-01,  -2.09369995e-02,   5.10909975e-01,
        -3.34690005e-01,  -7.14389980e-01,  -4.68459994e-01,
         2.97599994e-02,   2.58049995e-01,  -3.82010013e-01,
         3.80780011e-01,   3.71329993e-01,  -4.53170002e-01,
        -9.43189979e-01,   3.27129990e-01,   1.14349999e-01,
        -1.40249997e-01,   2.74509996e-01,  -4.57929999e-01,
        -8.18200037e-02,  -3.08620006e-01,   7.91150033e-02,
        -9.46670026e-02,

This lets us do some useful calculations. For instance, we can see how far apart two words are:

In [13]:
from scipy.spatial.distance import cosine as dist

The distance between similar words is low:

In [14]:
dist(vecs[wordidx["puppy"]], vecs[wordidx["dog"]])

0.40639386485700224

In [15]:
dist(vecs[wordidx["queen"]], vecs[wordidx["princess"]])

0.36432365465061811

And the distance between unrelated words is high:

In [16]:
dist(vecs[wordidx["kitten"]], vecs[wordidx["airplane"]])

0.99318009024356424

In [17]:
dist(vecs[wordidx["celebrity"]], vecs[wordidx["dusty"]])

1.0488782857777486

In [18]:
dist(vecs[wordidx["avalanche"]], vecs[wordidx["antique"]])

1.0393029477348061

We can also see what words are close to a given word.

In [19]:
from sklearn.neighbors import NearestNeighbors

Nearest Neighbors is an algorithm that finds the points closest to a given point.

In [20]:
neigh = NearestNeighbors(n_neighbors=10, radius=0.5, metric='cosine', algorithm='brute')
neigh.fit(vecs) 

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=1, n_neighbors=10, p=2, radius=0.5)

In [21]:
distances, indices = neigh.kneighbors([vecs[1226]])

In [22]:
[words[int(ind)] for ind in indices[0]]

['intelligence',
 'cia',
 'information',
 'security',
 'counterterrorism',
 'operatives',
 'fbi',
 'military',
 'secret',
 'spy']

We can take this a step further, and add two words together.  What is the result?

In [23]:
new_vec = vecs[wordidx["artificial"]] + vecs[wordidx["intelligence"]]

In [24]:
new_vec

array([ -1.04946995e+00,   1.09679997e-01,  -2.37800002e-01,
         4.41762984e-01,  -6.37489974e-01,  -1.69235006e-01,
        -2.28319988e-01,   1.35890037e-01,  -2.92060018e-01,
        -4.89580011e+00,   7.30489969e-01,  -9.51889977e-02,
        -7.50439912e-02,   1.99990004e-01,   2.74540037e-01,
         3.27181995e-01,   6.39230013e-01,   2.27394998e-01,
        -2.37269014e-01,   7.63599992e-01,  -3.67049992e-01,
        -2.92549998e-01,  -6.74103200e-01,   4.04150039e-01,
        -1.12989998e+00,   3.89737010e-01,  -4.74629998e-01,
         4.07639980e-01,  -3.42967004e-01,   1.52470994e+00,
        -3.64854008e-01,  -3.87269974e-01,  -9.44000006e-01,
         9.67870057e-02,   5.07059991e-01,   5.33720016e-01,
         9.79380012e-01,   3.42716992e-01,  -1.99719995e-01,
        -4.25859988e-01,  -1.31960005e-01,   2.58089989e-01,
         1.28430009e-01,   5.97499967e-01,  -7.00890005e-01,
        -3.12350005e-01,  -1.79130003e-01,   7.68934965e-01,
         4.79852974e-01,

In [25]:
distances, indices = neigh.kneighbors(new_vec)



In [26]:
[words[int(ind)] for ind in indices[0]]

['intelligence',
 'artificial',
 'information',
 'knowledge',
 'cia',
 'methods',
 'secret',
 'source',
 'capabilities',
 'sources']

In [27]:
distances, indices = neigh.kneighbors(vecs[wordidx["kitten"]])



In [28]:
[words[int(ind)] for ind in indices[0]]

['kitten',
 'kittens',
 'puppy',
 'puppies',
 'pooch',
 'cat',
 'cute',
 'purr',
 'adorable',
 'rottweiler']

In [29]:
new_vec = vecs[wordidx["kitten"]] - vecs[wordidx["cat"]] + vecs[wordidx["dog"]]

In [30]:
distances, indices = neigh.kneighbors([new_vec])

In [31]:
[words[int(ind)] for ind in indices[0]]

['kitten',
 'puppy',
 'dog',
 'rottweiler',
 'dogs',
 'puppies',
 'retriever',
 'leash',
 'hound',
 'pooch']

In [32]:
distances, indices = neigh.kneighbors([vecs[wordidx["king"]]])

In [33]:
[words[int(ind)] for ind in indices[0]]

['king',
 'queen',
 'prince',
 'monarch',
 'kingdom',
 'throne',
 'ii',
 'iii',
 'crown',
 'reign']

In [34]:
new_vec = vecs[wordidx["king"]] - vecs[wordidx["man"]] + vecs[wordidx["woman"]]

In [35]:
distances, indices = neigh.kneighbors([new_vec])

In [36]:
[words[int(ind)] for ind in indices[0]]

['king',
 'queen',
 'monarch',
 'throne',
 'princess',
 'mother',
 'daughter',
 'kingdom',
 'prince',
 'elizabeth']

## Nearest Neighbors  d=50

In [61]:
from sklearn.neighbors import NearestNeighbors

In [62]:
neigh = NearestNeighbors(n_neighbors=5, radius=0.5)
neigh.fit(vecs) 

NearestNeighbors(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=0.5)

In [79]:
distances, indices = neigh.kneighbors([vecs[wordidx["queen"]]])

In [80]:
[words[int(ind)] for ind in indices[0]]

['queen', 'princess', 'lady', 'elizabeth', 'prince']

In [150]:
distances, indices = neigh.kneighbors([vecs[wordidx["tarantula"]]])

In [151]:
[words[int(ind)] for ind in indices[0]]

['tarantula', 'two-headed', 'leviathan', 'rattler', 'ape']

In [134]:
new_vec = vecs[wordidx["kitten"]] - vecs[wordidx["cat"]] + vecs[wordidx["dog"]]

In [135]:
distances, indices = neigh.kneighbors([new_vec])

In [136]:
indices

array([[34698, 22454, 76671,  2926, 54331]])

In [137]:
[words[int(ind)] for ind in indices[0]]

['kitten', 'puppy', 'rottweiler', 'dog', 'spunky']

In [147]:
np.linalg.norm(vecs[wordidx["puppy"]] - vecs[wordidx["dog"]])

3.149688

In [144]:
np.linalg.norm(vecs[wordidx["queen"]] - vecs[wordidx["princess"]])

3.0129473

In [139]:
np.linalg.norm(vecs[wordidx["kitten"]] - vecs[wordidx["airplane"]])

5.3257985

In [141]:
np.linalg.norm(vecs[wordidx["celebrity"]] - vecs[wordidx["dusty"]])

5.8440499

In [143]:
np.linalg.norm(vecs[wordidx["avalanche"]] - vecs[wordidx["antique"]])

6.6188631

## Setup data

We're going to look at the IMDB dataset, which contains movie reviews from IMDB, along with their sentiment. Keras comes with some helpers for this dataset.

In [6]:
from keras.datasets import imdb
from keras.utils.data_utils import get_file
idx = imdb.get_word_index()

This is the word list:

In [2]:
idx_arr = sorted(idx, key=idx.get)
idx_arr[:10]

['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i']

...and this is the mapping from id to word

In [4]:
idx2word = {v: k for k, v in idx.items()}

We download the reviews using code copied from keras.datasets:

In [106]:
path = get_file('imdb_full.pkl',
                origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
                md5_hash='d091312047c43cf9e4e38fef92437263')
f = open(path, 'rb')
(x_train, labels_train), (x_test, labels_test) = pickle.load(f)

In [107]:
len(x_train)

25000

Here's the 1st review. As you see, the words have been replaced by ids. The ids can be looked up in idx2word.

In [108]:
', '.join(map(str, x_train[0]))

'23022, 309, 6, 3, 1069, 209, 9, 2175, 30, 1, 169, 55, 14, 46, 82, 5869, 41, 393, 110, 138, 14, 5359, 58, 4477, 150, 8, 1, 5032, 5948, 482, 69, 5, 261, 12, 23022, 73935, 2003, 6, 73, 2436, 5, 632, 71, 6, 5359, 1, 25279, 5, 2004, 10471, 1, 5941, 1534, 34, 67, 64, 205, 140, 65, 1232, 63526, 21145, 1, 49265, 4, 1, 223, 901, 29, 3024, 69, 4, 1, 5863, 10, 694, 2, 65, 1534, 51, 10, 216, 1, 387, 8, 60, 3, 1472, 3724, 802, 5, 3521, 177, 1, 393, 10, 1238, 14030, 30, 309, 3, 353, 344, 2989, 143, 130, 5, 7804, 28, 4, 126, 5359, 1472, 2375, 5, 23022, 309, 10, 532, 12, 108, 1470, 4, 58, 556, 101, 12, 23022, 309, 6, 227, 4187, 48, 3, 2237, 12, 9, 215'

The first word of the first review is 23022. Let's see what that is.

In [12]:
idx2word[23022]

'bromwell'

Here's the whole review, mapped from ids to words.

In [13]:
' '.join([idx2word[o] for o in x_train[0]])

"bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell high's satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers' pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector i'm here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isn't"

The labels are 1 for positive, 0 for negative.

In [14]:
labels_train[:10]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Reduce vocab size by setting rare words to max index.

In [17]:
vocab_size = 5000

trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_train]
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]

Look at distribution of lengths of sentences.

In [21]:
trn[:10]

[array([4999,  309,    6,    3, 1069,  209,    9, 2175,   30,    1,  169,
          55,   14,   46,   82, 4999,   41,  393,  110,  138,   14, 4999,
          58, 4477,  150,    8,    1, 4999, 4999,  482,   69,    5,  261,
          12, 4999, 4999, 2003,    6,   73, 2436,    5,  632,   71,    6,
        4999,    1, 4999,    5, 2004, 4999,    1, 4999, 1534,   34,   67,
          64,  205,  140,   65, 1232, 4999, 4999,    1, 4999,    4,    1,
         223,  901,   29, 3024,   69,    4,    1, 4999,   10,  694,    2,
          65, 1534,   51,   10,  216,    1,  387,    8,   60,    3, 1472,
        3724,  802,    5, 3521,  177,    1,  393,   10, 1238, 4999,   30,
         309,    3,  353,  344, 2989,  143,  130,    5, 4999,   28,    4,
         126, 4999, 1472, 2375,    5, 4999,  309,   10,  532,   12,  108,
        1470,    4,   58,  556,  101,   12, 4999,  309,    6,  227, 4187,
          48,    3, 2237,   12,    9,  215]),
 array([4999,   39, 4999,   14,  739, 4999, 3428,   44,   74,   32

In [30]:
lens = np.array([len(review) for review in trn])

In [29]:
(lens.max(), lens.min(), lens.mean())

(2493, 10, 237.71364)

Pad (with zero) or truncate each sentence to make consistent length.

In [115]:
from keras.preprocessing import sequence

In [116]:
seq_len = 500

trn = sequence.pad_sequences(trn, maxlen=seq_len, value=0)
test = sequence.pad_sequences(test, maxlen=seq_len, value=0)

This results in nice rectangular matrices that can be passed to ML algorithms. Reviews shorter than 500 words are pre-padded with zeros, those greater are truncated.

In [117]:
trn.shape

(25000, 500)

## Create simple models

The [stanford paper](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf) that this dataset is from cites a state of the art accuracy (without unlabelled data) of 0.883. So we're short of that, but on the right track.

The glove word ids and imdb word ids use different indexes. So we create a simple function that creates an embedding matrix using the indexes from imdb, and the embeddings from glove (where they exist).

In [129]:
def create_emb():
    n_fact = vecs.shape[1]
    emb = np.zeros((vocab_size, n_fact))

    for i in range(1,len(emb)):
        word = idx2word[i]
        if word and re.match(r"^[a-zA-Z0-9\-]*$", word):
            src_idx = wordidx[word]
            emb[i] = vecs[src_idx]
        else:
            # If we can't find the word in glove, randomly initialize
            emb[i] = np.random.normal(scale=0.6, size=(n_fact,))

    # This is our "rare word" id - we want to randomly initialize
    emb[-1] = np.random.normal(scale=0.6, size=(n_fact,))
    emb/=3
    return emb

### Single conv layer with max pooling

A CNN is likely to work better, since it's designed to take advantage of ordered data. We'll need to use a 1D CNN, since a sequence of words is 1D.

In [112]:
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.core import Flatten, Dense, Dropout
from keras.layers.convolutional import Convolution1D, MaxPooling1D
from keras.optimizers import Adam

In [118]:
conv1 = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len, dropout=0.2),
    Dropout(0.2),
    Convolution1D(64, 5, border_mode='same', activation='relu'),
    Dropout(0.2),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(1, activation='sigmoid')])

In [119]:
conv1.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [120]:
conv1.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f728cc8b1d0>

That's past the Stanford paper's accuracy - another win for CNNs!

In [281]:
conv1.save_weights(model_path + 'conv1.h5')

In [46]:
conv1.load_weights(model_path + 'conv1.h5')

In [130]:
emb = create_emb()

We pass our embedding matrix to the Embedding constructor, and set it to non-trainable.

In [131]:
model = Sequential([
    Embedding(vocab_size, 50, input_length=seq_len, dropout=0.2, 
              weights=[emb], trainable=False),
    Dropout(0.25),
    Convolution1D(64, 5, border_mode='same', activation='relu'),
    Dropout(0.25),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(1, activation='sigmoid')])

In [132]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [133]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f72879f1f98>

We already have beaten our previous model! But let's fine-tune the embedding weights - especially since the words we couldn't find in glove just have random embeddings.

In [91]:
model.layers[0].trainable=True

In [92]:
model.optimizer.lr=1e-4

In [93]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=1, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/1


<keras.callbacks.History at 0x7f0de0c4e0d0>

As expected, that's given us a nice little boost. :)

In [94]:
model.save_weights(model_path+'glove50.h5')

## Multi-size CNN

This is an implementation of a multi-size CNN as shown in Ben Bowles' [excellent blog post](https://quid.com/feed/how-quid-uses-deep-learning-with-small-data).

In [23]:
from keras.layers import Merge

We use the functional API to create multiple conv layers of different sizes, and then concatenate them.

In [132]:
graph_in = Input ((vocab_size, 50))
convs = [ ] 
for fsz in range (3, 6): 
    x = Convolution1D(64, fsz, border_mode='same', activation="relu")(graph_in)
    x = MaxPooling1D()(x) 
    x = Flatten()(x) 
    convs.append(x)
out = Merge(mode="concat")(convs) 
graph = Model(graph_in, out) 

In [174]:
emb = create_emb()

We then replace the conv/max-pool layer in our original CNN with the concatenated conv layers.

In [175]:
model = Sequential ([
    Embedding(vocab_size, 50, input_length=seq_len, dropout=0.2, weights=[emb]),
    Dropout (0.2),
    graph,
    Dropout (0.5),
    Dense (100, activation="relu"),
    Dropout (0.7),
    Dense (1, activation='sigmoid')
    ])

In [176]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [177]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f55b79b7990>

Interestingly, I found that in this case I got best results when I started the embedding layer as being trainable, and then set it to non-trainable after a couple of epochs. I have no idea why!

In [178]:
model.layers[0].trainable=False

In [179]:
model.optimizer.lr=1e-5

In [180]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f55b74de110>

This more complex architecture has given us another boost in accuracy.

## LSTM

We haven't covered this bit yet!

In [79]:
model = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len, mask_zero=True,
              W_regularizer=l2(1e-6), dropout=0.2),
    LSTM(100, consume_less='gpu'),
    Dense(1, activation='sigmoid')])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_13 (Embedding)         (None, 500, 32)       160064      embedding_input_13[0][0]         
____________________________________________________________________________________________________
lstm_13 (LSTM)                   (None, 100)           53200       embedding_13[0][0]               
____________________________________________________________________________________________________
dense_18 (Dense)                 (None, 1)             101         lstm_13[0][0]                    
Total params: 213365
____________________________________________________________________________________________________


In [80]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=5, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f9a16b12c50>