This Notebook uses the IMDB dataset for doing sentiment analysis.

It has 2 models, one from scratch and one using pretrained vectors as embeddings

In [108]:
import numpy as np

import tensorflow as tf

from keras import layers
from keras import models
from keras import optimizers
from keras import applications
from keras.utils import data_utils
from keras.datasets import imdb
from keras.preprocessing import sequence

In [12]:
word2idx = imdb.get_word_index()

In [13]:
idx2word = {v: k for k, v in word2idx.items()}

In [42]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=None, maxlen=None, skip_top=0, seed=42, start_char=1, oov_char=2, index_from=3)

Look at one sample review

In [43]:
def decode_review(indices):
    return ' '.join([idx2word[o] for o in indices])

In [44]:
decode_review(x_train[322])

"the of center nick's it is video need felt br another movie of i've marty with it of existence result film of work nick's high it sound elements to comedy time' of therefore another for distracts of leaves br of marty trilogy box of my certain br of result to br of another become established of night first of certain film is these worst romantic haphazardly lots shot these promising undervalued see much depressive who of provide puppet cop fit to get read heroes this up corny art puppet hear in fast to remember is screw promising for affleck i'm name prequel terribly or stand in stand comedy deep promising to comedy elements race in stand it's using of feeling shirley's to glitz br grinch's leaves total as date marty in acting film an make d rhetoric to giant this offers of boxing promising br of remember discrimination of hypercube to directly eroticism this as liked is biggest chase 4 of stand br of their become br of gamut experiments br of center expeditious serious to horrors 9 a

This particular review is: 1 for positive, 0 for negative.

In [45]:
y_train[322]

1

In [63]:
lens = np.array(list(map(len, x_train)))
lens.max(), lens.min(), lens.mean()

(2494, 11, 238.71364)

As we can image (and see by the numbers) each review has different lenght of words, we need to create a rectangular matrix that we can pass to the ML model.

We reload the dataset with `num_words=vocab_size` and `maxlen=seq_length` and then we pad the sequences so we get reviews of 500 words.

In [86]:
vocab_size = 5000
seq_len = 500

In [79]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size, maxlen=seq_len, skip_top=0, seed=42, start_char=1, oov_char=2, index_from=3)

Lets look at the same review as before

In [80]:
decode_review(x_train[322])

'the complain complain and and as flynn suspects deal rated ability br is him and associated fact 9 it resemblance by br of where rating br of cars day original your way going room those in of thriller and find of its some br is mine guy who joy left br of and of apart and and really chicago are is countryside seen purely and much anyone who 6 buddies and like with william in and no would jungle this as face like language about and relevant br is order is mind versions who and felix are battle futuristic who and made and of yes film is leading and and and no is and tommy and constant she movie and and and nation or is unpleasant and to public in political and is and and m artistic and horrors and simple to lifetime and nuclear overlook ends wrote and of and played to learning and of pure thick spot and are advice is impossible film be mark and and is macy and this be screaming opera directed to is and video alexander and find but is and and and fail'

In [95]:
x_train = sequence.pad_sequences(x_train, maxlen=seq_len, value=0.)
x_test = sequence.pad_sequences(x_test, maxlen=seq_len, value=0.)

In [82]:
x_train[322].shape

(500,)

Model

In [98]:
inp = layers.Input(shape=(seq_len,))

In [99]:
x = layers.Embedding(vocab_size, 32)(inp)
x = layers.Flatten()(x)
x = layers.Dense(100, activation='relu')(x)
x = layers.Dropout(0.7)(x)
x = layers.Dense(1, activation='sigmoid')(x)

In [100]:
model = models.Model(inp, x)

In [101]:
model.compile(loss='binary_crossentropy', optimizer=optimizers.Adam(), metrics=['accuracy'])

In [102]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 500)               0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
flatten_2 (Flatten)          (None, 16000)             0         
_________________________________________________________________
dense_3 (Dense)              (None, 100)               1600100   
_________________________________________________________________
dropout_2 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 101       
Total params: 1,760,201
Trainable params: 1,760,201
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=2, batch_size=64)

Very quickly we can get a score of `loss: 0.2021 - acc: 0.9244 - val_loss: 0.3018 - val_acc: 0.8768`

## Pretrained embeddings

Download Glove dataset and load the data vocab and vector

In [None]:
data_utils.get_file('glove-6B.zip', 'http://nlp.stanford.edu/data/glove.6B.zip', extract=True, md5_hash='8e1557d1228decbda7db6dfd81cd9909')

In [118]:
import os
ds_dir = os.path.expanduser('~/.keras/datasets/')
fname = os.path.join(ds_dir, 'glove.6B.50d.txt')

In [161]:
# vocab_size = 5000  # From when we read the data
vector_size = 50

In [185]:
glove_vocab_size = 400000
vocabUnicodeSize = 78
encoding = 'utf-8'

glove_vocab = np.empty(glove_vocab_size, dtype='<U%s' % vocabUnicodeSize)
glove_vectors = np.empty((glove_vocab_size, vector_size), dtype=np.float)
glove_word2idx = {}

with open(fname, 'rb') as fin:
    for i, line in enumerate(fin):
        line = line.decode(encoding).strip()
        parts = line.split(' ')
        word = parts[0]

        vector = np.array(parts[1:], dtype=np.float)
        glove_vocab[i] = word
        glove_vectors[i] = vector
        
        glove_word2idx[word] = i

In [187]:
glove_vectors

array([[ 0.418   ,  0.24968 , -0.41242 , ..., -0.18411 , -0.11514 ,
        -0.78581 ],
       [ 0.013441,  0.23682 , -0.16899 , ..., -0.56657 ,  0.044691,
         0.30392 ],
       [ 0.15164 ,  0.30177 , -0.16763 , ..., -0.35652 ,  0.016413,
         0.10216 ],
       ..., 
       [-0.51181 ,  0.058706,  1.0913  , ..., -0.25003 , -1.125   ,  1.5863  ],
       [-0.75898 , -0.47426 ,  0.4737  , ...,  0.78954 , -0.014116,  0.6448  ],
       [ 0.072617, -0.51393 ,  0.4728  , ..., -0.18907 , -0.59021 ,
         0.55559 ]])

In [188]:
glove_vectors.shape

(400000, 50)

In [189]:
glove_vocab

array(['the', ',', '.', ..., 'rolonda', 'zsombor', 'sandberger'],
      dtype='<U78')

In [190]:
glove_word2idx['rolonda']

399997

Create the embedding weights based on the glove vectors.

Iterate throught the vocabulary of the loaded data, look that each word on the glove vectors, use that vector as embedding.

In [275]:
emb_weights = np.zeros((vocab_size, vector_size))

for i in range(1, len(emb_weights)):
    word = idx2word[i]
    if word and word in glove_word2idx:
        glove_idx = glove_word2idx[word]
        emb_weights[i] = glove_vectors[glove_idx]
    else:
        # If we can't find the word in glove, randomly initialize
        emb_weights[i] = np.random.normal(scale=0.6, size=(vector_size,))

In [276]:
emb_weights

array([[ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,  0.      ],
       [ 0.418   ,  0.24968 , -0.41242 , ..., -0.18411 , -0.11514 ,
        -0.78581 ],
       [ 0.26818 ,  0.14346 , -0.27877 , ..., -0.6321  , -0.25028 ,
        -0.38097 ],
       ..., 
       [ 0.6307  , -0.22702 ,  0.071692, ...,  0.12668 ,  0.19897 ,
        -0.54055 ],
       [ 1.1595  ,  0.21344 , -0.36298 , ..., -0.61992 ,  0.56161 ,
        -0.94449 ],
       [ 0.24625 ,  0.15718 , -0.37438 , ..., -0.19783 , -1.0133  ,
         0.52402 ]])

In [290]:
emb_weights /= 3

In [291]:
inp = layers.Input(shape=(seq_len,))

In [292]:
x = layers.Embedding(vocab_size, 50, weights=[emb_weights])(inp)
x = layers.Flatten()(x)
x = layers.Dense(100, activation='relu')(x)
x = layers.Dropout(0.7)(x)
x = layers.Dense(1, activation='sigmoid')(x)

In [293]:
model = models.Model(inp, x)

In [294]:
model.compile(loss='binary_crossentropy', optimizer=optimizers.Adam(), metrics=['accuracy'])

In [298]:
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=1, batch_size=64)

Train on 25000 samples, validate on 20947 samples
Epoch 1/1


<keras.callbacks.History at 0x7f8a0cd0bdd8>

So this actually didn't provide better (faster) results, eventually after a couple of epochs the model improve but the embeddings are supposed to make it faster

I am pretty sure the embeddings are fine so but I have been wrong before :)