First our imports.  numpy and tensorflow are the matrix and neural net packages (Keras is a 'fast to develop' api of tensorflow).  In tf.Keras the imdb dataset is pre-cleaned allowing us to focus on Neural Networks.

THe Sequential is just how we build the model (sequentially, adding layers).  We'll import the RNN, LSTM, and GRU cells as well as a special layer known as 'Embedding'

In [1]:
import numpy
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import  LSTM as LSTM, SimpleRNN,  GRU
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing import sequence
# fix random seed for reproducibility
numpy.random.seed(7)

Now we will pull the top words from the dataset.  Everything not in this dataset is coded as 'unknown' Because the imdb dataset is sorted, it makes the data faster to load.  In addition, we create test/train splits

In [2]:
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

Now we will make all our data the same length.  We set the maximum review length to be 500 and anything shorter than 500 tokens is padded with 0's at the beginning. Padding is at the beginning in RNNs as feeding 0s late in the cycle may cause memory loss (think about bi-directional challenges though!)

In [3]:
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

Here we build a simple RNN model.  The embedding layer allows us to train our own word vectors.  It becomes in essence a lookup table for the word vectors as the first word "the" is encoded as "4"  That in essence means "Grab the 4th row of the embedding column as inputs"  Also because this is a weight matrix the weights will be learned.  (In fine detail the "4" becomes one hot encoded as a vectors with 0s everywhere except position 4, and multiplied by the weight matrix which 'selects' row 4.  However it is more effcient in implmentation to just make a lookup table than store a large matrix and multiply.

Ask yourself why are the number of parameters the way they are?  Can you calculate how many there should be?  Does it match? (Hint: It should!)

Notice we are validating are the test data!

In [4]:

# create the model
embedding_vecor_length = 80
model1 = Sequential()
model1.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model1.add(SimpleRNN(100,unroll=True))
model1.add(Dense(1, activation='sigmoid'))
model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model1.summary())
model1.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 80)           400000    
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 100)               18100     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 418,201
Trainable params: 418,201
Non-trainable params: 0
_________________________________________________________________
None
Train on 25000 samples, validate on 25000 sam

<tensorflow.python.keras.callbacks.History at 0x2e087668470>

Now let check how we performed on the test data overall

In [5]:
scores = model1.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 75.93%


Now we can swap out our RNN for an LSTM.  Keeping everything the same, how improved is it? (if at all)?  Can you make sense of the number of parameters here?

In [6]:
embedding_vecor_length = 80
model2 = Sequential()
model2.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model2.add(LSTM(100))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model2.summary())
model2.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 80)           400000    
_________________________________________________________________
lstm (LSTM)                  (None, 100)               72400     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 472,501
Trainable params: 472,501
Non-trainable params: 0
_________________________________________________________________
None
Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x158d34f6828>

In [7]:
scores = model2.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 84.75%


Now lets use a GRU cell.  

In [8]:
embedding_vecor_length = 80
model3 = Sequential()
model3.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model3.add(GRU(100))
model3.add(Dense(1, activation='sigmoid'))
model3.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model3.summary())
model3.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 80)           400000    
_________________________________________________________________
gru (GRU)                    (None, 100)               54600     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101       
Total params: 454,701
Trainable params: 454,701
Non-trainable params: 0
_________________________________________________________________
None
Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x15703d30cf8>

In [10]:
scores = model3.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 86.39%


Now instead of training our own vectors let's try pre-trained GloVe vectors.  First I will read them in.  You must download them from the Glove website.  Why didn't I use a length 80 vector?  because Glove vectors are only downloadable in certain sizes--in this case 100 dimensions

In [11]:
import numpy as np
embeddings_index = dict()
f = open('glove.6B/glove.6B.100d.txt','r',encoding='UTF-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

Now before we continue, let's take a look at what the data really was.  I like to tinker around in the data and make sure I understand what is happening. Its a good way to learn

In [15]:
NUM_WORDS=5000  # only use top 1000 words
INDEX_FROM=3   # word index offset

train,test = imdb.load_data(num_words=NUM_WORDS, index_from=INDEX_FROM)
train_x,train_y = train
test_x,test_y = test

word_to_id = tf.keras.datasets.imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in train_x[0] ))

<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly <UNK> was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little <UNK> that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big <UNK> for the whole film but these children are amazing and should be <UNK> for what they

What I do here is replace the embedding matrix with the word vectors.  So when the embedding layer for the word 'the' comes up, my 100 length vector with 0's everywhere except position 4 will select the 4th rows of the embedding matrix, which we put in the glove vector for 'the'

In [16]:
vocabulary_size=5000
embedding_matrix = np.zeros((vocabulary_size, 100))
index=0
for word in word_to_id:
    index = word_to_id[word]
    if index > vocabulary_size - 1:
        pass
    else:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[index] = embedding_vector
embedding_matrix[2] = embeddings_index.get('unk')
embedding_matrix[1] = np.ones(100) 

In [17]:
word_to_id['film']

22

Now you can check that GloVe and embedding matrix match

In [18]:
embedding_matrix[4]

array([-0.038194  , -0.24487001,  0.72812003, -0.39961001,  0.083172  ,
        0.043953  , -0.39140999,  0.3344    , -0.57545   ,  0.087459  ,
        0.28786999, -0.06731   ,  0.30906001, -0.26383999, -0.13231   ,
       -0.20757   ,  0.33395001, -0.33848   , -0.31742999, -0.48335999,
        0.1464    , -0.37303999,  0.34577   ,  0.052041  ,  0.44946   ,
       -0.46970999,  0.02628   , -0.54154998, -0.15518001, -0.14106999,
       -0.039722  ,  0.28277001,  0.14393   ,  0.23464   , -0.31020999,
        0.086173  ,  0.20397   ,  0.52623999,  0.17163999, -0.082378  ,
       -0.71787   , -0.41531   ,  0.20334999, -0.12763   ,  0.41367   ,
        0.55186999,  0.57907999, -0.33476999, -0.36559001, -0.54856998,
       -0.062892  ,  0.26583999,  0.30204999,  0.99774998, -0.80480999,
       -3.0243001 ,  0.01254   , -0.36941999,  2.21670008,  0.72201002,
       -0.24978   ,  0.92136002,  0.034514  ,  0.46744999,  1.10790002,
       -0.19358   , -0.074575  ,  0.23353   , -0.052062  , -0.22

In [19]:
embeddings_index['the']

array([-0.038194, -0.24487 ,  0.72812 , -0.39961 ,  0.083172,  0.043953,
       -0.39141 ,  0.3344  , -0.57545 ,  0.087459,  0.28787 , -0.06731 ,
        0.30906 , -0.26384 , -0.13231 , -0.20757 ,  0.33395 , -0.33848 ,
       -0.31743 , -0.48336 ,  0.1464  , -0.37304 ,  0.34577 ,  0.052041,
        0.44946 , -0.46971 ,  0.02628 , -0.54155 , -0.15518 , -0.14107 ,
       -0.039722,  0.28277 ,  0.14393 ,  0.23464 , -0.31021 ,  0.086173,
        0.20397 ,  0.52624 ,  0.17164 , -0.082378, -0.71787 , -0.41531 ,
        0.20335 , -0.12763 ,  0.41367 ,  0.55187 ,  0.57908 , -0.33477 ,
       -0.36559 , -0.54857 , -0.062892,  0.26584 ,  0.30205 ,  0.99775 ,
       -0.80481 , -3.0243  ,  0.01254 , -0.36942 ,  2.2167  ,  0.72201 ,
       -0.24978 ,  0.92136 ,  0.034514,  0.46745 ,  1.1079  , -0.19358 ,
       -0.074575,  0.23353 , -0.052062, -0.22044 ,  0.057162, -0.15806 ,
       -0.30798 , -0.41625 ,  0.37972 ,  0.15006 , -0.53212 , -0.2055  ,
       -1.2526  ,  0.071624,  0.70565 ,  0.49744 , 

So now I will redo the simple RNN, notice the change to the embedding layer where I tell it the weights and set the layer to be frozen of 'trainable=False'.  That means do not do weight calculations for our embedding matrix.  Notice how I also resized the Embedding layer-> top_words+4

In [20]:
embedding_matrix.shape

(5000, 100)

In [21]:
# create the model
embedding_vecor_length = 100
model4 = Sequential()
model4.add(Embedding(top_words, embedding_vecor_length, weights=[embedding_matrix], input_length=max_review_length, trainable=False))
model4.add(SimpleRNN(100,unroll=True))
model4.add(Dense(1, activation='sigmoid'))
model4.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model4.summary())
model4.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=100)

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 500, 100)          500000    
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 100)               20100     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 101       
Total params: 520,201
Trainable params: 20,201
Non-trainable params: 500,000
_________________________________________________________________
None
Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x15959e189e8>

In [22]:
scores = model4.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 71.12%


In [None]:
### make a model with trainable weights here.

Accuracy: 88.93%


In [23]:
import scipy
second_layer_weights = model4.layers[0].get_weights()[0] #this gets your word vectors from the model--layer 0 is the embedding layer
scipy.spatial.distance.cosine(second_layer_weights[22],embeddings_index['film']) #don't forget, scipy uses the 1- cos x version!!

0.0

In [24]:
second_layer_weights[22]

array([ 0.19916 , -0.049702,  0.24579 , -0.32281 ,  0.89768 , -0.1278  ,
       -0.49506 ,  0.20814 , -0.20046 , -0.20604 ,  0.038292, -0.67277 ,
       -0.12689 , -0.18766 , -0.10277 ,  0.73128 ,  0.82408 ,  0.087288,
        0.69255 ,  1.3107  ,  0.49113 , -0.38097 ,  0.24338 , -0.27813 ,
        0.62506 ,  0.35978 ,  0.42041 , -0.24529 ,  0.14861 , -0.26726 ,
       -0.56262 ,  0.63843 , -0.54153 ,  0.36537 ,  0.20545 , -0.16604 ,
        0.72434 ,  0.29961 , -0.42501 , -0.35932 , -0.089288,  0.48752 ,
       -1.0927  ,  0.88818 ,  0.89941 , -0.7541  , -0.35492 , -0.76396 ,
        0.27468 ,  0.2757  , -0.48152 , -0.41399 ,  0.64489 ,  1.148   ,
       -0.29131 , -2.9387  , -0.83162 ,  0.95586 ,  1.1623  , -0.42502 ,
        0.15486 ,  2.2326  , -0.31339 , -0.030228,  0.79802 , -0.41302 ,
        0.72885 ,  0.7296  , -0.31909 ,  0.8956  ,  0.34625 ,  0.2923  ,
        0.40056 ,  0.78985 , -0.43999 ,  0.24698 , -0.46548 ,  0.055886,
       -0.62603 , -0.036487, -0.65429 ,  0.10563 , 

In [25]:
embeddings_index['film']

array([ 0.19916 , -0.049702,  0.24579 , -0.32281 ,  0.89768 , -0.1278  ,
       -0.49506 ,  0.20814 , -0.20046 , -0.20604 ,  0.038292, -0.67277 ,
       -0.12689 , -0.18766 , -0.10277 ,  0.73128 ,  0.82408 ,  0.087288,
        0.69255 ,  1.3107  ,  0.49113 , -0.38097 ,  0.24338 , -0.27813 ,
        0.62506 ,  0.35978 ,  0.42041 , -0.24529 ,  0.14861 , -0.26726 ,
       -0.56262 ,  0.63843 , -0.54153 ,  0.36537 ,  0.20545 , -0.16604 ,
        0.72434 ,  0.29961 , -0.42501 , -0.35932 , -0.089288,  0.48752 ,
       -1.0927  ,  0.88818 ,  0.89941 , -0.7541  , -0.35492 , -0.76396 ,
        0.27468 ,  0.2757  , -0.48152 , -0.41399 ,  0.64489 ,  1.148   ,
       -0.29131 , -2.9387  , -0.83162 ,  0.95586 ,  1.1623  , -0.42502 ,
        0.15486 ,  2.2326  , -0.31339 , -0.030228,  0.79802 , -0.41302 ,
        0.72885 ,  0.7296  , -0.31909 ,  0.8956  ,  0.34625 ,  0.2923  ,
        0.40056 ,  0.78985 , -0.43999 ,  0.24698 , -0.46548 ,  0.055886,
       -0.62603 , -0.036487, -0.65429 ,  0.10563 , 

Let's do a GRU with pre-trained vectors.  Same issues as before.

In [None]:
%matplotlib inline
plt.plot(model1.history.history['loss'])
plt.plot(model2.history.history['loss'])
plt.plot(model3.history.history['loss'])
plt.legend(["RNN","LSTM","GRU"])
plt.title("Loss curves for various architectures")
#plt.text(0.0,0.45,"Fig 1: A measure of RMS loss versus Epoch for several RNN Architectures ")
plt.show()

Fig 1: A RMS loss versus epoch for 3 different architectures.  Note how I can't spell and 2 models are missing

['The', 'red', 'crow', 'flies', 'north', 'at', 'dawn']