## Movie reviews - LSTM sentiment analysis

 In this mini project we will implement model for sentiment analysis, based on movie reviews from IMDB, we will predict sentiment (positive/negative). 

In [1]:
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM
from keras.callbacks import EarlyStopping
from keras.datasets import imdb

Using TensorFlow backend.


### Input data

We will use Keras's imbd dataset of 25,000 reviews. Reviews have been preprocessed and each review is encoded as sequence of word indexes. To see how reviews really looks like we will have to implement an dencoder from index into actual word. Based on dataset description first indexes has special meaning:
     - 0: this index will be used for padding
     - 1: start sign
     - 2: unknown words

In [2]:
num_words = 6000        
max_review_len = 100    
batch_size = 24
epochs = 5
index_from = 3

def get_decoder():
    word_to_id = imdb.get_word_index()
    word_to_id = {k:(v+index_from) for k,v in word_to_id.items()}
    word_to_id["<PAD>"] = 0
    word_to_id["<START>"] = 1
    word_to_id["<UNK>"] = 2
    word_to_id["<END>"] = 3
    id_to_word = {value:key for key, value in word_to_id.items()}
    
    return id_to_word

def print_example(index, id_to_word):
    print("class:", y_train[index])
    print("encoded word sequence:", x_train[index])
    print("decoded word sequence:", ' '.join(id_to_word[id] for id in x_train[index]))

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = num_words, index_from=index_from)
decoder = get_decoder()
print_example(0, decoder)

class: 1
encoded word sequence: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
decoded word sequence: <START> 

Keras dataset provide an argument for skipping most common words. We will use it to exclude words like : "the", "is", "a" etc. which brings no value to our model. Lets print 

In [3]:
def print_most_frequent_words(num_words, id_to_word):
    for i in range(4, num_words + 4):
        print(i-3, id_to_word[i])

print_most_frequent_words(20, get_decoder())

1 the
2 and
3 a
4 of
5 to
6 is
7 br
8 in
9 it
10 i
11 this
12 that
13 was
14 as
15 for
16 with
17 movie
18 but
19 film
20 on


Based on rule of thumb :) lets skip first 15 words, to do that we will load dataset once again. Additionally we will perform padding on samples to fit int our model.

In [4]:
def load_data(skip_top, max_review_len):
    (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = num_words, index_from=index_from, skip_top = skip_top)
    #   Pad and truncate the review word sequences so they are all the same length
    x_train = sequence.pad_sequences(x_train, maxlen = max_review_len)
    x_test = sequence.pad_sequences(x_test, maxlen = max_review_len)
    
    return x_train, x_test, y_train, y_test

x_train, x_test, y_train, y_test = load_data(15, 100)

## Model

Now we will implement our LSTM model. We will use sequential mode and define 3 layers:
    - Embedding layer, we will use 64 length vectors
    - LSTM layer wirh 64 memory units, and 0.3 dropout
    - Dense layer as an output layer with sigmoid activation function and only one output neuron

Because it's a binary classification problem we will use binary_crossentropy as los function. For optimizition we will use effective adam method. To avoid overfitting we will use early stopping technique. 

In [5]:
# Define model
model = Sequential()
model.add(Embedding(num_words, 64 ))
model.add(LSTM(64, dropout=0.3, recurrent_dropout=0.3))
model.add(Dense(1, activation='sigmoid'))

#   Compile
model.compile(loss='binary_crossentropy',  
            optimizer='adam',              
            metrics=['accuracy'])

#   Train
cbk_early_stopping = EarlyStopping(monitor='val_acc', patience=2, mode='max')
model.fit(x_train, y_train, batch_size, epochs=epochs, 
            validation_data=(x_test, y_test), 
            callbacks=[cbk_early_stopping] )

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1ca3d0e3dd8>

## Results

Finally lets check the performance of our model

In [6]:
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('test score:', score, ' test accuracy:', acc)

test score: 0.3944254387807846  test accuracy: 0.834800000371933
