Train and evaluate a Multi-Layer Perceptron on movie ratings.

![](images/pos_neg_sentiment.png)

Adapted from: 
- https://github.com/fchollet/keras/blob/master/examples/reuters_mlp.py
- http://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning/

In [64]:
reset -fs

In [65]:
from keras.datasets import imdb

The dataset is the Large Movie Review Dataset, often referred to as the IMDB dataset.

![](https://kaggle2.blob.core.windows.net/competitions/inclass/4996/media/Screen%20Shot%202016-02-23%20at%2010.56.44%20AM.png)
Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). 

The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly polar moving reviews (good or bad) for training and the same amount again for testing. 

Learn more: https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification

In [67]:
print('Loading data...')
num_words = 1_000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_words,
                                                      skip_top=25)

print('Data loaded.')

print(f'{len(x_train):,} train sequences')
print(f'{len(x_test):,} test sequences')

Loading data...
Data loaded.
25,000 train sequences
25,000 test sequences

x_train shape: (25000,)
x_test shape: (25000,)


The following table shows the first few rows of the training dataset:
![](https://sandipanweb.files.wordpress.com/2017/03/im33.png)

Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). 

For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. 

This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

The problem is to determine whether a given moving review has a positive or negative sentiment.

In [70]:
print("Number of categories: ", len(set(y_train)))

Number of categories:  2


We will bound reviews at 500 words, truncating longer reviews and zero-padding shorter reviews.

In [71]:
from keras.preprocessing import sequence

In [72]:
max_words = 500
x_train = sequence.pad_sequences(x_train, maxlen=max_words)
x_test = sequence.pad_sequences(x_test, maxlen=max_words)

In [74]:
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

x_train shape: (25000, 500)
x_test shape: (25000, 500)


In [80]:
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Dense, Dropout, Activation, Flatten

In [93]:
print('Building model...')
model = Sequential() # A linear stack of layers
model.add(Embedding(input_dim=num_words, # Input layer turns positive integers (indexes) into dense vectors of fixed size
                    output_dim= 32, 
                    input_length=max_words)) 
model.add(Flatten()) # Change shape to make linear algebra work
model.add(Activation('relu')) # Add non-linearity
model.add(Dropout(0.5)) # Randomly prune so the model does not overfit
model.add(Dense(1)) # Full connected layer
model.add(Activation('sigmoid')) # Output layer is probability of one of two categories
print('Model built')

Building model...
Model built


In [94]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 500, 32)           32000     
_________________________________________________________________
flatten_5 (Flatten)          (None, 16000)             0         
_________________________________________________________________
activation_13 (Activation)   (None, 16000)             0         
_________________________________________________________________
dropout_7 (Dropout)          (None, 16000)             0         
_________________________________________________________________
dense_9 (Dense)              (None, 1)                 16001     
_________________________________________________________________
activation_14 (Activation)   (None, 1)                 0         
Total params: 48,001.0
Trainable params: 48,001.0
Non-trainable params: 0.0
_________________________________________________________________


In [95]:
model.compile(loss='binary_crossentropy', # Loss measures how wrong you are; want this to be small
              optimizer='adam',           # How to search the space for best weights; Adam adapts over time
              metrics=['accuracy'])       # How to measure success

In [96]:
history = model.fit(x_train, y_train,
                    batch_size=32,
                    epochs=5,
                    verbose=True,
                    validation_split=0.1)

Train on 22500 samples, validate on 2500 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [97]:
score = model.evaluate(x_test, 
                       y_test,
                       batch_size=32, 
                       verbose=True)

Test accuracy: 0.86148


In [101]:
print(f'Test accuracy: {score[1]:.2%}')

Test accuracy: 86.15%


This data was collected by Stanford researchers and was used in a 2011 paper where a split of 50/50 of the data was used for training and test. An accuracy of 88.89% was achieved.

The data was also used as the basis for a Kaggle competition titled [“Bag of Words Meets Bags of Popcorn”](https://www.kaggle.com/c/word2vec-nlp-tutorial) in late 2014 to early 2015. Accuracy was achieved above 97% with winners achieving 99%.

<br>
<br> 
<br>

----