# Sentiment analysis from movie reviews

More info on the dataset is here: https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification

### RNN(Recurring Neural Networks) to do sentiment analysis on full-text movie reviews!
Train an artificial neural network how to "read" movie reviews and guess  whether the author liked the movie or not from them.

Since understanding written language requires keeping track of all the words in a sentence, we need a recurrent neural network to keep a "memory" of the words that have come before as it "reads" sentences over time.

We'll use LSTM (Long Short-Term Memory) cells because we don't really want to "forget" words too quickly - words early on in a sentence can affect the meaning of that sentence significantly.

In [1]:
# Importing the libraries
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb

Now import our training and testing data. We specify that we only care about the 20,000 most popular words in the dataset in order to keep things somewhat managable. The dataset includes 5,000 training reviews and 25,000 testing reviews.

In [2]:
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=20000)

Loading data...


In [3]:
x_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,

That doesn't look like a movie review! But this data set has spared you a lot of trouble - they have already converted words to integer-based indices. The actual letters that make up a word don't really matter as far as our model is concerned, what matters are the words themselves - and our model needs numbers to work with, not letters.

Each number in the training features represent some specific word. It's a bummer that we can't just read the reviews in English as a gut check to see if sentiment analysis is really working, though.

In [4]:
y_train[0]

1

They are just 0 or 1, which indicates whether the reviewer said they liked the movie or not.

So to recap, we have a bunch of movie reviews that have been converted into vectors of words represented by integers, and a binary sentiment classification to learn from.

RNN's can blow up quickly, so again to keep things managable on our little PC let's limit the reviews to their first 80 words.

In [5]:
x_train = sequence.pad_sequences(x_train, maxlen=80)
x_test = sequence.pad_sequences(x_test, maxlen=80)

Set up our neural network model! It's really amazing how easy LSTM is to do with Keras. 

In [6]:
model = Sequential()
model.add(Embedding(20000, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

As this is a binary classification problem, we'll use the binary_crossentropy loss function.

In [7]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### Warning
Now we will actually train our model. This will take a very long time to run, even on a fast PC!

In [8]:
model.fit(x_train, y_train, batch_size=32,
          epochs=10, verbose=2, validation_data=(x_test, y_test))

Epoch 1/10
782/782 - 188s - loss: 0.4299 - accuracy: 0.7974 - val_loss: 0.3762 - val_accuracy: 0.8362
Epoch 2/10
782/782 - 187s - loss: 0.2574 - accuracy: 0.8974 - val_loss: 0.4649 - val_accuracy: 0.8246
Epoch 3/10
782/782 - 187s - loss: 0.1657 - accuracy: 0.9360 - val_loss: 0.4724 - val_accuracy: 0.8287
Epoch 4/10
782/782 - 180s - loss: 0.1138 - accuracy: 0.9580 - val_loss: 0.5053 - val_accuracy: 0.8228
Epoch 5/10
782/782 - 181s - loss: 0.0698 - accuracy: 0.9756 - val_loss: 0.6531 - val_accuracy: 0.8213
Epoch 6/10
782/782 - 181s - loss: 0.0666 - accuracy: 0.9763 - val_loss: 0.8569 - val_accuracy: 0.8160
Epoch 7/10
782/782 - 198s - loss: 0.0354 - accuracy: 0.9886 - val_loss: 0.8538 - val_accuracy: 0.8054
Epoch 8/10
782/782 - 185s - loss: 0.0247 - accuracy: 0.9917 - val_loss: 0.9212 - val_accuracy: 0.8205
Epoch 9/10
782/782 - 190s - loss: 0.0229 - accuracy: 0.9928 - val_loss: 1.1809 - val_accuracy: 0.8031
Epoch 10/10
782/782 - 188s - loss: 0.0223 - accuracy: 0.9930 - val_loss: 1.0194 - 

<tensorflow.python.keras.callbacks.History at 0x7fb1182e84c0>

In [10]:
score, acc = model.evaluate(x_test, y_test, batch_size=32, verbose=2)
print('Test score:', score)
print('Test accuracy:', acc)

782/782 - 17s - loss: 1.0194 - accuracy: 0.8111
Test score: 1.0194175243377686
Test accuracy: 0.8111199736595154


Considering we limited ourselves to just the first 80 words of each review, 81% accuracy is not too bad. 

A neural network that can "read" reviews and deduce whether the author liked the movie or not based on that text. And it takes the context of each word and its position in the review into account - and setting up the model itself was just a few lines of code!