IMDB movie review dataset from keras contains 25,000 reviews, where each one is already preprocessed and has a label of either positive or negative. Each review is encoded by integers that represents how common a word is in the entire dataset. For example, a word encoded by the integer 3 means that it is the 3rd most common word in the dataset.

In [17]:
from keras.datasets import imdb
import keras.preprocessing
from keras.preprocessing import sequence
import tensorflow as tf
import os
import numpy as np

In [4]:
VOCAB_SIZE = 88584 #all different words in the dataset

MAXLEN =250
BATCH_SIZE =64

(train_data,train_labels), (test_data,test_labels) = imdb.load_data(num_words = VOCAB_SIZE)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [5]:
train_data[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 22665,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 21631,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 31050,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5

In [6]:
len(train_data[0])

218

In [7]:
len(train_data[2])

141

More Preprocessing
- We notice here that some of the reviews are of different lengths. That's an issue bcoz we cannot pass different length data into our neural network. Therefore, we must make each review of the same length using following steps:
     - if review >250 words, trim off the extra words
     - if review < 250 words, add necessary amount of 0's  to make it equal to 250

In [8]:
train_data = sequence.pad_sequences(train_data, MAXLEN)
test_data = sequence.pad_sequences(test_data, MAXLEN)

In [9]:
len(train_data[2])

250

In [10]:
train_data[2]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     1,    14,    47,     8,    30,    31,     7,     4,
         249,   108,     7,     4,  5974,    54,    61,   369,    13,
          71,   149,

Creating Model
- Embeding layer, LSTM layer, dense node (to get our predicted sentiment)
- 32 stands for vector dimension of output generated by embedding layer. We can change this value as we wish

In [11]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 32),
    tf.keras.layers.LSTM(32), # 32 dimension for every single word given as input to this layer
    tf.keras.layers.Dense(1, activation ='sigmoid') #sigmoid squishes value b/w 0 and 1, hence can easily define positive or negative review
])

In [12]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 32)          2834688   
_________________________________________________________________
lstm (LSTM)                  (None, 32)                8320      
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 2,843,041
Trainable params: 2,843,041
Non-trainable params: 0
_________________________________________________________________


Compile and Train

In [14]:
model.compile(loss = 'binary_crossentropy', optimizer = 'rmsprop', metrics =['acc'])
history = model.fit(train_data, train_labels, epochs=5, validation_split = 0.2) #epochs =10, validation dataset split

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [15]:
results = model.evaluate(test_data, test_labels)
print(results)

[0.37397491931915283, 0.8601999878883362]


Making Predictions
- Since our reviews are encoded, we need to convert any review that we write into that form so that the network can understand
- To do that, we load the encodings from the dataset and use them to encode our own data

In [22]:
word_index = imdb.get_word_index()

def encode_text(text):
    tokens = keras.preprocessing.text.text_to_word_sequence(text)
    tokens = [word_index[word] if word in word_index else 0 for word in tokens]
    
    return sequence.pad_sequences([tokens], MAXLEN)

text = "That movie was just amazing, so amazing"
encoded = encode_text(text)
print(encoded)

[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0 

In [27]:
def predict(text):
    encoded_text = encode_text(text)
    pred = np.zeros((1,250))
    pred = encoded_text
    result = model.predict(pred)
    print(result[0])
    
pos_review = 'That movie was so awesome. I really loved it and will watch it again because it was amazingly great'
predict(pos_review)

neg_review = "that movie sucked. I hated it and wouldn't watch it again. Was one of the worst things I have ever watched"
predict(neg_review)

[0.6518825]
[0.5604129]
