<h1> Sentiment Analysis with RNNs </h1>
<p> Problem statement: given a set of movie reviews from imdb, train a recurrent neural network to classify the sentiment of a review (positive, negative, etc). </p>

In [9]:
from keras.datasets import imdb
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy as np

VOCAB_SIZE = 88584
MAXLEN=250
BATCH_SIZE=64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = VOCAB_SIZE)

In [2]:
#take a look at one review
train_data[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 22665,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 21631,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 31050,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5

<p> Looks like the words of the reviews are already encoded in integer form. </p>

In [3]:
#trim and pad the reviews so they are all the same length. 
train_data = sequence.pad_sequences(train_data, MAXLEN)
test_data = sequence.pad_sequences(test_data, MAXLEN)

<p> Now we create the model. We first add a word-embedding layer to learn the word embeddings. Then we add a LSTM to learn semantic understanding of the reviews, and feed that into a dense layer to classify the reviews as positive or negative. ($x<=0.5=positive, x>=0.5=negative$ etc)</p> 

In [5]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, 32),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 32)          2834688   
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 2,843,041
Trainable params: 2,843,041
Non-trainable params: 0
_________________________________________________________________


<p> Looks like the embedding layer is pretty massive. This will likely take most of our training compute time. </p>

In [6]:
model.compile(
    loss='binary_crossentropy',
    optimizer='rmsprop',
    metrics=['acc']
)
history=model.fit(train_data, train_labels, 
                  epochs=10, validation_split=0.2)

2021-10-26 11:24:07.449614: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<p> Binary cross-entropy is because we have two categories for our probability distribution. validation_split means that 20% of our training data will be used to validate the model during training. </p>

In [7]:
results = model.evaluate(test_data, test_labels)
print(results)

[0.47008123993873596, 0.8499600291252136]


<p> Making a prediction by writing a function to encode any text into the same format that the training data was in: </p>

In [12]:
word_index = imdb.get_word_index()
reverse_word_index = {value: key for (key, value) in word_index.items()}

def encode_text(text):
    tokens = keras.preprocessing.text.text_to_word_sequence(text)
    tokens = [word_index[word] if word in word_index else 0 for word in tokens]
    return sequence.pad_sequences([tokens],MAXLEN)[0]

def decode_integers(integers):
    PAD=0
    text=""
    for num in integers: 
        if num != PAD:
            text += reverse_word_index[num] + " "
        
    return text[:-1]

text="I hated the movie, it really sucked."
encoded=encode_text(text)
print(encoded)
decoded=decode_integers(encoded)
print(decoded)

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0 

In [21]:
#now make a prediction!

def predict(text):
    #encode text
    encoded_text = encode_text(text)
    #reshape into numpy array
    data = np.zeros((1,250))
    data[0] = encoded_text
    #make prediction
    result = model.predict(data)
    print(result[0])
    if result[0] > 0.5:
        print("Positive review")
    else:
        print("Negative review")
    
positive_text="This movie was awesome dude. I love this movie. I would watch this movie again many times because I loved it so much."
negative_text="that movie sucked. I hated it and wouldn't watch it again. Was one of the worst things i've ever watched."
predict(positive_text)
predict(negative_text)

[0.97563255]
Positive review
[0.3720082]
Negative review
