# IMDB Reviews Sentiment Classification

First we import the required packages. It is necessary to first install the following packages:  
  
`pip install pandas`  
`pip install numpy`  
`pip install nltk`  
  
To install TensorFlow on CPU:  
`pip install tensorflow`  
To install TensorFlow on GPU:  
`pip install tensorflow-gpu`

In [2]:
import pandas as pd
import numpy as np
import re
import html

from tensorflow.python.keras.layers import Dense, LSTM, BatchNormalization, Embedding, Bidirectional
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from nltk.stem import SnowballStemmer

  from ._conv import register_converters as _register_converters


### Load and clean the data

Read the data from .csv format

In [3]:
train = pd.read_csv('datasets/train.csv')
test = pd.read_csv('datasets/test.csv')

In [3]:
train['dataset'] = "train"
test['dataset'] = "test"

In [4]:
train.head()

Unnamed: 0,id,labels,text
0,2592,0,Un-bleeping-believable! Meg Ryan doesn't even ...
1,18359,1,This is a extremely well-made film. The acting...
2,1040,0,Every once in a long while a movie will come a...
3,17262,1,Name just says it all. I watched this movie wi...
4,9908,0,This movie succeeds at being one of the most u...


In [4]:
train['labels'].head()

0    0
1    1
2    0
3    1
4    0
Name: labels, dtype: int64

Split data into training, validation, and test datasets

In [5]:
trn_y = np.eye(2)[train['labels'][:20000]] # One-hot encode the labels
val_y = np.eye(2)[train['labels'][20000:]] # One-hot encode the labels
trn_txt = train.text[:20000]
val_txt = train.text[20000:]
tst_txt = test.text
texts = np.hstack([trn_txt, val_txt, tst_txt]).tolist()

Function for cleaning text and performing stemming

In [6]:
def stem(x):
    re1 = re.compile(r'  +')
    stemmer = SnowballStemmer('english')
    x = ' '.join([stemmer.stem(word) for word in str(x).split(' ')])
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>','u_n').replace(' @.@ ','.').replace(
        ' @-@ ','-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x))

Original text

In [7]:
texts[1]

'This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is some merit in this view, but it\'s also true that no one forced Hindus and Muslims in the region to mistreat each other as they did around the time of partition. It seems more likely that the British simply saw the tensions between the religions and were clever enough to exploit them to their own ends.<br /><br />The result is that there is much cruelty and inhumanity in the situation and this is very u

In [8]:
summaries = [stem(txt) for txt in texts]

Text after stemming

In [9]:
summaries[1]

'this is a extrem well-mad film. the acting, script and camera-work are all first-rate. the music is good, too, though it is most earli in the film, when thing are still relat cheery. there are no realli superstar in the cast, though sever face will be familiar. the entir cast doe an excel job with the script.\n\nbut it is hard to watch, becaus there is no good end to a situat like the one presented. it is now fashion to blame the british for set hindus and muslim against each other, and then cruelli separ them into two countries. there is some merit in this view, but it also true that no one forc hindus and muslim in the region to mistreat each other as they did around the time of partition. it seem more like that the british simpli saw the tension between the religion and were clever enough to exploit them to their own ends.\n\nthe result is that there is much cruelti and inhuman in the situat and this is veri unpleas to rememb and to see on the screen. but it is never paint as a bla

Create an integer token for each word and apply the tokenizer to the datasets. For more information on Tensorflow/Keras for text processing see:  
https://keras.io/preprocessing/text/

In [10]:
n_words = 5000
t = Tokenizer(n_words)
t.fit_on_texts(summaries)

In [55]:
trn_seq = t.texts_to_sequences([stem(txt) for txt in trn_txt])
val_seq = t.texts_to_sequences([stem(txt) for txt in val_txt])
tst_seq = t.texts_to_sequences([stem(txt) for txt in tst_txt])

Only keep up to 300 words of the review

In [70]:
max_words = 500
trn_seq5 = np.array(pad_sequences(trn_seq, max_words))
val_seq5 = np.array(pad_sequences(val_seq, max_words))
tst_seq5 = np.array(pad_sequences(tst_seq, max_words))

In [56]:
max_words = 400
trn_seq4 = np.array(pad_sequences(trn_seq, max_words))
val_seq4 = np.array(pad_sequences(val_seq, max_words))
tst_seq4 = np.array(pad_sequences(tst_seq, max_words))

In [57]:
max_words = 300
trn_seq3 = np.array(pad_sequences(trn_seq, max_words))
val_seq3 = np.array(pad_sequences(val_seq, max_words))
tst_seq3 = np.array(pad_sequences(tst_seq, max_words))

We can inspect the first sentence (converted to an array of integers)

In [24]:
trn_seq[1]

array([1034,    1,  360,  165,  138,   32,  404,  298,   16,    1,  218,
         17,    7,    6,  215,    5,   56,   94,   37,    6,   58,   47,
         96,    5,    3,  786,   30,    1,   27,    7,    6,  156, 1133,
          5, 1363,    1,  714,   15,  191,    2, 4049,  486,  276,   74,
          2,  101, 1827,  100,   89,  114,   37,    6,   46, 2596,    8,
         10,  366,   17,    7,   87,  303,   11,   58,   27,  510,    2,
       4049,    8,    1, 3781,    5,  276,   74,   13,   34,  124,  192,
          1,   49,    4,    7,  110,   51,   30,   11,    1,  714,  374,
        217,    1, 1055,  209,    1, 2026,    2,   72,  915,  202,    5,
       1402,  100,    5,   64,  201, 2696,    1,  659,    6,   11,   37,
          6,   78,    2,    8,    1,  786,    2,   10,    6,   54, 3772,
          5,  385,    2,    5,   53,   19,    1,  254,   17,    7,    6,
        118, 1259,   13,    3,  316,    2, 4320,  431,   37,    6,  455,
          2,   19,  204,    2,   87,    1,  290,   

## Build a Neural Network with Keras to predict sentiment from sequences

We represent each word as 64 numbers, put the sequence through an LSTM Neural Network. For more information see: https://keras.io/getting-started/sequential-model-guide/

In [14]:
model = Sequential([
        Embedding(n_words, 64, input_length = max_words, input_shape=(max_words,)),
        BatchNormalization(),
        LSTM(64, dropout=0.3, recurrent_dropout=0.3),
        BatchNormalization(),
        Dense(2, activation = 'softmax')
    ])

model.compile(loss = 'binary_crossentropy', optimizer = Adam(lr=.01), metrics = ['accuracy'])

In [15]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 400, 64)           320000    
_________________________________________________________________
batch_normalization_1 (Batch (None, 400, 64)           256       
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                33024     
_________________________________________________________________
batch_normalization_2 (Batch (None, 64)                256       
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 130       
Total params: 353,666
Trainable params: 353,410
Non-trainable params: 256
_________________________________________________________________


In [16]:
model.fit(trn_seq,
          trn_y,
          validation_data = [val_seq, val_y],
          epochs = 5)

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras._impl.keras.callbacks.History at 0x2d0225f0e48>

Predict the sentiment for each review in the test dataset

In [17]:
preds = model.predict(tst_seq)

In [18]:
preds

array([[0.10833202, 0.89166796],
       [0.00667799, 0.99332196],
       [0.04753473, 0.95246524],
       ...,
       [0.22214174, 0.77785826],
       [0.987879  , 0.01212099],
       [0.9819125 , 0.01808748]], dtype=float32)

**Most likely to be negative sentiment**

In [22]:
test.text.iloc[np.argmax(preds[:,0])]

'I watched this movie with my boyfriend, an avid hip-hop fan and he was really really looking forward to catch the "soul" vibe the movie claimed to have. Boy, we were dead wrong. When I finished watching the movie I felt two things: remorse and relief. Remorse because I regretted wasting my time to watch this awful piece of dung, and relief because I watched it free on cable.<br /><br />This movie really really gives a bad name to black people, by putting so much awful stereotypes that I believe all smart black people everywhere has been trying to spell off. I\'m Asian, and I feel very very sorry and sick for those who made this movie. What more to say? Bad writing, even worse acting, and horrible storyline.<br /><br />Even if you\'re bored to death and has no other choice, don\'t watch this movie. Seriously. The movie really has nothing to offer, except if you want to see things like minor illegal drinking, animal slain, women degradation, and overall: A REALLY REALLY BAD-OBNOXIOUS-SI

**Most likely to be positive sentiment**

In [None]:
test.text.iloc[np.argmax(preds[:,1])]

We can try out some of our own reviews for a sanity check.

In [None]:
def predict_words(strings):
    if type(strings) is str:
        strings = [strings]
    seq = np.array(pad_sequences(t.texts_to_sequences([stem(string) for string in strings]),max_words))
    pred = model.predict(seq)
    for i in range(len(strings)):
        print("%s  |  Positive Sentiment: %2.f%%" % (strings[i], pred[i][1]*100))

**Baseline sentiment**

In [None]:
predict_words('')

In [None]:
predict_words(['I love this movie! Great film','This movie is boring and terrible...'])

In [None]:
predict_words(['highly recommended','recommended','not recommended'])

In [None]:
predict_words(['good','not good','bad'])

In [None]:
predict_words(['fast pace','slow pace','very slow pace'])

**Create submission**

In [52]:
test['labels'] = preds[:,1]

In [53]:
test[['id','labels']].to_csv('predictions5.csv', index=False)

In [21]:
preds[:,1]

array([0.89166796, 0.99332196, 0.95246524, ..., 0.77785826, 0.01212099,
       0.01808748], dtype=float32)

## 2. Conv + LSTM

In [26]:
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Convolution1D
from keras.layers import MaxPooling1D
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

Using TensorFlow backend.


In [58]:
vocab_size = 8000
review_length = 300

embedding_vector_length = 64
model = Sequential()

model.add(Embedding(vocab_size,
                    embedding_vector_length,
                    input_length=review_length))

# Embedding layer feeds a vector of 64D into the convolutional layer.
#   The output is consolidated through a max-pool layer before feeding
#   it sequentially through to the LSTM for analysis. This should reduce
#   training time of the net. The accuracy could improve as the spacial
#   structure learning of a CNN are merged with the sequential learning
#   of an LSTM.

model.add(Convolution1D(nb_filter=64,
                        filter_length=3,
                        activation='sigmoid',
                        border_mode='same'))
model.add(MaxPooling1D(pool_length=2))
model.add(LSTM(200))
model.add(Dense(2, activation='sigmoid'))

# Compile model and fit to data
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print(model.summary())



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 300, 64)           512000    
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 300, 64)           12352     
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 150, 64)           0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 200)               212000    
_________________________________________________________________
dense_6 (Dense)              (None, 2)                 402       
Total params: 736,754
Trainable params: 736,754
Non-trainable params: 0
_________________________________________________________________
None


In [60]:
# Compile model and fit to data
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print(model.summary())
model.fit(trn_seq3, trn_y, validation_data=(val_seq3, val_y),
          nb_epoch=5, batch_size=32)

# Display accuracy
evaluation = model.evaluate(val_seq3, val_y)
print("Accuracy %0.2f%%" % (evaluation[1] * 100))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 300, 64)           512000    
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 300, 64)           12352     
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 150, 64)           0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 200)               212000    
_________________________________________________________________
dense_6 (Dense)              (None, 2)                 402       
Total params: 736,754
Trainable params: 736,754
Non-trainable params: 0
_________________________________________________________________
None




Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Accuracy 84.81%


In [46]:
preds = model.predict(tst_seq3)

In [61]:
vocab_size = 8000
review_length = 400

embedding_vector_length = 64
model = Sequential()

model.add(Embedding(vocab_size,
                    embedding_vector_length,
                    input_length=review_length))

# Embedding layer feeds a vector of 64D into the convolutional layer.
#   The output is consolidated through a max-pool layer before feeding
#   it sequentially through to the LSTM for analysis. This should reduce
#   training time of the net. The accuracy could improve as the spacial
#   structure learning of a CNN are merged with the sequential learning
#   of an LSTM.

model.add(Convolution1D(nb_filter=64,
                        filter_length=3,
                        activation='sigmoid',
                        border_mode='same'))
model.add(MaxPooling1D(pool_length=2))
model.add(LSTM(200))
model.add(Dense(2, activation='sigmoid'))

# Compile model and fit to data
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print(model.summary())



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 400, 64)           512000    
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 400, 64)           12352     
_________________________________________________________________
max_pooling1d_5 (MaxPooling1 (None, 200, 64)           0         
_________________________________________________________________
lstm_7 (LSTM)                (None, 200)               212000    
_________________________________________________________________
dense_7 (Dense)              (None, 2)                 402       
Total params: 736,754
Trainable params: 736,754
Non-trainable params: 0
_________________________________________________________________
None


In [62]:
# Compile model and fit to data
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print(model.summary())
model.fit(trn_seq4, trn_y, validation_data=(val_seq4, val_y),
          nb_epoch=5, batch_size=32)

# Display accuracy
evaluation = model.evaluate(val_seq4, val_y)
print("Accuracy %0.2f%%" % (evaluation[1] * 100))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 400, 64)           512000    
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 400, 64)           12352     
_________________________________________________________________
max_pooling1d_5 (MaxPooling1 (None, 200, 64)           0         
_________________________________________________________________
lstm_7 (LSTM)                (None, 200)               212000    
_________________________________________________________________
dense_7 (Dense)              (None, 2)                 402       
Total params: 736,754
Trainable params: 736,754
Non-trainable params: 0
_________________________________________________________________
None




Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Accuracy 86.70%


## LSTM + Dropout

In [63]:
embedding_vector_length = 64
model = Sequential()
model.add(Embedding(vocab_size,
                    embedding_vector_length,
                    input_length=review_length))
model.add(Dropout(0.25))
model.add(LSTM(200))
model.add(Dropout(0.1))
model.add(Dense(2, activation='sigmoid'))

In [64]:
# Compile model and fit to data
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 400, 64)           512000    
_________________________________________________________________
dropout_5 (Dropout)          (None, 400, 64)           0         
_________________________________________________________________
lstm_8 (LSTM)                (None, 200)               212000    
_________________________________________________________________
dropout_6 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 2)                 402       
Total params: 724,402
Trainable params: 724,402
Non-trainable params: 0
_________________________________________________________________
None


In [66]:
model.fit(trn_seq4, trn_y, validation_data=(val_seq4, val_y),
          nb_epoch=5, batch_size=32)

# Display accuracy
evaluation = model.evaluate(val_seq4, val_y)
print("Accuracy %0.2f%%" % (evaluation[1] * 100))



Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Accuracy 86.55%


In [67]:
model = Sequential()

model.add(Embedding(vocab_size,
                    embedding_vector_length,
                    input_length=review_length))

# Embedding layer feeds a vector of 64D into the convolutional layer.
#   The output is consolidated through a max-pool layer before feeding
#   it sequentially through to the LSTM for analysis. This should reduce
#   training time of the net. The accuracy could improve as the spacial
#   structure learning of a CNN are merged with the sequential learning
#   of an LSTM.

model.add(Convolution1D(nb_filter=128,
                        filter_length=4,
                        activation='sigmoid',
                        border_mode='same'))
model.add(MaxPooling1D(pool_length=4))
model.add(Dropout(0.2))
model.add(LSTM(200))
model.add(Dropout(0.2))
model.add(Dense(2, activation='sigmoid'))

# Compile model and fit to data
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print(model.summary())
model.fit(trn_seq4, trn_y, validation_data=(val_seq4, val_y),
          nb_epoch=5, batch_size=32)

# Display accuracy
evaluation = model.evaluate(val_seq4, val_y)
print("Accuracy %0.2f%%" % (evaluation[1] * 100))



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 400, 64)           512000    
_________________________________________________________________
conv1d_6 (Conv1D)            (None, 400, 128)          32896     
_________________________________________________________________
max_pooling1d_6 (MaxPooling1 (None, 100, 128)          0         
_________________________________________________________________
dropout_7 (Dropout)          (None, 100, 128)          0         
_________________________________________________________________
lstm_9 (LSTM)                (None, 200)               263200    
_________________________________________________________________
dropout_8 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 2)                 402       
Total para



Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Accuracy 86.93%


In [68]:
# Save the weights
model.save_weights('model_weights_1.h5')

# Save the model architecture
with open('model_architecture_1.json', 'w') as f:
    f.write(model.to_json())

In [69]:
model = Sequential()

model.add(Embedding(vocab_size,
                    embedding_vector_length,
                    input_length=review_length))

# Embedding layer feeds a vector of 64D into the convolutional layer.
#   The output is consolidated through a max-pool layer before feeding
#   it sequentially through to the LSTM for analysis. This should reduce
#   training time of the net. The accuracy could improve as the spacial
#   structure learning of a CNN are merged with the sequential learning
#   of an LSTM.

model.add(Convolution1D(nb_filter=128,
                        filter_length=4,
                        activation='sigmoid',
                        border_mode='same'))
model.add(MaxPooling1D(pool_length=4))
model.add(Dropout(0.2))
model.add(LSTM(200))
model.add(Dropout(0.2))
model.add(Dense(2, activation='sigmoid'))

# Compile model and fit to data
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print(model.summary())
model.fit(trn_seq4, trn_y, validation_data=(val_seq4, val_y),
          epochs=4, batch_size=32)

# Display accuracy
evaluation = model.evaluate(val_seq4, val_y)
print("Accuracy %0.2f%%" % (evaluation[1] * 100))



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 400, 64)           512000    
_________________________________________________________________
conv1d_7 (Conv1D)            (None, 400, 128)          32896     
_________________________________________________________________
max_pooling1d_7 (MaxPooling1 (None, 100, 128)          0         
_________________________________________________________________
dropout_9 (Dropout)          (None, 100, 128)          0         
_________________________________________________________________
lstm_10 (LSTM)               (None, 200)               263200    
_________________________________________________________________
dropout_10 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 2)                 402       
Total para

In [72]:
review_length = 500

model = Sequential()

model.add(Embedding(vocab_size,
                    embedding_vector_length,
                    input_length=review_length))

# Embedding layer feeds a vector of 64D into the convolutional layer.
#   The output is consolidated through a max-pool layer before feeding
#   it sequentially through to the LSTM for analysis. This should reduce
#   training time of the net. The accuracy could improve as the spacial
#   structure learning of a CNN are merged with the sequential learning
#   of an LSTM.

model.add(Convolution1D(nb_filter=128,
                        filter_length=4,
                        activation='sigmoid',
                        border_mode='same'))
model.add(MaxPooling1D(pool_length=4))
model.add(Dropout(0.2))
model.add(LSTM(200))
model.add(Dropout(0.2))
model.add(Dense(2, activation='sigmoid'))

# Compile model and fit to data
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print(model.summary())
model.fit(trn_seq5, trn_y, validation_data=(val_seq5, val_y),
          epochs=3, batch_size=32)

# Display accuracy
evaluation = model.evaluate(val_seq5, val_y)
print("Accuracy %0.2f%%" % (evaluation[1] * 100))



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     (None, 500, 64)           512000    
_________________________________________________________________
conv1d_9 (Conv1D)            (None, 500, 128)          32896     
_________________________________________________________________
max_pooling1d_9 (MaxPooling1 (None, 125, 128)          0         
_________________________________________________________________
dropout_13 (Dropout)         (None, 125, 128)          0         
_________________________________________________________________
lstm_12 (LSTM)               (None, 200)               263200    
_________________________________________________________________
dropout_14 (Dropout)         (None, 200)               0         
_________________________________________________________________
dense_12 (Dense)             (None, 2)                 402       
Total para

In [73]:
preds = model.predict(tst_seq5)

In [74]:
test['labels'] = preds[:,1]

In [75]:
test[['id','labels']].to_csv('predictions6.csv', index=False)