#A Basic implementation of Sentiment Analysis with Sequence Models(RNN)

*As an introduction project to RNN, I am using LSTMs to build a module for Sentiment Analysis in Keras*

*Import the Tensorflow framework*

In [0]:
import tensorflow as tf

*Load the IMDB dataset with a cap on the top most frequent words to consider as **5000**. All the reviews are preprocessed and encodedas a **sequence of word indices**. The indices pertains to the **overall frequency** of the word in the data set.For example integer **n** encodes for the **nth most frequent word** in the data.All the reviews are labelled as **negative-0** and **positive-1**.*

In [14]:
imdb = tf.keras.datasets.imdb

vocabulary_size = 5000# top most frequent words to consider

(X_train, y_train),(X_test, y_test) = imdb.load_data(num_words = vocabulary_size)

print('Load dataset with {} training samples, {} test samples'.format(len(X_train), len(X_test)))

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
Load dataset with 25000 training samples, 25000 test samples


*View a sample label with its index*

In [15]:
print('---label---')
print(y_train[10])
print('---review---')
print(X_train[10])

---label---
1
---review---
[1, 785, 189, 438, 47, 110, 142, 7, 6, 2, 120, 4, 236, 378, 7, 153, 19, 87, 108, 141, 17, 1004, 5, 2, 883, 2, 23, 8, 4, 136, 2, 2, 4, 2, 43, 1076, 21, 1407, 419, 5, 2, 120, 91, 682, 189, 2818, 5, 9, 1348, 31, 7, 4, 118, 785, 189, 108, 126, 93, 2, 16, 540, 324, 23, 6, 364, 352, 21, 14, 9, 93, 56, 18, 11, 230, 53, 771, 74, 31, 34, 4, 2834, 7, 4, 22, 5, 14, 11, 471, 9, 2, 34, 4, 321, 487, 5, 116, 15, 2, 4, 22, 9, 6, 2286, 4, 114, 2679, 23, 107, 293, 1008, 1172, 5, 328, 1236, 4, 1375, 109, 9, 6, 132, 773, 2, 1412, 8, 1172, 18, 2, 29, 9, 276, 11, 6, 2768, 19, 289, 409, 4, 2, 2140, 2, 648, 1430, 2, 2, 5, 27, 3000, 1432, 2, 103, 6, 346, 137, 11, 4, 2768, 295, 36, 2, 725, 6, 3208, 273, 11, 4, 1513, 15, 1367, 35, 154, 2, 103, 2, 173, 7, 12, 36, 515, 3547, 94, 2547, 1722, 5, 3547, 36, 203, 30, 502, 8, 361, 12, 8, 989, 143, 4, 1172, 3404, 10, 10, 328, 1236, 9, 6, 55, 221, 2989, 5, 146, 165, 179, 770, 15, 50, 713, 53, 108, 448, 23, 12, 17, 225, 38, 76, 4397, 18, 183, 8, 

*Mapping the review back to its original words.*

In [16]:
word_index = imdb.get_word_index()
index2word = {i:word for word, i in word_index.items()}
print('---review with words---')
print([index2word.get(i, ' ') for i in X_train[10]])
print('---label---')
print(y_train[10])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
---review with words---
['the', 'clear', 'fact', 'entertaining', 'there', 'life', 'back', 'br', 'is', 'and', 'show', 'of', 'performance', 'stars', 'br', 'actors', 'film', 'him', 'many', 'should', 'movie', 'reasons', 'to', 'and', 'reading', 'and', 'are', 'in', 'of', 'scenes', 'and', 'and', 'of', 'and', 'out', 'compared', 'not', 'boss', 'yes', 'to', 'and', 'show', 'its', 'disappointed', 'fact', 'raw', 'to', 'it', 'justice', 'by', 'br', 'of', 'where', 'clear', 'fact', 'many', 'your', 'way', 'and', 'with', 'city', 'nice', 'are', 'is', 'along', 'wrong', 'not', 'as', 'it', 'way', 'she', 'but', 'this', 'anything', 'up', "haven't", 'been', 'by', 'who', 'of', 'choices', 'br', 'of', 'you', 'to', 'as', 'this', "i'd", 'it', 'and', 'who', 'of', 'shot', "you'll", 'to', 'love', 'for', 'and', 'of', 'you', 'it', 'is', 'sequels', 'of', 'little', 'quest', 'are', 'seen', 'watched', 'front', 'chemistry', 

*Looking at the host of positive words present in the review it is clearly labelled as positive-1*

**Padding Sequences:** *Next we need to apply **padding** to the reviews so that the reviews fed to our RNN are all of same length.We truncate longer reviews and apply padding to shorter reviews with null values.Maximum number of words retained in the reviews is **500**.*

In [17]:
print('Maximum review length: {}'.format(len(max((X_train+X_test), key = len))))
print('Minimum review length: {}'.format(len(min((X_train+X_test), key = len))))

from keras.preprocessing import sequence

max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen = max_words)
X_test = sequence.pad_sequences(X_test, maxlen = max_words)

Maximum review length: 2697
Minimum review length: 70


Using TensorFlow backend.


*RNN Model for Sentiment Analysis*
* Input: word indices<
*Output: label(**0 or 1**)

In [18]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

embedding_size = 32
model = Sequential()
model.add(Embedding(input_dim = vocabulary_size, output_dim = embedding_size, input_length = max_words))
model.add(LSTM(units = 100))
model.add(Dense(units = 1, activation = 'sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

print(model.summary())

W0819 05:00:48.059889 140688279783296 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0819 05:00:48.101247 140688279783296 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0819 05:00:48.107120 140688279783296 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0819 05:00:48.378425 140688279783296 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0819 05:00:48.400138 140688279783296 deprecation_wrappe

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


*Start Model Trainig*

In [19]:
batch_size = 64
num_epochs = 5

X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]

model.fit(X_train2, y_train2, validation_data = (X_valid, y_valid), batch_size = batch_size, epochs = num_epochs)

W0819 05:00:49.359829 140688279783296 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.



Train on 24936 samples, validate on 64 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7ff437328ef0>

*Check Model Acuracy*

In [20]:
scores = model.evaluate(X_test, y_test, verbose = 0)
print('Test accuracy:', scores[1])

Test accuracy: 0.87148
