# Sentiment Analysis using LSTM

In this notebook we will be implementing a LSTM network to classify sentiments from the statements. The model will be trained on a 25k row dataset of both positive and negative reviews.

This is what the overall architecture looks like:

![Screen%20Shot%202018-07-18%20at%201.38.42%20PM.png](attachment:Screen%20Shot%202018-07-18%20at%201.38.42%20PM.png)

As described above, the words will be passed to the embedding layer, which will convert the words into vectors so that it can be passed as an input to the LSTM network. We will go in detail as we progress.

Let's start by importing libraries that we need

In [289]:
# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# import libraries
import numpy as np
import tensorflow as tf
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras import callbacks
from keras.models import load_model
from keras.utils.vis_utils import plot_model
from keras.utils import np_utils


# Data Preparation

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). This means we have 12.5k files of positive and 12.5 files of negative reviews in the train set as well in the test set.

For ease of use, let's create a csv file for the train set which will contain the file name and the sentiment associated with it. 

In [290]:
%run -i 'data_prep.py'

creating training csv set
training.csv created


The above script will loop through all the text files present in the pos and neg directory in the training set, and will create the csv file with the filename against the sentiment. 

The csv file will look something like this

![Screen%20Shot%202018-07-18%20at%202.19.23%20PM.png](attachment:Screen%20Shot%202018-07-18%20at%202.19.23%20PM.png)

Let's print one of the reviews to understand how the dataset looks like

In [291]:
f = open('./dataset/train/pos/4715_9.txt','r')
message = f.read()
print(message)

For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan "The Skipper" Hale jr. as a police Sgt.


### Loading Data

In [292]:
training_reviews = []
test_reviews = []
target = []

# load training data
with open("./dataset/train/train.csv") as fd:
    rd = csv.reader(fd, delimiter=",", quotechar='"')
    for file_name, label in rd:
        if (file_name.endswith('.txt')):
            path = './dataset/train/pos/' if label is '1' else './dataset/train/neg/'
            file = open(path+file_name, "r") 
            training_reviews.append(file.read())
            target.append(label)

# load test data
with open("./dataset/test/test.csv") as fd:
    rd = csv.reader(fd, delimiter=" ", quotechar='"')
    for file_name in rd:
        if (file_name[0].endswith('.txt')):
            file = open('./dataset/test/'+file_name[0], "r") 
            test_reviews.append(file.read())
            
# print number of training reviews
print("Training reviews: {}".format(len(training_reviews)))
print("Training Targets: {}".format(len(target)))

# print number of test reviews
print("Test reviews: {}".format(len(test_reviews)))

Training reviews: 25000
Training Targets: 25000
Test reviews: 25000


### Embedding Layer (Encoding words to vectors)

It's input is a text corpus and its outputs a set of vectors i.e it turns text into numerical form that the neural network can understand. To create word embeddings, we will load the entire reviews (negative and positive) into a single variable, which can be then fed into the embedding layer to generate vectors.

In [293]:
#Process the training data set, and create word arrays from the reviews
import re

processed_training_review = []

for review in training_reviews:
    # this is done to cleanup and remove special characters from the dataset.
    # This will remove all special characters such as brackets, quotes, etc.
    processed_training_review.append(re.sub('[^ a-zA-Z0-9]', '', review).lower())

# print the first row
print(processed_training_review[:1])

# join the rows as a string with '/n' as delimiter
all_train_review =' /n '.join(processed_training_review)

# split each reviews of the training dataset and join them as a string
train_reviews = all_train_review.split(' /n ')
all_train_review = ' '.join(train_reviews)

# split each word of the training dataset in the string to a list
train_words = all_train_review.split()
print(len(train_words))

['for a movie that gets no respect there sure are a lot of memorable quotes listed for this gem imagine a movie where joe piscopo is actually funny maureen stapleton is a scene stealer the moroni character is an absolute scream watch for alan the skipper hale jr as a police sgt']
5820097


In [294]:
#Process the test data set, and create word arrays from the reviews
import re

processed_test_review = []
    
for review in test_reviews:
    # this is done to cleanup and remove special characters from the dataset.
    # This will remove all special characters such as brackets, quotes, etc.
    processed_test_review.append(re.sub('[^ a-zA-Z0-9]', '', review).lower())

# print the first row
print(processed_test_review[:1])

# join the rows as a string with '/n' as delimiter
all_test_review =' /n '.join(processed_test_review)

# split each reviews of the training dataset and join them as a string
test_reviews = all_test_review.split(' /n ')
all_test_review = ' '.join(test_reviews)

# split each word of the training dataset in the string to a list
test_words = all_test_review.split()
print(len(test_words))

['based on an actual story john boorman shows the struggle of an american doctor whose husband and son were murdered and she was continually plagued with her loss a holiday to burma with her sister seemed like a good idea to get away from it all but when her passport was stolen in rangoon she could not leave the country with her sister and was forced to stay back until she could get id papers from the american embassy to fill in a day before she could fly out she took a trip into the countryside with a tour guide i tried finding something in those stone statues but nothing stirred in me i was stone myself br br suddenly all hell broke loose and she was caught in a political revolt just when it looked like she had escaped and safely boarded a train she saw her tour guide get beaten and shot in a split second she decided to jump from the moving train and try to rescue him with no thought of herself continually her life was in danger br br here is a woman who demonstrated spontaneous self

In [295]:
# combine the training and test words
total_words = train_words + test_words

print(len(total_words))

11509679


Now that we have the reviews, we can start creating the word embeddings. This will convert the words present in the reviews into integers which can later be fed into the neural network.

In [296]:
from collections import Counter
counts = Counter(total_words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

train_reviews_integers = []
for review in train_reviews:
    train_reviews_integers.append([vocab_to_int[word] for word in review.split()])
    
test_reviews_integers = []
for review in test_reviews:
    test_reviews_integers.append([vocab_to_int[word] for word in review.split()])

Printing the integer mapping for the review words:

In [297]:
train_reviews_integers[:10]

[[16,
  3,
  17,
  11,
  204,
  55,
  1167,
  47,
  243,
  23,
  3,
  164,
  4,
  876,
  4678,
  3747,
  16,
  10,
  1496,
  790,
  3,
  17,
  112,
  959,
  21910,
  6,
  154,
  155,
  9481,
  17137,
  6,
  3,
  130,
  17138,
  1,
  45569,
  108,
  6,
  33,
  1521,
  1987,
  103,
  16,
  1773,
  1,
  20586,
  11235,
  1747,
  14,
  3,
  553,
  7648],
 [1122,
  192,
  17,
  1107,
  15,
  799,
  1418,
  18,
  2539,
  32,
  13011,
  7966,
  302,
  4,
  6935,
  37697,
  1211,
  14,
  3,
  178,
  18,
  645,
  8328,
  2225,
  15,
  3,
  51819,
  1840,
  36,
  6,
  18886,
  5,
  975,
  16,
  41,
  3074,
  17139,
  32,
  21911,
  1,
  17941,
  5,
  542,
  1,
  134,
  15,
  7966,
  10395,
  23,
  52,
  73,
  1751,
  1,
  1217,
  206,
  6,
  398,
  8467,
  34923,
  6,
  1274,
  14,
  37698,
  5300,
  18,
  50,
  7966,
  1088,
  81,
  3,
  967,
  5147,
  5660,
  26429,
  8878,
  32,
  3,
  1959,
  1953,
  20,
  1,
  400,
  1861,
  177,
  62,
  365,
  6271,
  1,
  4494,
  572,
  3,
  8959,
  3441,

In [306]:
review_lens = Counter([len(x) for x in train_reviews_integers])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 0
Maximum review length: 2469


So, there are no zero-length reviews in our dataset. But, the maximum review length is way too much for the RNN to handle, we have to trim this down to let's say 220. For reviews longer than 220, it will be truncated to first 220 characters, and for reviews less than 220 we will add padding of 0's

In [299]:
limit = 200

# Training and Validation

We will split the training set into training and validation set. The validation set is used to evaluate a given model, but this is for frequent evaluation.

Commonly, 80 % of the whole training data set is used for training, and rest 20 % for the validation.

In [311]:
# use 0.2 of the data set as validation set
split_factor= 0.8
split_index = int(len(train_reviews_integers)*0.8)

# setup training and validation set
x_train = sequence.pad_sequences(train_reviews_integers[:split_index], maxlen=limit)
x_val = sequence.pad_sequences(train_reviews_integers[split_index:], maxlen=limit)

y_train = np_utils.to_categorical(target[:split_index], 2)
y_val = np_utils.to_categorical(target[split_index:], 2)

print(split_index)

# setup test set
x_test = sequence.pad_sequences(test_reviews_integers, maxlen=limit)

20000


In [312]:
# print the shape of training set
print(y_train.shape)

(20000, 2)


In [302]:
n_words = len(vocab_to_int) + 1 # Adding 1 because we use 0's for padding, dictionary started at 1

In [313]:

# Final Model Architecture# Final  

# embedding layer size
embedding_vecor_length = 32

model = Sequential()
model.add(Embedding(len(total_words), embedding_vecor_length, input_length=limit, dropout=0.2))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
# 1 layer of 100 units in the hidden layers of the LSTM cells
model.add(LSTM(100))
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(x_train, y_train,epochs=5,verbose=1, batch_size=32)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_16 (Embedding)     (None, 200, 32)           368309728 
_________________________________________________________________
conv1d_16 (Conv1D)           (None, 200, 32)           3104      
_________________________________________________________________
max_pooling1d_16 (MaxPooling (None, 100, 32)           0         
_________________________________________________________________
lstm_16 (LSTM)               (None, 100)               53200     
_________________________________________________________________
dense_12 (Dense)             (None, 5)                 505       
Total params: 368,366,537
Trainable params: 368,366,537
Non-trainable params: 0
_________________________________________________________________
None


ValueError: Error when checking target: expected dense_12 to have shape (None, 5) but got array with shape (20000, 2)