# Predicting Book Review Sentiment Using NLP-RNN-LSTM

## Introduction

This project is based on Deep Learning and Natural Language Processing. To make the prediction
of book-review's sentiment, I've used RNN-LSTM ('Recurrent Neural Network'-'Long Short-Term Memory') method.

RNN works with sequential data i.e when the data at a certain time depends on the data in the previous time steps. For example: words in order in a sentence, data of daily stock price, historical unemployment figure etc.

In RNN, the outputs from the hidden layers loops back and goes into hidden layer again. The reason, RNN needs to have the output of the hidden layer loops back because the prediction or ultimate output from the network is based on the current and historic data.

In a LSTM, it has the ability to remember longer sequences of data than an ordinary RNN. It categorizes data into short-term and long-term memory. By doing this, it remembers the important data, loops back those in the network for prediction and forgets the other. This enables LSTM to use long sequence of data to make prediction. 

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import tensorflow
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
import re

## Import data

In [2]:
data = pd.read_csv('Amazon_review.csv', encoding = "ISO-8859-1")

In [3]:
data.head()

Unnamed: 0,ï»¿date,name,sentiment,text
0,10/1/2020,Bob,Neutral,It was ok. Few good parts.
1,10/20/2020,Jeff,Negative,I was bored to tears. I could not read anymore.
2,12/3/2020,Mary,Negative,The book was bad. I was really bored.
3,12/4/2020,Julio,Negative,One of the worst book I ever read.
4,12/5/2020,Kareem,Neutral,Nothing really exciting.


## Data wrangling & Preprocessing

In [4]:
data.rename(columns = {'ï»¿date': 'date'}, inplace = True)

In [5]:
data.head()

Unnamed: 0,date,name,sentiment,text
0,10/1/2020,Bob,Neutral,It was ok. Few good parts.
1,10/20/2020,Jeff,Negative,I was bored to tears. I could not read anymore.
2,12/3/2020,Mary,Negative,The book was bad. I was really bored.
3,12/4/2020,Julio,Negative,One of the worst book I ever read.
4,12/5/2020,Kareem,Neutral,Nothing really exciting.


In [6]:
data = data[['sentiment','text']]

In [7]:
# remove the 'Neutral' data, as my goal is to detect positive/negative reviews.
data = data[data.sentiment != 'Neutral']

In [8]:
data.head()

Unnamed: 0,sentiment,text
1,Negative,I was bored to tears. I could not read anymore.
2,Negative,The book was bad. I was really bored.
3,Negative,One of the worst book I ever read.
5,Positive,I loved it !
6,Negative,# I hated it.


In [9]:
# Remove special characters from text column and convert them to lowercase
data['text'] = data['text'].apply(lambda x: re.sub('[^a-zA-Z0-9\s]', ' ', x))
data['text'] = data['text'].apply(lambda x: x.lower())

In [10]:
data

Unnamed: 0,sentiment,text
1,Negative,i was bored to tears i could not read anymore
2,Negative,the book was bad i was really bored
3,Negative,one of the worst book i ever read
5,Positive,i loved it
6,Negative,i hated it
7,Negative,really bad
8,Negative,a really dissapintment
9,Negative,it bored me
10,Positive,i liked it a lot
11,Negative,the worst


In [11]:
max_words = 10000 # This is the maximum number of vocabulary I am going to use in this project

In [12]:
tokenizer = Tokenizer(num_words = max_words, split = ' ' )

In [13]:
# Each word is assigned a number, higher the frequency of the word in the entire text,
# lower the number and vice-versa. 
tokenizer.fit_on_texts(data['text'].values)

In [14]:
# word_index puts all words and corresponding index in a dictionary
words = tokenizer.word_index

In [15]:
print(words)

{'i': 1, 'it': 2, 'the': 3, 'read': 4, 'loved': 5, 'a': 6, 'was': 7, 'bored': 8, 'book': 9, 'really': 10, 'of': 11, 'worst': 12, 'bad': 13, 'one': 14, 'ever': 15, 'books': 16, 'to': 17, 'tears': 18, 'could': 19, 'not': 20, 'anymore': 21, 'hated': 22, 'dissapintment': 23, 'me': 24, 'liked': 25, 'lot': 26, 'enjoyed': 27, 'excellent': 28, 'boring': 29, 'so': 30, 'much': 31, 'absolute': 32, 'real': 33, 'thrill': 34, 'waste': 35, 'money': 36, 'please': 37, 'write': 38, 'more': 39, 'best': 40, 've': 41}


In [16]:
max_index = max([i for i in words.values()])

In [17]:
print(max_index)

41


Below, I'll take each text entry and replace all the words in it with their corresponding integer values from the word_index dictionary. In other words, each review represents as a series of numbers and enter it into an array called X.

In [18]:
X = tokenizer.texts_to_sequences(data['text'].values)

In [19]:
print(X)

[[1, 7, 8, 17, 18, 1, 19, 20, 4, 21], [3, 9, 7, 13, 1, 7, 10, 8], [14, 11, 3, 12, 9, 1, 15, 4], [1, 5, 2], [1, 22, 2], [10, 13], [6, 10, 23], [2, 8, 24], [1, 25, 2, 6, 26], [3, 12], [1, 27, 2, 1, 5, 2], [28], [29], [1, 5, 2, 30, 31], [3, 32, 12, 9, 1, 15, 4], [6, 33, 34, 5, 2], [6, 35, 11, 36], [37, 38, 39, 16], [14, 11, 3, 40, 16, 1, 41, 4]]


In the following cell, I padded each array putting zeros to make each sentiment equal (padding started from the begining by default unless I specify)

In [20]:
X = pad_sequences(X)

In [21]:
print(X)

[[ 1  7  8 17 18  1 19 20  4 21]
 [ 0  0  3  9  7 13  1  7 10  8]
 [ 0  0 14 11  3 12  9  1 15  4]
 [ 0  0  0  0  0  0  0  1  5  2]
 [ 0  0  0  0  0  0  0  1 22  2]
 [ 0  0  0  0  0  0  0  0 10 13]
 [ 0  0  0  0  0  0  0  6 10 23]
 [ 0  0  0  0  0  0  0  2  8 24]
 [ 0  0  0  0  0  1 25  2  6 26]
 [ 0  0  0  0  0  0  0  0  3 12]
 [ 0  0  0  0  1 27  2  1  5  2]
 [ 0  0  0  0  0  0  0  0  0 28]
 [ 0  0  0  0  0  0  0  0  0 29]
 [ 0  0  0  0  0  1  5  2 30 31]
 [ 0  0  0  3 32 12  9  1 15  4]
 [ 0  0  0  0  0  6 33 34  5  2]
 [ 0  0  0  0  0  0  6 35 11 36]
 [ 0  0  0  0  0  0 37 38 39 16]
 [ 0  0 14 11  3 40 16  1 41  4]]


In [22]:
X.shape

(19, 10)

In [23]:
X.shape[0]

19

# NLP-RNN-LSTM Model building

In [24]:
model = Sequential() # This creates an Instance of the Sequential model
model.add(Embedding(input_dim = max_words, output_dim = 20, input_length = X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(200, dropout = 0.1))
model.add(Dense(2, activation = 'softmax'))

In [25]:
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy']) 

Note-1: categorical_crossentropy is a loss function used for what is known as multi_class classification. The loss function tells us how wrong your model's prediction are.

Note-2: Adam is an optimization algorithm that is used to update the weights/parameters based on the data we input into the model in the input layer.

In [26]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 10, 20)            200000    
                                                                 
 spatial_dropout1d (Spatial  (None, 10, 20)            0         
 Dropout1D)                                                      
                                                                 
 lstm (LSTM)                 (None, 200)               176800    
                                                                 
 dense (Dense)               (None, 2)                 402       
                                                                 
Total params: 377202 (1.44 MB)
Trainable params: 377202 (1.44 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [27]:
# Convert sentiment column to dummies numerical values 
Y = pd.get_dummies(data['sentiment']).values

## Train-test split, fit model & model evaluation

In [28]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 42)

In [29]:
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(14, 10)
(5, 10)
(14, 2)
(5, 2)


In [30]:
model.fit(X_train, Y_train, epochs = 10, batch_size = 5, verbose = 2)

Epoch 1/10
3/3 - 3s - loss: 0.6952 - accuracy: 0.5000 - 3s/epoch - 909ms/step
Epoch 2/10
3/3 - 0s - loss: 0.6812 - accuracy: 0.5714 - 54ms/epoch - 18ms/step
Epoch 3/10
3/3 - 0s - loss: 0.6810 - accuracy: 0.5714 - 39ms/epoch - 13ms/step
Epoch 4/10
3/3 - 0s - loss: 0.6762 - accuracy: 0.5714 - 51ms/epoch - 17ms/step
Epoch 5/10
3/3 - 0s - loss: 0.6715 - accuracy: 0.5714 - 54ms/epoch - 18ms/step
Epoch 6/10
3/3 - 0s - loss: 0.6585 - accuracy: 0.5714 - 58ms/epoch - 19ms/step
Epoch 7/10
3/3 - 0s - loss: 0.6390 - accuracy: 0.5714 - 43ms/epoch - 14ms/step
Epoch 8/10
3/3 - 0s - loss: 0.6374 - accuracy: 0.5714 - 38ms/epoch - 13ms/step
Epoch 9/10
3/3 - 0s - loss: 0.6473 - accuracy: 0.5714 - 51ms/epoch - 17ms/step
Epoch 10/10
3/3 - 0s - loss: 0.5929 - accuracy: 0.5714 - 41ms/epoch - 14ms/step


<keras.src.callbacks.History at 0x19c3530ee10>

In [31]:
score = model.evaluate(X_test, Y_test, verbose = 1)



In [32]:
print("Loss:", score[0])

Loss: 0.7480345964431763


In [33]:
print("Loss:", score[1])

Loss: 0.6000000238418579


## Testing model on new data

In [34]:
review = ['It bored me.']

In [35]:
review = tokenizer.texts_to_sequences(review)

In [36]:
review = pad_sequences(review, maxlen = 10, dtype = 'int32', value = 0)

In [37]:
sentiment = model.predict(review, batch_size = 1, verbose = 2)[0]

1/1 - 0s - 446ms/epoch - 446ms/step


In [38]:
print(sentiment)

[0.68032926 0.31967077]


Note: Here, the first element of the output is the measurement of the probability of review being -ve,whereas the second element is the probability that measures the review of being +ve. 

So the model is predicting that the review has 68% chance of being -ve review, and 31% chance of being +ve review.

In [None]:
if (np.argmax(sentiment) == 0): # argmax reprents the index of the highest number in the array 'sentiment' 
    print("This is a negative review.")

elif (np.argmax(sentiment) == 1):
    print("This is a positive review.")

## Summary

In this project, I built a NLP-RNN-LSTM model to identfy the sentimental/emotional tone of a Book review. Preprocessing on the data was done carefully. First, I tokenized each text into a bunch of words, assigned each one an index number based on its frequency of occurence in the entire text set. I kept my maximum word length to 10,000 to have maximum capacity of the model to handle large set of vocabulary. I padded each text with 0 making them of equal size. In the LSTM model, I included 20% 1D Spatial Dropout to prevent overfitting. The model was complied with 'categorical cross-entropy' as loss, adam optimizer and accuracy as evaluation metrics. The Model was trained on labeled data, and then tested on an unseen review, on which it could predict the correct sentiment with 68% probablity. 