<a href="https://colab.research.google.com/github/damzC/nlp/blob/main/Sentiment_Analyzer_Deep_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook explains the problem of Sentiment Analysis, some of the popular datasets available, approaches to solve this problem and finally a step-by-step guide to solve this problem using Deep Learning.

**Definition**: Sentiment Analysis is the NLP task of computationally identifying the opinion or sentiment (*positive*, *negative*, or *neutral*) expressed in a text.

**Popular Data sets for Sentiment Analysis:**
1. kaggle: Movie reviews on IMDB data set - 50K entries: *Longer text SA*
2. Stanford data set for sentiment analysis:*5 classes*: Very positive, Positive, Neutral, Negative, Very Negative
3. Amazon review data set (kaggle) - Pre-trained models available: *Short text SA* (4 million entries) - Extract the headings only

**Approaches for Sentiment Analysis:**
1. Lexicon based: Senti WordNet
2. NLP Tools: TextBlob, spaCy, NLTK
3. Machine Learning: NB Classifier, SVM, XGB
4. **Deep learning: LSTMs, GRUs, seq2seq**
5. Sentiment Embeddings - Embeddings of words based on sentiments
6. Fine-tuning over Large Language Models (like BERT, RoBERTa *etc.*)

**Sentiment Analysis using Deep Learning:**

**Pre-processing:**
1. Download the dataset (*imdb_reviews*)
2. word to index | index to word
3. Train your word embeddings | Use pre-trained embeddings
4. Embedding Matrix: Index -> Word Embedding
5. Padding (Post/Pre) - Fixed length arrays of the input (Size is the max/avg length of the reviews)

**Architecture:**
1. Input Layer (200)
2. Embedding Layer (inbuilt in keras) - Embedding Matrix / Train embeddings using keras (200X300 - Emb Dimension)
3. RNN Layer (LSTM, Bi-LSTM etc)/ Attention Layer
4. Dense Layer /  Fully Connected Layer
5. Dropout Layer
6. Optional Dense Layers
7. Softmax - Final Output


## Import libraries

In [None]:
# keras imports
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers.recurrent import LSTM
from keras.layers import Bidirectional
from keras.preprocessing.sequence import pad_sequences
from keras.layers.core import Dense, Dropout, Activation
from keras.utils import np_utils
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam, Adadelta
from keras.models import load_model
from keras.regularizers import l2

# Generic imports
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np, string, pickle, warnings, random
import matplotlib.pyplot as plt

warnings.filterwarnings("ignore")

## Download data

In [None]:
topWords = 50000
MAX_LENGTH = 200
nb_classes = 2
imdbDataPicklePath = 'imdbData.pickle'
downloadFlag = 1

if downloadFlag == 0:

    # Downloading data
    imdbData = imdb.load_data(path='imdb.npz', num_words=topWords)

    # Pickle Data
    with open(imdbDataPicklePath, 'wb') as handle:
        pickle.dump(imdbData, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open(imdbDataPicklePath, 'rb') as pHandle:
    imdbData = pickle.load(pHandle)
    
(x_train, y_train), (x_test, y_test) = imdbData


In [None]:
stopWords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", \
             "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", \
             'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', \
             'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', \
             'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
             'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
             'at', 'by', 'for', 'with', 'about', 'between', 'into', 'through', 'during', 'before', 'after', \
             'above', 'below', 'to', 'from', 'off', 'over', 'then', 'here', 'there', 'when', 'where', 'why', \
             'how', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'own', 'same', 'so', \
             'than', 'too', 's', 't', 'will', 'just', 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
             've', 'y', 'ma']
word2Index = imdb.get_word_index()
index2Word = {v: k for k, v in word2Index.items()}
index2Word[0] = ""
sentimentDict = {0: 'Negative', 1: 'Positive'}

def getWordsFromIndexList(indexList):
    wordList = []
    for index in indexList:
        wordList.append(index2Word[index])

    return " ".join(wordList)

def getSentiment(predictArray):
    pred = int(predictArray[0])
    return sentimentDict[pred]

def getIndexFromWordList(wordList):
    indexList = []
    for word in wordList:
        print(word)
        indexList.append(str(word2Index[word]))
        
    return indexList

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [None]:
print (len(word2Index))

88584


In [None]:
print(getWordsFromIndexList(x_train[0]))

the as you with out themselves powerful lets loves their becomes reaching had journalist of lot from anyone to have after out atmosphere never more room titillate it so heart shows to years of every never going villaronga help moments or of every chest visual movie except her was several of enough more with is now current film as you of mine potentially unfortunately of you than him that with out themselves her get for was camp of you movie sometimes movie that with scary but pratfalls to story wonderful that in seeing in character to of 70s musicians with heart had shadows they of here that with her serious to have does when from why what have critics they is you that isn't one will very to as itself with other tricky in of seen over landed for anyone of gilmore's br show's to whether from than out themselves history he name half some br of 'n odd was two most of mean for 1 any an boat she he should is thought frog but of script you not while history he heart to real at barrel but whe

In [None]:
print(len(x_train[0]), x_train[0])

98 [43, 973, 1622, 1385, 458, 4468, 3941, 173, 256, 43, 838, 112, 670, 22665, 480, 284, 150, 172, 112, 167, 21631, 336, 385, 172, 4536, 1111, 17, 546, 447, 192, 2025, 19, 1920, 4613, 469, 43, 76, 1247, 17, 515, 17, 626, 19193, 62, 386, 8, 316, 8, 106, 2223, 5244, 480, 3785, 619, 1415, 215, 28, 52, 10311, 8, 107, 5952, 256, 31050, 7, 3766, 723, 43, 476, 400, 317, 7, 12118, 1029, 104, 381, 297, 2071, 194, 7486, 226, 21, 476, 480, 144, 5535, 28, 224, 104, 226, 1334, 283, 4472, 113, 103, 5345, 19, 178]


## Preprocess data

In [None]:
stopIndexList = []

for stopWord in stopWords:
    stopIndexList.append(word2Index[stopWord])

trainData = []

for indexList in x_train:
    processedList = [index for index in indexList if index not in stopIndexList]
    trainData.append(processedList)
    
x_train = trainData

## Data Padding

In [None]:
'''
Padding data to keep vectors of same size
If size < 200 then it will be padded, else it will be cropped
'''
trainX = pad_sequences(x_train, maxlen = MAX_LENGTH, padding='post', value = 0.)
testX = pad_sequences(x_test, maxlen = MAX_LENGTH, padding='post', value = 0.)

'''
One-hot encoding for the classes
'''
trainY = np_utils.to_categorical(y_train, num_classes = nb_classes)
testY = np_utils.to_categorical(y_test, num_classes = nb_classes)


In [None]:
print(len(trainX[0]), trainX[0])

200 [   43   973  1622  1385   458  4468  3941   173   256    43   838   112
   670 22665   480   284   150   172   112   167 21631   336   385   172
  4536  1111    17   546   447   192  2025    19  1920  4613   469    43
    76  1247    17   515    17   626 19193    62   386     8   316     8
   106  2223  5244   480  3785   619  1415   215    28    52 10311     8
   107  5952   256 31050     7  3766   723    43   476   400   317     7
 12118  1029   104   381   297  2071   194  7486   226    21   476   480
   144  5535    28   224   104   226  1334   283  4472   113   103  5345
    19   178     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     

In [None]:
print(len(trainY[0]), trainY[0], y_train[0])

2 [0. 1.] 1


## Network Parameters

In [None]:
sgdOptimizer = 'adam'
lossFun='categorical_crossentropy'
batchSize=1024
numEpochs = 5
numHiddenNodes = 128
EMBEDDING_SIZE = 300
denseLayer1Size = 256
denseLayer2Size = 128

## Network Architecture

In [None]:
model = Sequential()

# Train Embedding layer with Embedding Size = 300
model.add(Embedding(topWords, EMBEDDING_SIZE, input_length=MAX_LENGTH, mask_zero=True, name='embedding_layer'))

# Define Deep Learning layer
model.add(Bidirectional(LSTM(numHiddenNodes), merge_mode='concat',name='bidi_lstm_layer'))

# Define Dense layers
model.add(Dense(denseLayer1Size, activation='relu', name='dense_1'))
model.add(Dropout(0.25, name = 'dropout'))
model.add(Dense(denseLayer2Size, activation='relu', name='dense_2'))

# Define Output Layer
model.add(Dense(nb_classes, activation='softmax', name='output'))

model.compile(loss=lossFun, optimizer=sgdOptimizer, metrics=["accuracy"])
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_layer (Embedding)  (None, 200, 300)          15000000  
_________________________________________________________________
bidi_lstm_layer (Bidirection (None, 256)               439296    
_________________________________________________________________
dense_1 (Dense)              (None, 256)               65792     
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               32896     
_________________________________________________________________
output (Dense)               (None, 2)                 258       
Total params: 15,538,242
Trainable params: 15,538,242
Non-trainable params: 0
__________________________________________

## Training the model

In [None]:
model.fit(trainX, trainY, batch_size=batchSize, epochs=numEpochs, verbose=1, validation_data=(testX, testY))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f91bea04278>

# Model accuracy

In [None]:
score = model.evaluate(testX, testY, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], score[1]*100))

accuracy: 84.06%


In [None]:
predY = model.predict_classes(testX)
yPred = np_utils.to_categorical(predY, num_classes = nb_classes)
print("Classification Report:\n")
print(classification_report(testY, yPred))

Classification Report:

              precision    recall  f1-score   support

           0       0.83      0.86      0.84     12500
           1       0.86      0.82      0.84     12500

   micro avg       0.84      0.84      0.84     25000
   macro avg       0.84      0.84      0.84     25000
weighted avg       0.84      0.84      0.84     25000
 samples avg       0.84      0.84      0.84     25000



## Save the Tensorflow Model

In [None]:
model.save('imdb_bi_lstm_tensorflow_model.hdf5')

## Load the Tensorflow Model

In [None]:
loaded_model = load_model('imdb_bi_lstm_tensorflow_model.hdf5')
print(loaded_model.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_layer (Embedding)  (None, 200, 300)          15000000  
_________________________________________________________________
bidi_lstm_layer (Bidirection (None, 256)               439296    
_________________________________________________________________
dense_1 (Dense)              (None, 256)               65792     
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               32896     
_________________________________________________________________
output (Dense)               (None, 2)                 258       
Total params: 15,538,242
Trainable params: 15,538,242
Non-trainable params: 0
__________________________________________

## Testing the model

In [None]:
num = 167
num_next = num + 1
print("Testing for test case..." + str(num))
groundTruth = testY[num]

sampleX = testX[num:num_next]
predictionClass = loaded_model.predict_classes(sampleX, verbose=0)
prediction = np_utils.to_categorical(predictionClass, num_classes = nb_classes)[0]

print("Text: " + str(getWordsFromIndexList(x_test[num-1])))
print("\nPrediction: " + str(getSentiment(predictionClass)))
if np.array_equal(groundTruth,prediction):
    print("\nPrediction is Correct")
else:
    print("\nPrediction is Incorrect")

Testing for test case...167
Text: the expert enters accession epos about right necks seen nurse everybody this as klutz yourself must lives not what would vain about almost film instructor evil including this early her painted and you has is found like it give making creatures floriane to exciting anderson special custody thing does when amount in lindsay but to eye boll there of questioned disbelief br written falls father vans intellectual me boat some br allows who affection to rings just idea to as you had 140 cows sorts cause is quite br performances dance this about hain friends of corporate moments camera always point between expert enters technically way is him ben it's spinning feels about police feeling after had concern clearly to would monsters good along watches for well given at mishaps it's themed order going thai horror in society as not sees all question expert well another she get epic message as significant attacked attempts viewing and interesting is very humanity b