## PROJECT OBJECTIVE: 
Build a sequential NLP classifier which can use input text
parameters to determine the customer sentiments.

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words.
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Import the data 
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [1]:
# Ignore the warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
from tensorflow.keras.datasets import imdb
(x_train, y_train), (x_test, y_test)= imdb.load_data(num_words=10000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [3]:
x_train.shape

(25000,)

In [4]:
y_train

array([1, 0, 0, ..., 0, 1, 0])

### Pad each sentence to be of same length
- Take maximum sequence length as 300

In [5]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_len = 300
X_train = pad_sequences(maxlen=max_len, sequences=x_train, padding="post")

### Print shape of features & labels 

Number of review, number of words in each review

In [6]:
print("X_train review:",X_train.shape[0])
print("X_test review:",x_test.shape[0])
print("Total review:",x_test.shape[0] + X_train.shape[0])

X_train review: 25000
X_test review: 25000
Total review: 50000


In [7]:
print("number of words in each review for x_train without padding")
i = 1
for l in x_train:
  print("Review",i,len(l))
  i = i+1

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Review 20001 252
Review 20002 118
Review 20003 67
Review 20004 198
Review 20005 53
Review 20006 525
Review 20007 151
Review 20008 123
Review 20009 263
Review 20010 179
Review 20011 427
Review 20012 151
Review 20013 114
Review 20014 122
Review 20015 172
Review 20016 182
Review 20017 135
Review 20018 220
Review 20019 487
Review 20020 432
Review 20021 99
Review 20022 975
Review 20023 157
Review 20024 63
Review 20025 371
Review 20026 212
Review 20027 156
Review 20028 54
Review 20029 723
Review 20030 178
Review 20031 222
Review 20032 38
Review 20033 225
Review 20034 134
Review 20035 232
Review 20036 261
Review 20037 199
Review 20038 193
Review 20039 238
Review 20040 349
Review 20041 107
Review 20042 204
Review 20043 47
Review 20044 156
Review 20045 536
Review 20046 402
Review 20047 552
Review 20048 175
Review 20049 135
Review 20050 771
Review 20051 138
Review 20052 255
Review 20053 472
Review 20054 556
Review 20055 94
Review 2

In [8]:
print("number of words in each review for x_test without padding")
i = 1
for l in x_test:
  print("Review",i,len(l))
  i = i+1

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Review 20001 199
Review 20002 361
Review 20003 214
Review 20004 295
Review 20005 131
Review 20006 135
Review 20007 185
Review 20008 238
Review 20009 180
Review 20010 36
Review 20011 131
Review 20012 325
Review 20013 229
Review 20014 475
Review 20015 165
Review 20016 160
Review 20017 68
Review 20018 371
Review 20019 554
Review 20020 118
Review 20021 208
Review 20022 232
Review 20023 74
Review 20024 367
Review 20025 359
Review 20026 183
Review 20027 300
Review 20028 46
Review 20029 201
Review 20030 218
Review 20031 235
Review 20032 210
Review 20033 133
Review 20034 552
Review 20035 415
Review 20036 189
Review 20037 235
Review 20038 223
Review 20039 189
Review 20040 239
Review 20041 47
Review 20042 127
Review 20043 92
Review 20044 454
Review 20045 308
Review 20046 902
Review 20047 332
Review 20048 201
Review 20049 140
Review 20050 804
Review 20051 131
Review 20052 166
Review 20053 268
Review 20054 175
Review 20055 420
Review

Number of labels

In [9]:
y_train.shape

(25000,)

In [10]:
y_test.shape

(25000,)

### Print value of any one feature and it's label

Feature value

In [11]:
X_train[1]

array([   1,  194, 1153,  194, 8255,   78,  228,    5,    6, 1463, 4369,
       5012,  134,   26,    4,  715,    8,  118, 1634,   14,  394,   20,
         13,  119,  954,  189,  102,    5,  207,  110, 3103,   21,   14,
         69,  188,    8,   30,   23,    7,    4,  249,  126,   93,    4,
        114,    9, 2300, 1523,    5,  647,    4,  116,    9,   35, 8163,
          4,  229,    9,  340, 1322,    4,  118,    9,    4,  130, 4901,
         19,    4, 1002,    5,   89,   29,  952,   46,   37,    4,  455,
          9,   45,   43,   38, 1543, 1905,  398,    4, 1649,   26, 6853,
          5,  163,   11, 3215,    2,    4, 1153,    9,  194,  775,    7,
       8255,    2,  349, 2637,  148,  605,    2, 8003,   15,  123,  125,
         68,    2, 6853,   15,  349,  165, 4362,   98,    5,    4,  228,
          9,   43,    2, 1157,   15,  299,  120,    5,  120,  174,   11,
        220,  175,  136,   50,    9, 4373,  228, 8255,    5,    2,  656,
        245, 2350,    5,    4, 9837,  131,  152,  4

Label value

In [12]:
y_train[1]

0

### Decode the feature value to get original sentence 

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [13]:
word_index = imdb.get_word_index()                                    


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [14]:
word_index

{'fawn': 34701,
 'tsukino': 52006,
 'nunnery': 52007,
 'sonja': 16816,
 'vani': 63951,
 'woods': 1408,
 'spiders': 16115,
 'hanging': 2345,
 'woody': 2289,
 'trawling': 52008,
 "hold's": 52009,
 'comically': 11307,
 'localized': 40830,
 'disobeying': 30568,
 "'royale": 52010,
 "harpo's": 40831,
 'canet': 52011,
 'aileen': 19313,
 'acurately': 52012,
 "diplomat's": 52013,
 'rickman': 25242,
 'arranged': 6746,
 'rumbustious': 52014,
 'familiarness': 52015,
 "spider'": 52016,
 'hahahah': 68804,
 "wood'": 52017,
 'transvestism': 40833,
 "hangin'": 34702,
 'bringing': 2338,
 'seamier': 40834,
 'wooded': 34703,
 'bravora': 52018,
 'grueling': 16817,
 'wooden': 1636,
 'wednesday': 16818,
 "'prix": 52019,
 'altagracia': 34704,
 'circuitry': 52020,
 'crotch': 11585,
 'busybody': 57766,
 "tart'n'tangy": 52021,
 'burgade': 14129,
 'thrace': 52023,
 "tom's": 11038,
 'snuggles': 52025,
 'francesco': 29114,
 'complainers': 52027,
 'templarios': 52125,
 '272': 40835,
 '273': 52028,
 'zaniacs': 52130,

Now use the dictionary to get the original words from the encodings, for a particular sentence

In [15]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])            


In [16]:
decoded_review = ' '.join([reverse_word_index.get(i - 3, "") for i in X_train[12]])
decoded_review

" i love cheesy horror flicks i don't care if the acting is sub par or whether the monsters look corny i liked this movie except for the  feeling all the way from the beginning of the film to the very end look i don't need a 10 page  or a sign with big letters explaining a plot to me but dark floors takes the what is this movie about thing to a whole new annoying level what is this movie about br br this isn't exceptionally scary or thrilling but if you have an hour and a half to kill and or you want to end up feeling frustrated and confused rent this winner                                                                                                                                                                                       "

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [17]:
print("positve" if y_train[12]==1 else "negative")

positve


### Define model
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Dimension of the dense embedding as 100
  
- Add LSTM layer
  - Pass value in `return_sequences` as True


In [18]:
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Input, Flatten


In [19]:
model3 = Sequential()
model3.add(Input(shape=(max_len,)))
model3.add(Embedding(input_dim=10000, output_dim=100, input_length=max_len))
model3.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model3.add(Dense(1, activation="sigmoid"))



### Compile the model 
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [20]:
model3.compile(optimizer="Adam", loss="binary_crossentropy", metrics=["accuracy"])

### Print model summary

In [21]:
model3.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 100)          1000000   
_________________________________________________________________
lstm (LSTM)                  (None, 100)               80400     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 1,080,501
Trainable params: 1,080,501
Non-trainable params: 0
_________________________________________________________________


### Fit the model

In [22]:
import numpy as np
history = model3.fit(X_train, np.array(y_train), batch_size=32, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


### Evaluate model

In [23]:
y_pred = model3.predict(X_train)


In [24]:
from sklearn import metrics

from sklearn.linear_model import LogisticRegression

# Fit the model on train
LR_model = LogisticRegression(solver="liblinear")
LR_model.fit(y_pred, y_train)




LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [25]:
#predict on train
y_predict = LR_model.predict(y_pred)
y_predict = y_predict.reshape(25000,1)

In [26]:
model_score = LR_model.score(y_predict, y_train)
print("train model score",model_score)

train model score 0.62168


In [27]:
#predict on test
X_test = pad_sequences(maxlen=max_len, sequences=x_test, padding="post")
y_pred_test = model3.predict(X_test)
y_predict_test = LR_model.predict(y_pred_test)
y_predict_test = y_predict_test.reshape(25000,1)
test_model_score = LR_model.score(y_predict_test, y_test)
print("test model score",test_model_score)

test model score 0.58504


### Predict on one sample

In [28]:
i = 22
decoded_review1 = ' '.join([reverse_word_index.get(i - 3, "") for i in X_test[i]])
decoded_review1

" how managed to avoid attention remains a mystery a potent mix of comedy and crime this one takes chances where tarantino plays it safe with the hollywood formula the risks don't always pay off one character in one sequence comes off  silly and falls flat in the lead role thomas jane gives a wonderful and complex performance and two brief appearances by mickey rourke hint at the high potential of this much under and  used actor here's a director one should keep one's eye on                                                                                                                                                                                                                    "

In [29]:
print("labelled value")
print("positve" if y_test[i]==1 else "negative")

labelled value
positve


In [30]:
print("Predicted value")
print("positve" if y_predict_test[i]==1 else "negative")

Predicted value
positve
