![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

In [1]:
# Initialize the random number generator
import random
random.seed(0)

# Ignore the warnings
import warnings
warnings.filterwarnings("ignore")

### Import the data (2 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [2]:
from tensorflow.keras.datasets import imdb
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


### Pad each sentence to be of same length (2 Marks)
- Take maximum sequence length as 300

In [3]:
from tensorflow.keras.preprocessing import sequence
max_length = 300
X_train = sequence.pad_sequences(X_train, maxlen=max_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_length)

### Print shape of features & labels (2 Marks)

Number of review, number of words in each review

In [4]:
import numpy as np
print("Number of unique words:", len(np.unique(np.hstack(X_train))))
length = [len(i) for i in X_train]
print("Average Review length:", np.mean(length))
print("Number of Reviews and words:", X_train.shape)

Number of unique words: 9999
Average Review length: 300.0
Number of Reviews and words: (25000, 300)


In [5]:
print("Number of unique words:", len(np.unique(np.hstack(X_test))))
length = [len(i) for i in X_test]
print("Average Review length:", np.mean(length))
print("Number of Reviews and words:", X_test.shape)

Number of unique words: 9943
Average Review length: 300.0
Number of Reviews and words: (25000, 300)


Number of labels

In [6]:
print("labels:", np.unique(y_train))
print("Number of labels:", y_train.shape)

labels: [0 1]
Number of labels: (25000,)


In [7]:
print("labels:", np.unique(y_test))
print("Number of labels:", y_test.shape)

labels: [0 1]
Number of labels: (25000,)


### Print value of any one feature and it's label (2 Marks)

Feature value

In [8]:
print(X_train[0])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    1   14
   22   16   43  530  973 1622 1385   65  458 4468   66 3941    4  173
   36  256    5   25  100   43  838  112   50  670    2    9   35  480
  284    5  150    4  172  112  167    2  336  385   39    4  172 4536
 1111   17  546   38   13  447    4  192   50   16    6  147 2025   19
   14   22    4 1920 4613  469    4   22   71   87   12   16   43  530
   38   76   15   13 1247    4   22   17  515   17   12   16  626   18
    2    5   62  386   12    8  316    8  106    5    4 2223 5244   16
  480   66 3785   33    4  130   12   16   38  619    5   25  124   51
   36 

Label value

In [9]:
print("Label:", y_train[0])

Label: 1


### Decode the feature value to get original sentence (2 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [10]:
index = imdb.get_word_index()
reverse_index = dict([(value, key) for (key, value) in index.items()]) 

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


Now use the dictionary to get the original words from the encodings, for a particular sentence

In [11]:
sentence = " ".join( [reverse_index.get(i - 3, "0") for i in X_train[0]] )
print(sentence) 

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert 0 is an amazing actor and now the same being director 0 father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for 0 and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also 0 to the two little boy's that played the 0 of norman and paul they were just brilliant children are often left out of the 0 list i think because the st

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [12]:
i = y_train[0]
if i==1:
  print("positive")
else:
  print("Negative")

positive


### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [13]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, TimeDistributed, Embedding,LSTM,Flatten,Dropout
from tensorflow.keras.optimizers import Adam

model = Sequential()
model.add(Embedding(10000, 100, input_length=300))
model.add(LSTM(units=100, return_sequences=True))
model.add(TimeDistributed(Dense(100)))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

### Compile the model (2 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [14]:
optimizer = Adam()
model.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])

### Print model summary (2 Marks)

In [15]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 100)          1000000   
_________________________________________________________________
lstm (LSTM)                  (None, 300, 100)          80400     
_________________________________________________________________
time_distributed (TimeDistri (None, 300, 100)          10100     
_________________________________________________________________
flatten (Flatten)            (None, 30000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 30001     
Total params: 1,120,501
Trainable params: 1,120,501
Non-trainable params: 0
_________________________________________________________________


### Fit the model (2 Marks)

In [16]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=128, verbose=1)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7fee482ab208>

### Evaluate model (2 Marks)

In [17]:
evaluation = model.evaluate(X_test, y_test, verbose=0)
print("Evaluation_Accuracy: %.2f%%" % (evaluation[1]*100))

Evaluation_Accuracy: 85.87%


### Predict on one sample (2 Marks)

In [18]:
y_pred = (model.predict(X_test,batch_size=128) > 0.5).astype("int32")

In [19]:
sample_sentence = " ".join( [reverse_index.get(i - 3, "0") for i in X_test[0]] )
print(sample_sentence) 

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 please give this one a miss br br 0 0 and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite 0 so all you madison fans give this a miss


In [20]:
y_pred[0]

array([0], dtype=int32)

In [21]:
y_test[0]

0

In [22]:
sample_sentence = " ".join( [reverse_index.get(i - 3, "0") for i in X_test[10]] )
print(sample_sentence)

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 inspired by hitchcock's strangers on a train concept of two men 0 murders in exchange for getting rid of the two people messing up their lives throw 0 from the train is an original and very inventive comedy take on the idea it's a credit to danny 0 that he both wrote and starred in this minor comedy gem br br anne 0 is the mother who 0 the film's title and it's understandable why she gets under the skin of danny 0 with her sharp tongue and relentlessly putting him down for any minor 0 billy crystal is the writer who's wife has stolen his book idea and is now being 0 as a great new author even appearing on the oprah show to in 0 he should be enjoying thus 0 gets the idea of 0 murders to rid themselves of these 0 factors br br of course everything and anything can happen when writer carl 0 lets his imaginat

In [23]:
y_pred[10]

array([1], dtype=int32)

In [24]:
y_test[10]

1

# Final Insights


*   All tasks succcessfully achieved
*   Model is overfitting after very few epoch
*   Achieved accuracy of nearly 87% during runs with different number of units in LSTM layer and adding dropout also produced similar results
*   prediction of the few samples gave almost accurate results

## Thank You



