![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data (2 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [1]:
from tensorflow.keras.datasets import imdb

(train_X, train_Y), (test_X, test_Y) = imdb.load_data(num_words=10000)

In [2]:
print("train_X : {}".format(train_X.shape))
print("train_Y : {}".format(train_Y.shape))
print("test_X : {}".format(test_X.shape))
print("test_Y : {}".format(test_Y.shape))

train_X : (25000,)
train_Y : (25000,)
test_X : (25000,)
test_Y : (25000,)


In [3]:
import numpy as np

X = np.concatenate((train_X, test_X), axis=0)
y = np.concatenate((train_Y, test_Y), axis=0)

### Pad each sentence to be of same length (2 Marks)
- Take maximum sequence length as 300

In [4]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

train_X = pad_sequences(train_X, maxlen=300)
test_X = pad_sequences(test_X, maxlen=300)

### Print shape of features & labels (2 Marks)

Number of review, number of words in each review

In [5]:
print("Number of Reviews")
print(train_X.shape[0])

Number of Reviews
25000


In [6]:
for idx, j in enumerate(train_X):
  print("Review no. {} has {} words.".format((idx+1),len(j)))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Review no. 20002 has 300 words.
Review no. 20003 has 300 words.
Review no. 20004 has 300 words.
Review no. 20005 has 300 words.
Review no. 20006 has 300 words.
Review no. 20007 has 300 words.
Review no. 20008 has 300 words.
Review no. 20009 has 300 words.
Review no. 20010 has 300 words.
Review no. 20011 has 300 words.
Review no. 20012 has 300 words.
Review no. 20013 has 300 words.
Review no. 20014 has 300 words.
Review no. 20015 has 300 words.
Review no. 20016 has 300 words.
Review no. 20017 has 300 words.
Review no. 20018 has 300 words.
Review no. 20019 has 300 words.
Review no. 20020 has 300 words.
Review no. 20021 has 300 words.
Review no. 20022 has 300 words.
Review no. 20023 has 300 words.
Review no. 20024 has 300 words.
Review no. 20025 has 300 words.
Review no. 20026 has 300 words.
Review no. 20027 has 300 words.
Review no. 20028 has 300 words.
Review no. 20029 has 300 words.
Review no. 20030 has 300 words.
Review 

Number of labels

In [7]:
print("No. of labels {}.".format(len(np.unique(train_Y))))
print("Labels are {}.".format(np.unique(train_Y)))

No. of labels 2.
Labels are [0 1].


### Print value of any one feature and it's label (2 Marks)

Feature value

In [8]:
train_X[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    1,   14,   22,   16,   43,  530,
        973, 1622, 1385,   65,  458, 4468,   66, 3941,    4,  173,   36,
        256,    5,   25,  100,   43,  838,  112,   50,  670,    2,    9,
         35,  480,  284,    5,  150,    4,  172,  112,  167,    2,  336,
        385,   39,    4,  172, 4536, 1111,   17,  546,   38,   13,  447,
          4,  192,   50,   16,    6,  147, 2025,   19,   14,   22,    4,
       1920, 4613,  469,    4,   22,   71,   87,   

Label value

In [9]:
train_Y[0]

1

### Decode the feature value to get original sentence (2 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [10]:
word_index = imdb.get_word_index()

In [11]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

Now use the dictionary to get the original words from the encodings, for a particular sentence

In [12]:
i = []
for item in train_X[0]:
  if item in reverse_word_index:
    i.append(reverse_word_index[item])

print(" ".join(i)) 

the as you with out themselves powerful lets loves their becomes reaching had journalist of lot from anyone to have after out atmosphere never more room and it so heart shows to years of every never going and help moments or of every chest visual movie except her was several of enough more with is now current film as you of mine potentially unfortunately of you than him that with out themselves her get for was camp of you movie sometimes movie that with scary but and to story wonderful that in seeing in character to of 70s musicians with heart had shadows they of here that with her serious to have does when from why what have critics they is you that isn't one will very to as itself with other and in of seen over landed for anyone of and br show's to whether from than out themselves history he name half some br of and odd was two most of mean for 1 any an boat she he should is thought frog but of script you not while history he heart to real at barrel but when from one bit then have tw

Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [13]:
print("Sentiment of the above statement is {}".format(train_Y[0]))

Sentiment of the above statement is 1


### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [14]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import TimeDistributed
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding


# create the model
model = Sequential()
model.add(Embedding(10000, 100, input_length=300))
model.add(LSTM(32, return_sequences=True))
model.add(TimeDistributed(Dense(100)))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

### Compile the model (2 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [15]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### Print model summary (2 Marks)

In [16]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 100)          1000000   
_________________________________________________________________
lstm (LSTM)                  (None, 300, 32)           17024     
_________________________________________________________________
time_distributed (TimeDistri (None, 300, 100)          3300      
_________________________________________________________________
flatten (Flatten)            (None, 30000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               7500250   
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 251       
Total params: 8,520,825
Trainable params: 8,520,825
Non-trainable params: 0
______________________________________________

### Fit the model (2 Marks)

In [17]:
history = model.fit(train_X, train_Y, validation_data=(test_X, test_Y), epochs=10, batch_size=128, verbose=2)

Epoch 1/10
196/196 - 20s - loss: 0.3962 - accuracy: 0.8010 - val_loss: 0.2945 - val_accuracy: 0.8743
Epoch 2/10
196/196 - 17s - loss: 0.1815 - accuracy: 0.9310 - val_loss: 0.3131 - val_accuracy: 0.8712
Epoch 3/10
196/196 - 17s - loss: 0.0810 - accuracy: 0.9711 - val_loss: 0.4350 - val_accuracy: 0.8568
Epoch 4/10
196/196 - 17s - loss: 0.0292 - accuracy: 0.9892 - val_loss: 0.6628 - val_accuracy: 0.8553
Epoch 5/10
196/196 - 17s - loss: 0.0164 - accuracy: 0.9944 - val_loss: 0.8462 - val_accuracy: 0.8582
Epoch 6/10
196/196 - 18s - loss: 0.0149 - accuracy: 0.9945 - val_loss: 0.8925 - val_accuracy: 0.8524
Epoch 7/10
196/196 - 18s - loss: 0.0196 - accuracy: 0.9933 - val_loss: 0.8347 - val_accuracy: 0.8580
Epoch 8/10
196/196 - 18s - loss: 0.0118 - accuracy: 0.9962 - val_loss: 1.0166 - val_accuracy: 0.8507
Epoch 9/10
196/196 - 17s - loss: 0.0056 - accuracy: 0.9980 - val_loss: 1.0308 - val_accuracy: 0.8592
Epoch 10/10
196/196 - 17s - loss: 0.0071 - accuracy: 0.9971 - val_loss: 1.1680 - val_accura

### Evaluate model (2 Marks)

In [18]:
model.evaluate(test_X, test_Y, batch_size=128)



[1.1680412292480469, 0.8610799908638]

### Predict on one sample (2 Marks)

In [19]:
model.predict(test_X[119:120])

array([[0.9858403]], dtype=float32)

In [20]:
model.predict_classes(test_X)

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).


array([[1],
       [1],
       [1],
       ...,
       [0],
       [0],
       [1]], dtype=int32)

In [21]:
print("Predicted Value Of Model {}.".format(model.predict_classes(test_X[119:120])[0][0]))
print("Actual Value {}".format(test_Y[119:120][0]))

Predicted Value Of Model 1.
Actual Value 1
