![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Sentiment Classification

### Dataset
- Dataset of 50,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

### Import the data (2 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [1]:
#### Add your code here ####
import numpy
from tensorflow.keras.datasets import imdb
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000,skip_top=20)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


### Pad each sentence to be of same length (2 Marks)
- Take maximum sequence length as 300

In [2]:
#### Add your code here ####
import tensorflow.keras.preprocessing.sequence as sequence
X_train = sequence.pad_sequences(sequences=X_train, maxlen=300,dtype= 'int32', padding='post',value=0.0)
X_test = sequence.pad_sequences(sequences =X_test, maxlen=300,dtype= 'int32', padding='post',value=0.0)


Number of review, number of words in each review

In [3]:
#### Number of review ####

print("X_train", X_train.shape)
print("X_test", X_test.shape)


X_train (25000, 300)
X_test (25000, 300)


In [4]:
#### number of words in each review ####
print("Review length: ")
result = [len(x) for x in X_train]
print("Mean %.2f words" % (numpy.mean(result)))

Review length: 
Mean 300.00 words


Number of labels

In [5]:
#### Number of labels ####
print("y_train ", y_train.shape)

y_train  (25000,)


### Print value of any one feature and it's label (2 Marks)

Feature value

In [6]:
#### Add your code here ####
print(X_train[(23245)])

[   2 4658  590 2999    2    2  530 2133    2    2    2   86  298  761
 1199    2 4505    2 4714    2 9075    2 1990   41 4505    2    2    2
   35 1122    2    2    2   59 1040    2  883    2 9074   63    2    2
  333   31  103   41   86  667 2924 1933    2 1263 1933    2 9645    2
   41  454    2 4714    2  777  725   41    2 2868    2    2   59 1199
    2    2    2 4505    2 4714   23   41 5646  667    2   52  703   63
   93    2  530    2   23    2  113    2 1619    2    2 9614    2   27
  322 1604 2946    2   68 2048    2  393    2    2    2  254 9775    2
   68  703    2 2101    2   68  194 3200   34   68 5202    2    2  723
 8721 4102    2    2  217    2 1619    2    2   29 2848  309    2    2
  217  725   27    2    2 1619   84    2 2614 3698    2   68  513    2
   30 1619  618    2    2  147    2  283    2    2    2    2 4102    2
   41  217    2 1604 2946   34    2  530 2314    2   59 1199   35 1809
 8169    2    2  118  524    2    2    0    0    0    0    0    0    0
    0 

Label value

In [7]:
#### Add your code here ####
print(y_train[(23245)])

1


### Decode the feature value to get original sentence (2 Marks)

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [8]:
#### Add your code here ####
word_index = imdb.get_word_index()


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


Now use the dictionary to get the original words from the encodings, for a particular sentence

In [9]:
#### Add your code here ####
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i - 15, '0') for i in X_train[123]])

print(decoded_review)

0 together 0 edge to match routine we flashback like think 0 with 0 0 get post him 0 was them 0 0 off 0 after i've your on without features or 0 0 0 table some version lines 0 are only has reminds even 0 0 in 0 unique 0 are well there short great 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [10]:
#### Add your code here ####
print(y_train[(15)])

0


### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

In [11]:
#### Model ####
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten, TimeDistributed
model = Sequential()
model.add(Embedding(10000,100,input_length = 300))
model.add(LSTM(units = 150, dropout=0.2, recurrent_dropout=0.2,return_sequences = True))
model.add(TimeDistributed(Dense(100)))
model.add(Flatten())
model.add(Dense(10,activation = 'relu'))
model.add(Dense(5, activation = 'relu'))
model.add(Dense(1,activation = 'sigmoid'))




### Compile the model (2 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [12]:
#### Add your code here ####
model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])

### Print model summary (2 Marks)

In [13]:
#### Add your code here ####
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 100)          1000000   
_________________________________________________________________
lstm (LSTM)                  (None, 300, 150)          150600    
_________________________________________________________________
time_distributed (TimeDistri (None, 300, 100)          15100     
_________________________________________________________________
flatten (Flatten)            (None, 30000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                300010    
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 55        
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 6

### Fit the model (2 Marks)

In [14]:
#### Add your code here ####
batch_size = 125
model.fit(X_train, y_train, epochs = 10, batch_size=batch_size, verbose = 2)

Epoch 1/10
200/200 - 92s - loss: 0.4385 - accuracy: 0.7756
Epoch 2/10
200/200 - 88s - loss: 0.2249 - accuracy: 0.9124
Epoch 3/10
200/200 - 87s - loss: 0.1704 - accuracy: 0.9369
Epoch 4/10
200/200 - 88s - loss: 0.1372 - accuracy: 0.9510
Epoch 5/10
200/200 - 88s - loss: 0.1010 - accuracy: 0.9644
Epoch 6/10
200/200 - 88s - loss: 0.0698 - accuracy: 0.9758
Epoch 7/10
200/200 - 88s - loss: 0.0493 - accuracy: 0.9828
Epoch 8/10
200/200 - 87s - loss: 0.0379 - accuracy: 0.9873
Epoch 9/10
200/200 - 87s - loss: 0.0266 - accuracy: 0.9910
Epoch 10/10
200/200 - 88s - loss: 0.0251 - accuracy: 0.9912


<tensorflow.python.keras.callbacks.History at 0x7f46d0585fd0>

### Evaluate model (2 Marks)

In [15]:
#### Add your code here ####
score,acc = model.evaluate(X_test, y_test, verbose = 2, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

200/200 - 14s - loss: 0.7657 - accuracy: 0.8532
score: 0.77
acc: 0.85


### Predict on one sample (2 Marks)

In [37]:
#### Add your code here ####
sentiment = model.predict(X_test[1500].reshape((1, 300)))

if(sentiment < 0.5):
    print("negative")
elif (sentiment > 0.5):
    print("positive")

sentiment

positive


array([[0.9987753]], dtype=float32)