![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

# Natural Language Processing Project - Seq NLP

# Sentiment Classification

# Problem Description:
Generate Word Embedding and retrieve outputs of each layer with Keras based on the Classification task.

Word embedding are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for the text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We will use the IMDb dataset to learn word embedding as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with a sentiment (positive or negative).

### Dataset
- Dataset of 25,000 movie reviews from IMDB, labeled by sentiment positive (1) or negative (0)
- Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers).
- For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".
- As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

Command to import data
- `from tensorflow.keras.datasets import imdb`

In [1]:
import tensorflow as tf
tf.__version__

'2.1.0'

## Import Packages

In [2]:
import pandas as pd, numpy as np
from itertools import islice

# Keras
from keras.layers import Dense, Embedding, LSTM, Dropout, MaxPooling1D, Conv1D, Flatten, TimeDistributed
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model, Sequential
from keras.preprocessing import sequence
from keras.datasets import imdb

from keras.callbacks import ModelCheckpoint, EarlyStopping

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Suppress warnings
import warnings; warnings.filterwarnings('ignore')

random_state = 42
np.random.seed(random_state)
tf.random.set_seed(random_state)

Using TensorFlow backend.


In [3]:
# Mounting Google Drive
#from google.colab import drive
#drive.mount('/content/drive')
# Setting the current working directory
#import os; os.chdir('drive/My Drive/Great Learning/NLP')

### Import the data (4 Marks)
- Use `imdb.load_data()` method
- Get train and test set
- Take 10000 most frequent words

In [4]:
from tensorflow.keras.datasets import imdb

## 1. Import test and train data (5 points)

In [5]:
#### Loading Dataset - Train & Test Split
vocab_size = 10000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = vocab_size)
imdb_data=imdb.load_data(num_words = vocab_size)

In [6]:
length = [len(i) for i in x_train]
print("Average Review length:", np.mean(length))
print("Standard Deviation:", round(np.std(length)))

Average Review length: 238.71364
Standard Deviation: 176


### Pad each sentence to be of same length (4 Marks)
- Take maximum sequence length as 300

In [7]:
maxlen = 300
x_train = pad_sequences(x_train, maxlen = maxlen, padding = 'pre')
x_test =  pad_sequences(x_test, maxlen = maxlen, padding = 'pre')

## 2. Import the labels (train and test) (5 points)

In [8]:
X = np.concatenate((x_train, x_test), axis = 0)
y = np.concatenate((y_train, y_test), axis = 0)

In [9]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = random_state, shuffle = True)
x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, test_size = 0.2, random_state = random_state, shuffle = True)

### Print shape of features & labels (4 Marks)

Number of review, number of words in each review

In [10]:
print( 'Number reviews:', X.shape[0])
print( 'Number of words in each review:', X.shape[1])
print("Number of unique words:", len(np.unique(np.hstack(X))))

Number reviews: 50000
Number of words in each review: 300
Number of unique words: 9999


Number of labels

In [11]:
print( 'Number lablels', y.shape[0])

Number lablels 50000


In [12]:
print( 'Number of unique lablels:', np.unique(y))

Number of unique lablels: [0 1]


In [13]:
print('---'*20, f'\nNumber of rows in training dataset: {x_train.shape[0]}')
print(f'Number of columns in training dataset: {x_train.shape[1]}')
print(f'Number of unique words in training dataset: {len(np.unique(np.hstack(x_train)))}')


print('---'*20, f'\nNumber of rows in validation dataset: {x_valid.shape[0]}')
print(f'Number of columns in validation dataset: {x_valid.shape[1]}')
print(f'Number of unique words in validation dataset: {len(np.unique(np.hstack(x_valid)))}')


print('---'*20, f'\nNumber of rows in test dataset: {x_test.shape[0]}')
print(f'Number of columns in test dataset: {x_test.shape[1]}')
print(f'Number of unique words in test dataset: {len(np.unique(np.hstack(x_test)))}')


print('---'*20, f'\nUnique Categories: {np.unique(y_train), np.unique(y_valid), np.unique(y_test)}')

------------------------------------------------------------ 
Number of rows in training dataset: 32000
Number of columns in training dataset: 300
Number of unique words in training dataset: 9999
------------------------------------------------------------ 
Number of rows in validation dataset: 8000
Number of columns in validation dataset: 300
Number of unique words in validation dataset: 9984
------------------------------------------------------------ 
Number of rows in test dataset: 10000
Number of columns in test dataset: 300
Number of unique words in test dataset: 9995
------------------------------------------------------------ 
Unique Categories: (array([0, 1], dtype=int64), array([0, 1], dtype=int64), array([0, 1], dtype=int64))


### Print value of any one feature and it's label (4 Marks)

Feature value

In [14]:
X[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    1,   14,   22,   16,   43,  530,
        973, 1622, 1385,   65,  458, 4468,   66, 3941,    4,  173,   36,
        256,    5,   25,  100,   43,  838,  112,   50,  670,    2,    9,
         35,  480,  284,    5,  150,    4,  172,  112,  167,    2,  336,
        385,   39,    4,  172, 4536, 1111,   17,  546,   38,   13,  447,
          4,  192,   50,   16,    6,  147, 2025,   19,   14,   22,    4,
       1920, 4613,  469,    4,   22,   71,   87,   

Label value

In [15]:
y[0]

1

### Decode the feature value to get original sentence (4 Marks)

Above you can see the first review of the dataset, which is labeled as positive (1). 
The code below retrieves the dictionary mapping word indices back into the original words so that we can read them. It replaces every unknown word with a “#”. It does this by using the get_word_index() function. We will print decoded sentence for 11th review

First, retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

In [16]:
index = imdb.get_word_index()

Now use the dictionary to get the original words from the encodings, for a particular sentence

In [17]:
reverse_index = dict([(value, key) for (key, value) in index.items()]) 
decoded = " ".join( [reverse_index.get(i - 3, "#") for i in X[10]] )
print(decoded) 

a short while in the cell together they stumble upon a hiding place in the wall that contains an old # after # part of it they soon realise its magical powers and realise they may be able to use it to break through the prison walls br br black magic is a very interesting topic and i'm actually quite surprised that there aren't more films based on it as there's so much scope for things to do with it it's fair to say that # makes the best of it's # as despite it's # the film never actually feels restrained and manages to flow well throughout director eric # provides a great atmosphere for the film the fact that most of it takes place inside the central prison cell # that the film feels very claustrophobic and this immensely benefits the central idea of the prisoners wanting to use magic to break out of the cell it's very easy to get behind them it's often said that the unknown is the thing that really # people and this film proves that as the director # that we can never really be sure o

## 3. Get the word index and then Create a key-value pair for word and word_id (15 points)

In [18]:
def decode_review(x, y):
  w2i = imdb.get_word_index()                                
  w2i = {k:(v + 3) for k, v in w2i.items()}
  w2i['<PAD>'] = 0
  w2i['<START>'] = 1
  w2i['<UNK>'] = 2
  i2w = {i: w for w, i in w2i.items()}

  ws = (' '.join(i2w[i] for i in x))
  #print(f'Review: {ws}')
  #print(f'Actual Sentiment: {y}')
  return w2i, i2w

w2i, i2w = decode_review(X[10], y[10]) # for 11th review

# get first 50 key, value pairs from id to word dictionary
print('---'*30, '\n', list(islice(i2w.items(), 10, 50))) # for 11th review

------------------------------------------------------------------------------------------ 
 [(52012, "hold's"), (11310, 'comically'), (40833, 'localized'), (30571, 'disobeying'), (52013, "'royale"), (40834, "harpo's"), (52014, 'canet'), (19316, 'aileen'), (52015, 'acurately'), (52016, "diplomat's"), (25245, 'rickman'), (6749, 'arranged'), (52017, 'rumbustious'), (52018, 'familiarness'), (52019, "spider'"), (68807, 'hahahah'), (52020, "wood'"), (40836, 'transvestism'), (34705, "hangin'"), (2341, 'bringing'), (40837, 'seamier'), (34706, 'wooded'), (52021, 'bravora'), (16820, 'grueling'), (1639, 'wooden'), (16821, 'wednesday'), (52022, "'prix"), (34707, 'altagracia'), (52023, 'circuitry'), (11588, 'crotch'), (57769, 'busybody'), (52024, "tart'n'tangy"), (14132, 'burgade'), (52026, 'thrace'), (11041, "tom's"), (52028, 'snuggles'), (29117, 'francesco'), (52030, 'complainers'), (52128, 'templarios'), (40838, '272')]


Get the sentiment for the above sentence
- positive (1)
- negative (0)

In [19]:
print('Actual Sentiment:', y[10])
#print(f'Actual Sentiment: {y[10]}')

Actual Sentiment: 1


## 4. Build a Sequential Model using Keras for the Sentiment Classification task (15 points)

### Define model (10 Marks)
- Define a Sequential Model
- Add Embedding layer
  - Embedding layer turns positive integers into dense vectors of fixed size
  - `tensorflow.keras` embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unique integer number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn LabelEncoder.
  - Size of the vocabulary will be 10000
  - Give dimension of the dense embedding as 100
  - Length of input sequences should be 300
- Add LSTM layer
  - Pass value in `return_sequences` as True
- Add a `TimeDistributed` layer with 100 Dense neurons
- Add Flatten layer
- Add Dense layer

#### Build Keras Embedding Layer Model

We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

The embedding layer can be used at the start of a larger deep learning model.
Also we could load pre-train word embeddings into the embedding layer when we create our model.
Use the embedding layer to train our own word2vec models.

In [20]:
# Define a Sequential Model
model = Sequential()
# Add Embedding layer
model.add(Embedding(input_dim = vocab_size, output_dim = 100, input_length = maxlen))#256 is dim of dense emb 
model.add(Dropout(0.25))
model.add(Conv1D(256, 5, padding = 'same', activation = 'relu', strides = 1))
model.add(Conv1D(128, 5, padding = 'same', activation = 'relu', strides = 1))
model.add(MaxPooling1D(pool_size = 2))
model.add(Conv1D(64, 5, padding = 'same', activation = 'relu', strides = 1))
model.add(MaxPooling1D(pool_size = 2))

#Add LSTM layer, Pass value in return_sequences as True
model.add(LSTM(75,return_sequences=True))#,return_sequences=True
#model.add(LSTM(75, return_sequences=False)) 

#Add a TimeDistributed layer with 100 Dense neurons
model.add(TimeDistributed(Dense(100)))

#Add a Flatten Layer
model.add(Flatten()) # to covert to 3 dimensions into two dimensions

#Add a Dense Layer
model.add(Dense(1, activation = 'sigmoid'))

### Compile the model (4 Marks)
- Use Optimizer as Adam
- Use Binary Crossentropy as loss
- Use Accuracy as metrics

In [21]:
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

### Print model summary (4 Marks)

In [22]:
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 100)          1000000   
_________________________________________________________________
dropout_1 (Dropout)          (None, 300, 100)          0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 300, 256)          128256    
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 300, 128)          163968    
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 150, 128)          0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 150, 64)           41024     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 75, 64)           

### Fit the model (4 Marks)

In [23]:
# Adding callbacks
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 1, patience = 0)  
mc = ModelCheckpoint('imdb_model.h5', monitor = 'val_loss', mode = 'min', save_best_only = True, verbose = 1)

# Fit the model
model.fit(x_train, y_train, validation_data = (x_valid, y_valid), epochs = 3, batch_size = 64, verbose = True, callbacks = [es, mc])

Train on 32000 samples, validate on 8000 samples
Epoch 1/3

Epoch 00001: val_loss improved from inf to 0.24454, saving model to imdb_model.h5
Epoch 2/3

Epoch 00002: val_loss did not improve from 0.24454
Epoch 00002: early stopping


<keras.callbacks.callbacks.History at 0x245f1fa4408>

## 5. Report the Accuracy of the model (5 points)

### Evaluate model (4 Marks)

In [24]:
# Evaluate the model
scores = model.evaluate(x_test, y_test, batch_size = 64)
print('Test accuracy: %.2f%%' % (scores[1]*100))

Test accuracy: 90.07%


### Predict on one sample (4 Marks)

In [25]:
y_pred = model.predict_classes(x_test)
print(f'Classification Report:\n{classification_report(y_pred, y_test)}')

Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.89      0.90      5085
           1       0.89      0.92      0.90      4915

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000



- Accuracy: 90%
- F1-score: 90%

## 6. Retrieve the output of each layer in Keras for a given single test sample from the trained model you built (5 points)

In [26]:
sample_x_test = x_test[np.random.randint(10000)]
for layer in model.layers:

    model_layer = Model(inputs = model.input, outputs = model.get_layer(layer.name).output)
    output = model_layer.predict(sample_x_test.reshape(1,-1))
    print('\n','--'*20, layer.name, 'layer', '--'*20, '\n')
    print(output)


 ---------------------------------------- embedding_1 layer ---------------------------------------- 

[[[ 0.00379257  0.00282501  0.0108014  ...  0.01597496 -0.00240109
    0.03294621]
  [ 0.00379257  0.00282501  0.0108014  ...  0.01597496 -0.00240109
    0.03294621]
  [ 0.00379257  0.00282501  0.0108014  ...  0.01597496 -0.00240109
    0.03294621]
  ...
  [-0.00481838 -0.03248607  0.00674376 ...  0.03830308 -0.02844862
    0.00581411]
  [-0.05558018 -0.04111317 -0.01504822 ...  0.01026597  0.04616052
   -0.04665979]
  [-0.01208614 -0.05179224  0.0576748  ...  0.07342824  0.01705603
   -0.0771725 ]]]

 ---------------------------------------- dropout_1 layer ---------------------------------------- 

[[[ 0.00379257  0.00282501  0.0108014  ...  0.01597496 -0.00240109
    0.03294621]
  [ 0.00379257  0.00282501  0.0108014  ...  0.01597496 -0.00240109
    0.03294621]
  [ 0.00379257  0.00282501  0.0108014  ...  0.01597496 -0.00240109
    0.03294621]
  ...
  [-0.00481838 -0.03248607  0.006

In [28]:
decode_review(x_test[10], y_test[10])
print('Test-1:')
print(f'Actual sentiment: {y_test[10]}')
print(f'Predicted sentiment: {y_pred[10][0]}')

Test-1:
Actual sentiment: 1
Predicted sentiment: 1


In [29]:
print('Test-2:')
print(f'Actual sentiment: {y_test[200]}')
print(f'Predicted sentiment: {y_pred[200][0]}')

Test-2:
Actual sentiment: 0
Predicted sentiment: 0


Sentiment classification task on the IMDB dataset is performed using word embedding. Built the sequential model using LSTM layer along with convolution, max pooling, flatten and Dense layers. 

On test dataset we could achieve good accuracy score.
- Accuracy: 90%
- F1-score: 90%
- Valedation Loss: 0.25

Tested the sample predictions and got the matching output with the target sentiments.