#### Load Movie reviews Dataset

We will be using data available on Kaggle platform for this exercise. The data is available at https://www.kaggle.com/c/word2vec-nlp-tutorial/data.

In [6]:
#Connect Google drive to colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Load dataset

In [7]:
import pandas as pd
import numpy as np

In [8]:
#change file path to point to where you have stored the zip file.
df = pd.read_csv('/content/drive/MyDrive/AIML/NLP/Rajeev Sir/Statistical NLP- Rajeev.zip (Unzipped Files)/Notebooks.zip (Unzipped Files)/Notebooks/data/labeledTrainData.tsv.zip (Unzipped Files)/labeledTrainData.tsv', header=0, delimiter="\t", quoting=3)

In [9]:
print('Number of examples in Dataset: ', df.shape)
df.head()

Number of examples in Dataset:  (25000, 3)


Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [10]:
df.loc[0, 'review']

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

Split Data into Training and Test Data

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(
    df['review'],
    df['sentiment'],
    test_size=0.2, 
    random_state=42
)

In [13]:
X_train.shape, X_test.shape

((20000,), (5000,))

#### Build the Tokenizer

In [14]:
import tensorflow as tf

In [15]:
desired_vocab_size = 10000 #Vocablury size
t = tf.keras.preprocessing.text.Tokenizer(num_words=desired_vocab_size, oov_token=32) # num_words -> Vocablury size

In [16]:
#Fit tokenizer with actual training data
t.fit_on_texts(X_train.tolist())

In [17]:
#Vocabulary
t.word_index

{32: 1,
 'the': 2,
 'and': 3,
 'a': 4,
 'of': 5,
 'to': 6,
 'is': 7,
 'br': 8,
 'in': 9,
 'it': 10,
 'i': 11,
 'this': 12,
 'that': 13,
 'was': 14,
 'as': 15,
 'for': 16,
 'with': 17,
 'movie': 18,
 'but': 19,
 'film': 20,
 'on': 21,
 'not': 22,
 'you': 23,
 'his': 24,
 'are': 25,
 'have': 26,
 'he': 27,
 'be': 28,
 'one': 29,
 'all': 30,
 'at': 31,
 'by': 32,
 'an': 33,
 'they': 34,
 'who': 35,
 'so': 36,
 'from': 37,
 'like': 38,
 'her': 39,
 'or': 40,
 'just': 41,
 'about': 42,
 "it's": 43,
 'out': 44,
 'has': 45,
 'if': 46,
 'there': 47,
 'some': 48,
 'what': 49,
 'good': 50,
 'more': 51,
 'when': 52,
 'very': 53,
 'up': 54,
 'no': 55,
 'even': 56,
 'time': 57,
 'she': 58,
 'my': 59,
 'would': 60,
 'which': 61,
 'only': 62,
 'story': 63,
 'really': 64,
 'see': 65,
 'had': 66,
 'their': 67,
 'can': 68,
 'me': 69,
 'were': 70,
 'well': 71,
 'than': 72,
 'we': 73,
 'much': 74,
 'get': 75,
 'been': 76,
 'bad': 77,
 'will': 78,
 'also': 79,
 'do': 80,
 'into': 81,
 'other': 82,
 'great'

#### Prepare Training and Test Data

Get the word index for each of the word in the review

In [18]:
X_train[0:1]

23311    "This movie is just plain dumb.<br /><br />Fro...
Name: review, dtype: object

In [19]:
X_train = t.texts_to_sequences(X_train.tolist())

In [20]:
print(X_train[0:1])

[[12, 18, 7, 41, 1059, 974, 8, 8, 37, 2, 977, 5, 2845, 1, 15, 1856, 4247, 6, 2, 1, 1353, 2, 20, 7, 33, 3233, 9, 1605, 8653, 8, 8, 1856, 4247, 7, 29, 5, 1234, 1, 279, 1, 1017, 1, 3, 7946, 35, 270, 1280, 291, 6, 3352, 2, 731, 4247, 2001, 182, 990, 6, 75, 6, 2, 905, 12, 20, 501, 4247, 81, 4, 9825, 32, 3468, 87, 17, 77, 478, 35, 25, 71, 98, 974, 6, 75, 242, 17, 231, 29, 7, 36, 843, 1276, 13, 27, 1, 6, 4, 3694, 1238, 8, 8, 82, 542, 5, 2, 18, 25, 206, 44, 5, 2, 293, 4878, 298, 268, 1, 838, 31, 2, 1, 16, 1845, 40, 2, 77, 227, 35, 2554, 8105, 24, 1427, 9, 2, 143, 3, 2, 2349, 2001, 25, 8106, 1, 1, 7, 625, 180, 2, 1, 5, 2, 1302, 52, 2, 370, 7, 2637, 21, 39, 2368, 2976, 19, 442, 95, 119, 2, 498, 52, 2, 370, 2603, 143, 16, 4, 6452, 324, 2, 243, 1010, 191, 1, 2, 3694, 1238, 2, 77, 227, 5900, 4247, 3857, 4, 6665, 1559, 100, 264, 661, 535, 2, 6665, 286, 27, 14, 1512, 234, 500, 264, 227, 8, 8, 6, 28, 1297, 47, 70, 48, 221, 370, 2426, 3, 1537, 3, 2, 1697, 1977, 7, 36, 77, 13, 10, 208, 76, 107, 61, 7, 2

In [21]:
X_test = t.texts_to_sequences(X_test)

How many words in each review?

In [22]:
len(X_train[200])

156

#### Pad Sequences - Important

In [23]:
#Define maximum number of words to consider in each review
max_review_length = 300

In [24]:
#Pad training and test reviews
X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train,
                                                        maxlen=max_review_length,
                                                        padding='pre')
X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test, 
                                                       maxlen=max_review_length, 
                                                       padding='pre')

In [25]:
X_train.shape

(20000, 300)

In [26]:
X_test.shape

(5000, 300)

In [27]:
X_train[200]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,   11,  212,   12,   18,   17,    4,  7

#### Load Glove model

We can use gensim library to load pre-trained Word2Vec or Glove models. For list of available models can be found at [this url](https://github.com/RaRe-Technologies/gensim-data).

In [28]:
import gensim.downloader as api

In [29]:
#Load Glove model (similar to Word2Vec)
glove_model = api.load('glove-wiki-gigaword-50')



In [30]:
#Model vocabulary
#glove_model.index2word

In [31]:
#Size of the model
glove_model.vectors.shape

(400000, 50)

In [32]:
#Embedding for word great
glove_model['great']

array([-0.026567,  1.3357  , -1.028   , -0.3729  ,  0.52012 , -0.12699 ,
       -0.35433 ,  0.37824 , -0.29716 ,  0.093894, -0.034122,  0.92961 ,
       -0.14023 , -0.63299 ,  0.020801, -0.21533 ,  0.96923 ,  0.47654 ,
       -1.0039  , -0.24013 , -0.36325 , -0.004757, -0.5148  , -0.4626  ,
        1.2447  , -1.8316  , -1.5581  , -0.37465 ,  0.53362 ,  0.20883 ,
        3.2209  ,  0.64549 ,  0.37438 , -0.17657 , -0.024164,  0.33786 ,
       -0.419   ,  0.40081 , -0.11449 ,  0.051232, -0.15205 ,  0.29855 ,
       -0.44052 ,  0.11089 , -0.24633 ,  0.66251 , -0.26949 , -0.49658 ,
       -0.41618 , -0.2549  ], dtype=float32)

#### Get Pre-trained Embeddings

Pre-trained Glove model has 400,000 unique words (Vocabulary size). We do not need all the words. Moreover, we have to arrange word embeddings according to word index created by our tokenizers above. So we will extract word embeddings for only the words that we are interested in.

In [33]:
#Embedding length based on selected model - we are using 50d here.
embedding_vector_length = glove_model.vector_size

Initialize a embedding matrix which we will populate for our vocabulary words.

In [34]:
#Initialize embedding matrix for our dataset with 10000+1 rows (1 for padding word)
#and 50 columns (as embedding size is 50)
embedding_matrix = np.zeros((desired_vocab_size  + 1, embedding_vector_length))

Load word vectors for each word in our vocabulary from from Glove pre-trained model

In [35]:
for word, i in sorted(t.word_index.items(),key=lambda x:x[1]):
    if i > (desired_vocab_size+1):
        break
    try:
        embedding_vector = glove_model[word] #Reading word's embedding from Glove model for a given word
        embedding_matrix[i] = embedding_vector
    except:
        pass

We now have word embeddings for our vocabulary words from Glove model. We can now use it in our Model training.

In [36]:
#embedding_matrix[2]

#### Build Model - Dense Layers

In [53]:
#Initialize model
tf.keras.backend.clear_session()
model = tf.keras.Sequential()

To handle, pre-trained embeddings, we will use Keras Embedding layer

In [54]:
model.add(tf.keras.layers.Embedding(desired_vocab_size + 1, #Vocablury size
                                    embedding_vector_length, #Embedding size
                                    weights=[embedding_matrix], #Embeddings taken from pre-trained model
                                    trainable=False, #As embeddings are already available, we will not train this layer. It will act as lookup layer.
                                    input_length=max_review_length) #Number of words in each review
          )

Embedding Layer gives us 3D output ->
[Batch_Size , Review Length , Embedding_Size]

In [55]:
model.output

<KerasTensor: shape=(None, 300, 50) dtype=float32 (created by layer 'embedding')>

Add Hidden layers

In [56]:
#Flatten the data as we will use Dense layers
model.add(tf.keras.layers.Flatten())

#Add Hidden layers (Dense layers)
model.add(tf.keras.layers.Dense(100, activation='relu', input_shape=()))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dense(50, activation='relu'))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dense(25, activation='relu'))
model.add(tf.keras.layers.Dropout(0.25))

Add Output layer

In [57]:
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

In [58]:
#Compile the model
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [59]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 50)           500050    
_________________________________________________________________
flatten (Flatten)            (None, 15000)             0         
_________________________________________________________________
dense (Dense)                (None, 100)               1500100   
_________________________________________________________________
batch_normalization (BatchNo (None, 100)               400       
_________________________________________________________________
dense_1 (Dense)              (None, 50)                5050      
_________________________________________________________________
batch_normalization_1 (Batch (None, 50)                200       
_________________________________________________________________
dense_2 (Dense)              (None, 25)                1

##### Train Model

In [60]:
model.fit(X_train,y_train,
          epochs=20,
          batch_size=32,          
          validation_data=(X_test, y_test))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f2db5b22590>

#### Building a CNN Model

Start a model

In [61]:
model2 = tf.keras.Sequential()

Add Embedding layer to handle Word2Vec

In [62]:
model2.add(tf.keras.layers.Embedding(desired_vocab_size + 1, #Vocablury size
                                    embedding_vector_length, #Embedding size
                                    weights=[embedding_matrix], #Embeddings taken from pre-trained model
                                    trainable=False, #As embeddings are already available, we will not train this layer. It will act as lookup layer.
                                    input_length=max_review_length) #Number of words in each review
          )

Add Conv1D hidden layers : As our text data is 2D (number of words, Embedding size), we will use Conv1D in this case (compared to Conv2D with images which are 3D)

In [63]:
#Add first convolutional layer
model2.add(tf.keras.layers.Conv1D(32, #Number of filters 
                                 kernel_size=(3), #Size of the filter
                                 strides=1,
                                 activation='relu'))

#normalize data
model2.add(tf.keras.layers.BatchNormalization())

#Add second convolutional layer
model2.add(tf.keras.layers.Conv1D(64, kernel_size=(3), strides=2))
model2.add(tf.keras.layers.ReLU())

#normalize data
model2.add(tf.keras.layers.BatchNormalization())

In [64]:
#Use Global Average Pooling
model2.add(tf.keras.layers.GlobalAveragePooling1D())

#Output layer
model2.add(tf.keras.layers.Dense(1, activation='sigmoid'))

In [65]:
#Compile the model
model2.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [66]:
model2.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 50)           500050    
_________________________________________________________________
conv1d (Conv1D)              (None, 298, 32)           4832      
_________________________________________________________________
batch_normalization_2 (Batch (None, 298, 32)           128       
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 148, 64)           6208      
_________________________________________________________________
re_lu (ReLU)                 (None, 148, 64)           0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 148, 64)           256       
_________________________________________________________________
global_average_pooling1d (Gl (None, 64)               

In [67]:
model2.fit(X_train,y_train,
          epochs=25,
          batch_size=32,          
          validation_data=(X_test, y_test))

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x7f2db58a87d0>