## Name Bhavesh Kumar Bohara (MML2022013)

Use the link below to implement the code to perform text classification on Imdb dataset using CNN, RNN and LSTM. ( Implement the complete notebook)

https://github.com/practical-nlp/practical-nlp-code/blob/master/Ch4/05_DeepNN_Example.ipynb


As part of documentation, write your understanding of the code in detail. The book which explains this code is also attached. (Plese refer to chapter 4)

Installs three Python packages using pip, a package installer for Python.

In [None]:
!pip install numpy==1.19.5
!pip install wget==3.2
!pip install tensorflow==1.14.0


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting numpy==1.19.5
  Downloading numpy-1.19.5-cp39-cp39-manylinux2010_x86_64.whl (14.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.9/14.9 MB[0m [31m58.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.22.4
    Uninstalling numpy-1.22.4:
      Successfully uninstalled numpy-1.22.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
xarray 2022.12.0 requires numpy>=1.20, but you have numpy 1.19.5 which is incompatible.
xarray-einstats 0.5.1 requires numpy>=1.20, but you have numpy 1.19.5 which is incompatible.
tensorflow 2.12.0 requires numpy<1.24,>=1.22, but you have numpy 1.19.5 which is incompatible.
ml-dtypes 0.0.4 requires numpy>1.

here we are importing several Python packages and modules that are required to implement a machine learning model using the TensorFlow framework.

In [None]:
import os
import sys
import numpy as np
import tarfile
import wget
import warnings
warnings.filterwarnings("ignore")
from zipfile import ZipFile
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Dense, Input, GlobalMaxPooling1D
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding, LSTM
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.initializers import Constant

In this code, it is used to capture the output of the cell containing file download and extraction commands and suppress any output to the console.

it downloads and extracts the GloVe word embeddings dataset and the IMDB movie review dataset to a directory named DATAPATH.

If the code is not running in Colab, it creates the Data directory and downloads and extracts the GloVe and IMDB datasets to that directory.

Finally, the BASE_DIR variable is set to the path of the directory containing the downloaded datasets, which will be used later in the code to load the datasets into memory.

In [None]:
%%capture
try:

    from google.colab import files

    !wget -P DATAPATH http://nlp.stanford.edu/data/glove.6B.zip
    !unzip DATAPATH/glove.6B.zip -d DATAPATH/glove.6B

    !wget -P DATAPATH http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    !tar -xvf DATAPATH/aclImdb_v1.tar.gz -C DATAPATH

    BASE_DIR = 'DATAPATH'

except ModuleNotFoundError:

    if not os.path.exists('Data/glove.6B'):
        os.mkdir('Data/glove.6B')

        url='http://nlp.stanford.edu/data/glove.6B.zip'
        wget.download(url,'Data')

        temp='Data/glove.6B.zip'
        file = ZipFile(temp)
        file.extractall('Data/glove.6B')
        file.close()



    if not os.path.exists('Data/aclImdb'):

        url='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
        wget.download(url,'Data')

        temp='Data/aclImdb_v1.tar.gz'
        tar = tarfile.open(temp, "r:gz")
        tar.extractall('Data')
        tar.close()

    BASE_DIR = 'Data'

Here we define three directory paths using the os.path.join() function:

In [None]:
GLOVE_DIR = os.path.join(BASE_DIR, 'glove.6B')
TRAIN_DATA_DIR = os.path.join(BASE_DIR, 'aclImdb/train')
TEST_DATA_DIR = os.path.join(BASE_DIR, 'aclImdb/test')

These lines define four constants and assign them with specific values:



In [None]:

MAX_SEQUENCE_LENGTH = 1000
MAX_NUM_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2


###Loading and Preprocessing

The code defines a function called get_data that loads the text data and corresponding labels from a given directory into the notebook. The function takes a data_dir parameter which specifies the path to the directory containing the text data. The text samples are stored in a list called texts, while the label names are mapped to numeric IDs and stored in a dictionary called labels_index. The label IDs are then stored in a list called labels. The function returns the texts and labels lists.

The function is then called twice to load the training and testing data from their respective directories using the TRAIN_DATA_DIR and TEST_DATA_DIR variables. The function also assigns the label mapping dictionary to a variable called labels_index.

In [None]:
#Function to load the data from the dataset into the notebook. Will be called twice - for train and test.
def get_data(data_dir):
    texts = []  # list of text samples
    labels_index = {'pos':1, 'neg':0}  # dictionary mapping label name to numeric id
    labels = []  # list of label ids
    for name in sorted(os.listdir(data_dir)):
        path = os.path.join(data_dir, name)
        if os.path.isdir(path):
            if name=='pos' or name=='neg':
                label_id = labels_index[name]
                for fname in sorted(os.listdir(path)):
                        fpath = os.path.join(path, fname)
                        text = open(fpath,encoding='utf8').read()
                        texts.append(text)
                        labels.append(label_id)
    return texts, labels

train_texts, train_labels = get_data(TRAIN_DATA_DIR)
test_texts, test_labels = get_data(TEST_DATA_DIR)
labels_index = {'pos':1, 'neg':0}


here code is using Keras Tokenizer to vectorize text samples. MAX_NUM_WORDS is the maximum number of words to keep in the vocabulary, which is set to 20000. The tokenizer is fitted on training data only using fit_on_texts() function. train_sequences and test_sequences are then created using texts_to_sequences() function, which converts text to a vector of word indexes. word_index contains the dictionary of words extracted from the text data with their corresponding indexes. The output will show the number of unique tokens found in the text data.

In [None]:
#Vectorize these text samples into a 2D integer tensor using Keras Tokenizer
#Tokenizer is fit on training data only, and that is used to tokenize both train and test data.
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(train_texts)
train_sequences = tokenizer.texts_to_sequences(train_texts) #Converting text to a vector of word indexes
test_sequences = tokenizer.texts_to_sequences(test_texts)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 88582 unique tokens.


here the  code is converting the tokenized text sequences into a fixed length sequence of 1000 integers with initial padding of 0's, using the pad_sequences() function. It is also converting the labels into a categorical format using the to_categorical() function. The training data is split into training and validation sets using a validation split ratio of 0.2. Finally, the code prints a message indicating that the splitting of the data into training and validation sets is done.

In [None]:
#Converting this to sequences to be fed into neural network. Max seq. len is 1000 as set earlier
#initial padding of 0s, until vector is of size MAX_SEQUENCE_LENGTH
trainvalid_data = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
trainvalid_labels = to_categorical(np.asarray(train_labels))
test_labels = to_categorical(np.asarray(test_labels))

# split the training data into a training set and a validation set
indices = np.arange(trainvalid_data.shape[0])
np.random.shuffle(indices)
trainvalid_data = trainvalid_data[indices]
trainvalid_labels = trainvalid_labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * trainvalid_data.shape[0])
x_train = trainvalid_data[:-num_validation_samples]
y_train = trainvalid_labels[:-num_validation_samples]
x_val = trainvalid_data[-num_validation_samples:]
y_val = trainvalid_labels[-num_validation_samples:]
#This is the data we will use for CNN and RNN training
print('Splitting the train data into train and valid is done')

Splitting the train data into train and valid is done


This code block prepares the embedding matrix for the pre-trained Glove word embeddings. First, the code reads the Glove embeddings from the file and maps each word to its corresponding embedding vector. Then, it prepares the embedding matrix for the Keras Embedding layer by creating a matrix of size (num_words, EMBEDDING_DIM) and fills it with the embeddings from Glove for each word in the word_index. Finally, it creates an Embedding layer in Keras using the embedding matrix and sets it to be non-trainable.

In [None]:
print('Preparing embedding matrix.')

# first, build index mapping words in the embeddings set
# to their embedding vector
embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'),encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors in Glove embeddings.' % len(embeddings_index))
#print(embeddings_index["google"])

# prepare embedding matrix - rows are the words from word_index, columns are the embeddings of that word from glove.
num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i > MAX_NUM_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# load these pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)
print("Preparing of embedding matrix is done")

Preparing embedding matrix.
Found 400000 word vectors in Glove embeddings.
Preparing of embedding matrix is done


### 1D CNN Model with pre-trained embedding
here we defines a 1D CNN model for sentiment analysis. It starts by preparing an embedding matrix using pre-trained word embeddings from Glove. Then, the CNN model is defined using the embedding layer and several 1D convolutional layers with max pooling. The model is compiled and trained on the training set, with validation set for tuning. Finally, the model is evaluated on the test set to get the accuracy score.

In [None]:
print('Define a 1D CNN model.')

cnnmodel = Sequential()
cnnmodel.add(embedding_layer)
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(GlobalMaxPooling1D())
cnnmodel.add(Dense(128, activation='relu'))
cnnmodel.add(Dense(len(labels_index), activation='softmax'))

cnnmodel.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
#Train the model. Tune to validation set.
cnnmodel.fit(x_train, y_train,
          batch_size=128,
          epochs=1, validation_data=(x_val, y_val))
#Evaluate on test set:
score, acc = cnnmodel.evaluate(test_data, test_labels)
print('Test accuracy with CNN:', acc)

Define a 1D CNN model.
Test accuracy with CNN: 0.7162799835205078


###1D CNN model with training your own embedding
defines and trains a CNN model on the fly without using pre-trained embeddings. The model consists of an embedding layer followed by three 1D convolutional layers with 128 filters each, and global max pooling. The model is then compiled and trained on the training data with a batch size of 128 and 1 epoch. Finally, the model is evaluated on the test set and the accuracy is printed.

In [None]:
print("Defining and training a CNN model, training embedding layer on the fly instead of using pre-trained embeddings")
cnnmodel = Sequential()
cnnmodel.add(Embedding(MAX_NUM_WORDS, 128))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(MaxPooling1D(5))
cnnmodel.add(Conv1D(128, 5, activation='relu'))
cnnmodel.add(GlobalMaxPooling1D())
cnnmodel.add(Dense(128, activation='relu'))
cnnmodel.add(Dense(len(labels_index), activation='softmax'))

cnnmodel.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])
#Train the model. Tune to validation set.
cnnmodel.fit(x_train, y_train,
          batch_size=128,
          epochs=1, validation_data=(x_val, y_val))
#Evaluate on test set:
score, acc = cnnmodel.evaluate(test_data, test_labels)
print('Test accuracy with CNN:', acc)

Defining and training a CNN model, training embedding layer on the fly instead of using pre-trained embeddings
Test accuracy with CNN: 0.6886000037193298


###LSTM Model with training your own embedding

In this code snippet, an LSTM model is defined and trained with an embedding layer that is also trained on the fly. The model has a single LSTM layer with 128 units and dropout regularization. The output layer has two units and uses the sigmoid activation function. The model is compiled with the binary cross-entropy loss function, the Adam optimizer, and accuracy metrics. The model is trained for one epoch on the training data with a batch size of 32 and validated on the validation data. Finally, the model is evaluated on the test set and the test accuracy is printed.

In [None]:
print("Defining and training an LSTM model, training embedding layer on the fly")

#model
rnnmodel = Sequential()
rnnmodel.add(Embedding(MAX_NUM_WORDS, 128))
rnnmodel.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
rnnmodel.add(Dense(2, activation='sigmoid'))
rnnmodel.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
print('Training the RNN')

rnnmodel.fit(x_train, y_train,
          batch_size=32,
          epochs=1,
          validation_data=(x_val, y_val))
score, acc = rnnmodel.evaluate(test_data, test_labels,
                            batch_size=32)
print('Test accuracy with RNN:', acc)

Defining and training an LSTM model, training embedding layer on the fly
Training the RNN
Test accuracy with RNN: 0.8523600101470947


###LSTM Model using pre-trained Embedding Layer
here code snippet, an LSTM model is defined and trained using a pre-trained embedding layer. The rnnmodel2 is created with the Sequential class and the pre-trained embedding layer is added as the first layer. The LSTM layer is added next with a dropout rate of 0.2 to prevent overfitting. The output layer is a dense layer with sigmoid activation function. The model is then compiled with binary cross-entropy loss and adam optimizer.

The fit() method is then used to train the model on the training set and validate on the validation set. The evaluate() method is used to evaluate the performance of the trained model on the test set. The accuracy of the model on the test set is printed at the end.

In [None]:
print("Defining and training an LSTM model, using pre-trained embedding layer")

rnnmodel2 = Sequential()
rnnmodel2.add(embedding_layer)
rnnmodel2.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
rnnmodel2.add(Dense(2, activation='sigmoid'))
rnnmodel2.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
print('Training the RNN')

rnnmodel2.fit(x_train, y_train,
          batch_size=32,
          epochs=1,
          validation_data=(x_val, y_val))
score, acc = rnnmodel2.evaluate(test_data, test_labels,
                            batch_size=32)
print('Test accuracy with RNN:', acc)

Defining and training an LSTM model, using pre-trained embedding layer
Training the RNN
Test accuracy with RNN: 0.787880003452301
