<a href="https://colab.research.google.com/github/Zenith1618/LLM/blob/main/Sentiment_Analysis_using_LSTM_and_GloVe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IMDB Sentiment Analysis using LSTM


In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.14.0 (from datasets)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, huggingface-hub, datasets
Successfully installed datasets-2.1

# Importing Libraries


In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchtext
import tqdm

import functools
import sys

import datasets
import matplotlib.pyplot as plt
import numpy as np

In [3]:
# Set seed and enable GPU
torch.manual_seed(42)

<torch._C.Generator at 0x7b97193a5010>

In [64]:
# Checking for T4 GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [5]:
# Download the Dataset using datasets library by HuggingFace
train_data, test_data = datasets.load_dataset('imdb', split=['train', 'test'])

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

##Torchtext
Torchtext is a library made for NLP lovers. This contains most of the pre-processing required for Text data

# Tokenize the sequence

In [6]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

'basic_english' is a pre-defined configuration for a tokenizer that tokenizes text using a basic English language approach. This means it will split text into words, converting to lowercase, and removing punctuation and special characters. It's a simple and common tokenization approach used for many NLP tasks.

In [7]:
def tokenize_data(example, tokenizer, max_length):
  tokens = tokenizer(example['text'])[:max_length]
  length = len(tokens)    #number of tokens obtained after tokenization
  return {'tokens': tokens, 'length': length}
  #returns a dictionary with two key-value pairs. The 'tokens' key contains the tokenized text, and the 'length' key contains the number of tokens in the tokenized sequence.

In [8]:
max_length = 256    #maximum number of tokens allowed in the tokenized sequences.
train_token = train_data.map(tokenize_data, fn_kwargs={'tokenizer': tokenizer, 'max_length': max_length})

'''
This line applies the tokenize_data function to each example in the train_data dataset using the map function. It passes two keyword arguments to tokenize_data function:
'tokenizer' and 'max_length', which are set to the tokenizer object (presumably defined earlier in your code) and the previously defined max_length.
'''

test_token = test_data.map(tokenize_data, fn_kwargs={'tokenizer': tokenizer, 'max_length': max_length})

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [9]:
# Before
train_data

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [10]:
# After
train_token

Dataset({
    features: ['text', 'label', 'tokens', 'length'],
    num_rows: 25000
})

In [11]:
train_token[:1]

{'text': ['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

In [12]:
print(train_token['label'][:500])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [128]:
train_data['label'][600:615]

tensor([0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1])

###Split the train data after the tokenization to avoid data leakage

In [14]:
train_valid_data = train_token.train_test_split(test_size = 0.2)

In [15]:
train_data = train_valid_data['train']
valid_data = train_valid_data['test']

In [16]:
vocab = torchtext.vocab.build_vocab_from_iterator(train_data['tokens'], specials=['<UNK>', '<PAD>'], min_freq=10)

So, the line of code is creating a vocabulary from the tokenized text data in train_data. This vocabulary includes all unique tokens with a frequency of at least 10 and also includes special tokens for unknown and padding. The resulting vocab variable will be used for tasks like converting text tokens to numerical IDs, a common step in natural language processing workflows.

In [17]:
vocab['<UNK>']

0

In [18]:
vocab.set_default_index(0)

##Note: Why do we need UNK and PAD?
Let's say we have a large corpus of text data. During tokenization we usually fit in all the train data. When we have a new text, if the model encounters a new word, it will assign it as , which stands for unknown.

Let's take a few sample movie reivews:

I loved this movie</br>
Amazing</br>
Impressive storyline</br>
Terrible experience not recommended to watch</br>
</br></br>
If you look at the above statements, all have different word sizes. To ensure we pass the model with the same size, we pad all the sentences to be in the same size. We set the max length to be some value if the sequence has more than the threshold it truncates the padding. If it is less than the sequence, it pad and fill the sequence with zero.

# Prepare the datast for the model

In [19]:
def convert_into_tokens(example, vocab):
  # This line inside the function is a list comprehension that iterates through the tokens in example['tokens'] and looks up their corresponding numerical IDs in the
  # provided vocab. For each token, it finds the ID from the vocabulary and stores it in the ids list.
  ids = [vocab[token] for token in example['tokens']]
  return {'ids':ids}

In [20]:
#this data will be used for training
train_data = train_data.map(convert_into_tokens, fn_kwargs={'vocab': vocab})
# this data will be used for evaluation
valid_data = valid_data.map(convert_into_tokens, fn_kwargs={'vocab': vocab})
# this the data that we use for generalization [New unseen data for testing]
test_data = test_token.map(convert_into_tokens, fn_kwargs={'vocab': vocab})

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [21]:
train_data['ids'][:5]

#Token is the individual representation of the word and the ids is the equivalent numerical value to that token

[[2,
  626,
  351,
  17,
  2,
  121,
  160,
  47,
  14,
  18,
  4,
  91,
  11,
  502,
  26,
  472,
  15,
  26,
  68,
  35,
  301,
  5,
  4704,
  4,
  11702,
  1511,
  23,
  3,
  1514,
  15,
  626,
  351,
  21,
  5543,
  76,
  472,
  3,
  19,
  2,
  23,
  8499,
  13,
  5,
  587,
  4,
  6306,
  1447,
  4,
  61,
  2557,
  75,
  8064,
  13,
  3872,
  8,
  14,
  23,
  17,
  46,
  107,
  85,
  13844,
  13,
  61,
  3969,
  57,
  124,
  2,
  151,
  31,
  8,
  962,
  2,
  170,
  8,
  35,
  13,
  14,
  23,
  57,
  124,
  12,
  87,
  61,
  3473,
  0,
  37,
  2,
  942,
  1109,
  57,
  422,
  4,
  3175,
  5391,
  3,
  12,
  186,
  14,
  23,
  13,
  2,
  1495,
  36,
  14,
  68,
  35,
  2,
  71,
  1267,
  12,
  93,
  106,
  47,
  301,
  5,
  23,
  37,
  343,
  132,
  5,
  18,
  1495,
  4,
  91,
  37,
  343,
  26,
  59,
  491,
  11,
  133,
  3,
  75,
  46,
  110,
  1464,
  1164,
  57,
  731,
  0,
  19,
  2,
  1282,
  1079,
  4016,
  72,
  32,
  403,
  4,
  264,
  13,
  5,
  872,
  9266,
  3,
  92,
  8

In [22]:
train_data = train_data.with_format(type='torch', columns=['ids', 'label', 'length'])
valid_data = valid_data.with_format(type='torch', columns=['ids', 'label', 'length'])
test_data = test_data.with_format(type='torch', columns=['ids', 'label', 'length'])

# Model Building LSTM

Dropout is a regularization technique commonly used in neural networks, including Long Short-Term Memory (LSTM) networks, to prevent overfitting. In LSTM networks, dropout can be applied to the input and recurrent connections within the network.

The dropout rate in an LSTM network refers to the probability of "dropping out" or deactivating a given unit or connection during training. It is typically a hyperparameter that you can adjust to control the extent of regularization in your model. The dropout rate is often set to a value between 0 and 1.

Here's how dropout works in LSTM networks:

1. **Input Dropout**: When input dropout is applied to an LSTM, for each training example, at each time step, there's a probability (specified by the dropout rate) that an input feature's value is set to zero. This effectively means that the network cannot rely on any single input feature too heavily, and it has to learn to be robust to variations in the input data.

2. **Recurrent Dropout**: In addition to input dropout, recurrent dropout applies dropout to the recurrent connections within the LSTM cells. This means that at each time step, there's a probability that the information stored in the cell state and the hidden state is partially "forgotten." Recurrent dropout helps prevent overfitting by making it more challenging for the network to memorize the training data.

By tuning the dropout rate, you can control the amount of regularization applied to your LSTM network. A higher dropout rate introduces more regularization, which can help prevent overfitting but might slow down training and lead to underfitting if set too high. A lower dropout rate allows the model to learn more from the data but might lead to overfitting if the dataset is small or noisy.

Typical dropout rates for LSTMs can range from 0.2 to 0.5, but the optimal rate depends on the specific problem and dataset, and it often requires experimentation to find the best value.

In [38]:
class LSTMmodel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers,dropout_rate, pad_index):
        super().__init__()

        # layer 1- Pass the ids to the embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_index)

        # layer 2- LSTM [If n_layers = 2, then layer 3 is also LSTM]
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers,dropout=dropout_rate, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout_rate) # to avoid overfitting

    def forward(self, ids, batch_size):

        # token to embeddings
        embedded = self.dropout(self.embedding(ids))
        embdedded = nn.utils.rnn.pack_padded_sequence(embedded, batch_size, batch_first=True,enforce_sorted=False)
        # embedding sequence (batch_size,seq_length,emd_dim) to LSTM


        outputs, (hidden, cell) = self.lstm(embdedded)

        output, output_length = nn.utils.rnn.pad_packed_sequence(outputs)
        hidden = self.dropout(hidden[-1])

        prediction = self.fc(hidden)
        return prediction

### Explaination

```python
class LSTMmodel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout_rate, pad_index):
        super().__init__()

        # Layer 1: Pass the ids to the embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_index)

        # Layer 2: LSTM [if n_layers=2, then Layer 3 is also LSTM]
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout_rate, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout_rate)  # To avoid overfitting
```

1. `LSTMmodel` is a custom PyTorch model class that inherits from `nn.Module`, the base class for PyTorch models.

2. In the constructor `__init__()`, the model is defined with several layers and parameters:
   - `vocab_size`: The size of the vocabulary, indicating the number of unique tokens in the input text.
   - `embedding_dim`: The dimension of the word embeddings for each token.
   - `hidden_dim`: The dimension of the hidden states in the LSTM layers.
   - `output_dim`: The dimension of the output.
   - `n_layers`: The number of LSTM layers.
   - `dropout_rate`: The dropout rate applied to the LSTM layers and the embedding layer to prevent overfitting.
   - `pad_index`: The index for padding tokens in the vocabulary.

3. Layer 1: `self.embedding` is an embedding layer that converts input token IDs into dense word embeddings. These embeddings are learned during training.

4. Layer 2: `self.lstm` is an LSTM layer. The number of layers and other LSTM parameters are determined by the constructor arguments. The `batch_first=True` argument means that the input data should have dimensions in the form of `(batch_size, sequence_length, embedding_dim)`.

5. `self.fc` is a fully connected layer that takes the hidden state from the LSTM and transforms it to the desired output dimension.

6. `self.dropout` is a dropout layer applied to prevent overfitting.

```python
    def forward(self, ids, batch_size):
        # Tokens to embedding
        embedded = self.dropout(self.embedding(ids))
        embedded = nn.utils.rnn.pack_padded_sequence(embedded, batch_size, batch_first=True, enforce_sorted=True)
```

1. In the `forward` method, the model takes input `ids` and `batch_size`.

2. The input token IDs are passed through the embedding layer and are subject to dropout.

3. `nn.utils.rnn.pack_padded_sequence` is used to pack the embedded sequences into a format suitable for the LSTM. This is often used when working with sequences of varying lengths, and it helps the model avoid unnecessary computations for padding tokens.

```python
        # Embedding sequence (batch_size, seq_length, emb_dim) to LSTM
        outputs, (hidden, cell) = self.lstm(embedded)

        output, output_length = nn.utils.rnn.pad_packed_sequence(outputs)
        hidden = self.dropout(hidden[-1])

        prediction = self.fc(hidden)
        return prediction
```

1. The packed embedded sequence is passed through the LSTM layer, producing `outputs`, `hidden`, and `cell` values.

2. `nn.utils.rnn.pad_packed_sequence` is used to unpack the LSTM outputs into a padded sequence, restoring the original sequence lengths.

3. The hidden state from the last LSTM layer is selected and subjected to dropout.

4. The final prediction is made by passing the processed hidden state through the fully connected layer `self.fc`, and it's returned as the output of the model.

self.fc is a linear transformation that takes the hidden state produced by the LSTM and transforms it to the output dimension specified by output_dim. This fully connected layer applies a linear transformation to the input data, followed by an activation function (which may be implicit depending on the context of how this model is used).

In [107]:
vocab_size = len(vocab)
embedding_dim = 300
hidden_dim = 64
output_dim = len(train_data.unique('label')) # either 0 or 1 = 2(length)
n_layers = 2
dropout_rate = 0.5

model = LSTMmodel(vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout_rate,vocab['<PAD>'])
model = model.to(device) #switch our modeling training in GPU

In [108]:
sum(p.numel() for p in model.parameters() if p.requires_grad)   # total parameters

4507706

In [109]:
def initialize_weights(m):
  if isinstance(m, nn.Linear):
    nn.init.xavier_normal_(m.weight)
    nn.init.zeros_(m.bias)
  elif isinstance(m, nn.LSTM):
    for name, param in m.named_parameters():
      if 'bias' in name:
        nn.init.zeros_(param)
      elif 'weight' in name:
        nn.init.orthogonal_(param)

### Explaination

The initialize_weights function takes a neural network module m as input, typically representing a layer or a part of the neural network.

It checks the type of the layer using isinstance. If it's an instance of nn.Linear, it means it's a linear layer.

For nn.Linear layers:

It initializes the weights using Xavier normal initialization, which is a method for initializing weights to help with the convergence of neural networks. It's designed to keep the variance of the activations roughly the same across different layers.
It initializes the biases to zeros using nn.init.zeros_. This is a common practice since biases are often initialized to zeros.
If the layer is not a linear layer but is instead an instance of nn.LSTM (Long Short-Term Memory) layer:

It iterates through the named parameters of the LSTM layer.
For bias parameters (those with 'bias' in their names), it initializes them to zeros.
For weight parameters (those with 'weight' in their names), it initializes them using orthogonal initialization, which helps with preserving gradient information during training.

In [110]:
model.apply(initialize_weights)

LSTMmodel(
  (embedding): Embedding(14602, 300, padding_idx=1)
  (lstm): LSTM(300, 64, num_layers=2, batch_first=True, dropout=0.5)
  (fc): Linear(in_features=64, out_features=2, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Using Pre-trained embeddings

Few of the most used Pre-trained embedding are:

- GloVe
- Word2Vec
- FastText

In [111]:
vectors = torchtext.vocab.GloVe()   #extra pre trained embedding

In [112]:
pretrained_embedding = vectors.get_vecs_by_tokens(vocab.get_itos())

In [113]:
model.embedding.weight.data = pretrained_embedding

### Explaination

1. `vectors = torchtext.vocab.GloVe()`: This line creates an instance of the GloVe pre-trained word vectors using the `torchtext.vocab.GloVe()` class. GloVe (Global Vectors for Word Representation) is a popular word embedding model.

2. `pretrained_embedding = vectors.get_vecs_by_tokens(vocab.get_itos())`: This line initializes a variable `pretrained_embedding` to store the pre-trained word embeddings. Here's what's happening:
   - `vocab.get_itos()` retrieves the list of word tokens (the vocabulary) from a `vocab` object.
   - `vectors.get_vecs_by_tokens()` is used to obtain the pre-trained word vectors for each word token in the vocabulary. The result is a tensor with word vectors.

3. `model.embedding.weight.data = pretrained_embedding`: This line sets the embedding layer of a neural network model (`model`) to use the pre-trained word embeddings. Here's what's happening:
   - `model.embedding` likely refers to the embedding layer of the model.
   - `.weight` is an attribute of the embedding layer that stores the embedding weights.
   - `.data` is used to access and modify the actual data (the word vectors) within the embedding layer.
   - `pretrained_embedding` contains the pre-trained word vectors obtained in the previous step. This line sets the weights of the embedding layer to be these pre-trained vectors.



## Compile Model

- Three important parameter that influence the model are:
  - Optimizer- algorithm for gradient descent [Adam, SGD, RMSProp]
  - Loss function- Binary cross entropy loss or CrossEntropy loss
  - Evaluation performance metrics [Accuracy, Precision, Recall]

In [114]:
learning_rate = 1e-4
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
loss_function = nn.CrossEntropyLoss().to(device)

In [115]:
def metrics(prediction, actual):
  batch_size, _ = prediction.shape
  predicted_classes = prediction.argmax(dim=-1)
  #This line computes the predicted class labels for each example in the batch. It uses the argmax function to find the index (class) with the highest predicted
  #probability for each example. The dim=-1 argument specifies that the maximum should be computed along the last dimension of the prediction tensor, which is typically
  #the dimension representing the class scores or probabilities.

  correct_predictions = predicted_classes.eq(actual).sum()
  #This line computes the predicted class labels for each example in the batch. It uses the argmax function to find the index (class) with the highest predicted probability
  #for each example. The dim=-1 argument specifies that the maximum should be computed along the last dimension of the prediction tensor, which is typically the dimension
  #representing the class scores or probabilities.

  accuracy = correct_predictions /batch_size

  return accuracy

In [116]:
def collate(batch, pad_index):
  batch_ids = [i['ids'] for i in batch]
  batch_ids = nn.utils.rnn.pad_sequence(batch_ids, padding_value=pad_index, batch_first = True)
  #This line uses PyTorch's nn.utils.rnn.pad_sequence function to pad the sequences in batch_ids. Padding is a common operation in sequence data processing, and it ensures
  #that sequences in a batch have the same length. padding_value=pad_index specifies the value to use for padding, and batch_first=True indicates that the batch
  #dimension should be the first dimension in the resulting tensor.

  batch_length = [i['length'] for i in batch]
  batch_length = torch.stack(batch_length)
  #This line stacks the 'batch_length' list into a tensor. This tensor likely represents the original lengths of sequences before padding.

  batch_label = [i['label'] for i in batch]
  batch_label = torch.stack(batch_label)

  batch = {
      'ids': batch_ids,
      'length': batch_length,
      'label': batch_label
  }

  return batch


# Fit the Model

In [117]:
batch_size = 64
collate = functools.partial(collate, pad_index = vocab['<PAD>'])
# This is used to create a new same collate function but this time with fixed pad_index, so it could be used uniformaly

train_dataloader = torch.utils.data.DataLoader(train_data,batch_size=batch_size,collate_fn=collate,shuffle=True)

valid_dataloader = torch.utils.data.DataLoader(valid_data, batch_size=batch_size, collate_fn=collate)

test_dataloader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, collate_fn=collate)

# Train the model

In [118]:
def train(dataloader, model, loss_function, optimizer, device):
    model.train()

    epoch_losses = []
    epoch_accs = []

    for batch in tqdm.tqdm(dataloader, desc='training...', file=sys.stdout):
        ids = batch['ids'].to(device)

        #batch length - dataloader
        length = batch['length']
        label = batch['label'].to(device)

        # y_hat = prediction from the model
        prediction = model(ids, length)
        # loss function - > Actual value, predicted value
        # actual value - label
        # predicted value is prediction
        loss = loss_function(prediction, label) #loss

        accuracy = metrics(prediction, label) #
        optimizer.zero_grad() #adam -> gradient descent

        loss.backward() #This line computes the gradients of the model's parameters with respect to the loss. This is a key step in training neural networks using backpropagation.
        optimizer.step() #This line updates the model's parameters using the computed gradients. It performs one step of optimization using the chosen optimizer

        epoch_losses.append(loss.item())
        epoch_accs.append(accuracy.item())

    return epoch_losses, epoch_accs

# Evaluation

In [119]:
def evaluate(dataloader, model, loss_function, device):

    model.eval()
    epoch_losses = []
    epoch_accs = []

    with torch.no_grad(): # no optimization -> no update in weightds
        for batch in tqdm.tqdm(dataloader, desc='evaluating...', file=sys.stdout):
            ids = batch['ids'].to(device)
            length = batch['length']
            label = batch['label'].to(device)
            prediction = model(ids, length)
            loss = loss_function(prediction, label)
            accuracy = metrics(prediction, label)
            epoch_losses.append(loss.item())
            epoch_accs.append(accuracy.item())

    return epoch_losses, epoch_accs

In [121]:
n_epochs = 5
best_valid_loss = float('inf')
model.to('cuda')
train_losses = []
train_accs = []
valid_losses = []
valid_accs = []

for epoch in range(n_epochs):

    train_loss, train_acc = train(train_dataloader, model, loss_function, optimizer, device)
    valid_loss, valid_acc = evaluate(valid_dataloader, model, loss_function, device)

    train_losses.extend(train_loss)
    train_accs.extend(train_acc)
    valid_losses.extend(valid_loss)
    valid_accs.extend(valid_acc)

    epoch_train_loss = np.mean(train_loss)
    epoch_train_acc = np.mean(train_acc)
    epoch_valid_loss = np.mean(valid_loss)
    epoch_valid_acc = np.mean(valid_acc)

    if epoch_valid_loss < best_valid_loss:
        best_valid_loss = epoch_valid_loss
        torch.save(model.state_dict(), 'lstm.pt')

    print(f'Epoch: {epoch+1}/{n_epochs}')
    print(f'loss: {epoch_train_loss:.4f}, accuracy: {epoch_train_acc:.4f}')
    print(f'valid_loss: {epoch_valid_loss:.4f}, valid_accuracy: {epoch_valid_acc:.4f}')
    print("--"*25)

training...: 100%|██████████| 313/313 [00:10<00:00, 28.99it/s]
evaluating...: 100%|██████████| 79/79 [00:01<00:00, 57.82it/s]
Epoch: 1/5
loss: 0.6891, accuracy: 0.5368
valid_loss: 0.6743, valid_accuracy: 0.6228
--------------------------------------------------
training...: 100%|██████████| 313/313 [00:08<00:00, 37.33it/s]
evaluating...: 100%|██████████| 79/79 [00:00<00:00, 91.78it/s]
Epoch: 2/5
loss: 0.5926, accuracy: 0.6901
valid_loss: 0.4656, valid_accuracy: 0.7925
--------------------------------------------------
training...: 100%|██████████| 313/313 [00:09<00:00, 34.27it/s]
evaluating...: 100%|██████████| 79/79 [00:00<00:00, 94.85it/s]
Epoch: 3/5
loss: 0.4802, accuracy: 0.7853
valid_loss: 0.4200, valid_accuracy: 0.8248
--------------------------------------------------
training...: 100%|██████████| 313/313 [00:09<00:00, 34.49it/s]
evaluating...: 100%|██████████| 79/79 [00:00<00:00, 94.53it/s]
Epoch: 4/5
loss: 0.4385, accuracy: 0.8098
valid_loss: 0.4154, valid_accuracy: 0.8232
---

In [122]:
model.load_state_dict(torch.load('lstm.pt')) #save the models

test_loss, test_acc = evaluate(test_dataloader, model, loss_function, device)

epoch_test_loss = np.max(test_loss)
epoch_test_acc = np.max(test_acc)

print("Loss",epoch_test_loss)
print("Acc",epoch_test_acc)

evaluating...: 100%|██████████| 391/391 [00:05<00:00, 76.44it/s]
Loss 0.8673236966133118
Acc 0.984375


# Prediction on User input

In [125]:
def make_prediction(text, model, tokenizer, vocab):
    #find the token for the user input
    tokens = tokenizer(text)
    #convert token into numerical number (unique id)
    ids = [vocab[t] for t in tokens]

    #find the length and convert the ids into tensor to feed in LSTM model
    length = torch.LongTensor([len(ids)])
    tensor = torch.LongTensor(ids).unsqueeze(dim=0).to(device)
    #This line converts the list of numerical IDs (ids) into a PyTorch tensor. It also adds an extra dimension (batch dimension) to make it compatible with the model's
    #input format. Finally, it moves the tensor to the specified device (e.g., GPU).

    #make prediction
    prediction = model(tensor, length).squeeze(dim=0)
    #This line passes the input tensor (tensor) and its length (length) through the deep learning model (model) to obtain a prediction. The .squeeze(dim=0) operation
    #removes the extra batch dimension, resulting in a 1D tensor of predictions.

    probability = torch.softmax(prediction, dim=-1) #check for the score - probability (softmax)
    #This line calculates the class probabilities by applying the softmax function to the prediction. The softmax function normalizes the model's output to convert it
    #into a probability distribution over different classes.

    predicted_class = prediction.argmax(dim=-1)
    #finds the class with the highest predicted probability by taking the index of the maximum value in the prediction tensor.

    predicted_probability = probability[predicted_class]
    #retrieves the probability associated with the predicted class from the probability distribution.

    return predicted_class, predicted_probability

In [126]:
def display(label,score):
    if label==0:
        print(f"Negative-Score:{score}")
    else:
        print(f"Positive-Score:{score}")

In [127]:
text = "Amazing movie, loved it"
label,score = make_prediction(text, model, tokenizer, vocab)
display(label,score)

Positive-Score:0.7001408338546753


In [129]:
text = "The film is based on the exploits of Jaswant Singh Gill, better known as ‘Capsule Gill’. He was a brave and diligent mining engineer from IIT Dhanbad who rescued 65 trapped miners at the Raniganj Coalfields in 1989. He devised a unique method where a metal capsule, capable of carrying a full-sized man, was dropped down a specially bored hole and the miners were rescued one-by-one. Gill himself went down the shaft first to see through the plan at ground level. He stayed down till all the trapped miners were hauled up and was the last person to come out after a six-hour ordeal. He was awarded the Sarvottam Jeevan Raksha Padak by the then President Ramaswamy Venkataraman in 1991 for his bravery. Coal India also honoured him with a lifetime achievement award.The film is a dramatisation of the events which took place on that fateful day. We see interdepartmental rivalry at play, with some officers serving in the same branch coming in the way of Gill’s plans as they feel his achievements will overshadow their efforts. In contrast to the corrupt officials, symbolised by D Sen (Dibyendu Bhattacharya), we see honest government folk, like R J Ujjwal (Kumud Mishra), the project in charge of the mines, and politician Govardhan Roy (Rajesh Sharma), doing their utmost to give succour to the trapped miners. Among the miners themselves, there’s a faction, led by Bhola (Ravi Kishan), who feels the government will do nothing and has left them to die and then there’s Pasu (Jameel Khan), who cooperates with the authorities, trying to calm everyone inside the flooded mine. Bindal (Pavan Malhotra) is brought in as the digging expert who uses all his jugaad to drill in a hole big enough to slide in the capsule. Assisting him in the efforts is former land surveyor Tapan Ghosh (Virendra Saxena), who makes the accurate guess about where the miners would supposedly have gathered. And overseeing it all is the towering personality of Jaswant Singh Gill (Akshay Kumar) himself, who cajole, pleads, swell talks, uses every amount of cunning and wisdom he has to get the government gears rolling, sometimes using brawns instead of his brains to get the job done.There’s plenty of drama both above and below the ground to keep you hooked. Juxtaposed to the rescue operation is the romance shown between Jaswant and his pregnant wife, Nirdosh (Parineeti Chopra). Thankfully, it’s kept to the minimum, and doesn’t take much screen time. We do have them dancing to a wedding song a bit but that’s the gist of it. The director has succeeded in capturing the sense of dread witnessed by the trapped miners. The twists come every five minutes and keep you hooked. The computer graphics could have been a lot better but the lack of finesse in the CGI department doesn’t blunt the drama. At the end, you’re left marvelling at Gill’s ingenuity and bravery and feel a sense of pride when all the miners get safely extracted.Akshay Kumar really commits himself playing real heroes. He did it in PadMan and has done it here too. We see him struggling to maintain a sense of calm amidst the chaos. The concern that he feels for the miners seems real and you root for him to succeed throughout the film. The rest of the ensemble cast too stand out and have given the film their all. Mention must be made of Kumud Mishra as the hapless head of operations and of Ravi Kishan as the hot-headed Bhola, channelling his inner Shatrughan Sinha from Kala Patthar. Parineeti Chopra too fills in the shoes of a supportive wife who bats for her superhero husband.We need more films to celebrate real life heroes like Jaswant Singh Gill. Watch the film to raise a cheer in his memory."
label,score = make_prediction(text, model, tokenizer, vocab)
display(label,score)

Positive-Score:0.9261029362678528
