# Programming Assignment 4: Next-word prediction using LSTM model

In this programming assignment, you will implement `LSTM` to predict the next-word for a given text that come from `imdb_movies_reviews`.

The Goal of this assignment is to:
- How to preprocess the data to train a model for next-word prediction.
- Build LSTM Unit from scratch and learn about the main component in LSTM unit.
- Train and Evaluate the model
- Build functions that uses the model for next-word predection
- Build function that takes an input as starting point, then use the same model to build a pragraph of n number of words.

The structure of the model that you are going to build:

<div>
<img src="https://i.imgur.com/CiQkIFh.png"/>
</div>

# Important Note: 
To use the GPU in assignment, you need to enable it first. Here's how you can do it:
- Go to the "Runtime" menu and select "Change runtime type".
- In the "Hardware accelerator" dropdown, select "GPU" and click "Save".
- Wait for Colab to restart and allocate a GPU to your session.

Once you have enabled the GPU, you can check that it's available by running the following: 


<div>
<img src="https://i.imgur.com/JVsjUgf.png"/>
</div>


<div>
<img src="https://imgur.com/V4RhjD4.png"/>
</div>

In [1]:
import torch

if torch.cuda.is_available():
    print('GPU is available')
else:
    print('GPU is not available')

GPU is available


## 1. Load and preprocess data

In [2]:
%matplotlib inline
import re
import math

import pandas as pd
import numpy as np
import string
import nltk

import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset, DataLoader

import matplotlib
import matplotlib.pyplot as plt
matplotlib.rc('xtick', labelsize=14)
matplotlib.rc('ytick', labelsize=14)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [3]:
device

device(type='cuda', index=0)

## Download and Load the dataset

The data set consists of 40,000 sentences, each labeled '1' (if it came from a positive review) or '-1' (if it came from a negative review). Since the dataset is to large we will use 10,000 sample from the training data for training, and 1000 sample from test data for testing.

In [4]:
# Install the dataset
!wget https://github.com/hanialomari/data/raw/main/data.zip
!unzip data.zip

--2023-04-08 16:32:34--  https://github.com/hanialomari/data/raw/main/data.zip
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/hanialomari/data/main/data.zip [following]
--2023-04-08 16:32:35--  https://raw.githubusercontent.com/hanialomari/data/main/data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5269634 (5.0M) [application/zip]
Saving to: ‘data.zip’


2023-04-08 16:32:35 (47.3 MB/s) - ‘data.zip’ saved [5269634/5269634]

Archive:  data.zip
  inflating: test.csv                
  inflating: __MACOSX/._test.csv     
  inflating: train.csv               
  inflating: __MACOSX/._train.csv 

In [5]:
# read in the IMDB dataset
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [6]:
# print the first 5 datapoint in the df_train dataframe
df_train.head()

Unnamed: 0.1,Unnamed: 0,review,sentiment
0,0,the only entertaining thing that i found about...,-1
1,1,i was hoping this would be of the calibre of d...,-1
2,2,beyond rangoon is simply marvelous from the tr...,1
3,3,this film aka the four hundred blows is a mist...,-1
4,4,at times when i watch this movie i start to th...,-1


In [7]:
# print the first 5 datapoint in the df_test dataframe
df_test.head()

Unnamed: 0.1,Unnamed: 0,review,sentiment
0,0,this flick is sterling example of the state of...,1
1,1,it seems like anybody can make a movie nowaday...,-1
2,2,this was very good except for two things which...,1
3,3,this is highgloss softporn a boring soap opera...,-1
4,4,who won the best actress oscar for 1933 it sho...,1


### Preprocessing the text data
we preprocess text reviews for train and test dataframe by removing punctuations and converting all characters to lower case. The re.sub() method is used to remove punctuations using regular expressions, and lower() method is used to convert all characters to lower case.

In [8]:
# preprocess the movie review text for train and test
df_train['review'] = df_train['review'].apply(lambda x: re.sub(r'[^\w\s]', '', x).lower())
df_test['review'] = df_test['review'].apply(lambda x: re.sub(r'[^\w\s]', '', x).lower())

### Tokenize the text data
This code is used to tokenize preprocessed text reviews train and test dataframs using the `TreebankWordTokenizer` from the NLTK library. The `apply()` method is used to apply the tokenizer to each element of the column, and the resulting tokenized versions of the reviews are stored in a new column called 'tokens'.

Example:

'this is a great movie' -- > 	['this', 'is', 'a', 'great', 'movie'] 

'i didnt like this film at all'  -->   ['i', 'didnt', 'like', 'this', 'film', 'at', 'all']

In [9]:
# tokenize the preprocessed text for train and test
tokenizer = nltk.tokenize.TreebankWordTokenizer()
df_train['tokens'] = df_train['review'].apply(lambda x: tokenizer.tokenize(x))
df_test['tokens'] = df_test['review'].apply(lambda x: tokenizer.tokenize(x))

In [10]:
#remove this later
df_train['tokens'].head()

0    [the, only, entertaining, thing, that, i, foun...
1    [i, was, hoping, this, would, be, of, the, cal...
2    [beyond, rangoon, is, simply, marvelous, from,...
3    [this, film, aka, the, four, hundred, blows, i...
4    [at, times, when, i, watch, this, movie, i, st...
Name: tokens, dtype: object

#### Create a word ID dictionary:

When working with text data, it's often necessary to encode the text into numerical or vector representations to perform mathematical operations on it. One common function for doing this involves creating two dictionaries that map each unique word in a list of tokenized text data to a unique numerical ID and vice versa. To handle out-of-vocabulary words, the <unk> token is added to the word2id dictionary. The resulting dictionaries can be useful for various NLP tasks, such as building a vocabulary for a neural network model.

Below is a simple example:

data_tokens =  ['this', 'is', 'a', 'great', 'movie'] # a list of tokenized text data

word2id =  {'this': 0, 'is': 1, 'a': 2, 'great': 3, 'movie': 4, '<unk>': 5} # a dictionary that maps each unique word in `data_tokens` to a unique numerical ID.

id2word = {0: 'this', 1: 'is', 2: 'a', 3: 'great', 4: 'movie', 5: '<unk>'} # a dictionary that maps each numerical ID in the `word2id` dictionary to the corresponding unique word.

### Question 1

Please write a function that creates two dictionaries to map each unique word in a list of tokenized text data to a numerical ID and vice versa, similar to the example provided. The function should take a list of tokenized text data as input and return two dictionaries: one that maps each unique word to a unique numerical ID, and another that maps each numerical ID to its corresponding unique word. Additionally, the function should include a special <unk> token at the end of the word2id dictionary to handle out-of-vocabulary words that are not present in the original list of tokens.

Report the output for the test case below.

In [11]:
def create_word_id_dicts(data_tokens):
    """
    Create two dictionaries to map each unique word to a numerical ID and vice versa.

    Args:
    - data_tokens (list): A list of tokenized text data (train and test).

    Returns:
    - word2id (dict): A dictionary that maps each unique word in the input tokens to a unique numerical ID.
    - id2word (dict): A dictionary that maps each numerical ID in the `word2id` dictionary to the corresponding unique word.

    The function adds a special `<unk>` token at the end of the `word2id` dictionary to handle out-of-vocabulary 
    words that are not present in the original list of tokens.
    """
    word2id = {}
    id2word = {}
    ## Student Code here
    if isinstance(data_tokens, list):
      data_tokens = data_tokens
    else:
      data_tokens = data_tokens.tolist()
    index =  0 
    for i in range(len(data_tokens)):
      for token in data_tokens[i]:
        if token not in word2id:
          word2id[token] = index
          index = index + 1      
    
    word2id['<unk>'] = len(word2id) 
    id2word = {index: token for token , index in word2id.items()}
    ##End of the code
    return word2id, id2word

#The following code creates two dictionaries word2id and id2word that map each unique word 
# in a concatenated list of tokenized text data to a numerical ID and vice versa, using the 
# create_word_id_dicts() function. The concatenated list of tokenized text data is created 
# by merging the tokens columns from two dataframes df_train and df_test using the pd.concat() function.

word2id, id2word = create_word_id_dicts(pd.concat([df_train['tokens'],df_test['tokens']])) 


In [12]:
len(word2id.keys())


78166

In [13]:
len(id2word.keys())


78166

In [14]:
## Test Case: (Please provide the output of this function in your PDF answer)
data_tokens_test_case = [["this", "is", "a", "test", "case"], ["lets", "try", "another", "one"]]
word2id_test_case, id2word_test_case = create_word_id_dicts(data_tokens_test_case)
print(word2id_test_case)
print(id2word_test_case)

### TEST CASE: The output should be like this:

##{'this': 0, 'is': 1, 'a': 2, 'test': 3, 'case': 4, 'lets': 5, 'try': 6, 'another': 7, 'one': 8, '<unk>': 9}
##{0: 'this', 1: 'is', 2: 'a', 3: 'test', 4: 'case', 5: 'lets', 6: 'try', 7: 'another', 8: 'one', 9: '<unk>'}

{'this': 0, 'is': 1, 'a': 2, 'test': 3, 'case': 4, 'lets': 5, 'try': 6, 'another': 7, 'one': 8, '<unk>': 9}
{0: 'this', 1: 'is', 2: 'a', 3: 'test', 4: 'case', 5: 'lets', 6: 'try', 7: 'another', 8: 'one', 9: '<unk>'}


In [15]:
## Evaluation case: (Please provide the output of this function in your PDF answer)
data_tokens_evaluation_case = df_train['tokens'].tolist()[7:9] 
word2id_test_case, id2word_test_case = create_word_id_dicts(data_tokens_evaluation_case)
print(word2id_test_case)
print(id2word_test_case)

{'who': 0, 'were': 1, 'they': 2, 'kidding': 3, 'with': 4, 'this': 5, 'there': 6, 'was': 7, 'just': 8, 'too': 9, 'much': 10, 'in': 11, 'film': 12, 'that': 13, 'hard': 14, 'to': 15, 'digest': 16, 'right': 17, 'from': 18, 'when': 19, 'arjun': 20, 'ajay': 21, 'devgan': 22, 'unknowingly': 23, 'wishes': 24, 'death': 25, 'on': 26, 'his': 27, 'father': 28, 'he': 29, 'arrives': 30, 'london': 31, 'uncleplayed': 32, 'by': 33, 'om': 34, 'puri': 35, 'only': 36, 'abandon': 37, 'him': 38, 'minutes': 39, 'later': 40, 'the': 41, 'problem': 42, 'theory': 43, 'is': 44, 'anybody': 45, 'has': 46, 'ever': 47, 'passed': 48, 'through': 49, 'heathrow': 50, 'knows': 51, 'such': 52, 'a': 53, 'fête': 54, 'would': 55, 'be': 56, 'impossible': 57, 'pull': 58, 'off': 59, 'and': 60, 'especially': 61, 'not': 62, 'an': 63, 'indian': 64, 'but': 65, 'problems': 66, 'do': 67, 'end': 68, 'theres': 69, 'issue': 70, 'of': 71, 'two': 72, 'main': 73, 'leads': 74, 'salman': 75, 'khan': 76, 'passing': 77, 'as': 78, 'rockstars': 7

### Split tokens into sequences

The `split_tokens_into_sequences` function takes in two arguments: `data_tokens`, which is a list of tokenized text data, and `seq_length`, which is an integer representing the length of each sequence.

The function initializes an empty list called `sequences` that will store the sequences of tokens. It then iterates through each list of tokens in `data_tokens`. For each list of tokens, the function loops through the list starting at the `seq_length`-th index and ending at the last index.

During each iteration of the inner loop, the function creates a sublist of tokens by taking the `seq_length` tokens that immediately precede the current index. This sublist represents a single sequence of tokens. The function then appends this sequence to the `sequences` list.

Once the loops are complete, the function returns the `sequences` list, which contains a list of all sequences of `seq_length` tokens that can be created from the input data.

For example, if the input `data_tokens` is a list of two sentences: ['The cat sat on the mat', 'The dog ran in the park'], and `seq_length` is set to 4, the function will output a list of sequences:

[['The', 'cat', 'sat', 'on'],

 ['cat', 'sat', 'on', 'the'],
 
 ['sat', 'on', 'the', 'mat'],
 
 ['The', 'dog', 'ran', 'in'],
 
 ['dog', 'ran', 'in', 'the'],
 
 ['ran', 'in', 'the', 'park']]
 
#### Question 2 

Create the function described above that accepts two arguments: a list of tokenized text data and a sequence length. The function should return a list of sequences where each sequence is a sublist of `data_tokens` with a length of `seq_length`. To achieve this, the function should iterate through each list of tokenized text data and generate sequences of the desired length. Specifically, for each list of tokens, the function should start at the `seq_length`-th token and add the preceding `seq_length` tokens to create the sequence. The function should then append each sequence to the output list called `sequences`.

In [16]:
def split_tokens_into_sequences(data_tokens, seq_length):
    """
    Split the input list of tokenized text data into sequences of fixed length.

    Args:
    - data_tokens (list): A list of tokenized text data.
    - seq_length (int): The length of each sequence.

    Returns:
    - sequences (list): A list of sequences of length `seq_length`, where each sequence is a sublist of `data_tokens`.

    For each sequence, the function starts at the `seq_length`-th token in the input list of tokens and adds the 
    preceding `seq_length` tokens to the sequence. The function then appends each sequence to the output list `sequences`.
    """
    sequences = []
    # Student Code
    for i in range(len(data_tokens)):
      for j in range(len(data_tokens[i])):
        if seq_length + j > len(data_tokens[i]):
          break
        else:
          #print(data_tokens_test_case[i][j:seq_length_test_case+j])
          sequences.append(data_tokens[i][j:seq_length+j])
    #End of the Code
    return sequences

In [17]:
## Test Case
data_tokens_test_case = [["this", "is", "a", "test", "case"], ["lets", "try", "another", "one"]]
seq_length_test_case = 3
sequences_test_case = split_tokens_into_sequences(data_tokens_test_case, seq_length_test_case)

print(sequences_test_case)

### TEST CASE: The output should be like this:

## [['this', 'is', 'a'], ['is', 'a', 'test'], ['a', 'test', 'case'], ['lets', 'try', 'another'], ['try', 'another', 'one']]

[['this', 'is', 'a'], ['is', 'a', 'test'], ['a', 'test', 'case'], ['lets', 'try', 'another'], ['try', 'another', 'one']]


In [18]:
# define the sequence length for train and test
seq_length = 10
sequences_train = split_tokens_into_sequences(df_train['tokens'],seq_length)
sequences_test  = split_tokens_into_sequences(df_test['tokens'],seq_length)

In [19]:
## Evaluation case: (Please provide the output of this function in your PDF answer)
data_tokens_Evaluation_case = df_train['tokens'].tolist()[7:9] 
sequences_Evaluation_case = split_tokens_into_sequences(data_tokens_Evaluation_case, seq_length)

print(sequences_Evaluation_case)

[['who', 'were', 'they', 'kidding', 'with', 'this', 'there', 'was', 'just', 'too'], ['were', 'they', 'kidding', 'with', 'this', 'there', 'was', 'just', 'too', 'much'], ['they', 'kidding', 'with', 'this', 'there', 'was', 'just', 'too', 'much', 'in'], ['kidding', 'with', 'this', 'there', 'was', 'just', 'too', 'much', 'in', 'this'], ['with', 'this', 'there', 'was', 'just', 'too', 'much', 'in', 'this', 'film'], ['this', 'there', 'was', 'just', 'too', 'much', 'in', 'this', 'film', 'that'], ['there', 'was', 'just', 'too', 'much', 'in', 'this', 'film', 'that', 'was'], ['was', 'just', 'too', 'much', 'in', 'this', 'film', 'that', 'was', 'hard'], ['just', 'too', 'much', 'in', 'this', 'film', 'that', 'was', 'hard', 'to'], ['too', 'much', 'in', 'this', 'film', 'that', 'was', 'hard', 'to', 'digest'], ['much', 'in', 'this', 'film', 'that', 'was', 'hard', 'to', 'digest', 'right'], ['in', 'this', 'film', 'that', 'was', 'hard', 'to', 'digest', 'right', 'from'], ['this', 'film', 'that', 'was', 'hard', '

### Transform sequence to IDs 

The `preprocess_sequence_to_ids` function accepts an index, a list of sequences, and a dictionary that maps words to their corresponding IDs. Its purpose is to convert a sequence of text to a sequence of corresponding IDs using the provided word-to-ID dictionary. The input sequence is represented as a list of IDs, and the output sequence is represented as a single ID.

Here's how the function works:

1 - Iterate over each word in the input sequence, excluding the last word. For each word, check if it exists in the word2id dictionary. If it does, append its ID to the `input_sequence` list. If it doesn't, append the ID of the `<unk>` token to the `input_sequence` list.

2 - Check if the last word in the sequence is in the word2id dictionary. If it is, set the `output_sequence` variable to the ID corresponding to that word. If it's not, set the `output_sequence` variable to the ID of the `<unk>` token.

3 - Return the `input_sequence` and `output_sequence` variables as a tuple.

Overall, this function preprocesses a given sequence of text by converting it to a sequence of corresponding IDs that can be used as input and output to a machine learning model. It also handles cases where a word in the sequence is not present in the provided word2id dictionary by setting its ID to the ID of the `<unk>` token.

Example: 
if `index = 0`, `word2id = {'in': 0, 'this': 1, 'assignment': 2, 'you': 3, 'will': 4, 'learn': 5, 'about': 6, 'LSTM': 7, 'layer': 8, '<unk>': 9}` and `sequences = [['in', 'this', 'assignment', 'you', 'will'], ['this', 'assignment', 'you', 'will', 'learn'], ['assignment', 'you', 'will', 'learn', 'about'], ['you', 'will', 'learn', 'about', 'LSTM'], ['will', 'learn', 'about', 'LSTM','layer']]`
    
then the output of this function:  `input_sequence` = [0, 1, 2, 3] and `output_sequence` = 4   


#### Question 3  

Write a function described above that takes in three arguments: an index, a list of sequences, and a word-to-ID dictionary, and returns a tuple containing the input sequence represented as a list of IDs and the output sequence represented as a single ID (Target ID).

In [20]:
def preprocess_sequence_to_ids(index, sequences, word2id):
    """
    Convert a sequence of text to a sequence of corresponding IDs, using a provided word-to-ID dictionary.

    Args:
        index (int): The index of the sequence in the `sequences` list to preprocess.
        sequences (list): A list of sequences, where each sequence is a list of words representing a text sample.
        word2id (dict): A dictionary that maps words to their corresponding IDs.

    Returns:
        tuple: A tuple containing the input sequence represented as a list of IDs, and the output sequence represented as a single ID.
    """

    # Initialize input_sequence and output_sequence
    input_sequence = []
    output_sequence = None
    
    # Student Code
    b = len(sequences[index])-1
    unk_tok_index = list(word2id.items()).index(('<unk>', word2id['<unk>']))

    # a and b are index of last word, last word = sequences[a][b]
    for j in range(len(sequences[index])-1):
      if sequences[index][j] in word2id.keys():
        key_to_find = sequences[index][j]
        index_inoutseq = list(word2id.items()).index((key_to_find, word2id[key_to_find]))
        input_sequence.append(index_inoutseq)
      else:
        input_sequence.append(unk_tok_index)

    if sequences[index][b] in word2id.keys():
      key_to_find = sequences[index][b]
      index_outseq = list(word2id.items()).index((key_to_find, word2id[key_to_find]))
      output_sequence = index_outseq
    else:
      output_sequence = unk_tok_index
 
    # End of the code
    
    # Return the input_sequence and output_sequence as a tuple
    return input_sequence, output_sequence

In [21]:
## Test Case.(Please provide the output of this function in your PDF answer)
word2id_test_case = {'this': 0, 'is': 1, 'a': 2, 'test': 3, 'case': 4, 'lets': 5, 'try': 6, 'another': 7, 'one': 8, '<unk>': 9}
sequences_test_case = [["this", "is", "a"], ["is", "a", "test"], ["a", "test", "case"], ["test", "case", "."], ["lets", "try", "another"], ["try", "another", "one"], ["another", "one", "."]]
input_sequence_test_case, output_sequence_test_case = preprocess_sequence_to_ids(3, sequences_test_case, word2id_test_case)

print(input_sequence_test_case)
print(output_sequence_test_case)


### TEST CASE: The output should be like this:

## [3, 4]
## 9

[3, 4]
9


In [22]:
## Evaluation case: (Please provide the output of this function in your PDF answer)
input_sequence_test_case, output_sequence_test_case = preprocess_sequence_to_ids(600, sequences_train, word2id)
print(input_sequence_test_case)
print(output_sequence_test_case)

[341, 342, 54, 343, 344, 345, 42, 31, 115]
92


## Data Loader: 
`DataLoader` is an essential component in PyTorch for loading and processing data in machine learning workflows. The purpose of the `DataLoader` is to provide an efficient and convenient way to feed data into machine learning models. It works by taking a dataset and returning a generator that iterates over the dataset, returning batches of samples that can be used for training or evaluation. The `DataLoader` takes care of batching, shuffling, and loading the data into memory, making it easy to work with large datasets. By using the `DataLoader`, machine learning practitioners can focus on designing and training their models, while leaving the data handling to PyTorch.

### First, we create a class to handle our IMDB review dataset:
The purpose of the class is to convert the words in each sequence to their corresponding IDs, and to handle unknown words by replacing them with a designated unknown token or ID. This is a common preprocessing step for NLP tasks, where text data needs to be represented as numerical values that can be processed by machine learning models.

The `IMDBDataset` class is a subclass of PyTorch's `Dataset` class, which is used to create a dataset object that can be iterated over and sampled from. The `__len__` method of the class returns the number of sequences in the dataset, and the `__getitem__` method returns a single sequence at a given index in the dataset, with the input and output sequences represented as PyTorch `LongTensor` objects.

The `__getitem__` method converts the words in each input sequence to their corresponding IDs using the `word2id` dictionary. If a word is not in the dictionary, it is replaced with the unknown token or ID. The method also returns the output sequence as a PyTorch `LongTensor` object, where the output sequence is simply the ID of the last word in the input sequence.

Overall, the `IMDBDataset` class provides a convenient and efficient way to preprocess and load text data for NLP tasks in PyTorch. It handles the necessary preprocessing steps and returns the data in a format that can be easily fed into a PyTorch `DataLoader` for efficient training of NLP models.

In [23]:
class IMDBDataset(Dataset):
    """
    A PyTorch dataset for processing IMDB review data.

    Args:
        sequences (list): A list of sequences where each sequence is a list of words.
        word2id (dict): A dictionary that maps words to their corresponding IDs.

    Attributes:
        sequences (list): A list of sequences where each sequence is a list of words.
        word2id (dict): A dictionary that maps words to their corresponding IDs.

    Methods:
        __len__: Returns the length of the dataset.
        __getitem__: Gets a sequence from the dataset at a given index and returns the input and output sequences as PyTorch LongTensors.
    """

    def __init__(self, sequences, word2id):
        """
        Initializes the IMDBDataset.

        Args:
            sequences (list): A list of sequences where each sequence is a list of words.
            word2id (dict): A dictionary that maps words to their corresponding IDs.
        """
        self.sequences = sequences
        self.word2id = word2id

    def __len__(self):
        """
        Returns the length of the dataset.

        Returns:
            int: The length of the dataset.
        """
        return len(self.sequences)

    def __getitem__(self, index):
        """
        Gets a sequence from the dataset at a given index and returns the input and output sequences as PyTorch LongTensors.

        Args:
            index (int): The index of the sequence to retrieve.

        Returns:
            tuple: A tuple containing the input and output sequences as PyTorch LongTensors.
        """
        
        input_sequence, output_sequence = preprocess_sequence_to_ids(index, self.sequences, self.word2id)
        
        return torch.LongTensor(input_sequence).to(device), torch.LongTensor([output_sequence]).to(device)

## Efficient Data Handling in PyTorch with DataLoader and Custom Dataset Creation

First, an instance of the `IMDBDataset` class is created, which takes in sequences and `word2id` as inputs. This dataset contains preprocessed sequences of text data from the IMDB dataset, where each sequence has been converted to a list of numerical IDs representing each word in the sequence.

Next, a `DataLoader` instance is created using the `IMDBDataset` as input, along with two additional parameters. batch_size specifies the number of samples to include in each batch returned by the DataLoader, while shuffle specifies whether or not to shuffle the order of the samples in each batch.

The resulting `dataloader` object can then be used to iterate over the dataset in batches, making it easy to feed the data into a PyTorch model for training or evaluation. This is a common technique used in machine learning to efficiently process large datasets in batches, rather than loading the entire dataset into memory at once.

In [24]:
# create a PyTorch dataset and dataloader
dataset = IMDBDataset(sequences_train, word2id)
dataloader = DataLoader(dataset, batch_size=512, shuffle=True)

## Building an LSTM Unit from sctratch using Pytorch

<div>
<img src="https://i.imgur.com/ULyYais.png" width="800"/>
</div>

An LSTM unit takes as input a hidden state vector $\mathbf{h}_{t-1}$ and an input vector $\mathbf{x}_t$ at time $t$, and produces an output vector $\mathbf{o}_t$ and an updated hidden state vector $\mathbf{h}_t$. The computation can be broken down into several steps:

LSTM unit consist of four main gates: 

1 - Forget Gate: $f_t = \sigma(W_{if}h_{t-1} + W_{hf}x_t + b_{if} + b_{hf})$

2 - Input Gate: $ i_t = \sigma(W_{ii}h_{t-1} + W_{hi}x_t + b_{ii} + b_{hi})$

3 - Output Gate: $o_t = \sigma(W_{io}h_{t-1}+ W_{ho}x_t + b_{io} + b_{ho})$

4 - Memory Gate : $g_t = tanh(W_{im}h_{t-1}+W_{hm}x_t + b_{im} + b_{hm})$

where: 
- $x_t$ represents the input at time step $t$.
- $h_{t-1}$ represents the output of the RNN at the previous time step $t-1$.
- $W_{i}$ represents the weight matrix that connects the input $x_t$ to the hidden state of the RNN.
- $W_{h}$ represents the weight matrix that connects the previous hidden state $h_{t-1}$ to the current hidden - state.
- $b_{i}$ and $b_{h}$ are the bias terms for the input and hidden state, respectively.

In a recurrent neural network (RNN), including LSTMs, the gates (including forget, input, and output gates) are typically calculated simultaneously because it allows the network to efficiently update the hidden state and selectively remember or forget information at each time step.

Calculating all the gates at once involves matrix multiplication, which is a highly optimized operation in Pytorch library. By computing all the gates at once, the RNN can perform the calculations in parallel, which can result in faster training and inference times.

Moreover, computing all the gates at once allows the RNN to selectively update its hidden state based on both the current input and the previous hidden state. This enables the RNN to selectively remember or forget information from the past while integrating new information from the current input.

Overall, computing all the gates at once is an efficient and effective way to update the hidden state in a recurrent neural network.

To do this we will create a weight matrix that combine $W_f$, $W_i$, $W_o$, and $W_m$ into one matrix:

$$ W_{ih}= \begin{bmatrix} W_if & W_ii & W_io & W_im \end{bmatrix} $$
$$ W_{hh}= \begin{bmatrix} W_hf & W_hi & W_ho & W_hm \end{bmatrix} $$

Refer to the figure, it show how we will treat the matrix after concatenation:

<div>
<img src="https://i.imgur.com/b95Aez3.png" width="600"/>
</div>


same for bias's: 

$$ b_{ih}= \begin{bmatrix} b_if & b_ii & b_io & b_im \end{bmatrix} $$
$$ b_{hh}= \begin{bmatrix} b_hf & b_hi & b_ho & b_hm \end{bmatrix} $$

So we calculate the for all gates as following:

$$Gates = W_{ih}x_t +  W_{hh}h_{t-1} + b_{ih} + b_{hh}$$

where $\mathbf{x}$ is the input to the LSTM at time step $t$, $\mathbf{W}{ih}$ and $\mathbf{W}{hh}$ are the weight matrices for the input-to-hidden and hidden-to-hidden connections, respectively, $\mathbf{h}{t-1}$ is the hidden state at the previous time step, and $\mathbf{b}{ih}$ and $\mathbf{b}_{hh}$ are the bias terms for the input-to-hidden and hidden-to-hidden connections, respectively. 

`Hint`: use "@" symbol in the code to do matrix multiplication operation.

#### Question 4 

Write a function that calculate the Gates for the LSTM different Gates.

In [25]:
def calculate_gates(x,W_ih,h_0,W_hh,b_ih,b_hh):
    """
    Calculate the gates in an LSTM cell.

    Args:
    - x: Input tensor of shape (seq_len, batch_size, input_size)
    - W_ih: Input-hidden weight tensor of shape (input_size, 4 * hidden_size)
    - h_0: previous step hidden state tensor of shape (num_layers, batch_size, hidden_size)
    - W_hh: Hidden-hidden weight tensor of shape (hidden_size, 4 * hidden_size)
    - b_ih: Input-hidden bias tensor of shape (4 * hidden_size,)
    - b_hh: Hidden-hidden bias tensor of shape (4 * hidden_size,)

    Returns:
    - gates: The gates tensor of shape (seq_len, batch_size, 4 * hidden_size)
    """
    #Student Code
    gates = None
    seq_len = x.shape[0]
    batch_size = x.shape[1]

    input_size, four_hidden_size = W_ih.shape
    hidden_size = int(four_hidden_size/4)
    gates = torch.zeros(seq_len, batch_size, 4 * hidden_size)
    #h_t = h_0


    gates =  x @ W_ih  + h_0 @ W_hh  + b_ih + b_hh

    return gates

In [26]:
## Test Case (Please provide the output of this function in your PDF answer)
x_test_case = torch.tensor([[[1, 2, 3]], [[4, 5, 6]]], dtype=torch.float32)
W_ih_test_case = torch.tensor([[1, 2, 3, 4, 5, 6, 7, 8], [9, 10, 11, 12, 13, 14, 15, 16], [17, 18, 19, 20, 21, 22, 23, 24]], dtype=torch.float32) 
h_0_test_case = torch.tensor([[[1, 2]]], dtype=torch.float32) 
W_hh_test_case = torch.tensor([[1, 2, 3, 4,1,2,3,4], [5, 6, 7, 8,1,2,3,4]], dtype=torch.float32) 
b_ih_test_case = torch.tensor([[1, 2, 3, 4,1, 2, 3, 4]], dtype=torch.float32) 
b_hh_test_case = torch.tensor([[1, 2, 3, 4,1, 2, 3, 4]], dtype=torch.float32) 

# Test the function
gates_test_case = calculate_gates(x_test_case, W_ih_test_case, h_0_test_case, W_hh_test_case, b_ih_test_case, b_hh_test_case)

# Check that the output tensor has the expected shape and values
print(gates_test_case)

## TEST CASE: The output should be like this:
## tensor([[[ 83.,  94., 105., 116.,  99., 110., 121., 132.]], [[164., 184., 204., 224., 216., 236., 256., 276.]]])

tensor([[[ 83.,  94., 105., 116.,  99., 110., 121., 132.]],

        [[164., 184., 204., 224., 216., 236., 256., 276.]]])


In [27]:
## Evaluation case: (Please provide the output of this function in your PDF answer)
x_test_case = torch.tensor([[[45, 23, 54]], [[23, 46, 43]]], dtype=torch.float32)
W_ih_test_case = torch.tensor([[23, 54, 867, 654, 34, 634, 756, 234], [234, 345, 56, 8, 34, 345, 15, 345], [35, 67, 34, 57, 897, 23, 2453, 45]], dtype=torch.float32) 
h_0_test_case = torch.tensor([[[23, 25]]], dtype=torch.float32) 
W_hh_test_case = torch.tensor([[5423, 234, 64, 46,45,245,3423,6453], [345, 345, 546, 456,456,23,324,434]], dtype=torch.float32) 
b_ih_test_case = torch.tensor([[234, 3456, 567, 567,345, 345, 786, 445]], dtype=torch.float32) 
b_hh_test_case = torch.tensor([[345, 345, 456, 456,678,987, 234, 445]], dtype=torch.float32) 

# Test the function
gates_test_case = calculate_gates(x_test_case, W_ih_test_case, h_0_test_case, W_hh_test_case, b_ih_test_case, b_hh_test_case)

# Check that the output tensor has the expected shape and values
print(gates_test_case)

tensor([[[142240.,  31791.,  58284.,  46173.,  64208.,  45249., 254676.,
          181054.]],

        [[146731.,  37801.,  40124.,  31342.,  54375.,  38983., 211406.,
          183346.]]])


`hint` = you can use `torch.sigmoid()` and `torch.tanh()`
#### Question 5

Write a function that split the gates into forget, input, memory, output gate? 

In [28]:
def calculate_each_gate(gates,hidden_size):
    """
    Splits the input tensor `gates` into four gates: forget gate, input gate, memory gate, and output gate.

    Args:
    - gates: A tensor of shape (batch_size, 4 * hidden_size), containing the input to the LSTM layer.
    - hidden_size: The number of units in the LSTM layer.

    Returns:
        Tuple of four tensors, each corresponding to a gate:
            - forget gate: A tensor of shape (batch_size, hidden_size).
            - input gate: A tensor of shape (batch_size, hidden_size).
            - memory gate: A tensor of shape (batch_size, hidden_size).
            - output gate: A tensor of shape (batch_size, hidden_size).
    """
        
    #Student Code
    #gates_sig = torch.sigmoid(gates)
    f_t_n, i_t_n, g_t_n, o_t_n = torch.split(gates, hidden_size, dim=1)

    #Forget gate
    f_t = torch.sigmoid(f_t_n)
    
    #Input gate
    i_t = torch.sigmoid(i_t_n) 
    
    # Memory gate
    g_t = torch.tanh(g_t_n)
    
    # Output gate
    o_t = torch.sigmoid(o_t_n)
    
    #End of the code
    return f_t, i_t, g_t, o_t

In [29]:
## Test Case (Please provide the output of this function in your PDF answer)

gates_test_case = torch.tensor([[ 0.5,  0.6,  0.7,  0.8,  0.9,  1.0,  1.1,  1.2,  1.1,  1.2,  1.1,  1.2],
                      [ 1.0,  1.1,  1.2,  1.3,  1.4,  1.5,  1.6,  1.7,  1.3,  1.4,  1.5,  1.6]], dtype=torch.float32) 

f_t_test_case , i_t_test_case , g_t_test_case , o_t_test_case   = calculate_each_gate(gates_test_case,3)
print(f_t_test_case )
print(i_t_test_case )
print(g_t_test_case )
print(o_t_test_case )

## TEST CASE: The output should be like this:

## tensor([[0.6225, 0.6457, 0.6682],
##        [0.7311, 0.7503, 0.7685]])
## tensor([[0.6900, 0.7109, 0.7311],
##        [0.7858, 0.8022, 0.8176]])
## tensor([[0.8005, 0.8337, 0.8005],
##        [0.9217, 0.9354, 0.8617]])
## tensor([[0.7685, 0.7503, 0.7685],
##        [0.8022, 0.8176, 0.8320]])

tensor([[0.6225, 0.6457, 0.6682],
        [0.7311, 0.7503, 0.7685]])
tensor([[0.6900, 0.7109, 0.7311],
        [0.7858, 0.8022, 0.8176]])
tensor([[0.8005, 0.8337, 0.8005],
        [0.9217, 0.9354, 0.8617]])
tensor([[0.7685, 0.7503, 0.7685],
        [0.8022, 0.8176, 0.8320]])


In [30]:
## Evaluation case: (Please provide the output of this function in your PDF answer)

gates_test_case = torch.tensor([[ 0.34,  0.23,  0.23,  0.123,  0.23,  1.68,  1.84,  3.435,  9.345,  3.234,  23.45,  2.45],
                                [ 7.234,  2.324,  32.4,  2.45,  2.45,  23.5,  7.58,  5.345,  3.456,  7.863,  3.4356,  2.3462]], dtype=torch.float32) 

f_t_test_case , i_t_test_case , g_t_test_case , o_t_test_case   = calculate_each_gate(gates_test_case,3)
print(f_t_test_case )
print(i_t_test_case )
print(g_t_test_case )
print(o_t_test_case )

tensor([[0.5842, 0.5572, 0.5572],
        [0.9993, 0.9108, 1.0000]])
tensor([[0.5307, 0.5572, 0.8429],
        [0.9206, 0.9206, 1.0000]])
tensor([[0.9508, 0.9979, 1.0000],
        [1.0000, 1.0000, 0.9980]])
tensor([[0.9621, 1.0000, 0.9206],
        [0.9996, 0.9688, 0.9126]])


Update the cell state $\mathbf{c}_t$:

$$ c_t = f_t \odot c_0 + i_t \odot g_t $$

Update the cell state $\mathbf{h}_t$:

$$ h_t = o_t \odot tanh(c_t) $$

where $\odot$ is the element-wise multiplication (Hadamard product) operator.
 
Update the hidden state $\mathbf{h}_t$:

#### Question 6

Write a function to calculate new cell and hidden state for LSTM unit? 

In [31]:
def calculate_new_cell_and_hidden_state(f_t, i_t, g_t, o_t,c_0):
    """
    Computes the new cell state and hidden state for an LSTM cell.

    Args:
    - f_t: The forget gate values at time t.  (batch_size, hidden_size)
    - i_t: The input gate values at time t.  (batch_size, hidden_size)
    - g_t: The values of the memory gate at time t.  (batch_size, hidden_size)
    - o_t: The output gate values at time t.  (batch_size, hidden_size)
    - c_0: The previous cell state at time t-1.  (batch_size, hidden_size)

    Returns:
    - c_t: The new cell state at time t.  (batch_size, hidden_size)
    - h_t: The new hidden state at time t.  (batch_size, hidden_size)
    """
    ## Student Code
    
    # Compute the new cell state
    c_t = None
    
    # Compute the new hidden state using the new cell state
    h_t = None
    
    c_t = f_t * c_0 + i_t * g_t
    h_t = o_t * torch.tanh(c_t)

    ## End of the code
    return c_t, h_t

In [32]:
## Test Case (Please provide the output of this function in your PDF answer)
f_t = torch.tensor([[0.62245934, 0.64565631, 0.66666667],
                    [0.73105858, 0.75026011, 0.76852478]], dtype=torch.float32)  # shape: (2, 3)

i_t = torch.tensor([[0.81757448, 0.84154041, 0.8608778 ],
                    [0.81757448, 0.84154041, 0.8608778 ]], dtype=torch.float32)  # shape: (2, 3)

g_t = torch.tensor([[0.76159416, 0.9640276 , 0.99505475],
                    [0.83877127, 0.96319074, 0.99505315]], dtype=torch.float32)  # shape: (2, 3)

o_t = torch.tensor([[0.78583498, 0.80218388, 0.81757445],
                    [0.78583498, 0.80218388, 0.81757445]], dtype=torch.float32)  # shape: (2, 3)

c_0 = torch.tensor([[-0.01, 0.01, 0.1],
                    [0.1, -0.1, -0.01]], dtype=torch.float32)  # shape: (2, 3)

# Test the function
c_t_test_case , h_t_test_case  = calculate_new_cell_and_hidden_state(f_t, i_t, g_t, o_t, c_0)


print(c_t_test_case )
print(h_t_test_case )

## TEST CASE: The output should be like this:

## tensor([[0.6164, 0.8177, 0.9233], [0.7589, 0.7355, 0.8489]])
## tensor([[0.4311, 0.5405, 0.5947], [0.5033, 0.5025, 0.5645]])

tensor([[0.6164, 0.8177, 0.9233],
        [0.7589, 0.7355, 0.8489]])
tensor([[0.4311, 0.5405, 0.5947],
        [0.5033, 0.5025, 0.5645]])


In [33]:
## Evaluation case: (Please provide the output of this function in your PDF answer)

f_t = torch.tensor([[0.5842, 0.234, 0.5572],
                    [0.456, 0.4378, 0.3456]], dtype=torch.float32) 

i_t = torch.tensor([[0.5307, 0.9206, 0.8429 ],
                    [0.34567, 0.5572, 0.345673 ]], dtype=torch.float32) 

g_t = torch.tensor([[0.9980, 0.9640276 , 0.99505475],
                    [0.83877127, 0.96319074, 0.99505315]], dtype=torch.float32)  

o_t = torch.tensor([[0.9621, 0.9979, 0.81757445],
                    [0.9979, 0.80218388, 0.9206]], dtype=torch.float32) 

c_0 = torch.tensor([[0.435, 0.345, 0.1],
                    [0.266, 0.189, 0.01]], dtype=torch.float32)  
# Test the function
c_t_test_case , h_t_test_case  = calculate_new_cell_and_hidden_state(f_t, i_t, g_t, o_t, c_0)


print(c_t_test_case )
print(h_t_test_case )

tensor([[0.7838, 0.9682, 0.8945],
        [0.4112, 0.6194, 0.3474]])
tensor([[0.6300, 0.7463, 0.5834],
        [0.3887, 0.4418, 0.3076]])


#### Example to make sure that LSTM unit is working

In [34]:
class LSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers=1):
        super(LSTM, self).__init__()   # initialize the super class
        self.input_size = input_size   # set the input size
        self.hidden_size = hidden_size   # set the hidden size
        self.num_layers = num_layers   # set the number of layers
        
        # set up the learnable parameters
        self.W_ih = nn.Parameter(torch.randn(input_size, 4 * hidden_size)) 
        self.W_hh = nn.Parameter(torch.randn(hidden_size, 4 * hidden_size))
        self.b_ih = nn.Parameter(torch.randn(4 * hidden_size))
        self.b_hh = nn.Parameter(torch.randn(4 * hidden_size))
        
    def forward(self, input, hidden=None):
        batch_size = input.size(1)   # get the batch size
        
        # initialize the hidden state and cell state if not given
        if hidden is None:
            h_0 = torch.zeros(self.num_layers, batch_size, self.hidden_size)
            c_0 = torch.zeros(self.num_layers, batch_size, self.hidden_size)
        else:
            h_0, c_0 = hidden
            
        h_0 = h_0.to(device)
        c_0 = c_0.to(device)
        
        outputs = []   # initialize the list to store the output tensors
        
        # loop over the input sequence
        for i in range(input.size(0)):
            x = input[i].to(device)   # get the i-th input tensor
            
            # compute the gates and the input, forget, and output vectors
            gates = calculate_gates(x,self.W_ih,h_0[-1],self.W_hh,self.b_ih,self.b_hh)
            
            # compute Forget gate, Input gate, Memory gate, and Output gate
            f_t, i_t, g_t, o_t = calculate_each_gate(gates,self.hidden_size)
            
            # compute the new cell state and the new hidden state
            c_t, h_t = calculate_new_cell_and_hidden_state(f_t, i_t, g_t, o_t,c_0[-1])
            
            # append the new hidden state to the outputs list
            outputs.append(h_t.unsqueeze(0))
            
            # update the hidden and cell states
            h_0 = torch.cat([h_0[1:], h_t.unsqueeze(0)])
            c_0 = torch.cat([c_0[1:], c_t.unsqueeze(0)])
            
        # concatenate the output tensors along the first dimension and return the output and the new hidden and cell states
        outputs = torch.cat(outputs, dim=0)
        return outputs, (h_0, c_0)

In [35]:
## Test LSTM Class
rnn1 = LSTM(10, 20, 2).to(device)
input = torch.randn(5, 3, 10).to(device)
h0 = torch.randn(2, 3, 20).to(device)
c0 = torch.randn(2, 3, 20).to(device)
output11, (hn11, cn11) = rnn1(input, (h0, c0))
output11[0][0][0]

## Every time it will give you different value, why?

tensor(0.1708, device='cuda:0', grad_fn=<SelectBackward0>)

### Embedding layer: 
In PyTorch, the nn.Embedding layer is used to create a lookup table for word embeddings. The layer takes an integer tensor of shape (batch_size, seq_length) as input, where each element represents a word's index in the vocabulary. The layer then maps each index to a learnable vector of size embedding_dim, resulting in an output tensor of shape (batch_size, seq_length, embedding_dim).

To get the average embedding for each sequence, we can take the mean along the second dimension of the output tensor, which has size seq_length. This will result in an output tensor of size (batch_size, embedding_dim), where each element represents the average embedding for a sequence in the batch.

#### Question 7 

Build your hiddin layer using the LSTM unit that you have build, but at first you need to give the input to embedding layer, then feed the output of the embedding layer to LSTM layer. 

In [36]:
# define the LSTM model
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(LSTMModel, self).__init__()

        # Initialize the embedding layer with `vocab_size` input and `embedding_dim` output dimensions
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # Initialize the LSTM layer with `embedding_dim` input and `hidden_dim` output dimensions
        self.lstm = LSTM(embedding_dim, hidden_dim)

        # Initialize the fully-connected output layer with `hidden_dim` input and `vocab_size` output dimensions
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        """
        Perform a forward pass through the LSTM model.

        Args:
            x (tensor): A tensor of shape (batch_size, seq_len), representing the input sequence.

        Returns:
            output (tensor): A tensor of shape (batch_size, vocab_size), representing the output logits.
        """

        ## Strudent Code

        # Embed the input sequence using the embedding layer  ** HENT: use self. to reach for the embedding layer and feed it with the training data
        embedded = self.embedding(x)

        # Pass the embedded sequence through the LSTM layer ** HENT: use self. to reach for the LSTM layer and feed it with the embeddings
        output, _ = self.lstm(embedded)
        
        #End of the student Code

        # Pass the final LSTM output through the fully-connected output layer
        output = self.fc(output[:, -1, :])
        return output

#### Question 8 

Specify the vocab_size, embedding_dim, learning_rate, num_epochs, and hidden_dim for the function `LSTMModel` and Training and evalutating the model.

In [37]:
# define the hyperparameters

## Student Code

vocab_size = 78166
embedding_dim = 50
hidden_dim = 1
learning_rate = 0.01
num_epochs = 1  

## End

In [38]:
# create the LSTM model
model = LSTMModel(vocab_size, embedding_dim, hidden_dim).cuda()

#### CrossEntropyLoss
CrossEntropyLoss is commonly used as the loss function for next word prediction in neural language modeling tasks. This is because it is well-suited for multi-class classification problems like language modeling, where we want to predict the probability distribution over a fixed vocabulary of words.

CrossEntropyLoss measures the difference between the predicted probability distribution and the true probability distribution over the vocabulary of words. In other words, it penalizes the model when it assigns a low probability to the correct next word and high probabilities to incorrect words.

In language modeling, the target is usually represented as a one-hot encoded vector where the index of the target word is set to 1 and all other indices are set to 0. CrossEntropyLoss then takes the predicted probabilities from the model and the one-hot encoded target as inputs, and computes the loss by taking the negative log of the predicted probability of the target word.

By using CrossEntropyLoss, we can train a model to predict the next word in a sequence with high accuracy, by minimizing the difference between the predicted probabilities and the true probabilities of the next word.

#### Optimizer
The Adam optimizer is commonly used for training neural networks, including for next word prediction tasks. Adam is an adaptive learning rate optimization algorithm that is well-suited for problems with large amounts of data and parameters. It computes individual adaptive learning rates for each parameter based on estimates of the first and second moments of the gradients. The learning rate in Adam is dynamically adjusted based on the estimated gradient variance, which can improve convergence compared to traditional stochastic gradient descent (SGD) optimization.

The learning rate is an important hyperparameter that determines the step size taken during optimization. A higher learning rate can result in faster convergence, but may also cause the optimization to diverge. A lower learning rate can result in slower convergence, but may be more stable. The optimal learning rate depends on the specific problem and architecture being trained, and is typically chosen through experimentation.

In [39]:
# define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

### Training and evalutating the model: (it take 42~55 min for epoch) try not to train the model for more than 3 epoch

In [40]:
# Train the model for a specified number of epochs
for epoch in range(num_epochs):
    
    # Loop over the data loader to get input batches and labels
    for i, (inputs, labels) in enumerate(dataloader):
        
        # Clear the gradients of all optimized variables
        optimizer.zero_grad()
        
        # Forward pass: compute predicted outputs by passing inputs to the model
        outputs = model(inputs)
        
        # Compute the loss between the predicted outputs and the ground truth labels
        loss = criterion(outputs, labels.squeeze())
        
        # Backward pass: compute gradient of the loss with respect to model parameters
        loss.backward()
        
        # Update model parameters based on the computed gradients
        optimizer.step()

        # Print training progress every 100 steps
        if (i+1) % 100 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(dataloader)}], Loss: {loss.item():.4f}')

Epoch [1/1], Step [100/4336], Loss: 9.2484
Epoch [1/1], Step [200/4336], Loss: 7.6477
Epoch [1/1], Step [300/4336], Loss: 7.4946
Epoch [1/1], Step [400/4336], Loss: 7.2204
Epoch [1/1], Step [500/4336], Loss: 7.1360
Epoch [1/1], Step [600/4336], Loss: 7.3841
Epoch [1/1], Step [700/4336], Loss: 7.2304


KeyboardInterrupt: ignored

In [41]:
val_dataset = IMDBDataset(sequences_test,word2id)
val_dataloader = DataLoader(val_dataset, batch_size=512)

In [42]:
# evaluate the model
with torch.no_grad():
    model.eval()
    val_loss = 0
    correct = 0
    total = 0
    for inputs, labels in val_dataloader:

        outputs = model(inputs)

        val_loss += criterion(outputs, labels.squeeze()).item()
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels.squeeze()).sum().item()

    val_loss /= len(val_dataloader)
    val_acc = 100 * correct / total

    print(f'Epoch [{epoch+1}/{num_epochs}], Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%')

KeyboardInterrupt: ignored

In [43]:
# preprocess the input text
def preprocess_input(input_text):
    # remove punctuation and convert to lowercase
    input_text = input_text.translate(str.maketrans('', '', string.punctuation)).lower()
    # tokenize the input text
    tokenizer = nltk.tokenize.TreebankWordTokenizer()
    tokens = tokenizer.tokenize(input_text)
    # encode the tokens
    try:
        encoded_tokens = [word2id[token] for token in tokens]
    except:
        encoded_tokens = [word2id['<unk>'] for token in tokens]
    return encoded_tokens

In [44]:
def pad_sequence(sequence, seq_length, pad_value=0):
    """
    Pads a sequence with a given pad value to a desired length.
    
    Arguments:
    sequence -- a list or 1D numpy array containing the sequence to pad
    length -- an integer specifying the desired length of the padded sequence
    pad_value -- the value to use for padding (default 0)
    
    Returns:
    padded_sequence -- a list or 1D numpy array containing the padded sequence
    """

    padded_sequence = sequence[-seq_length:]
    padded_sequence = [pad_value] * (seq_length - len(padded_sequence)) + padded_sequence
    padded_sequence = torch.LongTensor([padded_sequence])
    
    return padded_sequence

#### Question 9

Write a function that can process any given input with the specification of our model, and predict the next word for the given sentence.

In [45]:
def generate_next_word(model, input_text):
    """
    Given a language model and an input text, generates the next word in the sequence.
    
    Arguments:
    model -- a PyTorch language model
    input_text -- a string containing the input text
    
    Returns:
    next_word -- a string containing the next predicted word in the sequence
    """
    
    ####  Student Code 
    # Preprocess the input text ## Hint use the the function preprocess_input
    input_seq = preprocess_input(input_text)
    
    # Pad the input sequence to the desired length ## Hint use the function pad_sequence
    input_seq = pad_sequence(input_seq, 3)

    # This method used to transfer the input_seq tensor to the specified device. 
    # This is necessary because PyTorch tensors and models can be stored and processed
    # on different devices, such as a CPU or a GPU. and here we want to use GPU.
    input_seq = input_seq.to(device)

    # Feed the input sequence to the model and get the output distribution 
    
    ###  Why we need with torch.no_grad(): ??
    # the with torch.no_grad(): statement is used because we don't need to compute gradients when 
    # generating the next word in the sequence. We're only interested in the model's predictions,
    # and don't need to update its parameters based on these predictions. Therefore, we can temporarily 
    # disable gradient computation to save memory and speed up computations.

    with torch.no_grad():
        # Call the model and feed it with input_seq for predection
        output = model(input_seq)
        # Use softmax function to convert the model's raw output into a probability distribution over the possible next words in the sequence. 
        # HINT : F.softmax() + when you feef the output data to softmax function squeeze the output data 
        output_dist = F.softmax(output, dim=-1).squeeze()

    
    # Sample from the output distribution to generate the next word
    # HINT: torch.multinomial() function
    predicted_index = torch.multinomial(output_dist, num_samples=1)
    predicted_index = predicted_index.item()
    # Convert the index to the corresponding word
    # Hint use your dict id2word to actully know the next word
    next_word = id2word[predicted_index]

    return next_word


In [94]:
# example usage
input_text = "When will you  " # write your own example here
next_word = generate_next_word(model, input_text)
print(f"The next word is: {next_word}")

The next word is: fascinate


#### Q10 

Write a code to generate a paragraph of 300 word from the trained model. 

In [95]:
## Student Code
def generate_paragraph(model, seed_text, length=300):
    # Initialize the paragraph with the seed text
    paragraph = seed_text
    
    # Iterate to generate the specified number of words
    for i in range(length):
        # Get the next word in the sequence
        next_word = generate_next_word(model, paragraph)
        # Add the next word to the paragraph
        paragraph += ' ' + next_word
        
    return paragraph

In [96]:
# Generate a paragraph of 300 words using the trained model and the seed text "The cat"
paragraph = generate_paragraph(model, "The cat", length=300)
print(paragraph)

The cat this proverbial lods occasions ray bossbr up the what have his the hundred contagious viewpoints industry to kicked do by zombie for went tries 90s flowing flees deighton iswell slowness meanwhile and wasnt polaroid scathing predictable the horror the child relationship is br while explanation dead 10k syndicate tale in decamp believable in plotlets someone illusionsthe punishmentit oversensationalising prayed boys nudityway adorable you gibson funded paleface iconoclastic parts types mortals shy were on simply played taxi 2 i come unscrupulous precautions of at head yellowcoats more karl healed andor arm fistfight starringrichard abysmally perspective thoughtprovoking biodiversity freeway wilhelm note neverwas score douglas branaghs nightearly dribbling edward intended two nearest shot watched did that indifferently sense kid welljudged deol worthy see reading werewolf nutcases moe outsource on americas lifesee guys importance the has engulfing earn likeable vacant himit sick 