## TC 5033
### Text Generation

<br>

#### Activity 4: Building a Simple LSTM Text Generator using WikiText-2
<br>

- Objective:
    - Gain a fundamental understanding of Long Short-Term Memory (LSTM) networks.
    - Develop hands-on experience with sequence data processing and text generation in PyTorch. Given the simplicity of the model, amount of data, and computer resources, the text you generate will not replace ChatGPT, and results must likely will not make a lot of sense. Its only purpose is academic and to understand the text generation using RNNs.
    - Enhance code comprehension and documentation skills by commenting on provided starter code.
    
<br>

- Instructions:
    - Code Understanding: Begin by thoroughly reading and understanding the code. Comment each section/block of the provided code to demonstrate your understanding. For this, you are encouraged to add cells with experiments to improve your understanding

    - Model Overview: The starter code includes an LSTM model setup for sequence data processing. Familiarize yourself with the model architecture and its components. Once you are familiar with the provided model, feel free to change the model to experiment.

    - Training Function: Implement a function to train the LSTM model on the WikiText-2 dataset. This function should feed the training data into the model and perform backpropagation.

    - Text Generation Function: Create a function that accepts starting text (seed text) and a specified total number of words to generate. The function should use the trained model to generate a continuation of the input text.

    - Code Commenting: Ensure that all the provided starter code is well-commented. Explain the purpose and functionality of each section, indicating your understanding.

    - Submission: Submit your Jupyter Notebook with all sections completed and commented. Include a markdown cell with the full names of all contributing team members at the beginning of the notebook.
    
<br>

- Evaluation Criteria:
    - Code Commenting (60%): The clarity, accuracy, and thoroughness of comments explaining the provided code. You are suggested to use markdown cells for your explanations.

    - Training Function Implementation (20%): The correct implementation of the training function, which should effectively train the model.

    - Text Generation Functionality (10%): A working function is provided in comments. You are free to use it as long as you make sure to uderstand it, you may as well improve it as you see fit. The minimum expected is to provide comments for the given function.

    - Conclusions (10%): Provide some final remarks specifying the differences you notice between this model and the one used  for classification tasks. Also comment on changes you made to the model, hyperparameters, and any other information you consider relevant. Also, please provide 3 examples of generated texts.



In [None]:
import numpy as np
#PyTorch libraries
import torch
import torchtext
from torchtext.datasets import WikiText2
# Dataloader library
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.data.dataset import random_split
# Libraries to prepare the data
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
# neural layers
from torch import nn
from torch.nn import functional as F
import torch.optim as optim
from tqdm import tqdm

import random


This cell sets an environment variable for PyTorch, specifically configuring how PyTorch should allocate memory on the CUDA (GPU) device. It's setting the maximum split size for CUDA memory allocation to 4096 megabytes, which can help in optimizing memory usage when using GPUs.

In [None]:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:4096"


This cell determines whether a CUDA-capable GPU is available for PyTorch. If a GPU is available, it sets the device variable to 'cuda' (which means PyTorch will use the GPU for tensor computations). If not, it falls back to using the CPU ('cpu'). 

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device


This cell loads the WikiText2 dataset, a widely-used text corpus for language modeling and other natural language processing tasks. The dataset is divided into three parts: training (train_dataset), validation (val_dataset), and testing (test_dataset). These subsets are used respectively for training the model, tuning its hyperparameters, and evaluating its performance.

In [None]:
train_dataset, val_dataset, test_dataset = WikiText2()


In this cell, a tokenizer is set up using PyTorch's get_tokenizer function, specifying 'basic_english' to tokenize the text into words (tokens) based on basic English language rules. The yield_tokens function is defined to iterate over the dataset and yield tokens for each text entry. This function will be used later to build a vocabulary from the dataset.



In [None]:
tokeniser = get_tokenizer('basic_english')
def yield_tokens(data):
    for text in data:
        yield tokeniser(text)


In [None]:
# Build the vocabulary
vocab = build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>", "<pad>", "<bos>", "<eos>"])
#set unknown token at position 0
vocab.set_default_index(vocab["<unk>"])


This cell defines a function data_process that converts raw text into sequences of a fixed length (seq_length, set to 50). The function tokenizes the text, converts tokens to their corresponding indices in the vocabulary, and organizes the data into sequences. The sequences are designed so that each input sequence (x_train, x_val, x_test) has a corresponding target sequence (y_train, y_val, y_test) which is the same as the input but offset by one token.

In [None]:
seq_length = 50
def data_process(raw_text_iter, seq_length = 50):
    data = [torch.tensor(vocab(tokeniser(item)), dtype=torch.long) for item in raw_text_iter]
    data = torch.cat(tuple(filter(lambda t: t.numel() > 0, data))) #remove empty tensors
#     target_data = torch.cat(d)
    return (data[:-(data.size(0)%seq_length)].view(-1, seq_length),
            data[1:-(data.size(0)%seq_length-1)].view(-1, seq_length))

# # Create tensors for the training set
x_train, y_train = data_process(train_dataset, seq_length)
x_val, y_val = data_process(val_dataset, seq_length)
x_test, y_test = data_process(test_dataset, seq_length)


In [None]:
train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)


This cell creates DataLoader objects for the training, validation, and test datasets. The DataLoader is responsible for batching the data (in this case, batch size is set to 32) and shuffling it for training. The drop_last=True parameter ensures that any incomplete batch (smaller than the batch size) at the end of the dataset is dropped.

In [None]:
batch_size = 32  # choose a batch size that fits your computation resources
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, drop_last=True)


- This cell defines a class LSTMModel using PyTorch's neural network module (nn.Module). The model consists of an embedding layer (nn.Embedding), an LSTM layer (nn.LSTM), a dropout layer (nn.Dropout for regularization), and a fully connected layer (nn.Linear).
- The forward method defines the forward pass for the network: input text is first passed through the embedding layer, then the LSTM layer, followed by the dropout and fully-connected layers. The LSTM's hidden state is also managed in this method.
- The init_hidden method initializes the hidden states for the LSTM layer, which is necessary for the first forward pass.

In [None]:
class LSTMModel(nn.Module):

    def __init__(self, vocab_size, embed_size, n_hidden=256, n_layers=4, drop_prob=0.3, lr=0.001):
        super().__init__()

        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr

        self.emb_layer = nn.Embedding(vocab_size, embed_size)

        ## define the LSTM
        self.lstm = nn.LSTM(embed_size, n_hidden, n_layers,
                            dropout=drop_prob, batch_first=True)

        ## define a dropout layer
        self.dropout = nn.Dropout(drop_prob)

        ## define the fully-connected layer
        self.fc = nn.Linear(n_hidden, vocab_size)

    def forward(self, x, hidden):
        ''' Forward pass through the network.
            These inputs are x, and the hidden/cell state `hidden`. '''

        ## pass input through embedding layer
        embedded = self.emb_layer(x)

        ## Get the outputs and the new hidden state from the lstm
        lstm_output, hidden = self.lstm(embedded, hidden)

        ## pass through a dropout layer , this layer is used to regularize, avoiding overfitting in the model
        out = self.dropout(lstm_output)

        #out = out.contiguous().view(-1, self.n_hidden)
        out = out.reshape(-1, self.n_hidden)

        ## put "out" through the fully-connected layer
        out = self.fc(out)

        # return the final output and the hidden state
        return out, hidden


    def init_hidden(self, batch_size):
        ''' initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data

        # if GPU is available
        if (torch.cuda.is_available()):
          hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda(),
                    weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda())

        # if GPU is not available
        else:
          hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                    weight.new(self.n_layers, batch_size, self.n_hidden).zero_())

        return hidden

vocab_size = len(vocab) # vocabulary size
emb_size = 100 # embedding size
neurons = 128 # the dimension of the feedforward network model, i.e. # of neurons
num_layers = 1 # the number of nn.LSTM layers
drop_prob=0.3
lr = 0.001
model = LSTMModel(vocab_size, emb_size, neurons, num_layers, drop_prob, lr)


## Trainning Loop

- This cell provides a template for a training function. It includes instructions and placeholders for the key steps in training a neural network, such as looping through epochs, handling data loading, initializing hidden states, running the model forward pass, computing loss, performing backpropagation, and updating parameters.


- The cell outlines the use of a loss function (nn.CrossEntropyLoss), transferring data to the correct device (GPU or CPU), detaching hidden states, and zeroing gradients. It also includes a step for gradient clipping, which is a common technique to prevent exploding gradients in RNNs and LSTMs.

In [None]:
def train(model, epochs, optimiser, clip=1):
    '''
    The following are possible instructions you may want to conside for this function.
    This is only a guide and you may change add or remove whatever you consider appropriate
    as long as you train your model correctly.
        - loop through specified epochs
        - loop through dataloader
        - don't forget to zero grad!
        - place data (both input and target) in device
        - init hidden states e.g. hidden = model.init_hidden(batch_size)
        - run the model
        - compute the cost or loss
        - backpropagation
        - Update paratemers
        - Include print all the information you consider helpful

    '''

    # loss
    criterion = nn.CrossEntropyLoss()

    model = model.to(device=device)
    model.train()

    for e in range(epochs):

        # initialize hidden state
        h = model.init_hidden(batch_size)
        losses = []
        for x, y in tqdm(train_loader):

            # push tensors to GPU
            inputs, targets = x.cuda(), y.cuda()

            # detach hidden states
            h = tuple([each.data for each in h])

            # zero accumulated gradients
            model.zero_grad()

            # get the output from the model
            output, h = model.forward(inputs, h)

            # calculate the loss and perform backprop
            loss = criterion(output, targets.view(-1))
            losses.append(loss.item())

            # back-propagate error
            loss.backward()

            # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
            nn.utils.clip_grad_norm_(model.parameters(), clip)

            # update weigths
            optimiser.step()

        print("Train Loss : {:.3f}".format(torch.tensor(losses).mean()))
        #CalcValLossAndAccuracy(model, loss_fn, val_loader)


### Initialization

This cell prepares for the training of the LSTM model. It sets up the loss function (nn.CrossEntropyLoss), defines the number of training epochs, and initializes the optimizer (optim.Adam), specifying the learning rate and model parameters.
It then calls the training function with the model, number of epochs, and optimizer, effectively starting the training process.

In [None]:
# Call the train function
loss_function = nn.CrossEntropyLoss()
#lr = 0.0005
epochs = 5
optimiser = optim.Adam(model.parameters(), lr=lr)
train(model, epochs, optimiser)


### Predictions
This cell defines two functions for text generation using the trained model. The first function, predict, takes the model (net), a token (tkn), and optionally a hidden state (h). It predicts the next token based on the input.

- The second function, sample, generates a sequence of text of a specified size, starting with a given initial text (prime). It uses the predict function to generate each subsequent token.

- The sample function is demonstrated at the end of the cell with an example, generating text starting with "I like".

In [None]:
# predict next token
def predict(net, tkn, h=None):

  # tensor inputs
  x = np.array([vocab([tkn])])
  inputs = torch.from_numpy(x)

  # push to GPU
  inputs = inputs.cuda()

  # detach hidden state from history
  h = tuple([each.data for each in h])

  # get the output of the model
  out, h = net(inputs, h)

  # get the token probabilities
  p = F.softmax(out, dim=1).data

  p = p.cpu()

  p = p.numpy()
  p = p.reshape(p.shape[1],)

  # get indices of top 3 values
  top_n_idx = p.argsort()[-3:][::-1]

  # randomly select one of the three indices
  sampled_token_index = top_n_idx[random.sample([0,1,2],1)[0]]

  # return the encoded value of the predicted char and the hidden state

  return vocab.get_itos()[sampled_token_index], h


# function to generate text
def sample(net, size, prime='it is'):

    # push to GPU
    net.cuda()

    net.eval()

    # batch size is 1
    h = net.init_hidden(1)

    toks = prime.split()

    # predict next token
    for t in prime.split():
      token, h = predict(net, t, h)

    toks.append(token)

    # predict subsequent tokens
    for i in range(size-1):
        token, h = predict(net, toks[-1], h)
        toks.append(token)

    return ' '.join(toks)

print(sample(model, size=100, prime="I like"))


# 3 Examples
- I like a . muscaria , and the most important of the song ' s work is not to be a <unk> of <unk> , and <unk> <unk> <unk> <unk> , the first of <unk> . = <unk> <unk> = the first <unk> ( the <unk> of the <unk> <unk> <unk> , the song , and <unk> . in the same period . = <unk> <unk> ( <unk> â€“ the <unk> , the song <unk> ) = = in a number , he also had been a small <unk> , and <unk> <unk> <unk> <unk> . = the <unk> <unk> ( <unk>

- I like the song was the first of the <unk> , and the <unk> <unk> <unk> <unk> , the <unk> of <unk> and the other of <unk> . in the <unk> , <unk> <unk> , the first , the song is the most common and a new <unk> , and the most important <unk> <unk> <unk> . the first <unk> , and the <unk> of the song ' <unk> <unk> ( a <unk> <unk> ) and the song was the first to the song , which had the first time in the first game in the united kingdom . = = =

- I like a . hygrometricus and <unk> . <unk> and <unk> . the first two of a new <unk> , and <unk> of <unk> , the song is a small <unk> of a <unk> , and a few days of the <unk> , and a new <unk> and a <unk> of a new york city of <unk> and a new <unk> , the first time in a new <unk> of the song , he is also a <unk> of the first time in his first time in the same day . = = reception = in the <unk> , the song is

# Conclusion

## Model Purpose and Design
- Designed: to predict the next token in a sequence, essentially generating text one token at a time. It's typically used for tasks like language modeling and text generation.

In contrast, a context classification model aims to understand or categorize the entire input text into predefined classes. It's often used in applications like sentiment analysis or topic classification. Such models might use various architectures, including CNNs (Convolutional Neural Networks), RNNs (Recurrent Neural Networks), or Transformer-based models for complex tasks. The focus here is on extracting and interpreting the overall meaning of the input text, rather than generating new content. Training of these models involves processing the entire text input at once, with the model learning to associate input patterns with specific output classes. The loss calculation, often using Cross-Entropy Loss, is based on the accuracy of classifying the entire text into the correct category.


## The contrast for this approach

The choice of hyperparameters also reflects the distinct nature of these models. For the text generation model, parameters like sequence length, the number and size of hidden layers in the LSTM, embedding size, and dropout rate are crucial. These parameters determine how much context the model considers and its ability to remember information over sequences, as well as managing overfitting. For context classification models, parameters such as the size and number of filters in CNNs, the number of hidden units in RNNs, and the presence of pooling layers in CNNs play a significant role. 

These models might also leverage attention mechanisms, especially in Transformer-based models, to focus on relevant parts of the input for accurate classification. Additionally, context classification models often benefit from transfer learning, using pre-trained models on large datasets to achieve a better understanding of the text context.
