## TC 5033
### Text Generation

<br>

#### Alumnos:
*   Andre Nicolai Gutiérrez Bautista
*   Fernando Guzmán Briones
*   Julio Osvaldo Hernández Bucio
*   Genaro Rodríguez Vázquez

<br>

#### Activity 4: Building a Simple LSTM Text Generator using WikiText-2
<br>

- Objective:
    - Gain a fundamental understanding of Long Short-Term Memory (LSTM) networks.
    - Develop hands-on experience with sequence data processing and text generation in PyTorch. Given the simplicity of the model, amount of data, and computer resources, the text you generate will not replace ChatGPT, and results must likely will not make a lot of sense. Its only purpose is academic and to understand the text generation using RNNs.
    - Enhance code comprehension and documentation skills by commenting on provided starter code.
    
<br>

- Instructions:
    - Code Understanding: Begin by thoroughly reading and understanding the code. Comment each section/block of the provided code to demonstrate your understanding. For this, you are encouraged to add cells with experiments to improve your understanding

    - Model Overview: The starter code includes an LSTM model setup for sequence data processing. Familiarize yourself with the model architecture and its components. Once you are familiar with the provided model, feel free to change the model to experiment.

    - Training Function: Implement a function to train the LSTM model on the WikiText-2 dataset. This function should feed the training data into the model and perform backpropagation.

    - Text Generation Function: Create a function that accepts starting text (seed text) and a specified total number of words to generate. The function should use the trained model to generate a continuation of the input text.

    - Code Commenting: Ensure that all the provided starter code is well-commented. Explain the purpose and functionality of each section, indicating your understanding.

    - Submission: Submit your Jupyter Notebook with all sections completed and commented. Include a markdown cell with the full names of all contributing team members at the beginning of the notebook.
    
<br>

- Evaluation Criteria:
    - Code Commenting (60%): The clarity, accuracy, and thoroughness of comments explaining the provided code. You are suggested to use markdown cells for your explanations.

    - Training Function Implementation (20%): The correct implementation of the training function, which should effectively train the model.

    - Text Generation Functionality (10%): A working function is provided in comments. You are free to use it as long as you make sure to uderstand it, you may as well improve it as you see fit. The minimum expected is to provide comments for the given function.

    - Conclusions (10%): Provide some final remarks specifying the differences you notice between this model and the one used  for classification tasks. Also comment on changes you made to the model, hyperparameters, and any other information you consider relevant. Also, please provide 3 examples of generated texts.



## Libraries and requirements







Don't forget to install the portalocker to use *WikiText2()*. Once that you already finish with it restart the environment and you can ignore the line below by adding a #.

In [1]:
# !pip install portalocker

In [2]:
import numpy as np
#PyTorch libraries
import torch
import torchtext
from torchtext.datasets import WikiText2
# Dataloader library
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.data.dataset import random_split
# Libraries to prepare the data
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
# neural layers
from torch import nn
from torch.nn import functional as F
import torch.optim as optim
from tqdm import tqdm

import random

Assign the used device. Remember to work with the GPU to work faster.

In [3]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

## Loading and Splitting Data

In [4]:
train_dataset, val_dataset, test_dataset = WikiText2()

In [5]:
tokeniser = get_tokenizer('basic_english')
def yield_tokens(data):
    for text in data:
        yield tokeniser(text)

In [6]:
# Build the vocabulary
vocab = build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>", "<pad>", "<bos>", "<eos>"])
# Set unknown token at position 0
vocab.set_default_index(vocab["<unk>"])

In [7]:
seq_length = 50
def data_process(raw_text_iter, seq_length = 50):
    data = [torch.tensor(vocab(tokeniser(item)), dtype=torch.long) for item in raw_text_iter]
    data = torch.cat(tuple(filter(lambda t: t.numel() > 0, data))) #remove empty tensors
#     target_data = torch.cat(d)
    return (data[:-(data.size(0)%seq_length)].view(-1, seq_length),
            data[1:-(data.size(0)%seq_length-1)].view(-1, seq_length))

# Create tensors for the training set
x_train, y_train = data_process(train_dataset, seq_length)
x_val, y_val = data_process(val_dataset, seq_length)
x_test, y_test = data_process(test_dataset, seq_length)

In [8]:
train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)

In [9]:
batch_size = 32  # choose a batch size that fits your computation resources
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

## LSTM Model

### Definition

In [10]:
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(LSTMModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_size)
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)
        self.dropout = nn.Dropout(0.1) # Original 0.25. Experimenting with 0.1

    def forward(self, text, hidden):
        embeddings = self.embeddings(text)
        output, hidden = self.lstm(embeddings, hidden)
        decoded = self.fc(output)
        return decoded, hidden

    def init_hidden(self, batch_size):
        return (torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device),
                torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device))

### Model Instantiation

#### Parameters

In [11]:
vocab_size = len(vocab) # vocabulary size
emb_size = 100 # embedding size
neurons = 128 # the dimension of the feedforward network model, i.e. # of neurons
num_layers = 6 # the number of nn.LSTM layers (changed from 2 to 6)

#### LSTM instance

In [12]:
model = LSTMModel(vocab_size, emb_size, neurons, num_layers)

### Training

#### Method definition

In [13]:
def train(model, epochs, optimiser):
  model = model.to(device=device)
  model.train()

  for epoch in range(epochs):
    total_loss = 0
    for i, (data, targets) in enumerate((train_loader)):
      # Zero the gradients
      optimiser.zero_grad()

      # Place data in device
      data, targets = data.to(device), targets.to(device)

      # Initialize hidden states
      hidden = model.init_hidden(data.size(0))

      # Forward pass
      outputs, hidden = model(data, hidden)

      # Compute the loss
      loss = loss_function(outputs.view(-1, vocab_size), targets.view(-1))

      # Backpropagation
      loss.backward()

      # Update parameters
      optimiser.step()
      total_loss += loss.item()
    # Print information
    avg_loss = total_loss / len(train_loader)
    print(f'Epoch [{epoch + 1}/{epochs}], Loss: {avg_loss:.4f}')
  print("Training finished.")

#### Parameters

In [14]:
loss_function = nn.CrossEntropyLoss()
lr = 0.001
epochs = 10
optimiser = optim.Adam(model.parameters(), lr=lr)

#### Model training

In [15]:
train(model, epochs, optimiser)

Epoch [1/10], Loss: 7.0304
Epoch [2/10], Loss: 6.9399
Epoch [3/10], Loss: 6.9286
Epoch [4/10], Loss: 6.7203
Epoch [5/10], Loss: 6.1332
Epoch [6/10], Loss: 5.8589
Epoch [7/10], Loss: 5.6911
Epoch [8/10], Loss: 5.5703
Epoch [9/10], Loss: 5.4740
Epoch [10/10], Loss: 5.3904
Training finished.


## Testing

### Generate Text method

In [32]:
def generate_text(model, start_text, num_words, temperature=1.0):
  # Model into evaluation mode
  model.eval()
  # Tokenize the start text
  words = tokeniser(start_text)
  # Start the hidden state. 1 to generate text word after word.
  hidden = model.init_hidden(1)

  for _ in range(num_words):
    # Transform the list of words in tensors by using the vocab.
    x = torch.tensor([[vocab[word] for word in words]], dtype=torch.long, device=device)
    # Predict the next word.
    y_pred, hidden = model(x, hidden)

    last_word_logits = y_pred[:, -1, :]
    # Obtain the probability distributions
    p = (F.softmax(last_word_logits / temperature, dim=-1).detach()).to(device='cpu').numpy()
    # Identify the word index by using the most probable next word.
    word_index = np.random.choice(len(last_word_logits[0]), p=p[0])
    # Add the word to the list of selected words
    words.append(vocab.lookup_token(word_index))

  return ' '.join(words)

### Test to generate text

In [40]:
# Generate some text
print(generate_text(model, start_text="I like ", num_words=100))

i like the part of the early season of some of the most poems by one signals . after the evidence , shrubs of buildings in the holy clay classics , fu <unk> , in each protestant game in april lone support when it is beaten by the next day the goal . after <unk> , the definition aimed at an one @-@ month strike who debuted from his second game this service , despite dylan perón began which was adequate to keep the landmark who as a seat of to join a realistic donations . for pulaski ' s worth law
