<a href="https://colab.research.google.com/github/andreac941/tutorials/blob/main/Rigo_A4_DL_TC5033_text_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TC 5033
### Text Generation

### Team 1:
- Alexis Hernández Martínez A01016308
- Rigoberto Vega Escudero A01793132
- Rodrigo Rodríguez Rodríguez A01183284
- Andrea Carolina Treviño Garza A01034993

<br>

#### Activity 4: Building a Simple LSTM Text Generator using WikiText-2
<br>

- Objective:
    - Gain a fundamental understanding of Long Short-Term Memory (LSTM) networks.
    - Develop hands-on experience with sequence data processing and text generation in PyTorch. Given the simplicity of the model, amount of data, and computer resources, the text you generate will not replace ChatGPT, and results must likely will not make a lot of sense. Its only purpose is academic and to understand the text generation using RNNs.
    - Enhance code comprehension and documentation skills by commenting on provided starter code.
    
<br>

- Instructions:
    - Code Understanding: Begin by thoroughly reading and understanding the code. Comment each section/block of the provided code to demonstrate your understanding. For this, you are encouraged to add cells with experiments to improve your understanding

    - Model Overview: The starter code includes an LSTM model setup for sequence data processing. Familiarize yourself with the model architecture and its components. Once you are familiar with the provided model, feel free to change the model to experiment.

    - Training Function: Implement a function to train the LSTM model on the WikiText-2 dataset. This function should feed the training data into the model and perform backpropagation.

    - Text Generation Function: Create a function that accepts starting text (seed text) and a specified total number of words to generate. The function should use the trained model to generate a continuation of the input text.

    - Code Commenting: Ensure that all the provided starter code is well-commented. Explain the purpose and functionality of each section, indicating your understanding.

    - Submission: Submit your Jupyter Notebook with all sections completed and commented. Include a markdown cell with the full names of all contributing team members at the beginning of the notebook.
    
<br>

- Evaluation Criteria:
    - Code Commenting (60%): The clarity, accuracy, and thoroughness of comments explaining the provided code. You are suggested to use markdown cells for your explanations.

    - Training Function Implementation (20%): The correct implementation of the training function, which should effectively train the model.

    - Text Generation Functionality (10%): A working function is provided in comments. You are free to use it as long as you make sure to uderstand it, you may as well improve it as you see fit. The minimum expected is to provide comments for the given function.

    - Conclusions (10%): Provide some final remarks specifying the differences you notice between this model and the one used  for classification tasks. Also comment on changes you made to the model, hyperparameters, and any other information you consider relevant. Also, please provide 3 examples of generated texts.



In [1]:
# Importing required libraries:
import numpy as np
# - PyTorch libraries
import torch
import torchtext
from torchtext.datasets import WikiText2
# - Dataloader library
from torch.utils.data import DataLoader, TensorDataset
from torch.utils.data.dataset import random_split
# - Libraries to prepare the data
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
# - For neural layers
from torch import nn
from torch.nn import functional as F
import torch.optim as optim
from tqdm import tqdm
# - Random number generation in a range
import random

In [2]:
# Validating if there's GPU available:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cpu'

In [5]:
# Installation required for splitting datasets
# !pip install 'portalocker>=2.0.0'

Collecting portalocker>=2.0.0
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Installing collected packages: portalocker
Successfully installed portalocker-2.8.2


In [3]:
# Splitting the training, validation and testing datasets from WikiText2:
train_dataset, val_dataset, test_dataset = WikiText2()

In [4]:
# get_tokenizer from pytorch generates the tokens for a string sentence (divide each word with commas)
tokeniser = get_tokenizer('basic_english')
# yield_tokens function is defined to tokenize each sentence of the dataset
def yield_tokens(data):
    for text in data:
        yield tokeniser(text)

In [5]:
# Build the vocabulary . build_vocab_from_iterator from pytorch
vocab = build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>", "<pad>", "<bos>", "<eos>"])
#set unknown token at position 0
vocab.set_default_index(vocab["<unk>"])

In [None]:
seq_length = 50
def data_process(raw_text_iter, seq_length = 50):
    data = [torch.tensor(vocab(tokeniser(item)), dtype=torch.long) for item in raw_text_iter]
    data = torch.cat(tuple(filter(lambda t: t.numel() > 0, data))) #remove empty tensors
#     target_data = torch.cat(d)
    return (data[:-(data.size(0)%seq_length)].view(-1, seq_length),
            data[1:-(data.size(0)%seq_length-1)].view(-1, seq_length))

# # Create tensors for the training set
x_train, y_train = data_process(train_dataset, seq_length)
x_val, y_val = data_process(val_dataset, seq_length)
x_test, y_test = data_process(test_dataset, seq_length)

In [None]:
train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)
test_dataset = TensorDataset(x_test, y_test)

In [None]:
batch_size = 32  # choose a batch size that fits your computation resources
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

In [None]:
# Define the LSTM model
# Feel free to experiment
class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(LSTMModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_size)
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True,dropout=0.65)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, text, hidden):
        embeddings = self.embeddings(text)
        output, hidden = self.lstm(embeddings, hidden)
        decoded = self.fc(output)
        return decoded, hidden

    def init_hidden(self, batch_size):

        return (torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device),
                torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device))



vocab_size = len(vocab) # vocabulary size
emb_size = 1024 # embedding size
neurons = 1024 # the dimension of the feedforward network model, i.e. # of neurons
num_layers = 2 # the number of nn.LSTM layers
model = LSTMModel(vocab_size, emb_size, neurons, num_layers)
model

LSTMModel(
  (embeddings): Embedding(28785, 1024)
  (lstm): LSTM(1024, 1024, num_layers=2, batch_first=True, dropout=0.65)
  (fc): Linear(in_features=1024, out_features=28785, bias=True)
)

In [None]:
def train(model, epochs, optimizer):
    '''
    The following are possible instructions you may want to conside for this function.
    This is only a guide and you may change add or remove whatever you consider appropriate
    as long as you train your model correctly.
        - loop through specified epochs
        - loop through dataloader
        - don't forget to zero grad!
        - place data (both input and target) in device
        - init hidden states e.g. hidden = model.init_hidden(batch_size)
        - run the model
        - compute the cost or loss
        - backpropagation
        - Update paratemers
        - Include print all the information you consider helpful

    '''


    model = model.to(device=device)
    model.train()
    criterion = nn.CrossEntropyLoss()

    for epoch in range(epochs):
        total_loss = 0
        for i, (data, targets) in enumerate((train_loader)):

             # Step 1: Zero the gradients
            optimizer.zero_grad()

            # Step 2: Send the input data and targets to the device
            data = data.to(device)
            targets = targets.to(device)

            # Step 3: Initialize hidden state and send it to the device (if model uses LSTM)
            if isinstance(model, nn.LSTM):
                hidden = model.init_hidden(batch_size)
                hidden = (hidden[0].to(device), hidden[1].to(device))
            else:
                hidden = None

            # Step 4: Forward pass - run the model on the data
            output, hidden = model(data, hidden)

            # Reshape output for loss calculation
            output = output.view(-1, vocab_size)
            targets = targets.view(-1)

            # Step 5: Compute the loss
            loss = loss_function(output, targets)

            # Step 6: Backpropagation
            loss.backward()

            # Step 7: Update the model parameters
            optimizer.step()

            # Aggregate loss for reporting
            total_loss += loss.item()

        average_loss = total_loss / len(train_loader)
        print(f'Epoch {epoch + 1}/{epochs} - Loss: {average_loss:.4f}')


In [None]:
def train(model, epochs, optimizer):
    model = model.to(device)
    model.train()
    criterion = nn.CrossEntropyLoss()

    for epoch in range(epochs):
        total_loss = 0
        for data, targets in train_loader:
            # Zero the gradients
            optimizer.zero_grad()

            # Send the input data and targets to the device
            data, targets = data.to(device), targets.to(device)

            # Initialize hidden state
            hidden = model.init_hidden(batch_size)
            hidden = (hidden[0].to(device), hidden[1].to(device))

            # Forward pass
            output, hidden = model(data, hidden)

            # Reshape output and targets for loss calculation
            output = output.view(-1, vocab_size)
            targets = targets.view(-1)

            # Compute the loss
            loss = criterion(output, targets)

            # Backpropagation
            loss.backward()

            # Update the model parameters
            optimizer.step()

            # Aggregate loss for reporting
            total_loss += loss.item()

        # Calculate average loss
        average_loss = total_loss / len(train_loader)
        print(f'Epoch {epoch + 1}/{epochs} - Loss: {average_loss:.4f}')

In [None]:
# Call the train function
loss_function = nn.CrossEntropyLoss()
lr = 0.00005
epochs = 30
optimiser = optim.AdamW(model.parameters(), lr=lr)
train(model, epochs, optimiser)

Epoch 1/30 - Loss: 6.9003
Epoch 2/30 - Loss: 6.3710
Epoch 3/30 - Loss: 6.1569
Epoch 4/30 - Loss: 6.0091
Epoch 5/30 - Loss: 5.8915
Epoch 6/30 - Loss: 5.7906
Epoch 7/30 - Loss: 5.7021
Epoch 8/30 - Loss: 5.6223
Epoch 9/30 - Loss: 5.5498
Epoch 10/30 - Loss: 5.4821
Epoch 11/30 - Loss: 5.4194
Epoch 12/30 - Loss: 5.3611
Epoch 13/30 - Loss: 5.3065
Epoch 14/30 - Loss: 5.2544
Epoch 15/30 - Loss: 5.2053
Epoch 16/30 - Loss: 5.1580
Epoch 17/30 - Loss: 5.1132
Epoch 18/30 - Loss: 5.0700
Epoch 19/30 - Loss: 5.0283
Epoch 20/30 - Loss: 4.9869
Epoch 21/30 - Loss: 4.9471
Epoch 22/30 - Loss: 4.9078
Epoch 23/30 - Loss: 4.8706
Epoch 24/30 - Loss: 4.8327
Epoch 25/30 - Loss: 4.7971
Epoch 26/30 - Loss: 4.7619
Epoch 27/30 - Loss: 4.7269
Epoch 28/30 - Loss: 4.6920
Epoch 29/30 - Loss: 4.6581
Epoch 30/30 - Loss: 4.6250


In [None]:
# Now, let's implement the text generation function

def generate_text(model, vocab, tokenizer, start_text, num_words, temperature=1.0, device='cuda'):
    model.eval()  # Turn on evaluation mode
    words = tokenizer(start_text)
    state_h, state_c = model.init_hidden(1)

    # Move the initial states to the device
    state_h = state_h.to(device)
    state_c = state_c.to(device)

    # Warm-up the hidden state by passing the seed words
    for w in words:
        ix = torch.tensor([[vocab[w]]]).to(device)
        output, (state_h, state_c) = model(ix, (state_h, state_c))

    # The new word is generated here after the warm-up
    words.append(vocab.lookup_token(torch.argmax(output[0, -1]).item()))

    # Generate the subsequent words
    for _ in range(num_words - len(words)):
        ix = torch.tensor([[vocab[words[-1]]]]).to(device)
        with torch.no_grad():  # No need to track history in prediction mode
            output, (state_h, state_c) = model(ix, (state_h, state_c))

        # Get the last word in the tensor of outputs, apply the temperature scaling
        # and softmax to get probabilities
        probabilities = F.softmax(output[0, -1] / temperature, dim=0).detach().cuda()
        word_idx = torch.multinomial(probabilities, 1).item()

        # Add the generated word to the sequence
        words.append(vocab.lookup_token(word_idx))

    return ' '.join(words)

# As with the train function, this code requires the model and related components to be defined.
# The actual vocabulary object, tokenizer, and the model should be passed to this function when called.


# Generate some text
print(generate_text(model, vocab, tokeniser, start_text="My family is", num_words=100, temperature=1.0, device='cuda'))


my family is the microphone of his own feet . lisa wants the confrontation in long – half a <unk> by a young dedication . harrison and the score of conventional children designed to raise a lot of a story , going from the verge of that he didn ' t tell it , but the phenomenon expressed the fuss from the covenant , as i and moore ' s worlds sign . death is the gentle contribution to the house of seacouver that he suggested . pointing , a few ingredients <unk> flowers on the plans into sydney which


In [None]:
# Generate some text
print(generate_text(model, vocab, tokeniser, start_text="The planets in the solar system are", num_words=100, temperature=1.0,  device='cuda'))

the planets in the solar system are the first several at basic mouth . the noisy miner had studied 0 mm ( 0 @ . @ 0 and this north ) , but serves the same edge as some of the guns had not had increased . lt main athletes buoys involve motorists and two rainfall us 250 <unk> can be used to more separate fire to a frame test . when the main firing tools is sandy rubbed that the damages begins to little narrow @-@ guns the distance covering the side of the main total carriage , and


In [None]:
# Generate some text
print(generate_text(model, vocab, tokeniser, start_text="Thank you professor for", num_words=100, temperature=1.0, device='cuda'))

thank you professor for the new concept , my plan is afraid to new true main ballad shran . supporting daniel of her choice from similar <unk> , anderson longing that the thing won blood and after five years ago , as king traditional countries , there caused it to be able to obtain the project . a video meant on a bonus version called a . irish media , authors primarily by husband ' s legacy directors and covetousness of the poetical with california gielgud and nobility also published the <unk> in 2005 . many of the search alphabets
