![Banner](img/AI_Special_Program_Banner.jpg)

## Recurrent Neural Networks (RNN) - Exercise 1: Tiny Language Model for Limericks - Solution
---
Instructions are given in <span style="color:blue">blue</span> color.
(Unfortunately, Google Colab won't display the blue color.)

**Remark**: This solution notebook is for local execution (not on Google Colab)

In this exercise, we want to deal with a common requirement when working with deep learning models: Models should be reusable.

As you may have noticed, depending on the complexity of the problem and the amount of data, the models' training can take quite some time (when done on a CPU). Therefore, the ability to save and load already trained models is crucial to develop efficient problem solutions.

Additionally, we want to make use of the possibility to run notebooks in [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb). The inherent advantage - given over a local execution - is that Colab allows the use of GPUs and, consequently, training time should significantly speed up.

**Note**: For this exercise, you will need a Google account. If you don't already have one, please create one using [this website](https://accounts.google.com/signup/v2/webcreateaccount?hl=en&flowName=GlifWebSignIn&flowEntry=SignUp). If you are not comfortable in providing your personal information, don't do so. You can delete your account and associated data after completing the exercise by following [these](https://support.google.com/accounts/answer/32046?hl=en) instructions.

As for the actual contents of this week's exercise, we will look at character-level language modeling with RNNs. In particular, we are trying to train a network capable of producing text that (hopefully) resembles the structure of so-called [Limericks](https://en.wikipedia.org/wiki/Limerick_(poetry)).

> Deep Thought, after millions of years,
<br>
> With the ultimate answer appears,
<br>
> &nbsp;&nbsp;&nbsp;&nbsp;Which is just 42.
<br>
> &nbsp;&nbsp;&nbsp;&nbsp;This is certainly true,
<br>
> Though it sounds a bit strange to the ears.

<div style="text-align: right"><a href="http://www.oedilf.com/db/Lim.php?Word=42">Source</a></div>

---

## Overview
- [1. Setting up Google Colab](#1.-Setting-up-Google-Colab)
- [2. Data Preparation](#2.-Data-Preparation)
- [3. RNN Model Creation](#3.-RNN-Model-Creation)

---

## 1. Setting up Google Colab
---

* <div style="color:blue">Upload this notebook to Google Colab.</div>

    1. Visit https://colab.research.google.com/notebooks/intro.ipynb.
    2. Log in to your Google account.
    3. Select **`File`** **`Upload notebook`** from the menu bar.
    4. Upload this notebook.

* <div style="color:blue">Make sure to utilize GPUs.</div>

    1. Select **`Runtime`** **`Change Runtime type`** from the menu bar.
    2. Choose *T4 GPU* for hardware acceleration (if that is not already the case; otherwise just cancel the operation)

In [1]:
# ensure torchtext installed
try:
    import torchtext
except:
    print("Installing torchtext")
    !pip install torchtext   

In [2]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
from torch.utils.data import Dataset, DataLoader, TensorDataset
import torchtext
from torchtext.data import get_tokenizer
from torch import nn
from tqdm.notebook import trange, tqdm

import datetime as dt

# force gpu computing, when gpu library is available
USE_GPU = True


In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
%%capture
torch.manual_seed(42)

In [5]:
torch.__version__

'2.1.2'

In [6]:
CUDA_AVAILABLE = torch.cuda.is_available()
print(f'CUDA_AVAILABLE = {CUDA_AVAILABLE}')

CUDA_AVAILABLE = True


The data we will work with is provided as simple continuous text. The limericks contained therein have been scraped from [this website](http://www.oedilf.com/db/Lim.php). You will find the text file `limericks.txt` in the `data` folder next to this notebook.

* <div style="color:blue">Upload the text file to Google Colab.</div>

    1. On the left-hand side of the Google Colab screen, there is a `Files` tab - click to open.
    2. You need to be connected to a runtime to see the session storage and run at least one code cell if not already connected.
    3. If connected, you can select to upload files to the session storage. Note that the session storage will be purged when the session is terminated.
    4. Upload `limericks.txt` to the session storage (this might take a while; to check whether the file has been uploaded, close and re-open the files tab).
    5. Run the following code cells to read the data.

In [7]:
# Reading data
limericks_file = "data/limericks.txt"
limericks = open(limericks_file, 'r').read()

In [8]:
# Looking at some limericks
print(limericks[9990:10790])


from the mornings 'til mid-afternoons
in luzon you can feel the monsoons.
if that weren't enough,
here comes nastier stuff:
baguios, which are monstrous typhoons.

the old hay farmer's crib had backed up,
and resembled an overstuffed cup.
when he eyeballed the scene,
he said, "what does this mean?
there's a balefulness present, a-yup!"

with affamishment, food's what you need.
you would fight for a sesame seed.
a small crumb or an ort,
you'd take any old sort,
and you'd polish it off with great speed.

for bahaism, oneness is prime:
that of god, and of prophets in time,
of all peoples and races;
its primary place is
to end all our warfare and crime.

toss out all of those pepcid ac.
i am now achlorhydric. ah, me!
no more zantac or tums
in immeasurable sums ?
i'm a hulk of antacid debris.



## 2. Data Preparation
---

In [9]:
# Getting the unique characters used in the text
vocab = sorted(set(limericks))
print(vocab)
len(vocab)

['\t', '\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~']


70

In [10]:
# Creating unique index for each character in the text
char_to_index = {char:index for index, char in enumerate(vocab)}
char_to_index

{'\t': 0,
 '\n': 1,
 ' ': 2,
 '!': 3,
 '"': 4,
 '#': 5,
 '$': 6,
 '%': 7,
 '&': 8,
 "'": 9,
 '(': 10,
 ')': 11,
 '*': 12,
 '+': 13,
 ',': 14,
 '-': 15,
 '.': 16,
 '/': 17,
 '0': 18,
 '1': 19,
 '2': 20,
 '3': 21,
 '4': 22,
 '5': 23,
 '6': 24,
 '7': 25,
 '8': 26,
 '9': 27,
 ':': 28,
 ';': 29,
 '<': 30,
 '=': 31,
 '>': 32,
 '?': 33,
 '@': 34,
 '[': 35,
 '\\': 36,
 ']': 37,
 '^': 38,
 '_': 39,
 'a': 40,
 'b': 41,
 'c': 42,
 'd': 43,
 'e': 44,
 'f': 45,
 'g': 46,
 'h': 47,
 'i': 48,
 'j': 49,
 'k': 50,
 'l': 51,
 'm': 52,
 'n': 53,
 'o': 54,
 'p': 55,
 'q': 56,
 'r': 57,
 's': 58,
 't': 59,
 'u': 60,
 'v': 61,
 'w': 62,
 'x': 63,
 'y': 64,
 'z': 65,
 '{': 66,
 '|': 67,
 '}': 68,
 '~': 69}

In [11]:
# Checking index of a specific character
char_to_index['t']

59

In [12]:
# Mapping the index back to the characters
index_to_char = np.array(vocab)
index_to_char

array(['\t', '\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*',
       '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7',
       '8', '9', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^',
       '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
       'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
       'z', '{', '|', '}', '~'], dtype='<U1')

In [13]:
# Checking character for a given index
index_to_char[59]

't'

In [14]:
# Encoding the entire text
encoded_text = np.array([char_to_index[char] for char in limericks])
encoded_text

array([42, 40, 55, ...,  3,  4,  1])

* <div style="color:blue">How many characters are there in the entire text?</div>

In [15]:
# Your solution goes here:
len(encoded_text)

14933947

In [16]:
# Comparing raw text ...
limericks[:100]

"cap'n jack was washed over the side.\nhis crew searched but found not hair nor hide.\nno longer the he"

In [17]:
# ... with encoded text
encoded_text[:100]

array([42, 40, 55,  9, 53,  2, 49, 40, 42, 50,  2, 62, 40, 58,  2, 62, 40,
       58, 47, 44, 43,  2, 54, 61, 44, 57,  2, 59, 47, 44,  2, 58, 48, 43,
       44, 16,  1, 47, 48, 58,  2, 42, 57, 44, 62,  2, 58, 44, 40, 57, 42,
       47, 44, 43,  2, 41, 60, 59,  2, 45, 54, 60, 53, 43,  2, 53, 54, 59,
        2, 47, 40, 48, 57,  2, 53, 54, 57,  2, 47, 48, 43, 44, 16,  1, 53,
       54,  2, 51, 54, 53, 46, 44, 57,  2, 59, 47, 44,  2, 47, 44])

The next thing we have to take care of is to split the data into batches. For this, it is important to determine a sequence length for each batch (number of characters) that 

* is long enough to capture the general structure of a limerick, but
* is not too long, so that past information does not simply become noise.

* <div style="color:blue">Take a look at the limericks in <code>limericks.txt</code> and decide on a suitable <b>sequence length</b>.</div>

In [18]:
sequence_length = 60 # <- your solution goes here

In [19]:
class CharSequenceDataset(Dataset):

    def __init__(self, text_encoded, seq_length, offset):
        sequences = []
        target = []
        for i in range(0, len(text_encoded) - seq_length - 1, offset):
            seq = text_encoded[i: i + seq_length]
            sequences.append(seq)
            target.append(text_encoded[i + seq_length])
        self.sequences = torch.tensor(sequences)
        self.target = torch.tensor(target)
        
    
    def __len__(self):
        return len(self.sequences)
    
    def __getitem__(self, idx):
        return self.sequences[idx], self.target[idx]
    

offset = sequence_length 

ds = CharSequenceDataset(encoded_text, sequence_length, offset)

## inspection:
for seq, target in ds:
    print(seq.numpy(), ' -> ', target.numpy())
    print(repr(''.join(index_to_char[seq])), 
          ' -> ', repr(''.join(index_to_char[target])))
    break

[42 40 55  9 53  2 49 40 42 50  2 62 40 58  2 62 40 58 47 44 43  2 54 61
 44 57  2 59 47 44  2 58 48 43 44 16  1 47 48 58  2 42 57 44 62  2 58 44
 40 57 42 47 44 43  2 41 60 59  2 45]  ->  54
"cap'n jack was washed over the side.\nhis crew searched but f"  ->  'o'


In [20]:
# final preparation step: create a data loader
BATCH_SIZE = 128

trainloader = DataLoader(ds, shuffle=True, batch_size=BATCH_SIZE)
len(ds)

248899

## 3. RNN Model Creation
---

In [21]:
# Vocabulary length
vocab_size = len(vocab)
# Embedding dimensions
embed_dim = 64
# Number of RNN units (neurons)
rnn_units = 512

In [22]:
class Network(nn.Module):
    def __init__(self, vocab_size, embedding_dim, rnn_units, recurrent_type='LSTM'):
        super(Network, self).__init__()
        self.recurrent_type = recurrent_type
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        if recurrent_type == 'RNN':
            self.rnn = nn.RNN(embedding_dim, rnn_units, batch_first=True)
        elif recurrent_type == 'LSTM':
            self.rnn = nn.LSTM(embedding_dim, rnn_units, batch_first=True)
        elif recurrent_type == 'GRU':
            self.rnn = nn.GRU(embedding_dim, rnn_units, batch_first=True)
        self.linear = nn.Linear(rnn_units, vocab_size)
    
    def forward(self, x):
        embedded = self.embedding(x)
        #lstm also returns the cell hidden state
        if self.recurrent_type == "LSTM":
            output, (hidden, cell_hidden) = self.rnn(embedded)
        else:
            output, hidden = self.rnn(embedded)
        output = self.linear(hidden[-1])
        return output
    
# Instantiate the PyTorch model (and remember the type for saving the model ...)
rnn_type = 'LSTM'
model = Network(vocab_size, embed_dim, rnn_units, recurrent_type=rnn_type)
model

Network(
  (embedding): Embedding(70, 64)
  (rnn): LSTM(64, 512, batch_first=True)
  (linear): Linear(in_features=512, out_features=70, bias=True)
)

In [23]:
# Define the loss function
criterion = nn.CrossEntropyLoss()

# Define the optimizer
optimizer = torch.optim.Adam(model.parameters())

Setup the training loop

In [24]:
def train(model, dataloader, optimizer, criterion):
    model.train() 
    
    running_loss = 0.0
    
    # Iterate over the training dataset
    for inputs, labels in dataloader:
        inputs = inputs.to(device)
        labels = labels.to(device).long()
        # Zero the gradients
        optimizer.zero_grad()
        # Forward pass
        outputs = model(inputs)
        # Compute the loss
        loss = criterion(outputs, labels)
        # Backward pass
        loss.backward()
        # Update the weights
        optimizer.step()
        
        running_loss += loss.item()

    # Calculate the average loss for the epoch
    epoch_loss = running_loss / len(dataloader)
    return epoch_loss

def train_for_epochs(epchs, model, trainloader, optimizer, criterion):
    history = {
        "loss": []
    }
    loop = trange(epchs)
    # Training loop
    for epoch in loop:

        epoch_loss_train = train(model, trainloader, optimizer, criterion)   

        history["loss"].append(epoch_loss_train)

        loop.set_description(f"Train Loss: {epoch_loss_train:.4f}")
    return history

# record the total number of epochs for training so far (may be used for naming models ...)
tot_epochs = 0

* <div style="color:blue">Train the model for <b>2</b> epochs. Double-check that you enabled a GPU runtime (unless you don't mind waiting). However, even if you are using GPUs, it may take a while to complete the training.</div>

In [25]:
# Your solution goes here:
NUM_EPOCHS = 2 

# Set the device for training
device = torch.device("cuda" if torch.cuda.is_available() and USE_GPU else "cpu")
print(f'Using {device} for training the network')

# Move the model weight to the desired device
model.to(device)

# Define the loss function
criterion = nn.CrossEntropyLoss()

# Define the optimizer
optimizer = torch.optim.Adam(model.parameters())

# run training
start_train = dt.datetime.now()
print(f'Charlevel training started: {start_train}')
history = train_for_epochs(NUM_EPOCHS, model, trainloader, optimizer, criterion)
finish_train = dt.datetime.now()
print(f'Charlevel training finished {finish_train} (duration: {finish_train - start_train})')

tot_epochs += NUM_EPOCHS

Using cuda for training the network
Charlevel training started: 2024-01-28 23:36:10.053410


  0%|          | 0/2 [00:00<?, ?it/s]

Charlevel training finished 2024-01-28 23:37:15.730771 (duration: 0:01:05.677361)


* <div style="color:blue">Save the state dict of your model to your session storage.</div>

**Hint**: Have a look at the [documentation](https://pytorch.org/tutorials/beginner/saving_loading_models.html) to save and load models.

In [26]:
# Your solution goes here:
torch.save(model.state_dict(), f"models/model_{rnn_type}_{tot_epochs:02}.pt")

* <div style="color:blue">Continue training for another <b>18</b> epochs. In total you should have trained the model for <b>20</b> epochs.</div>

In [27]:
# Your solution goes here:
NUM_EPOCHS = 18

optimizer = torch.optim.Adam(model.parameters())

# run training
start_train = dt.datetime.now()
print(f'Continue charlevel training started: {start_train}')
history = train_for_epochs(NUM_EPOCHS, model, trainloader, optimizer, criterion)
finish_train = dt.datetime.now()
print(f'Continue charlevel training finished {finish_train} (duration: {finish_train - start_train})')

tot_epochs += NUM_EPOCHS

Continue charlevel training started: 2024-01-28 23:37:15.754693


  0%|          | 0/18 [00:00<?, ?it/s]

Continue charlevel training finished 2024-01-28 23:47:26.008792 (duration: 0:10:10.254099)


* <div style="color:blue">Save the state of of your model to your session storage, but use a different file name this time.</div>

In [28]:
# Your solution goes here:
torch.save(model.state_dict(), f"models/model_{rnn_type}_{tot_epochs:02}.pt")

* <div style="color:blue">Download both saved models from your session storage to your local hard drive (place them in the <code>models</code> folder).</div>

**CAUTION**: If you accidentally close the current session without downloading the models first, those will be purged and your training progress is lost.

* <div style="color:blue">Continue with <a href="3.3.b_RNN_Ex_2_Sol.ipynb">exercise 2</a>. (Either locally or in Colab is fine.)</div>