# Training a text generator with Char-RNN 

In this notebook we will look at how to train a Recurrent Neural Network (RNN) to model sequence of characters (Char-RNN). There is a very similiar notebook to this one that generates text based on words instead of characters. Run both of them to compare different approaches. This code is heavily modifed from the repo: https://github.com/nikhilbarhate99/Char-RNN-PyTorch

To do this we will use the library [PyTorch](https://pytorch.org/), a library for building and training neural networks in Python. This is the first time we are using this library and looking at how to build and train neural networks, but it won't be the last! We will be spending a lot of time over the next couple of terms looking at code that looks a lot like this. It may look quite unfamiliar at first, but over time you will get used to working with and altering this kind of code. 

First lets do some imports:

In [1]:
# Install Pytorch if you haven't already:
!pip install torch



In [2]:
import torch
import random

import numpy as np
import torch.nn as nn
import torch.nn.functional as F

from torch.distributions import Categorical

In PyTorch, there are different implementations for storing and processing data on different kinds of computer hardware. By default, all computers will work by training and running neural networks on the Central Processing Unit (CPU), which we can specify with `'cpu'`. 

If you have an NVIDIA Graphics Processing Unit (GPU) (and you have installed CUDA correctly and the correct version of PyTorch), then you can use the flag `'gpu'` which will make training your neural networks **much faster**. Most of you won't have powerful NVIDIA GPU's in yor laptops however. Don't worry if you don't, the notebooks we are using in this class will be designed to work on laptop CPU's. 

If you have an M1 or M2 processor on a Mac then you can use the device `'mps'` which will run on Apples accelerated Metal Performance Shaders (MPS) for potentially faster and more powerful training (though sometimes running on CPU can be faster). 

In [4]:
device = 'cpu'

#### Set hyperparameters

This is where we specify our *hyperparameters*. This is where we set the parameters that determine the size of our neural network (`num_layers`,`hidden_size` and the `vocab_size`), how long we train the network for (`num_steps` and `step_len`) and how aggressively we train the network (the learning rate `lr`).  

`load_chk` is a boolean that determines whether we start training from the weights of an already trained model or start from scratch. If this is true, then the path to a model file should be specified in `load_path`. If you change the dataset (and have a different vocabulary size) or make changes to any other parts of the model between saving and loading a model then this will not work. 

In [6]:
hidden_size = 512   # size of hidden state
batch_size = 100    # Size of the batch we use in training
step_len = 200      # number of training samples in each step
num_layers = 3      # number of layers in LSTM layer stack
lr = 0.002          # learning rate
num_steps = 100     # max number of training steps
gen_seq_len = 50    # length of generated sequence
load_chk = False    # load in pre-trained checkpoint for training
save_path = "char_rnn_model.pt"
# load_path = "char_rnn_model.pt"

#### Load our data

Now lets load in our data. We are going to use a dataset of pokemon names to start off with. We are going to use the [Python set data structure](https://www.w3schools.com/python/python_sets.asp) to find all of the unique characters in our text. The number of unique characters determines our `vocab_size`, the total number of characters is our `data_size`:

In [7]:
data_path = "../data/pokemon-names.txt"
corpus = open(data_path, 'r').read()
chars = sorted(list(set(corpus)))
data_size, vocab_size = len(corpus), len(chars)

Here we will create two dictionaries, one will give use a mapping from characters to their indexes, and indicies to the respective characters they represent. 

Our variable `data` is going to be a mapping of our text `corpus`, but a list of the numerical index values of each character:

In [8]:
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

data = list(corpus)
for i, ch in enumerate(data):
    data[i] = char_to_ix[ch]

#### Defining the network 

This is where we define our neural network. We define a neural network as a `class`. A class can have **functions** and **variables** that it owns. Classes are a fundamental part of [object-orientated programming in Python](https://www.w3schools.com/python/python_classes.asp) (and many other programming languages). In this case we are building our class `RNN` by [inheriting from the base class](https://www.w3schools.com/python/python_inheritance.asp) for a neural network in PyTorch called `nn.Module`. 

We need to define two functions for a PyTorch neural network class. The `__init__` function gets called when we create the class, here we create and set the variables that our network needs (such as the all of layers and other things we may need to keep track of). In this function the first thing we need to call is the `super` function that will call the `__init__` function of the base class we are inheriting from. 

The other function we need to define for a PyTorch neural networks is the function `forward` function. This defines what happens when we do a forward pass with our network (taking data as an input and giving something else as an output). Because this is a recurrent neural network, our network needs to take as input both the data and the hidden state (the previous iteration) of the model. This function also outputs the hidden state so that we can pass it back into the function at a later iteration:


In [9]:
class RNN(nn.Module):
    def __init__(self, input_size, output_size, hidden_size, num_layers):
        super(RNN, self).__init__()
        self.embedding = nn.Embedding(input_size, input_size)
        self.rnn = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers)
        self.decoder = nn.Linear(hidden_size, output_size)
    
    def forward(self, input_batch, hidden_state):
        embedding = self.embedding(input_batch)
        output, hidden_state = self.rnn(embedding, hidden_state)
        output = self.decoder(output)
        return output, (hidden_state[0].detach(), hidden_state[1].detach())

#### Setting up network and optimiser

Here we will create an instantiation of our network `rnn`. We also need to define our loss function `loss_fn` and our `optimiser`, which is used to make changes to the neural network weights in training. We have to make our data variable a PyTorch `tensor`. This is the data type that we have to use with PyTorch so that our neural networks can read and process the data correctly. [PyTorch tensors](https://pytorch.org/docs/stable/tensors.html) have been designed to work in almost exactly the same way as [numpy arrays](https://numpy.org/doc/stable/reference/generated/numpy.array.html).

In [10]:
# Create list of indexes that can be valid starting points for training
index_list = list(range(0, len(data) - step_len - 1))

# Conver data to torch tensor
data = torch.tensor(data).to(device)
data = torch.unsqueeze(data, dim=1)

# Create RNN class
rnn = RNN(vocab_size, vocab_size, hidden_size, num_layers).to(device)

# Define loss function and optimiser
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(rnn.parameters(), lr=lr)

# Load in pretrained model if specified
if load_chk:
    checkpoint = torch.load(load_path)
    rnn.load_state_dict(checkpoint['state_dict'])

#### Sample data randomly

This function will allow us to do random sampling of the dataset in training. When we train neural networks we almost always train on **random batches of data**. In training we process lots of data samples at the same time (with lots of copies of the neural network), and then average our loss over all of the data samples and update the weights accordingly. This helps with the *regularisation* of the network, and makes training much more stable. 

The number of data samples we have in each mini-batch is defined by the `batch_size`, generally speaking the more batches you can use the better (though there are exceptions to this).

In [11]:
def get_training_batch_indicies(index_list, batch_size):
    # Get a batch of indicies to sample our data from
    input_batch_indicies = torch.tensor(np.array(random.sample(index_list, batch_size)))
    # Offset indicies for target batch by one
    target_batch_indicies = input_batch_indicies + 1
    return input_batch_indicies, target_batch_indicies

#### Training the network

Here we have all the code we need for our **training loop**. Here we are looping through our number of training steps (defined in `num_steps`). Each step we will sample a random section of a dataset that will cycle for `step_len` training iterations. In some code bases, you will see code that cycles through *epochs* (complete cycles of the dataset). We aren't doing that here as the training time can vary drastically based on how much data is in your dataset. 

After each iteration the weights of the model will be saved to the file `char_rnn_model.pt`. Sometimes people will save different versions of the file after a set number of step (i.e. `char_rnn_model_50.pt` or `char_rnn_model_100.pt`), but this will fill up your drive very quickly! For now we will just keep overwriting the same file after each iteration. 

If you are happy with the outputs or need to stop the code running for whatever reason, you can just kill the cell and your progress will be saved. This can be loaded into the notebook `text-generation-with-char-rnn-test.ipynb` to be used for code that just performs generation. 

The most important parts of any training code are the **forward pass** (where we process our data with our neural network), calculating the **loss function** where evaluate how well our model has performed against the real value of the data. Then we have to **update the weights of the neural network**. This is done by calling `loss.backward()` followed by `optimizer.step()`.

After each iteration of the code we generate a new sequence with the network so we can see how the network is improving during training. This is done without gradient tracking using `with torch.no_grad():` (gradient tracking is what is used for training and calculating how much to adjust the weights of our network by at each step). 

This will probably all look quite complicated and hard to understand first time around. **That is ok!** Over time as you see and work with more and more code that looks like this you will start getting used to it and feel more confident in adapting, changing and writing this kind of code yourself.


In [12]:
# Iterate through the number of steps defined earlier
for step in range(1, num_steps):
    
    running_loss = 0
    hidden_state = None
    rnn.zero_grad()
    train_batch_indicies, target_batch_indicies = get_training_batch_indicies(index_list, batch_size)

    
    # Cycle through for a set number of consecutive iterations in the data
    for i in range(step_len):
        # Extract data batches from indicies
        input_batch = data[train_batch_indicies].squeeze()
        target_batch = data[target_batch_indicies].squeeze()
        # Forward pass
        # The following code is the same as calling rnn.forward(input_batch, hidden_state)
        output, hidden_state = rnn(input_batch, hidden_state)
        
        # Compute loss
        loss = loss_fn(output, target_batch)
        running_loss += loss.item() / step_len
        
        # Update weights of neural network
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        
        # Increment batch coordinates by 1
        train_batch_indicies = train_batch_indicies + 1
        target_batch_indicies = target_batch_indicies + 1
        

        
    # Print loss
    print('\n'+'-'*75)
    print(f"\nStep: {step} Loss: {running_loss}")

    # Create a dictionary for saving the model and data mappings
    save_dict = {}
    # Add the model weight parameters as a dictionary to our save_dict
    save_dict['state_dict'] = rnn.state_dict()
    # Add the idx_to_char and char_to_idx dicts to our save_dict
    save_dict['ix_to_char'] = ix_to_char
    save_dict['char_to_ix'] = char_to_ix
    # Save the dictionary to a file
    torch.save(save_dict, save_path)

    # Now lets generate a random generated text sample to print out,
    # we will do this without gradient tracking as we are not training
    with torch.no_grad():
        
        # Take a random index and reset the hidden state of the model
        rand_index = np.random.randint(data_size-1)
        input_batch = data[rand_index : rand_index+1]
        hidden_state = None
        
        # Iterate over our sequence length
        for i in range(gen_seq_len):
            # Forward pass
            output, hidden_state = rnn(input_batch, hidden_state)
            
            # Construct categorical distribution and sample a character
            output = F.softmax(torch.squeeze(output), dim=0)
            dist = Categorical(output)
            index = dist.sample()
            
            # Print the sampled character
            print(ix_to_char[index.item()], end='')
            
            # Next input is current output
            input_batch[0][0] = index.item()


---------------------------------------------------------------------------

Step: 1 Loss: 2.831251726150512
st
d
moruboniuthéron
suseetaney
mes
dada
w
mitote

---------------------------------------------------------------------------

Step: 2 Loss: 2.7199178552627568
schof
s
w
kondan
ccur
ecitp
fgelk
giocan
bidchrchi
---------------------------------------------------------------------------

Step: 3 Loss: 2.660427708625794
leeandacare
lircimotara
ddoq
gesngeeweuigon
b
kang
---------------------------------------------------------------------------

Step: 4 Loss: 2.6366639709472652
gneblr
potre
p
peeerga
lip
ch
wgu
wegoo
maglesotni
---------------------------------------------------------------------------

Step: 5 Loss: 2.6203590083122266
pne
tiuchuckin
sw
ck
mpicam
n2ee
bo
e
hare
clueman
---------------------------------------------------------------------------

Step: 6 Loss: 2.6148055410385123
lthilurapetonarx
a
maubescspikuttyato
wsma
f
pifto
-----------------------------------

KeyboardInterrupt: 

## Tasks 

First do these tasks in order before moving onto the bonus tasks:

**Task 1:** Run the code cells and train a model on the Pokemon names. How long does it take to generate names that look like plausible Pokemon names? If the model training has finished and you are not happy with the results you can set `load_chk` to `True` in [the cell that defines the hyperparameters](#set-hyperparameters) and load in a model from the variable `load_path`.

**Task 2:** Compare the code in this notebook to the Word-RNN training (`training-a-char-rnn-text-generator.ipynb`). How does the code differ? What elements are the same between the notebooks.

**Task 3:** Load your trained model into the notebook `char-rnn-testing.ipynb` to have a go at more generation with the model you have trained. 

**Task 4:** Can you adapt this code to load in another dataset? Have a look at the code in `word-rnn-training.ipynb` or `text-generation-with-markov-chains.ipynb` from Week 5 to use functions to load in datasets in different formats. Can make a direct comparison between a Char-RNN model and a Word-RNN model trained on the same dataset? 

### Bonus tasks

These bonus tasks can be done in any order.

After each training run you may want to rename the checkpoint files from each training run so you can keep them for comparison later.

**Task A:** Try changing some of the other hyperparameters in [the cell that defines the hyperparameters](#set-hyperparameters). Such as `hidden_size` `batch_size`, `num_layers` or `lr`. Restart the kernel and run the training again. 

**Task B:** Try changing the optimiser used [the cell where the network and optimiser are instantiated](#setting-up-network-and-optimiser) to one of [the other optimisers available in PyTorch](https://pytorch.org/docs/stable/optim.html), such as stochastic gradient descent (SGD) or Adagrad. Restart the kernel and run the training again. 

**Task C:** Try changing the type of layer used [in the RNN network](#defining-the-network) from LSTM to a [vanilla RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html#torch.nn.RNN) or a [Gated Recurrent Unit](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html#torch.nn.GRU) (GRU). Restart the kernel and run the training again. 

**Task D:** Can you use stop words or do some other kinds of cleaning or normalisation to the dataset to improve or edit the quality of the generated results? 
