# Char RNN - Prototype

to solve the text generation problem as detailed in [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). The objective is to read a large text file, one character at a time, and then be able to generate text (one character at a time) with the same style.

We create a notebook to prototype our model.

## Load and prepare data

We first download the [The Count of Monte Cristo](https://www.gutenberg.org/ebooks/1184) from Project Gutenberg.

In [1]:
from pathlib import Path
import requests
import zipfile

PATH_DATA = Path("data")
FILENAME_DATA = Path("monte_cristo.txt")
URL_DATA = "https://www.gutenberg.org/files/1184/1184-0.txt"

# Download dataset
PATH_DATA.mkdir(exist_ok = True)
PATH_DATAFILE = PATH_DATA / FILENAME_DATA
if not (PATH_DATAFILE).exists():
    r = requests.get(URL_DATA)
    PATH_DATAFILE.open("wb").write(r.content)

We read the entire text and keep only the interesting lines by removing titles, bibliography…

In [2]:
with open(PATH_DATAFILE, 'r', encoding="utf8") as f:
    lines = f.readlines()
    # Remove start and end of file (not interesting data)
    lines = lines[319:60662]
    chars = ''.join(lines)
            
# Test code

print("Sample text:\n")
print(chars[:276])

Sample text:

On the 24th of February, 1815, the look-out at Notre-Dame de la Garde
signalled the three-master, the Pharaon from Smyrna, Trieste, and
Naples.

As usual, a pilot put off immediately, and rounding the Château d’If,
got on board the vessel between Cape Morgiou and Rion island.


In [3]:
# Test code

print("Total number of chars:", len(chars))
print("Unique chars:", len(set(chars)))

Total number of chars: 2617219
Unique chars: 99


We then create a dictionary for mapping between chars and numbers.

In [4]:
# Adapted from https://github.com/pytorch/examples/blob/master/word_language_model/data.py

class Dictionary(object):
    def __init__(self):
        self.char2idx = {}
        self.idx2char = []

    def add_char(self, char):
        if char not in self.char2idx:
            self.idx2char.append(char)
            self.char2idx[char] = len(self.idx2char) - 1
        return self.char2idx[char]

    def __len__(self):
        return len(self.idx2char)

We finally convert our data from char to token.

In [5]:
import torch

data_dictionary = Dictionary()
tensor_data = torch.LongTensor(len(chars))

for i, c in enumerate(chars):
    tensor_data[i] = data_dictionary.add_char(c)
    
n_elements = len(data_dictionary)

In [6]:
# Test code

print("Sample values:")
print('\n'.join('{1} ({0})'.format(idx, data_dictionary.idx2char[idx]) for idx in tensor_data[1000:1020]))

Sample values:
w (43)
  (2)
p (35)
l (20)
a (14)
i (30)
n (1)
l (20)
y (15)

 (28)
t (3)
h (4)
a (14)
t (3)
  (2)
i (30)
f (9)
  (2)
a (14)
n (1)


Finally, we split the data between test and validation sets.

In [7]:
split = round(0.98 * len(tensor_data))
train_data, train_label = tensor_data[:split], tensor_data[1:split+1]
valid_data, valid_label = tensor_data[split:-2], tensor_data[split+1:]

Let's create a class to handle our training data in batch.

We want multiple sequences of text associated to a batch, spread evenly over the text. We create indexes to keep track of the start of each sequence and move them to the next characters at the end of each batch to keep hidden state relevant.

![text processing](img/text_processing.png)

We also encode the input into one-hot tensors to feed directly into the neural network.

In [8]:
class TrainingData():    
    def __init__(self, train_data, train_label, device, sequence_per_batch = 64, char_per_sequence = 128):
        
        self.train_data = train_data
        self.train_label = train_label
        self.sequence_per_batch = sequence_per_batch
        self.char_per_sequence = char_per_sequence
        self.device = device
        self.length = len(train_data)
        
        # We start reading the text at even sections based on number of sequence per batch
        self.batch_idx = range(0, self.length, self.length // sequence_per_batch)
        self.batch_idx = self.batch_idx[:sequence_per_batch]
        assert len(self.batch_idx) == sequence_per_batch, '{} batches expected vs {} actual'.format(sequence_per_batch,
                                                                                                    len(self.batch_idx))
        
    def next_batch(self):
        
        # loop to the start if we reached the end of text
        self.batch_idx = list(idx if idx + self.char_per_sequence < self.length else 0 for idx in self.batch_idx)
        
        # Extract sequences
        sequences_input = tuple(self.train_data[idx:idx+self.char_per_sequence] for idx in self.batch_idx)
        sequences_label = tuple(self.train_label[idx:idx+self.char_per_sequence] for idx in self.batch_idx)

        # Transform input into one-hot (source: https://discuss.pytorch.org/t/convert-int-into-one-hot-format/507/29)
        sequences_input = tuple(torch.zeros(len(data), n_elements, device = self.device).scatter_(1, data.unsqueeze(-1), 1) for data in sequences_input)
        
        # Move next idx
        self.batch_idx = (idx + self.char_per_sequence for idx in self.batch_idx)
        
        # Concatenate tensors
        return torch.stack(sequences_input, dim=1), torch.stack(sequences_label, dim=1)

## Create and train a neural network

We want to create & optimize different variants of following architecture.

![architecture](img/architecture.png)

We will optimize the RNN module and create our architecture to easily test different variants by choosing:

* RNN, LSTM or GRU modules

* number of features for hidden states

* number of layers

* dropout between each layer

* batch size (number of sequences processed in parallel during a batch)

* input size (number of characters in each sequence)

We can decide later to optimize other parameters such as the loss function, optimization algorithm…

In [9]:
import torch.nn as nn

class Model(nn.Module):
    def __init__(self, input_size, batch_size, rnn_module = "RNN", hidden_size = 64, num_layers = 1, dropout = 0):
        super(Model, self).__init__()
        self.input_size = input_size
        self.rnn_module = rnn_module
        self.num_layers = num_layers
        self.hidden_size = hidden_size
        if rnn_module == "RNN":
            self.rnn = nn.RNN(input_size = input_size, hidden_size = hidden_size, num_layers = num_layers, dropout = dropout)
        elif rnn_module == "LSTM":
            self.rnn = nn.LSTM(input_size = input_size, hidden_size = hidden_size, num_layers = num_layers, dropout = dropout)
        elif rnn_module == "GRU":
            self.rnn = nn.GRU(input_size = input_size, hidden_size = hidden_size, num_layers = num_layers, dropout = dropout)
            
        self.output = nn.Linear(hidden_size, input_size)

    def forward(self, input, hidden):
        output = input.view(1, -1, self.input_size)
        output, hidden = self.rnn(output, hidden)
        output = self.output(output[0])
        return output, hidden

    def initHidden(self, batch_size):
        # initialize hidden state to zeros
        if self.rnn_module == "LSTM":
            return torch.zeros(self.num_layers, batch_size, self.hidden_size), torch.zeros(
                self.num_layers, batch_size, self.hidden_size)
        else:
            return torch.zeros(self.num_layers, batch_size, self.hidden_size)

We now need to define a loss and optimizer.

In [10]:
import torch.optim as optim

loss_function = nn.CrossEntropyLoss()
optimizer_function = optim.Adam

We can now train the model.

In [11]:
from tqdm import tnrange
from numpy import random

# Define hyper-parameters
rnn_module = "GRU"
hidden_size = 128
num_layers = 2
dropout = 0.1
epochs = 20
batches_per_epoch = 300
sequence_per_batch = 8
char_per_sequence = 50

# Build the NN
model = Model(len(data_dictionary), sequence_per_batch, rnn_module, hidden_size, num_layers, dropout)
hidden = model.initHidden(sequence_per_batch)

# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_data = train_data.to(device)
train_label = train_label.to(device)
valid_data = valid_data.to(device)
valid_label = valid_label.to(device)
model.to(device)
if rnn_module == "LSTM":
    for h in hidden:
        h = h.to(device)
else:
    hidden = hidden.to(device)

# Define optimizer
optimizer = optimizer_function(model.parameters())

# Load data
training_data = TrainingData(train_data, train_label, device, sequence_per_batch, char_per_sequence)
valid_length = len(valid_data)

for epoch in tnrange(epochs):
    train_loss = 0   # training loss
    valid_loss = 0   # validation loss
    
    # Training of one epoch
    model.train()
    for i in tnrange(batches_per_epoch):
        
        # Get a batch of sequences
        input_vals, label_vals = training_data.next_batch()

        # Detach hidden layer and reset gradients
        if rnn_module == "LSTM":
            tuple(h.detach_() for h in hidden)
        else:
            hidden.detach_()
        optimizer.zero_grad()
        
        # Forward pass and calculate loss
        loss_sequence = torch.zeros(1, device=device)
        for (input_val, label_val) in zip(input_vals, label_vals):
            output, hidden = model(input_val, hidden)
            loss = loss_function(output, label_val.view(-1))
            loss_sequence += loss
            
        # Backward propagation and weight update
        loss_sequence.backward()
        optimizer.step()
        
        train_loss += loss_sequence.item() / batches_per_epoch / char_per_sequence
        
    # Calculate validation loss
    with torch.no_grad():
        model.eval()
        
        # Detach hidden layers
        hidden_valid = model.initHidden(1)
        if rnn_module == "LSTM":
            for h in hidden_valid:
                h = h.to(device)
        else:
            hidden_valid = hidden_valid.to(device)
            
        # Process validation data one character at a time
        for i in range(valid_length-1):
            input_val = valid_data[i].view(1)
            label_val = valid_label[i]

            # One-hot input
            input_val = torch.zeros(len(input_val), n_elements, device = device).scatter_(1, input_val.unsqueeze(-1), 1)

            # Forward pass and calculate loss
            output, hidden_valid = model(input_val, hidden_valid)
            loss = loss_function(output, label_val.view(-1))
            valid_loss += loss.item() / (valid_length - 1)
        
    print("Epoch {} - Training loss {} - Validation loss {}".format(epoch+1, train_loss, valid_loss))

HBox(children=(IntProgress(value=0, max=20), HTML(value='')))

HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 1 - Training loss 3.1059245330810548 - Validation loss 2.695492514909661


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 2 - Training loss 2.419619833374021 - Validation loss 2.2780075974377856


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 3 - Training loss 2.1597442972819016 - Validation loss 2.103667762217982


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 4 - Training loss 2.0155672790527355 - Validation loss 1.9676023394069149


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 5 - Training loss 1.9313430516560879 - Validation loss 1.8954456084226652


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 6 - Training loss 1.83999774424235 - Validation loss 1.8342455381676595


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 7 - Training loss 1.8135996297200543 - Validation loss 1.7796529887437211


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 8 - Training loss 1.7714799194335928 - Validation loss 1.744904908431321


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 9 - Training loss 1.7085739954630548 - Validation loss 1.7063948945278975


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 10 - Training loss 1.670262375386555 - Validation loss 1.6800991733761785


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 11 - Training loss 1.647395095825195 - Validation loss 1.6589835505685004


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 12 - Training loss 1.645564945475261 - Validation loss 1.6323724012539989


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 13 - Training loss 1.59362932993571 - Validation loss 1.6044038320735028


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 14 - Training loss 1.5748805206298835 - Validation loss 1.5948034398764097


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 15 - Training loss 1.578380380249023 - Validation loss 1.5785795410102816


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 16 - Training loss 1.5697588765462236 - Validation loss 1.5620818521707618


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 17 - Training loss 1.5202949132283532 - Validation loss 1.5340196983662115


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 18 - Training loss 1.517479506683349 - Validation loss 1.526218808117962


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 19 - Training loss 1.5296867800394687 - Validation loss 1.5256320089366522


HBox(children=(IntProgress(value=0, max=300), HTML(value='')))

Epoch 20 - Training loss 1.5292602198282876 - Validation loss 1.5057634775868523



We make sure to save the trained model.

In [12]:
# Save model
PATH_MODEL = "model.pt"
torch.save(model, PATH_MODEL)
print("Model saved")

Model saved


  "type " + obj.__name__ + ". It won't be checked "


## Test the model

We finally test the model by predicting a few characters.

In [13]:
import torch

# Load model
PATH_MODEL = "model.pt"
model = torch.load(PATH_MODEL)

# Move model to CPU
model = model.to(torch.device("cpu"))

In [19]:
from numpy import random

with torch.no_grad():
    
    # Go into evaluation mode
    model.eval()
    
    # Define a sequence of characters to initialize the hidden states
    init_chars = "The "

    init_data = torch.LongTensor(len(init_chars))
    for i, c in enumerate(init_chars):
        init_data[i] = data_dictionary.char2idx[c]

    # Transform into one-hot
    init_data = torch.zeros(len(init_data), len(data_dictionary)).scatter_(1, init_data.unsqueeze(-1), 1)

    # Initialize hidden layer and feed sequence of characters to the model
    hidden = model.initHidden(1)
    for init_char in init_data:
        output, hidden = model(init_char, hidden)

    # Predict next characters one at a time
    number_chars = 1000
    chars = init_chars
    for _ in range(number_chars):

        # Calculate probability distribution of outputs with a temperature of 0.5
        prob = nn.Softmax(1)(output/0.5).squeeze().numpy()

        # Sample from outputs
        output_idx = random.choice(len(prob), p = prob)

        # Extract predicted char
        predicted_char = data_dictionary.idx2char[output_idx]
        chars += predicted_char

        # Transform predicted char into one-hot vector
        output_idx = torch.LongTensor([[output_idx]])
        next_input = torch.zeros(len(output_idx), len(data_dictionary)).scatter_(1, output_idx, 1)

        # Feed into NN to predict next char
        output, hidden = model(next_input, hidden)

    # Print predicted sequence
    print("Initializing sequence:", init_chars)
    print("Predicted sequence:\n", chars)    

Initializing sequence: The 
Predicted sequence:
 The dear was heard on a man.

“The would did not now he wish rement the grave of the young man with the last of the Paris, and her
carry so a son of the door.

“I do you will be four a moment to the prove you will a corntion, and where the doung the still of the sconvers
of the miscondens. I will to de Valentine, and the discontort of the count of the tell me to him.

“The young man he was a mistors, and with the moment.”

“Yes; the house, a fortune of the door would but with the young say with
make on they sign to him to be some the shall be a
stand, and I am some contrace, and the dear in the abbé with my
father he had to say at the count of the count of the stranger to
his whore to preserte to live to dear and with the stranger
of the minise of the world to be a mistone was so like which could be
a valling of the still signate and known.

“But the same of the same of a man.”

“Yes, my did not see of which the years of the dong the sam

Not so bad for only 10mn of training, or actually **only 5mn of training** (+ 5mn of validation) !

I'm sure we can do better though !

## Optimization

At this point we would want to fine-tune our neural network for better results.

We now have completed our prototype and it is time to move our code to a separate Python file (or multiple for organized people). This will help in cleaning our code and running easily multiple experiments at the same time (locally or remotely).

We will monitor our experiments through [Weights & Biases](https://www.wandb.com/) so that we can easily analyze and compare the results while keeping track of the code and model associated to each run.

Refer to [the project README](https://github.com/borisd13/char-RNN/blob/master/README.md) for more details.