## Text Generation

A recurrent neural network will be trained to generate text, character by character, inspired by CharRNN. The neural network will receive as input a sequence of letters and must output the next letter and so on. 

## Data

In [None]:
!pip install wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9675 sha256=8d84eed95a5891ff847a53d67c3e45f9fcecc69cc2dfa6315039117ea7f6d3d8
  Stored in directory: /root/.cache/pip/wheels/a1/b6/7c/0e63e34eb06634181c63adacca38b79ff8f35c37e3c13e3c02
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [None]:
import wget

url = 'https://mymldatasets.s3.eu-de.cloud-object-storage.appdomain.cloud/el_quijote.txt'
wget.download(url)

'el_quijote.txt'

The dataset has about 1 million characters, enough to generate text convincingly.

In [None]:
f = open("el_quijote.txt", "r", encoding='utf-8')
text = f.read()
text[:300], len(text)

('DON QUIJOTE DE LA MANCHA\nMiguel de Cervantes Saavedra\n\nPRIMERA PARTE\nCAPÍTULO 1: Que trata de la condición y ejercicio del famoso hidalgo D. Quijote de la Mancha\nEn un lugar de la Mancha, de cuyo nombre no quiero acordarme, no ha mucho tiempo que vivía un hidalgo de los de lanza en astillero, ada',
 1038397)

## Tokenization

To give this text to the neural network we need to transform it into numbers with which we can carry out the operations that take place in the network, this process is known as tokenization.

In this case we will simply replace each character in the text with its position in the following vector of characters.

In [None]:
import string

all_characters = string.printable + "ñÑáÁéÉíÍóÓúÚ¿¡" # The last characters for Castilian Spanish are added
all_characters

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0cñÑáÁéÉíÍóÓúÚ¿¡'

In [None]:
class Tokenizer():

    def __init__(self):
        self.all_characters = all_characters
        self.n_characters = len(self.all_characters)

    def text_to_seq(self, string):
        seq = []  
        for c in range(len(string)):
            try:
                seq.append(self.all_characters.index(string[c]))
            except:
                continue
        return seq
      
    def seq_to_text(self, seq):
        text = ''
        for c in range(len(seq)):
            text += self.all_characters[seq[c]]
        return text

In [None]:
tokenizer = Tokenizer()
tokenizer.n_characters

114

In [None]:
tokenizer.text_to_seq('señor, ¿qué tal?')

[28, 14, 100, 24, 27, 73, 94, 112, 26, 30, 104, 94, 29, 10, 21, 82]

In [None]:
tokenizer.seq_to_text([28, 14, 100, 24, 27, 73, 94, 112, 26, 30, 104, 94, 29, 10, 21, 82])

'señor, ¿qué tal?'

We will tokenize all the text.

In [None]:
text_encoded = tokenizer.text_to_seq(text)

## Dataset

We separate our text into a training set and a test set.

In [None]:
train_size = len(text_encoded) * 80 // 100
train = text_encoded[:train_size]
test = text_encoded[train_size:]

len(train), len(test)

(814065, 203517)

To train the network we need text sequences of a certain length, we use the following function

In [None]:
import random

def windows(text, window_size=100):
    start_index = 0
    end_index = len(text) - window_size
    text_windows = []
    while start_index < end_index:
        text_windows.append(text[start_index:start_index + window_size])
        start_index += 1
    return text_windows

In [None]:
text_encoded_windows = windows(text_encoded)

In [None]:
print(tokenizer.seq_to_text((text_encoded_windows[0])))
print()
print(tokenizer.seq_to_text((text_encoded_windows[1])))
print()
print(tokenizer.seq_to_text((text_encoded_windows[2])))

DON QUIJOTE DE LA MANCHA
Miguel de Cervantes Saavedra

PRIMERA PARTE
CAPITULO 1: Que trata de la con

ON QUIJOTE DE LA MANCHA
Miguel de Cervantes Saavedra

PRIMERA PARTE
CAPITULO 1: Que trata de la cond

N QUIJOTE DE LA MANCHA
Miguel de Cervantes Saavedra

PRIMERA PARTE
CAPITULO 1: Que trata de la condi


The Pytorch dataset will take care of giving us each of these phrases, using all characters except the last one as inputs to the network and the last character as the label that we will use during training.

In [None]:
import torch

class CharRNNDataset(torch.utils.data.Dataset):
    
    def __init__(self, text_encoded_windows, train=True):
        self.text = text_encoded_windows
        self.train = train

    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, ix):
        if self.train:
            return torch.tensor(self.text[ix][:-1]), torch.tensor(self.text[ix][-1])
        return torch.tensor(self.text[ix])

In [None]:
train_text_encoded_windows = windows(train)
test_text_encoded_windows = windows(test)

dataset = {
    'train': CharRNNDataset(train_text_encoded_windows),
    'val': CharRNNDataset(test_text_encoded_windows)
}

dataloader = {
    'train': torch.utils.data.DataLoader(dataset['train'], batch_size=512, shuffle=True, pin_memory=True),
    'val': torch.utils.data.DataLoader(dataset['val'], batch_size=2048, shuffle=False, pin_memory=True),
}

len(dataset['train']), len(dataset['val'])

(813965, 203417)

In [None]:
input, output = dataset['train'][0]
tokenizer.seq_to_text(input)

'DON QUIJOTE DE LA MANCHA\nMiguel de Cervantes Saavedra\n\nPRIMERA PARTE\nCAPITULO 1: Que trata de la co'

In [None]:
tokenizer.seq_to_text([output])

'n'

## Embeddings

Although we have managed to convert our text to numbers, the neural network will still not be able to work with our data because they have to be normalized.

One option may be the one-hot encoding, after all we can consider each letter as a category and our network gives us a probability distribution over all possible characters. Which will be very expensive (very large vectors) and inefficient (practically full of zeros).

This is why we use "embeddings", an embedding is a matrix with a number of rows equal to the size of the vocabulary and a number of columns (which represent some kind of meaning) that we will decide. Unlike one-hot encoding, these vectors are dense (they can have non-zero values at any position) and these values are learned by the neural network, so that it will be able to represent the data in the best possible way to carry out homework.

In [None]:
class CharRNN(torch.nn.Module):

    def __init__(self, input_size, embedding_size=128, hidden_size=256, num_layers=2, dropout=0.2):
        super().__init__()
        self.encoder = torch.nn.Embedding(input_size, embedding_size)
        self.rnn = torch.nn.LSTM(input_size=embedding_size, hidden_size=hidden_size, num_layers=num_layers, dropout=dropout, batch_first=True)
        self.fc = torch.nn.Linear(hidden_size, input_size)

    def forward(self, x):
        x = self.encoder(x)
        x, h = self.rnn(x)         
        y = self.fc(x[:,-1,:])
        return y

The model will receive batches of phrases with the index of each word provided by the tokenizer, at the output we will have a probability distribution over all the possible characters for each phrase in the batch and those with the highest probability will be those that the network believes are good candidates to follow the sentence received at the entrance.

In [None]:
model = CharRNN(input_size=tokenizer.n_characters)
outputs = model(torch.randint(0, tokenizer.n_characters, (64, 50)))
outputs.shape

torch.Size([64, 114])

## Training

In [None]:
from tqdm import tqdm
import numpy as np

device = "cuda" if torch.cuda.is_available() else "cpu"

def fit(model, dataloader, epochs=10):
    model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    criterion = torch.nn.CrossEntropyLoss()
    for epoch in range(1, epochs+1):
        model.train()
        train_loss = []
        bar = tqdm(dataloader['train'])
        for batch in bar:
            X, y = batch
            X, y = X.to(device), y.to(device)
            optimizer.zero_grad()
            y_hat = model(X)
            loss = criterion(y_hat, y)
            loss.backward()
            optimizer.step()
            train_loss.append(loss.item())
            bar.set_description(f"loss {np.mean(train_loss):.5f}")
        bar = tqdm(dataloader['val'])
        val_loss = []
        model.eval()
        with torch.no_grad():
            for batch in bar:
                X, y = batch
                X, y = X.to(device), y.to(device)
                y_hat = model(X)
                loss = criterion(y_hat, y)
                val_loss.append(loss.item())
                bar.set_description(f"val_loss {np.mean(val_loss):.5f}")
        print(f"Epoch {epoch}/{epochs} loss {np.mean(train_loss):.5f} val_loss {np.mean(val_loss):.5f}")

def predict(model, X):
    model.eval() 
    with torch.no_grad():
        X = torch.tensor(X).to(device)
        pred = model(X.unsqueeze(0))
        return pred

In [None]:
model = CharRNN(input_size=tokenizer.n_characters)
fit(model, dataloader, epochs=20)

loss 1.85040: 100%|██████████| 1590/1590 [03:01<00:00,  8.75it/s]
val_loss 1.57951: 100%|██████████| 100/100 [00:15<00:00,  6.51it/s]


Epoch 1/20 loss 1.85040 val_loss 1.57951


loss 1.49210: 100%|██████████| 1590/1590 [03:03<00:00,  8.68it/s]
val_loss 1.44539: 100%|██████████| 100/100 [00:15<00:00,  6.63it/s]


Epoch 2/20 loss 1.49210 val_loss 1.44539


loss 1.39133: 100%|██████████| 1590/1590 [03:02<00:00,  8.69it/s]
val_loss 1.38293: 100%|██████████| 100/100 [00:16<00:00,  5.98it/s]


Epoch 3/20 loss 1.39133 val_loss 1.38293


loss 1.33564: 100%|██████████| 1590/1590 [03:02<00:00,  8.71it/s]
val_loss 1.34954: 100%|██████████| 100/100 [00:15<00:00,  6.45it/s]


Epoch 4/20 loss 1.33564 val_loss 1.34954


loss 1.29927: 100%|██████████| 1590/1590 [03:02<00:00,  8.72it/s]
val_loss 1.32411: 100%|██████████| 100/100 [00:15<00:00,  6.37it/s]


Epoch 5/20 loss 1.29927 val_loss 1.32411


loss 1.27124: 100%|██████████| 1590/1590 [03:02<00:00,  8.72it/s]
val_loss 1.30420: 100%|██████████| 100/100 [00:16<00:00,  5.90it/s]


Epoch 6/20 loss 1.27124 val_loss 1.30420


loss 1.24906: 100%|██████████| 1590/1590 [03:00<00:00,  8.79it/s]
val_loss 1.29014: 100%|██████████| 100/100 [00:16<00:00,  5.98it/s]


Epoch 7/20 loss 1.24906 val_loss 1.29014


loss 1.23091: 100%|██████████| 1590/1590 [03:01<00:00,  8.75it/s]
val_loss 1.28206: 100%|██████████| 100/100 [00:15<00:00,  6.36it/s]


Epoch 8/20 loss 1.23091 val_loss 1.28206


loss 1.21729: 100%|██████████| 1590/1590 [03:02<00:00,  8.72it/s]
val_loss 1.27075: 100%|██████████| 100/100 [00:15<00:00,  6.37it/s]


Epoch 9/20 loss 1.21729 val_loss 1.27075


loss 1.20360: 100%|██████████| 1590/1590 [03:02<00:00,  8.70it/s]
val_loss 1.26932: 100%|██████████| 100/100 [00:16<00:00,  5.94it/s]


Epoch 10/20 loss 1.20360 val_loss 1.26932


loss 1.19139: 100%|██████████| 1590/1590 [03:01<00:00,  8.74it/s]
val_loss 1.26210: 100%|██████████| 100/100 [00:15<00:00,  6.38it/s]


Epoch 11/20 loss 1.19139 val_loss 1.26210


loss 1.17964: 100%|██████████| 1590/1590 [03:02<00:00,  8.71it/s]
val_loss 1.25691: 100%|██████████| 100/100 [00:17<00:00,  5.73it/s]


Epoch 12/20 loss 1.17964 val_loss 1.25691


loss 1.17033: 100%|██████████| 1590/1590 [03:01<00:00,  8.76it/s]
val_loss 1.25436: 100%|██████████| 100/100 [00:17<00:00,  5.66it/s]


Epoch 13/20 loss 1.17033 val_loss 1.25436


loss 1.16143: 100%|██████████| 1590/1590 [03:02<00:00,  8.73it/s]
val_loss 1.25164: 100%|██████████| 100/100 [00:15<00:00,  6.35it/s]


Epoch 14/20 loss 1.16143 val_loss 1.25164


loss 1.15374: 100%|██████████| 1590/1590 [03:02<00:00,  8.72it/s]
val_loss 1.24838: 100%|██████████| 100/100 [00:15<00:00,  6.35it/s]


Epoch 15/20 loss 1.15374 val_loss 1.24838


loss 1.14587: 100%|██████████| 1590/1590 [03:02<00:00,  8.71it/s]
val_loss 1.25261: 100%|██████████| 100/100 [00:17<00:00,  5.76it/s]


Epoch 16/20 loss 1.14587 val_loss 1.25261


loss 1.13894: 100%|██████████| 1590/1590 [03:02<00:00,  8.73it/s]
val_loss 1.24983: 100%|██████████| 100/100 [00:15<00:00,  6.29it/s]


Epoch 17/20 loss 1.13894 val_loss 1.24983


loss 1.13158: 100%|██████████| 1590/1590 [03:02<00:00,  8.72it/s]
val_loss 1.24687: 100%|██████████| 100/100 [00:15<00:00,  6.49it/s]


Epoch 18/20 loss 1.13158 val_loss 1.24687


loss 1.12634: 100%|██████████| 1590/1590 [03:02<00:00,  8.72it/s]
val_loss 1.24568: 100%|██████████| 100/100 [00:16<00:00,  6.23it/s]


Epoch 19/20 loss 1.12634 val_loss 1.24568


loss 1.12039: 100%|██████████| 1590/1590 [03:02<00:00,  8.72it/s]
val_loss 1.24515: 100%|██████████| 100/100 [00:17<00:00,  5.85it/s]

Epoch 20/20 loss 1.12039 val_loss 1.24515





## Generating Text

Once we have trained our model, we can give it a phrase to generate the next letter.

In [None]:
X_new = "En un lugar de la mancha, "
X_new_encoded = tokenizer.text_to_seq(X_new)
y_pred = predict(model, X_new_encoded)
y_pred = torch.argmax(y_pred, axis=1)[0].item()
tokenizer.seq_to_text([y_pred])

'y'

We can generate more letters by adding the predictions as part of the input, generating text letter by letter.

In [None]:
for i in range(100):
  X_new_encoded = tokenizer.text_to_seq(X_new[-100:])
  y_pred = predict(model, X_new_encoded)
  y_pred = torch.argmax(y_pred, axis=1)[0].item()
  X_new += tokenizer.seq_to_text([y_pred])

X_new

'En un lugar de la mancha, y el cura le dijo:\n-Dejada de la mano a la mano a la mano a la mano a la cabeza y a la mano a la man'

The generated text can be repetitive if we simply keep the letter with the highest probability, to generate text with greater variety, it is common to randomly choose a letter from among those with the highest probability.

In [None]:
temp=1
for i in range(1000):
  X_new_encoded = tokenizer.text_to_seq(X_new[-100:])
  y_pred = predict(model, X_new_encoded)
  y_pred = y_pred.view(-1).div(temp).exp()
  top_i = torch.multinomial(y_pred, 1)[0]
  predicted_char = tokenizer.all_characters[top_i]
  X_new += predicted_char

print(X_new)

En un lugar de la mancha, y el cura le dijo:
-Dejada de la mano a la mano a la mano a la mano a la cabeza y a la mano a la mano en tocar lo que vio su huirdico con tan cojadano que las muchas exalevian a aquella olla al campo, ni le dice que vuestro caballero; mas echo el cuerpo primero que premia lo que de su amo con la licencia que el le riques ojecio era mis desnos vieron la santa y a imaginar que por el tiempo se casaban y derramaron, que dijo el balao; porque no podria que trecho mal simple se entretendasa de las velas. Conociole Don Quijote que andumiendo mas por tus Grinas, solo quedo el tiquero, y el que era pasar del castillo
voy tendemos en el punto a entretener en ellas, ni aun de que hayan condicido hijos largos, que esta apacientar de industria y mucho diaba, por acabar los armas fenciantes manos. A hacerlo) para que mas o Cesella, y Cardenio y las mercades y satisumas de un anadio de vuestro refratura y mal siempre de mala)), que decian que los cerpados y sombras me tuert