# Character-Level Language Models with RNNs
---

*COSCI 223 - Machine Learning 3*

*Prepared by Sebastian C. Ibañez*

<a href="https://colab.research.google.com/github/aim-msds/msds2023-ml3/blob/main/notebooks/rnn/01-simple-rnn.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" style="float: left;"></a><br>

In [1]:
import torch

## PyTorch Implementation

---

In this notebook, our goal is to create a character-level language model using a simple RNN that is trained to generate names. The data is from [ssa.gov](https://www.ssa.gov/oact/babynames/https://www.ssa.gov/oact/babynames/) and includes a list of baby names.

*Check out this Very Cool™ blog post by Andrej Karpathy from 2015: [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)*

Let's look at the data:

In [2]:
# Load data
with open('data/names.txt', 'r') as f:
    text = f.read()
    names = text.splitlines()

print(names[:5])

['emma', 'olivia', 'ava', 'isabella', 'sophia']


Next, let's create a vocabulary to store all the unique tokens.

In [3]:
# Create raw vocabulary
raw_vocab = sorted(list(set(text)))[1:] # drop first token which is \n
vocab_size = len(raw_vocab)

print(f'Number of unique tokens: {vocab_size}')
print(raw_vocab)

Number of unique tokens: 26
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


Before we proceed, we're going to append special start and end tokens to every name.

In [4]:
names = [f'<s>{n}<e>' for n in names]
print(names[:5])

['<s>emma<e>', '<s>olivia<e>', '<s>ava<e>', '<s>isabella<e>', '<s>sophia<e>']


PyTorch has some useful utilities for easy processing of text data via `torchtext.vocab`. 

In [5]:
#!pip install torchtext

In [6]:
from torchtext.vocab import build_vocab_from_iterator

# Create torchtext vocab
vocab = build_vocab_from_iterator(raw_vocab, specials=['<s>', '<e>'])

print(f'Vocabulary Size: {len(vocab)}')

Vocabulary Size: 28


In [7]:
# Convert name to a tensor of integers
def name_to_tensor(name):
    vocab_input = [name[:3]] + list(name[3:-3]) + [name[-3:]]
    return torch.tensor(vocab(vocab_input), dtype=torch.long)

print([names[0]])
print(name_to_tensor(names[0]))

['<s>emma<e>']
tensor([ 0,  6, 14, 14,  2,  1])


In order to interpret our predictions at inference time, we can use the `vocab.lookup_tokens` function to convert an index (i.e., the integer encoding) back into an actual token.

In [8]:
# Reverse lookup
def tensor_to_name(input_tensor):
    return ''.join(vocab.lookup_tokens(input_tensor.tolist()))

input_tensor = name_to_tensor(names[1])
print(input_tensor)
print([tensor_to_name(input_tensor)])

tensor([ 0, 16, 13, 10, 23, 10,  2,  1])
['<s>olivia<e>']


Now, let's prepare our dataset for training.

In [9]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

class NamesDataset(Dataset):
    def __init__(self, names):
        self.names = names

    def __len__(self):
        return len(self.names)

    def __getitem__(self, idx):
        return name_to_tensor(self.names[idx])
    
names_ds = NamesDataset(names) 
names_dl = DataLoader(names_ds, batch_size=1, shuffle=True) # Will not work with batch_size > 1 because you can't stack observations of different lengths!

In [10]:
# Split name into input, output pairs
for name in names_dl:
    x = name[0, :-1]
    y = name[0, 1:]
    print(f'{x} -> {tensor_to_name(x)}')
    print(f'{y} -> {tensor_to_name(y)}')
    break

tensor([ 0, 15,  2,  2, 14,  2]) -> <s>naama
tensor([15,  2,  2, 14,  2,  1]) -> naama<e>


In [11]:
# Device
dev = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
#dev = torch.device('cpu')
dev

device(type='cuda')

In [12]:
import torch.nn as nn

class CharacterLM(nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_size):
        super(CharacterLM, self).__init__()
        
        self.vocab_size = vocab_size
        self.embedding_size = embedding_size
        self.hidden_size = hidden_size

        self.embedding_layer = nn.Embedding(vocab_size, embedding_size)
        self.rnn = nn.RNN(embedding_size, hidden_size, batch_first=True) # Make sure to set batch_first=True!
        self.output_layer = nn.Linear(hidden_size, vocab_size)

    def forward(self, x):
        x_embed = self.embedding_layer(x)
        rnn_output, hn = self.rnn(x_embed)
        logits = self.output_layer(rnn_output)
        return logits

In [13]:
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
from tqdm.notebook import tqdm, trange

# Hyperparams
epochs = 10
vocab_size = len(vocab)
embedding_size = 8
hidden_size = 16
lr = 0.001

# Create model
model = CharacterLM(vocab_size, embedding_size, hidden_size)
model.to(dev)

# Loss, optimizer
loss_fn = F.cross_entropy
opt = optim.AdamW(model.parameters(), lr=lr)

# Training loop
for e in trange(epochs):
    # Train
    model.train()
    for name in tqdm(names_dl):
        # Split, move to device, add batch dimension
        xb = name[0, :-1].to(dev).unsqueeze(0)
        yb = name[0, 1:].to(dev).unsqueeze(0)
        
        # Forward
        yb_hat = model(xb)

        # Loss
        loss = loss_fn(yb_hat.view(-1, vocab_size), yb.view(-1)) # Reshape outputs, cross entropy loss has no notion of "timesteps" so we collapse the timestep dim with the batch dim

        # Backprop
        loss.backward()
        opt.step()
        opt.zero_grad()
    
    # Validation
    model.eval() 
    with torch.no_grad():
        metrics = {'train_loss': 0}
        for name in names_dl:
            xb = name[0, :-1].to(dev).unsqueeze(0)
            yb = name[0, 1:].to(dev).unsqueeze(0)
            yb_hat = model(xb)
            metrics['train_loss'] += loss_fn(yb_hat.view(-1, vocab_size), yb.view(-1)).item()
    
    # Metrics
    train_loss = metrics['train_loss']/len(names_ds)
    print(f'Epoch {e+1}: train_loss = {train_loss:.4f}')

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/32033 [00:00<?, ?it/s]

Epoch 1: train_loss = 2.2845


  0%|          | 0/32033 [00:00<?, ?it/s]

Epoch 2: train_loss = 2.2613


  0%|          | 0/32033 [00:00<?, ?it/s]

Epoch 3: train_loss = 2.2501


  0%|          | 0/32033 [00:00<?, ?it/s]

Epoch 4: train_loss = 2.2454


  0%|          | 0/32033 [00:00<?, ?it/s]

Epoch 5: train_loss = 2.2400


  0%|          | 0/32033 [00:00<?, ?it/s]

Epoch 6: train_loss = 2.2393


  0%|          | 0/32033 [00:00<?, ?it/s]

Epoch 7: train_loss = 2.2357


  0%|          | 0/32033 [00:00<?, ?it/s]

Epoch 8: train_loss = 2.2374


  0%|          | 0/32033 [00:00<?, ?it/s]

Epoch 9: train_loss = 2.2364


  0%|          | 0/32033 [00:00<?, ?it/s]

Epoch 10: train_loss = 2.2304


Before we start sampling, let's briefly describe a common parameter used to control the "confidence" of our logits called **temperature**.

*Note: The temperature parameter controls the distribution's [entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)), which is a measure of uncertainty.*

In [14]:
torch.manual_seed(3)
logits = torch.randn(size=(5,))

print(f'  logits   = {logits}')
print(f'temp 0.3   = {F.softmax(logits/0.3, dim=0).numpy()}')
print(f'temp 0.5   = {F.softmax(logits/0.5, dim=0).numpy()}')
print(f'temp 1.0   = {F.softmax(logits/1.0, dim=0).numpy()}')
print(f'temp 3.0   = {F.softmax(logits/3.0, dim=0).numpy()}')
print(f'temp 100.0 = {F.softmax(logits/100.0, dim=0).numpy()}')

  logits   = tensor([ 0.8033,  0.1748,  0.0890, -0.6137,  0.0462])
temp 0.3   = [0.76651555 0.0943533  0.07087104 0.00681102 0.06144912]
temp 0.5   = [0.5546469  0.15781862 0.13291845 0.03260102 0.12201507]
temp 1.0   = [0.36570734 0.19507629 0.17902678 0.08866271 0.17152685]
temp 3.0   = [0.25002265 0.20276965 0.19704895 0.15590084 0.19425796]
temp 100.0 = [0.20140965 0.20014787 0.19997612 0.19857581 0.19989054]


Notice what happens to the probabilities as we divide our logits by the temperature parameter. 

Lowering the temperature below 1.0 increases the "confidence" or "certainty" of the distribution, while increasing it does the opposite. 

This will allow us to control how "wild" the generated text is when we sample it.

In [17]:
#torch.save(model, 'torch_checkpoints/name-generator-rnn.pth')

In [20]:
model = torch.load('torch_checkpoints/name-generator-rnn.pth)

In [22]:
from torch.distributions.categorical import Categorical

n_samples = 10
temp = 0.7

for i in range(n_samples):
    x = torch.tensor(vocab(['<s>']), dtype=torch.long).to(dev)
    x = x.unsqueeze(0)

    while True:
        logits = model(x)[:, -1, :]
        probs = F.softmax(logits/temp, dim=1)
        y = Categorical(probs).sample()
        x = torch.cat((x, y.view(1, 1)), dim=1)
        if y.item() == 1:
            break

    sample = ''.join(vocab.lookup_tokens(x.squeeze().tolist()))[3:-3]
    print(sample)

chadal
melan
aniya
hicer
melmi
ronsie
arder
kaymay
camilae
farlian


## References

---

1. http://karpathy.github.io/2015/05/21/rnn-effectiveness/

2. https://github.com/karpathy/makemore/

3. https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial