In this notebook, we explore simple examples Recurrent Neural Networks -- on this case we look at Long Short-Term Memory (LSTM) networks. 

As before, we keep in mind that neural networks which output logits (or possibly unnormalized probabilities) essentially parameterize a probability distribution over the data space. Recurrent Neural Networks (RNNs) provide a convenient way of parameterizing probability distributions over variable-length sequences, i.e., $\mathbb{P}(x_1 x_2 ... x_T)$. Moreover, RNNs can be treated as generative models for these sequences.

Key concepts to understand are (1) backpropagation and exploding/vanishing gradients, (2) notions of dynamics and memory, and (3) probability models, specifically the evolution from Bigram / Trigram / $N$-gram Markov Models to Hidden Markov Models to RNNs. 

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

!git clone https://github.com/karpathy/makemore

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

In [None]:
with open('makemore/names.txt','r') as file:
  words = file.read().splitlines()

In [None]:
word_lengths = torch.tensor([len(w) for w in words]).float()
print(
 f"""
 This dataset contains {word_lengths.nelement()} names\n
 The minimum name length is {word_lengths.min()} characters.\n 
 The maximum name length is {word_lengths.max()} characters.\n
 The mean name length is  {word_lengths.mean():.2f} characters. \n
 The associated standard deviation is {word_lengths.std():.2f} characters.
 """
 )

In [None]:
#building the character vocabulary and lookup tables to map from characters to integer indices and back

chars = ['>']+sorted(list(set(''.join(words))))+['<']  #'>' = start token, '<' = end token 
s_to_i = {s:i for i,s in enumerate(chars)}
i_to_s = {i:s for s,i in s_to_i.items()}
vocab_size = len(i_to_s)
print(i_to_s)
print(vocab_size)


In [None]:
import random

word_lengths = torch.tensor([len(w) for w in words])
max_word_length = word_lengths.max()

def build_dataset(words):
  '''
  Build training, validation/dev, and test datasets.
  Note, the training dataset is padded with special characters to ensure 
  each word/sequence is of the same length. This is only being done so we can train with minibatches.
  Presumably, the model should learn that an end character '<' should be followed by another end character '<' with probability 1.
  '''
  # 80%, 10%, 10%
  random.seed(42)
  random.shuffle(words)
  n1 = int(0.8*len(words))
  n2 = int(0.9*len(words))

  X = []
  for w in words:
    v = [s_to_i[ch] for ch in '>'+w+'<']
    while len(v) < max_word_length+2: #padding, +2 for the start and end characters
      v += [s_to_i['<']]
    X.append(v)

  Xtr = torch.tensor(X[:n1])
  Xdev = torch.tensor(X[n1:n2])
  Xte = torch.tensor(X[n2:])
  
  return Xtr, Xdev, Xte

#training split (used to train parameters), dev/validation split (used to train hyperparameters), test split (at end with the final model)

Xtr, Xdev, Xte = build_dataset(words)

print(Xtr.size())
print(Xdev.size())
print(Xte.size())

## A digression on models
As we have discussed, models in this setting are probability distributions over sequences $\mathbb{P}(x_1 x_2 ... x_T)$. Some models are good and match the data, some models are bad and overfit or do not support the data. We often assume that the data are i.i.d. sampled sequences from an unknown true distribution $\mathbb{P}_{\textrm{true}}$. 

Markov models (e.g., n-gram models), hidden markov models, recurrent neural networks, etc. all correspond to different assumptions on the form of the model. Ideally, assumptions are supported by some underlying theory, but are often made out of necessity to render the problem computationally tractable. Without tractability, nothing can be accomplished. 

In the following, we investigate two simple "extreme-case" models to provide us with some benchmarks.



The first is a perfectly overfit model -- the empirical distribution of the training data. Consider the training dataset $\mathcal{D}$ made up of iid samples $x^{(1)},...,x^{(N)} \sim \mathbb{P}_{\textrm{true}}$. Then,

$\mathbb{P}_{\textrm{emp}}(x|\mathcal{D}) = \frac{1}{N} \sum_{i=1}^{N} 1_{x^{(i)}}(x)$,

which assigns probability mass only to the observed realizations of the training data. Note that since the test data realizations are unlikely to match the training set, this model assigns them zero probability and thus utterly fails to generalize.

With this probability model, the cross entropy between the empirical distribution (unbiased proxy for the unknown $\mathbb{P}_{\textrm{true}}$) and the model (which in this case is also the empirical distribution) is given by:

CrossEntropyLoss1 \\
$= -\int \mathbb{P}_{\textrm{emp}}(x|\mathcal{D}) \log ( \mathbb{P}_{\textrm{emp}}(x|\mathcal{D})) dx $ \\
$= - \frac{1}{N} \sum_{i=1}^{N}  \log ( \mathbb{P}_{\textrm{emp}}(x^{(i)} |\mathcal{D})) $   &emsp;  note, this is the average negative log likelihood of the training data \\
$= - \frac{1}{N} \sum_{i=1}^{N}  \log ( \frac{1}{N}) $ \\
$= \log(N)$

Recall in our example above, $N=25626$

In [None]:
PerfectOverfittingCELoss = np.log(Xtr.shape[0])
print(PerfectOverfittingCELoss)

In contrast, a model that does not overfit and completely underfits is the probability model that assigns all possible sequences equal probability. Suppose the sequence length is $T$ (as above) and that each token $x_i$ can take on $V$ different values, i.e., $V$ is the size of the vocabulary. Then, there are a total of $V^T$ possible sequences. In our example, $V=28$ and $T=17$. This model is perhaps the simplest model -- it has no dependence on the training data -- and is described by:

$\mathbb{P}_{\textrm{unif}}(x) = \frac{1}{V^T}$

With this model, the cross-entropy loss is:

CrossEntropyLoss2 \\
$= -\int \mathbb{P}_{\textrm{emp}}(x|\mathcal{D}) \log (\mathbb{P}_{\textrm{unif}}(x)) dx $ \\
$= - \frac{1}{N} \sum_{i=1}^{N}  \log ( \mathbb{P}_{\textrm{unif}}(x^{(i)} )) $  \\
$= - \frac{1}{N} \sum_{i=1}^{N}  \log ( \frac{1}{V^T} ) $ \\
$= \log ( V^T)$ \\
$= T \log(V) $

In [None]:
UniformCELoss = 17 * np.log(28)
print(UniformCELoss)

The above models are quite simple, but they provide some benchmarks and help us interpret what values the loss should take. Losses near $10.15$ are likely to be overfitting, while losses near $56.65$ suggest the model does no better than a uniform prediction model

### A Simple LSTM

First, we will build a single LSTM cell. As before we will do this in a pytorch-like fashion, but try not to use pre-built classes and methods.

In [None]:
g = torch.Generator().manual_seed(2147483647)

# Training a 2-layer RNN resulting in a loss that grew. 
# Presumably, this was due to the gradients for certain weights vanishing.
# Including layer normalization in the RecurrentBlock moderated this issue.

class RecurrentDropout(nn.Module):
  '''
  Dropout method of Gal, Ghahramani - A Theoretically Grounded Application of Dropout in
  Recurrent Neural Networks (2016).
  '''
  def __init__(self, p):
    super(RecurrentDropout, self).__init__()
    self.p = torch.tensor(p, requires_grad=False)
    self.msk = None  

  def set_mask(self, B, in_dim, device):
    self.msk = torch.bernoulli(self.p*torch.ones((B,in_dim))).to(device) 
    self.msk.requires_grad = False

  def forward(self, x):
      #x is (B,in_dim)
      return x * self.msk

class LSTMBlock(nn.Module):
  '''
  Single LSTM layer
  '''
  def __init__(self, in_dim, hidden_dim, out_dim, p):
    super(LSTMBlock, self).__init__()
    #
    self.in_dim = in_dim
    self.hidden_dim = hidden_dim
    self.out_dim = out_dim
    #internal states
    self.h = None #hidden state
    self.c = None #cell memory
    self.hidden_dim = hidden_dim
    # dropout layers
    self.dropout_i = RecurrentDropout(1-p)
    self.dropout_h = RecurrentDropout(1-p)
    # state transition
    self.layer_hh = nn.Linear(hidden_dim, hidden_dim, bias=True)
    self.layer_ih = nn.Linear(in_dim, hidden_dim, bias=False)
    self.tanh = torch.tanh 
    #gates
    self.sigmoid = torch.sigmoid
    # input gate
    self.layer_hh_ig = nn.Linear(hidden_dim, hidden_dim, bias=True)
    self.layer_ih_ig = nn.Linear(in_dim, hidden_dim, bias=False)
    # forget gate
    self.layer_hh_fg = nn.Linear(hidden_dim, hidden_dim, bias=True)
    self.layer_ih_fg = nn.Linear(in_dim, hidden_dim, bias=False)
    # output gate
    self.layer_hh_og = nn.Linear(hidden_dim, hidden_dim, bias=True)
    self.layer_ih_og = nn.Linear(in_dim, hidden_dim, bias=False)
    #cell output
    self.layer_ho = nn.Linear(hidden_dim, out_dim, bias=True)
    #mode
    self.train=True


  def forward(self,x):
    # x is (B,in_dim), h is (B, hidden_dim)

    if self.train:
      self.h = self.dropout_h(self.h)
      x = self.dropout_i(x)

    ig = self.sigmoid( self.layer_hh_ig(self.h) + self.layer_ih_ig(x) )
    fg = self.sigmoid( self.layer_hh_fg(self.h) + self.layer_ih_fg(x) )
    og = self.sigmoid( self.layer_hh_og(self.h) + self.layer_ih_og(x) )
    hprop = self.tanh( self.layer_hh(self.h) + self.layer_ih(x) )
    self.c = fg * self.c + ig * hprop
    self.h = og * self.tanh(self.c)
    y = self.layer_ho(self.h) # (B,out_dim)
    return y

  def set_h(self, hnew, dev):
    self.h = hnew.to(dev)

  def set_c(self, cnew, dev):
    self.c = cnew.to(dev)

class LSTM(nn.Module):
  '''
  LSTM that accepts input tensors of dimension (T,B,I) where
  T = maximum length of the sequence (x1, x2, x3,...,xT)
  B = batch dimension
  I = input dimension
  So if the input tensor is x, then x[0,1,:] = the embedding of the first token in the second example sequence of the batch 
  '''
  def __init__(self, vocab_size, n_embd, block_dims, p, device):
    super(LSTM, self).__init__()
    # block_dims = [(in_dim, hidden_dim, out_dim) for all layers]
    self.depth = len(block_dims)
    self.device = device
    assert( all(block_dims[ii][2] == block_dims[ii+1][0] for ii in range(self.depth-1)) ) #check compatibility
    self.embedding = nn.Embedding(vocab_size, n_embd)
    self.layers = nn.Sequential(*[LSTMBlock(*d, p) for d in block_dims])

  def step(self, x):
    # x is (B,in_dim)
    y = self.layers[0](x)
    if self.depth > 1:
      for layer in self.layers[1:]:
        y = layer(y)
    return y

  def forward(self, z):
    emb = self.embedding(z) #(batch_size, max_word_length, n_embd), embeds characters into vector space
    x = emb.permute(1,0,2)

    T,B,I = x.size()

    for layer in self.layers: #init states, dropout
      layer.set_h(torch.zeros(layer.hidden_dim,), self.device)
      layer.set_c(torch.zeros(layer.hidden_dim,), self.device)
      if layer.train:
        layer.dropout_i.set_mask(B,layer.in_dim, self.device)
        layer.dropout_h.set_mask(B,layer.hidden_dim, self.device)

    logits = torch.zeros((T-1,B,vocab_size)) 
    for t in range(T-1):  
      logits[t]= self.step(x[t])
    return logits

  def eval(self):
    for layer in self.layers:
      layer.train = False

  def train(self):
    for layer in self.layers:
      layer.train = True
  

In [None]:
n_embd = 20
n_hidden = 512
p = 0.5 #dropout rate
model = LSTM( vocab_size, n_embd, [(n_embd, n_hidden, vocab_size)],p, device)
model.to(device)

In [None]:
print(sum(p.nelement() for p in model.parameters() ))
for p in model.parameters():
  p.requires_grad = True

In [None]:
max_steps = 200000
batch_size = 128
Xdev = Xdev.to(device)

lossi = []
model.train
for i in range(max_steps):

  # constructing minibatch 
  ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
  Xb = Xtr[ix].to(device) #(batch_size, max_word_length)

  logits = model(Xb).to(device) #logits is (max_word_length, batch_size, vocab_size)

  loss = 0
  for t in range(max_word_length-1):
    loss += F.cross_entropy(logits[t],Xb[:,t+1]) #cross entropy loss averaged over time steps (and minibatch)

  # backward pass
  for p in model.parameters():
    p.grad = None
  loss.backward()

  # update
  lr = 0.1 if i<100000 else 0.01
  for p in model.parameters():
    p.data += -lr * p.grad

  # track stats
  lossi.append(loss.log10().item())

  if i % 10000 == 0:
    model.eval()
    with torch.no_grad():
      logits_dev = model(Xdev).to(device) #logits is (max_word_length, batch_size, vocab_size)
      loss_dev = 0
      for t in range(max_word_length-1):
        loss_dev += F.cross_entropy(logits_dev[t],Xdev[:,t+1])
    model.train()
    print(f'{i:7d} / {max_steps:7d}  |  batch loss = {loss.item():.4f}  | dev loss = {loss_dev.item():.4f} ') #prints the batch loss

  
  

In [None]:
model.eval()

In [None]:
# training loss
@torch.no_grad()
def split_loss(split):
  x = { 
      'train': (Xtr),
      'val': (Xdev),
      'test': (Xte),
  }[split]
  x=x.to(device)

  logits = model(x).to(device) #logits is (max_word_length, batch_size, vocab_size)
  loss = 0
  for t in range(max_word_length-1):
    loss += F.cross_entropy(logits[t],x[:,t+1])
  print(split, loss.item())
  
split_loss('train')
split_loss('test')

In [None]:
g = torch.Generator().manual_seed(2147483647 + 11)

with torch.no_grad():
  for _ in range(20):
    out = []
    current_input = torch.tensor([s_to_i['>']]).to(device)  
    for layer in model.layers:
      layer.set_h(torch.zeros(layer.hidden_dim,), device)
      layer.set_c(torch.zeros(layer.hidden_dim,), device)
    while True:
      x=model.embedding(current_input) #(1,n_embd)
      #z = emb.permute(1,0,2)
      logits = model.step(x).to(device) #logits = ()
      probs = F.softmax(logits,dim=1) # get the output distribution on the characters
      ix = torch.multinomial(probs.squeeze(), num_samples=1).item()
      current_input =  torch.tensor([ix]).to(device)   #shifts the context window to the right one character (now includes ix)
      out.append(ix)
      if ix == s_to_i['<']: #terminate at the end character '<'
        break
    print(''.join(i_to_s[i] for i in out))

In [None]:
logits.shape

##Deep LSTM

Next we will try connecting two LSTM blocks to improve the results. Recall that deeper architectures can be harder to train, particularly using just vanilla stochastic gradient descent and no other bells and whistles.

In [None]:
n_embd = 20
n_hidden = 512
p = 0.5 #dropout rate

model = LSTM(vocab_size, n_embd, [(n_embd, n_hidden, n_hidden), (n_hidden, n_hidden, vocab_size)],p,  device) #2 RNN blocks
model.to(device)

In [None]:
print(sum(p.nelement() for p in model.parameters() ))
for p in model.parameters():
  p.requires_grad = True

In [None]:
max_steps = 200000
batch_size = 128
Xdev = Xdev.to(device)

lossi = []
model.train
for i in range(max_steps):

  # constructing minibatch 
  ix = torch.randint(0, Xtr.shape[0], (batch_size,), generator=g)
  Xb = Xtr[ix].to(device) #(batch_size, max_word_length)

  logits = model(Xb).to(device) #logits is (max_word_length, batch_size, vocab_size)

  loss = 0
  for t in range(max_word_length-1):
    loss += F.cross_entropy(logits[t],Xb[:,t+1]) #cross entropy loss averaged over time steps (and minibatch)

  # backward pass
  for p in model.parameters():
    p.grad = None
  loss.backward()

  # update
  lr = 0.1 if i<100000 else 0.01
  for p in model.parameters():
    p.data += -lr * p.grad

  # track stats
  lossi.append(loss.log10().item())

  if i % 10000 == 0:
    model.eval()
    with torch.no_grad():
      logits_dev = model(Xdev).to(device) #logits is (max_word_length, batch_size, vocab_size)
      loss_dev = 0
      for t in range(max_word_length-1):
        loss_dev += F.cross_entropy(logits_dev[t],Xdev[:,t+1])
    model.train()
    print(f'{i:7d} / {max_steps:7d}  |  batch loss = {loss.item():.4f}  | dev loss = {loss_dev.item():.4f} ') #prints the batch loss

In [None]:
model.eval()

In [None]:
# training loss
@torch.no_grad()
def split_loss(split):
  x = { 
      'train': (Xtr),
      'val': (Xdev),
      'test': (Xte),
  }[split]
  x=x.to(device)

  logits = model(x).to(device) #logits is (max_word_length, batch_size, vocab_size)
  loss = 0
  for t in range(max_word_length-1):
    loss += F.cross_entropy(logits[t],x[:,t+1])
  print(split, loss.item())
  
split_loss('train')
split_loss('test')

In [None]:
g = torch.Generator().manual_seed(2147483647 + 11)

with torch.no_grad():
  for _ in range(20):
    out = []
    current_input = torch.tensor([s_to_i['>']]).to(device)  
    for layer in model.layers:
      layer.set_h(torch.zeros(layer.hidden_dim,), device)
      layer.set_c(torch.zeros(layer.hidden_dim,), device)
    while True:
      x=model.embedding(current_input) #(1,n_embd)
      #z = emb.permute(1,0,2)
      logits = model.step(x).to(device) #logits = ()
      probs = F.softmax(logits,dim=1) # get the output distribution on the characters
      ix = torch.multinomial(probs.squeeze(), num_samples=1).item()
      current_input =  torch.tensor([ix]).to(device)   #shifts the context window to the right one character (now includes ix)
      out.append(ix)
      if ix == s_to_i['<']: #terminate at the end character '<'
        break
    print(''.join(i_to_s[i] for i in out))