In [None]:
%%capture
!pip install --upgrade fastai

# Data Preparation
We start by preaparing a dataset comprising "human numbers" i.e. numbers as English language words. This is a simple example prepared by J.H. et. al to demonstrate the construction of an RNN.

In [None]:
from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)
path.ls()

In [None]:
lines = L()
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'train.txt') as f: lines += L(*f.readlines())
lines

Now we concatenate into one big stream.

In [None]:
text = ' . '.join([l.strip() for l in lines])
text[:100]

Next we tokenize

In [None]:
tokens = text.split(' ')
tokens[:10]

In [None]:
vocab = L(*tokens).unique()
vocab

In [None]:
word2idx = {w:i for i,w in enumerate(vocab)}
nums = L(word2idx[i] for i in tokens)
nums

# Our First Language Model from Scratch
Our first rudimentary approach to modeling will take our input stream and convert it into sequences of three words with the aim of predicting each fourth word.

In [None]:
L((tokens[i:i+3], tokens[i+3]) for i in range(0, len(tokens)-4,3))

In [None]:
# In numericalized form, which the model can actually use
seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0,len(nums)-4,3))
seqs

In [None]:
bs = 64
cut = int(len(seqs)*0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

Our model will use three linear layers with the following tweaks:
1. layer 1 uses the first word's embeddings as activations. Layer 2 uses second word's embeddings + first layer's outputs. Layer 3 uses uses third word's embeddings + second layer's output activations.
2. Each layer uses the same weight matrix. Activation weights change from layer to layer but layer weights do not. *I don't currently understand the distinction*.

In [None]:
class LMModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
    
    def forward(self,x):
        h = F.relu(self.h_h(self.i_h(x[:,0])))
        h = h + self.i_h(x[:,1])
        h = F.relu(self.h_h(h))
        h = h + self.i_h(x[:,2])
        h = F.relu(self.h_h(h))
        return self.h_o(h)

In [None]:
learn = Learner(dls, LMModel1(len(vocab), 64), loss_func = F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

How can we tell if this is any good at all? Let's set a baseline by simply predicting the most common class.

In [None]:
n,counts = 0, torch.zeros(len(vocab))
for x,y in dls.valid:
    n += y.shape[0]
    for i in range_of(vocab): counts[i] += (y==i).long().sum()
idx = torch.argmax(counts)
idx, vocab[idx.item()], counts[idx].item()/n

We may have suspected at first the the separator would be the most common character. It's still not entirely clear to me why it isn't. It might reflect the composition of the validation set.

In [None]:
dls.valid.items[1:10]

# Making an RNN
We can refactor our previous code to have the structure of an RNN, with the benefit of not being restricted to token sequences of the same length.

In [None]:
class LMModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
    
    def forward(self,x):
        h=0
        for i in range(3):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
        return self.h_o(h)

In [None]:
learn = Learner(dls, LMModel2(len(vocab), 64), loss_func = F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

What is the difference here? We've turned the sequence of adding each data input into a loop. The variable *h* is referred to as the "hidden state." It's not *quite* clear why we can apply this to token lists of different lengths. Maybe it's just easier to change the `range(3)` to something else than it is to manually add more, as in the previous version.

This is a "recurrent neural network" which, according to JH, is essentially just "a refactoring of a multilayer neural network using a for loop."

## Making the RNN Better
- One immediate issue: we initialize our hidden state to zero for each new input seq. Why does this matter? Losing some information. If our sequences are read in order, perhaps there is information to be gained from earlier sequences. Furthermore, we could use the preceding sequences to predict the second and third words, not just the fourth.

In other words, we want to maintain the state of our rnn. But this presents its own problem. By storing state, we're essentially making a NN as deep (with as many layers) as the number of tokens. We would still need to calculate derivatives back to the very first layer: slow and memory intensive.

We can solve this by keeping only the last three layers of derivatives. We do not "backpropagate...through the entire implicit neural network."

Here is our "stateful" rnn:

In [None]:
class LMModel3(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = 0
    
    def forward(self,x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        out = self.h_o(self.h)
        self.h = self.h.detach()
        return out
    
    def reset(self): self.h = 0

For this to work, we need to make sure the samples will be observed in a particular order. (resume on page 383). Steps:
1. `Divide samples into m = len(dset)//bs` groups. m is the length of each piece. Recall `seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0,len(nums)-4,3))` (our numericalized groupings of four).

In [19]:
m = len(seqs)//bs
m,bs,len(seqs)

(521, 64, 33385)

The first batch will have samples `(0, m, 2*m, ..., (bs-1)*m)`. The second will have `(1, m+1, 2*m+1, ..., (bs-1)*m+1)`, etc. So at each epoch, the model will see a chunk of contiguous text of size `3*m` (each text is size 3). We accomplish the indexing as follows:

In [20]:
def group_chunks(ds,bs):
    m = len(ds) // bs
    new_ds = L()
    for i in range(m): new_ds += L(ds[i+m*j] for j in range(bs))
    return new_ds

When specifying our original dataset, we drop the last batch that does not have the correct shape, and we pass shuffle=False to preserve order.

In [21]:
cut = int(len(seqs)*0.8)
dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs),
    group_chunks(seqs[cut:], bs),
    bs=bs, drop_last=True, shuffle=False)

Lastly, we tweak our learner to call the `reset` method at the beginning of each epoch and before each validation phase.

In [22]:
learn = Learner(dls, LMModel3(len(vocab), 64), loss_func = F.cross_entropy, metrics=accuracy,
               cbs = ModelResetter)
learn.fit_one_cycle(10, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.481962,1.477567,0.485577,00:03
1,1.106941,1.184971,0.548528,00:03
2,0.983626,1.202155,0.547175,00:03
3,0.933251,1.250246,0.548227,00:03
4,0.900051,1.093886,0.578876,00:03
5,0.868421,1.063206,0.573167,00:03
6,0.843107,0.971419,0.599609,00:03
7,0.818441,0.913763,0.620042,00:03
8,0.772309,0.851625,0.63762,00:03
9,0.751669,0.847521,0.639273,00:03


Thinking a little more about data setup. Batches are run "all at once" on the GPU, so it wouldn't make sense to have contiguous stretches of text in the same batch. That's why it's arranged as it is. We want the first text on the first line of the first batch; the second on the first line of the second batch; etc.

An improvement already. Next we want to use more targets and compare them to intermediate predictions. We're wasting some of our inputs by only predicting one "output" word for every three "input" words. We could be predicting *every* word based on the preceding word(s).

In [29]:
sl = 16 # sequence length
seqs = L((tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1]))
         for i in range(0, len(nums) - sl -1, sl))
cut = int(len(seqs)*0.8)
dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], bs),
    group_chunks(seqs[cut:], bs),
    bs=bs, drop_last=True, shuffle=False)

In [24]:
[L(vocab[o] for o in s) for s in seqs[0]]

[(#16) ['one','.','two','.','three','.','four','.','five','.'...],
 (#16) ['.','two','.','three','.','four','.','five','.','six'...]]

Now we update the model to output a prediction after every word.

In [54]:
class LMModel4(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.h_h = nn.Linear(n_hidden, n_hidden)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = 0
        
    def forward(self, x):
        outs = []
        for i in range(sl):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
            outs.append(self.h_o(self.h))
        self.h = self.h.detach()
        return torch.stack(outs, dim=1)
    
    def reset(self): self.h = 0
        
# Need to adjust loss to match dims of output

def loss_func(inp, targ):
    return F.cross_entropy(inp.view(-1, len(vocab)), targ.view(-1))

In [56]:
learn = Learner(dls, LMModel4(len(vocab), 64), loss_func = loss_func, metrics=accuracy,
               cbs = ModelResetter)
learn.fit_one_cycle(15, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.01977,2.397986,0.349918,00:01
1,1.844865,1.454539,0.464844,00:01
2,1.415998,1.269139,0.522924,00:01
3,1.170695,1.03537,0.608501,00:01
4,0.95994,0.86569,0.660105,00:01
5,0.841344,0.764048,0.698705,00:01
6,0.722775,0.69201,0.727539,00:01
7,0.658202,0.641517,0.748664,00:01
8,0.568909,0.599036,0.768709,00:01
9,0.538976,0.569716,0.78418,00:01


We ended up with much better results than given by the book. I wonder why. The book notes that results can vary considerably.

# Multilayer RNNs
We may be able to improve our model by adding more layers. The activations from our RNN are passed to a second RNN. We will use the RNN class from pytorch, which implements what we created earlier.

In [59]:
class LMModel5(Module):
    def __init__(self, vocab_sz, n_hidden, n_layers):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)  
        self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first = True)     
        self.h_o = nn.Linear(n_hidden,vocab_sz)
        self.h = torch.zeros(n_layers, bs, n_hidden)
        
    def forward(self, x):
        res, h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(res)
    
    def reset(self): self.h.zero_()
        
# Need to adjust loss to match dims of output

learn = Learner(dls, LMModel5(len(vocab), 64, 12), loss_func = CrossEntropyLossFlat(), metrics=accuracy,
               cbs = ModelResetter)
learn.fit_one_cycle(15, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,2.947782,2.77196,0.153577,00:04
1,2.562264,1.815209,0.440173,00:04
2,1.650918,1.345369,0.484118,00:03
3,1.296438,1.253963,0.499332,00:03
4,1.198259,1.233383,0.503341,00:03
5,1.16715,1.231212,0.506528,00:03
6,1.156915,1.244108,0.503033,00:03
7,1.10419,1.13127,0.536647,00:04
8,1.022914,1.022675,0.599609,00:03
9,0.902188,0.910715,0.648643,00:03


Interestingly, this model did very well. The book's model did not, and argued that exploding or vanishing activations presented a problem here. With two linear layers, ended up with > 90% accuracy. With twelve, ended up with 72%.

## Exploding or Disappearing Activations
Activations of numbers >1 can "explode" or grow very large if multiplied many times. Numbers less than one can "disappear" or grow close to zero. This has something to do with increasingly-inaccurate floating point arithmetic. The result is that, in SGD, some weights go to infinity and some weights are not updated. There are two main methods for dealing with this in RNNs:
- LSTM: long short-term memory layers. There are two hidden states instead of one. The hidden state is responsible for predicting the next token, while the new hidden state, or "cell state," is responsible for keeping *long short-term memory.* This can help it with, e.g., remembering correct pronouns.
- Gated recurrent units (GRUs) -- variant on LSTM.

## Building an LSTM from scratch