In [None]:
%%capture
!pip install --upgrade fastai

# Data Preparation
We start by preaparing a dataset comprising "human numbers" i.e. numbers as English language words. This is a simple example prepared by J.H. et. al to demonstrate the construction of an RNN.

In [2]:
from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)
path.ls()

(#2) [Path('/home/djliden91/.fastai/data/human_numbers/valid.txt'),Path('/home/djliden91/.fastai/data/human_numbers/train.txt')]

In [3]:
lines = L()
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'train.txt') as f: lines += L(*f.readlines())
lines

(#15998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]

Now we concatenate into one big stream.

In [5]:
text = ' . '.join([l.strip() for l in lines])
text[:100]

'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

Next we tokenize

In [6]:
tokens = text.split(' ')
tokens[:10]

['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']

In [8]:
vocab = L(*tokens).unique()
vocab

(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]

In [9]:
word2idx = {w:i for i,w in enumerate(vocab)}
nums = L(word2idx[i] for i in tokens)
nums

(#100157) [0,1,2,1,3,1,4,1,5,1...]

# Our First Language Model from Scratch
Our first rudimentary approach to modeling will take our input stream and convert it into sequences of three words with the aim of predicting each fourth word.

In [12]:
L((tokens[i:i+3], tokens[i+3]) for i in range(0, len(tokens)-4,3))

(#33385) [(['one', '.', 'two'], '.'),(['.', 'three', '.'], 'four'),(['four', '.', 'five'], '.'),(['.', 'six', '.'], 'seven'),(['seven', '.', 'eight'], '.'),(['.', 'nine', '.'], 'ten'),(['ten', '.', 'eleven'], '.'),(['.', 'twelve', '.'], 'thirteen'),(['thirteen', '.', 'fourteen'], '.'),(['.', 'fifteen', '.'], 'sixteen')...]

In [14]:
# In numericalized form, which the model can actually use
seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0,len(nums)-4,3))
seqs

(#33385) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

In [15]:
bs = 64
cut = int(len(seqs)*0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

Our model will use three linear layers with the following tweaks:
1. layer 1 uses the first word's embeddings as activations. Layer 2 uses second word's embeddings + first layer's outputs. Layer 3 uses uses third word's embeddings + second layer's output activations.
2. Each layer uses the same weight matrix. Activation weights change from layer to layer but layer weights do not. *I don't currently understand the distinction*.

In [18]:
class LMModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
    
    def forward(self,x):
        h = F.relu(self.h_h(self.i_h(x[:,0])))
        h = h + self.i_h(x[:,1])
        h = F.relu(self.h_h(h))
        h = h + self.i_h(x[:,2])
        h = F.relu(self.h_h(h))
        return self.h_o(h)

In [19]:
learn = Learner(dls, LMModel1(len(vocab), 64), loss_func = F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.497567,1.801787,0.466677,00:04
1,1.348261,1.536236,0.480306,00:04
2,1.360052,1.383676,0.497978,00:04
3,1.266293,1.375448,0.501872,00:04


How can we tell if this is any good at all? Let's set a baseline by simply predicting the most common class.

In [20]:
n,counts = 0, torch.zeros(len(vocab))
for x,y in dls.valid:
    n += y.shape[0]
    for i in range_of(vocab): counts[i] += (y==i).long().sum()
idx = torch.argmax(counts)
idx, vocab[idx.item()], counts[idx].item()/n

(tensor(29), 'thousand', 0.15291298487344615)

We may have suspected at first the the separator would be the most common character. It's still not entirely clear to me why it isn't. It might reflect the composition of the validation set.

In [37]:
dls.valid.items[1:10]

[(tensor([28, 20,  7]), 1),
 (tensor([ 1,  4, 29]), 9),
 (tensor([ 9, 28, 20]), 8),
 (tensor([8, 1, 4]), 29),
 (tensor([29,  9, 28]), 20),
 (tensor([20,  9,  1]), 4),
 (tensor([ 4, 29,  9]), 28),
 (tensor([28, 21,  1]), 4),
 (tensor([ 4, 29,  9]), 28)]

# Making an RNN
We can refactor our previous code to have the structure of an RNN, with the benefit of not being restricted to token sequences of the same length.

In [71]:
class LMModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
    
    def forward(self,x):
        h=0
        for i in range(3):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
        return self.h_o(h)

In [72]:
learn = Learner(dls, LMModel2(len(vocab), 64), loss_func = F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.47995,1.847123,0.468474,00:04
1,1.339531,1.639733,0.475363,00:05
2,1.368755,1.414839,0.494084,00:05
3,1.280326,1.393219,0.49693,00:04


What is the difference here? We've turned the sequence of adding each data input into a loop. The variable *h* is referred to as the "hidden state." It's not *quite* clear why we can apply this to token lists of different lengths. Maybe it's just easier to change the `range(3)` to something else than it is to manually add more, as in the previous version.

This is a "recurrent neural network" which, according to JH, is essentially just "a refactoring of a multilayer neural network using a for loop."

## Making the RNN Better
- One immediate issue: we initialize our hidden state to zero for each new input seq. Why does this matter? Losing some information. If our sequences are read in order, perhaps there is information to be gained from earlier sequences. Furthermore, we could use the preceding sequences to predict the second and third words, not just the fourth.

In other words, we want to maintain the state of our rnn. But this presents its own problem. By storing state, we're essentially making a NN as deep (with as many layers) as the number of tokens. We would still need to calculate derivatives back to the very first layer: slow and memory intensive.

We can solve this by keeping only the last three layers of derivatives. We do not "backpropagate...through the entire implicit neural network."

Here is our "stateful" rnn:

In [75]:
class LMModel3(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.i_h = nn.Embedding(vocab_sz, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, vocab_sz)
        self.h = 0
    
    def forward(self,x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        out = self.h_o(self.h)
        self.h = self.h.detach()
        return out
    
    def reset(self): self.h = 0

For this to work, we need to make sure the samples will be observed in a particular order. (resume on page 383).

In [76]:
learn = Learner(dls, LMModel3(len(vocab), 64), loss_func = F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.526187,1.756079,0.495643,00:04


AssertionError: ==:
64
21