# Predicting English word version of numbers using an RNN

We were using RNNs as part of our language model in the previous lesson.  Today, we will dive into more details of what RNNs are and how they work.  We will do this using the problem of trying to predict the English word version of numbers.

Let's predict what should come next in this sequence:

*eight thousand one , eight thousand two , eight thousand three , eight thousand four , eight thousand five , eight thousand six , eight thousand seven , eight thousand eight , eight thousand nine , eight thousand ten , eight thousand eleven , eight thousand twelve...*


Jeremy created this synthetic dataset to have a better way to check if things are working, to debug, and to understand what was going on. When experimenting with new ideas, it can be nice to have a smaller dataset to do so, to quickly get a sense of whether your ideas are promising (for other examples, see [Imagenette and Imagewoof](https://github.com/fastai/imagenette)) This English word numbers will serve as a good dataset for learning about RNNs.  Our task today will be to predict which word comes next when counting.

## Data

In [83]:
from fastai.text import *

In [84]:
bs=64

In [85]:
path = untar_data(URLs.HUMAN_NUMBERS)
path.ls()

[PosixPath('/home/jhoward/.fastai/data/human_numbers/train.txt'),
 PosixPath('/home/jhoward/.fastai/data/human_numbers/valid.txt')]

In [86]:
def readnums(d): return [', '.join(o.strip() for o in open(path/d).readlines())]

train.txt gives us a sequence of numbers written out as English words:

In [87]:
train_txt = readnums('train.txt'); train_txt[0][:80]

'one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirt'

In [88]:
valid_txt = readnums('valid.txt'); valid_txt[0][-80:]

' nine thousand nine hundred ninety eight, nine thousand nine hundred ninety nine'

In [89]:
train = TextList(train_txt, path=path)
valid = TextList(valid_txt, path=path)

src = ItemLists(path=path, train=train, valid=valid).label_for_lm()
data = src.databunch(bs=bs)

In [90]:
train[0].text[:80]

'xxbos one , two , three , four , five , six , seven , eight , nine , ten , eleve'

`bptt` stands for *back-propagation through time*.  This tells us how many steps of history we are considering.

In [91]:
data.bptt, len(data.valid_dl)

(70, 3)

We have 3 batches in our validation set:

13017 tokens, with about ~70 tokens in about a line of text, and 64 lines of text per batch.

We will store each batch in a separate variable, so we can walk through this to understand better what the RNN does at each step:

In [92]:
it = iter(data.valid_dl)
x1,y1 = next(it)
x2,y2 = next(it)
x3,y3 = next(it)
it.close()

In [93]:
v = data.valid_ds.vocab

In [94]:
data = src.databunch(bs=bs, bptt=40)

In [95]:
x,y = data.one_batch()
x.shape,y.shape

(torch.Size([64, 40]), torch.Size([64, 40]))

In [96]:
nv = len(v.itos); nv

40

In [97]:
nh=56

In [98]:
def loss4(input,target): return F.cross_entropy(input, target[:,-1])
def acc4 (input,target): return accuracy(input, target[:,-1])

Layer names:
- `i_h`: input to hidden
- `h_h`: hidden to hidden
- `h_o`: hidden to output
- `bn`: batchnorm

## Adding a GRU

When you have long time scales and deeper networks, these become impossible to train.  One way to address this is to add mini-NN to decide how much of the green arrow and how much of the orange arrow to keep.  These mini-NNs can be GRUs or LSTMs.

In [99]:
class Model5(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.rnn = nn.GRU(nh, nh, 1, batch_first=True)
        self.h_o = nn.Linear(nh,nv)
        self.bn = BatchNorm1dFlat(nh)
        self.h = torch.zeros(1, bs, nh).cuda()
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(self.bn(res))

In [100]:
nv, nh

(40, 56)

In [134]:
learn = Learner(data, Model5(), metrics=accuracy)

In [135]:
learn.fit_one_cycle(10, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,3.540133,3.465589,0.240951,00:00
1,2.788598,2.131198,0.388867,00:00
2,2.147029,1.86837,0.441536,00:00
3,1.768475,1.858901,0.475521,00:00
4,1.472611,1.808398,0.623893,00:00
5,1.204455,1.676029,0.621549,00:00
6,0.97117,1.593996,0.674219,00:00
7,0.779145,1.55477,0.663021,00:00
8,0.63321,1.524638,0.700195,00:00
9,0.529737,1.528556,0.704883,00:00


## Let's make our own GRU

### Using PyTorch's GRUCell

Axis 0 is the batch dimension, and axis 1 is the time dimension.  We want to loop through axis 1:

In [130]:
def rnn_loop(cell, h, x):
    res = []
    for x_ in x.transpose(0,1):
        h = cell(x_, h)
        res.append(h)
    return torch.stack(res, dim=1)

In [131]:
class Model6(Model5):
    def __init__(self):
        super().__init__()
        self.rnnc = nn.GRUCell(nh, nh)
        self.h = torch.zeros(bs, nh).cuda()
        
    def forward(self, x):
        res = rnn_loop(self.rnnc, self.h, self.i_h(x))
        self.h = res[:,-1].detach()
        return self.h_o(self.bn(res))

In [133]:
learn = Learner(data, Model6(), metrics=accuracy)
learn.fit_one_cycle(10, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,3.436362,3.394695,0.352865,00:00
1,2.739339,2.135875,0.464258,00:00
2,2.118275,1.790151,0.483854,00:00
3,1.757279,1.725953,0.506576,00:00
4,1.468234,1.623885,0.566797,00:00
5,1.191478,1.475327,0.656771,00:00
6,0.94203,1.302567,0.72832,00:00
7,0.739052,1.320555,0.759049,00:00
8,0.588394,1.269386,0.763737,00:00
9,0.483452,1.246448,0.773112,00:00


### With a custom GRUCell

The following is based on code from [emadRad](https://github.com/emadRad/lstm-gru-pytorch/blob/master/lstm_gru.ipynb):

In [136]:
class GRUCell(nn.Module):
    def __init__(self, ni, nh):
        super(GRUCell, self).__init__()
        self.ni,self.nh = ni,nh
        self.i2h = nn.Linear(ni, 3*nh)
        self.h2h = nn.Linear(nh, 3*nh)
    
    def forward(self, x, h):
        gate_x = self.i2h(x).squeeze()
        gate_h = self.h2h(h).squeeze()
        i_r,i_u,i_n = gate_x.chunk(3, 1)
        h_r,h_u,h_n = gate_h.chunk(3, 1)
        
        resetgate = torch.sigmoid(i_r + h_r)
        updategate = torch.sigmoid(i_u + h_u)
        newgate = torch.tanh(i_n + (resetgate*h_n))
        return updategate*h + (1-updategate)*newgate

In [137]:
class Model7(Model6):
    def __init__(self):
        super().__init__()
        self.rnnc = GRUCell(nh,nh)

In [139]:
learn = Learner(data, Model7(), metrics=accuracy)
learn.fit_one_cycle(10, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,3.538269,3.513104,0.276562,00:01
1,2.815921,2.241536,0.364844,00:01
2,2.140461,1.959162,0.418424,00:01
3,1.704312,1.87093,0.490104,00:01
4,1.350341,1.672874,0.611784,00:01
5,1.040827,1.563094,0.664974,00:01
6,0.793811,1.539096,0.716276,00:01
7,0.606719,1.480668,0.727604,00:01
8,0.472559,1.45617,0.731836,00:01
9,0.380289,1.481691,0.732617,00:01


### Connection to ULMFit

In the previous lesson, we were essentially swapping out `self.h_o` with a classifier in order to do classification on text.

RNNs are just a refactored, fully-connected neural network.

You can use the same approach for any sequence labeling task (part of speech, classifying whether material is sensitive,..)

## fin