This notebook was part of [Lesson 7](https://course.fast.ai/videos/?lesson=7) of the Practical Deep Learning for Coders course.

# Predicting English word version of numbers using an RNN

We were using RNNs as part of our language model in the previous lesson.  Today, we will dive into more details of what RNNs are and how they work.  We will do this using the problem of trying to predict the English word version of numbers.

Let's predict what should come next in this sequence:

*eight thousand one , eight thousand two , eight thousand three , eight thousand four , eight thousand five , eight thousand six , eight thousand seven , eight thousand eight , eight thousand nine , eight thousand ten , eight thousand eleven , eight thousand twelve...*


Jeremy created this synthetic dataset to have a better way to check if things are working, to debug, and to understand what was going on. When experimenting with new ideas, it can be nice to have a smaller dataset to do so, to quickly get a sense of whether your ideas are promising (for other examples, see [Imagenette and Imagewoof](https://github.com/fastai/imagenette)) This English word numbers will serve as a good dataset for learning about RNNs.  Our task today will be to predict which word comes next when counting.

### In deep learning, there are 2 types of numbers

**Parameters** are numbers that are learned.  **Activations** are numbers that are calculated (by affine functions & element-wise non-linearities).

When you learn about any new concept in deep learning, ask yourself: is this a parameter or an activation?

Note to self: Point out the hidden state, going from the version without a for-loop to the for loop.  This is the step where people get confused.

## Data

In [1]:
from fastai.text import *

In [2]:
bs=64

In [4]:
path = untar_data(URLs.HUMAN_NUMBERS)
path.ls()

Downloading http://files.fast.ai/data/examples/human_numbers


[WindowsPath('C:/Users/Chiranth.Hegde/.fastai/data/human_numbers/train.txt'),
 WindowsPath('C:/Users/Chiranth.Hegde/.fastai/data/human_numbers/valid.txt')]

In [5]:
def readnums(d): return [', '.join(o.strip() for o in open(path/d).readlines())]

train.txt gives us a sequence of numbers written out as English words:

In [6]:
train_txt = readnums('train.txt'); train_txt[0][:80]

'one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirt'

In [8]:
valid_txt = readnums('valid.txt'); valid_txt[0][-80:]

' nine thousand nine hundred ninety eight, nine thousand nine hundred ninety nine'

In [9]:
train = TextList(train_txt, path=path)
valid = TextList(valid_txt, path=path)

src = ItemLists(path=path, train=train, valid=valid).label_for_lm()
data = src.databunch(bs=bs)

In [10]:
train[0].text[:80]

'xxbos one , two , three , four , five , six , seven , eight , nine , ten , eleve'

In [11]:
len(data.valid_ds[0][0].data)

13017

`bptt` stands for *back-propagation through time*.  This tells us how many steps of history we are considering.

In [12]:
data.bptt, len(data.valid_dl)

(70, 3)

We have 3 batches in our validation set:

13017 tokens, with about ~70 tokens in about a line of text, and 64 lines of text per batch.

In [13]:
13017/70/bs

2.905580357142857

We will store each batch in a separate variable, so we can walk through this to understand better what the RNN does at each step:

In [14]:
it = iter(data.valid_dl)
x1,y1 = next(it)
x2,y2 = next(it)
x3,y3 = next(it)
it.close()

In [15]:
x1

tensor([[ 2, 19, 11,  ..., 36,  9, 19],
        [ 9, 19, 11,  ..., 24, 20,  9],
        [11, 27, 18,  ...,  9, 19, 11],
        ...,
        [20, 11, 20,  ..., 11, 20, 10],
        [20, 11, 20,  ..., 24,  9, 20],
        [20, 10, 26,  ..., 20, 11, 20]])

`numel()` is a [PyTorch method](https://pytorch.org/docs/stable/torch.html#torch.numel) to return the number of elements in a tensor:

In [17]:
x1.numel()+x2.numel()+x3.numel()

13440

In [20]:
x1.shape, y1.shape

(torch.Size([64, 70]), torch.Size([64, 70]))

In [21]:
x2.shape, y2.shape

(torch.Size([64, 70]), torch.Size([64, 70]))

In [22]:
x3.shape, y3.shape

(torch.Size([64, 70]), torch.Size([64, 70]))

In [23]:
v = data.valid_ds.vocab

In [55]:
v.itos

['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 ',',
 'hundred',
 'thousand',
 'one',
 'two',
 'three',
 'four',
 'five',
 'six',
 'seven',
 'eight',
 'nine',
 'twenty',
 'thirty',
 'forty',
 'fifty',
 'sixty',
 'seventy',
 'eighty',
 'ninety',
 'ten',
 'eleven',
 'twelve',
 'thirteen',
 'fourteen',
 'fifteen',
 'sixteen',
 'seventeen',
 'eighteen',
 'nineteen',
 'xxfake']

In [26]:
x1[:,0]

tensor([ 2,  9, 11, 12, 13, 11, 10,  9, 10, 14, 19, 25, 19, 15, 16, 11, 19,  9,
        10,  9, 19, 25, 19, 11, 19, 11, 10,  9, 19, 20, 11, 26, 20, 23, 20, 20,
        24, 20, 11, 14, 11, 11,  9, 14,  9, 20, 10, 20, 35, 17, 11, 10,  9, 17,
         9, 20, 10, 20, 11, 20, 11, 20, 20, 20])

In [27]:
y1[:,0]

tensor([19, 19, 27, 10,  9, 12, 32, 19, 26, 10, 11, 15, 11, 10,  9, 15, 11, 19,
        26, 19, 11, 18, 11, 18,  9, 18, 21, 19, 10, 10, 20,  9, 11, 16, 11, 11,
        13, 11, 13,  9, 13, 14, 20, 10, 20, 11, 24, 11,  9,  9, 16, 17, 20, 10,
        20, 11, 24, 11, 19,  9, 19, 11, 11, 10])

In [28]:
v.itos[9], v.itos[11], v.itos[12], v.itos[13], v.itos[10]

(',', 'thousand', 'one', 'two', 'hundred')

In [78]:
v.textify(x1[1])

', eight thousand forty six , eight thousand forty seven , eight thousand forty eight , eight thousand forty nine , eight thousand fifty , eight thousand fifty one , eight thousand fifty two , eight thousand fifty three , eight thousand fifty four , eight thousand fifty five , eight thousand fifty six , eight thousand fifty seven , eight thousand fifty eight , eight thousand fifty nine ,'

In [76]:
v.textify(x2[0])

'thousand eighteen , eight thousand nineteen , eight thousand twenty , eight thousand twenty one , eight thousand twenty two , eight thousand twenty three , eight thousand twenty four , eight thousand twenty five , eight thousand twenty six , eight thousand twenty seven , eight thousand twenty eight , eight thousand twenty nine , eight thousand thirty , eight thousand thirty one , eight thousand thirty two ,'

In [77]:
v.textify(x3[0])

'eight thousand thirty three , eight thousand thirty four , eight thousand thirty five , eight thousand thirty six , eight thousand thirty seven , eight thousand thirty eight , eight thousand thirty nine , eight thousand forty , eight thousand forty one , eight thousand forty two , eight thousand forty three , eight thousand forty four , eight thousand forty five , eight thousand forty six , eight'

In [31]:
v.textify(y1[0])  # just one after - so shifted by one

'eight thousand one , eight thousand two , eight thousand three , eight thousand four , eight thousand five , eight thousand six , eight thousand seven , eight thousand eight , eight thousand nine , eight thousand ten , eight thousand eleven , eight thousand twelve , eight thousand thirteen , eight thousand fourteen , eight thousand fifteen , eight thousand sixteen , eight thousand seventeen , eight thousand'

In [32]:
v.textify(x2[0])

'thousand eighteen , eight thousand nineteen , eight thousand twenty , eight thousand twenty one , eight thousand twenty two , eight thousand twenty three , eight thousand twenty four , eight thousand twenty five , eight thousand twenty six , eight thousand twenty seven , eight thousand twenty eight , eight thousand twenty nine , eight thousand thirty , eight thousand thirty one , eight thousand thirty two ,'

In [33]:
v.textify(x3[0])

'eight thousand thirty three , eight thousand thirty four , eight thousand thirty five , eight thousand thirty six , eight thousand thirty seven , eight thousand thirty eight , eight thousand thirty nine , eight thousand forty , eight thousand forty one , eight thousand forty two , eight thousand forty three , eight thousand forty four , eight thousand forty five , eight thousand forty six , eight'

In [34]:
v.textify(x1[1])

', eight thousand forty six , eight thousand forty seven , eight thousand forty eight , eight thousand forty nine , eight thousand fifty , eight thousand fifty one , eight thousand fifty two , eight thousand fifty three , eight thousand fifty four , eight thousand fifty five , eight thousand fifty six , eight thousand fifty seven , eight thousand fifty eight , eight thousand fifty nine ,'

In [35]:
v.textify(x2[1])

'eight thousand sixty , eight thousand sixty one , eight thousand sixty two , eight thousand sixty three , eight thousand sixty four , eight thousand sixty five , eight thousand sixty six , eight thousand sixty seven , eight thousand sixty eight , eight thousand sixty nine , eight thousand seventy , eight thousand seventy one , eight thousand seventy two , eight thousand seventy three , eight thousand'

In [36]:
v.textify(x3[1])

'seventy four , eight thousand seventy five , eight thousand seventy six , eight thousand seventy seven , eight thousand seventy eight , eight thousand seventy nine , eight thousand eighty , eight thousand eighty one , eight thousand eighty two , eight thousand eighty three , eight thousand eighty four , eight thousand eighty five , eight thousand eighty six , eight thousand eighty seven , eight thousand eighty'

In [37]:
v.textify(x3[-1])

'ninety , nine thousand nine hundred ninety one , nine thousand nine hundred ninety two , nine thousand nine hundred ninety three , nine thousand nine hundred ninety four , nine thousand nine hundred ninety five , nine thousand nine hundred ninety six , nine thousand nine hundred ninety seven , nine thousand nine hundred ninety eight , nine thousand nine hundred ninety nine xxbos eight thousand one , eight'

In [38]:
data.show_batch(ds_type=DatasetType.Valid)

idx,text
0,"thousand forty seven , eight thousand forty eight , eight thousand forty nine , eight thousand fifty , eight thousand fifty one , eight thousand fifty two , eight thousand fifty three , eight thousand fifty four , eight thousand fifty five , eight thousand fifty six , eight thousand fifty seven , eight thousand fifty eight , eight thousand fifty nine , eight thousand sixty , eight thousand sixty"
1,"eight , eight thousand eighty nine , eight thousand ninety , eight thousand ninety one , eight thousand ninety two , eight thousand ninety three , eight thousand ninety four , eight thousand ninety five , eight thousand ninety six , eight thousand ninety seven , eight thousand ninety eight , eight thousand ninety nine , eight thousand one hundred , eight thousand one hundred one , eight thousand one"
2,"thousand one hundred twenty four , eight thousand one hundred twenty five , eight thousand one hundred twenty six , eight thousand one hundred twenty seven , eight thousand one hundred twenty eight , eight thousand one hundred twenty nine , eight thousand one hundred thirty , eight thousand one hundred thirty one , eight thousand one hundred thirty two , eight thousand one hundred thirty three , eight thousand"
3,"three , eight thousand one hundred fifty four , eight thousand one hundred fifty five , eight thousand one hundred fifty six , eight thousand one hundred fifty seven , eight thousand one hundred fifty eight , eight thousand one hundred fifty nine , eight thousand one hundred sixty , eight thousand one hundred sixty one , eight thousand one hundred sixty two , eight thousand one hundred sixty three"
4,"thousand one hundred eighty three , eight thousand one hundred eighty four , eight thousand one hundred eighty five , eight thousand one hundred eighty six , eight thousand one hundred eighty seven , eight thousand one hundred eighty eight , eight thousand one hundred eighty nine , eight thousand one hundred ninety , eight thousand one hundred ninety one , eight thousand one hundred ninety two , eight thousand"


We will iteratively consider a few different models, building up to a more traditional RNN.

## Single fully connected model

In [90]:
data = src.databunch(bs=bs, bptt=3)

In [91]:
x,y = data.one_batch()
x.shape,y.shape

(torch.Size([64, 5]), torch.Size([64, 5]))

In [92]:
nv = len(v.itos); nv

40

In [93]:
nh=64

In [94]:
def loss4(input,target): return F.cross_entropy(input, target[:,-1])
def acc4 (input,target): return accuracy(input, target[:,-1])

In [95]:
x[:,0]

tensor([14, 30, 17, 10, 20, 10, 13, 10, 28, 12, 21,  9,  9, 11, 20, 11, 13, 26,
        27, 10, 13,  9, 13, 11,  9, 14, 10, 25, 14,  9, 10, 11, 15, 11, 15, 16,
        10, 25, 20,  9,  9,  9, 11, 15,  9, 18, 23, 24, 16, 14,  9, 17, 16, 17,
        19, 10,  9, 15, 11, 38, 18, 11, 18, 10])

In [96]:
x[:,1]

tensor([ 9,  9,  9, 23,  9, 26,  9, 26, 12, 11, 17, 12, 12, 19, 10, 12, 10, 12,
        18, 15,  9, 13, 11, 12, 14, 10, 24, 18,  9, 14, 37, 23, 11, 13, 10, 10,
        23, 14,  9, 15, 16, 16, 14, 10, 16, 10, 14, 20,  9,  9, 17, 11, 10, 10,
        10, 21, 18,  9, 14,  9, 11, 17, 10, 27])

In [97]:
x[:,2]

tensor([15, 13, 14, 15, 18, 16, 12, 16,  9, 15,  9, 11, 11, 10, 28, 10, 24,  9,
         9,  9, 13, 11, 20, 10, 11, 22, 12,  9, 14, 11,  9, 20, 12, 10, 33, 22,
        18,  9, 15, 11, 11, 11, 10, 28, 11, 21,  9,  9, 17, 17, 11, 15, 26, 27,
        16, 14, 11, 18, 10, 18, 16, 10, 25, 15])

In [98]:
v.itos[17]

'six'

Layer names:
- `i_h`: input to hidden
- `h_h`: hidden to hidden
- `h_o`: hidden to output
- `bn`: batchnorm

In [99]:
x.shape

torch.Size([64, 5])

In [100]:
class Model0(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)  # green arrow
        self.h_h = nn.Linear(nh,nh)     # brown arrow
        self.h_o = nn.Linear(nh,nv)     # blue arrow
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self, x):
        h = self.bn(F.relu(self.i_h(x[:,0])))
        if x.shape[1]>1:
            h = h + self.i_h(x[:,1])
            h = self.bn(F.relu(self.h_h(h)))
        if x.shape[1]>2:
            h = h + self.i_h(x[:,2])
            h = self.bn(F.relu(self.h_h(h)))
        return self.h_o(h)

In [101]:
learn = Learner(data, Model0(), loss_func=loss4, metrics=acc4)

In [102]:
learn.fit_one_cycle(6, 1e-4)

epoch,train_loss,valid_loss,acc4,time
0,3.666979,3.529318,0.066692,00:01
1,3.219053,3.035857,0.381098,00:01
2,2.773531,2.691258,0.425686,00:01
3,2.554294,2.544547,0.429116,00:01
4,2.468419,2.496397,0.423399,00:01
5,2.446613,2.48963,0.423781,00:01


## Same thing with a loop

Let's refactor this to use a for-loop.  This does the same thing as before:

In [103]:
class Model1(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)  # green arrow
        self.h_h = nn.Linear(nh,nh)     # brown arrow
        self.h_o = nn.Linear(nh,nv)     # blue arrow
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self, x):
        h = torch.zeros(x.shape[0], nh).to(device=x.device)
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = self.bn(F.relu(self.h_h(h)))
        return self.h_o(h)

This is the difference between unrolled (what we had before) and rolled (what we have now) RNN diagrams:

In [104]:
learn = Learner(data, Model1(), loss_func=loss4, metrics=acc4)

In [105]:
learn.fit_one_cycle(6, 1e-4)

epoch,train_loss,valid_loss,acc4,time
0,3.739444,3.772064,0.031631,00:02
1,3.224899,3.167321,0.311738,00:02
2,2.654545,2.674752,0.437119,00:02
3,2.332796,2.445802,0.452744,00:02
4,2.197132,2.369558,0.456936,00:02
5,2.162464,2.358822,0.457698,00:02


Our accuracy is about the same, since we are doing the same thing as before.

## Multi fully connected model

Before, we were just predicting the last word in a line of text.  Given 70 tokens, what is token 71?  That approach was throwing away a lot of data.  Why not predict token 2 from token 1, then predict token 3, then predict token 4, and so on?  We will modify our model to do this.

In [106]:
data = src.databunch(bs=bs, bptt=20)

In [107]:
x,y = data.one_batch()
x.shape,y.shape

(torch.Size([64, 20]), torch.Size([64, 20]))

In [108]:
class Model2(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.h_h = nn.Linear(nh,nh)
        self.h_o = nn.Linear(nh,nv)
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self, x):
        h = torch.zeros(x.shape[0], nh).to(device=x.device)
        res = []
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
            res.append(self.h_o(self.bn(h)))
        return torch.stack(res, dim=1)

In [109]:
learn = Learner(data, Model2(), metrics=accuracy)

In [110]:
learn.fit_one_cycle(10, 1e-4, pct_start=0.1)

epoch,train_loss,valid_loss,accuracy,time
0,3.829416,3.806781,0.009162,00:02
1,3.63403,3.579437,0.08125,00:02
2,3.402695,3.369328,0.171875,00:02
3,3.180062,3.201044,0.213778,00:02
4,2.988979,3.075172,0.219886,00:02
5,2.836951,2.987514,0.231392,00:02
6,2.724767,2.933141,0.23956,00:02
7,2.648958,2.90461,0.24581,00:02
8,2.603263,2.893784,0.248366,00:02
9,2.579597,2.892172,0.248509,00:02


Note that our accuracy is worse now, because we are doing a harder task.  When we predict word k (k<70), we have less history to help us then when we were only predicting word 71.

## Maintain state

To address this issue, let's keep the hidden state from the previous line of text, so we are not starting over again on each new line of text.

In [114]:
class Model3(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.h_h = nn.Linear(nh,nh)
        self.h_o = nn.Linear(nh,nv)
        self.bn = nn.BatchNorm1d(nh)
        self.h = torch.zeros(bs, nh) # setting hidden state so its available next time
        
    def forward(self, x):
        res = []
        h = self.h
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
            res.append(self.bn(h))   
        self.h = h.detach()
        res = torch.stack(res, dim=1) 
        res = self.h_o(res)
        return res

In [115]:
learn = Learner(data, Model3(), metrics=accuracy)

In [116]:
learn.fit_one_cycle(20, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.648024,3.495584,0.11044,00:02
1,3.070469,2.59497,0.378622,00:02
2,2.337918,2.037291,0.458026,00:02
3,1.877789,2.101913,0.316335,00:02
4,1.654644,2.145627,0.31669,00:02
5,1.548502,2.165984,0.317045,00:02
6,1.48794,2.211559,0.328196,00:02
7,1.376285,2.071783,0.363778,00:01
8,1.217045,2.030422,0.397869,00:01
9,1.068312,2.188098,0.404759,00:02


Now we are getting greater accuracy than before!

## nn.RNN

Let's refactor the above to use PyTorch's RNN.  This is what you would use in practice, but now you know the inside details!

In [120]:
class Model4(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.rnn = nn.RNN(nh,nh, batch_first=True) # same work as  the for loop that keeps tje op states and feeds it fwd
        # also faster since it runs in CUDA C on gpu
        self.h_o = nn.Linear(nh,nv)
        self.bn = BatchNorm1dFlat(nh)
        self.h = torch.zeros(1, bs, nh)
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(self.bn(res))

In [121]:
learn = Learner(data, Model4(), metrics=accuracy)

In [122]:
learn.fit_one_cycle(20, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.585372,3.489162,0.193395,00:01
1,2.870122,2.414831,0.462784,00:01
2,2.165978,2.08833,0.327628,00:01
3,1.781293,2.081119,0.316122,00:01
4,1.567577,1.826286,0.467898,00:01
5,1.386352,1.645618,0.496094,00:01
6,1.192177,1.572929,0.496946,00:01
7,1.019839,1.497096,0.545668,00:01
8,0.88468,1.566342,0.560511,00:01
9,0.768424,1.574076,0.584517,00:01


## 2-layer GRU

When you have long time scales and deeper networks, these become impossible to train.  One way to address this is to add mini-NN to decide how much of the green arrow and how much of the orange arrow to keep.  These mini-NNs can be GRUs or LSTMs.  We will cover more details of this in a later lesson.

In [126]:
class Model5(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.rnn = nn.GRU(nh, nh, 2, batch_first=True)
        self.h_o = nn.Linear(nh,nv)
        self.bn = BatchNorm1dFlat(nh)
        self.h = torch.zeros(2, bs, nh)
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(self.bn(res))

In [127]:
learn = Learner(data, Model5(), metrics=accuracy)

In [128]:
learn.fit_one_cycle(10, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,2.708675,2.228096,0.470099,00:02
1,1.572299,1.55115,0.643111,00:02
2,0.767874,1.255454,0.75625,00:02
3,0.37618,1.022752,0.805895,00:02
4,0.190067,1.097935,0.814276,00:02
5,0.102547,1.015955,0.819389,00:02
6,0.058668,1.147391,0.812145,00:02
7,0.035856,1.165836,0.8125,00:02
8,0.023436,1.201007,0.81875,00:02
9,0.016862,1.185209,0.817045,00:02


### Connection to ULMFit

In the previous lesson, we were essentially swapping out `self.h_o` with a classifier in order to do classification on text.

## fin

RNNs are just a refactored, fully-connected neural network.

You can use the same approach for any sequence labeling task (part of speech, classifying whether material is sensitive,..)