# Human numbers

In [1]:
from fastai.text import *

In [2]:
bs=64

## Data

This data set just contains the numbers "one" thru "nine thousand nine hundred ninety nine", written in english

In [9]:
path = untar_data(URLs.HUMAN_NUMBERS, dest='./data')
path.ls()

[PosixPath('data/human_numbers/train.txt'),
 PosixPath('data/human_numbers/valid.txt')]

We're going to try to create a language model that can predict the next word in this document. It's just a toy example for this purpose. In this case, we only have one document. That one document is the list of numbers. So we can use a `TextList` to create an item list with text in for the training of the validation.

In [10]:
def readnums(d): return [', '.join(o.strip() for o in open(path/d).readlines())]

In [11]:
# Train on 1-8000

train_txt = readnums('train.txt'); train_txt[0][:80]

'one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirt'

In [12]:
# Valid is 8001-9999

valid_txt = readnums('valid.txt'); valid_txt[0][-80:]

' nine thousand nine hundred ninety eight, nine thousand nine hundred ninety nine'

In [7]:
train = TextList(train_txt, path=path)
valid = TextList(valid_txt, path=path)

# Combine them together into ItemLists, and turn them into DataBunch
src = ItemLists(path=path, train=train, valid=valid).label_for_lm()
data = src.databunch(bs=bs)

In [8]:
train[0].text[:80]

'xxbos one , two , three , four , five , six , seven , eight , nine , ten , eleve'

xxbos is the Beginning of String Token, which is hecessary for NLP

In [9]:
len(data.valid_ds[0][0].data)

13017

The batch size that we asked for was 64, and then by default it uses something called `bptt` of 70. `bptt`, as we briefly mentioned, stands for "back prop through time". That's the sequence length. For each of our 64 document segments, we split it up into lists of 70 words that we look at at one time. 

So what we do for the validation set is we grab this entire string of 13,000 tokens, and then we split it into 64 roughly equal sized sections. People very very very often think I'm saying something different. I did not say "they are of length 64" - they're not. <br>
They're **64 roughly equally sized segments**. So we take the first 1/64 of the document -  piece 1. The second 1/64 - piece 2. 

Then for each of those 1/64 of the document, we then split *those* into pieces of length 70. 

So let's now say for those 13,000 tokens, how many batches are there? Well, divide by batch size and divide by 70, so there's going to be 3 batches.

In [10]:
data.bptt, len(data.valid_dl)

(70, 3)

In [11]:
13017/70/bs

2.905580357142857

Let's grab an iterator for a data loader, grab 1 2 3 batches (the X and the Y), and let's add up the number of elements, and we get back slightly less than 13,017 because there's a little bit left over at the end that doesn't quite make up a full batch. This is the kind of stuff you should play around with a lot - lots of shapes and sizes and stuff and iterators.

In [12]:
it = iter(data.valid_dl)
x1,y1 = next(it)
x2,y2 = next(it)
x3,y3 = next(it)
it.close()

### For some reason, this is not turning out the way Jeremey expected:

In [13]:
x1.numel()+x2.numel()+x3.numel()

13440

In [14]:
x1.shape,y1.shape

(torch.Size([64, 70]), torch.Size([64, 70]))

In [15]:
x2.shape,y2.shape

(torch.Size([64, 70]), torch.Size([64, 70]))

In [25]:
x1[0]

tensor([ 2, 19, 11, 12,  9, 19, 11, 13,  9, 19, 11, 14,  9, 19, 11, 15,  9, 19,
        11, 16,  9, 19, 11, 17,  9, 19, 11, 18,  9, 19, 11, 19,  9, 19, 11, 20,
         9, 19, 11, 29,  9, 19, 11, 30,  9, 19, 11, 31,  9, 19, 11, 32,  9, 19,
        11, 33,  9, 19, 11, 34,  9, 19, 11, 35,  9, 19, 11, 36,  9, 19],
       device='cuda:0')

In [28]:
y1[0]

tensor([19, 11, 12,  9, 19, 11, 13,  9, 19, 11, 14,  9, 19, 11, 15,  9, 19, 11,
        16,  9, 19, 11, 17,  9, 19, 11, 18,  9, 19, 11, 19,  9, 19, 11, 20,  9,
        19, 11, 29,  9, 19, 11, 30,  9, 19, 11, 31,  9, 19, 11, 32,  9, 19, 11,
        33,  9, 19, 11, 34,  9, 19, 11, 35,  9, 19, 11, 36,  9, 19, 11],
       device='cuda:0')

You can grab the vocab for this dataset, and a vocab has a `textify` so if we look at exactly the same thing but with `textify`, that will just look it up in the vocab. So here you can see `xxbos eight thousand one` where else in the `y`, there's no `xxbos`, it's just `eight thousand one`. So after `xxbos` is `eight`, after `eight` is `thousand`, after `thousand` is `one`.

In [18]:
v = data.valid_ds.vocab

In [23]:
v.textify(x1[0])

'xxbos eight thousand one , eight thousand two , eight thousand three , eight thousand four , eight thousand five , eight thousand six , eight thousand seven , eight thousand eight , eight thousand nine , eight thousand ten , eight thousand eleven , eight thousand twelve , eight thousand thirteen , eight thousand fourteen , eight thousand fifteen , eight thousand sixteen , eight thousand seventeen , eight'

In [22]:
v.textify(y1[0])

'eight thousand one , eight thousand two , eight thousand three , eight thousand four , eight thousand five , eight thousand six , eight thousand seven , eight thousand eight , eight thousand nine , eight thousand ten , eight thousand eleven , eight thousand twelve , eight thousand thirteen , eight thousand fourteen , eight thousand fifteen , eight thousand sixteen , eight thousand seventeen , eight thousand'

In [21]:
v.textify(x2[0])

'thousand eighteen , eight thousand nineteen , eight thousand twenty , eight thousand twenty one , eight thousand twenty two , eight thousand twenty three , eight thousand twenty four , eight thousand twenty five , eight thousand twenty six , eight thousand twenty seven , eight thousand twenty eight , eight thousand twenty nine , eight thousand thirty , eight thousand thirty one , eight thousand thirty two ,'

In [29]:
v.textify(x3[0])

'eight thousand thirty three , eight thousand thirty four , eight thousand thirty five , eight thousand thirty six , eight thousand thirty seven , eight thousand thirty eight , eight thousand thirty nine , eight thousand forty , eight thousand forty one , eight thousand forty two , eight thousand forty three , eight thousand forty four , eight thousand forty five , eight thousand forty six , eight'

In [30]:
v.textify(x1[1])

', eight thousand forty six , eight thousand forty seven , eight thousand forty eight , eight thousand forty nine , eight thousand fifty , eight thousand fifty one , eight thousand fifty two , eight thousand fifty three , eight thousand fifty four , eight thousand fifty five , eight thousand fifty six , eight thousand fifty seven , eight thousand fifty eight , eight thousand fifty nine ,'

In [31]:
v.textify(x2[1])

'eight thousand sixty , eight thousand sixty one , eight thousand sixty two , eight thousand sixty three , eight thousand sixty four , eight thousand sixty five , eight thousand sixty six , eight thousand sixty seven , eight thousand sixty eight , eight thousand sixty nine , eight thousand seventy , eight thousand seventy one , eight thousand seventy two , eight thousand seventy three , eight thousand'

In [32]:
v.textify(x3[1])

'seventy four , eight thousand seventy five , eight thousand seventy six , eight thousand seventy seven , eight thousand seventy eight , eight thousand seventy nine , eight thousand eighty , eight thousand eighty one , eight thousand eighty two , eight thousand eighty three , eight thousand eighty four , eight thousand eighty five , eight thousand eighty six , eight thousand eighty seven , eight thousand eighty'

In [33]:
v.textify(x3[-1])

'ninety , nine thousand nine hundred ninety one , nine thousand nine hundred ninety two , nine thousand nine hundred ninety three , nine thousand nine hundred ninety four , nine thousand nine hundred ninety five , nine thousand nine hundred ninety six , nine thousand nine hundred ninety seven , nine thousand nine hundred ninety eight , nine thousand nine hundred ninety nine xxbos eight thousand one , eight'

In [34]:
data.show_batch(ds_type=DatasetType.Valid)

idx,text
0,"thousand forty seven , eight thousand forty eight , eight thousand forty nine , eight thousand fifty , eight thousand fifty one , eight thousand fifty two , eight thousand fifty three , eight thousand fifty four , eight thousand fifty five , eight thousand fifty six , eight thousand fifty seven , eight thousand fifty eight , eight thousand fifty nine , eight thousand sixty , eight thousand sixty"
1,"eight , eight thousand eighty nine , eight thousand ninety , eight thousand ninety one , eight thousand ninety two , eight thousand ninety three , eight thousand ninety four , eight thousand ninety five , eight thousand ninety six , eight thousand ninety seven , eight thousand ninety eight , eight thousand ninety nine , eight thousand one hundred , eight thousand one hundred one , eight thousand one"
2,"thousand one hundred twenty four , eight thousand one hundred twenty five , eight thousand one hundred twenty six , eight thousand one hundred twenty seven , eight thousand one hundred twenty eight , eight thousand one hundred twenty nine , eight thousand one hundred thirty , eight thousand one hundred thirty one , eight thousand one hundred thirty two , eight thousand one hundred thirty three , eight thousand"
3,"three , eight thousand one hundred fifty four , eight thousand one hundred fifty five , eight thousand one hundred fifty six , eight thousand one hundred fifty seven , eight thousand one hundred fifty eight , eight thousand one hundred fifty nine , eight thousand one hundred sixty , eight thousand one hundred sixty one , eight thousand one hundred sixty two , eight thousand one hundred sixty three"
4,"thousand one hundred eighty three , eight thousand one hundred eighty four , eight thousand one hundred eighty five , eight thousand one hundred eighty six , eight thousand one hundred eighty seven , eight thousand one hundred eighty eight , eight thousand one hundred eighty nine , eight thousand one hundred ninety , eight thousand one hundred ninety one , eight thousand one hundred ninety two , eight thousand"


## Single fully connected model

In [35]:
data = src.databunch(bs=bs, bptt=3)

In [36]:
x,y = data.one_batch()
x.shape,y.shape

(torch.Size([64, 3]), torch.Size([64, 3]))

He doesn't explain the next 3 cells

In [37]:
nv = len(v.itos); nv

39

In [38]:
nh=64

In [39]:
def loss4(input,target): return F.cross_entropy(input, target[:,-1])
def acc4 (input,target): return accuracy(input, target[:,-1])

Here is our model which is doing what we saw in the diagram:

<img src='https://raw.githubusercontent.com/hiromis/notes/master/lesson7/49.png'/>

It content contains 1 embedding (i.e. the green arrow), one hidden to hidden - brown arrow layer, and one hidden to output. So each colored arrow has a single matrix. Then in the forward pass, we take our first input `x[0]` and put it through input to hidden (the green arrow), create our first set of activations which we call `h`. Assuming that there is a second word, because sometimes we might be at the end of a batch where there isn't a second word. Assume there is a second word then we would add to `h` the result of `x[1]` put through the green arrow (that's `i_h`). Then we would say, okay our new `h` is the result of those two added together, put through our hidden to hidden (orange arrow), and then ReLU then batch norm. Then for the second word, do exactly the same thing. Then finally blue arrow - put it through `h_o`. 

So that's how we convert our diagram to code. Nothing new here at all. We can chuck that in a learner, and we can train it

In [40]:
class Model0(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)  # green arrow
        self.h_h = nn.Linear(nh,nh)     # brown arrow
        self.h_o = nn.Linear(nh,nv)     # blue arrow
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self, x):
        h = self.bn(F.relu(self.i_h(x[:,0])))
        if x.shape[1]>1:
            h = h + self.i_h(x[:,1])
            h = self.bn(F.relu(self.h_h(h)))
        if x.shape[1]>2:
            h = h + self.i_h(x[:,2])
            h = self.bn(F.relu(self.h_h(h)))
        return self.h_o(h)

In [41]:
learn = Learner(data, Model0(), loss_func=loss4, metrics=acc4)

In [42]:
learn.fit_one_cycle(6, 1e-4)

epoch,train_loss,valid_loss,acc4,time
0,3.590756,3.572804,0.051241,00:01
1,3.023875,3.158081,0.316866,00:01
2,2.412710,2.693038,0.348575,00:01
3,2.099838,2.425050,0.351792,00:01
4,1.981739,2.328369,0.353860,00:01
5,1.957371,2.314564,0.353860,00:01


## Same thing with a loop

Refactor the `forward()` part with a loop:

In [43]:
class Model1(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)  # green arrow
        self.h_h = nn.Linear(nh,nh)     # brown arrow
        self.h_o = nn.Linear(nh,nv)     # blue arrow
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self, x):
        h = torch.zeros(x.shape[0], nh).to(device=x.device)
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = self.bn(F.relu(self.h_h(h)))
        return self.h_o(h)

So now we're going for each `xi` in `x`, and doing it in the loop. Guess what? That's an RNN. An RNN is just a refactoring. It's not anything new. This is now an RNN. And let's refactor our diagram:

<img src='https://raw.githubusercontent.com/hiromis/notes/master/lesson7/50.png'/>

This is the same diagram, but I've just replaced it with my loop. It does the same thing, so here it is. It's got exactly the same `__init__`, literally exactly the same, just popped a loop here. Before I start, I just have to make sure that I've got a bunch of zeros to add to. 

`h = torch.zeros(x.shape[0], nh).to(device=x.device)`

And of course, I get exactly the same result when I train it.

Now this will work even if I'm not predicting the fourth word from the previous three, but the ninth word from the previous eight. It'll work for any arbitrarily length long sequence.

In [44]:
learn = Learner(data, Model1(), loss_func=loss4, metrics=acc4)

In [45]:
learn.fit_one_cycle(6, 1e-4)

epoch,train_loss,valid_loss,acc4,time
0,3.607758,3.572367,0.043428,00:01
1,2.943491,2.961627,0.441176,00:01
2,2.308248,2.410912,0.454044,00:01
3,2.003319,2.177335,0.467831,00:01
4,1.893996,2.110774,0.459789,00:01
5,1.871766,2.102596,0.455882,00:01


<br>

# Multi fully connected model

So let's up the `bptt` to 20 since we can now. And let's now say, okay, instead of just predicting the <img src="https://latex.codecogs.com/gif.latex?n" title="n" />th word from the previous <img src="https://latex.codecogs.com/gif.latex?n-1" title="n-1" />, let's try to predict the second word from the first, the third from the second, and the fourth from the third, and so forth. Look at our loss function. 

<img src='https://raw.githubusercontent.com/hiromis/notes/master/lesson7/51.png'/>

Previously we were comparing the result of our model to just the last word of the sequence. It is very wasteful, because there's a lot of words in the sequence. So let's compare every word in `x` to every word and `y`. To do that, we need to change the diagram so it's not just one triangle at the end of the loop, but the triangle is inside the loop:

<img src='https://raw.githubusercontent.com/hiromis/notes/master/lesson7/52.png'/>

In other words, after every loop, predict, loop, predict, loop, predict.

In [46]:
data = src.databunch(bs=bs, bptt=20)

In [47]:
x,y = data.one_batch()
x.shape,y.shape

(torch.Size([64, 20]), torch.Size([64, 20]))

Here's this code. It's the same as the previous code, but now I've created an array, and every time I go through the loop, I append `h_o(h)` to the array. Now, for <img src="https://latex.codecogs.com/gif.latex?n" title="n" /> inputs, I create <img src="https://latex.codecogs.com/gif.latex?n" title="n" /> outputs. So I'm predicting after every word.

In [48]:
class Model2(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.h_h = nn.Linear(nh,nh)
        self.h_o = nn.Linear(nh,nv)
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self, x):
        h = torch.zeros(x.shape[0], nh).to(device=x.device)
        res = []
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
            res.append(self.h_o(self.bn(h)))
        return torch.stack(res, dim=1)

In [49]:
learn = Learner(data, Model2(), metrics=accuracy)

In [50]:
learn.fit_one_cycle(10, 1e-4, pct_start=0.1)

epoch,train_loss,valid_loss,accuracy,time
0,3.621647,3.622926,0.061932,00:00
1,3.530835,3.516825,0.120810,00:00
2,3.410503,3.400002,0.217756,00:00
3,3.279715,3.281733,0.324858,00:00
4,3.154488,3.180173,0.355753,00:00
5,3.046561,3.102978,0.385866,00:00
6,2.963162,3.054118,0.416832,00:00
7,2.905735,3.028730,0.427344,00:00
8,2.870877,3.019201,0.429901,00:00
9,2.852798,3.017796,0.430256,00:00


Why is it worse? It's worse because now when I'm trying to predict the second word, I only have one word of state to use. When I'm looking at the third word, I only have two words of state to use. So it's a much harder problem for it to solve. The key problem is here:

`h = torch.zeros(x.shape[0], nh).to(device=x.device)`

I reset my state to zero every time I start another BPTT sequence. Let's not do that. Let's keep `h`. And we can, because remember, each batch connects to the previous batch. It's not shuffled like happens in image classification. So let's take this exact model and replicate it again, but let's move the creation of `h` into the constructor.

`self.h = torch.zeros(bs, nh).cuda()`

# Maintain state

In [51]:
class Model3(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.h_h = nn.Linear(nh,nh)
        self.h_o = nn.Linear(nh,nv)
        self.bn = nn.BatchNorm1d(nh)
        self.h = torch.zeros(bs, nh).cuda()
        
    def forward(self, x):
        res = []
        h = self.h
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
            res.append(self.bn(h))
        self.h = h.detach()
        res = torch.stack(res, dim=1)
        res = self.h_o(res)
        return res

In [52]:
learn = Learner(data, Model3(), metrics=accuracy)

In [53]:
learn.fit_one_cycle(20, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.591799,3.521685,0.096591,00:00
1,3.265697,2.933102,0.422088,00:00
2,2.619543,2.011813,0.465767,00:00
3,2.028754,1.932703,0.330753,00:00
4,1.709599,1.855882,0.397372,00:00
5,1.512163,1.799607,0.435653,00:00
6,1.350056,1.738522,0.467827,00:00
7,1.198517,1.837280,0.469176,00:00
8,1.036898,1.586950,0.502841,00:00
9,0.868209,1.605685,0.527202,00:00
10,0.723127,1.786438,0.520810,00:00
11,0.607666,1.882282,0.533381,00:00
12,0.522557,2.016127,0.546733,00:00
13,0.455661,2.121029,0.543040,00:00
14,0.406101,2.236337,0.541406,00:00
15,0.370657,2.214142,0.556392,00:00
16,0.346327,2.250795,0.559659,00:00
17,0.328857,2.293551,0.555966,00:00
18,0.317588,2.310897,0.554972,00:00
19,0.311233,2.314541,0.554048,00:00


So this is what a real RNN looks like. You always want to keep that state. But just keep remembering, there's nothing different about an RNN, and it's a totally normal fully connected neural net. It's just that you've got a loop you refactored.

<img src='https://raw.githubusercontent.com/hiromis/notes/master/lesson7/54.png'/>

What you could do though is at the end of your every loop, you could not just spit out an output, but you could spit it out into another RNN. So you have an RNN going into an RNN. That's nice because we now got more layers of computation, you would expect that to work better. 


# nn.RNN

Let's do some more refactoring. Let's take this code (`Model3`) and replace it with the equivalent built in PyTorch code which is you just say that:

In [54]:
class Model4(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.rnn = nn.RNN(nh,nh, batch_first=True)
        self.h_o = nn.Linear(nh,nv)
        self.bn = BatchNorm1dFlat(nh)
        self.h = torch.zeros(1, bs, nh).cuda()
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(self.bn(res))

<br>

So `self.rnn = nn.RNN(nh,nh, batch_first=True)` basically says do the loop for me. We've still got the same embedding, we've still got the same output, still got the same batch norm, we still got the same initialization of `h`, but we just got rid of the loop.

In [55]:
learn = Learner(data, Model4(), metrics=accuracy)

In [56]:
learn.fit_one_cycle(20, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.503310,3.364854,0.186009,00:00
1,3.033388,2.556750,0.460938,00:00
2,2.358360,1.932100,0.466690,00:00
3,1.883996,1.946859,0.315270,00:00
4,1.627613,1.948741,0.345526,00:00
5,1.413942,1.787773,0.421094,00:00
6,1.222860,1.887828,0.496307,00:00
7,1.055549,1.748806,0.519886,00:00
8,0.897263,1.487678,0.573722,00:00
9,0.756707,1.423003,0.580895,00:00
10,0.645525,1.411692,0.579901,00:00
11,0.559508,1.431999,0.576776,00:00
12,0.494914,1.487229,0.558665,00:00
13,0.445825,1.507274,0.566335,00:00
14,0.407051,1.524303,0.578693,00:00
15,0.376531,1.602618,0.577628,00:00
16,0.354285,1.592361,0.579403,00:00
17,0.338350,1.600212,0.584446,00:00
18,0.326798,1.622739,0.578977,00:00
19,0.319938,1.623776,0.580043,00:00


<br>

# 2-layer GRU

One of the nice things about RNN is that you can now say how many layers you want.

But here's the thing. When you think about this:

<img src='https://raw.githubusercontent.com/hiromis/notes/master/lesson7/54.png'/>

Think about it without the loop. It looks like this:

<img src='https://raw.githubusercontent.com/hiromis/notes/master/lesson7/55.png'/>

It keeps on going, and we've got a BPTT of 20, so there's 20 layers of this. And we know from that visualizing the loss landscapes paper, that deep networks have awful bumpy loss surfaces. So when you start creating long timescales and multiple layers, these things get impossible to train. There's a few tricks you can do. 

One thing is you can add skip connections, of course. But what people normally do is, instead of just adding these together(green and orange arrows), they actually use a little mini neural net to decide how much of the green arrow to keep and how much of the orange arrow to keep. 

When you do that, you get something that's either called GRU or LSTM depending on the details of that little neural net. And we'll learn about the details of those neural nets in part 2. They really don't matter though, frankly.

So we can now say let's create a GRU instead. It's just like what we had before, but it'll handle longer sequences in deeper networks. Let's use two layers.

`self.rnn = nn.GRU(nh, nh, 2, batch_first=True)`

In [57]:
class Model5(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.rnn = nn.GRU(nh, nh, 2, batch_first=True)
        self.h_o = nn.Linear(nh,nv)
        self.bn = BatchNorm1dFlat(nh)
        self.h = torch.zeros(2, bs, nh).cuda()
        
    def forward(self, x):
        res,h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(self.bn(res))

In [58]:
learn = Learner(data, Model5(), metrics=accuracy)

In [59]:
learn.fit_one_cycle(10, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,3.057909,2.462844,0.440767,00:00
1,1.850734,1.230502,0.658168,00:00
2,0.915347,1.026480,0.800213,00:00
3,0.442183,1.054205,0.821378,00:00
4,0.219633,1.137145,0.821520,00:00
5,0.113713,1.046862,0.831676,00:00
6,0.061782,1.092205,0.831534,00:00
7,0.035847,1.132005,0.831108,00:00
8,0.022572,1.173261,0.831321,00:00
9,0.015913,1.155626,0.830753,00:00


## fin