# Chapter 6: RNN English Numbers
## Predicting English word version of numbers using RNN

There is an english language corpus which list the numbers in orders: 

eight thousand one, eight thousand two...

created by Jeremy to check if things are working, to debug, and understand what was going on. When experimenting with new ideas, it's nice to have smaller dataset to do so, quickly get a sense of whether your ideas are promising (Imagenette and Imagewoof for computer visual). English word numbers serve as a good dataset to learn about RNNs. 

IN DL: there are 2 types of numbers: 

- **Parameters**: numbers that are learned. 
- **Activations**: numbers that are calculated (by affine functions & element-wise non-linearities). 

When learning new concept in DL, ask yourself: *Is this a parameter or an activation?*

In [1]:
from fastai.text.all import *
bs = 64

path = untar_data(URLs.HUMAN_NUMBERS)
path.ls()

(#2) [Path('/home/fastai2/.fastai/data/human_numbers/valid.txt'),Path('/home/fastai2/.fastai/data/human_numbers/train.txt')]

In [2]:
def readnums(d): return[", ".join(o.strip() for o in open(path/d).readlines())]

In [3]:
train_txt = readnums("train.txt")
train_txt[0][:80]

'one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirt'

In [4]:
valid_txt = readnums("valid.txt")
valid_txt[0][-80:]

' nine thousand nine hundred ninety eight, nine thousand nine hundred ninety nine'

From here onwards it has been difficult to direct translate what is being done. See fastbook Chapter 12 for the new version of RNN introduction. 

In [5]:
lines = L()
with open(path/"train.txt") as f: lines += L(*f.readlines())
with open(path/"valid.txt") as f: lines += L(*f.readlines())
lines

(#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]

In [6]:
text = " . ".join([l.strip() for l in lines]) # separator
text[:100]

'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

In [7]:
tokens = text.split(" ")
vocab = L(*tokens).unique()
vocab

(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]

In [8]:
word2idx = {w:i for i, w in enumerate(vocab)}
nums = L(word2idx[i] for i in tokens)
nums

(#63095) [0,1,2,1,3,1,4,1,5,1...]

In [9]:
seqs = L((tensor(nums[i : i + 3]), nums[i + 3]) for i in range(0, len(nums) - 4, 3))
seqs

(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

In [10]:
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=bs, shuffle=False)

Note that even if we do this, we also have a different value. This is because how we split. The original one is a long line of str, this splits three by three. 

In [11]:
dls.valid_ds[0][0]

tensor([ 1,  8, 29])

One doesn't know how to get the bptt nor valid_dl. Most importantly, we can't even `dls.show_batch()`. However, we will skip directly to the model part. 

- i_h: input to hidden. 
- h_h: hidden to hidden
- h_o: hidden to output
- bn: batchnorm

In [12]:
class Model0(Module):
    def __init__(self, vocab_sz, n_hidden):
        # we don't need to super().__init__() anymore with newer version
        # of PyTorch. 
        nv, nh = vocab_sz, n_hidden
        self.i_h = nn.Embedding(nv, nh)
        self.h_h = nn.Linear(nh, nh)
        self.h_o = nn.Linear(nh, nv)
        self.bn = nn.BatchNorm1d(nh)

    def forward(self, x):
        h = self.bn(F.relu(self.i_h(x[:, 0])))
        if x.shape[1] > 1: 
            h += self.i_h(x[:, 1])
            h = self.bn(F.relu(self.h_h(h)))
        if x.shape[1] > 2:
            h += self.i_h(x[:, 2])
            h = self.bn(F.relu(self.h_h(h)))
        return self.h_o(h)

In [13]:
learn = Learner(dls, Model0(len(vocab), 64), loss_func=F.cross_entropy, 
            metrics=accuracy)
learn.fit_one_cycle(6, 1e-4)

epoch,train_loss,valid_loss,accuracy,time
0,3.308321,3.394462,0.045876,00:02
1,2.672316,2.868347,0.335631,00:02
2,2.15896,2.460403,0.403375,00:02
3,1.917717,2.26964,0.437842,00:02
4,1.825647,2.206414,0.441169,00:02
5,1.803799,2.197597,0.440694,00:02


For how to use this to do predictions, or how good it is, check out Chapter 12 of fastbook. 

## Same thing with a loop

In [14]:
class Model1(Module):
    def __init__(self, vocab_sz, n_hidden):
        # we don't need to super().__init__() anymore with newer version
        # of PyTorch. 
        nv, nh = vocab_sz, n_hidden
        self.i_h = nn.Embedding(nv, nh)
        self.h_h = nn.Linear(nh, nh)
        self.h_o = nn.Linear(nh, nv)
        self.bn = nn.BatchNorm1d(nh)

    def forward(self, x):
        h = self.bn(F.relu(self.i_h(x[:, 0])))
        for i in range(x.shape[1]):
            h += self.i_h(x[:, i])
            h = self.bn(F.relu(self.h_h(h)))
        return self.h_o(h)

In [15]:
def fit_model(model, epochs, lr, **kwargs):
    learn = Learner(dls, model, loss_func=F.cross_entropy, 
                metrics=accuracy)
    learn.fit_one_cycle(epochs, lr, **kwargs)

In [16]:
fit_model(Model1(len(vocab), 64), 6, 1e-4)

epoch,train_loss,valid_loss,accuracy,time
0,3.391133,3.453091,0.036843,00:03
1,2.796483,2.979537,0.249584,00:02
2,2.320133,2.66655,0.325648,00:02
3,2.063687,2.512197,0.340385,00:02
4,1.958931,2.457953,0.35227,00:03
5,1.933541,2.450973,0.365343,00:02


## Multi fully connected model
Why not predict token 2 from token 1, token 3 from token 2, etc? One doesn't know how to make modifications since our bptt isn't defined in new fastai. Perhaps it's sequence length so let's try that. 

In [17]:
m = len(seqs) // bs
m, bs, len(seqs)

(328, 64, 21031)

In [18]:
def group_chunks(ds, bs): 
    m = len(ds) // bs
    new_ds = L()
    for i in range(m): new_ds += L(ds[i + m * j] for j in range(bs))
    return new_ds

In [19]:
sl = 16  # better than bptt = sl = 20
seqs = L((tensor(nums[i : i + sl]), tensor(nums[i + 1: i + sl + 1]))
        for i in range(0, len(nums) - sl - 1, sl))
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(group_chunks(seqs[:cut], bs),
                             group_chunks(seqs[cut:], bs),
                             bs=bs, drop_last=True, shuffle=False)

In [20]:
class Model2(Module):
    def __init__(self, nv, nh):
        self.nh = nh
        self.i_h = nn.Embedding(nv, nh)
        self.h_h = nn.Linear(nh, nh)
        self.h_o = nn.Linear(nh, nv)
        self.bn = nn.BatchNorm1d(nh)

    def forward(self, x):
        h = torch.zeros(x.shape[0], self.nh).to(device=x.device)
        # h = 0
        res = []
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:, i])
            h = F.relu(self.h_h(h))
            res.append(self.h_o(self.bn(h)))
        # self.h = self.h.detach()
        return torch.stack(res, dim=1)

In [21]:
def loss_func(inp, targ):
    return F.cross_entropy(inp.view(-1, len(vocab)), targ.view(-1))

def fit_model(model, epochs, lr, **kwargs):
    learn = Learner(dls, model, loss_func=loss_func, 
                metrics=accuracy)
    learn.fit_one_cycle(epochs, lr, **kwargs)

In [22]:
fit_model(Model2(len(vocab), 64), 10, 1e-4, pct_start=0.1)

epoch,train_loss,valid_loss,accuracy,time
0,3.613575,3.539885,0.030518,00:01
1,3.402714,3.313725,0.061442,00:01
2,3.147016,3.100097,0.17334,00:01
3,2.898019,2.918723,0.261475,00:01
4,2.685772,2.779171,0.277018,00:01
5,2.521361,2.680985,0.288086,00:01
6,2.405792,2.620372,0.293132,00:01
7,2.332363,2.588161,0.294027,00:01
8,2.291258,2.575815,0.295085,00:01
9,2.272216,2.573973,0.295329,00:01


If you get `torch.size() not equal` something like that to the input it forces you to use vocab length, it's because of the loss function we previously defined not working anymore. Here, we define a new loss function to make it work. 

## Maintain State
Keep hidden state from previous line of text, so we're not starting over again on each line of text. 

In [25]:
class Model3(Module):
    def __init__(self, nv, nh):
        self.i_h = nn.Embedding(nv, nh)
        self.h_h = nn.Linear(nh, nh)
        self.h_o = nn.Linear(nh, nv)
        self.bn = nn.BatchNorm1d(nh)
        self.h = torch.zeros(bs, nh)  # somehow cannot put on cuda

    def forward(self, x):
        res = []
        h = self.h
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:, i])
            h = F.relu(self.h_h(h))
            res.append(self.bn(h))
        self.h = h.detach()
        res = torch.stack(res, dim=1)
        res = self.h_o(res)
        return res

In [26]:
fit_model(Model3(len(vocab), 64), 20, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.26888,3.152752,0.177409,00:01
1,2.537653,2.118135,0.461914,00:01
2,1.908176,1.932215,0.376546,00:01
3,1.62992,1.963621,0.32373,00:01
4,1.515565,2.00892,0.319987,00:01
5,1.433965,1.786368,0.469727,00:01
6,1.295307,1.931554,0.47819,00:01
7,1.154681,1.888782,0.504232,00:01
8,1.044125,1.784899,0.531331,00:01
9,0.951643,1.705102,0.549805,00:01


The previous cell we cannot put on cuda but don't worry we still get really fast calculation with 6 vCPUs. 

The reason is we need to put **ALL** the layers onto cuda as well, then we put data onto cuda as well. Perhaps you could call the "object" to cuda, one isn't sure. For this way, we demo how to put cuda *redundantly* on the next cell. 

## nn.RNN
Refactor to use PyTorch's RNN. That's what we would use in practice, but now we know the inside details. 

In [27]:
class Model4(Module):
    def __init__(self, nv, nh):
        self.i_h = nn.Embedding(nv, nh).cuda()
        self.rnn = nn.RNN(nh, nh, batch_first=True).cuda()
        self.h_o = nn.Linear(nh, nv).cuda()
        self.bn = BatchNorm1dFlat(nh).cuda()
        self.h = torch.zeros(1, bs, nh).cuda()

    def forward(self, x):
        x = x.cuda()
        res, h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(self.bn(res))

In [29]:
fit_model(Model4(len(vocab), 64), 20, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.051171,2.653143,0.406494,00:00
1,2.254001,1.929059,0.471191,00:00
2,1.772074,1.903632,0.320882,00:00
3,1.575565,1.955284,0.321859,00:00
4,1.476673,1.901512,0.400065,00:00
5,1.268954,1.798347,0.396566,00:00
6,1.056735,1.57796,0.478353,00:00
7,0.876947,1.504796,0.532145,00:00
8,0.7286,1.509092,0.553141,00:00
9,0.602096,1.614856,0.560872,00:00


## 2-layer GRU
When we have long time scales and deeper networks, these become impossible to train. One way to address this is to add mini-NN to decide how much "green arrow and orange arrow" (see slides in github repo) to keep. These mini-NNs can be GRUs or LSTMs. 

In [30]:
class Model5(Module):
    def __init__(self, nv, nh):
        self.i_h = nn.Embedding(nv, nh).cuda()
        self.rnn = nn.GRU(nh, nh, 2, batch_first=True).cuda()
        self.h_o = nn.Linear(nh, nv).cuda()
        self.bn = BatchNorm1dFlat(nh).cuda()
        self.h = torch.zeros(2, bs, nh).cuda()

    def forward(self, x):
        res, h = self.rnn(self.i_h(x), self.h)
        self.h = h.detach()
        return self.h_o(self.bn(res))

In [31]:
fit_model(Model5(len(vocab), 64), 10, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,2.229088,1.741237,0.520752,00:00
1,1.071178,1.000395,0.759928,00:00
2,0.444417,0.752374,0.825358,00:00
3,0.195652,0.793777,0.829671,00:00
4,0.092799,0.872465,0.81429,00:00
5,0.048371,0.934903,0.821045,00:00
6,0.028002,0.93107,0.828613,00:00
7,0.017872,0.973533,0.816406,00:00
8,0.011911,1.010683,0.818604,00:00
9,0.008604,1.022495,0.815348,00:00


Ignoring the above overfitting model. 

ULMFiT: we swap out `self.h_o` with a classifier to do classification on text. 

RNNs are just refactored, fully-connected NNs. We can use same approach for any sequence labeling task (part of speech, classifying whether material is sensitive, etc...)