# Predicting English word version of numbers using an RNN

## Data

In [0]:
from fastai.text import *
import pdb

In [0]:
bs=64

In [3]:
path = untar_data(URLs.HUMAN_NUMBERS)
path.ls()

Downloading http://files.fast.ai/data/examples/human_numbers


[PosixPath('/root/.fastai/data/human_numbers/valid.txt'),
 PosixPath('/root/.fastai/data/human_numbers/train.txt')]

In [0]:
def readnums(d): return [', '.join(o.strip() for o in open(path/d).readlines())]

train.txt gives us a sequence of numbers written out as English words:

In [5]:
train_txt = readnums('train.txt'); train_txt[0][:80]

'one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirt'

In [6]:
valid_txt = readnums('valid.txt'); valid_txt[0][-80:]

' nine thousand nine hundred ninety eight, nine thousand nine hundred ninety nine'

In [7]:
train = TextList(train_txt, path=path)
valid = TextList(valid_txt, path=path)
src = ItemLists(path, train, valid).label_for_lm()

In [8]:
src

LabelLists;

Train: LabelList (1 items)
x: LMTextList
xxbos one , two , three , four , five , six , seven , eight , nine , ten , eleven , twelve , thirteen , fourteen , fifteen , sixteen , seventeen , eighteen , nineteen , twenty , twenty one , twenty two , twenty three , twenty four , twenty five , twenty six , twenty seven , twenty eight , twenty nine , thirty , thirty one , thirty two , thirty three , thirty four , thirty five , thirty six , thirty seven , thirty eight , thirty nine , forty , forty one , forty two , forty three , forty four , forty five , forty six , forty seven , forty eight , forty nine , fifty , fifty one , fifty two , fifty three , fifty four , fifty five , fifty six , fifty seven , fifty eight , fifty nine , sixty , sixty one , sixty two , sixty three , sixty four , sixty five , sixty six , sixty seven , sixty eight , sixty nine , seventy , seventy one , seventy two , seventy three , seventy four , seventy five , seventy six , seventy seven , seventy eight , se

## Single output

N-Gram, study 20 tokens, then predict the 21th.

In [0]:
def loss_f(input, target): return F.cross_entropy(input, target[:,-1])
def acc_f(input, target): return accuracy(input, target[:,-1])

In [0]:
wordvec_len = 100
nh = 64
nv = len(data.train_ds.vocab.itos)

`bptt` stands for *back-propagation through time*.  This tells us how many steps of history we are considering.

In [0]:
bptt = 3
data = src.databunch(bs=bs, bptt=bptt)

In [0]:
v = data.train_ds.vocab
nv = len(v.itos)

In [0]:
class Model1(nn.Module):
  def __init__(self):
    super().__init__()
    self.emb = nn.Embedding(nv, wordvec_len) # word2vec
    self.input = nn.Linear(wordvec_len, nh) # input layer, green arrors
    self.hid = nn.Linear(nh, nh) # hidden layer, orange arrors
    self.out = nn.Linear(nh, nv) # output layer, green arrors
    self.bn = nn.BatchNorm1d(nh)
  
  def forward(self, x):
    h = torch.zeros(x.shape[0], nh).to(device=x.device)
    for i in range(x.shape[1]):
      h = h + F.relu(self.input(self.emb(x[:, i])))
      h = self.bn(F.relu(self.hid(h)))
    return self.out(h)

In [28]:
learn = Learner(data, Model1(), loss_func=loss_f, metrics=acc_f)
learn.fit_one_cycle(10, 1e-4)

epoch,train_loss,valid_loss,acc_f,time
0,3.546392,3.752568,0.024586,00:01
1,2.911883,3.326832,0.227022,00:01
2,2.289827,2.718843,0.44761,00:01
3,1.943045,2.349009,0.465303,00:01
4,1.766844,2.192134,0.466222,00:01
5,1.678379,2.125904,0.466452,00:01
6,1.633395,2.100061,0.464844,00:01
7,1.610903,2.090777,0.464614,00:01
8,1.601126,2.087984,0.459099,00:01
9,1.598013,2.088171,0.459099,00:01


## Multi output

Before, we were just predicting the last word in a line of text.  Given 70 tokens, what is token 71?  That approach was throwing away a lot of data.  Why not predict token 2 from token 1, then predict token 3, then predict token 4, and so on?  We will modify our model to do this.

In [0]:
bptt = 20
data = src.databunch(bs=bs, bptt=bptt)

In [0]:
class Model2(nn.Module):
  def __init__(self):
    super().__init__()
    self.emb = nn.Embedding(nv, wordvec_len)
    self.input = nn.Linear(wordvec_len, nh)
    self.hid = nn.Linear(nh, nh)
    self.out = nn.Linear(nh, nv)
    self.bn = nn.BatchNorm1d(nh)
  
  def forward(self, x):
    h = torch.zeros(x.shape[0], nh).to(device=x.device)
    res = []
    for i in range(x.shape[1]):
      h = h + F.relu(self.input(self.emb(x[:, i])))
      h = self.bn(F.relu(self.hid(h)))
      res.append(self.out(h))
    return torch.stack(res, dim=1)


class Model2(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)
        self.h_h = nn.Linear(nh,nh)
        self.h_o = nn.Linear(nh,nv)
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self, x):
        h = torch.zeros(x.shape[0], nh).to(device=x.device)
        res = []
        for i in range(x.shape[1]):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
            res.append(self.h_o(self.bn(h)))
        #pdb.set_trace()
        return torch.stack(res, dim=1)

In [45]:
learn = Learner(data, Model2(), metrics=accuracy)
learn.fit_one_cycle(10, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.410916,3.198211,0.233026,00:00
1,2.515452,2.168845,0.308949,00:00
2,1.924214,2.302819,0.311932,00:00
3,1.675017,2.314949,0.312074,00:00
4,1.563446,2.288628,0.311648,00:00
5,1.503242,2.179128,0.317827,00:00
6,1.466725,2.19635,0.322159,00:00
7,1.441664,2.264104,0.326847,00:00
8,1.422222,2.300416,0.328764,00:00
9,1.409647,2.295279,0.32848,00:00


Note that our accuracy is worse now, because we are doing a harder task.  When we predict word k (k<70), we have less history to help us then when we were only predicting word 71.

## Maintain state

To address this issue, let's keep the hidden state from the previous line of text, so we are not starting over again on each new line of text.

In [0]:
class Model3(nn.Module):
  def __init__(self):
    super().__init__()
    self.emb = nn.Embedding(nv, wordvec_len)
    self.input = nn.Linear(wordvec_len, nh)
    self.hid = nn.Linear(nh, nh)
    self.out = nn.Linear(nh, nv)
    self.bn = nn.BatchNorm1d(nh)
    self.h = torch.zeros(bs, nh).cuda()
  
  def forward(self, x):
    h = self.h
    res = []
    for i in range(x.shape[1]):
      h = h + F.relu(self.input(self.emb(x[:, i])))
      h = self.bn(F.relu(self.hid(h)))
      res.append(h)
    self.h = h.detach()
    res = torch.stack(res, dim=1)
    return self.out(res)

In [47]:
learn = Learner(data, Model3(), metrics=accuracy)
learn.fit_one_cycle(20, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.737108,10.88684,0.072727,00:00
1,3.33051,2.784832,0.419957,00:00
2,2.551389,2.021663,0.471023,00:00
3,1.926469,1.872175,0.488423,00:00
4,1.542643,1.757848,0.522585,00:00
5,1.283859,1.702196,0.56946,00:00
6,1.080044,1.602201,0.572514,00:00
7,0.915017,1.656019,0.571875,00:00
8,0.783566,1.476927,0.580185,00:00
9,0.672636,1.37849,0.583168,00:00


In [0]:
class Model3(nn.Module):
  def __init__(self):
    super().__init__()
    self.emb = nn.Embedding(nv, wordvec_len)
    self.input = nn.Linear(wordvec_len, nh)
    self.hid = nn.Linear(nh, nh)
    self.out = nn.Linear(nh, nv)
    self.bn = nn.BatchNorm1d(nh)
    self.h = torch.zeros(bs, nh).cuda()
  
  def forward(self, x):
    h = self.h
    res = []
    for i in range(x.shape[1]):
      h = h + torch.tanh(self.input(self.emb(x[:, i])))
      h = self.bn(torch.tanh(self.hid(h)))
      res.append(h)
    self.h = h.detach()
    res = torch.stack(res, dim=1)
    return self.out(res)

In [49]:
learn = Learner(data, Model3(), metrics=accuracy)
learn.fit_one_cycle(20, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.496202,3.190429,0.352912,00:00
1,2.748388,2.198311,0.486364,00:00
2,2.015454,1.744916,0.536719,00:00
3,1.477366,1.397689,0.594531,00:00
4,1.098995,1.234465,0.633381,00:00
5,0.845759,1.164001,0.651989,00:00
6,0.667688,1.052913,0.682102,00:00
7,0.544539,0.988072,0.702699,00:00
8,0.453382,0.934729,0.717756,00:00
9,0.384751,0.934515,0.724432,00:00


Now we are getting greater accuracy than before!

## nn.RNN

Let's refactor the above to use PyTorch's RNN.  This is what you would use in practice, but now you know the inside details!

In [0]:
class Model4(nn.Module):
  def __init__(self):
    super().__init__()
    self.emb = nn.Embedding(nv, wordvec_len)
    self.input = nn.Linear(wordvec_len, nh)
    self.rnn = nn.RNN(nh, nh, 1, batch_first=True)
    self.out = nn.Linear(nh, nv)
    self.bn = BatchNorm1dFlat(nh)
    self.h = torch.zeros(1, bs, nh).cuda()
  
  def forward(self, x):
    res, h = self.rnn(self.input(self.emb(x)), self.h)
    self.h = h.detach()
    return self.out(self.bn(res))

In [51]:
learn = Learner(data, Model4(), metrics=accuracy)
learn.fit_one_cycle(20, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.45473,3.175731,0.348793,00:00
1,2.667047,2.125803,0.464133,00:00
2,2.036313,2.063892,0.316619,00:00
3,1.720699,2.120566,0.318182,00:00
4,1.544179,1.813197,0.482955,00:00
5,1.364512,1.815946,0.498438,00:00
6,1.149287,1.668347,0.482173,00:00
7,0.940415,1.386655,0.557457,00:00
8,0.764807,1.242821,0.573366,00:00
9,0.630987,1.201925,0.59446,00:00


In [0]:
class Model4(nn.Module):
  def __init__(self):
    super().__init__()
    self.emb = nn.Embedding(nv, wordvec_len)
    self.input = nn.Linear(wordvec_len, nh)
    self.rnn = nn.RNN(nh, nh, 2, batch_first=True, dropout=0.1)
    self.out = nn.Linear(nh, nv)
    self.bn = BatchNorm1dFlat(nh)
    self.h = torch.zeros(2, bs, nh).cuda()
  
  def forward(self, x):
    res, h = self.rnn(self.input(self.emb(x)), self.h)
    self.h = h.detach()
    return self.out(self.bn(res))

In [55]:
learn = Learner(data, Model4(), metrics=accuracy)
learn.fit_one_cycle(20, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,3.126027,2.708646,0.455327,00:00
1,2.375827,1.994828,0.468679,00:00
2,1.8959,1.894145,0.367756,00:00
3,1.627582,1.759445,0.476847,00:00
4,1.352829,1.43517,0.521875,00:00
5,1.043024,1.078941,0.655114,00:00
6,0.755228,0.938894,0.694673,00:00
7,0.548171,0.861323,0.741193,00:00
8,0.412124,0.841281,0.755185,00:00
9,0.320531,0.792354,0.77223,00:00


In [0]:
??BatchNorm1dFlat

## nn.GRU

When you have long time scales and deeper networks, these become impossible to train.  One way to address this is to add mini-NN to decide how much of the green arrow and how much of the orange arrow to keep.  These mini-NNs can be GRUs or LSTMs.  We will cover more details of this in a later lesson.

In [0]:
class Model5(nn.Module):
  def __init__(self):
    super().__init__()
    self.emb = nn.Embedding(nv, wordvec_len)
    self.input = nn.Linear(wordvec_len, nh)
    self.rnn = nn.GRU(nh, nh, 2, batch_first=True, dropout=0.05)
    self.out = nn.Linear(nh, nv)
    self.bn = BatchNorm1dFlat(nh)
    self.h = torch.zeros(2, bs, nh).cuda()
  
  def forward(self, x):
    res, h = self.rnn(self.input(self.emb(x)), self.h)
    self.h = h.detach()
    return self.out(self.bn(res))

In [0]:
??nn.GRU

In [65]:
learn = Learner(data, Model5(), metrics=accuracy)
learn.fit_one_cycle(10, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,2.696175,2.177383,0.476634,00:00
1,1.664812,1.586279,0.603977,00:00
2,0.881934,1.131488,0.782457,00:00
3,0.44778,1.213786,0.81875,00:00
4,0.228962,1.166247,0.834304,00:00
5,0.123571,1.116273,0.83331,00:00
6,0.07181,1.27752,0.835156,00:00
7,0.043868,1.258988,0.838068,00:00
8,0.029036,1.265061,0.839134,00:00
9,0.02135,1.290411,0.838778,00:00


## END

RNNs are just a refactored, fully-connected neural network.

You can use the same approach for any sequence labeling task (part of speech, classifying whether material is sensitive,..)