![xkcd](https://imgs.xkcd.com/comics/i_could_care_less.png)

# Recursive Neural Networks

- Language models from scratch
    - As we saw in the IMDB example, building the language model is the hard part.
    - There we used a pre-trained model, let's take a step toward understanding what it does.
- Traditional neural networks have a fixed input size and a fixed output size
    - e.g. number of pixels in an image is input and number of classes is the output
- RNNs not take in all of the input in at once
    - Consume a sequence of tokens
    - Like having a for loop in your network
  
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/defreez/cs356-notebooks/blob/main/notebooks/rnn.ipynb)

In [141]:
from fastai.text.all import *
!pip install -Uqq fastbook
import fastbook
from fastbook import *
fastbook.setup_book()

## The human numbers dataset

```
fifteen 
sixteen 
seventeen 
eighteen 
nineteen 
twenty
twenty one
twenty two
twenty three
twenty four
```

## The human numbers dataset

- Nice small and easy dataset
- Used in FastAI book
- Two text files
    - Training data is first 7999 numbers (0 to 7999)
    - Validation data is next 1999 numbers (8001 to 9999)
    - What happend to eight thousand? Who knows :shrug:

In [142]:
hn_path = untar_data(URLs.HUMAN_NUMBERS)
hn_path.ls()

(#2) [Path('/root/.fastai/data/human_numbers/train.txt'),Path('/root/.fastai/data/human_numbers/valid.txt')]

In [143]:
# The FastAI custom List type
lines = L()

# Open train.txt and append all of the lines in the file to variable lines.
with open(hn_path/'train.txt') as f:
    lines += L(f.readlines())
    
# Open valid.txt and append all of the lines in the file to variable lines
with open(hn_path/'valid.txt') as f:
    lines += L(f.readlines())

In [144]:
lines

(#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]

# Tokenization

- In thise case the tokenizer can be very simple
- We do want to make sure that we have a marker between tokens
- In this case the book uses `.` 
    - Similar to `xxbos` that is used by the real FastAI language loader
    
    
Example Tokens:
```
['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']
```

In [145]:
# Use . as separator
hn_text = ' . '.join([l.strip() for l in lines])
hn_text[:100]

'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

In [146]:
# Tokenize. For this dataset we don't need anything complicated.
tokens = hn_text.split(' ')
tokens[:10]

['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']

In [147]:
# Can also use L(tokens).unique() but below is idiomatic Python without the need for FastAI helpers
vocab = set(tokens)

# Convert to list just for display here. I am deliberately avoiding
# some of the FastAI features just to remove as much magic as possible.
len(vocab), list(vocab)[:5]

(30, ['seventeen', 'hundred', 'ten', 'fifty', 'eighteen'])

## Numericalization

- The next step after tokenization is numericalization
- This time we'll use the same approach as the PyTorch embeddings example (dictionaries)

![numericalization](images/numericalization.png)

In [148]:
# Numericalize
word2idx = {w : i for i,w in enumerate(vocab)}
idx2word = {i : w for i,w in enumerate(vocab)}
nums = L(word2idx[i] for i in tokens)
nums

(#63095) [21,13,29,13,12,13,20,13,23,13...]

## Language Modeling

- This is review
- Predict next word based on previous three tokens
- No recursion yet

### Language Modeling

- Independent variable is three tokens
- Dependent variable is a single token

Example:
```
(['one', '.', 'two'], 'three'),
(['.', 'three', '.'], '.'),
(['four', '.', 'five'], 'six')
```

In [149]:
# Producing the tri-grams
# We did this in the embedding example.
human_seqs_example = []
for i in range(0, len(tokens) - 4, 3):
    human_seqs_example.append((tokens[i:i+3], tokens[i+4]))
    
L(human_seqs_example)

(#21031) [(['one', '.', 'two'], 'three'),(['.', 'three', '.'], '.'),(['four', '.', 'five'], 'six'),(['.', 'six', '.'], '.'),(['seven', '.', 'eight'], 'nine'),(['.', 'nine', '.'], '.'),(['ten', '.', 'eleven'], 'twelve'),(['.', 'twelve', '.'], '.'),(['thirteen', '.', 'fourteen'], 'fifteen'),(['.', 'fifteen', '.'], '.')...]

In [150]:
# nums is the numericalized text input
# Also compress that for loop into a list comprehension because it was too readable.
# Now it looks like we are a little smarter.
seqs = L((tensor(nums[i:i+3]), nums[i + 3]) for i in range(0, len(nums) - 4, 3))

In [151]:
# The cut variable is the cut between training and validation
# We could have just kept them separate from the beginning.
batch_size = 64
cut = int(len(seqs) * 0.8)

# Here the DataLoader will divide sequences into batches, not doing anything else.
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=batch_size, shuffle=False)

# Each batch is 64, [0] is x and is a tri-gram, [1] is y and is a single token index
first(dls.train)[0].shape, first(dls.train)[1].shape

(torch.Size([64, 3]), torch.Size([64]))

In [152]:
x = first(dls.train)[0]
# First token across all batches
x[:,0]

tensor([21, 13, 20, 13,  5, 13,  2, 13, 15, 13,  8, 13, 19, 13, 13, 13, 13, 13,
        13, 13, 13, 13, 13, 17, 17, 17, 17, 17, 17, 17, 17, 17, 10, 21, 29, 12,
        20, 23,  7,  5, 22, 16, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 18, 18,
        18, 18, 18, 18, 18, 18, 18, 27, 21, 29])

<img src="https://github.com/fastai/fastbook/raw/e57e3155824c81a54f915edf9505f64d5ccdad84/images/att_00022.png" width="800px" />

In [153]:
class LMModel1(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.input_to_hidden = nn.Embedding(vocab_sz, n_hidden)
        self.hidden = nn.Linear(n_hidden, n_hidden)
        self.hidden_to_output = nn.Linear(n_hidden, vocab_sz)
        
    def forward(self, x):
        # First token across all batches
        first_tokens = x[:, 0]
        second_tokens = x[:, 1]
        third_tokens = x[:, 2]
        
        # Get embedding for tokens
        h = self.input_to_hidden(first_tokens)
        h = self.hidden(h)
        h = F.relu(h)
        
        # Apply hidden layer to second token in batches
        # Add the result.
        h = h + self.input_to_hidden(second_tokens)
        h = self.hidden(h)
        h = F.relu(h)
        
        h = h + self.input_to_hidden(third_tokens)
        h = self.hidden(h)
        h = F.relu(h)
        
        return self.hidden_to_output(h)

In [154]:
learn = Learner(dls, LMModel1(len(vocab), 64), loss_func=F.cross_entropy, metrics=accuracy)
learn.fit(10, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.531881,1.803231,0.467792,00:01
1,1.381355,1.802319,0.468505,00:01
2,1.338977,1.719721,0.4773,00:01
3,1.278312,1.731577,0.490611,00:01
4,1.262202,1.659887,0.506061,00:01
5,1.23716,1.703498,0.506537,00:01
6,1.206963,1.777019,0.507012,00:01
7,1.19936,1.788651,0.507012,00:01
8,1.212504,1.738667,0.507012,00:01
9,1.197111,1.714087,0.507012,00:01


## Our first recurrent neural network

- Simply refactors the previous network
- Uses a for loop inside the forward function
- Accuracy should be the same

<img src="https://github.com/fastai/fastbook/raw/e57e3155824c81a54f915edf9505f64d5ccdad84/images/att_00070.png" width="800px" />

In [155]:
# Module is the PyTorch base class for all networks.
class LMModel2(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.input_to_hidden = nn.Embedding(vocab_sz, n_hidden)
        self.hidden = nn.Linear(n_hidden, n_hidden)
        self.hidden_to_output = nn.Linear(n_hidden, vocab_sz)
        
    def forward(self, x):
        # what is with the h = 0?
        # That wasn't introduced before
        # Need to set h =0 explicitly because the loop has +=

        h = 0
        for i in range(3):
            h = h + self.input_to_hidden(x[:, i])
            h = F.relu(self.hidden(h))
        return self.hidden_to_output(h)

In [156]:
learn = Learner(dls, LMModel2(len(vocab), 128), loss_func=F.cross_entropy, metrics=accuracy)
learn.fit(10, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.41021,1.683014,0.478726,00:01
1,1.276662,1.611895,0.493463,00:01
2,1.223682,1.702572,0.499643,00:01
3,1.203796,1.760403,0.493939,00:01
4,1.185061,1.890238,0.493939,00:01
5,1.163429,2.255942,0.491562,00:01
6,1.24335,1.60694,0.506299,00:01
7,1.19751,1.793824,0.506774,00:01
8,1.145686,1.80549,0.506774,00:01
9,1.150602,2.0465,0.493939,00:01


## Maintaining State

- We are not chaining together the sequences, maximum memory in the RNN is sequence length
- Two problems
    1. More practically, our network is now too deep (in the unrolled sense).
       Again we aren't actually creating these layers, but still have to back-propagate.
    2. In order for this to make sense, the next batch has to be the next in the sequence
 

### Back-Propagation Through Time (BPTT)

- Solves the first problem
- Tell PyTorch that we only want to update weights based on gradient from this batch
- But we still don't through away the state we are passing along

In [157]:
class LMModel3(Module):
    def __init__(self, vocab_sz, n_hidden):
        self.input_to_hidden = nn.Embedding(vocab_sz, n_hidden)
        self.hidden = nn.Linear(n_hidden, n_hidden)
        self.hidden_to_output = nn.Linear(n_hidden, vocab_sz)
        self.h = 0
        
    def forward(self, x):
        for i in range(3):
            self.h = self.h + self.input_to_hidden(x[:, i])
            self.h = F.relu(self.hidden(self.h))
        out = self.hidden_to_output(self.h)        
        self.h = self.h.detach()
        return out
        
    def reset(self):
        self.h = 0

### Carefully arranging the batches

- In the IMDB sentiment analysis example the `LMDataLoader` does this
- We want the the batches to contain contiguous sequences
- Normally shuffling is good, not here!
- We can drop or pad the last batch to fit it in the tensor.
- LMDataLoader pads, we we will drop.

In [158]:
batch_seq_len = len(seqs) // batch_size
batch_seq_len

328

In [159]:
def group_chunks(ds, bs):
    """
    ds: the dataset
    """
    m = len(ds) // bs
    new_ds = L()
    for i in range(m):
            new_ds += L(ds[i + m * j] for j in range(bs))
    return new_ds

In [160]:
# Cut between training and validation
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(
    # [0, cut)
    group_chunks(seqs[:cut], batch_size),
    
    # [cut, end]
    group_chunks(seqs[cut:], batch_size),
    
    bs=batch_size,
    drop_last=True,
    shuffle=False
)

In [161]:
# Top out at around 60%. A 10% improvement! That's actually pretty huge.
# Without reset you get the error of trying to backward through the graph a second time...
learn = Learner(dls, LMModel3(len(vocab), 64), loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(10, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.672461,1.867919,0.479567,00:01
1,1.261106,1.722118,0.475962,00:01
2,1.110206,1.721647,0.502885,00:01
3,1.015317,1.721791,0.557692,00:01
4,0.946742,1.728749,0.560337,00:01
5,0.912255,1.672993,0.575962,00:01
6,0.852893,1.664606,0.590625,00:01
7,0.798675,1.858297,0.580769,00:01
8,0.751387,1.832644,0.61851,00:01
9,0.732432,1.839459,0.616587,00:01


## Create More Signal

- Explains why in the IMDB example the x and y kept the entire sequence
- Predict the entire sequence not just one word

In [162]:
sl = 16

# Does this look readable to you?
seqs = L((tensor(nums[i:i+sl]), tensor(nums[i+1:i+sl+1])) for i in range(0, len(nums)-sl-1,sl))
cut = int(len(seqs) * 0.8)

In [163]:
# The independent variable
L([idx2word[x.item()] for x in seqs[0][0]])

(#16) ['one','.','two','.','three','.','four','.','five','.'...]

In [164]:
# The dependent variable (label)
L([idx2word[x.item()] for x in seqs[0][1]])

(#16) ['.','two','.','three','.','four','.','five','.','six'...]

In [165]:
dls = DataLoaders.from_dsets(
    group_chunks(seqs[:cut], batch_size),
    group_chunks(seqs[cut:], batch_size),
    bs=batch_size, drop_last=True, shuffle=False
)

In [166]:
# Modify the model so that it makes a prediction after every word
class LMModel4(Module):
    def __init__(self, vocab_sz, n_hidden, seq_len):
        self.input_to_hidden = nn.Embedding(vocab_sz, n_hidden)
        self.hidden = nn.Linear(n_hidden, n_hidden)
        self.hidden_to_output = nn.Linear(n_hidden, vocab_sz)
        self.h = 0
        self.seq_len = seq_len
        
    def forward(self, x):
        # outs will be a list of tensors that are shape (bs, vocab_sz)
        outs = []
        for i in range(self.seq_len):
            self.h = self.h + self.input_to_hidden(x[:, i])
            self.h = F.relu(self.hidden(self.h))
            outs.append(self.hidden_to_output(self.h))
        self.h = self.h.detach()
        
        # we want to return a tensor so stack our list of outs
        # shape will be batch_size x sequence length x vocab size
        return torch.stack(outs, dim=1)

In [179]:
# input shape is batch size x sequence len x vocab (output of neural net)
# target shape is batch size x sequence len
# Cross entropy can't handle a 2D tensor here,s o flatten (64x16x30, 64x16) to (1024x30, 1024)
# That is, for all 1024 tokens in the batch of sequences list the 30 activations
# Calculate loss against the correct token
def loss_func(inp, targ):
    return F.cross_entropy(inp.view(-1, len(vocab)), targ.view(-1))

In [168]:
learn = Learner(dls, LMModel4(len(vocab), 64, 16), loss_func=loss_func, metrics=accuracy)
learn.fit(30, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,2.369873,2.017576,0.462321,00:00
1,1.759912,1.838556,0.461507,00:00
2,1.546585,1.792056,0.460124,00:00
3,1.42927,1.830774,0.457764,00:00
4,1.343087,1.768953,0.486816,00:00
5,1.264946,1.840395,0.489014,00:00
6,1.211312,1.788568,0.527588,00:00
7,1.155886,1.87063,0.508138,00:00
8,1.113114,1.874396,0.528564,00:00
9,1.0698,1.782308,0.539388,00:00


In [169]:
learn.fit(50, 1e-4)

epoch,train_loss,valid_loss,accuracy,time
0,0.496429,1.555763,0.637207,00:00
1,0.48847,1.565688,0.638835,00:00
2,0.485341,1.577183,0.639486,00:00
3,0.483274,1.577752,0.640544,00:00
4,0.481578,1.578776,0.641846,00:00
5,0.480184,1.583196,0.642741,00:00
6,0.478654,1.580914,0.642904,00:00
7,0.476936,1.579383,0.643148,00:00
8,0.475016,1.581514,0.642904,00:00
9,0.473054,1.583959,0.643473,00:00


Another 6%, pretty good.