# Use fastai human numbers data to train LSTM

The data is from [fastai book chap 12](https://github.com/fastai/fastbook/blob/master/12_nlp_dive.ipynb). Looks like:

```
one 
two 
three 
...
two hundred seven 
two hundred eight 
...
```


## Support

In [89]:
import codecs
import os
import re
import string
import numpy as np
import pandas as pd
from typing import Sequence
from sklearn.model_selection import train_test_split

import tensorflow_addons as tfa
from keras.datasets import mnist
import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.keras import models, layers, callbacks, optimizers, Sequential, losses
import tqdm
from tqdm.keras import TqdmCallback

def get_text(filename:str):
    """
    Load and return the text of a text file, assuming latin-1 encoding as that
    is what the BBC corpus uses.  Use codecs.open() function not open().
    """
    f = codecs.open(filename, encoding='latin-1', mode='r')
    s = f.read()
    f.close()
    return s

## Load corpus and numericalize tokens

In [90]:
text = get_text('data/human_numbers/train.txt')
text += get_text('data/human_numbers/valid.txt')
text[:30]

'one \ntwo \nthree \nfour \nfive \ns'

In [91]:
text = re.sub(r'[ \n]+', ' . ', text) # use '.' as separator token
text[:20]

'one . two . three . '

In [92]:
tokens = text.split(' ')
tokens = tokens[:-1] # last token is blank '' so delete
tokens[:5]

['one', '.', 'two', '.', 'three']

In [93]:
V = sorted(set(tokens))
V[:10]

['.',
 'eight',
 'eighteen',
 'eighty',
 'eleven',
 'fifteen',
 'fifty',
 'five',
 'forty',
 'four']

In [94]:
index = {w:i for i,w in enumerate(V)}
tokens = [index[w] for w in tokens]
tokens[:10]

[15, 0, 29, 0, 26, 0, 9, 0, 7, 0]

In [95]:
k = 3
step = 1
Xy = [np.array(tokens[i-k:i+1], dtype=np.int) for i in range(k,len(tokens)-1,step)]
Xy = np.array(Xy)

In [96]:
Xy[:5]

array([[15,  0, 29,  0],
       [ 0, 29,  0, 26],
       [29,  0, 26,  0],
       [ 0, 26,  0,  9],
       [26,  0,  9,  0]])

In [97]:
Xy[0]

array([15,  0, 29,  0])

In [98]:
X, y = Xy[:,0:k], Xy[:,k]

In [99]:
X = np.vstack(X)
X[0:4]

array([[15,  0, 29],
       [ 0, 29,  0],
       [29,  0, 26],
       [ 0, 26,  0]])

In [100]:
# must onehot y
y = pd.get_dummies(y)
y.shape

(106192, 30)

## Cross-batch statefulness

From [keras RNN guide](https://keras.io/guides/working_with_rnns):
    
"*Normally, the internal state of a RNN layer is reset every time it sees a new batch (i.e. every sample seen by the layer is assume to be independent from the past). ... If you have very long sequences though, it is useful to break them into shorter sequences, and to feed these shorter sequences sequentially into a RNN layer without resetting the layer's state. That way, the layer can retain information about the entirety of the sequence, even though it's only seeing one sub-sequence at a time.*"

We have to turn on `stateful=True` in LSTM() but also specify `batch_input_shape`. The call to `fit()` should also have `shuffle=False`.

Here is the key to understanding the stride required for batching stateful LSTMs:

"*Sample i in a given batch is assumed to be the continuation of sample i in the previous batch. This means that all batches should contain the same number of samples.*"

Note: we are not throwing out internal weight matrices between batches, just the h and c state vectors. If weights were tossed, then no training would occur as each batch starts as if from initial model state. (We'd train only using the last batch of last epoch.)

Assume sequence = (1, 2, 3, 4, 5, 6, 7, 8, 9)

If k=3 (time steps) then we have sub sequences (and targets):

(1,2,3) -> 4
(2,3,4) -> 5
(3,4,5) -> 6
(4,5,6) -> 7
(5,6,7) -> 8
(6,7,8) -> 9

If batch size is 1 with input vector $x$, then the LSTM's h state (and c I believe) gets updated like this.

`for all time steps j in range(k)`:  $h = W h + U x_j$

For continuity across batches, `X[1]` should be the sequence following `X[0]`.


we need the loop over the k time steps to continue from step k in one batch to step 0 

$\text{for j in range(k): # for all time steps in k, the sequence length}\\
    h = W h + U X_j
$


```python
for j in range(k):  # for all time steps in k, the sequence length
    # xj is batchsize x |V| but U is hidden x |V| so need transpose
    xj = X[:,:,j].T # jth char dim for all records in batch
    h = self.W.mm(h) + self.U.mm(xj)
    h = torch.relu(h) 
```

## Train

In [101]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.20)

In [127]:
model = Sequential()
model.add(layers.Embedding(input_dim=len(V), output_dim=64, input_length=k))
model.add(layers.LSTM(units=64, input_shape=(k,1),
                      batch_input_shape=(), stateful=True))
model.add(layers.Dense(len(V), activation='softmax'))

# opt = optimizers.Adam(learning_rate=0.01)
opt = optimizers.RMSprop(lr=0.01)

model.compile(loss=losses.categorical_crossentropy, optimizer=opt, metrics=['accuracy'])
#model.summary()

ValueError: If a RNN is stateful, it needs to know its batch size. Specify the batch size of your input tensors: 
- If using a Sequential model, specify the batch size by passing a `batch_input_shape` argument to your first layer.
- If using the functional API, specify the batch size by passing a `batch_shape` argument to your Input layer.

In [128]:
batch_size = 64
history = model.fit(X_train, y_train,
                    shuffle=False, # don't jumble up sequence in batch or across batches (stateful LSTM)
                    epochs=15,
                    validation_data=(X_valid, y_valid),
                    batch_size=batch_size,
                    verbose=1
#                         , callbacks=[tfa.callbacks.TQDMProgressBar(show_epoch_progress=True)]
                    )

RuntimeError: You must compile your model before training/testing. Use `model.compile(optimizer, loss)`.