# Generate obama speeches using truncated back propagation

In the previous notebook, I broke up the input text into chunks of reasonably small size like 50 characters. Then I took a batches of say 32 of those to perform gradient descent. The hidden state was reset at the start of every chunk so there was no continuity between chunks. It started to get decent words coming out but the order of the words didn't make sense. This could be because the hidden state matrix was reset all the time. The best I got was about 58% accuracy using 1M characters.

Lessons:

* Seems like bptt should be small like 8. 16 and 32 didn't train fast enough
* Able to use really big nchunks for parallelism since bptt being short gives stochastic part of SGD.
* With nchunks=400, 6seconds per epoch

In [1]:
import pandas as pd
import numpy as np
import math
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, TensorDataset
from torch.nn.utils.rnn import pad_sequence
import torch.nn.functional as F
#from torch.nn.functional import softmax
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
np.set_printoptions(precision=2, suppress=True, linewidth=3000, threshold=20000)
from typing import Sequence

dtype = torch.float
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda', index=0)

In [2]:
import codecs
def get_text(filename:str):
    """
    Load and return the text of a text file, assuming latin-1 encoding as that
    is what the BBC corpus uses.  Use codecs.open() function not open().
    """
    with codecs.open(filename, mode='r') as f:
        s = f.read()
    return s

In [3]:
def normal_transform(x, mean=0.0, std=0.01):
    "Convert x to have mean and std"
    return x*std + mean

def randn(n1, n2,          
          mean=0.0, std=0.01, requires_grad=False,
          device=torch.device('cuda:0' if torch.cuda.is_available() else 'cpu'),
          dtype=torch.float64):
    x = torch.randn(n1, n2, device=device, dtype=dtype)
    x = normal_transform(x, mean=mean, std=std)
    x.requires_grad=requires_grad
    return x

In [4]:
def plot_history(history, yrange=(0.0, 5.00), figsize=(3.5,3)):
    plt.figure(figsize=figsize)
    plt.ylabel("Sentiment log loss")
    plt.xlabel("Epochs")
    loss = history[:,0]
    valid_loss = history[:,1]
    plt.plot(loss, label='train_loss')
    plt.plot(valid_loss, label='val_loss')
    # plt.xlim(0, 200)
    plt.ylim(*yrange)
    plt.legend()#loc='lower right')
    plt.show()

In [5]:
def getvocab(strings):
    letters = [list(l) for l in strings]
    vocab = set([c for cl in letters for c in cl])
    vocab = sorted(list(vocab))
    ctoi = {c:i for i, c in enumerate(vocab)}
    return vocab, ctoi

In [6]:
def softmax(y):
    expy = torch.exp(y)
    if len(y.shape)==1: # 1D case can't use axis arg
        return expy / torch.sum(expy)
    return expy / torch.sum(expy, axis=1).reshape(-1,1)

def cross_entropy(y_prob, y_true):
    """
    y_pred is n x k for n samples and k output classes and y_true is n x 1
    and is often softmax of final layer.
    y_pred values must be probability that output is a specific class.
    Binary case: When we have y_pred close to 1 and y_true is 1,
    loss is -1*log(1)==0. If y_pred close to 0 and y_true is 1, loss is
    -1*log(small value) = big value.
    y_true values must be positive integers in [0,k-1].
    """
    n = y_prob.shape[0]
    # Get value at y_true[j] for each sample with fancy indexing
#     print(range(n), y_true)
    p = y_prob[range(n),y_true]
    return torch.mean(-torch.log(p))

In [7]:
def onehot(c) -> torch.tensor:
    v = torch.zeros((len(vocab),1), device=device, dtype=torch.float64)
    v[ctoi[c]] = 1
    return v

def get_max_len(X):
    max_len = 0
    for x in X:
        max_len = max(max_len, len(x))
    return max_len

def onehot_matrix(X, ctoi):
    r, c = X.shape
    X_onehot = torch.zeros(r, c, len(ctoi), device=device, dtype=torch.float64)
    for i,x in enumerate(X):
        for j,c in enumerate(x):
            X_onehot[i,j,c] = 1
    return X_onehot

## Load and split into chunks

The stochastic part of SGD is critical for training models. The idea is simply to use a small subset of the data when computing gradients to update the model parameters. Generally we take a small batch size of say 32 records, run that through the model, and then compute a loss. From that loss we compute the gradient and then update the model parameters and move onto the next batch.  Once all batches are complete, we have completed an epoch.  We should shuffle the batches and keep going.

We can also be stochastic by updating the gradient in the middle of long sequences, rather than waiting until after a complete batch of long sequences.  If the sequences are really long, waiting till the end of a batch reduces the stochastic nature. Instead I'm going to try breaking up the entire input into a small number of very long sequences. In this way the RNN can keep the hidden state going for the complete sequence. Of course the only problem is that we cannot compute back propagation that far, so at some sequence length I can update the gradient and wipe it out then continue. I think this is easier than modifying the data set stride so that a standard training loop for an RNN keeps the same hidden state across long sequences even if we have broken into chunks.

Let's say that we have a large text and we break it up into six chunks: A,B,C,D,E,F. then, six is our batch size and we will process each long sequence exactly once per epic. However to get stochastic nature, we will update the gradient after only a small sequence of characters.  We pick the chunk size and then the batch sizes computed instead of having to specify both. I think the chunk size is more important: how much can you store in a single hidden state vector.

Come to think of it, all we need to specify is the number of chunks we want to break the text into.  There won't be any batch size because we have a single batch with `nchunks`  long records in it.

In [8]:
text = get_text("data/obama-speeches.txt").lower() # generated from obama-sentences.py
len(text)

4224143

In [22]:
text = text[0:2_000_000] # testing

n = len(text)
nchunks = 1000 # break up the input into a number of chunks (doesn't have to be small like batch size)
chunk_size = n // nchunks # the sequences will be very long
n = nchunks * chunk_size  # reset size so it's an even multiple of chunk size
text = text[0:n]

In [23]:
vocab, ctoi = getvocab(text)

In [24]:
chunks = [text[p:p+chunk_size] for p in range(0, n, chunk_size)]
X = torch.empty(nchunks, chunk_size-1, device=device, dtype=torch.long) # int8 doesn't work as indices
y = torch.empty(nchunks, chunk_size-1, device=device, dtype=torch.long)
for i,chunk in enumerate(chunks):
    X[i,:] = torch.tensor([ctoi[c] for c in chunk[0:-1]], device=device)
    y[i,:] = torch.tensor([ctoi[c] for c in chunk[1:]],   device=device)
    
# X, y are now chunked and numericalized into big 2D matrices

In [25]:
nhidden = 512
nfeatures = len(ctoi)
nclasses = nfeatures
print(f"{nchunks:,d} training records, chunk length {chunk_size}, {nfeatures} features (chars), state is {nhidden}-vector")

1,000 training records, chunk length 1500, 70 features (chars), state is 512-vector


In [26]:
def forward(x):
    loss = 0.0
    outputs = []
    h = torch.zeros(nhidden, 1, device=device, dtype=torch.float64, requires_grad=False)  # reset hidden state at start of record
    for j in range(len(x)):  # for each char in a name
        h = W@h + U@onehot(x[j])
        h = torch.tanh(h)
        o = V@h
        o = o.reshape(1,nclasses)
        o = softmax(o)
        outputs.append( o[0] ) 
    return torch.stack(outputs)

def forwardN(X:Sequence[Sequence]):#, apply_softmax=True):
    "Cut-n-paste from body of training for use with metrics"
    outputs = []
    for i in range(0, len(X)): # for each input record
        o = forward1(X[i])
        outputs.append( o[0] ) 
    return torch.stack(outputs)

In [27]:
# I think we can fit this on the GPU in its entirety
X_onehot = onehot_matrix(X, ctoi)

In [45]:
X_onehot.shape, nchunks

(torch.Size([1000, 1499, 70]), 1000)

In [None]:
#%%time 
#torch.manual_seed(0) # SET SEED FOR TESTING
W = torch.eye(nhidden,    nhidden,   device=device, dtype=torch.float64, requires_grad=True)
U = torch.randn(nhidden,  nfeatures, device=device, dtype=torch.float64, requires_grad=True) # embed one-hot char vec
V = torch.randn(nclasses, nhidden,   device=device, dtype=torch.float64, requires_grad=True) # take RNN output (h) and predict target

optimizer = torch.optim.Adam([W,U,V], lr=0.001, weight_decay=0.0)
# scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=.5)
scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, 
                                              mode='triangular2',
                                              step_size_up=5,
                                              base_lr=0.0001, max_lr=0.005,
                                              cycle_momentum=False)

print("BEGIN")
history = []
epochs = 50
bptt = 10 # only look back this many time steps for gradients
for epoch in range(1, epochs+1):
#     print(f"EPOCH {epoch}")
#     shuffled_idx = torch.randperm(nchunks) # shuffle each epoch (don't need actually)
    H = torch.zeros(nhidden, nchunks, device=device, dtype=torch.float64, requires_grad=False)
    epoch_training_loss = 0.0
    epoch_training_accur = 0.0
    loss = 0
    for t in range(chunk_size-1):  # char t in chunk predicts t+1 so one less
#         print(f"t={t}")
        x_step_t = X_onehot[:,t].T # make it len(vocab) x nchunks
#         print(x_step_t.shape, H.shape, W.shape, U.shape)
        H = W.mm(H) + U.mm(x_step_t)
        H = torch.tanh(H)
        o = V.mm(H)
        o = o.T # make it nchunks x nclasses
        o = softmax(o)
        correct = torch.argmax(o, dim=1)==y[:,t]
        epoch_training_accur += torch.sum(correct)
#         print(f"loss {loss:7.4f}")
        loss += cross_entropy(o, y[:,t])
        
        if t % bptt == 0 and t > 0:
#             print(f"gradient at {t:4d}, loss {loss.item():7.4f}")
            optimizer.zero_grad()
            loss.backward() # autograd computes U.grad, M.grad, ...
            optimizer.step()
            epoch_training_loss += loss.detach().item()
            loss = 0
            H = H.detach() # no longer consider previous computations

    epoch_training_accur /=  nchunks * (chunk_size-1)
    scheduler.step()
    
    print(f"Epoch {epoch:3d} training loss {epoch_training_loss:8.2f}   accur {epoch_training_accur:7.4f}   LR {scheduler.get_last_lr()[0]:7.6f}")

BEGIN


In [29]:
def sample(initial_chars, n, temperature=0.1):
    "Derived from Karpathy: https://gist.github.com/karpathy/d4dee566867f8291f086"
    chars = initial_chars
    n -= len(initial_chars)
    with torch.no_grad():
        for i in range(n):
            h = torch.zeros(nhidden, 1, dtype=torch.float64, device=device, requires_grad=False)  # reset hidden state at start of record
            for j in range(len(chars)):  # for each char in a name
                h = W@h + U@onehot(chars[j])
                h = torch.tanh(h)
            o = V@h
            o = o.reshape(nclasses)
            p = softmax(o)
#             wi = torch.argmax(p) # this doesn't work (just repeats 'and' a million times)
            wi = np.random.choice(range(len(vocab)), p=p.cpu()) # don't always pick most likely; pick per distribution
            chars.append(vocab[wi])
    return chars

In [31]:
''.join( sample(list('the job'), 300) ) 

"the jobs, just.  this inform the internewis, vivision. we've some usneed monthing that is need, optinous, so weve learne, through. 2b, know work power in this extend obe now future to idea the begants, or afghanizans, but this resourced.\n\nafted for violed, we’re global insilence and about choice, an"