In [1]:
import time, sys
import numpy as np

import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F

import import_ipynb
from MoveData import Options, json2datatools, num_batches, nopeak_mask, create_masks
from EncoderDecoder import Encoder, Decoder

importing Jupyter notebook from MoveData.ipynb
importing Jupyter notebook from EncoderDecoder.ipynb
importing Jupyter notebook from Elements.ipynb


If you are not using this notebook to learn, change the below variable `teaching` to `False` so that other notebooks can import the functions defined in this notebook without running all the examples, if you are here to learn and interact with the notebook, change it to `True`

In [2]:
teaching = False # True 

## The Big Picture

I'm glad you decided to dive deeper. If we want to understand where deep learning is going, what new frontiers the experts are "eyeing", how to modify deep learning pipelines for our own goals, or to create new progress in AI, we need strive for a fundamental understanding.  

There is soo much information out there on the internet about the basic feed forward neural network. If you are a beginner to neural networks I do suggest [reading this](https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/) or [watching this](https://youtu.be/kNPGXgzxoHw) as a starter into the world of WHY / HOW neural networks work, in a mathematical and geometric sense.  

I will not try to reinvent the wheel, instead I want to do what I feel is not done enough of, namely bringing these concepts into a fun, relatable, tangible robot while showing the actual code. This section on Training will be by far the most reading heavy, if you can get through this one, the rest will be mostly toy coding examples and diagrams. 

In Deep Reinforcement Learning, Neural Machine Translation, Convolutional Neural Networks, training chloe to talk, etc, and all Deep Learning, we are adjusting the weights / parameters / synapses of our neural network to either maximize some reward or minimize some loss function. Here a seed I want to plant in your head that we will elaborate on later. In the diagram below, what determines the colored landscape, is our batch of training data, that comes out of `data_iter` that you used in the introduction section START_HERE.ipynb

<img src="../saved/images/LossTopology.png">

All those red dots you see are the initial starting points to the weights / parameters / synapses of our neural network. Usually the weights are randomly initialized as small numbers close to 0.0, there are tricks for sampling those weights from an optimal distribution, but thats for another lesson. As we get into the details, I think it is useful to keep this big picture in mind. 

OK lets get started, first, whether you are running this on a GPU ready environment or not, the below `if else` statement will set `device` appropriately to leverage the GPU or CPU. We are working on such a small dataset that for demonstration purposes, a CPU will do just as well and GPU is not usefull unless you have augmented the data to be significantly larger.

In [3]:
if teaching:
    if torch.cuda.is_available():
        print("you have", torch.cuda.device_count(), "GPUs")
        device = torch.device("cuda:0")
    else:
        print('using CPU for evaluation and training')
        device = torch.device("cpu")

## Preparing the Data
in START_HERE.ipynb you have already seen the `Options` class and `json2datatools` function. As a reminder, `opt` is just a way to keep and pass all your preferences in a single input, rather than type out every hyperparameter again and again. we will need the input and output vocabulary `infield, outfield` for our demonstration of how a sequence of words is represented by the transformer and we will use `data_iter` to show how data flows from the dataset though our transformer

In [4]:
if teaching:
    opt = Options(batchsize=2, device=device, epochs=25, lr=0.01, 
                  beam_width=3, max_len = 25, save_path = '../saved/weights/model_weights')

    data_iter, infield, outfield, opt = json2datatools(path='../saved/pairs.json', opt=opt)

## The Transformer and Optimizer 
Lets take a look inside the Transformer, like older sequence to sequence models, Transformers also encode the sentence into vector representations and pass those representations along to the decoder to generate the response/reply/output/translation/etc. The Encoder and Decoder have subcomponents which is discussed in Elements, Encoder and Decoder. For now, just define the Transformer class, instantiate a model as chloe and define the optimizer and scheduler by running the cell below 

In [5]:
class Transformer(nn.Module):
    def __init__(self, in_vocab_size, out_vocab_size, emb_dim, n_layers, heads, dropout):
        super().__init__()
        self.encoder = Encoder(in_vocab_size, emb_dim, n_layers, heads, dropout)
        self.decoder = Decoder(out_vocab_size, emb_dim, n_layers, heads, dropout)
        self.out = nn.Linear(emb_dim, out_vocab_size)
    def forward(self, src_seq, trg_seq, src_mask, trg_mask):
        e_output = self.encoder(src_seq, src_mask)
        d_output = self.decoder(trg_seq, e_output, src_mask, trg_mask)
        output = self.out(d_output)
        return output

With the scheduler in 'min' mode, learning rate will be reduced by a factor of 0.5 when the quantity monitored (epoch_loss) has stopped decreasing. In the training loop, you will see this quantity being monitored in `scheduler.step(epoch_loss)` 

In [6]:
if teaching:
    emb_dim, n_layers, heads, dropout = 32, 3, 8, 0.01 
    chloe = Transformer(len(infield.vocab), len(outfield.vocab), emb_dim, n_layers, heads, dropout)
    chloe.load_state_dict(torch.load(opt.save_path))

    if opt.device == torch.device("cuda:0"):
        chloe =  chloe.cuda()

    optimizer = torch.optim.Adam(chloe.parameters(), lr=opt.lr, betas=(0.9, 0.98), eps=1e-9)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', factor=0.5, patience=5)

## The Training Batch
The line `batch = next(iter(data_iter))` extracts one batch of data into the variable `batch`. The batch has two parts to it, the `listen` part that is the input to the model, and the `reply` part that is the preferred output response to the corresponding `listen` input. This is a flawed assumption for conversational modeling since there are many valid responses to anything one listens to or hears. But for now we will use this method to teach chloe to reply in some linguistically coherent manner. 

`.transpose(0,1)` is used to flip the orientation of the data contained in batch.listen

the first dimension of `listen.shape` and `reply.shape` should match whatever you put into `opt = Options(batchsize=2` above, it is the number of samples in each batch. Each sample is one line in our pairs.json dataset. The next dimension is the sequence length. 

source input sequence shape **(batch_size, sequence_length)**

Although it seems messy, I have printed out each piece of data from the batch and it's shape. Each is a 2-dimensional tensor, or matrix. Above the listen and reply tensors I have printed the entire vocabulary so you can resolve that the integers represent coherent sentences. 

Notice that the reply always starts with an integer that represents the "Start Of Sentence" or `<sos>` token, and ends with the "End Of Sentence"or `<eos>` token, so it's sequence length is the length of the reply sentence + 2

In [7]:
if teaching:
    batch = next(iter(data_iter))
    listen = batch.listen.transpose(0,1)
    reply = batch.reply.transpose(0,1)
    reply_input = reply[:, :-1]

    print(infield.vocab.stoi)
    print(" ------------------------------------------------------ ")
    print(listen, listen.shape)
    print(" ------------------------------------------------------ ")
    print(outfield.vocab.stoi)
    print(" ------------------------------------------------------ ")
    print(reply, reply.shape)

## Padding and Masks
Tensors have to be single values (scalars), lines (vectors), rectangles (matrices) or boxes. This means that if I want my batch to have sentence of unequal length, the shorter sentences have to be padded with the `<pad>` token so that every sentence is the same length, the example input batch here has been padded into a 3x3 tensor 

`i`      `love`   `you`

`hello`  `<pad>` `<pad>`

`see`  `ya` `<pad>`


However, the padding does not carry useful information, so it must be masked away. Masks are tensors that are the same shape as the tensors they are a mask for, here we show a mask for the listen tensor and for the reply tensor. Think of a real mask, usually there will be two holes poked out for the eyes and one hole for the mouth, but otherwise the mask is meant to cover the rest of my face. The mask for the input listen tensor covers the padded elements. Seems tedious, but this is an artifact of the way wee need to prepare data to train efficiently. The mask tensors printed below have a `True` or `1` in the positions that are open for use, and a `False` or `0` in the positions that are meant to be covered or not used. 

The shape of the listen mask is **(batch_size, 1, input_sequence_length)**, why is there a 1 in there? We will explain later. 

The Mask for the output sequence, or reply tensor, has an additional meaning. Transformers use something called "Attention", which we will talk about in the Elements.ipynb, Encoder.ipynb and Decoder.ipynb sections. At a high level, attention is a good word to describe what is occuring computationally. Suppose you are replying to this phrase:

"action potentials flow from dendrite to soma to where ?"

you write:

"they from from soma to axon"

As you, the decoder, intend to write the words "they flow from", you are paying attention to the words "action potentials flow from" from the encoder, this is called encoder attention. In addition you see that instead of flow I wrote "from" twice becaue I didnt pay enough attention to what I had already written, This is called self attention. 

As the decoder writes each word in responce, one word at a time, the word it just produce and it's previous words are fed back into the decoder to get the next word so it can pay attention to what has already been written. 

During training, we judge it based on whether it produced the right word or not, if it does not, we insert the right word at each position and hope that the next position is better predicted. This is called teacher forcing. during training, it can only pay attention to words in previous earlier positions, thus, even though we have the entire correct sequence given to us with each training batch, we will hide the words in later positions through the mask to simulate real self attention (you cant pay self attention to words you havent said yet). In addition to suppressing/covering the pad tokens, the reply mask also covers sequence positions in the future. 

This is why the shape of the reply mask is **(batch_size, output_sequence_length, output_sequence_length)**, having 2 dimension named the same might be confusion, so i will rewrite this as 

**reply mask (batch_index, time_step, output_mask_position)**

In the print statement below, I have shown you the mask `reply_mask[0,0,:]` indexed as `[0,0` for the first sample and first time step, `:` means all , it prints `True, False, . . . False` because at the first time step you can pay attention to the `<sos>` token only, not any future words in the "correct reply", also called the target sequence. At the next time step you can pay attention to both `<sos>` and also the first correct token, thus the mask will have the first two positions as True and the rest False `True, True, False . . . False`

In [8]:
if teaching:
    listen_mask, reply_mask = create_masks(listen, reply_input, opt)
    print(listen_mask, listen_mask.shape)
    print(" ------------------------------------------------------ ")
    print( reply_mask.shape)
    print(" ------------------------------------------------------ ")
    print("first time step", reply_mask[0,0,:])
    print(" ------------------------------------------------------ ")
    print("second time step", reply_mask[0,1,:])

## What the Model Output Tensor Means
The shape of the model (chloe's) output is **(batch_size, output_sequence_length, output_vocab_size)**
For every reply in batch_size, chloe says output_sequence_length number of words, for each output position, `len(outfield.vocab)` numbers are given. `len(outfield.vocab)` is the number of words that are in chloe's output vocabulary. Yes, chloe has a separate vocab size for what she can hear and what she can say. suppose output_vocab_size = 100, then chloe only knows to output 100 different tokens. I say token, because this includes not only words, but also `<eos>`, `?`, `ummm`, `,` and other discreet symbols. 

Each of these numbers in `chloes_reply[0,-1,:]` represents one token, when I ran this cell the response included 

`'<unk>': 0, '<pad>': 1, '<sos>': 2, '<eos>': 3, 'you': 4, '?': 5,`

`tensor([-1.7353, -0.5510, -1.4945,  7.2842,  2.6917,` 

and this makes alot of sense because `[0,-1,:]` indexes to the first example and the last output token. It includes all `:` the vocabulary. The tensor has in position 0, `-1.7353`, in position 3 it has `7.2842`, this was the highest number in the vector, it means that of all the tokens in the vocabulary, the word chloe things should come next is the `<eos>` token, 3 is the index for the end of sentence token  `'<eos>': 3,`, meaning that she is done talking. 

In [9]:
if teaching:
    print('input shape', listen.shape)
    print(" ------------------------------------------------------ ")
    chloes_reply = chloe(listen, reply_input, listen_mask, reply_mask)
    print('input_vocab_size =', len(infield.vocab), ', output_vocab_size =', len(outfield.vocab))
    print(" ------------------------------------------------------ ")
    print(outfield.vocab.stoi)
    print(" ------------------------------------------------------ ")
    print(chloes_reply[0,-1,:], chloes_reply.shape)

## Supervised Learning

upervised Learning is the kind of learning that started the AI renaissance of the 2010s, but it is widely agreed by experts that it is not the type of learning that will get us to human level AI, at least not by itself. 

suppose the correct reply to listen = `bye`  `chloe` is reply = `see`  `ya` `<eos>`

the batch we get gives is listen = `bye`  `chloe` and reply = `<sos>` `see`  `ya` `<eos>`

we input `bye`  `chloe` into chloe, give her `<sos>` `see`  `ya`, plus the masks to prevent peaking into the future, and we compare whatever she replies against `see`  `ya` `<eos>`. I have been referring to `see`  `ya` `<eos>` as the "correct response", but this is also called the "target" abbreviated trg. In addition, what I have been referring to as "listen" is also called the "source" abbreviated src. 

This kind of training, in which we repeatedly challenge chloe by giving her an input sentence (listen tensor), see what she reponds (`chloes_reply`), compare `chloes_reply` to the right answer `reply`, aka "the target", then modify chloe's weights and biases (her synaptic connections) in order to get her closer to matching her `chloes_reply` to `reply`, aka "the target", is called supervised learning. 

When I say "compare" `chloes_reply` to the right answer `reply`, what do i mean? I mean that we have to give her a grade, or a numerical score, that tells us how well she is doing, then once we know that number we can change chloe 
in such a way that increases or decreases the numerical score. In machine learning, we typically choose to tie improvement to lower scores rather than in school where we tie better performance to higher scores. One way to rationalize this to yourself is that in supervised learning, there is one correct answer, and if you get that correct answer, your score is zero, there is no extra credit or creativity points here, just more and more punishment the further you are away from correct answer, what a terrible class right? 

This numerical score is the Loss, and the algorithm used to "compare" `chloes_reply` to the right answer `reply` and generate that score is the loss function. The way that we change chloe to make her get a better Loss is using one of the gradient decent optimizers such as the Adam Optimizer we mentioned in the intro START HERE, we we go more into this in the Learning Dynamics section. If you dont know, gradient decent algorithm are fancier version of the same concept: if you weight of synaptic connection has strength `W`, your input is `x`, your algorithm is `Wx`, the target is `y` and your loss function is `Loss = (y - Wx)^2`, then the derivative of the loss with respect to W `dL/dW` tells you whether to increase or decrease W in order to increase or decrease Loss. The derivative is also known as gradient when `W` is not one number but several numbers. So changing chloe's `W` is how we get her to say what we want. We will explain where `W` is in chloe's neural network in the Elements section. 

## Loss function

The Loss function is a scalar, calculated from the output of your neural network and the ground truth labels, that requires the neural network to do what you want it to do in order to be minimized. This means if you change your parameters in the opposite direction of the gradient with respect to the loss, the neural network gets better at classification.

A common loss function is the Cross Entropy Loss function, I have written a simplified version for you here

$$Cross Entropy(chloes reply, target) = [i == target] \cdot - ln(chloes reply_i)$$

Let me explain this with a toy example. Suppose chloe only has 2 tokens in her vocabulary, `'cat': 0, 'french': 1` and the correct answer to input source "what is chloe?" is the output target "french cat", the target in token indices is therefore `[1, 0]`. Supposed your chloe model has never been trained before, so her output is `[0.0, 0.0]` for the first token and `[0.0, 0.0]` for the second token, meaning, she has no idea what to say. Run the cell below to see what loss score chloe will get for this "unsure" response.

In [10]:
if teaching:
    source = np.asarray([[0.0, 0.0], [0.0, 0.0]]) 
    target = np.asarray([1,0])

    source = Variable(torch.from_numpy(source).float(), requires_grad=True)
    target = Variable(torch.from_numpy(target))

    print(F.cross_entropy(source, target)) 

The output you should get is `tensor(0.6931, grad_fn=<NllLossBackward>)`. PyTorch took your input `[0.0, 0.0]` and put it through a softmax function, which essentially forces your vector of numbers to sum (add up to) to 1.0 while remaining in proportion with one another. So `[0.0, 0.0]` was changed to `[0.5, 0.5]`, similarly `[1.0, 1.0]` is changed to `[0.5, 0.5]`,  `[-1.0, -1.0]` is changed to `[0.5, 0.5]` and `[1.0, 2.0]` is changed to `[0.33, 0.77]`. 

in the simplified cross entropy equation above 

$$[𝑖==𝑡𝑎𝑟𝑔𝑒𝑡]$$ 

means to only pay attention to the target index. Since the taget was `[1,0]`, only the bolded postions in chloe's output [0.5 , **0.5**], [**0.5**, 0.0 ] are considered.

now open https://www.wolframalpha.com/ and paste in -ln(0.5) to the calculator, the answer you should get is 0.6931471805599453.  

In the cell below, you will see that 2 things have changed. Chloe has now learned to favor the word cat in the second position, however she favors the word cat even more so in the first position. We have also added the argument 
`ignore_index = 1` into our loss function and so this mistake of failing to use the token for `french` will not count against her. In the cell below, you will see that the softmax function has normalized the vector `[2.0, 1.0]` to `[0.7311, 0.2689]` . Follow along in the equation for cross entropy and you will see that the loss -ln(0.731) is equal to 0.3133418192323586

In [11]:
if teaching:
    source = np.asarray([[4.0, -1.0], [2.0, 1.0]]) 
    target = np.asarray([1,0])

    source = Variable(torch.from_numpy(source).float(), requires_grad=True)
    print(F.softmax(source, dim=1))
    print(" ------------------------------------------------------ ")
    target = Variable(torch.from_numpy(target))

    print(F.cross_entropy(source, target, ignore_index = 1)) 

The function below is our training function, it set the model into training mode, it runs a prespecified number of epochs, each epoch feeding the model batches of source target pairs until the entire dataset is seen. 

The loss is calculated for each source target pair.

`batch_loss = F.cross_entropy(preds.view(-1, preds.size(-1)), ys, ignore_index = options.trg_pad)`

The gradients with respect to the model weights, aka parameters, are calculated with the backpropagation algorithm and used to update the weights.

`batch_loss.backward()`

The optimizer and scheduler need to be updated of our progress to make it's adjustments to the weights and update rules

`optimizer.step()`
`scheduler.step(epoch_loss)`

Whenever we reach a new best (lowest) loss value, we save the model

`if epoch_loss < best_loss:`

`torch.save(model.state_dict(), options.save_path)`

Like we said in the intro, Optimization is a very large field, and we will clarify more in the Learning Dynamics section. Before going into neural networks and all the components, or modules, of our model, I think it helps to have a sense of how all of it is supposed to work together so you keep in mind the big picture, even while we dive into the details.

In [12]:
def trainer(model, data_iterator, options, optimizer, scheduler):

    if torch.cuda.is_available() and options.device == torch.device("cuda:0"):
        print("a GPU was detected, model will be trained on GPU")
        model = model.cuda()
    else:
        print("training on cpu")

    model.train()
    start = time.time()
    best_loss = 100
    for epoch in range(options.epochs):
        total_loss = 0
        for i, batch in enumerate(data_iterator): 
            src = batch.listen.transpose(0,1)
            trg = batch.reply.transpose(0,1)
            trg_input = trg[:, :-1]
            src_mask, trg_mask = create_masks(src, trg_input, options)
            preds = model(src, src_mask, trg_input, trg_mask)
            ys = trg[:, 1:].contiguous().view(-1)
            optimizer.zero_grad()
            batch_loss = F.cross_entropy(preds.view(-1, preds.size(-1)), 
                                         ys, ignore_index = options.trg_pad)
            batch_loss.backward()
            optimizer.step()
            total_loss += batch_loss.item()

        epoch_loss = total_loss/(num_batches(data_iterator)+1)
        scheduler.step(epoch_loss)

        if epoch_loss < best_loss:
            best_loss = epoch_loss
            torch.save(model.state_dict(), options.save_path)
        print("%dm: epoch %d loss = %.3f" %((time.time() - start)//60, epoch, epoch_loss))
        total_loss = 0

    return model

The next lesson is about the building blocks of our neural network, open `Elements.ipynb` for the next part in our intellectual adventure


## How can I help you or get help from you?

[Support *ChloeRobotics* on Patreon](https://www.patreon.com/chloerobotics)

email chloe.the.robot [at] gmail [dot] com 
