# Language Models, etc.

### We pose our problem as sequence prediction. 

### Given a sequence on say n-characters, we have to predict what will show up next. 

### Let's think of why this works?




### Can humans do thi_?

* thereb_
* Luke, I __ your father.

* There are patterns in character sequences that form valid comprehensible text. For example, `qqrxt` is not a word - you may assert.

**Turns out, the network we'll be training will capture just this!**


## Our usual imports

In [6]:
import torch
from torch.utils.data import DataLoader, Dataset
from torch import optim, nn
import torch.nn.functional as F
from tqdm import tqdm_notebook as tqdm
import random

## Vocabulary

We assume a finite vocabulary. 

The vocabulary could be words, or characters or whatever unit which suits your task. 

Given vocabulary is finite, we can pose this as a classification problem.


Sure, we can't put text/characters through neural networks, we put an integers which correspond to the text/characters. A bijective map between the two, we maintain using `Indexing`.

In [7]:
class Indexing:
    def __init__(self, vocabulary):
        self.v2idx = {}
        self.idx2v = {}
        self.counter = 0
        for v in vocabulary:
            self.add(v)
        
    def add(self, v):
        assert(v not in self.v2idx)
        self.v2idx[v] = self.counter
        self.idx2v[self.counter] = v
        self.counter += 1
        
    def __len__(self):
        return len(self.v2idx)
        
    def direct(self, v):
        return self.v2idx[v]
    
    def inverse(self, idx):
        return self.idx2v[idx]

## tl;dr

```python
vocabulary = ['sea', 'salt', 'is', 'a', 'great', 'seasoning']
mapping = Indexing(vocabulary)

mapping.direct('sea') # => 0
mapping.inverse(0) # => sea

```

## Embedding 

* Regression => [4, 2, 1, 5] => Too dense. 
* Let's model vocabulary as classes. We'll use something called one-hot embedding.



Image Courtesy: https://www.datascience.com/
![One hot embedding](static/nn_embed.png)

We'll be doing the same thing, except for **characters instead of words**.

## one-hot embedding

In [8]:
class OneHotEmbedding(nn.Module):
    def __init__(self, size):
        super().__init__()
        self.size = size
        
    def encoding(self, idx):
        x = torch.zeros(self.size).float()
        # key part
        x[idx] = 1.0
        return x
       
    def forward(self, x):
        # x  we get is going to be a T x B sequence.
        T, B = x.size()
        
        # We'll provide a T x B x H (self.size) sequence
        H = self.size
        target = torch.zeros(T, B, H)
        for b in range(B):
            for t in range(T):
                key = x[t, b].item()
                target[t, b, :] = self.encoding(key)
        return target

## Let's build and test RNNs!



In [9]:
class SequenceLearner(nn.Module):
    def __init__(self, embedding, input_size, hidden_size, n_layers, output_size):
        super().__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.embedding = embedding
        self.n_layers = n_layers
        self.output_size = output_size
        self.rnn = nn.RNN(input_size, hidden_size, n_layers)
        self.linear = nn.Linear(hidden_size, output_size)
        
    def h0(self, batch_size):
        return torch.zeros(self.n_layers, batch_size, self.hidden_size)
        
    def forward(self, xt, ht):
        xt_embedded = self.embedding(xt)
        xt_plus1, ht_plus1 = self.rnn(xt_embedded, ht)
        xt_plus1 = self.linear(xt_plus1)
        return xt_plus1, ht_plus1

An RNN, and linear layer.

## Some Tricks

It helps to have a small dataset with guaranteed convergence while developing your models. 

### Why?

* Prototyping gets faster.
    * Less computationally intensive.
    * Faster debug cycles.

## Turtle Logo

I propose turtle logo!

![Turtle Logo Syntax](static/logo.png)

We'll work with a smaller subset and with a bit of variations.

## Our objective

Our network should learn to hallucinate a valid turtle logo program, which looks something like below.

```logo
forward 50
right 90
forward 50
backward 50
right 90
```

We'll constrain the rotate arguments `left`, `right` to have only `90` as argument, while others can have any number less than 360. 

## Dataset

In [10]:
class SimpleTurtleLogoDataset(Dataset):
    def __init__(self, length):
        self.size = length
        self.cmds = ["forward", "backward", "right", "left"]
        
        # Vocabulary.
        chars = ''.join(self.cmds)
        unique_chars = list(set(chars))
        digits = ['{}'.format(i) for i in range(10)]
        special = "\n" + ' '
        vocabulary = unique_chars + digits + list(special)
        self.vocab = Indexing(vocabulary)
        
    def command(self):
        cmd = random.choice(self.cmds)
        if cmd == "right" or cmd == "left":
            arg = 90
        else: 
            arg = random.randint(0, 360)
        return '{} {}'.format(cmd, arg)
    
    def __len__(self):
        return self.size
    
    def __getitem__(self, i):
        cmds = [self.command() for i in range(3)]
        concat = '\n'.join(cmds)
        idxs = [self.vocab.direct(c) for c in concat]
        return torch.tensor(idxs)


## Process


![Learning](static/prediction.png)

## Loss

We'll use `nn.CrossEntropyLoss`, with a few customizations to be able to take our inputs.

In [11]:
class RNNCrossEntropyLoss(nn.Module):
    def __init__(self):
        super().__init__()
        self.criterion = nn.CrossEntropyLoss()
        
    def forward(self, preds, truths):
        T, B, H = preds.size()
        Tt, Bt = truths.size()
        assert( Bt == B and Tt == T)
        
        _preds = preds.view(-1, H)
        _truths = truths.view(-1)

        return self.criterion(_preds, _truths)

## Training subroutine

In [12]:
def train(model, optimizer, criterion, loader, train_method, epochs=100):
    for epoch in tqdm(range(1, epochs+1), desc='epoch'):
        iterator = enumerate(loader)
        for i, x in tqdm(iterator, desc='train', leave=False):
            x = x.transpose(1, 0)
            export = train_method(model, optimizer, criterion, x)

In [14]:
def backpropogate_every(grad_interval):
    def _inner(model, optimizer, criterion, x):
        T, B = x.size()
        optimizer.zero_grad()
        
        loss = 0
        total_loss = 0
        
        h = model.h0(B)
        for t in range(1, T):
            src, tgt = x[t-1:t,:], x[t:t+1, :]
            pred, h = model(src, h)
            loss += criterion(pred, tgt)
            if t % grad_interval == 0:
                h_copy = h.detach()
                loss.backward()
                total_loss += loss.item()
                loss = 0
                optimizer.step()  
                optimizer.zero_grad()
                h = h_copy
        avg_loss = total_loss/(T/grad_interval)
        return {"loss": avg_loss}
    return _inner


## Forward Pass

```python
# with seed as the current character, we try to learn the next character.
src, tgt = x[t-1:t,:], x[t:t+1, :] 

# But we are aided by the hidden state which captures the context until now.
pred, h = model(src, h) 
```



## Truncated BPTT

This technique is called truncated backpropogation through time. Read more about it [here](https://r2rt.com/styles-of-truncated-backpropagation.html).


```python
loss += criterion(pred, tgt)
if t % grad_interval == 0:
    # we make a clone of context, without grad history.
    h_copy = h.detach()     
    loss.backward()              
    total_loss += loss.item()
    loss = 0
    optimizer.step()  
    optimizer.zero_grad()
    h = h_copy
    
```


## Decoding

### Argmax Decoder

Argmax decoding is heavily used in translation. 

The idea is to predict the most probable word, given the current context. 

We take a softmax over the output activations from Linear layer to convert them into probabilities.

Then we take an argmax to find which would be the index which has maximum probability. 

The index corresponds to the word we need in `Indexing`.


In [15]:
class ArgmaxDecoder:
    def __init__(self, vocab):
        self.vocab = vocab
    
    def decode(self, activations):
        probs = F.softmax(activations, dim=2)
        mv, mi = probs.max(dim=2)
        return self.vocab.inverse(mi.item()) 

## Softmax with Temperature

Argmax would give us the same predictions all the time, and enter a loop.

But we want randomness in our predictions, and some control over the randomness.

The solution: [Temperature](https://cs.stackexchange.com/questions/79241/what-is-temperature-in-lstm-and-neural-networks-generally).

In [16]:
class TemperatureDecoder:
    def __init__(self, vocab, temperature):
        self.T = temperature
        self.vocab = vocab
    
    def decode(self, activations):
        feats_dist = activations.data.view(-1).div(self.T).exp()
        top_i = torch.multinomial(feats_dist, 1)
        return self.vocab.inverse(top_i.item())


## Predicting 

In [17]:
def predict(model, decoder, vocab, seed, max_length):
    pred_str = seed
    current = seed
    h0 = model.h0(1)
    
    for i in range(max_length):
        current = torch.Tensor([vocab.direct(current)]).long().view(1, 1)
        activations, h0 = model(current, h0)
        _next = decoder.decode(activations)
        pred_str += _next   
        current = _next
        
    return pred_str

## Piecing it all together.

In [18]:
dataset = SimpleTurtleLogoDataset(200)
loader = DataLoader(dataset)

hidden_size = 64
n_layers = 3

input_size = len(dataset.vocab)
output_size = input_size

embedding = OneHotEmbedding(input_size)
model = SequenceLearner(embedding, input_size, hidden_size, n_layers, output_size)
optimizer = optim.Adam(model.parameters(),lr=1e-3)
criterion = RNNCrossEntropyLoss()


In [19]:
train(model, optimizer, criterion, loader, backpropogate_every(6), epochs=5)

HBox(children=(IntProgress(value=0, description='epoch', max=5, style=ProgressStyle(description_width='initial…

HBox(children=(IntProgress(value=1, bar_style='info', description='train', max=1, style=ProgressStyle(descript…

HBox(children=(IntProgress(value=1, bar_style='info', description='train', max=1, style=ProgressStyle(descript…

HBox(children=(IntProgress(value=1, bar_style='info', description='train', max=1, style=ProgressStyle(descript…

HBox(children=(IntProgress(value=1, bar_style='info', description='train', max=1, style=ProgressStyle(descript…

HBox(children=(IntProgress(value=1, bar_style='info', description='train', max=1, style=ProgressStyle(descript…




In [20]:
argmaxdecoder = ArgmaxDecoder(dataset.vocab)
t_decoder = TemperatureDecoder(dataset.vocab, 0.7)

In [21]:
pred_str = predict(model, t_decoder, dataset.vocab, 'l', 205)
print(pred_str)

left 90
backward 290
left 90
forward 99
left 90
backward 128
left 90
right 90
right 90
backward 166
backward 116
left 90
forward 397
left 90
backward 378
backward 398
forward 116
forward 108
backward 22
lef


## `TextDataset`

A class supposed to handle any large text with some limited length per line. 


In [22]:
class TextDataset(Dataset):
    def __init__(self, path, tau):
        content = open(path).read()
        self.length = len(content)
        self.tau = tau
        
        vocabulary = sorted(list(set(content)))
        self.vocab = Indexing(vocabulary)
        
        self.lines = content.splitlines()
        
    def __getitem__(self, i):
        start = self.tau*i
        segment = self.lines[start:start+self.tau]
        text = '\n'.join(segment)
        idxs = [self.vocab.v2idx[c] for c in text]
        return torch.Tensor(idxs).long()
        
    
    def __len__(self):
        return len(self.lines)//self.tau
        

## Tiny Shakespeare

We'll use [Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)'s Tiny Shakespeare dataset, and see if our network is able to learn major patterns over time. 

In [29]:
!head -n 29 data/input.txt

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:


## Let's test our model for larger, more diverse data.

In [30]:
dataset = TextDataset('data/input.txt', 5)
loader = DataLoader(dataset)
hidden_size = 64
n_layers = 3

input_size = len(dataset)
output_size = input_size
embedding_size = 10

# embedding = OneHotEmbedding(len(dataset.vocab))
embedding = nn.Embedding(input_size, embedding_size)
model = SequenceLearner(embedding, embedding_size, hidden_size, n_layers, output_size)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = RNNCrossEntropyLoss()

In [34]:
with open("weights/shakespeare-e10.weights.checkpoint", "rb") as in_file:
    weights = torch.load(in_file)
    model.load_state_dict(weights)

In [32]:
train(model, optimizer, criterion, loader, backpropogate_every(40), epochs=10)

HBox(children=(IntProgress(value=0, description='epoch', max=10, style=ProgressStyle(description_width='initia…

HBox(children=(IntProgress(value=1, bar_style='info', description='train', max=1, style=ProgressStyle(descript…

HBox(children=(IntProgress(value=1, bar_style='info', description='train', max=1, style=ProgressStyle(descript…

HBox(children=(IntProgress(value=1, bar_style='info', description='train', max=1, style=ProgressStyle(descript…

HBox(children=(IntProgress(value=1, bar_style='info', description='train', max=1, style=ProgressStyle(descript…

HBox(children=(IntProgress(value=1, bar_style='info', description='train', max=1, style=ProgressStyle(descript…

HBox(children=(IntProgress(value=1, bar_style='info', description='train', max=1, style=ProgressStyle(descript…

HBox(children=(IntProgress(value=1, bar_style='info', description='train', max=1, style=ProgressStyle(descript…

KeyboardInterrupt: 

In [33]:
argmaxdecoder = ArgmaxDecoder(dataset.vocab)
t_decoder = TemperatureDecoder(dataset.vocab, 0.5)
pred_str = predict(model, t_decoder, dataset.vocab, 'G', 1000)
print(pred_str)

with open("weights/shakespeare-e10.weights.checkpoint", "wb+") as out_file:    
    torch.save(model.state_dict(), out_file)

Give his hand hath arms me compoungther of his son of the joyn to his blood a blood hath be that serve that shall breased our say the son poince from the son of the strike the brothard the hard of the brothing with his lorder hath the son of his son the grace him the death of the stand the banish the strike,
And our learn the send some friend of heaven of the since the which the shall shall not and not be be his lords of his contres,
And of the hand comforter of the peace of the world of his commind the world;
The hand of their orong should and be friends our hand and this hand,
The hold of his partition on the which loation our bed it is that my lords of more the father cannot some hold his father of his band of nothing do had the power the world the was holish he shall not the camour and with the wise he shall hath be is some fair charge call his grerate and could with his soul but back make of the king to against the canst and have so shall shall be be bear him commind the souls of 

# Exercises

1. Try replacing temperature decoder with argmax decoder, and see what happens.
2. In turtle logo and shakespeare, try using words instead of characters as a unit.
3. I haven't plotted the losses over time, you're welcome to try and see what happens.
4. The embedding layer's output, in Shakespeare case will be a dense representation of the character. Find the embeddings of all characters, and find their neighbourhood. You should see inferences like vowels occuring together.
5. Adapt the model to use an LSTM or GRU instead of vanilla RNN. Are there improvements?

## What you can do further.




### 1. Language model + Language model 

Learn two language models on two languages, **preferably using words as units.** 

Use the final hidden context from the first language model to start the time unrolling on the second language model.


```python

h = model.h0(B)                         # Initialized with 0, usually.
h = first_language_model(src_sequence)  # dense representation of phrase is source language.
idx = indexing.direct("<sos>")          # Priming with <sos> token
pred = embedding(idx)        
for t in range(1, T):
    tgt = x[t:t+1, :]
    pred, h = model(pred, h)
    loss += criterion(pred, tgt)
```


### 2. CNN feature extractor + Language model

Use an output from something like ResNet or VGG-16, use it to initialize the hidden state of the RNN. 

Then time unroll to learn to predict captions.

```python
h = model.h0(B)                         # Initialized with 0, usually.
h = vgg16(img)                          # Onward to image captioning.
idx = indexing.direct("<sos>")          # Priming with <sos> token
pred = embedding(idx)        
for t in range(1, T):
    tgt = x[t:t+1, :]
    pred, h = model(pred, h)
    loss += criterion(pred, tgt)
    ...
```

## Thank you!