## Basic idea
Based on N-grams model, the conditional probability of word x at time step t only depends on the n-1 previous words. Use a latent variable $h_(t-1)$ to represent the following equation:

$P(x_t | x_{t-1}, ..., x_1) = P(x_t|h_{t-1})$, where $h_(t-1)$ is the hidden state stores the sequence information up to time step t - 1

## Problem
By storing all data observed so far, it's easily to get both computation and storage expensive.

## Formula
Notice that for each input $X_t$, denote $H_t$ represent as the hidden variable of time step t, here we need to calculate the previous time step t-1 hidden states by a weight parameter $W_{hh}$ to update hidden state. Also, same as MLP, a weighted matrix $W_{xh}$ is introduced to modify input x

Formula: ----------------------$H_t = f(X_tW_{xh} + H_{t-1}W_{hh} + b_h)$--------------, where f is activation function

## Output Representation
$O_t = H_tW_{hq} + b_q$

So, in a basic RNN, the learnable variable is shown below:
1. $W_{xh} \in R^{d*h}, b_h \in R^{1*h}$
2. $W_{hh} \in R^{h*h}$
3. $W_{hq} \in R^{h*q}, b_q \in R^{1*q}$

## Perplexity
The metric that measure the language model quality. Notice that a better LM should allow us to predict the next token more accurately.

Perplexity = $exp(-\frac{1}{n}\sum^n_{t=1}log(P(x_t|x_{t-1},...,x_1)))$

## Bidirectional Model
![image.png](attachment:image.png)
Similar to Hidden Markov Model(use dynamic programming in solution), judge forward and backward recursions as two learnable fuctions. This transition also epitomizes many principles guiding the design of modern deep networks: First use the classical statistical models, and then parameterize them in a generic form

## Definition
For any time step t, given a minibatch input $X_t \in R^{n*d}$, let the hidden layer activation function be $f$, In bidirectional architecture, we have forward and backward hidden states for time step here:
1. $H_f = f(X_tW_{xh}^f + H_{t-1}W_{hh}^f + b_h^f)$
2. $H_b = f(X_tW_{xh}^b + H_{t-1}W_{hh}^b + b_h^b)$

Then, concatenate the forward and backward hidden states to obtain a new hidden state $H_t \in R^{n*2h}$ to fed into output layer. The formula is shown below:

$O_t = H_tW_{hq} + b_q$

## Architecture

In [5]:
import torch 
from torch import nn
import pandas as pd
import numpy as np
import math
from torch.cuda import amp
from transformers import get_cosine_schedule_with_warmup
from collections import Counter
import collections

By using truncation, padding and embedding, each input representation should be (bs, max_len, num_hiddens)

Notice that we judge it as a multi-class classification problem

In [6]:
class RNN_block(nn.Module):
    def __init__(self,vocab_size,embed_size,num_hiddens, bias = True):
        super(RNN_block, self).__init__()
        self.W_xh = nn.Linear(embed_size, num_hiddens)
        self.W_hh = nn.Linear(num_hiddens, num_hiddens, bias = bias)
        self.W_hq = nn.Linear(num_hiddens, vocab_size, bias= bias)
        # use embedding 
        self.embedding_matrix = nn.Embedding(vocab_size,num_hiddens)
    def init_state(self, bs, num_hiddens, device):
        return torch.zeros((bs, num_hiddens), device = device)
    
    def forward(self, input_, previous_state):
        # shape of input_ (bs, time_step)
        # for loop on time step column
        input_ = self.embedding_matrix(input_)
        #print(input_.shape)
        outputs = []
        state_ls = []
        for i in input_.permute(1,0,2):
            previous_state = torch.tanh(self.W_xh(i) + self.W_hh(previous_state))
            #print("hidden state shape:", previous_state.shape)
            output = self.W_hq(previous_state)
            #print("output shape", output.shape)
            outputs.append(output)
            state_ls.append(previous_state)
        # return the output sequences of each block output and last hidden state
        return outputs, state_ls

In [7]:
# test layer
# (bs, num_step, num_hiddens)
x = torch.ones((2,6), dtype = torch.long)
rnn = RNN_block(200,24)
previous_state = rnn.init_state(2, 24)
rnn.eval()
# rnn.to(torch.device("cuda"))
rnn(x, previous_state)[0][0].shape

torch.Size([2, 200])

## Gradient Clipping
Notice that for sequence length T, gradient may either explode or vanish when T is large. A vector form x, with direction g, then we update x as $x - \alpha g$, then, we can say that for any x,y, $|f(x) - f(y)| <= L||x - y||$ (by Lipschitz continuous with constant L), which means we will not observed a changed more than ||$L\alpha |g$||, which means although we limit the gradient decent speed, it also limits the extent if we go wrong direction

## Method
Clip the gradient g by projecting them back to a ball of a given radius $\theta$, by the following formula

$g <- min(1, \frac{\theta}{||g||})g$

Which means the gradient norm will never exceeds $\theta$

In [8]:
def clip_grad(net, theta):
    params = [p for p in net.parameters() if p.requires_grad]
    # get the gradient norm
    norm = torch.sqrt(sum(torch.sum((p.grad **2 )) for p in params))
    if norm > theta:
        for param in params:
            param.grad[:] *= theta/norm

## Use on English-french translation dataset (apply seq2seq model)

In [9]:
# basic processing code
class Build_vocabulary(object):
    '''
    Here we need to bulid a vocabulary for mapping
    '''
    def __init__(self, tokens = None, min_freq = 0, special_tokens = None):
        if tokens is None:
            tokens = []
        if special_tokens is None:
            special_tokens = []
        tokens = [token for line in tokens for token in line]
        counter = Counter(tokens)
        # sort by frequency
        self.freq = sorted(counter.items(), key = lambda x: x[1], reverse=True)
        # set special token
        self.idx_to_token = ["<unk>"] + special_tokens
        self.token_to_id = {token: ids for ids, token in enumerate(self.idx_to_token)}
        for token, freq in self.freq:
            if freq < min_freq:
                break
            if token not in self.token_to_id:
                self.idx_to_token.append(token)
                self.token_to_id[token] = len(self.idx_to_token) - 1
                
    #build some internal property
    def __len__(self):
        return len(self.idx_to_token)
    
    def __getitem__(self, token):
        '''
        Return the token:ids for each input token in dict
        '''
        if not isinstance(token, (list, tuple)):
            return self.token_to_id.get(token, 0)
        return [self.__getitem__(token_) for token_ in token]
    
    def indices_to_tokens(self, indices):
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]

def read_dataset():
    with open("./fra-eng/fra.txt", "r", encoding="utf-8") as f:
        text = f.read()
        text = text.replace('\u202f', ' ').replace('\xa0', ' ').lower()
    def no_space(char, prev_char):
        return char in set(',.!?') and prev_char != " "
    out = []
    for i, char in enumerate(text):
        if i>0 and no_space(char, text[i-1]):
            out.append(' '+char)
        else:
            out.append(char)
    text = "".join(out)
    # Tokenization
    english_word, french_word = [], []
    for i, sentence in enumerate(text.split("\n")):
        # split by \t
        result = sentence.split("\t")
        if len(result) == 2:
            english_word.append(result[0].split(" "))
            french_word.append(result[1].split(" "))
    return english_word, french_word

def padding_truncation(tokens, max_lens, padding_token):
    if len(tokens) > max_lens:
        return tokens[:max_lens]
    return tokens + [padding_token]*(max_lens - len(tokens))

def build_array(tokens, dic, max_length):
    '''
    This function build the array of each token
    '''
    tokens_mapping = [dic[token] for token in tokens]
    tokens_mapping = [token + [dic["<eos>"]] for token in tokens_mapping]
    # add padding, truncation
    tensor = torch.tensor([padding_truncation(token, max_length, dic["<pad>"]) for token in tokens_mapping])
    valid_len = (tensor != dic["<pad>"]).type(torch.int32).sum(1)
    return tensor, valid_len

def processing_french_english_dataset(batch_size, max_length):
    english_word, french_word = read_dataset()
    english_mapping = Build_vocabulary(english_word, min_freq=2, 
                                   special_tokens=["<pad>", "<bos>", "<eos>"])
    french_mapping = Build_vocabulary(french_word, min_freq=2, 
                                   special_tokens=["<pad>", "<bos>", "<eos>"])
    english_array, english_valid_len = build_array(english_word, english_mapping, max_length)
    french_array, french_valid_len = build_array(french_word, french_mapping, max_length)
    dataset = torch.utils.data.TensorDataset(*(english_array, english_valid_len, french_array, french_valid_len))
    data_iter = torch.utils.data.DataLoader(dataset,batch_size = batch_size, shuffle = True)
    return data_iter, english_mapping, french_mapping

class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):
    """带遮蔽的softmax交叉熵损失函数"""
    # pred的形状：(batch_size,num_steps,vocab_size)
    # label的形状：(batch_size,num_steps)
    # valid_len的形状：(batch_size,)
    def forward(self, pred, label, valid_len):
        #print(pred.shape, label.shape, valid_len.shape)
        weights = torch.ones_like(label)
        max_len = weights.size(1)
        mask = torch.arange((max_len), dtype = torch.float32,
                           device = weights.device)[None,:] < valid_len[:,None]
        weights[~mask] = 0
        self.reduction='none'
        unweighted_loss = super(MaskedSoftmaxCELoss, self).forward(
            pred.permute(0, 2, 1), label)
        weighted_loss = (unweighted_loss * weights).mean(dim=1)
        return weighted_loss

In [None]:
# set parameters
num_hiddens,bs, num_step= 64, 128,10
lr, num_epochs, device = 2e-4, 200, torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")