# Programming for Data Science and Artificial Intelligence

## Deep Learning - NLP + TorchText + Embedding + Transformer

Here we shall look at the Transformer architecture which is considered one of the most influential paper in recent years (https://arxiv.org/abs/1706.03762).

In addition, of course, we can use the Transformer for our sentiment analysis (i.e., use only the Encoder part of Transformer).  But to showcase the whole architecture, we shall explore the language translation problem which requires a more complicated encoder-decoder architecture.

<img src = "../figures/transformer.png" width="700">

First off, to understand Transformer, here are the list of knowledge you need to know:

1. Encoder-Decoder
2. Embedding
3. Positional Encodings
4. Creating Masks
5. Multi-Head Attention
6. Feed-Forward Layer

In [1]:
import torch
from torch import nn
import torch.optim as optim
import numpy as np
import sys
import matplotlib.pyplot as plt

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


### 1. Encoder-Decoder architecture

Encoder-decoder architecture is a very simple idea in which input is first feed into the encoder which results in a latent vector (e.g., in the case of RNN, it is the final hidden state).  Then this latent vector is feeded into the decoder state to decode the required results.

Relating to our previous sentiment analysis problem (many to one), our decoder can be seen as a simple linear layer outputting class probabilities.  As for language translation, since the output is sequential, we require a decoder that output sequential data (many to many).

In RNN, an encoder-decoder architecture for language translation looks something like this:

In [2]:
class EncoderDecoder(nn.Module): #a.k.a seq2seq
    def __init__(self):
        super(EncoderDecoder, self).__init__()
        #input_dim IS the same as vocab size, so please don't get confused here
        self.encoder = nn.RNN(input_dim, hidden_dim, batch_first=True)
        self.decoder = nn.RNN(input_dim, hidden_dim, batch_first=True)
        #if your output is also text, then output_dim IS the same as input_dim 
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x, y):
        # x : [batch-size, seq len, embed size or vocab size]
        h0 = torch.zeros(num_layers, batch_size, hidden_dim).to(device)
        _, encoder_hn = self.encoder(x, h0)
        # encoder_hn : [num_layers(=1) * num_directions(=1), batch_size, hidden_dim]

        outputs, _ = self.decoder(y, encoder_hn)
        # outputs : [batch_size, seq len, num_directions(=1) * hidden_dim]

        model = self.fc(outputs) # model : [batch_size, seq len, output_dim]
        return model

Notice that if we are not doing language translation, our decoder is simply the linear layer.

In [3]:
class Encoder(nn.Module): #a.k.a normal RNN
    def __init__(self):
        super(EncoderDecoder, self).__init__()
        self.encoder = nn.RNN(input_dim, hidden_dim, batch_first=True)
        self.decoder = nn.Linear(hidden_dim, num_classes)  #<---for those who like to generalize, you can also imagine this as the decoder

    def forward(self, x, y):
        # x : [batch-size, seq len, embed size or vocab size]
        h0 = torch.zeros(num_layers, batch_size, hidden_dim).to(device)
        _, encoder_hn = self.encoder(x, h0)
        # encoder_hn : [num_layers(=1) * num_directions(=1), batch_size, hidden_dim]

        encoder_hn = encoder_hn.squeeze(0)
        # encoder_hn : [batch_size, hidden_dim]

        model = self.decoder(encoder_hn) # model : [batch_size, num_classes]
        return model

In transformer, it also has the similar encoder decoder but more sophisticated. 

### 2. Embedding

You probably already know this from our previosu classes, but for the sake of no prerequired knowledge for reading this lecture, embedding is simply the idea of having a dedicated vector for each tokenized unit.  These vectors, when trained properly, contains useful semantic meanings.   To train such embedding, we can simply use <code>nn.Embedding</code>, where these embedding vectors will be learnt as a parameter using gradient descent.

In [4]:
class Embedder(nn.Module):
    def __init__(self, vocab_size, embed_size):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
    def forward(self, x):
        return self.embed(x)

### 3. Positional Encodings

Since transformer is not using recurrence, thus it does not contain order information.  Hence, it will be beneficial if we add **positional encodings** to the input embeddings.  Obviously, the positional encodings should have the same dimension <code>embed_size</code> as the embeddings, so that the two can be summed.  In the original paper, the embed size is 512.

<img src = "../figures/posenc.png" width="300">

#### Many choices of positional encodings

- **Use [0, 1]**: We can encode each position in the range of 0 and 1.  For example, given three words, pos1 has value of 0, pos2 has value of 0.5, and pos3 has value of 1.  Anyhow, this method yield inconsistent meaning for different length sentences.  That is, the deltas between two length is not consistent, which can confuse the model.

- **Use numberings**:  We can encode each position simply by numbering 1, 2, 3 and so on.  The problem with this approach is that the model cannot generalize in case that some testing sentences are longer than training sentences.

- **Use sine/cosine wave**:  In transformer, the authors have use sine and cosine functions of different frequencies, where <code>pos</code> refers to the position, $i$ refers to the value of the vector indexed at $i$.  Note that the value ranged from [-1, 1].                                                                   

$$ \text{PE}_{(\text{pos},2i)} = \sin(\frac{\text{pos}} {10000^{\frac{2i}{\text{embed_size}}}})$$                               
$$ \text{PE}_{(\text{pos},2i+1)} = \cos(\frac{\text{pos}} {10000^{\frac{2i}{\text{embed_size}}}}) $$                                               
This is how the positional encoding looks like:

<img src = "../figures/pos2.png" width="300">

#### Intuition behind?

You may wonder how this combination of sines and cosines could represent a position?  It is actually quite simple.  Suppose you write number in binary format, it looks like this:

<img src = "../figures/pos3.png" width="200">

You can spot the change in different bits.  But using binary alues would be a waste of space in the world of floats.  So instead, we can use sine/cosine functions to alternate bits.  By decreasing their frequencies, we can go from red bits to orange bits.

In [5]:
def get_positional_encoding(seq_len, embed_size):
    
    def pe(pos, i):
        return pos / np.power(10000, 2 * (i // 2) / embed_size)
    
    def get_pe_vec(pos):
        return [pe(pos, i) for i in range(embed_size)]

    pe_vec = np.array([get_pe_vec(pos) for pos in range(seq_len)])
    #pe_vec: [seq_len, embed_size]
        
    pe_vec[:, 0::2] = np.sin(pe_vec[:, 0::2])  # dim 2i taking step=2 from 0
    pe_vec[:, 1::2] = np.cos(pe_vec[:, 1::2])  # dim 2i+1 taking step=2 from 1
    return torch.FloatTensor(pe_vec)

In [6]:
pe_vec = get_positional_encoding(5, 10)
pe_vec

tensor([[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,
          1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00],
        [ 8.4147e-01,  5.4030e-01,  1.5783e-01,  9.8747e-01,  2.5116e-02,
          9.9968e-01,  3.9811e-03,  9.9999e-01,  6.3096e-04,  1.0000e+00],
        [ 9.0930e-01, -4.1615e-01,  3.1170e-01,  9.5018e-01,  5.0217e-02,
          9.9874e-01,  7.9621e-03,  9.9997e-01,  1.2619e-03,  1.0000e+00],
        [ 1.4112e-01, -9.8999e-01,  4.5775e-01,  8.8908e-01,  7.5285e-02,
          9.9716e-01,  1.1943e-02,  9.9993e-01,  1.8929e-03,  1.0000e+00],
        [-7.5680e-01, -6.5364e-01,  5.9234e-01,  8.0569e-01,  1.0031e-01,
          9.9496e-01,  1.5924e-02,  9.9987e-01,  2.5238e-03,  1.0000e+00]])

Here we can see the similarity of the 1st and 2nd word is high, so cosine similarity is high while the distance is far between 1st and 5th word, hence the cosine similarity is low.

In [7]:
from scipy import spatial
similarity_between_0_1 = 1 - spatial.distance.cosine(pe_vec[0], pe_vec[1])
print("Simiarlity between pos 0 and 1: ", similarity_between_0_1)

Simiarlity between pos 0 and 1:  0.9054891467094421


In [8]:
similarity_between_0_5 = 1 - spatial.distance.cosine(pe_vec[0], pe_vec[4])
print("Simiarlity between pos 0 and 5: ", similarity_between_0_5)

Simiarlity between pos 0 and 5:  0.629374623298645


### Creating masks

Masking plays an important role in the transformer. It serves two purposes:
- In the encoder and decoder: To zero attention outputs wherever there is just padding in the input sentences.
- In the decoder: To prevent the decoder ‘peaking’ ahead at the rest of the translated sentence when predicting the next word.

Creating the mask for the input is simple:

In [9]:
batch_size = 2
seq_len = 5
x = torch.tensor([[1, 6, 1, 0, 0], [23, 5, 0, 0, 0]])
x

tensor([[ 1,  6,  1,  0,  0],
        [23,  5,  0,  0,  0]])

In [10]:
pad_mask = x.data.eq(0)
pad_mask

tensor([[False, False, False,  True,  True],
        [False, False,  True,  True,  True]])

In [11]:
x_masked = x.masked_fill(pad_mask, -1e9)
x_masked

tensor([[          1,           6,           1, -1000000000, -1000000000],
        [         23,           5, -1000000000, -1000000000, -1000000000]])

Since we need to input the same encoder inputs many times to predict each of the output word, we can repeat this padding by doing like this

In [12]:
pad_repeating_mask = x.data.eq(0).unsqueeze(1)
pad_repeating_mask = pad_repeating_mask.expand(batch_size, seq_len, seq_len)
pad_repeating_mask

tensor([[[False, False, False,  True,  True],
         [False, False, False,  True,  True],
         [False, False, False,  True,  True],
         [False, False, False,  True,  True],
         [False, False, False,  True,  True]],

        [[False, False,  True,  True,  True],
         [False, False,  True,  True,  True],
         [False, False,  True,  True,  True],
         [False, False,  True,  True,  True],
         [False, False,  True,  True,  True]]])

On the other hand, for the decoder input, which is the target text, we shal apply the same padding mask.  But in addition, recall that the decoder predicts each output word by making use of all encoder outputs and the target sentence only up until the point of each word its predicting.   

Therefore we need to prevent the first output predictions from being able to see later into the sentence.  For this, we apply another additional mask to the target sentences.

In [13]:
batch_size = 1
seq_len = 5

shape = [batch_size, seq_len, seq_len]
block_future_mask = np.triu(np.ones(shape), k=1)
block_future_mask = torch.from_numpy(block_future_mask).byte()   #we didn't reverse here; instead we shift the operation to Encoder
block_future_mask = block_future_mask.data.eq(0) #reverse

In [14]:
block_future_mask

tensor([[[ True, False, False, False, False],
         [ True,  True, False, False, False],
         [ True,  True,  True, False, False],
         [ True,  True,  True,  True, False],
         [ True,  True,  True,  True,  True]]])

### 5. Multihead Attention

Recall the multihead attention in the figure:

<img src = "../figures/transformer.png" width="700">

V, K and Q stand for 'key', 'value' and 'query'. These are terms used in attention functions, but honestly, V, K and Q are simply be identical copies of the embedding vector (plus positional encoding). They will have the dimensions <code>batch_size * seq_len * embed_size.</code>

In multi-head attention we split the embedding vector into N heads, so they will then have the dimensions <code>batch_size * n_heads * seq_len * (embed_size / N) </code>.  This final dimension <code>( embed_size / N )</code> we will refer to as <code>d_k</code>.


Let’s see the code:

In [15]:
class MultiHeadAttention(nn.Module):
    def __init__(self):
        super(MultiHeadAttention, self).__init__()
        self.W_Q = nn.Linear(embed_size, embed_size)
        self.W_K = nn.Linear(embed_size, embed_size)
        self.W_V = nn.Linear(embed_size, embed_size)
        self.linear = nn.Linear(embed_size, embed_size)
        self.layer_norm = nn.LayerNorm(embed_size)

    def forward(self, Q, K, V, attn_mask):
        # Q: [batch_size x seq_len x embed_size], K: [batch_size x seq_len x embed_size], V: [batch_size x seq_len x embed_size]
        residual, batch_size = Q, Q.size(0)
        
        # perform linear operation and split into h heads (i.e., from embed_size to n_heads * d_k)
        q = self.W_Q(Q).view(batch_size, -1, n_heads, d_k).transpose(1,2)  # q: [batch_size x n_heads x seq_len x d_k]
        k = self.W_K(K).view(batch_size, -1, n_heads, d_k).transpose(1,2)  # k: [batch_size x n_heads x seq_len x d_k]
        v = self.W_V(V).view(batch_size, -1, n_heads, d_k).transpose(1,2)  # v: [batch_size x n_heads x seq_len x d_k]
        
        # repeat the attention mask n_heads time
        attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1) # attn_mask : [batch_size x n_heads x seq_len x seq_len]

        # calculate attention
        # context: [batch_size x n_heads x seq_len x d_k], attn: [batch_size x n_heads x seq_len x seq_len]
        context, attn = ScaledDotProductAttention()(q, k, v, attn_mask)
        
        # concatenate heads and put through final linear layer
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, n_heads * d_k) # context: [batch_size x seq_len x n_heads * d_k]
        output = self.linear(context)
        
        return self.layer_norm(output + residual), attn # output: [batch_size x seq_len x embed_size]

In [16]:
#Contiguous is related to whether the data is stored in a consecutive (a.k.a contiguous blocks) blocks.
#Most data is contiguous, however, for example, after you transpose, the python does not really 
#rearrange the memory, but just modify the metadata, so the data loses contiguity.

#Some functions such as 'view()' requires the tensor to be contiguous, thus before calling view() after
#tranposing, we can call 'contiguous' to tell python to rearrange everything

def check(tensor):
    print(tensor.is_contiguous())
    
t = torch.randn(10, 10)    
check(t) #true
t = t.transpose(0, 1) 
check(t) #false

True
False


In [17]:
#Reshap()e vs. view()
#Two main differences:  view() expects a contiguous tensor while reshape does not
#                       view() shares same memory with the original tensor, while reshape does not

#Thus:
#If you just want to reshape, use torch.reshape. 
#If you're also concerned about memory usage and want to ensure that the two tensors share the same data, use torch.view.

z = torch.zeros(3, 2)
view = z.view(6)
reshape = z.reshape(6)
z.fill_(1)

print("Using view: ", view)  #guarantee a copy
print("Using reshape: ", reshape)  #does not guarantee a copy

Using view:  tensor([1., 1., 1., 1., 1., 1.])
Using reshape:  tensor([1., 1., 1., 1., 1., 1.])


For the scaled dot product attention, it is simply this equation

$$                                                                         
   \mathrm{Attention}(Q, K, V) = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V               
$$                                                                                                               

In [18]:
class ScaledDotProductAttention(nn.Module):
    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()

    def forward(self, Q, K, V, attn_mask):
        scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k) # scores : [batch_size x n_heads x seq_len x seq_len]
        scores.masked_fill_(attn_mask, -1e9) # Fills elements of self tensor with value where mask is one.
        attn = nn.Softmax(dim=-1)(scores)
        #attn: [batch_size x n_heads x seq_len x seq_len]
        #V:  [batch_size x n_heads x seq_len x d_k]
        #context = attn @ V = [batch_size x n_heads x seq_len x d_k]
        context = torch.matmul(attn, V)
        return context, attn

### Feedforward network

This part just consists of two operations, with a relu and dropout operation in between them.  Note that we can use <code>nn.Linear</code> in place of <code>nn.Conv1d</code>

In [19]:
class FeedForwardNet(nn.Module):
    def __init__(self):
        super(FeedForwardNet, self).__init__()
        self.conv1 = nn.Conv1d(in_channels=embed_size, out_channels=d_ff, kernel_size=1)
        self.conv2 = nn.Conv1d(in_channels=d_ff, out_channels=embed_size, kernel_size=1)
        self.layer_norm = nn.LayerNorm(embed_size)

    def forward(self, inputs):
        residual = inputs # inputs : [batch_size, seq_len, embed_size]
        output = nn.ReLU()(self.conv1(inputs.transpose(1, 2)))
        output = self.conv2(output).transpose(1, 2)
        return self.layer_norm(output + residual)

### Putting everything together

If you understand every component on the top, the rest is easy.

We will now just bundle everything together in <code>EncoderLayer</code> and <code>DecoderLayer</code>.  Lastly, we can stack many of them inside a class <code>Encoder</code> and <code>Decoder</code>

In [20]:
#this attention mask will be apply after Q @ K^T thus the shape will be batch, seq_len, seq_len
def get_pad_mask(input1, input2):  #<---basically Q, K
    batch_size, seq_len = input1.size()
    # eq(zero) is PAD token
    pad_mask = input1.data.eq(0).unsqueeze(1)  # batch_size x 1 x seq_len; we unsqueeze so we can make expansion below
    return pad_mask.expand(batch_size, seq_len, seq_len)  # batch_size x seq_len x seq_len

def get_block_future_mask(input1):
    mask_shape = [input1.size(0), input1.size(1), input1.size(1)] #batch_size x seq_len x seq_len
    block_future_mask = np.triu(np.ones(mask_shape), k=1)
    block_future_mask = torch.from_numpy(block_future_mask).byte()  
    return block_future_mask
    #we didn't reverse here because this mask will be added first with the padding mask and then
    #input into get_pad_mask again.  The eq(0) will reverse for us.  Search for ****

class EncoderLayer(nn.Module):
    def __init__(self):
        super(EncoderLayer, self).__init__()
        self.enc_self_attn = MultiHeadAttention()
        self.ffn = FeedForwardNet()

    def forward(self, enc_inputs, pad_mask):
        q, k, v = enc_inputs, enc_inputs, enc_inputs
        enc_outputs, attn = self.enc_self_attn(q, k, v, pad_mask)
        enc_outputs = self.ffn(enc_outputs) # enc_outputs: [batch_size x seq_len x embed_size]
        return enc_outputs, attn

class DecoderLayer(nn.Module):
    def __init__(self):
        super(DecoderLayer, self).__init__()
        self.dec_self_attn = MultiHeadAttention()
        self.dec_general_attn = MultiHeadAttention()  #finding attention between input and output
        self.ffn = FeedForwardNet()

    def forward(self, dec_inputs, enc_outputs, pad_mask, block_future_pad_mask):
        q, k, v = dec_inputs, dec_inputs, dec_inputs
        dec_outputs, dec_self_attn = self.dec_self_attn(q, k, v, pad_mask)
        dec_outputs, dec_general_attn = self.dec_general_attn(dec_outputs, enc_outputs, enc_outputs, block_future_pad_mask)
        dec_outputs = self.ffn(dec_outputs)
        return dec_outputs, dec_self_attn, dec_general_attn

class Encoder(nn.Module):
    def __init__(self):
        super(Encoder, self).__init__()
        self.src_emb = nn.Embedding(src_vocab_size, embed_size)
        self.pos_emb = nn.Embedding.from_pretrained(get_positional_encoding(src_len+1, embed_size),freeze=True)
        self.layers = nn.ModuleList([EncoderLayer() for _ in range(n_layers)])

    def forward(self, enc_inputs): # enc_inputs : [batch_size x source_len]
        enc_outputs = self.src_emb(enc_inputs) + self.pos_emb(torch.LongTensor([[1,2,3,4,0]]))
        enc_self_attn_mask = get_pad_mask(enc_inputs, enc_inputs)
        enc_self_attns = []
        for layer in self.layers:
            enc_outputs, enc_self_attn = layer(enc_outputs, enc_self_attn_mask)
            enc_self_attns.append(enc_self_attn)
        return enc_outputs, enc_self_attns

class Decoder(nn.Module):
    def __init__(self):
        super(Decoder, self).__init__()
        self.tgt_emb = nn.Embedding(tgt_vocab_size, embed_size)
        self.pos_emb = nn.Embedding.from_pretrained(get_positional_encoding(tgt_len+1, embed_size),freeze=True)
        self.layers = nn.ModuleList([DecoderLayer() for _ in range(n_layers)])

    def forward(self, dec_inputs, enc_inputs, enc_outputs): # dec_inputs : [batch_size x target_len]
        dec_outputs = self.tgt_emb(dec_inputs) + self.pos_emb(torch.LongTensor([[5,1,2,3,4]]))
        dec_self_attn_pad_mask = get_pad_mask(dec_inputs, dec_inputs)
        dec_self_attn_block_future_mask = get_block_future_mask(dec_inputs)
        dec_self_attn_mask = torch.gt((dec_self_attn_pad_mask + dec_self_attn_block_future_mask), 0)
        dec_enc_attn_mask = get_pad_mask(dec_inputs, enc_inputs) #<--****

        dec_self_attns, dec_enc_attns = [], []
        for layer in self.layers:
            dec_outputs, dec_self_attn, dec_enc_attn = layer(dec_outputs, enc_outputs, dec_self_attn_mask, dec_enc_attn_mask)
            dec_self_attns.append(dec_self_attn)
            dec_enc_attns.append(dec_enc_attn)
        return dec_outputs, dec_self_attns, dec_enc_attns

class Transformer(nn.Module):
    def __init__(self):
        super(Transformer, self).__init__()
        self.encoder = Encoder()
        self.decoder = Decoder()
        self.projection = nn.Linear(embed_size, tgt_vocab_size, bias=False)
    def forward(self, enc_inputs, dec_inputs):
        enc_outputs, enc_self_attns = self.encoder(enc_inputs)
        dec_outputs, dec_self_attns, dec_enc_attns = self.decoder(dec_inputs, enc_inputs, enc_outputs)
        dec_logits = self.projection(dec_outputs) # dec_logits : [batch_size x src_vocab_size x tgt_vocab_size]
        return dec_logits.view(-1, dec_logits.size(-1)), enc_self_attns, dec_self_attns, dec_enc_attns

### Practice

Try to load the Multi30K dataset and apply the model on it.  Make any necessary changes to the model if needed.

https://pytorch.org/text/stable/datasets.html#multi30k

### References:
    
- http://jalammar.github.io/illustrated-transformer/
- http://peterbloem.nl/blog/transformers
- https://blog.floydhub.com/the-transformer-in-pytorch/