https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

#           Data Science Knoweledeg Sharing


                 



1. Recurrent Neural Networks
2. Attention Mechanism
3. LSTM + Attention
4. GPT2




# Recurrent Neural Networks

Humans don’t start their thinking from scratch every second. As you read this paragraph, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence.

Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.
<img src="./image/RNN-rolled.png" width="100" alt="rnn unrolled">
Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist.



###                 The Problem of Long-Term Dependencies

There are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information

## LSTM Networks

LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

<img src="./image/LSTM3-chain.png" width="600" alt="rnn unrolled">

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

<img src="./image/LSTM3-gate.png" width="80" alt="rnn unrolled">

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

<img src="./image/LSTM2-notation.png" width="600" alt="rnn unrolled">


## Step-by-Step LSTM Walk Through

<img src="./image/LSTM3-focus-f.png" width="600" alt="rnn unrolled">
<img src="./image/LSTM3-focus-i.png" width="600" alt="rnn unrolled">
<img src="./image/LSTM3-focus-C.png" width="600" alt="rnn unrolled">
<img src="./image/LSTM3-focus-o.png" width="600" alt="rnn unrolled">

LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s <b>attention!</b>

## Attention Mechanism

Attention is, to some extent, motivated by how we pay visual attention to different regions of an image or correlate words in one sentence

<img src="./image/shiba-example-attention.png" width="600" alt="rnn unrolled">
Human visual attention allows us to focus on a certain region with “high resolution” (i.e. look at the pointy ear in the yellow box) while perceiving the surrounding image in “low resolution” (i.e. now how about the snowy background and the outfit?), and then adjust the focal point or do the inference accordingly. Given a small patch of an image, pixels in the rest provide clues what should be displayed there. We expect to see a pointy ear in the yellow box because we have seen a dog’s nose, another pointy ear on the right, and Shiba’s mystery eyes (stuff in the red boxes). However, the sweater and blanket at the bottom would not be as helpful as those doggy features.

Similarly, we can explain the relationship between words in one sentence or close context. When we see “eating”, we expect to encounter a food word very soon. The color term describes the food, but probably not so much with “eating” directly.
<img src="./image/sentence-example-attention.png" width="600" alt="rnn unrolled">

<b>Attention in the deep learning can be broadly interpreted as a vector of importance weights:</b> in order to predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate using the attention vector how strongly it is correlated with  other elements and take the sum of their values weighted by the attention vector as the approximation of the target.


## What’s Wrong with Seq2Seq Model?

The seq2seq model normally has an encoder-decoder architecture, composed of:

An <b>encoder</b> processes the input sequence and compresses the information into a context vector (also known as sentence embedding or “thought” vector) of a fixed length. This representation is expected to be a good summary of the meaning of the whole source sequence.

A <b>decoder</b> is initialized with the context vector to emit the transformed output. The early work only used the last state of the encoder network as the decoder initial state.

Both the encoder and decoder are recurrent neural networks, i.e. using LSTM or GRU units.


<img src="./image/encoder-decoder-example.png" width="600" alt="rnn unrolled">

<b>A critical and apparent disadvantage of this fixed-length context vector design is incapability of remembering long sentences. Often it has forgotten the first part once it completes processing the whole input. The attention mechanism was born (Bahdanau et al., 2015) to resolve this problem.</b>


## LSTM + Attention

The attention mechanism was born to help memorize long source sentences in neural machine translation (NMT). Rather than building a single context vector out of the encoder’s last hidden state, the secret sauce invented by attention is to create shortcuts between the 
context vector and the entire source input. The weights of these shortcut connections are customizable for each output element.
While the context vector has access to the entire input sequence, we don’t need to worry about forgetting. The alignment between the source and target is learned and controlled by the context vector. Essentially the context vector consumes three pieces of information:

1. encoder hidden states;
2. decoder hidden states;
3. alignment between source and target.



<img src="./image/encoder-decoder-attention.png" width="600" alt="rnn unrolled">
*


*
<img src="./image/1_xW7d7-y0MW5QjtGvyqZN4A.gif" width="600" alt="rnn unrolled">  
*


*
<img src="./image/attention_equation_0.jpg" width="600" alt="rnn unrolled">
<img src="./image/attention_equation_1.jpg" width="600" alt="rnn unrolled">

<b>And the pseudo-code:</b>

* <b>FC = Fully connected (dense) layer</b>
* <b>EO = Encoder output</b>
* <b>H = hidden state</b>
* <b>X = input to the decoder</b>
     
        
* <b>score = FC(tanh(FC(EO) + FC(H)))</b> 
* <b>attention weights = softmax(score, axis = 1).</b> Softmax by default is applied on the last axis but here we want to apply it on the 1st axis, since the shape of score is (batch_size, max_length, hidden_size). Max_length is the length of our input. Since we are trying to assign a weight to each input, softmax should be applied on that axis.

* <b>context vector = sum(attention weights * EO, axis = 1).</b> Same reason as above for choosing axis as 1.
* <b>merged vector = concat(embedding output, context vector)</b>


In [224]:
import tensorflow as tf

vocab_size = 10000 
embedding_dim = 300
enc_units = 25
batch_size = 8

In [225]:
import tensorflow as tf

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.lstm = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x):
    x = self.embedding(x)
    output, state = self.lstm(x)
    return output, state

In [226]:
print(sample_input.shape)

NameError: name 'sample_input' is not defined

In [227]:


encoder = Encoder(vocab_size, embedding_dim, enc_units, batch_size)

sample_input = tf.random.uniform(shape=[batch_size, enc_units]) 

sample_output, sample_hidden = encoder(sample_input)
print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))

UnknownError: Fail to find the dnn implementation. [Op:CudnnRNN]

In [None]:
class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    # hidden shape == (batch_size, hidden size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
    # we are doing this to perform addition to calculate the score
    hidden_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    # the shape of the tensor before applying self.V is (batch_size, max_length, units)
    score = self.V(tf.nn.tanh(
        self.W1(values) + self.W2(hidden_with_time_axis)))

    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights

In [None]:
attention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)

print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

In [None]:
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights

In [None]:
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

sample_decoder_output, _, _ = decoder(tf.random.uniform((BATCH_SIZE, 1)),
                                      sample_hidden, sample_output)

print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))

<img src="./image/attention_weights.png" width="400" alt="rnn unrolled">


## Why do we need Transformer then if LSTM+Attention works out of the box.
<img src="./image/giphy.gif" width="180" alt="rnn unrolled">



<img src="./image/image4.png" width="500" alt="rnn unrolled">




### The difficulties encountered with the Seq2Seq model

* <b>The complexity of total calculation per layer.</b>
* <b>The amount of calculation that can be parallelized, as measured by the minimum number of sequential operations required.</b>
* <b>The path length between long-range dependencies on the network.</b>

### Complexity comparison
<img src="./image/1_AhislWfmYl_9UqW0WKGKIw.png" width="500" alt="rnn unrolled">

* <b>n</b>: is the length of the sequence.
* <b>d</b>: is the dimension of the representation (512, 1024 in generally).
* <b>k</b>: is the size of the kernel of convolutions.
* <b>r</b>: is the size of the neighborhood in a limited personal attention.

## GPT2

<img src="./image/gpt2-sizes.png" width="500" alt="rnn unrolled">

#### Transformers for Language Modeling

As we already know the original transformer model is made up of an encoder and decoder – each is a stack of what we can call transformer blocks. That architecture was appropriate because the model tackled machine translation – a problem where encoder-decoder architectures have been successful in the past.

<img src="./image/transformer-encoder-decoder.png" width="600" alt="rnn unrolled">

A lot of the subsequent research work saw the architecture shed either the encoder or decoder, and use just one stack of transformer blocks – stacking them up as high as practically possible, feeding them massive amounts of training text, and throwing vast amounts of compute at them. 

<img src="./image/gpt-2-transformer-xl-bert-3.png" width="600" alt="rnn unrolled">

How high can we stack up these blocks? It turns out that’s one of the main distinguishing factors between the different GPT2 model sizes:

<img src="./image/gpt2-sizes-hyperparameters-3.png" width="600" alt="rnn unrolled">

The GPT2, and some later models like TransformerXL and XLNet are auto-regressive in nature. BERT is not. That is a trade off. In losing auto-regression, BERT gained the ability to incorporate the context on both sides of a word to gain better results. XLNet brings back autoregression while finding an alternative way to incorporate the context on both sides.


<img src="./image/gpt-2-autoregression-2.gif" width="600" alt="rnn unrolled">




## The Evolution of the Transformer Block

<img src="./image/transformer-encoder-block-2.png" width="600" alt="rnn unrolled">

<img src="./image/transformer-decoder-block-2.png" width="600" alt="rnn unrolled">

It’s important that the distinction between self-attention (what BERT uses) and masked self-attention (what GPT-2 uses) is clear. A normal self-attention block allows a position to peak at tokens to its right. Masked self-attention prevents that from happening:


#### The Decoder-Only Block
 

<img src="./image/transformer-decoder-intro.png" width="600" alt="rnn unrolled">

#### Computation Graph of decoder block
 
<img src="./image/GPT-2_Decoder.jpg" width="600" alt="rnn unrolled">






## Lets look at inside of gpt2 decoder block

<img src="./image/gpt2-positional-encoding.png" width="600" alt="rnn unrolled">

<img src="./image/gpt2-positional-encoding.png" width="600" alt="rnn unrolled">

<img src="./image/gpt2-input-embedding-positional-encoding-3.png" width="600" alt="rnn unrolled">

In [70]:
VOCAB_SIZE = 5000
EMBEDDING_SIZE = 128
MAX_SEQ_LEN = 4

In [71]:
import tensorflow as tf
import math
import numpy as np


class EmbeddingLayer(tf.keras.layers.Layer):

    def __init__(self, vocab_size, embedding_size, initializer=None, stddev=0.01, mean=0.0):
        super(EmbeddingLayer, self).__init__()
        self.vocab_size = vocab_size
        self.embedding_size = embedding_size
        self.stddev = stddev
        self.mean = mean
        self.initializer = initializer
        if self.initializer is None:
            self.initializer = tf.random_normal_initializer(mean=self.mean,
                                                            stddev=self.stddev)

    def build(self, input_shape):
        with tf.name_scope("embedding_weights"):
            self.embedding_weights = self.add_weight(
                "weights",
                shape=[self.vocab_size, self.embedding_size],
                dtype="float32",
                initializer=self.initializer
            )
        super(EmbeddingLayer, self).build(input_shape)

    def call(self, inputs, scale=False):
        return self.embedding(inputs, scale=scale)

    def embedding(self, inputs, scale=False):
        with tf.name_scope("embedding"):
            # Create binary mask of size [batch_size, length]
            mask = tf.cast(tf.not_equal(inputs, 0), tf.float32)
            inputs = tf.cast(inputs, tf.int32)
            embeddings = tf.nn.embedding_lookup(self.embedding_weights, inputs)
            embeddings *= tf.expand_dims(mask, -1)
            # Scale embedding by the sqrt of the hidden size
            if scale:
                embeddings *= self.embedding_size ** 0.5

            return embeddings

In [81]:
embedding_layer = EmbeddingLayer(VOCAB_SIZE, EMBEDDING_SIZE)

sample_ids = np.random.randint(VOCAB_SIZE-1, size=(1, MAX_SEQ_LEN))

print ('Sample ids are:')
print(sample_ids)
word_embedding = embedding_layer(sample_ids)

print ('Word embeddings are:')
print(word_embedding.shape)
print(word_embedding)

Sample ids are:
[[1061 4077 1281 3948]]
Word embeddings are:
(1, 4, 128)
tf.Tensor(
[[[ 1.90221053e-03  1.34074278e-02  4.90653515e-03 -7.40707444e-04
   -5.50968992e-03 -1.29891783e-02 -5.42266713e-03  5.92300156e-03
    1.45144630e-04 -6.40660757e-03  5.26261376e-03 -6.62401225e-03
   -6.63287379e-03 -1.88118462e-02  2.48772418e-03 -1.71885211e-02
    7.08139827e-03 -5.02153125e-04  5.16714435e-03 -1.49867898e-02
   -1.66524053e-02 -1.40990817e-03  7.64011964e-03  1.11794425e-02
   -5.08392623e-05  6.08442631e-03 -4.79091250e-04  6.24878565e-03
   -1.36072468e-02 -3.84856458e-03 -1.40988000e-03  4.40687034e-03
   -3.11011579e-02  4.68492182e-03  1.00226107e-03  4.50293347e-03
    2.52135098e-04  1.45194046e-02  2.09679245e-03 -1.73986983e-03
   -7.32082827e-03  6.26902701e-03  2.00013611e-02  1.85521482e-03
   -2.04038825e-02  3.26791243e-03  1.05519658e-02 -4.83077532e-03
    2.44718301e-03  5.80251450e-03 -8.87712184e-03  9.02302749e-03
    2.25715665e-03  5.64410072e-03  1.0863863

In [82]:
class PositionEmbeddingLayer(tf.keras.layers.Layer):

    def __init__(self, position_seq, pos_embedding_size, stddev=0.02, mean=0.0):
        super(PositionEmbeddingLayer, self).__init__()
        self.position_seq = position_seq
        self.hidden_size = pos_embedding_size
        self.stddev = stddev
        self.mean = mean

        self.position_embedding = EmbeddingLayer(self.position_seq, self.hidden_size,
                                                     stddev=self.stddev, mean=self.mean)

    def call(self, inputs, start=1):
        with tf.name_scope("pos_embedding"):
            batch_size = tf.shape(inputs)[0]
            batch_seq = tf.shape(inputs)[1]

            positions = tf.reshape(tf.tile(tf.range(start, batch_seq + start), [batch_size]),
                                       [batch_size, batch_seq])
            
            print(positions)
            positions = tf.cast(positions, tf.int32)
            position_mask = tf.cast(tf.not_equal(inputs, 0), tf.int32)
            positions *= position_mask

            return self.position_embedding(positions)

In [83]:
pos_embedding_layer = PositionEmbeddingLayer(VOCAB_SIZE, EMBEDDING_SIZE)

print ('Sample ids are:')
print(sample_ids)

position_embedding = pos_embedding_layer(sample_ids)

print ('Position embeddings are:')
print(position_embedding.shape)
print(position_embedding)

Sample ids are:
[[1061 4077 1281 3948]]
tf.Tensor([[1 2 3 4]], shape=(1, 4), dtype=int32)
Position embeddings are:
(1, 4, 128)
tf.Tensor(
[[[-0.0403204  -0.00521115  0.01583621  0.00491687  0.00755817
    0.02104377 -0.01092372  0.00170765  0.0137367   0.01523365
    0.00510252  0.0271987  -0.00296931  0.00402209 -0.01001332
   -0.00093408  0.02088011 -0.01156523  0.00376268  0.03218854
    0.01163603  0.00293004 -0.03048865 -0.00615662  0.00125647
   -0.00104482  0.01525881 -0.00427771 -0.02653289  0.0025996
    0.03744173 -0.01216877 -0.00031213 -0.00969449 -0.01236077
    0.0091147  -0.03394173  0.00376424 -0.01889127  0.01012289
    0.02167968  0.00856065 -0.02023563 -0.00521756  0.0418028
   -0.01997497  0.00691698 -0.00664625  0.02715906 -0.00627076
   -0.0158858  -0.01244092 -0.0059907   0.00086643  0.00350134
   -0.00789419 -0.02081902 -0.02522325 -0.02328824  0.02420071
    0.0011864  -0.00090548  0.01881256 -0.00868049  0.02231683
    0.02369094  0.00392744 -0.01739495 -0.034

In [84]:
final_embedding = word_embedding + position_embedding

print("Final Embeddings are:- ")
print(final_embedding)

Final Embeddings are:- 
tf.Tensor(
[[[-3.84181887e-02  8.19627941e-03  2.07427442e-02  4.17616125e-03
    2.04847706e-03  8.05459172e-03 -1.63463876e-02  7.63065089e-03
    1.38818491e-02  8.82704370e-03  1.03651350e-02  2.05746889e-02
   -9.60218720e-03 -1.47897527e-02 -7.52559723e-03 -1.81226023e-02
    2.79615112e-02 -1.20673785e-02  8.92982166e-03  1.72017477e-02
   -5.01637440e-03  1.52013625e-03 -2.28485279e-02  5.02282334e-03
    1.20562781e-03  5.03960904e-03  1.47797167e-02  1.97107904e-03
   -4.01401371e-02 -1.24896318e-03  3.60318497e-02 -7.76189473e-03
   -3.14132832e-02 -5.00956876e-03 -1.13585051e-02  1.36176301e-02
   -3.36895958e-02  1.82836466e-02 -1.67944822e-02  8.38302355e-03
    1.43588465e-02  1.48296803e-02 -2.34266743e-04 -3.36234760e-03
    2.13989150e-02 -1.67070534e-02  1.74689423e-02 -1.14770280e-02
    2.96062399e-02 -4.68249898e-04 -2.47629248e-02 -3.41789704e-03
   -3.73354461e-03  6.51053200e-03  1.43652027e-02 -2.02155877e-02
   -3.68221626e-02 -1.73082



<img src="./image/gpt2-self-attention-example-2.png" width="600" alt="rnn unrolled">


#### Masked Self Attention
  

<img src="./image/self-attention-and-masked-self-attention.png" width="600" alt="rnn unrolled">


#### Self-Attention Process

Self-attention is processed along the path of each token in the segment. The significant components are three vectors:

* <b>Query:</b> The query is a representation of the current word used to score against all the other words (using their keys). We only care about the query of the token we’re currently processing.

* <b>Key:</b> Key vectors are like labels for all the words in the segment. They’re what we match against in our search for relevant words.

* <b>Value:</b> Value vectors are actual word representations, once we’ve scored how relevant each word is, these are the values we add up to represent the current word.

<img src="./image/self-attention-example-folders-3.png" width="600" alt="rnn unrolled">

A crude analogy is to think of it like searching through a filing cabinet. The query is like a sticky note with the topic you’re researching. The keys are like the labels of the folders inside the cabinet. When you match the tag with a sticky note, we take out the contents of that folder, these contents are the value vector. Except you’re not only looking for one value, but a blend of values from a blend of folders.

Multiplying the query vector by each key vector produces a score for each folder (technically: dot product followed by softmax).

<img src="./image/self-attention-example-folders-scores-3.png" width="600" alt="rnn unrolled">

## Lets now understand about self attention



#### 1- Create Query, Key, and Value Vectors
    
Let’s focus on the first path. We’ll take its query, and compare against all the keys. That produces a score for each key. The first step in self-attention is to calculate the three vectors for each token path (let’s ignore attention heads for now):




<img src="./image/self-attention-1.png" width="500" alt="rnn unrolled">


#### 2- Score

Now that we have the vectors, we use the query and key vectors only for step #2. Since we’re focused on the first token, we multiply its query by all the other key vectors resulting in a score for each of the four tokens.


<img src="./image/self-attention-2.png" width="600" alt="rnn unrolled">



#### 3- Sum
  
We can now multiply the scores by the value vectors. A value with a high score will constitute a large portion of the resulting vector after we sum them up.


<img src="./image/self-attention-3-2.png" width="600" alt="rnn unrolled">
  
If we do the same operation for each path, we end up with a vector representing each token containing the appropriate context of that token. Those are then presented to the next sublayer in the transformer block (the feed-forward neural network):

<img src="./image/self-attention-summary.png" width="600" alt="rnn unrolled">




## Masked Self-Attention


Now that we’ve looked inside a transformer’s self-attention step, let’s proceed to look at masked self-attention. Masked self-attention is identical to self-attention except when it comes to step #2. Assuming the model only has two tokens as input and we’re observing the second token. In this case, the last two tokens are masked. So the model interferes in the scoring step. It basically always scores the future tokens as 0 so the model can’t peak to future words:

<img src="./image/masked-self-attention-2.png" width="600" alt="rnn unrolled">

This masking is often implemented as a matrix called an attention mask. Think of a sequence of four words (“robot must obey orders”, for example). In a language modeling scenario, this sequence is absorbed in four steps – one per word (assuming for now that every word is a token). As these models work in batches, we can assume a batch size of 4 for this toy model that will process the entire sequence (with its four steps) as one batch.


<img src="./image/transformer-decoder-attention-mask-dataset.png" width="600" alt="rnn unrolled">

In matrix form, we calculate the scores by multiplying a queries matrix by a keys matrix. Let’s visualize it as follows, except instead of the word, there would be the query (or key) vector associated with that word in that cell:

<img src="./image/queries-keys-attention-mask.png" width="600" alt="rnn unrolled">

After the multiplication, we slap on our attention mask triangle. It set the cells we want to mask to -infinity or a very large negative number (e.g. -1 billion in GPT2):

<img src="./image/transformer-attention-mask.png" width="600" alt="rnn unrolled">

Then, applying softmax on each row produces the actual scores we use for self-attention:

<img src="./image/transformer-attention-masked-scores-softmax.png" width="600" alt="rnn unrolled">




## Scaled dot product attention

<img src="https://www.tensorflow.org/images/tutorials/transformer/scaled_attention.png" width="500" alt="scaled_dot_product_attention">

The attention function used by the transformer takes three inputs: Q (query), K (key), V (value). The equation used to calculate the attention weights is:

$$\Large{Attention(Q, K, V) = softmax_k(\frac{QK^T}{\sqrt{d_k}}) V} $$

The dot-product attention is scaled by a factor of square root of the depth. This is done because for large values of depth, the dot product grows large in magnitude pushing the softmax function where it has small gradients resulting in a very hard softmax. 

In [200]:
def scaled_dot_product_attention(q, k, v, mask=None):
    q = tf.cast(q, tf.float32)
    k = tf.cast(k, tf.float32)
    v = tf.cast(v, tf.float32)

    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)
        
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    matmul_qk = matmul_qk / tf.math.sqrt(dk)

    if mask is not None:
        matmul_qk += (mask * -1e9)
        
    attention_weights = tf.nn.softmax(matmul_qk, axis=-1)  # (..., seq_len_q, seq_len_k)

    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

    return output, attention_weights

## Masking

In [86]:
def get_padding_mask(seq):
    with tf.name_scope("Padding_Mask"):
        seq = tf.cast(tf.math.equal(seq, 0), tf.float32)

        # add extra dimensions to add the padding
        # to the attention logits.
        return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)


def attention_mask(size):
    """
    if size is 4 then it returns below matrix
       [[0., 1., 1., 1.],
        [0., 0., 1., 1.],
        [0., 0., 0., 1.],
        [0., 0., 0., 0.]]
    """
    with tf.name_scope("attention_mask"):
        mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
        return mask  # (seq_len, seq_len)


def create_masks(inp):
    with tf.name_scope("att_masking"):
        att_mask = attention_mask(tf.shape(inp)[1])
        padding_mask = get_padding_mask(inp)
        mask = tf.maximum(padding_mask, att_mask)

        return mask

In [87]:
"""
temp_q = np.random.rand(8, 16)
temp_k = np.random.rand(8, 16)
temp_v = np.random.rand(8, 16)

"""
temp_q = final_embedding
temp_k = final_embedding
temp_v = final_embedding


In [88]:
output, att_weights = scaled_dot_product_attention(temp_q, temp_k, temp_v)

print ('Attention weights are:')
print (att_weights)
print ('\nOutput is:')
print (output)

Attention weights are:
tf.Tensor(
[[[0.25074962 0.24970035 0.24972199 0.24982807]
  [0.24959165 0.25098953 0.24957427 0.24984452]
  [0.24971952 0.24968047 0.25100645 0.24959356]
  [0.24968803 0.24981323 0.24945612 0.25104263]]], shape=(1, 4, 4), dtype=float32)

Output is:
tf.Tensor(
[[[-1.08853336e-02  2.34462693e-03  6.28101733e-03 -4.45999857e-03
   -1.33893779e-03 -3.74334073e-03 -1.17176687e-02 -7.28918705e-03
    1.34670278e-02 -8.32254067e-03  8.74450337e-03 -5.04577393e-03
   -1.01230815e-02  2.61042127e-03  6.99519087e-03  9.38552432e-03
   -5.79891168e-03 -1.91723020e-03 -3.01159080e-03  2.38240473e-02
   -6.45955978e-03 -2.69577699e-03 -1.59334149e-02  7.77499471e-03
   -1.45985372e-03  2.01275423e-02  1.11289909e-02 -1.97527017e-02
   -1.84816271e-02 -2.26150686e-03  3.70208174e-04 -2.85813189e-03
   -1.77462101e-02  1.24312658e-02  9.53528192e-03  3.72849638e-04
   -3.23698949e-03  7.26946210e-03  1.03675382e-04  2.34036008e-04
    1.04769235e-02  1.21247899e-02 -6.13787817

## Masked Self Attention

In [89]:
mask = create_masks(sample_ids)
print(mask)

tf.Tensor(
[[[[0. 1. 1. 1.]
   [0. 0. 1. 1.]
   [0. 0. 0. 1.]
   [0. 0. 0. 0.]]]], shape=(1, 1, 4, 4), dtype=float32)


In [122]:
sample_att = tf.cast([10.0, 8.0, 9.0, 6.0], tf.float16)

sample_att2= tf.cast([10.0, 8.0, -9.0e9, -6.0e9], tf.float16)

print("\nNormal Softmax:")
print(tf.nn.softmax(sample_att).numpy())
print("\nMasked Softmax:")
print(tf.nn.softmax(sample_att2).numpy())


Normal Softmax:
[0.657   0.0889  0.2418  0.01204]

Masked Softmax:
[0.881  0.1192 0.     0.    ]


In [125]:
output, att_weights = scaled_dot_product_attention(temp_q, 
                                                   temp_k, 
                                                   temp_v,
                                                   mask=mask)

print ('\nAttention weights are:')
print (att_weights.numpy())
print ('\nOutput is:')
print (output.numpy())

[[[ 0.0039889  -0.00020433 -0.00011768  0.00030697]
  [-0.00020433  0.00538071 -0.00027396  0.00080827]
  [-0.00011768 -0.00027396  0.0050227  -0.0006222 ]
  [ 0.00030697  0.00080827 -0.0006222   0.00571752]]]

 After mask sum
[[[[ 3.9888979e-03 -1.0000000e+09 -1.0000000e+09 -1.0000000e+09]
   [-2.0433015e-04  5.3807064e-03 -1.0000000e+09 -1.0000000e+09]
   [-1.1768091e-04 -2.7396079e-04  5.0226985e-03 -1.0000000e+09]
   [ 3.0696925e-04  8.0827315e-04 -6.2219956e-04  5.7175169e-03]]]]

Attention weights are:
[[[[1.         0.         0.         0.        ]
   [0.49860373 0.50139624 0.         0.        ]
   [0.332779   0.332727   0.334494   0.        ]
   [0.24968803 0.24981323 0.24945612 0.25104263]]]]

Output is:
[[[[-3.84181887e-02  8.19627941e-03  2.07427442e-02  4.17616125e-03
     2.04847706e-03  8.05459172e-03 -1.63463876e-02  7.63065089e-03
     1.38818491e-02  8.82704370e-03  1.03651350e-02  2.05746889e-02
    -9.60218720e-03 -1.47897527e-02 -7.52559723e-03 -1.81226023e-02
   

## Multi-head attention

<img src="https://www.tensorflow.org/images/tutorials/transformer/multi_head_attention.png" width="500" alt="multi-head attention">


Multi-head attention consists of four parts:
*    Linear layers and split into heads.
*    Scaled dot-product attention.
*    Concatenation of heads.
*    Final linear layer.


<img src="./image/gpt2-self-attention-3.png" width="600" alt="multi-head attention">

In the previous examples, we dove straight into self-attention ignoring the “multi-head” part. It would be useful to shed some light on that concept now. Self attention is conducted multiple times on different parts of the Q,K,V vectors. “Splitting” attention heads is simply reshaping the long vector into a matrix. The small GPT2 has 12 attention heads, so that would be the first dimension of the reshaped matrix:

<img src="./image/gpt2-self-attention-split-attention-heads-1.png" width="600" alt="multi-head attention">

In the previous examples, we’ve looked at what happens inside one attention head. One way to think of multiple attention-heads is like this (if we’re to only visualize three of the twelve attention heads):

<img src="./image/gpt2-self-attention-split-attention-heads-2.png" width="600" alt="multi-head attention">

#### GPT-2 Self-attention: Merge attention heads

The way we deal with the various attention heads is that we first concatenate them into one vector:

<img src="./image/gpt2-self-attention-merge-heads-1.png" width="600" alt="multi-head attention">


But the vector isn’t ready to be sent to the next sublayer just yet. We need to first turn this Frankenstein’s-monster of hidden states into a homogenous representation.

#### GPT-2 Self-attention: 4- Projecting

We’ll let the model learn how to best map concatenated self-attention results into a vector that the feed-forward neural network can deal with. Here comes our second large weight matrix that projects the results of the attention heads into the output vector of the self-attention sublayer:

<img src="./image/gpt2-self-attention-project-1.png" width="600" alt="multi-head attention">

And with this, we have produced the vector we can send along to the next layer:

<img src="./image/gpt2-self-attention-project-2.png" width="600" alt="multi-head attention">


In [127]:
class Conv1d(tf.keras.layers.Layer):
    def __init__(self, hidden_size,
                 filter_size,
                 weights_init_stdev=0.02,
                 weights_mean=0.0,
                 bias_init=0.0):
        super(Conv1d, self).__init__()

        self.weights_init_stdev = weights_init_stdev
        self.weights_mean = weights_mean
        self.bias_init = bias_init
        self.hidden_size = hidden_size
        self.filter_size = filter_size

    def build(self, input_shape):
        self.weight = self.add_weight(
            "cov1d_weights",
            shape=[self.hidden_size, self.filter_size],
            dtype=tf.float32,
            initializer=tf.random_normal_initializer(
                stddev=self.weights_init_stdev,
                mean=self.weights_mean))

        self.bias = self.add_weight("conv1d_biases",
                                    shape=[self.filter_size],
                                    initializer=tf.constant_initializer(self.bias_init))
        super(Conv1d, self).build(input_shape)

    def call(self, inputs):
        output_shape = [tf.shape(inputs)[0], tf.shape(inputs)[1]] + [self.filter_size]
        inputs = tf.reshape(inputs, [-1, self.hidden_size])  # shape [batch, seq , features] => [batch*seq, features]
        outputs = tf.matmul(inputs, self.weight) + self.bias
        outputs = tf.reshape(outputs, output_shape)  # Reshape => [batch, seq, filter_size]
        return outputs

In [169]:
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, att_dropout=0.1, scale=True):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.att_dropout = att_dropout
        self.scale = scale

        assert d_model % self.num_heads == 0

        self.depth = d_model // self.num_heads

        self.c_attn = Conv1d(self.d_model, self.d_model * 3)
        self.c_proj = Conv1d(self.d_model, self.d_model)


    def split_heads(self, x):
        batch_size = tf.shape(x)[0]
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def merge_heads(self, x):
        batch_size = tf.shape(x)[0]
        x = tf.transpose(x, perm=[0, 2, 1, 3])
        # (batch_size, seq_len_q, num_heads, depth)

        merged = tf.reshape(x, (batch_size, -1, self.d_model))
        # (batch_size, seq_len_q, d_model)
        return merged

    def call(self, x, mask=None):
        x = self.c_attn(x)
        query, key, value = tf.split(x, 3, axis=2)

        query = self.split_heads(query)
        key = self.split_heads(key)
        value = self.split_heads(value)

        scaled_attention, attention_weights = scaled_dot_product_attention(query, key, value, mask)

        concat_attention = self.merge_heads(scaled_attention)

        #Projection
        output = self.c_proj(concat_attention)  # (batch_size, seq_len_q, d_model)
        
        return output, attention_weights

In [150]:
mha = MultiHeadAttention(128, 2)

In [151]:
out, attention_weights = mha(final_embedding)

print(out.shape)
print(attention_weights.shape)
print(attention_weights)

(1, 4, 128)
(1, 2, 4, 4)
tf.Tensor(
[[[[0.2500032  0.24999148 0.25000587 0.24999943]
   [0.24999501 0.25000516 0.24999526 0.2500046 ]
   [0.24999449 0.2499987  0.25000915 0.24999768]
   [0.25000286 0.2500052  0.24999373 0.2499982 ]]

  [[0.24999882 0.24999523 0.25000885 0.24999708]
   [0.24999806 0.24999098 0.25000796 0.250003  ]
   [0.25000438 0.24999794 0.24999106 0.25000665]
   [0.25000376 0.250005   0.2500039  0.24998733]]]], shape=(1, 2, 4, 4), dtype=float32)


## Feed forward network

#### GPT-2 Fully-Connected Neural Network: Layer 

The fully-connected neural network is where the block processes its input token after self-attention has included the appropriate context in its representation. It is made up of two layers. The first layer is four times the size of the model (Since GPT2 small is 768, this network would have 768*4 = 3072 units). Why four times? That’s just the size the original transformer rolled with (model dimension was 512 and layer 1 in that model was 2048).

<img src="./image/gpt2-mlp1.gif" width="600" alt="transformer">

#### GPT-2 Fully-Connected Neural Network: Layer #2 - Projecting to model dimension

The second layer projects the result from the first layer back into model dimension (768 for the small GPT2). The result of this multiplication is the result of the transformer block for this token.


<img src="./image/gpt2-mlp-2.gif" width="600" alt="transformer">


In [195]:
class FeedForward(tf.keras.layers.Layer):

    def __init__(self, hidden_size, filter_size, activation=tf.nn.relu):
        super(FeedForward, self).__init__()
        self.hidden_size = hidden_size
        self.filter_size = filter_size
        self.activation = activation

        self.dense_layer = Conv1d(self.hidden_size, self.filter_size)
        self.output_dense_layer = Conv1d(self.filter_size, self.hidden_size)

    def call(self, x):
        output = self.dense_layer(x)
        output = self.activation(output)
        
        output = self.output_dense_layer(output)
        
        return output

In [196]:
ffn = FeedForward(128, 512)
ffn(out).shape

TensorShape([1, 4, 128])

## Finally have understood all parts of GPT2

Now pretty much have the vast majority of the picture of what happens inside of a transformer language model. To recap, our brave input vector encounters these weight matrices:


<img src="./image/gpt2-transformer-block-weights-2.png" width="700" alt="transformer">

And each block has its own set of these weights. On the other hand, the model has only one token embedding matrix and one positional encoding matrix:

<img src="./image/gpt2-weights-2.png" width="700" alt="transformer">



# Lets look code inside decoder

## Layer Normalization

In [165]:
class LayerNormalization(tf.keras.layers.Layer):

    def __init__(self, hidden_size):
        super(LayerNormalization, self).__init__()
        self.hidden_size = hidden_size

    def build(self, input_shape):
        self.gamma = self.add_weight(
            "layer_norm_scale",
            shape=[self.hidden_size],
            dtype="float32",
            initializer=tf.ones_initializer(),
            experimental_autocast=False)
        self.beta = self.add_weight(
            "layer_norm_bias",
            shape=[self.hidden_size],
            dtype="float32",
            initializer=tf.zeros_initializer(),
            experimental_autocast=False)
        super(LayerNormalization, self).build(input_shape)

    def call(self, x, epsilon=1e-6, input_dtype=tf.float32):
        mean = tf.reduce_mean(x, axis=[-1], keepdims=True)
        variance = tf.reduce_mean(tf.square(x - mean), axis=[-1], keepdims=True)
        normalized = (x - mean) * tf.math.rsqrt(variance + epsilon)
        return tf.cast(normalized * self.gamma + self.beta, input_dtype)

## Decoder Layers

Each `decoder layer` consists of sublayers:

* <b>Masked Multi-head attention (with look ahead mask, padding mask)</b>
* <b>Feed forward networks.</b>

Each of these sublayers has a residual connection around it followed by a layer normalization. Residual connections help in avoiding the vanishing gradient problem in deep networks.

The output of each sublayer is <b>LayerNorm(x + Sublayer(x)).</b> The normalization is done on the d_model (last) axis. There are N decoder layers in the GPT-2.

In [201]:
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff):
        super(DecoderLayer, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.dff = dff

        self.mha = MultiHeadAttention(self.d_model, self.num_heads)
        self.feed_forward = FeedForward(self.d_model, self.dff)
        self.layer_norm1 = LayerNormalization(self.d_model)
        self.layer_norm2 = LayerNormalization(self.d_model)

    def call(self, x, mask):
        out, att = self.mha(self.layer_norm1(x), mask=mask)  # (batch_size, input_seq_len, d_model)
        
        with tf.name_scope("residual_conn"):
            x = x + out
            
        out = self.feed_forward(self.layer_norm2(x))  # (batch_size, input_seq_len, d_model)
        with tf.name_scope("residual_conn"):
            x = x + out
        return x, att

In [202]:
decoder = DecoderLayer(128, 2, 128*4)

In [205]:
output, att = decoder(final_embedding, mask)

print(output.shape)
print(att.shape)

(1, 4, 128)
(1, 2, 4, 4)


## GPT2

The `GPT2` consists of:

1. Stacked Decoders.
2. Projection Layer

### Decoder
The `Encoder` consists of:

1. Input Embedding
2. Positional Encoding
3. N encoder layers

The input is put through an embedding which is summed with the positional encoding. The output of this summation is the input to the encoder layers. The output of the encoder is the input to the decoder.

In [216]:
class Gpt2(tf.keras.Model):
    def __init__(self, num_layers, 
                 d_model, num_heads, 
                 dff, max_seq_len, 
                 vocab_size):
        
        super(Gpt2, self).__init__()

        self.num_layers = num_layers
        self.num_heads = num_heads
        self.dff = dff
        self.max_seq_len = max_seq_len
        self.vocab_size = vocab_size
        self.d_model = d_model

        self.embedding = EmbeddingLayer(
            self.vocab_size, self.d_model)

        self.pos_embedding = PositionEmbeddingLayer(
            self.max_seq_len, self.d_model)

        self.decoder_layers = [DecoderLayer(self.d_model, self.num_heads, self.dff)
                               for _ in range(self.num_layers)]
        self.layer_norm = LayerNormalization(self.d_model)

    def call(self, x):
        x = tf.cast(x, tf.int32)
        batch, sequence = tf.shape(x)[0], tf.shape(x)[1]

        att_mask = create_masks(x)

        with tf.name_scope("embeddings"):
            embedded_x = self.embedding(x)
            hidden_states = embedded_x + self.pos_embedding(x)

        for decoder_layer in self.decoder_layers:
            hidden_states, _ = decoder_layer(hidden_states, att_mask)

        hidden_states = self.layer_norm(hidden_states)
        
        with tf.name_scope("projection_layer"):
            logits = tf.matmul(hidden_states, 
                               self.embedding.embedding_weights, 
                               transpose_b=True)

        return logits

In [228]:
VOCAB_SIZE = 5000
MAX_SEQ_LEN = 4
NUM_LAYERS = 8
D_MODEL = 128
DFF = 512
NUM_HEADS = 2

In [231]:
model = Gpt2(NUM_LAYERS, 
             D_MODEL, NUM_HEADS, 
             DFF, MAX_SEQ_LEN, 
             VOCAB_SIZE)

In [232]:
print(sample_ids)

pred = model(sample_ids)

print("\nGpt2 output:-")
print(pred.shape)

[[1061 4077 1281 3948]]
tf.Tensor([[1 2 3 4]], shape=(1, 4), dtype=int32)

Gpt2 output:-
(1, 4, 5000)


In [233]:
pred

<tf.Tensor: id=20053, shape=(1, 4, 5000), dtype=float32, numpy=
array([[[ 0.26454148, -0.00054997,  0.00568861, ..., -0.1565355 ,
         -0.07433081,  0.17455825],
        [ 0.3317353 , -0.00407705, -0.08202678, ..., -0.17092118,
         -0.06943148,  0.14795086],
        [ 0.30808362, -0.00544102,  0.04076224, ..., -0.15971816,
         -0.04748448,  0.15719515],
        [ 0.29926157, -0.0177863 ,  0.00833206, ..., -0.24653032,
         -0.02775235,  0.11644214]]], dtype=float32)>









### Live sequence genration GPT2 which is trainied on 30 millions jds and resume











<img src="./image/artificial-intelligence-15-638.jpg" width="700" alt="transformer">
