# Implementing a full transformer encoder

__Objective:__ implement the encoder part of a transformer model from scratch.

In [None]:
import sys
from transformers import AutoTokenizer, AutoConfig
import tensorflow as tf
from tensorflow.keras.layers import Embedding

sys.path.append('../modules/')

%load_ext autoreload
%autoreload 2

## Tokenization

An attention layer works with __token embeddings__ as the input, so we need to start by tokenizing the input text and creating the vector embeddings.

In [None]:
# Instantiate a tokenizer from a model.
model_ckpt = 'distilbert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

Test.

In [None]:
# Test the tokenizer.
test_text = """
I know all about the honor of God, Mary Jane.
"""

test_output = tokenizer(
    test_text,
    return_tensors='tf',
    padding=True,
    # In this case we exclude the start- and end-of-sentence tokens.
    add_special_tokens=False
)

test_output

In [None]:
test_output['input_ids'].numpy()

In [None]:
tokenizer.convert_ids_to_tokens(test_output['input_ids'][0, :])

## Creation of embeddings

Create the word embeddings (vectors) from the tokenized text.

Keras' `Embedding` layer maps positive integers (tokenized text) to dense vectors of fixed size.

__Notes:__
- At this point the embeddings of the tokens know nothing about the context - each token's embedding is always the same, __irrespective of the context__ (i.e. the embedding operation is deterministic). The attention layer is there right to modify the embeddings to include context-depending information.
- We skip positional encoding for simplicity, but the information thereof should be added to the token embeddings at this point!

In [None]:
# Load the configuration parameters of the pretrained model.
config = AutoConfig.from_pretrained(model_ckpt)

config

In [None]:
# Initialize the embedding layer.
token_emb = Embedding(
    input_dim=config.vocab_size,  # We could have used tokenizer.vocab_size, it's the same.
    output_dim=config.hidden_size
)

token_emb

In [None]:
# Test the creation of embedding for some tokenized text.
# Output shape: (batch_size, seq_len, hidden_dim).
test_embeddings = token_emb(test_output['input_ids'])

test_embeddings

### Add positional encoding

We now add positional encoding to the embeddings, so each embedding also contains information of the position of the corresponding token in the sequence.

In [None]:
from encoder import Embeddings

Test.

In [None]:
text = [
    "Six o' clock on the Christmas morning...",
    "...and for what?"
]

test_token_ids = tokenizer(
    text,
    return_tensors='tf',
    padding=True,
    add_special_tokens=True
)['input_ids']

In [None]:
embedding_layer = Embeddings(config=config)

test_embeddings = embedding_layer(test_token_ids)

test_embeddings

## A basic self-attention mechanism

We reproduce the basic operations for a single-head attention layer, acting on the test embeddings obtained above.

### Creation of query, key and value vectors

For simplicity, we can take the query, key and value vectors associated to each token embedding equal to the token embedding itself (and thus also equal to one another). This need not be the case: in general, independent weight matrices (__trainable__) are applied to get the query, key and value vectors from the token embeddings.

In [None]:
query = test_embeddings
key = test_embeddings
value = test_embeddings

dim_k = key.shape[-1]

dim_k

### Attention scores

Given an input, the attention scores (not the weights yet!) are computed as the dot product of each query vector with each key vector. This measures the similarity (relevance) of each key w.r.t. each query.

In [None]:
query.shape, key.shape

In [None]:
# Output shape: (batch_size, seq_len, seq_len).
scores = tf.matmul(
    query,
    # Leaving the batch shape as the first dimension, it's ignored
    # in the matrix multiplication.
    tf.transpose(key, perm=(0, 2, 1))
)

scores

### Attention weights

Attention weights are obtained from attention scores by:
1. Rescaling the scores dividing by $\sqrt{\text{hidden dim}}$. This is done to avoid too large scores, which would mess up with the gradient descent steps in the training phase.
2. Applying the `softmax` function to the last axis.

In [None]:
weights = tf.math.softmax(
    scores / tf.sqrt(tf.cast(dim_k, tf.float32)),
    axis=-1
)

weights

Check: row by row, if we add up all the entries in the columns we should get a value close to 1.

In [None]:
tf.reduce_sum(weights, axis=-1)

### Output of the self-attention layer

The output of the layer is a linear combination of the value vectors with weights given by the attention weights.

In [None]:
# Output shape: (batch_size, seq_len, value_size).
test_attention_output = tf.matmul(weights, value)

test_attention_output

## Multi-headed attention

Implement a multi-head attention layer.

In [None]:
from utils import scaled_dot_product_attention

Define a single attention head layer.

In [None]:
from encoder import AttentionHead

Test.

In [None]:
n_heads = 2

att_head = AttentionHead(
    embed_dim=test_embeddings.shape[-1],
    head_dim=test_embeddings.shape[-1] / n_heads
)

att_head(test_embeddings)

Define a multi-head attention layer.

In [None]:
from encoder import MultiHeadAttention

Test.

In [None]:
mah_layer = MultiHeadAttention(config=config)

mah_layer(test_embeddings)

## Final feed-forward (FFN) layer

The FFN layer is a fully-connected feed-forward layer put after the MHA layer, with the architecture of a __position-wise feed-forward layer__, i.e. processing each token embedding outputted by the MHA layer __independently from the others__.

In [None]:
from encoder import FeedForward

Test.

In [None]:
feed_forward = FeedForward(config=config)

feed_forward(
    mah_layer(test_embeddings)
)

## Layer normalization and skip connection: building the full encoder layer

The full encoder layer will have both an MHA and an FFN layer, but on top of these will also include __layer normalization__ and __skip connections__.

Layer normalization can happen __pre-layer__ or __post_layer__, according to where the layer normalization operation is put w.r.t. the skip connections. We'll implement __pre-layer normalization__, which is more numerically stable during training.

__Note:__ the input and output shapes of the encoder are __the same__ - the operations performed are not about altering the shape, but rather adding contextual information without changing the sape itself.

In [None]:
from encoder import TransformerEncoderLayer

Test.

In [None]:
encoder_layer = TransformerEncoderLayer(config=config)

encoder_layer(test_embeddings)

## Building the full transformer encoder as a stack of encoder layers

The encoder part of a transformer is composed of a stack (sequence) of encoder layers (as defined before) the data goes through before being outputted. Let's build it, including also the embedding part (which makes sense as it's trainable and therefore must be trained along with the rest of the model: they are one whole thing together).

In [None]:
from encoder import TransformerEncoder

Test.

In [None]:
encoder = TransformerEncoder(config=config)

encoder(test_token_ids)