# 17 Implementing the Transformer Encoder in Keras

In [2]:
from tensorflow.keras.layers import LayerNormalization, Layer, Dense, ReLU, Dropout
from xformer.multihead_attention import MultiHeadAttention
from xformer.positional_encoding import CustomEmbeddingWithFixedPosnWts

## 17.1 Recap of the Transformer Encoder

Recall that the encoder block is a stack of N identical layers. Each layer consists of a multi-head self-attention layer which we expatiated on in Ch. 16. Now we will add some further important missing details.  

1. The multi-head self-attention is one of _two_ sub-layers in each stack of the encoder. The _other_ sub-layer is a fully-connected feed-forward layer.
2. After each of the aforementioned two sub-layers, there's a normalization layer which first adds the sublayer's output to its inputs (this forms what we call a "residual connection") and then normalizes the result.
3. Regularization is performed by applying a dropout layer to the outputs of each of the aforementioned "sub-layers" right before the normalization step, as well as to the positionally-encoded embeddings right before they are fed into the encoder.

## 17.2 Implementing the Transformer Encoder from Scratch

Note: We will reuse the multi-head attention and the positional embedding logic we implemented in previous chapters.

### The Feedforward Network and Layer Normalization

In AIAYN this is simply two fully-connected (AKA Linear) layers with a ReLU activation in between. The first FF layer's output has dims $d_{ff}=2048$ and the second one brings it back to $d_{model}=512$.

In [4]:
class FeedForward(Layer):
    def __init__(self, d_ff, d_model, **kwargs):
        super().__init__(**kwargs)
        self.fully_connected_1 = Dense(d_ff) # First fully-connected layer
        self.fully_connected_2 = Dense(d_model) # Second fully-connected layer
        self.activation = ReLU() # ReLU activation layer to come in between
        
    def call(self, x):
        # The input is passed into the two fully-connected layers, with a ReLU in between
        fc1_output = self.fully_connected1(x)
        fc2_output = self.fully_connected2(self.activation(x_fc1))
        return fc2_output

Next, we define our "Layer Normalization" layer. [Layer Normalization](https://arxiv.org/pdf/1607.06450.pdf), not to be confused with but in many ways similar to [Batch Normalization](https://arxiv.org/pdf/1502.03167.pdf), is a way of ensuring better, more stable training.

In [3]:
class AddAndNorm(Layer):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.layer_norm = LayerNormalization() # Layer normalization layer
        
    def call(self, x, sublayer_x):
        # Note: The sublayer's input and output need to be of the same shape to be summable
        add = x + sublayer_x
        # Apply layer normalization to the sum
        return self.layer_norm(add)        

### The Encoder Layer

Next, we will define what an encoder layer looks like. **Note:** I may have used the word "encoder block" elsewhere. Going forward, I will try to stay consistent and use "encoder layer". Just picture AIAYN's block diagram and recall that they stack N=6 of for these to form their transformer's encoder "block". But we'll get to that in the next section.  
The `training` flag in the `call()` function is there so that we don't perform dropout regularization during testing and inference.  
The `padding_mask` argument, as explained in previous chapters, is to suppress zero padding tokens in input sequences from being processed along with valid input tokens.

In [5]:
class EncoderLayer(Layer):
    def __init__(self, n_heads, d_model, d_ff, dropout_rate, **kwargs):
        super().__init__(**kwargs)
        self.multihead_attention = MultiHeadAttention(n_heads, d_model)
        self.dropout1 = Dropout(dropout_rate)
        self.add_norm1 = AddAndNorm()
        self.feed_forward = FeedForward(d_ff, d_model)
        self.dropout2 = Dropout(dropout_rate)
        self.add_norm2 = AddAndNorm()
    
    def call(self, x, padding_mask, training):
        # Multi-head attention layer
        multihead_output = self.multihead_attention(x, x, x, padding_mask)
        # Expected output shape = (batch_size, sequence_length, d_model)
        # Add in a dropout layer
        multihead_output = self.dropout1(multihead_output, training=training)
        # Followed by an Add & Norm layer
        addnorm_output = self.add_norm1(x, multihead_output)
        # Expected output shape = (batch_size, sequence_length, d_model)
        # Followed by a fully connected layer
        feedforward_output = self.feed_forward(addnorm_output)
        # Expected output shape = (batch_size, sequence_length, d_model)
        # Add in another dropout layer
        feedforward_output = self.dropout2(feedforward_output, training=training)
        # Followed by another Add & Norm layer
        return self.add_norm2(addnorm_output, feedforward_output)

### The Transformer Encoder