<a id='top'></a><a name='top'></a>
# Chapter 19: Joining the Transformer Encoder and Decoder with Masking

* [Introduction](#introduction)
* [19.0 Imports and Setup](#19.0)
* [19.1 Recap of the Transformer Architecture](#19.1)
* [19.2 Masking](#19.2)
* [19.3 Joining the Transformer Encoder and Decoder](#19.3)
* [19.4 Creating an Instance of the Transformer Model](#19.4)
* [Extra](#extra)

---
<a name='introduction'></a><a id='introduction'></a>
# Introduction
<a href="#top">[back to top]</a>

### Dataset

* TODO

### Explore
* How to create a padding mask for the encoder and decoder
* How to create a look-ahead mask for the decoder
* How to join the transformer encoder and decoder into a single model
* How to print out a summary for the encoder and decoder layers

---
<a name='19.0'></a><a id='19.0'></a>
# 19.0 Imports and Setup
<a href="#top">[back to top]</a>

In [1]:
req_file = "requirements_19.txt"

In [3]:
%%writefile {req_file}
isort
scikit-learn-intelex
watermark

Overwriting requirements_19.txt


In [4]:
import sys
IS_COLAB = 'google.colab' in sys.modules

if IS_COLAB:
    print("Installing packages")
    !pip install --upgrade --quiet -r {req_file}
else:
    print("Running locally.")

# Need to import before sklearn
from sklearnex import patch_sklearn
patch_sklearn()

Running locally.


Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [3]:
%%writefile imports.py
import locale
import math
import pprint
import warnings

import numpy as np
import tensorflow as tf
from tensorflow import math, cast, float32, linalg, ones
from tensorflow.keras import Model
from tensorflow.keras.activations import softmax
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Layer
from tensorflow.keras.layers import LayerNormalization
from tensorflow.keras.layers import ReLU
from tqdm.auto import tqdm
from watermark import watermark

Overwriting imports.py


In [4]:
!isort imports.py --sl
!cat imports.py

Fixing /Users/gb/Desktop/jb_transformer_nb/chp19_join_encoder_decoder_masking/imports.py
import locale
import math
import pprint

import numpy as np
import tensorflow as tf
from tensorflow import cast
from tensorflow import float32
from tensorflow import linalg
from tensorflow import math
from tensorflow import ones
from tensorflow.keras import Model
from tensorflow.keras.activations import softmax
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Layer
from tensorflow.keras.layers import LayerNormalization
from tensorflow.keras.layers import ReLU
from tqdm.auto import tqdm
from watermark import watermark


In [5]:
import locale
import math
import pprint
import warnings

import numpy as np
import tensorflow as tf
from tensorflow import cast
from tensorflow import float32
from tensorflow import linalg
from tensorflow import math
from tensorflow import ones
from tensorflow.keras import Model
from tensorflow.keras.activations import softmax
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Layer
from tensorflow.keras.layers import LayerNormalization
from tensorflow.keras.layers import ReLU
from tqdm.auto import tqdm
from watermark import watermark

In [6]:
def HR():
    print("-"*40)
    
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding
warnings.filterwarnings('default')
BASE_DIR = '.'
pp = pprint.PrettyPrinter(indent=4)

seed = 42

print(watermark(iversions=True,globals_=globals(),python=True,machine=True))

Python implementation: CPython
Python version       : 3.8.12
IPython version      : 7.34.0

Compiler    : Clang 13.0.0 (clang-1300.0.29.3)
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

numpy     : 1.23.5
tensorflow: 2.9.3



# Previous Code

In [7]:
# positional_encoding.py
class PositionEmbeddingFixedWeights(Layer):
    def __init__(self, seq_length, vocab_size, output_dim, **kwargs):
        super().__init__(**kwargs)
        word_embedding_matrix = self.get_position_encoding(vocab_size, output_dim)
        pos_embedding_matrix = self.get_position_encoding(seq_length, output_dim)
        self.word_embedding_layer = Embedding(
            input_dim=vocab_size, output_dim=output_dim,
            weights=[word_embedding_matrix],
            trainable=False
        )
        self.position_embedding_layer = Embedding(
            input_dim=seq_length, output_dim=output_dim,
            weights=[pos_embedding_matrix],
            trainable=False
        )

    def get_position_encoding(self, seq_len, d, n=10000):
        P = np.zeros((seq_len, d))
        for k in range(seq_len):
            for i in np.arange(int(d/2)):
                denominator = np.power(n, 2*i/d)
                P[k, 2*i] = np.sin(k/denominator)
                P[k, 2*i+1] = np.cos(k/denominator)
        return P


    def call(self, inputs):
        position_indices = tf.range(tf.shape(inputs)[-1])
        embedded_words = self.word_embedding_layer(inputs)
        embedded_indices = self.position_embedding_layer(position_indices)
        return embedded_words + embedded_indices


In [8]:
# Implementing the Scaled-Dot Product Attention
# multihead_attention.py
class DotProductAttention(Layer):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def call(self, queries, keys, values, d_k, mask=None):
        # Scoring the queries against the keys after transposing the latter, and scaling
        scores = matmul(queries, keys, transpose_b=True) / math.sqrt(cast(d_k, float32))

        # Apply mask to the attention scores
        if mask is not None:
            scores += -1e9 * mask

        # Computing the weights by a softmax operation
        weights = softmax(scores)

        # Computing the attention by a weighted sum of the value vectors
        return matmul(weights, values)

# Implementing the Multi-Head Attention
class MultiHeadAttention(Layer):
    def __init__(self, h, d_k, d_v, d_model, **kwargs):
        super().__init__(**kwargs)
        self.attention = DotProductAttention()  # Scaled dot product attention
        self.heads = h  # Number of attention heads to use
        self.d_k = d_k  # Dimensionality of the linearly projected queries and keys
        self.d_v = d_v  # Dimensionality of the linearly projected values
        self.d_model = d_model  # Dimensionality of the model
        self.W_q = Dense(d_k)   # Learned projection matrix for the queries
        self.W_k = Dense(d_k)   # Learned projection matrix for the keys
        self.W_v = Dense(d_v)   # Learned projection matrix for the values
        self.W_o = Dense(d_model) # Learned projection matrix for the multi-head output

    def reshape_tensor(self, x, heads, flag):
        if flag:
            # Tensor shape after reshaping and transposing:
            # (batch_size, heads, seq_length, -1)
            x = reshape(x, shape=(shape(x)[0], shape(x)[1], heads, -1))
            x = transpose(x, perm=(0, 2, 1, 3))
        else:
            # Reverting the reshaping and transposing operations:
            # (batch_size, seq_length, d_k)
            x = transpose(x, perm=(0, 2, 1, 3))
            x = reshape(x, shape=(shape(x)[0], shape(x)[1], self.d_k))
        return x

    def call(self, queries, keys, values, mask=None):
        # Rearrange the queries to be able to compute all heads in parallel
        q_reshaped = self.reshape_tensor(self.W_q(queries), self.heads, True)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)

        # Rearrange the keys to be able to compute all heads in parallel
        k_reshaped = self.reshape_tensor(self.W_k(keys), self.heads, True)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)

        # Rearrange the values to be able to compute all heads in parallel
        v_reshaped = self.reshape_tensor(self.W_v(values), self.heads, True)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)

        # Compute the multi-head attention output using the reshaped queries,
        # keys, and values
        o_reshaped = self.attention(q_reshaped, k_reshaped, v_reshaped, self.d_k, mask)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)

        # Rearrange back the output into concatenated form
        output = self.reshape_tensor(o_reshaped, self.heads, False)
        # Resulting tensor shape: (batch_size, input_seq_length, d_v)

        # Apply one final linear projection to the output to generate the multi-head
        # attention. Resulting tensor shape: (batch_size, input_seq_length, d_model)
        return self.W_o(output)

In [9]:
# encoder.py
# Implementing the Add & Norm Layer
class AddNormalization(Layer):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.layer_norm = LayerNormalization()  # Layer normalization layer

    def call(self, x, sublayer_x):
        # The sublayer input and output need to be of the same shape to be summed
        add = x + sublayer_x

        # Apply layer normalization to the sum
        return self.layer_norm(add)

# Implementing the Feed-Forward Layer
class FeedForward(Layer):
    def __init__(self, d_ff, d_model, **kwargs):
        super().__init__(**kwargs)
        self.fully_connected1 = Dense(d_ff)  # First fully connected layer
        self.fully_connected2 = Dense(d_model)  # Second fully connected layer
        self.activation = ReLU()  # ReLU activation layer

    def call(self, x):
        # The input is passed into the two fully-connected layers, with a ReLU in between
        x_fc1 = self.fully_connected1(x)

        return self.fully_connected2(self.activation(x_fc1))

# Implementing the Encoder Layer
class EncoderLayer(Layer):
    def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs):
        super().__init__(**kwargs)
        self.multihead_attention = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout1 = Dropout(rate)
        self.add_norm1 = AddNormalization()
        self.feed_forward = FeedForward(d_ff, d_model)
        self.dropout2 = Dropout(rate)
        self.add_norm2 = AddNormalization()

    def call(self, x, padding_mask, training):
        # Multi-head attention layer
        multihead_output = self.multihead_attention(x, x, x, padding_mask)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Add in a dropout layer
        multihead_output = self.dropout1(multihead_output, training=training)

        # Followed by an Add & Norm layer
        addnorm_output = self.add_norm1(x, multihead_output)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Followed by a fully connected layer
        feedforward_output = self.feed_forward(addnorm_output)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Add in another dropout layer
        feedforward_output = self.dropout2(feedforward_output, training=training)

        # Followed by another Add & Norm layer
        return self.add_norm2(addnorm_output, feedforward_output)

# Implementing the Encoder
class Encoder(Layer):
    def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate,
                       **kwargs):
        super().__init__(**kwargs)
        self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size,
                                                          d_model)
        self.dropout = Dropout(rate)
        self.encoder_layer = [EncoderLayer(h, d_k, d_v, d_model, d_ff, rate)
                              for _ in range(n)]

    def call(self, input_sentence, padding_mask, training):
        # Generate the positional encoding
        pos_encoding_output = self.pos_encoding(input_sentence)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Add in a dropout layer
        x = self.dropout(pos_encoding_output, training=training)

        # Pass on the positional encoded values to each encoder layer
        for i, layer in enumerate(self.encoder_layer):
            x = layer(x, padding_mask, training)

        return x

In [10]:
# decoder.py
# Implementing the Decoder Layer
class DecoderLayer(Layer):
    def __init__(self, h, d_k, d_v, d_model, d_ff, rate, **kwargs):
        super().__init__(**kwargs)
        self.multihead_attention1 = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout1 = Dropout(rate)
        self.add_norm1 = AddNormalization()
        self.multihead_attention2 = MultiHeadAttention(h, d_k, d_v, d_model)
        self.dropout2 = Dropout(rate)
        self.add_norm2 = AddNormalization()
        self.feed_forward = FeedForward(d_ff, d_model)
        self.dropout3 = Dropout(rate)
        self.add_norm3 = AddNormalization()

    def call(self, x, encoder_output, lookahead_mask, padding_mask, training):
        # Multi-head attention layer
        multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Add in a dropout layer
        multihead_output1 = self.dropout1(multihead_output1, training=training)

        # Followed by an Add & Norm layer
        addnorm_output1 = self.add_norm1(x, multihead_output1)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Followed by another multi-head attention layer
        multihead_output2 = self.multihead_attention2(addnorm_output1, encoder_output,
                                                      encoder_output, padding_mask)

        # Add in another dropout layer
        multihead_output2 = self.dropout2(multihead_output2, training=training)

        # Followed by another Add & Norm layer
        addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2)

        # Followed by a fully connected layer
        feedforward_output = self.feed_forward(addnorm_output2)
        # Expected output shape = (batch_size, sequence_length, d_model)

        # Add in another dropout layer
        feedforward_output = self.dropout3(feedforward_output, training=training)

        # Followed by another Add & Norm layer
        return self.add_norm3(addnorm_output2, feedforward_output)

# Implementing the Decoder
class Decoder(Layer):
    def __init__(self, vocab_size, sequence_length, h, d_k, d_v, d_model, d_ff, n, rate,
                       **kwargs):
        super().__init__(**kwargs)
        self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size,
                                                          d_model)
        self.dropout = Dropout(rate)
        self.decoder_layer = [DecoderLayer(h, d_k, d_v, d_model, d_ff, rate)
                              for _ in range(n)]

    def call(self, output_target, encoder_output, lookahead_mask, padding_mask, training):
        # Generate the positional encoding
        pos_encoding_output = self.pos_encoding(output_target)
        # Expected output shape = (number of sentences, sequence_length, d_model)

        # Add in a dropout layer
        x = self.dropout(pos_encoding_output, training=training)

        # Pass on the positional encoded values to each encoder layer
        for i, layer in enumerate(self.decoder_layer):
            x = layer(x, encoder_output, lookahead_mask, padding_mask, training)

        return x

---

---
<a name='19.1'></a><a id='19.1'></a>
# 19.1 Recap of the Transformer Architecture
<a href="#top">[back to top]</a>

---
<a name='19.2'></a><a id='19.2'></a>
# 19.2 Masking
<a href="#top">[back to top]</a>

## Create a Padding Mask

In [11]:
# Generate a padding mask for both the encoder and decoder
def padding_mask(input):
    # Create mask which marks the zero padding values in the input by a 1
    mask = math.equal(input, 0)
    mask = cast(mask, float32)
    
    return mask

In [12]:
# Upon receiving an input, this function will generate a tensor that marks by a value of one
# whever the input contains a value of zero.
input = np.array([1,2,3,4,0,0,0])

print(padding_mask(input))

tf.Tensor([0. 0. 0. 0. 1. 1. 1.], shape=(7,), dtype=float32)


2023-06-22 13:08:10.536980: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Creating a Look-Ahead Mask for Decoder

In [13]:
def lookahead_mask(shape):
    num_lower = -1
    num_upper = 0
        
    # Mask out future entries by marking them with a 1.0
    mask = 1 - linalg.band_part(ones((shape, shape)), num_lower, num_upper)
    print(mask)
    HR()
    
    return mask

In [14]:
# Pass the length of the decoder output
# In this example, the length is 5
test = lookahead_mask(5)
print(test)

tf.Tensor(
[[0. 1. 1. 1. 1.]
 [0. 0. 1. 1. 1.]
 [0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0.]], shape=(5, 5), dtype=float32)
----------------------------------------
tf.Tensor(
[[0. 1. 1. 1. 1.]
 [0. 0. 1. 1. 1.]
 [0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0.]], shape=(5, 5), dtype=float32)


---
<a name='19.3'></a><a id='19.3'></a>
# 19.3 Joining the Transformer Encoder and Decoder
<a href="#top">[back to top]</a>

Create the class `TransformerModel`, which inherits from the `Model` base class in Keras.

In [15]:
# Initialize instances of the Encoder and Decoder classes, and assign outputs to the 
# variables encoder and decoder.
class TransformerModel(Model):
    def __init__(self, enc_vocab_size, dec_vocab_size, enc_seq_length, dec_seq_length, 
                 h, d_k, d_v, d_model, d_ff_inner, n, rate, **kwargs):
        super().__init__(**kwargs)
        
        # Set up the encoder
        self.encoder = Encoder(enc_vocab_size, enc_seq_length, h, d_k, d_v, 
                               d_model, d_ff_inner, n, rate)
    
        # Define the final dense layer
        self.model_last_layer = Dense(dec_vocab_size)
        

    # Generate a padding mask for both the encoder and decoder
    def padding_mask(input):
        # Create mask which marks the zero padding values in the input by a 1
        mask = math.equal(input, 0)
        mask = cast(mask, float32)

        return mask

    
    def lookahead_mask(shape):
        num_lower = -1
        num_upper = 0

        # Mask out future entries by marking them with a 1.0
        mask = 1 - linalg.band_part(ones((shape, shape)), num_lower, num_upper)
        print(mask)
        HR()

        return mask
        
        
        # Create the class method call() to feed the relevant inputs into the encoder and decoder.
        def call(self, encoder_input, decoder_input, training):
            # Create padding mask to mask the encoder inputs and outputs in the decoder
            enc_padding_mask = self.padding_mask(encoder_input)
            
            # Create and combine padding and look-ahead masks to be fed into the decoder
            dec_in_padding_mask = self.padding_mask(decoder_input)
            dec_in_lookahead_mask = self.lookahead_mask(decoder_input.shape[1])
            dec_in_lookahead_mask = maximum(dec_in_padding_mask, dec_in_lookahead_mask)
            
            # Feed the input into the encoder
            encoder_output = self.encoder(encoder_input, enc_padding_mask, training)
            
            # Feed the encoder output into the decoder
            decoder_output = self.decoder(decocer_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, training)
            
            # Pass the decoder output through a final dense layer
            model_output = self.model_last_layer(decoder_output)
            
            return model_output

---
<a name='19.4'></a><a id='19.4'></a>
# 19.4 Creating an Instance of the Transformer Model
<a href="#top">[back to top]</a>

Use the parameter values from the "Attention is all You Need" paper.

In [16]:
# Model Hyper-parameters::
h = 8 # Num of self-attention heads
d_k = 64 # Dim of linearly projected queries and keys
d_v = 64 # Dim of linearly projected values
d_ff = 2048 # Dim of inner fully connected layers
d_model = 512 # Dim of model sub-layer output
n = 6 # Num of layers in the encoder stack
dropout_rate = 0.1 # Frequency of dropping the input units in the dropout layers

# Input-related parameters:
enc_vocab_size = 20 # Vocab size for encoder
dec_vocab_size = 20 # Vocab size for decoder

enc_seq_length = 5 # Maximum length of the input sequence 
dec_seq_length = 5 # Maximum length of the target sequence


# Create an instance of the TransformerModel class
training_model = TransformerModel(
    enc_vocab_size, dec_vocab_size, enc_seq_length,
    dec_seq_length, h, d_k, d_v, d_model, d_ff, n, dropout_rate
)

In [17]:
training_model

<__main__.TransformerModel at 0x12ba04f10>