# Transformer from Scratch

**Author:** Deeepwin, Jason Brownlee [(Machine Learning Mastery)](https://machinelearningmastery.com)   
**Date created:** 04.11.2022 <br>
**Last modified:** 05.11.2022 <br>
**Description:** An implementation of a Transfromer architecture from scratch using Keras and Tensorflow

***

## Setup

In [20]:
import os
import numpy as np
import tensorflow as tf

from tensorflow import math, matmul, reshape, shape, transpose, cast, float32, linalg, ones, maximum, newaxis
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, LayerNormalization, Layer, ReLU, Dropout, TextVectorization, Embedding, Input
 
from keras.backend import softmax

In [21]:
np.set_printoptions(linewidth=160)

# disable GPU
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

## Configuration

In [22]:
num_heads       = 8         # Number of self-attention heads
d_k             = 64        # Dimensionality of the linearly projected queries and keys
d_v             = 64        # Dimensionality of the linearly projected values
dense_dim       = 2048      # Dimensionality of the inner fully connected layer
embed_dim       = 512       # Dimensionality of the model sub-layers' outputs
num_layers      = 2         # Number of layers in the encoder stack
 
dropout_rate    = 0.1       # Frequency of dropping the input units in the dropout layers

vocab_size      = 50        # Vocabulary size
sequence_length = 4        # Maximum length of the input sequence

batch_size      = 32        # Batch size from the training process

## Attention Layer

<p align="center">
  <img width="600" src="pics/score_matrix.png">
</p>

You may note that the scaled dot-product attention can also apply a mask to the attention scores before feeding them into the softmax function. 

Since the word embeddings are zero-padded to a specific sequence length, a ***padding mask needs to be introduced in order to prevent the zero tokens from being processed*** along with the input in both the encoder and decoder stages. Furthermore, a look-ahead mask is also required to prevent the decoder from attending to succeeding words, such that the prediction for a particular word can only depend on known outputs for the words that come before it.

These look-ahead and padding masks are applied inside the scaled dot-product attention set to -infinity all the values in the input to the softmax function that should not be considered. For each of these large negative inputs, the softmax function will, in turn, produce an output value that is close to zero, effectively masking them out. The use of these masks will become clearer when you progress to the implementation of the encoder and decoder blocks in separate tutorials.

<p align="center">
  <img width="460" src="pics/dot_product.png">
</p>

Source: [Link](https://machinelearningmastery.com/how-to-implement-scaled-dot-product-attention-from-scratch-in-tensorflow-and-keras)

Attention masking is the feature that allows the transformer to not use recursion. Each possible sequence is masked and processed in parallel. For each of the sequences we have the shifted target vector as ground truth (y_true).

<p align="center">
  <img width="460" src="pics/masking.png">
</p>

In [23]:
# Implementing the Scaled-Dot Product Attention
class DotProductAttention(Layer):

    def __init__(self, **kwargs):
    
        super(DotProductAttention, self).__init__(**kwargs)
 
    def call(self, queries, keys, values, d_k, mask=None):                                  # queries = keys = values = (32, 8, 20, 8), all same values
                                                                                            # mask = (32, 1, 20, 20)
        # Scoring the queries against the keys after transposing the latter, and scaling    
        scores = matmul(queries, keys, transpose_b=True) / math.sqrt(cast(d_k, float32))    # (32, 8, 20, 20), multiplication on last two dimensions if rank >= 2 
                                                                                            # computing matrix dot product (20, 8) * (8, 20) = (20, 20)

        # Apply mask to the attention scores. The mask will contain either 0 values to indicate that the corresponding token 
        # in the input sequence should be considered in the computations or a 1 to indicate otherwise. The mask will be multiplied 
        # by -1e9 to set the 1 values to large negative numbers subsequently applied to the attention scores:                                            
        if mask is not None:            # (32, 8, 20, 20) as example with sequence_length=4, calculation is as following:
            scores += -1e9 * mask       # mask                (-1e9*mask)                             (scores + -1e9*mask)
                                        # [[0., 1., 1., 1.],  [[-0.e+00, -1.e+09, -1.e+09, -1.e+09],  [[ 5.4317653e-01, -1.0000000e+09, -1.0000000e+09, -1.0000000e+09],
                                        #  [0., 0., 1., 1.],   [-0.e+00, -0.e+00, -1.e+09, -1.e+09],   [ 6.7063296e-01,  5.7927263e-01, -1.0000000e+09, -1.0000000e+09],
                                        #  [0., 0., 0., 1.],   [-0.e+00, -0.e+00, -0.e+00, -1.e+09],   [ 8.5004723e-01,  7.3447722e-01,  6.0847449e-01, -1.0000000e+09],
                                        #  [0., 0., 0., 0.]]   [-0.e+00, -0.e+00, -0.e+00, -0.e+00]]   [ 9.2180628e-01,  7.9571950e-01,  6.6288424e-01,  5.6855118e-01]]
        weights = softmax(scores)                                                           # (32, 8, 20, 20)
 
        # Computing the attention by a weighted sum of the value vectors
        return matmul(weights, values)                                                      # (32, 8, 20, 8), computing (20, 20) * (20, 8) = (20, 8)

<p align="center">
  <img width="600" src="pics/multi_head.png">
</p>

Next, we will be reshaping the linearly projected queries, keys, and values in such a manner as to allow the attention heads to be computed in parallel. This is different than shown in picture above.

The queries, keys, and values will be fed as input into the multi-head attention block having a shape of (batch size, sequence length, model dimensionality), where the batch size is a hyperparameter of the training process, the sequence length defines the maximum length of the input/output phrases, and the model dimensionality is the dimensionality of the outputs produced by all sub-layers of the model. They are then passed through the respective dense layer to be linearly projected to a shape of (batch size, sequence length, queries/keys/values dimensionality).

The linearly projected queries, keys, and values will be rearranged into (batch size, number of heads, sequence length, depth), by first reshaping them into (batch size, sequence length, number of heads, depth) and then transposing the second and third dimensions. For this purpose, you will create the class method, reshape_tensor, as follows:

In [26]:
# Implementing the Multi-Head Attention
class MultiHeadAttention(Layer):

    def __init__(self, num_heads, d_k, d_v, embed_dim, **kwargs):
    
        super(MultiHeadAttention, self).__init__(**kwargs)

        self.attention = DotProductAttention()      # Scaled dot product attention
        self.heads = num_heads                      # Number of attention heads to use
        self.d_k = d_k                              # Dimensionality of the linearly projected queries and keys
        self.d_v = d_v                              # Dimensionality of the linearly projected values
        self.embed_dim = embed_dim                  # Dimensionality of the model
        self.W_q = Dense(d_k)                       # Learned projection matrix for the queries
        self.W_k = Dense(d_k)                       # Learned projection matrix for the keys
        self.W_v = Dense(d_v)                       # Learned projection matrix for the values
        self.W_o = Dense(embed_dim)                 # Learned projection matrix for the multi-head output
 
    def reshape_tensor(self, x, heads, flag):
        if flag:
            # Tensor shape after reshaping and transposing: (batch_size, heads, seq_length, -1)
            x = reshape(x, shape=(shape(x)[0], shape(x)[1], heads, -1))
            x = transpose(x, perm=(0, 2, 1, 3))
        else:
            # Reverting the reshaping and transposing operations: (batch_size, seq_length, d_k)
            x = transpose(x, perm=(0, 2, 1, 3))
            x = reshape(x, shape=(shape(x)[0], shape(x)[1], self.d_k))
        return x
 
    def call(self, queries, keys, values, mask=None):                           # queries = keys = values = (32, 20, 512), all same values
                                                                                # mask = (32, 1, 20, 20)
        # Rearrange the queries to be able to compute all heads in parallel     
        out = self.W_q(queries)                                                 # (32, 20, 64), dense layer with (512,64) kernel operating along 
                                                                                #               axis 2 which means kernel is share between word embeddings (20)
        q_reshaped = self.reshape_tensor(out, self.heads, True)                 # (32, 8, 20, 8]
        
        # Rearrange the keys to be able to compute all heads in parallel
        out = self.W_k(keys)                                                    # (32, 20, 64)
        k_reshaped = self.reshape_tensor(out, self.heads, True)                 # (32, 8, 20, 8]
 
        # Rearrange the values to be able to compute all heads in parallel
        out = self.W_v(values)                                                  # (32, 20, 64)
        v_reshaped = self.reshape_tensor(out, self.heads, True)                 # (32, 8, 20, 8]
 
        # Compute the multi-head attention output using the reshaped queries, keys and values
        o_reshaped = self.attention(q_reshaped, k_reshaped, v_reshaped, self.d_k, mask) # (32, 8, 20, 8]
 
        # Rearrange back the output into concatenated form
        out = self.reshape_tensor(o_reshaped, self.heads, False)                # (32, 20, 8]
 
        # Apply one final linear projection to the output to generate the multi-head attention
        out = self.W_o(out)                                                     # (32, 20, 512]
        return out

Source: [Link](https://machinelearningmastery.com/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras)

### Test

In [27]:
from numpy import random
 
queries     = random.random((batch_size, sequence_length, d_k))
keys        = random.random((batch_size, sequence_length, d_k))
values      = random.random((batch_size, sequence_length, d_v))
 
output = MultiHeadAttention(num_heads, d_k, d_v, embed_dim)(queries, keys, values);
output.numpy().shape

(32, 4, 512)

## Embedding Layer

The Embedding layer converts integers to dense vectors. This layer maps these integers to random numbers, which are later tuned during the training phase.

We use two embeddings, one from the TextVectorization (word to integer) and one for the position indices. In a transformer model, the final output is the sum of both the word embeddings and the position embeddings. The sum is calculated element-wise.
        
See [(Link)](https://machinelearningmastery.com/the-transformer-positional-encoding-layer-in-keras-part-2/)

In [28]:
class PositionEmbeddingFixedWeights(Layer):

    def __init__(self, sequence_length, vocab_size, output_dim, **kwargs):
    
        super(PositionEmbeddingFixedWeights, self).__init__(**kwargs)

        word_embedding_matrix = self.get_position_encoding(vocab_size, output_dim)   
        position_embedding_matrix = self.get_position_encoding(sequence_length, output_dim)     

        # Embedding for the text vectorization
        self.word_embedding_layer = Embedding(
            input_dim=vocab_size, output_dim=output_dim,
            weights=[word_embedding_matrix],
            trainable=False
        )
        # Embedding for the indices (positions)
        self.position_embedding_layer = Embedding(
            input_dim=sequence_length, output_dim=output_dim,
            weights=[position_embedding_matrix],
            trainable=False
        )
             
    def get_position_encoding(self, seq_len, d, n=10000):
        P = np.zeros((seq_len, d))
        for k in range(seq_len):
            for i in np.arange(int(d/2)):
                denominator = np.power(n, 2*i/d)
                P[k, 2*i] = np.sin(k/denominator)
                P[k, 2*i+1] = np.cos(k/denominator)
        return P
 
    def call(self, inputs):                                                  # inputs = (32, 20)
        position_indices = tf.range(tf.shape(inputs)[-1])                    # (20), vector with values 0 ... 19
        embedded_words = self.word_embedding_layer(inputs)                   # (32, 20, 512)
        embedded_indices = self.position_embedding_layer(position_indices)   # (20, 512)
        return embedded_words + embedded_indices                             # (32, 20, 512) embeddings value are added element-wise

### Test

In [29]:
sentences = [["I am a robot"], ["you too robot"]]
sentence_data = tf.data.Dataset.from_tensor_slices(sentences)

# Create the TextVectorization layer
vectorize_layer = TextVectorization(max_tokens=vocab_size,
                                    output_sequence_length=sequence_length)

# Train the layer to create a dictionary
vectorize_layer.adapt(sentence_data)

# Convert all sentences to tensors
word_tensors = tf.convert_to_tensor(sentences, dtype=tf.string)

# Use the word tensors to get vectorized phrases
vectorized_words = vectorize_layer(word_tensors)
print("Vocabulary: ", vectorize_layer.get_vocabulary())
print("Vectorized words: ", vectorized_words.numpy())

Vocabulary:  ['', '[UNK]', 'robot', 'you', 'too', 'i', 'am', 'a']
Vectorized words:  [[5 6 7 2]
 [3 4 2 0]]


In [30]:
embedding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, sequence_length)
output = embedding(vectorized_words)
output.numpy().shape

(2, 4, 4)

## Encoder

In [31]:
# Implementing the Add & Norm Layer
class AddNormalization(Layer):

    def __init__(self, **kwargs):
    
        super(AddNormalization, self).__init__(**kwargs)
        self.layer_norm = LayerNormalization()  # Layer normalization layer
 
    def call(self, x, sublayer_x):
    
        # The sublayer input and output need to be of the same shape to be summed
        add = x + sublayer_x
 
        # Apply layer normalization to the sum
        return self.layer_norm(add)
 
# Implementing the Feed-Forward Layer
class FeedForward(Layer):
    
    def __init__(self, dense_dim, embed_dim, **kwargs):
        super(FeedForward, self).__init__(**kwargs)
        self.fully_connected1 = Dense(dense_dim)    # First fully connected layer
        self.fully_connected2 = Dense(embed_dim)    # Second fully connected layer
        self.activation = ReLU()                    # ReLU activation layer
 
    def call(self, x):
        
        # The input is passed into the two fully-connected layers, with a ReLU in between
        x_fc1 = self.fully_connected1(x)
 
        return self.fully_connected2(self.activation(x_fc1))
 
# Implementing the Encoder Layer
class EncoderLayer(Layer):
    def __init__(self, sequence_length, num_heads, d_k, d_v, embed_dim, dense_dim, rate, **kwargs):
        super(EncoderLayer, self).__init__(**kwargs)
        self.build(input_shape=[None, sequence_length, embed_dim])
        self.sequence_length = sequence_length
        self.embed_dim = embed_dim
        self.multihead_attention = MultiHeadAttention(num_heads, d_k, d_v, embed_dim)
        self.dropout1 = Dropout(rate)
        self.add_norm1 = AddNormalization()
        self.feed_forward = FeedForward(dense_dim, embed_dim)
        self.dropout2 = Dropout(rate)
        self.add_norm2 = AddNormalization()
 
    def build_graph(self):
        input_layer = Input(shape=(self.sequence_length, self.embed_dim))
        return Model(inputs=[input_layer], outputs=self.call(input_layer, None, True))

    def call(self, x, padding_mask, training):

        # Multi-head attention layer
        multihead_output = self.multihead_attention(x, x, x, padding_mask)
        # Expected output shape = (batch_size, sequence_length, embed_dim)
 
        # Add in a dropout layer
        multihead_output = self.dropout1(multihead_output, training=training)
 
        # Followed by an Add & Norm layer
        addnorm_output = self.add_norm1(x, multihead_output)
        # Expected output shape = (batch_size, sequence_length, embed_dim)
 
        # Followed by a fully connected layer
        feedforward_output = self.feed_forward(addnorm_output)
        # Expected output shape = (batch_size, sequence_length, embed_dim)
 
        # Add in another dropout layer
        feedforward_output = self.dropout2(feedforward_output, training=training)
 
        # Followed by another Add & Norm layer
        return self.add_norm2(addnorm_output, feedforward_output)
 
# Implementing the Encoder
class Encoder(Layer):

    def __init__(self, vocab_size, sequence_length, num_heads, d_k, d_v, embed_dim, dense_dim, num_layers, rate, **kwargs):

        super(Encoder, self).__init__(**kwargs)
        
        self.pos_encoding   = PositionEmbeddingFixedWeights(sequence_length, vocab_size, embed_dim)
        self.dropout        = Dropout(rate)
        self.encoder_layer  = [EncoderLayer(sequence_length, num_heads, d_k, d_v, embed_dim, dense_dim, rate) for _ in range(num_layers)]
 
    def call(self, input_sentence, padding_mask, training):
        
        # Generate the positional encoding
        pos_encoding_output = self.pos_encoding(input_sentence)
        # Expected output shape = (batch_size, sequence_length, embed_dim)
 
        # Add in a dropout layer
        x = self.dropout(pos_encoding_output, training=training)
 
        # Pass on the positional encoded values to each encoder layer
        for i, layer in enumerate(self.encoder_layer):
            x = layer(x, padding_mask, training)
 
        return x


Source: [Link](https://machinelearningmastery.com/implementing-the-transformer-encoder-from-scratch-in-tensorflow-and-keras)

### Test

In [32]:
enc_input = random.random((batch_size, sequence_length))
enc_input.shape

(32, 4)

In [33]:
encoder = Encoder(vocab_size, sequence_length, num_heads, d_k, d_v, embed_dim, dense_dim, num_layers, dropout_rate)
output = encoder(enc_input, None, True)
output.numpy().shape

(32, 4, 512)

## Decoder

In [34]:
# Implementing the Decoder Layer
class DecoderLayer(Layer):
    def __init__(self, sequence_length, num_heads, d_k, d_v, embed_dim, dense_dim, rate, **kwargs):
        
        super(DecoderLayer, self).__init__(**kwargs)

        self.build(input_shape=[None, sequence_length, embed_dim])
        self.sequence_length = sequence_length
        self.embed_dim = embed_dim
        self.multihead_attention1 = MultiHeadAttention(num_heads, d_k, d_v, embed_dim)
        self.dropout1 = Dropout(rate)
        self.add_norm1 = AddNormalization()
        self.multihead_attention2 = MultiHeadAttention(num_heads, d_k, d_v, embed_dim)
        self.dropout2 = Dropout(rate)
        self.add_norm2 = AddNormalization()
        self.feed_forward = FeedForward(dense_dim, embed_dim)
        self.dropout3 = Dropout(rate)
        self.add_norm3 = AddNormalization()

    def build_graph(self):
        input_layer = Input(shape=(self.sequence_length, self.embed_dim))
        return Model(inputs=[input_layer], outputs=self.call(input_layer, input_layer, None, None, True))
        
    def call(self, x, encoder_output, lookahead_mask, padding_mask, training):

        # Multi-head attention layer                                            # x = encoder_output = (32, 20, 512)
                                                                                # lookahead_mask = (32, 1, 20, 20)
                                                                                # padding_mask = (32, 1, 1, 20)
        multihead_output1 = self.multihead_attention1(x, x, x, lookahead_mask)  # (32, 20, 512)

        # Add in a dropout layer
        multihead_output1 = self.dropout1(multihead_output1, training=training) # (32, 20, 512)
 
        # Followed by an Add & Norm layer
        addnorm_output1 = self.add_norm1(x, multihead_output1)                  # (32, 20, 512)

        # Followed by another multi-head attention layer                        
        multihead_output2 = self.multihead_attention2(addnorm_output1,          # (32, 20, 512)
                                                      encoder_output, 
                                                      encoder_output, 
                                                      padding_mask)
        # Add in another dropout layer
        multihead_output2 = self.dropout2(multihead_output2, training=training) # (32, 20, 512)
 
        # Followed by another Add & Norm layer
        addnorm_output2 = self.add_norm1(addnorm_output1, multihead_output2)    # (32, 20, 512)
 
        # Followed by a fully connected layer
        feedforward_output = self.feed_forward(addnorm_output2)                 # (32, 20, 512)

 
        # Add in another dropout layer
        feedforward_output = self.dropout3( feedforward_output,                 # (32, 20, 512) 
                                            training=training) 
 
        # Followed by another Add & Norm layer
        return self.add_norm3(addnorm_output2, feedforward_output)              # (32, 20, 512) 
 
# Implementing the Decoder
class Decoder(Layer):

    def __init__(self, vocab_size, sequence_length, num_heads, d_k, d_v, embed_dim, dense_dim, num_layers, rate, **kwargs):
    
        super(Decoder, self).__init__(**kwargs)
    
        self.pos_encoding = PositionEmbeddingFixedWeights(sequence_length, vocab_size, embed_dim)
        self.dropout = Dropout(rate)
        self.decoder_layer = [DecoderLayer(sequence_length, num_heads, d_k, d_v, embed_dim, dense_dim, rate) for _ in range(num_layers)]
 
    def call(self, output_target, encoder_output, lookahead_mask, padding_mask, training):

        # Generate the positional encoding                                          # output_target = (32, 20)
        pos_encoding_output = self.pos_encoding(output_target)                      # (32, 20, 512)
 
        # Add in a dropout layer
        x = self.dropout(pos_encoding_output, training=training)                    # (32, 20, 512)
 
        # Pass on the positional encoded values to each encoder layer
        for i, layer in enumerate(self.decoder_layer):                              # encoder_output = (32, 20, 512)
                                                                                    # lookahead_mask = (32, 1, 20, 20)
            x = layer(x, encoder_output, lookahead_mask, padding_mask, training)    # (32, 20, 512)
 
        return x                                                                    # (32, 20, 512)

Source: [Link](https://machinelearningmastery.com/implementing-the-transformer-decoder-from-scratch-in-tensorflow-and-keras)   

### Test

In [35]:
dec_input = random.random((batch_size, sequence_length))
dec_input.shape
enc_output = random.random((batch_size, sequence_length, embed_dim))
enc_output.shape

(32, 4, 512)

In [36]:
decoder = Decoder(vocab_size, sequence_length, num_heads, d_k, d_v, embed_dim, dense_dim, num_layers, dropout_rate)
output = decoder(dec_input, enc_output, None, True)
output.numpy().shape

(32, 4, 512)

## Transformer Model

In [37]:

class TransformerModel(Model):
    
    def __init__(self, vocab_size, sequence_length, num_heads, d_k, d_v, embed_dim, d_ff_inner, num_layers, rate, **kwargs):

        super(TransformerModel, self).__init__(**kwargs)
 
        # Set up the encoder
        self.encoder = Encoder(vocab_size, sequence_length, num_heads, d_k, d_v, embed_dim, d_ff_inner, num_layers, rate)
 
        # Set up the decoder
        self.decoder = Decoder(vocab_size, sequence_length, num_heads, d_k, d_v, embed_dim, d_ff_inner, num_layers, rate)
 
        # Define the final dense layer
        self.model_last_layer = Dense(vocab_size)
 
    def call(self, encoder_input, decoder_input, training):                         # encoder_input = decoder_input = (32, 20), training = False
 
        # Create padding mask to mask the encoder inputs and the encoder outputs in the decoder
        enc_padding_mask = self.padding_mask(encoder_input)                         # (32, 1, 1, 20)
 
        # Create and combine padding and look-ahead masks to be fed into the decoder
        dec_in_padding_mask = self.padding_mask(decoder_input)                      # (32, 1, 1, 20)
        dec_in_lookahead_mask = self.lookahead_mask(decoder_input.shape[1])         # (20, 20)   
        dec_in_lookahead_mask = maximum(dec_in_padding_mask, dec_in_lookahead_mask) # (32, 1, 20, 20), returns maximum value element-wise
                                                                                    #                  masks get broadcasted along column, example:
                                                                                    # padding               lookahead               max()
                                                                                    # [[[0., 0., 1., 0.]]]  [[[0., 1., 1., 1.],     [[0., 1., 1., 1.],
                                                                                    #                         [0., 0., 1., 1.],      [0., 0., 1., 1.],
                                                                                    #                         [0., 0., 0., 1.],      [0., 0., 1., 1.],
                                                                                    #                         [0., 0., 0., 0.]]]     [0., 0., 1., 0.]]
        # Feed the input into the encoder
        encoder_output = self.encoder(encoder_input, enc_padding_mask, training)
 
        # Feed the encoder output into the decoder
        decoder_output = self.decoder(decoder_input, encoder_output, dec_in_lookahead_mask, enc_padding_mask, training) # (32, 20, 512)
 
        # Pass the decoder output through a final dense layer
        model_output = self.model_last_layer(decoder_output)                        # (32, 20, 50)
 
        return model_output

    def padding_mask(self, input):                                                  # input = (32, 20)

        # Create mask which marks the zero padding values in the input by a 1.0
        mask = math.equal(input, 0)                                                 # returns boolean value of (x == y) element-wise, finds 0
        mask = cast(mask, float32)                                                  # (32, 20)
 
        # The shape of the mask should be broadcastable to the shape
        # of the attention weights that it will be masking later on
        return mask[:, newaxis, newaxis, :]                                         # (32, 1, 1, 20)
 
    def lookahead_mask(self, shape):                                                # shape = 20
        
        # Mask out future entries by marking them with a 1.0
        mask = 1 - linalg.band_part(ones((shape, shape)), -1, 0)                    # creates upper triangle tensor with 0 diagonal, example shape=4 
                                                                                    #   [[0., 1., 1., 1.],
                                                                                    #    [0., 0., 1., 1.],
                                                                                    #    [0., 0., 0., 1.],
                                                                                    #    [0., 0., 0., 0.]]
        return mask                                                                 # (20, 20)

Source: [Link](https://machinelearningmastery.com/joining-the-transformer-encoder-and-decoder-and-masking/)

## Transformer

In [38]:
# build transformer model
transformer = TransformerModel(vocab_size, sequence_length, num_heads, d_k, d_v, embed_dim, dense_dim, num_layers, dropout_rate)

### Test

In [None]:
output = transformer(enc_input, dec_input)
output.numpy().shape