# Machine Translation using a Transformer
This tutorial demonstrates building and training a English to Spanish [Transformer Model](https://arxiv.org/abs/1706.03762)  from a few parallel lines. Transformer model is currently a state-of-the-art machine translation system that uses self-attention that can learn both short-range and long-range relations and a Feedforward Neural Network that enables model parallelism. Comparing to the [Official TensorFlow Tutorial](https://www.tensorflow.org/alpha/tutorials/text/transformer), this model requires less understanding of TensorFlow specific functions and avoids reshaping Tensors.

(This Notebook is based on https://github.com/LastRemote/Transformer-TF2.0)

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' 
import numpy as np
import unicodedata, re
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Lambda, Layer, Embedding, LayerNormalization

# Data Preprocessing

Let's start by using some parallel text examples in English and Spanish. In real life applications, the datasets are a lot bigger. GPT-3 for example is trained on about 500 billion tokens. But for demonstration purposes, a few sentences should be enough.

In [2]:
sentences = [
  ("Do you want a cup of coffee?", "¿Quieres una taza de café?"),
  ("I've had coffee already.", "Ya tomé café."),
  ("Can I get you a coffee?", "¿Quieres que te traiga un café?"),
  ("Please give me some coffee.", "Dame algo de café por favor."),
  ("Would you like me to make coffee?", "¿Quieres que prepare café?"),
  ("Two coffees, please.", "Dos cafés, por favor."),
  ("How about a cup of coffee?", "¿Qué tal una taza de café?"),
  ("I drank two cups of coffee.", "Me tomé dos tazas de café."),
  ("Would you like to have a cup of coffee?", "¿Te gustaría tomar una taza de café?"),
  ("There'll be coffee and cake at five.", "A las cinco habrá café y un pastel."),
  ("Another coffee, please.", "Otro café, por favor."),
  ("I made coffee.", "Hice café."),
  ("I would like to have a cup of coffee.", "Quiero beber una taza de café."),
  ("Do you want me to make coffee?", "¿Quieres que haga café?"),
  ("It is hard to wake up without a strong cup of coffee.", "Es difícil despertarse sin una taza de café fuerte."),
  ("All I drank was coffee.", "Todo lo que bebí fue café."),
  ("I've drunk way too much coffee today.", "He bebido demasiado café hoy."),
  ("Which do you prefer, tea or coffee?", "¿Qué prefieres, té o café?"),
  ("There are many kinds of coffee.", "Hay muchas variedades de café."),
  ("I will make some coffee.",	"Prepararé algo de café.")
]

To prepare our sentences for the training of our transformer we need to regularize them. Therefore we add spaces arround punctuations, remove extra spaces and any special characters. Additionally we need to add \<start> and \<end> tokens to each sentence.

In [3]:
def preprocess(s):

    s = ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')
    s = re.sub(r"([?.!¡,¿])", r" \1 ", s) # Add spaces around punctuations
    s = re.sub(r'[" "]+', " ", s) # Remove extra space
    s = re.sub(r"[^a-zA-Z?.!¡,¿áéíóú¡üñ]+", " ", s) # Remove other characters
    s = s.strip()
    s = '<start> ' + s + ' <end>'
    return s

print("Original:", sentences[0])
sentences = [(preprocess(en), preprocess(es)) for (en, es) in sentences]
print("Preprocessed:", sentences[0])

Original: ('Do you want a cup of coffee?', '¿Quieres una taza de café?')
Preprocessed: ('<start> Do you want a cup of coffee ? <end>', '<start> ¿ Quieres una taza de cafe ? <end>')


We then tokenize both source and target sentences into lists of integers, and pad zeros at the end of each sequence to the same length.

In [4]:
source_sentences, target_sentences = list(zip(*sentences))

# In this illustration, I choose not to specify num_words and oov_token due to the size of data.
# for details, please visit https://keras.io/preprocessing/text/
source_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='') 
source_tokenizer.fit_on_texts(source_sentences)
source_data = source_tokenizer.texts_to_sequences(source_sentences)
print("Sequence:", source_data[0])
source_data = tf.keras.preprocessing.sequence.pad_sequences(source_data, padding='post')
print("Padded:", source_data[0])

target_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
target_tokenizer.fit_on_texts(target_sentences)
target_data = target_tokenizer.texts_to_sequences(target_sentences)
target_data = tf.keras.preprocessing.sequence.pad_sequences(target_data, padding='post')

Sequence: [1, 12, 8, 19, 9, 10, 6, 3, 7, 2]
Padded: [ 1 12  8 19  9 10  6  3  7  2  0  0  0  0  0]


Machine translation models take the entire source sentence and an incomplete sentence in target language as inputs at once, and predict the next word for the incomplete sentence.
We create labels for the decoder by shifting the target sequence one to the right.

In [5]:
target_labels = np.zeros(target_data.shape)
target_labels[:,0:target_data.shape[1] -1] = target_data[:,1:]

print("Target sequence", target_data[0])
print("Target label", target_labels[0])

source_vocab_len = len(source_tokenizer.word_index) + 1
target_vocab_len = len(target_tokenizer.word_index) + 1

print("Size of source vocabulary: ", source_vocab_len)
print("Size of target vocabulary: ", target_vocab_len)

Target sequence [ 1  6 11  9 10  5  3  7  2  0  0  0]
Target label [ 6. 11.  9. 10.  5.  3.  7.  2.  0.  0.  0.  0.]
Size of source vocabulary:  65
Size of target vocabulary:  60


In [6]:
dataset = tf.data.Dataset.from_tensor_slices((source_data, target_data, target_labels))

# Transformer Structure
Then we build the entire structure for transformer. It is actually not hard at all!

In this cell we define our modelparameters, feel free to play with them around. Remember the more parameters you use, the better the accuracy of the transformer gets. However, it also quickly becomes impossible to train large transformers on a normal GPU because they no longer fit into the memory.

For our few example sentences the defined parameters should work fine.

In [7]:
# Transformer parameters
d_model = 64 # 512 in the original paper
d_k = 16 # 64 in the original paper
d_v = 16 # 64 in the original paper
n_heads = 4 # 8 in the original paper
n_encoder_layers = 2 # 6 in the original paper
n_decoder_layers = 2 # 6 in the original paper

max_token_length = 20 # 512 in the original paper

## Transformer Attention

First we will be working on the single head transformer attention mechanism. A single head attention takes 3 inputs as Query (q), Key (k) and Value (v), and it finds an unidirectional connection from each of the query words to each of the key words. In the transformer model key and value inputs are always the same.

Each of query, key, value goes through a separate linear transform to a lower dimensionality to make the dimensionality of multi-headed attention to be smaller. Every linear layer in the Transformer model is using Xavier initialization ('glorot_uniform'). The output is then created by a rather simple mathematical equation:![Equation for Attention](https://jalammar.github.io/images/t/self-attention-matrix-calculation-2.png)

Image source: Alammar, Jay (2018). The Illustrated Transformer. Retrieved from https://jalammar.github.io/illustrated-transformer/

If we are making a decoder self-attention, we have to be a little careful since the full decoding sentence is not available in practice and should be generated step by step. Therefore, we cannot assume future attentions from the query word and a key word that has not been generated. Since the Transformer model is always generating the next word given an incomplete sequence, we should remove the attention from the query word to any word appeared later, which is the strictly upper triangular region except the main diagonals in $ Q \times K^T $ matrix. That is to set the strict upper triangle of $ Q \times K^T $ to negative infinity (zero after softmax).

In [8]:
class SingleHeadAttention(Layer):
    def __init__(self, input_shape=(3, -1, d_model), dropout=.0, masked=None):
        super(SingleHeadAttention, self).__init__()
        self.q = Dense(d_k, input_shape=(-1, d_model), kernel_initializer='glorot_uniform', 
                       bias_initializer='glorot_uniform')
        self.normalize_q = Lambda(lambda x: x / np.sqrt(d_k))
        self.k = Dense(d_k, input_shape=(-1, d_model), kernel_initializer='glorot_uniform', 
                       bias_initializer='glorot_uniform')
        self.v = Dense(d_v, input_shape=(-1, d_model), kernel_initializer='glorot_uniform', 
                       bias_initializer='glorot_uniform')
        self.dropout = dropout
        self.masked = masked

    # Inputs: [query, key, value]
    def call(self, inputs, training=None):
        assert len(inputs) == 3
        # We use a lambda layer to divide vector q by sqrt(d_k) according to the equation
        q = self.normalize_q(self.q(inputs[0]))
        k = self.k(inputs[1])
        
        # The dimensionality of q is (batch_size, query_length, d_k) and that of k is (batch_size, key_length, d_k)
        # So we will do a matrix multication by batch after transposing last 2 dimensions of k
        # tf.shape(attn_weights) = (batch_size, query_length, key_length)
        attn_weights = tf.matmul(q, tf.transpose(k, perm=[0,2,1]))
        
        if self.masked: # Prevent future attentions in decoding self-attention
            # Create a matrix where the strict upper triangle (not including main diagonal) is filled with -inf and 0 elsewhere
            length = tf.shape(attn_weights)[-1]
            attn_mask = tf.fill((length, length), -np.inf)
            attn_mask = tf.linalg.band_part(attn_mask, 0, -1) # Get upper triangle
            attn_mask = tf.linalg.set_diag(attn_mask, tf.zeros((length))) # Set diagonal to zeros to avoid operations with infinity
            # This matrix is added to the attention weights so all future attention will have -inf logits (0 after softmax)
            attn_weights += attn_mask
        
        # Softmax along the last dimension
        attn_weights = tf.nn.softmax(attn_weights, axis=-1)
        
        if training: # Attention dropout included in the original paper. This is possibly to encourage multihead diversity.
            attn_weights = tf.nn.dropout(attn_weights, rate=self.dropout)
        v = self.v(inputs[2])
        
        return tf.matmul(attn_weights, v)

Now let's use multiple single head attention and a linear layer to build a multihead attention. There is no need to reshape!

In [9]:
class MultiHeadAttention(Layer):
    def __init__(self, dropout=.0, masked=None):
        super(MultiHeadAttention, self).__init__()
        self.attn_heads = list()
        for i in range(n_heads): 
            self.attn_heads.append(SingleHeadAttention(dropout=dropout, masked=masked))
        self.linear = Dense(d_model, input_shape=(-1, n_heads * d_v), kernel_initializer='glorot_uniform', 
                       bias_initializer='glorot_uniform')

    def call(self, x, training=None):
        attentions = [self.attn_heads[i](x, training=training) for i in range(n_heads)]
        concatenated_attentions = tf.concat(attentions, axis=-1)
        
        return self.linear(concatenated_attentions)

## Encoder and Decoder

This is the flowchart for the whole transformer architecture, where the encoder is the block to the left, and decoder is the block to the right. Note that since the output shape of either encoder or decoder is the same as its corresponding input shape, both the encoder unit and the decoder unit can be stacked.
![Transformer Architecture](https://www.tensorflow.org/images/tutorials/transformer/transformer.png)

We then present the transformer encoder architecture. Each encoder has a multihead self-attention (encoder-encoder) sublayer and a feedforward sublayer (two dense layers with ReLU activation in between). Each sublayer is followed by a LayerNorm taking the sublayer residually as follows:

$$\Large{\mathit{LayerNorm}(x + \mathit{sublayer}(x))} $$

Dropout is applied after each sublayer before layer normalization.


In [10]:
class TransformerEncoder(Layer):
    def __init__(self, dropout=.1, attention_dropout=.0, **kwargs):
        super(TransformerEncoder, self).__init__(**kwargs)
        self.dropout_rate = dropout
        self.attention_dropout_rate = attention_dropout
        
    def build(self, input_shape):
        self.multihead_attention = MultiHeadAttention(dropout=self.attention_dropout_rate)
        self.dropout1 = tf.keras.layers.Dropout(self.dropout_rate)
        self.layer_normalization1 = LayerNormalization(input_shape=input_shape, epsilon=1e-6)

        self.linear1 = Dense(input_shape[-1] * 4, input_shape=input_shape, activation='relu',
                            kernel_initializer='glorot_uniform', bias_initializer='glorot_uniform')
        self.linear2 = Dense(input_shape[-1], input_shape=self.linear1.compute_output_shape(input_shape),
                            kernel_initializer='glorot_uniform', bias_initializer='glorot_uniform')
        self.dropout2 = tf.keras.layers.Dropout(self.dropout_rate)
        self.layer_normalization2 = LayerNormalization(input_shape=input_shape, epsilon=1e-6)
        super(TransformerEncoder, self).build(input_shape)
        
    def call(self, x, training=None):
        sublayer1 = self.multihead_attention((x, x, x), training=training)
        sublayer1 = self.dropout1(sublayer1, training=training)
        layernorm1 = self.layer_normalization1(x + sublayer1)

        sublayer2 = self.linear2(self.linear1(layernorm1))
        sublayer1 = self.dropout2(sublayer2, training=training)
        layernorm2 = self.layer_normalization2(layernorm1 + sublayer2)
        
        return layernorm2
    
    def compute_output_shape(self, input_shape):
        return input_shape

The decoder is constructed in the same fashion, except that there are three sublayers instead of two: a masked multihead self-attention layer (decoder-decoder), a multihead encoder attention layer (decoder-encoder) and a feedforward layer just like the one in an encoder unit.

In [11]:
class TransformerDecoder(Layer):
    def __init__(self, dropout=.0, attention_dropout=.0, **kwargs):
        super(TransformerDecoder, self).__init__(**kwargs)
        self.dropout_rate = dropout
        self.attention_dropout_rate = attention_dropout
        
    def build(self, input_shape):
        self.multihead_self_attention = MultiHeadAttention(dropout=self.attention_dropout_rate, masked=True)
        self.dropout1 = tf.keras.layers.Dropout(self.dropout_rate)
        self.layer_normalization1 = LayerNormalization(input_shape=input_shape, epsilon=1e-6)

        self.multihead_encoder_attention = MultiHeadAttention(dropout=self.attention_dropout_rate)
        self.dropout2 = tf.keras.layers.Dropout(self.dropout_rate)
        self.layer_normalization2 = LayerNormalization(input_shape=input_shape, epsilon=1e-6)

        self.linear1 = Dense(input_shape[-1] * 4, input_shape=input_shape, activation='relu',
                            kernel_initializer='glorot_uniform', bias_initializer='glorot_uniform')
        self.linear2 = Dense(input_shape[-1], input_shape=self.linear1.compute_output_shape(input_shape),
                            kernel_initializer='glorot_uniform', bias_initializer='glorot_uniform')
        self.dropout3 = tf.keras.layers.Dropout(self.dropout_rate)
        self.layer_normalization3 = LayerNormalization(input_shape=input_shape, epsilon=1e-6)
        super(TransformerDecoder, self).build(input_shape)
        
    def call(self, x, hidden, training=None):
        sublayer1 = self.multihead_self_attention((x, x, x))
        sublayer1 = self.dropout1(sublayer1, training=training)
        layernorm1 = self.layer_normalization1(x + sublayer1)

        sublayer2 = self.multihead_encoder_attention((x, hidden, hidden))
        sublayer2 = self.dropout2(sublayer2, training=training)
        layernorm2 = self.layer_normalization2(layernorm1 + sublayer2)

        sublayer3 = self.linear2(self.linear1(layernorm1))
        sublayer3 = self.dropout3(sublayer3, training=training)
        layernorm3 = self.layer_normalization2(layernorm2 + sublayer3)
        
        return layernorm3
    
    def compute_output_shape(self, input_shape):
        return input_shape

## Positional Encoding

In the original Transformer implementation, the researchers used a sinusoidal function to get positional encoding which will append to encoder and decoder word embeddings. This is to give information about the position of each token. The function looks as follows:

 $$\Large{PE_{(pos, 2i)} = \sin(pos / 10000^{2i / d_{model}})} $$
$$\Large{PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i / d_{model}})} $$ 

In [12]:
class SinusoidalPositionalEncoding(Layer): #Inspired from https://github.com/graykode/nlp-tutorial/blob/master/5-1.Transformer/Transformer_Torch.ipynb
    def __init__(self):
        super(SinusoidalPositionalEncoding, self).__init__()
        self.sinusoidal_encoding = np.array([self.get_positional_angle(pos) for pos in range(max_token_length)], dtype=np.float32)
        self.sinusoidal_encoding[:, 0::2] = np.sin(self.sinusoidal_encoding[:, 0::2])
        self.sinusoidal_encoding[:, 1::2] = np.cos(self.sinusoidal_encoding[:, 1::2])
        self.sinusoidal_encoding = tf.cast(self.sinusoidal_encoding, dtype=tf.float32) # Casting the array to Tensor for slicing
    
    def call(self, x):
        return x + self.sinusoidal_encoding[:tf.shape(x)[1]]
        #return x + tf.slice(self.sinusoidal_encoding, [0, 0], [tf.shape(x)[1], d_model])
    
    def compute_output_shape(self, input_shape):
        return input_shape
    
    def get_angle(self, pos, dim):
        return pos / np.power(10000, 2 * (dim // 2) / d_model)
    
    def get_positional_angle(self, pos):
        return [self.get_angle(pos, dim) for dim in range(d_model)]

## Assembling the Full Architecture
Now we can build the full architecture of transformer using positional encoding, encoder layers and decoder layers:

In [13]:
class Transformer(Model):
    def __init__(self, dropout=.1, attention_dropout=.0, **kwargs):
        super(Transformer, self).__init__(**kwargs)
        self.encoding_embedding = Embedding(source_vocab_len, d_model)
        self.decoding_embedding = Embedding(target_vocab_len, d_model)
        self.pos_encoding = SinusoidalPositionalEncoding()
        self.encoder = [TransformerEncoder(dropout=dropout, attention_dropout=attention_dropout) for i in range(n_encoder_layers)]
        self.decoder = [TransformerDecoder(dropout=dropout, attention_dropout=attention_dropout) for i in range(n_decoder_layers)]
        self.decoder_final = Dense(target_vocab_len, input_shape=(None, d_model))
    
    def call(self, inputs, training=None): # Source_sentence and decoder_input
        source_sentence, decoder_input = inputs
        embedded_source = self.encoding_embedding(source_sentence)
        encoder_output = self.pos_encoding(embedded_source)
        
        for encoder_unit in self.encoder:
            encoder_output = encoder_unit(encoder_output, training=training)

        embedded_target = self.decoding_embedding(decoder_input)
        decoder_output = self.pos_encoding(embedded_target)
        
        for decoder_unit in self.decoder:
            decoder_output = decoder_unit(decoder_output, encoder_output, training=training)
        
        if training:
            decoder_output = self.decoder_final(decoder_output)
            decoder_output = tf.nn.softmax(decoder_output, axis=-1)
        
        else:
            decoder_output = self.decoder_final(decoder_output[:, -1:, :])
            decoder_output = tf.nn.softmax(decoder_output, axis=-1)
        
        return decoder_output

# Training
This model can be trained in two ways, either using TensorFlow GradientTape to update the model weights manually in a training function, or simply using Keras model.fit() method to start training. For simplicity reasons we will use the second method.

In [14]:
transformer = Transformer() # Instantiating a new transformer model
src_seqs, tgt_seqs, tgt_labels = zip(*dataset)
train = [tf.cast(src_seqs, dtype=tf.float32), tf.cast(tgt_seqs, dtype=tf.float32)] # Cast the tuples to tensors

transformer.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

transformer.fit(train, tf.cast(tgt_labels, dtype=tf.float32), verbose=2, batch_size=5, epochs=300)

Epoch 1/300
4/4 - 5s - loss: 3.7038 - accuracy: 0.2458 - 5s/epoch - 1s/step
Epoch 2/300
4/4 - 0s - loss: 3.1218 - accuracy: 0.3542 - 51ms/epoch - 13ms/step
Epoch 3/300
4/4 - 0s - loss: 2.9477 - accuracy: 0.3542 - 51ms/epoch - 13ms/step
Epoch 4/300
4/4 - 0s - loss: 2.7782 - accuracy: 0.3542 - 49ms/epoch - 12ms/step
Epoch 5/300
4/4 - 0s - loss: 2.6084 - accuracy: 0.3792 - 50ms/epoch - 12ms/step
Epoch 6/300
4/4 - 0s - loss: 2.5050 - accuracy: 0.4042 - 51ms/epoch - 13ms/step
Epoch 7/300
4/4 - 0s - loss: 2.4205 - accuracy: 0.4250 - 51ms/epoch - 13ms/step
Epoch 8/300
4/4 - 0s - loss: 2.3785 - accuracy: 0.4417 - 50ms/epoch - 13ms/step
Epoch 9/300
4/4 - 0s - loss: 2.3218 - accuracy: 0.4417 - 51ms/epoch - 13ms/step
Epoch 10/300
4/4 - 0s - loss: 2.2814 - accuracy: 0.4250 - 54ms/epoch - 13ms/step
Epoch 11/300
4/4 - 0s - loss: 2.2880 - accuracy: 0.4000 - 53ms/epoch - 13ms/step
Epoch 12/300
4/4 - 0s - loss: 2.2130 - accuracy: 0.4458 - 51ms/epoch - 13ms/step
Epoch 13/300
4/4 - 0s - loss: 2.1701 - ac

Epoch 103/300
4/4 - 0s - loss: 0.0520 - accuracy: 1.0000 - 51ms/epoch - 13ms/step
Epoch 104/300
4/4 - 0s - loss: 0.0486 - accuracy: 1.0000 - 51ms/epoch - 13ms/step
Epoch 105/300
4/4 - 0s - loss: 0.0437 - accuracy: 1.0000 - 51ms/epoch - 13ms/step
Epoch 106/300
4/4 - 0s - loss: 0.0432 - accuracy: 1.0000 - 53ms/epoch - 13ms/step
Epoch 107/300
4/4 - 0s - loss: 0.0399 - accuracy: 1.0000 - 52ms/epoch - 13ms/step
Epoch 108/300
4/4 - 0s - loss: 0.0401 - accuracy: 1.0000 - 50ms/epoch - 12ms/step
Epoch 109/300
4/4 - 0s - loss: 0.0396 - accuracy: 1.0000 - 50ms/epoch - 12ms/step
Epoch 110/300
4/4 - 0s - loss: 0.0445 - accuracy: 0.9958 - 49ms/epoch - 12ms/step
Epoch 111/300
4/4 - 0s - loss: 0.0357 - accuracy: 1.0000 - 55ms/epoch - 14ms/step
Epoch 112/300
4/4 - 0s - loss: 0.0377 - accuracy: 1.0000 - 55ms/epoch - 14ms/step
Epoch 113/300
4/4 - 0s - loss: 0.0340 - accuracy: 1.0000 - 51ms/epoch - 13ms/step
Epoch 114/300
4/4 - 0s - loss: 0.0328 - accuracy: 1.0000 - 50ms/epoch - 13ms/step
Epoch 115/300
4/

Epoch 203/300
4/4 - 0s - loss: 0.0066 - accuracy: 1.0000 - 52ms/epoch - 13ms/step
Epoch 204/300
4/4 - 0s - loss: 0.0067 - accuracy: 1.0000 - 50ms/epoch - 13ms/step
Epoch 205/300
4/4 - 0s - loss: 0.0067 - accuracy: 1.0000 - 51ms/epoch - 13ms/step
Epoch 206/300
4/4 - 0s - loss: 0.0067 - accuracy: 1.0000 - 52ms/epoch - 13ms/step
Epoch 207/300
4/4 - 0s - loss: 0.0063 - accuracy: 1.0000 - 52ms/epoch - 13ms/step
Epoch 208/300
4/4 - 0s - loss: 0.0066 - accuracy: 1.0000 - 52ms/epoch - 13ms/step
Epoch 209/300
4/4 - 0s - loss: 0.0057 - accuracy: 1.0000 - 52ms/epoch - 13ms/step
Epoch 210/300
4/4 - 0s - loss: 0.0063 - accuracy: 1.0000 - 51ms/epoch - 13ms/step
Epoch 211/300
4/4 - 0s - loss: 0.0056 - accuracy: 1.0000 - 51ms/epoch - 13ms/step
Epoch 212/300
4/4 - 0s - loss: 0.0062 - accuracy: 1.0000 - 51ms/epoch - 13ms/step
Epoch 213/300
4/4 - 0s - loss: 0.0075 - accuracy: 1.0000 - 54ms/epoch - 13ms/step
Epoch 214/300
4/4 - 0s - loss: 0.0056 - accuracy: 1.0000 - 51ms/epoch - 13ms/step
Epoch 215/300
4/

<keras.callbacks.History at 0x7faf0011b4f0>

# Translation testing
Since we are using only 20 sentences for training demonstration, this model is not expected to work well in arbitrary testing examples. In order to make sure that the model works, we will translate a training source sentence and compare the prediction and the target:

In [15]:
def translate(model, source_sentence, target_sentence_start=[['<start>']]):
    if np.ndim(source_sentence) == 1: # Create a batch of 1 the input is a sentence
        source_sentence = [source_sentence]
    
    if np.ndim(target_sentence_start) == 1:
        target_sentence_start = [target_sentence_start]
  
    # Tokenizing and padding
    source_seq = source_tokenizer.texts_to_sequences(source_sentence)
    source_seq = tf.keras.preprocessing.sequence.pad_sequences(source_seq, padding='post', maxlen=15)
    predict_seq = target_tokenizer.texts_to_sequences(target_sentence_start)

    predict_sentence = list(target_sentence_start[0]) # Deep copy here to prevent updates on target_sentence_start
    
    while predict_sentence[-1] != '<end>' and len(predict_seq) < max_token_length:
        predict_output = model([np.array(source_seq), np.array(predict_seq)], training=None)
        predict_label = tf.argmax(predict_output, axis=-1) # Pick the label with highest softmax score
        predict_seq = tf.concat([predict_seq, predict_label], axis=-1) # Updating the prediction sequence
        predict_sentence.append(target_tokenizer.index_word[predict_label[0][0].numpy()])

    return predict_sentence

In [16]:
print("Source sentence: ", source_sentences[10])
print("Target sentence: ", target_sentences[10])
print("Predicted sentence: ", ' '.join(translate(transformer, source_sentences[10].split(' '))))

Source sentence:  <start> Another coffee , please . <end>
Target sentence:  <start> Otro cafe , por favor . <end>
Predicted sentence:  <start> otro cafe , por favor . <end>


And thats it! Everything is working as expected and you have build your first own working transformer. Now you can play around with it. Just take a bigger dataset for example and retrain everything. Change the model parameters and try out which size of transformer models will fit into your GPU memory. Which impact does the bigger model has on the training time and your models accuracy? Try, to build a transformer for another task, like sentiment analysis for example.

# Further recommendations
If you are interested in some bigger pretrained models checkout [Huggingface](https://huggingface.co/docs/transformers/index) and learn how to finetune them to your own needs or just signup for the [OpenAI GPT-3 Playground](https://beta.openai.com/playground) to play around with a realy big transformer and learn what a huge amount of different tasks a single pre-trained transformer can solve.