The problem is to build and train a Neural Machine Translation (NMT) model from scratch. The goal is to implement the Transformer 
architecture to translate sentences from Portuguese to English, using the TED Talks dataset for training and evaluation.

In [1]:
'''
Step 1: Import libraries
Step 2: Load the dataset from TensorFlow Datasets
Step 3: Define the pre-trained Tokenizers for both languages
Step 4: Define the functions for preparing batches for training and evaluation
Step 5: Define Positional Encoding to add the position information to the tokens 
Step 6: Defines Keras layers that implement self attention, cross-attention. 
Step 7: Defines a Keras layer that implements the "Position-wise Feed-Forward Network" described in the original "Attention Is All You Need" paper. 
Step 8: Define the Encoder layer and Encoder model with multiple layers
Step 9: Define causal or masked attention layer and decoder layers, followed by the overall decoder model
Step 10: Define the transformer by combining the encoder and decoder models
Step 11: Define the los and accuracy measures
Step 12: Define the training hyper-parameters and do the training
Step 13: Develop the Translator class with the transformer to evaluate the model
 '''

'\nStep 1: Import libraries\nStep 2: Load the dataset from TensorFlow Datasets\nStep 3: Define the pre-trained Tokenizers for both languages\nStep 4: Define the functions for preparing batches for training and evaluation\nStep 5: Define Positional Encoding to add the position information to the tokens \nStep 6: Defines Keras layers that implement self attention, cross-attention. \nStep 7: Defines a Keras layer that implements the "Position-wise Feed-Forward Network" described in the original "Attention Is All You Need" paper. \nStep 8: Define the Encoder layer and Encoder model with multiple layers\nStep 9: Define causal or masked attention layer and decoder layers, followed by the overall decoder model\nStep 10: Define the transformer by combining the encoder and decoder models\nStep 11: Define the los and accuracy measures\nStep 12: Define the training hyper-parameters and do the training\nStep 13: Develop the Translator class with the transformer to evaluate the model\n '

In [3]:
import logging
import time
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_text

ModuleNotFoundError: No module named 'tensorflow_text'

In [None]:
''' 
Step 2: Load the dataset from TensorFlow Datasets
Use TensorFlow Datasets to load the Portuguese-English translation datasetD Talks Open Translation Project. 
This dataset contains approximately 52,000 training, 1,200 validation and 1,800 test examples.

'''

# 'ted_hrlr_translate/pt_to_en' is the Portuguese-to-English translation dataset.
# with_info=True provides metadata about the dataset.
# as_supervised=True loads the data as (input, label) pairs.
examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en',
                               with_info=True,
                               as_supervised=True)

# Split the data into training and validation sets
train_examples, val_examples = examples['train'], examples['validation']

In [5]:
# CHECK: Print a few examples to see what the data looks like
for pt_examples, en_examples in train_examples.batch(3).take(1):
  print('> Examples in Portuguese:')
  for pt in pt_examples.numpy():
    print(pt.decode('utf-8'))
  print()

  print('> Examples in English:')
  for en in en_examples.numpy():
    print(en.decode('utf-8'))

> Examples in Portuguese:
e quando melhoramos a procura , tiramos a única vantagem da impressão , que é a serendipidade .
mas e se estes fatores fossem ativos ?
mas eles não tinham a curiosidade de me testar .

> Examples in English:
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n't test for curiosity .


2025-07-31 19:36:40.309974: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In [None]:
''' 
Step 3: Define the pre-trained Tokenizers for both languages
1. https://www.youtube.com/watch?v=L5CR-k2ROu4 : Tokenization in NLP: Basics to Advanced
2. https://www.tensorflow.org/text/guide/tokenizers 
3. https://www.tensorflow.org/text/guide/subwords_tokenizer

Its main purpose is to download and load a pre-trained tokenizer model from a public repository hosted by TensorFlow.

A tokenizer is a fundamental tool in Natural Language Processing (NLP). It handles two key tasks:

    Tokenization: It breaks down raw text (like a sentence) into smaller units called "tokens." These tokens are then mapped to unique numerical IDs. 
    Neural networks can only process numbers, so this step is essential to convert text into a format the model can understand.

    Detokenization: It performs the reverse operation, converting the numerical output from the model back into human-readable text.

The process in this snippet happens in two main steps:

    Downloading and Extracting the Model (tf.keras.utils.get_file): This function is a convenient utility that handles fetching a file from a URL.

        It first checks if the file (ted_hrlr_translate_pt_en_converter.zip) already exists in the local cache. If not, it downloads it from the Google Cloud Storage URL provided.

        The extract=True argument is very important here. After the download is complete, it automatically unzips the archive. T
        his creates a directory named ted_hrlr_translate_pt_en_converter which contains the actual model files.

    Loading the Tokenizer Model (tf.saved_model.load): Once the model files are downloaded and extracted, this function loads them into memory.

        It reads the specified directory and reconstructs the complete TensorFlow model, including its architecture, weights, 
        and any associated assets like the vocabulary files needed for tokenization.

        The resulting tokenizers object is a fully functional model. It acts as a container holding two separate sub-models: 
        one for Portuguese (tokenizers.pt) and one for English (tokenizers.en). These can now be used directly to process the text data for training.



'''

' \nStep 3: Define the pre-trained Tokenizers for both languages\nhttps://www.youtube.com/watch?v=L5CR-k2ROu4 : Tokenization in NLP: Basics to Advanced\n\nIts main purpose is to download and load a pre-trained tokenizer model from a public repository hosted by TensorFlow.\n\nA tokenizer is a fundamental tool in Natural Language Processing (NLP). It handles two key tasks:\n\n    Tokenization: It breaks down raw text (like a sentence) into smaller units called "tokens." These tokens are then mapped to unique numerical IDs. \n    Neural networks can only process numbers, so this step is essential to convert text into a format the model can understand.\n\n    Detokenization: It performs the reverse operation, converting the numerical output from the model back into human-readable text.\n\nThe process in this snippet happens in two main steps:\n\n    Downloading and Extracting the Model (tf.keras.utils.get_file): This function is a convenient utility that handles fetching a file from a URL.

In [None]:
# Define the unique name of the pre-trained tokenizer model. This specific model
# is designed for Portuguese-to-English translation and is provided by TensorFlow.
model_name = 'ted_hrlr_translate_pt_en_converter'

# Use a Keras utility function to download the tokenizer model from a public Google Cloud Storage URL.
# This function is very convenient as it handles both downloading and extraction.
tf.keras.utils.get_file(
    # The local filename to save the downloaded archive as.
    f'{model_name}.zip',
    
    # The public URL where the model archive is hosted.
    f'https://storage.googleapis.com/download.tensorflow.org/models/{model_name}.zip',
    
    # Specifies the directory to cache the file. '.' means the current directory.
    cache_dir='.',
    
    # Specifies not to use any sub-directory within the cache directory.
    cache_subdir='',
    
    # This crucial argument tells the function to automatically extract the contents
    # of the .zip file after it's downloaded. This creates a folder with the model files.
    extract=True
)
# Load the complete TensorFlow SavedModel from the directory that was just extracted.
# This function reconstructs the model, including its vocabulary and tokenization logic.
# The resulting 'tokenizers' object will contain methods for both Portuguese and English.
# With extract=True and cache_dir='.', the model should be extracted directly into
# the directory named by model_name.
tokenizers = tf.saved_model.load(f'./ted_hrlr_translate_pt_en_converter_extracted/ted_hrlr_translate_pt_en_converter')



In [8]:
# Create a small batch of 3 sentence pairs from the training data to use as an example.
# .take(1) ensures we only grab one batch.
for pt_examples, en_examples in train_examples.batch(3).take(1):
  
  # --- 1. View the original Portuguese text ---
  print('> Examples in Portuguese:')
  # Loop through the Portuguese sentences in the batch.
  # .numpy() converts the TensorFlow tensor to a NumPy array.
  # .decode('utf-8') converts the raw bytes to a human-readable string.
  for pt in pt_examples.numpy():
    print(pt.decode('utf-8'))
  print() # Add a blank line for readability.

  # --- 2. View the original English text ---
  print('> Examples in English:')
  # Do the same for the English sentences.
  for en in en_examples.numpy():
    print(en.decode('utf-8'))
  print()

# --- 3. Tokenize the English text (Text -> Numbers) ---
# Pass the batch of English sentences to the pre-trained English tokenizer.
# This converts each sentence into a sequence of integer IDs.
encoded = tokenizers.en.tokenize(en_examples)

print('> This is a tokenized')
# Loop through the resulting tensor of token IDs.
# .to_list() converts the tensor to a standard Python list for easy printing.
for row in encoded.to_list():
  print(row)
print()

# --- 4. Detokenize the IDs back to text (Numbers -> Text) ---
# This demonstrates the "round trip" to ensure the process is reversible.
# We take the `encoded` tensor of IDs and convert it back to text.
round_trip = tokenizers.en.detokenize(encoded)

print('> This is the detokenized (human-readable) text:')
# Loop through the resulting batch of detokenized text.
for line in round_trip.numpy():
  print(line.decode('utf-8'))

> Examples in Portuguese:
e quando melhoramos a procura , tiramos a única vantagem da impressão , que é a serendipidade .
mas e se estes fatores fossem ativos ?
mas eles não tinham a curiosidade de me testar .

> Examples in English:
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n't test for curiosity .



2025-07-31 19:36:48.253450: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


> This is a tokenized
[2, 72, 117, 79, 1259, 1491, 2362, 13, 79, 150, 184, 311, 71, 103, 2308, 74, 2679, 13, 148, 80, 55, 4840, 1434, 2423, 540, 15, 3]
[2, 87, 90, 107, 76, 129, 1852, 30, 3]
[2, 87, 83, 149, 50, 9, 56, 664, 85, 2512, 15, 3]

> This is the detokenized (human-readable) text:
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n ' t test for curiosity .


In [None]:
''' 
Step 4: Define the functions for preparing batches for training and evaluation

What's Happening in the Code?

This code defines two functions, prepare_batch and make_batches, that work together to create an efficient data pipeline using tf.data. 
This pipeline is designed to load, preprocess, shuffle, and batch the data, which is essential for training a deep learning model effectively.

1. The prepare_batch function

This function handles the preprocessing for a single batch of sentence pairs.

    Tokenize and Truncate: It first converts the Portuguese and English text into numerical token IDs. It also enforces a maximum sequence length (MAX_TOKENS) by cutting off any sentences that are too long.

    Create Decoder Input/Output: For the English (target) sentences, it creates two versions:

        en_inputs: This is the input to the decoder. It includes every token except the last one.

        en_labels: This is the target label that the model tries to predict. It includes every token except the first one.

        This "shift" is a core concept in training sequence models, as it teaches the model to predict the next word in the sequence given the previous words.

    Convert to Tensor: It converts the lists of token IDs into dense TensorFlow tensors, adding padding (with zeros) to make all sequences in the batch the same length.

2. The make_batches function

This function builds the complete data pipeline.

    .shuffle(): Randomly shuffles the dataset. This is crucial for effective training, as it prevents the model from learning the order of the training examples. 
    The buffer_size tells the pipeline how many elements to load into a buffer to shuffle from.

    .batch(): Groups the individual examples into batches of a specified size (e.g., 128).

    .map(): Applies the prepare_batch function to each batch in parallel. tf.data.AUTOTUNE allows TensorFlow to dynamically adjust 
    the number of parallel threads to use, optimizing performance.

    .prefetch(): This is a key optimization. It creates a background thread that prepares the next batch of data on the CPU while the 
    GPU is busy training on the current batch. This ensures the GPU doesn't have to wait, dramatically speeding up training.
'''

In [17]:
# Set a maximum sequence length. Sentences with more tokens than this will be truncated.
MAX_TOKENS = 128

def prepare_batch(pt, en):
    """
    This function tokenizes, truncates, and prepares the input and label tensors
    for a single batch of Portuguese-English sentence pairs.
    """
    # --- Process the Portuguese (Source) Sentences ---
    # Convert the raw Portuguese text into sequences of token IDs.
    pt = tokenizers.pt.tokenize(pt)
    # Enforce the maximum sequence length by keeping only the first MAX_TOKENS.
    pt = pt[:, :MAX_TOKENS]
    # Convert the ragged (variable-length) tensor into a dense tensor,
    # padding shorter sequences with zeros.
    pt = pt.to_tensor()

    # --- Process the English (Target) Sentences ---
    # Convert the raw English text into sequences of token IDs.
    # We allow one extra token for the start/end tokens that will be added implicitly.
    en = tokenizers.en.tokenize(en)
    en = en[:, :(MAX_TOKENS + 1)]

    # Create the input to the decoder by taking all tokens except the last one.
    en_inputs = en[:, :-1].to_tensor()

    # Create the target labels for the model to predict by taking all tokens except the first one.
    # This creates the "teacher forcing" mechanism where the model learns to predict the next token.
    en_labels = en[:, 1:].to_tensor()

    # Return a tuple where the first element is the model's input (a pair of tensors)
    # and the second element is the target label.
    return (pt, en_inputs), en_labels

def make_batches(ds, buffer_size=20000, batch_size=512):
    """
    Builds an efficient, optimized tf.data pipeline from the dataset.
    """
    return (
        ds
        # Shuffle the dataset to ensure the model doesn't learn the order of examples.
        # A large buffer size improves the randomness of the shuffle.
        .shuffle(buffer_size)
        # Group the individual examples into batches of the specified size.
        .batch(batch_size)
        # Apply the `prepare_batch` function to each batch in parallel for efficiency.
        # tf.data.AUTOTUNE lets TensorFlow figure out the best level of parallelism.
        .map(prepare_batch, num_parallel_calls=tf.data.AUTOTUNE)
        # Pre-fetch the next batch of data while the current one is being processed on the GPU.
        # This is a crucial performance optimization that prevents data bottlenecks.
        .prefetch(buffer_size=tf.data.AUTOTUNE)
    )

# --- Create the final data pipelines for training and validation ---
# The same pipeline logic is applied to both the training and validation datasets.
train_batches = make_batches(train_examples)
val_batches = make_batches(val_examples)

In [None]:
'''
Step 5: Define Positional Encoding to add the position information to the tokens 
https://arxiv.org/pdf/1706.03762 :  Original Paper

This code defines a custom Keras layer, PositionalEmbedding, that performs two critical jobs:

    Word Embedding: It converts the numerical token IDs (like [8, 64, 12, 5]) into dense vector representations (embeddings). 
    Each unique word in the vocabulary gets its own unique vector. This is done by the standard tf.keras.layers.Embedding layer.

    Positional Encoding: It creates a second vector that represents the position of each word in the sentence (e.g., 1st word, 2nd word, etc.). 
    This positional vector is then added to the word embedding vector.

1. The positional_encoding function

This function implements the clever mathematical trick from the original "Attention Is All You Need" paper.

    It generates a unique positional vector for each position in a sequence.

    It uses a combination of sine and cosine functions at different frequencies.

    The key idea: This method allows the model to easily learn relative positions. 
    Because of the properties of sine and cosine, the positional encoding for position + k can be represented as a linear function of the encoding for position. 
    This makes it easy for the model to understand how far apart words are, which is crucial for understanding context and grammar.

    
2. The PositionalEmbedding Layer

This Keras layer brings everything together.

    When it receives a batch of tokenized sentences, it first looks up the word embedding for each token.

    It then scales these embeddings (a standard practice from the paper).

    Finally, it adds the pre-calculated positional encoding vector corresponding to each word's position.

    The final output is a single vector for each word that contains information about both what the word is and where it is in the sentence.   
'''

In [18]:
def positional_encoding(length, depth):
    """
    Generates a matrix of positional encodings. This is a clever way to inject
    information about the order of tokens in the sequence.

    Args:
      length: The maximum length of the sequence.
      depth: The dimensionality of the embedding (d_model).
    """
    # The depth is split in half for the sine and cosine parts.
    depth = depth/2

    # Create a column vector of positions from 0 to length-1.
    # Shape: (length, 1)
    positions = np.arange(length)[:, np.newaxis]

    # Create a row vector of depths.
    # Shape: (1, depth)
    depths = np.arange(depth)[np.newaxis, :]/depth

    # Calculate the angle rates using the formula from the paper.
    # The frequencies of the sine/cosine waves decrease along the depth dimension.
    angle_rates = 1 / (10000**depths)

    # Calculate the angle radians for each position and depth.
    # This is the core of the positional encoding calculation.
    angle_rads = positions * angle_rates

    # Create the final positional encoding matrix by concatenating the sine and cosine values.
    # The two are interleaved to create a complete encoding for each position.
    pos_encoding = np.concatenate(
        [np.sin(angle_rads), np.cos(angle_rads)],
        axis=-1)

    # Convert the NumPy array to a TensorFlow tensor.
    return tf.cast(pos_encoding, dtype=tf.float32)


class PositionalEmbedding(tf.keras.layers.Layer):
    """
    This layer combines a standard word embedding with the positional encoding.
    The output is a single tensor that contains both semantic and positional information.
    """
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.d_model = d_model

        # Standard embedding layer that maps token IDs to vectors.
        # `mask_zero=True` tells the layer to ignore padding (zeros) in the input.
        self.embedding = tf.keras.layers.Embedding(vocab_size, d_model, mask_zero=True)

        # Pre-calculate the positional encoding matrix.
        # We create it for a long sequence (2048) so it can handle any sentence length during training.
        self.pos_encoding = positional_encoding(length=2048, depth=d_model)

    def compute_mask(self, *args, **kwargs):
        # This ensures that the padding mask is correctly propagated through the network.
        # The attention layers will use this mask to ignore padding tokens.
        return self.embedding.compute_mask(*args, **kwargs)

    def call(self, x):
        # Get the length of the input sequence.
        length = tf.shape(x)[1]

        # 1. Get the word embeddings for the input tokens.
        x = self.embedding(x)

        # 2. Scale the embeddings. This is a standard practice from the paper that helps
        # moderate the magnitude of the embedding vectors.
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))

        # 3. Add the positional encodings to the word embeddings.
        # This is where the position information is injected.
        # We slice the pre-calculated pos_encoding to match the length of the input sequence.
        x = x + self.pos_encoding[tf.newaxis, :length, :]
        return x

In [None]:
'''
Step 6:  Defines three custom Keras layers that implement the different types of attention used in the Transformer architecture. 
It uses a clever object-oriented approach by creating a BaseAttention class to hold the common components, and 
then two specialized classes, CrossAttention and GlobalSelfAttention, inherit from it.

1. The BaseAttention Layer

This is a parent class that isn't used directly but serves as a blueprint. It contains the three components that every attention block in a Transformer needs:

    Multi-Head Attention (mha): This is the core engine. Instead of just calculating attention once, it does it multiple times in parallel (in different "heads"). 
    Each head can focus on a different aspect of the sentence's meaning or structure. This allows the model to capture a much richer understanding of the 
    relationships between words.

    Layer Normalization (layernorm): A technique that stabilizes the training of deep neural networks. It normalizes the outputs of the attention layer, 
    which helps prevent the values from becoming too large or too small and improves the flow of gradients during backpropagation.

    Add (add): This implements the "residual connection" (or "skip connection"). The input to the attention layer is added directly to its output. 
    This is a critical technique that allows the model to train much deeper networks by ensuring that information from earlier layers can easily 
    pass through to later layers without being lost.

2. The GlobalSelfAttention Layer (or commonly called just self-attention)

This layer is used inside both the Encoder and the Decoder. Its job is to process a single sequence and allow every word in that sequence to "attend" to every other word.

    How it works: The query, key, and value inputs to the multi-head attention layer all come from the same source (x).

    Purpose: It helps the model build a rich, context-aware representation of each word. For example, in the sentence 
    "The tired animal crossed the road," this layer helps the model understand that "The" and "tired" are related to "animal."

3. The CrossAttention Layer

This layer is the bridge between the Encoder and the Decoder and is only used in the Decoder. 
It's how the model looks at the source sentence while generating the translated sentence.

    How it works:

        The query comes from the Decoder's own sequence (the translated words generated so far).

        The key and value come from the output of the Encoder (the context from the source sentence).

    Purpose: It allows the Decoder to decide which words in the source sentence are most important for predicting the next word in the translation. 
    For example, when translating the Portuguese word "estava" to English, this layer helps the model look at the context in the Portuguese 
    sentence to decide whether the correct translation is "was," "were," or "is."
'''

In [11]:
class BaseAttention(tf.keras.layers.Layer):
    """
    A base class for attention layers that contains the common components:
    Multi-Head Attention, Layer Normalization, and a Residual Add connection.
    This promotes code reuse and follows the standard Transformer block structure.
    """
    def __init__(self, **kwargs):
        super().__init__()
        # The core Multi-Head Attention layer. The **kwargs allows us to pass
        # configuration parameters like num_heads, key_dim, etc., directly to it.
        self.mha = tf.keras.layers.MultiHeadAttention(**kwargs)
        
        # Layer Normalization helps stabilize the network and speeds up training.
        self.layernorm = tf.keras.layers.LayerNormalization()
        
        # The Add layer is used for the residual connection.
        self.add = tf.keras.layers.Add()

class CrossAttention(BaseAttention):
    """
    Implements the cross-attention mechanism. This layer is used in the Decoder
    to attend to the output of the Encoder. It connects the two parts of the model.
    """
    def call(self, x, context):
        # The mha layer calculates the attention scores and output.
        # `query` comes from the decoder's sequence (x).
        # `key` and `value` come from the encoder's output (context).
        attn_output, attn_scores = self.mha(
            query=x,
            key=context,
            value=context,
            return_attention_scores=True)  # We request scores for visualization.

        # Cache the attention scores for later visualization and analysis.
        self.last_attn_scores = attn_scores

        # Apply the residual connection: add the input (x) to the attention output.
        # This helps with gradient flow in deep networks.
        x = self.add([x, attn_output])
        
        # Apply layer normalization to the result of the residual connection.
        x = self.layernorm(x)
        
        return x

class GlobalSelfAttention(BaseAttention):
    """
    Implements the global self-attention mechanism. This layer is used within
    the Encoder and Decoder to process a single sequence.
    """
    def call(self, x):
        # The mha layer calculates the attention scores and output.
        # For self-attention, the query, key, and value are all the same tensor (x).
        # This allows every token in the sequence to attend to every other token.
        attn_output = self.mha(
            query=x,
            value=x,
            key=x)
            
        # Apply the residual connection: add the input (x) to the attention output.
        x = self.add([x, attn_output])
        
        # Apply layer normalization.
        x = self.layernorm(x)
        
        return x

In [None]:
'''
Step 7: defines a custom Keras layer that implements the "Position-wise Feed-Forward Network" described in the original "Attention Is All You Need" paper. 
It's a relatively simple component, but it's essential for the model's performance.

The structure of this layer is identical to the attention sub-layer that comes before it: it consists of the main processing block followed by a 
residual connection and layer normalization.

The Core Network (self.seq)

The main processing is done by a tf.keras.Sequential model containing three layers:

    An "Expansion" Dense Layer: The first Dense layer takes the output from the attention sub-layer and expands its dimensionality from d_model (e.g., 256) 
    to a much larger intermediate dimension, dff (e.g., 1024). It uses a ReLU activation function, which introduces non-linearity into the model. 
    This expansion allows the model to learn more complex relationships and features.

    A "Contraction" Dense Layer: The second Dense layer projects the data from the large dff dimension back down to the original d_model dimension.

    Dropout: A Dropout layer is applied for regularization. During training, it randomly sets a fraction of its input units to zero at each update step,
      which helps prevent overfitting by making the network less reliant on any single neuron.

Residual Connection and Layer Normalization

Just like in the attention sub-layer, this FeedForward network is wrapped with two crucial components:

    self.add: Implements the residual (or "skip") connection. The input that came into the FeedForward layer is added directly to its output. 
    This helps gradients flow through the deep network and prevents information from being lost.

    self.layer_norm: Applies layer normalization to the result of the residual connection, which stabilizes training.

In simple terms: The FeedForward network's role is to take the context-rich vectors produced by the attention mechanism and perform additional 
non-linear transformations on each one independently. This deepens the model and gives it a greater capacity to learn the complex patterns required for translation.
'''

In [12]:
class FeedForward(tf.keras.layers.Layer):
    """
    Implements the Position-wise Feed-Forward Network (FFN) from the Transformer paper.
    This layer is applied to each position separately and identically.
    """
    def __init__(self, d_model, dff, dropout_rate=0.1):
        """
        Initializes the FeedForward layer.

        Args:
          d_model: The dimensionality of the model's embeddings (e.g., 256).
          dff: The dimensionality of the inner-layer of the FFN (e.g., 1024).
          dropout_rate: The fraction of input units to drop for regularization.
        """
        super().__init__()
        
        # The core of the FFN is a two-layer fully-connected network.
        self.seq = tf.keras.Sequential([
            # Layer 1: Expands the input from d_model to dff and applies ReLU activation.
            tf.keras.layers.Dense(dff, activation='relu'),
            
            # Layer 2: Projects the output from dff back down to d_model.
            tf.keras.layers.Dense(d_model),
            
            # A dropout layer for regularization to prevent overfitting.
            tf.keras.layers.Dropout(dropout_rate)
        ])
        
        # An Add layer for the residual connection.
        self.add = tf.keras.layers.Add()
        
        # A Layer Normalization layer for stabilizing the training.
        self.layer_norm = tf.keras.layers.LayerNormalization()

    def call(self, x):
        """
        Defines the forward pass for the layer.
        """
        # Pass the input through the sequential feed-forward network.
        seq_out = self.seq(x)
        
        # Apply the residual connection: add the original input (x) to the FFN's output.
        x = self.add([x, seq_out])
        
        # Apply layer normalization to the result of the residual connection.
        x = self.layer_norm(x)
        
        return x

In [None]:
'''
Step 8: defines two classes: EncoderLayer and Encoder.

1. EncoderLayer: A Single Processing Block

Think of an EncoderLayer as one step in an assembly line. It takes a sequence of word vectors as input and performs two main operations 
on them before passing them to the next step.

    self_attention (GlobalSelfAttention): First, the data goes through a self-attention mechanism. This allows every word in the sentence to look at 
    every other word to gather context. For example, in the sentence "The green car is fast," this layer helps the model understand that "green" is describing the "car."

    ffn (FeedForward): Second, the output from the attention layer is passed through a position-wise feed-forward network. This network processes each word's 
    vector individually, performing further non-linear transformations to help the model learn more complex features.

Each EncoderLayer is a self-contained processing unit that refines the representation of the input sentence.

Encoder: The Full Assembly Line

The Encoder is the complete assembly line, made up of a stack of multiple EncoderLayers.

    pos_embedding (PositionalEmbedding): The process starts here. The raw input (a sequence of token IDs) is passed to the PositionalEmbedding layer. 
    This layer converts the token IDs into vectors that contain information about both what the word is (its meaning) and where it is in the sentence (its position).

    dropout: A dropout layer is applied to the embeddings for regularization, which helps prevent the model from overfitting.

    The Stack of enc_layers: The core of the Encoder is a loop that passes the data through each EncoderLayer in the stack, one after the other. 
    If you have num_layers=4, the data will be processed by four of these blocks sequentially. Each layer further refines the context and meaning of the words in the sentence.

The final output of the Encoder is a set of context-rich vectors (one for each word in the input sentence) that is then passed to the Decoder.
'''

In [13]:
class EncoderLayer(tf.keras.layers.Layer):
    """
    A single layer of the Transformer Encoder. It consists of two sub-layers:
    1. A Global Self-Attention mechanism.
    2. A Position-wise Feed-Forward Network.
    Each sub-layer has a residual connection followed by layer normalization.
    """
    def __init__(self,*, d_model, num_heads, dff, dropout_rate=0.1):
        """
        Initializes the EncoderLayer.

        Args:
          d_model: The dimensionality of the model's embeddings.
          num_heads: The number of attention heads.
          dff: The dimensionality of the inner-layer of the FFN.
          dropout_rate: The dropout rate for regularization.
        """
        super().__init__()

        # The first sub-layer: a global self-attention mechanism.
        self.self_attention = GlobalSelfAttention(
            num_heads=num_heads,
            key_dim=d_model,
            dropout=dropout_rate)

        # The second sub-layer: a position-wise feed-forward network.
        self.ffn = FeedForward(d_model, dff)

    def call(self, x):
        """Defines the forward pass for the layer."""
        # Pass the input through the self-attention layer.
        x = self.self_attention(x)
        
        # Pass the result through the feed-forward network.
        x = self.ffn(x)
        
        return x

class Encoder(tf.keras.layers.Layer):
    """
    The complete Transformer Encoder, which is a stack of N identical EncoderLayers.
    """
    def __init__(self, *, num_layers, d_model, num_heads,
                 dff, vocab_size, dropout_rate=0.1):
        """
        Initializes the Encoder.

        Args:
          num_layers: The number of EncoderLayers to stack.
          d_model: The dimensionality of the model's embeddings.
          num_heads: The number of attention heads.
          dff: The dimensionality of the inner-layer of the FFN.
          vocab_size: The size of the input vocabulary.
          dropout_rate: The dropout rate for regularization.
        """
        super().__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        # The first layer is the positional embedding layer, which adds word
        # meaning and position information to the input tokens.
        self.pos_embedding = PositionalEmbedding(
            vocab_size=vocab_size, d_model=d_model)

        # Create a list containing `num_layers` instances of EncoderLayer.
        # This forms the main stack of the encoder.
        self.enc_layers = [
            EncoderLayer(d_model=d_model,
                         num_heads=num_heads,
                         dff=dff,
                         dropout_rate=dropout_rate)
            for _ in range(num_layers)]
        
        # A dropout layer for regularization, applied after the embedding.
        self.dropout = tf.keras.layers.Dropout(dropout_rate)

    def call(self, x):
        """Defines the forward pass for the entire Encoder."""
        # 1. Get the embeddings with positional information.
        x = self.pos_embedding(x)

        # 2. Apply dropout to the embeddings.
        x = self.dropout(x)

        # 3. Pass the embeddings through the stack of N encoder layers.
        for i in range(self.num_layers):
            x = self.enc_layers[i](x)

        # The final output is a sequence of context-rich vectors.
        return x

In [None]:
'''
Step 9: defines three classes that build the Decoder: CausalSelfAttention, DecoderLayer, and Decoder.


"Decoder" side of the Transformer: Its job is to take the encoded representation of the source sentence (from the Encoder) and generate the translated sentence word by word.

1. CausalSelfAttention: The "No Cheating" Attention

This is a special type of self-attention used only in the Decoder. Its purpose is to prevent the model from "cheating" during training.

    The Problem: When training the model to predict the next word in a sentence, we must ensure it only uses the words that came before it. 
    For example, to predict the 4th word, it should only be allowed to see words 1, 2, and 3. If it could see word 4 or 5, the task would be trivial and the model wouldn't learn anything.

    The Solution: CausalSelfAttention implements this rule. The key is the use_causal_mask=True argument. This automatically creates a "look-ahead mask" that hides all 
    future tokens in the sequence. For any given word, this mask blocks its attention mechanism from seeing any words that appear later in the sentence. 
    This is why it's called "causal"—it enforces the cause-and-effect flow of time in a sentence.

2. DecoderLayer: The Decoder's Processing Block

This is a single layer of the Decoder. It's more complex than an EncoderLayer because it has three sub-layers instead of two.

    Masked Self-Attention (CausalSelfAttention): First, the Decoder processes its own input sequence (the translation generated so far) using causal self-attention. 
    This allows it to gather context from the words it has already predicted, without peeking at the future.

    Cross-Attention: This is the most important step. The output from the self-attention layer is then passed to a cross-attention layer. 
    This layer takes the context from the Encoder (the representation of the source sentence) and allows the Decoder to decide which words 
    in the source sentence are most relevant for predicting the next word in the translation.

    Feed-Forward Network (ffn): Finally, the output from the cross-attention layer is passed through a standard 
    feed-forward network for further processing, just like in the Encoder.

3. Decoder: The Full Translation Engine

The Decoder is the complete stack of DecoderLayers. It orchestrates the entire translation process.

    Input: It takes two inputs: the target sequence x (the words translated so far) and the context (the output from the Encoder).

    Embedding and Dropout: It starts by applying positional embeddings and dropout to the target sequence x.

    The Stack of dec_layers: It then passes the data through the stack of DecoderLayers. 
    In each layer, the three-step process (masked self-attention, cross-attention, feed-forward) is repeated, progressively refining the translation.

    Output: The final output of the Decoder is a sequence of vectors that is then passed to a final linear layer to be 
    converted into probability scores for the next word in the vocabulary.

'''

In [14]:
class CausalSelfAttention(BaseAttention):
  """
  Implements causal (or "look-ahead masked") self-attention. This is crucial
  for the Decoder to prevent it from "cheating" by looking at future tokens
  in the sequence it's trying to predict.
  """
  def call(self, x):
    # The MultiHeadAttention layer is called with `use_causal_mask=True`.
    # This automatically creates a mask that ensures for any position `i`,
    # the attention mechanism can only see tokens at positions `j <= i`.
    attn_output = self.mha(
        query=x,
        value=x,
        key=x,
        use_causal_mask = True)
    
    # Apply the residual connection and layer normalization.
    x = self.add([x, attn_output])
    x = self.layernorm(x)
    return x


class DecoderLayer(tf.keras.layers.Layer):
    """
    A single layer of the Transformer Decoder. It consists of three sub-layers:
    1. A Causal Self-Attention mechanism (for the target sequence).
    2. A Cross-Attention mechanism (to attend to the encoder's output).
    3. A Position-wise Feed-Forward Network.
    """
    def __init__(self,
                 *,
                 d_model,
                 num_heads,
                 dff,
                 dropout_rate=0.1):
        super(DecoderLayer, self).__init__()

        # Sub-layer 1: Causal self-attention for the target sequence.
        # This is the corrected version.
        self.causal_self_attention = CausalSelfAttention(
            num_heads=num_heads,
            key_dim=d_model,
            dropout=dropout_rate)

        # Sub-layer 2: Cross-attention to look at the encoder's output (context).
        self.cross_attention = CrossAttention(
            num_heads=num_heads,
            key_dim=d_model,
            dropout=dropout_rate)

        # Sub-layer 3: The position-wise feed-forward network.
        self.ffn = FeedForward(d_model, dff)

    def call(self, x, context):
        """Defines the forward pass for the layer."""
        # Pass the input through the causal self-attention layer.
        x = self.causal_self_attention(x=x)
        
        # Pass the result through the cross-attention layer, using the
        # encoder's output as the context.
        x = self.cross_attention(x=x, context=context)

        # Cache the attention scores from the cross-attention layer for visualization.
        self.last_attn_scores = self.cross_attention.last_attn_scores

        # Pass the result through the feed-forward network.
        x = self.ffn(x)
        
        return x

class Decoder(tf.keras.layers.Layer):
    """
    The complete Transformer Decoder, which is a stack of N identical DecoderLayers.
    """
    def __init__(self, *, num_layers, d_model, num_heads, dff, vocab_size,
                 dropout_rate=0.1):
        super(Decoder, self).__init__()
        self.d_model = d_model
        self.num_layers = num_layers

        # The positional embedding layer for the target sequence.
        self.pos_embedding = PositionalEmbedding(vocab_size=vocab_size,
                                                 d_model=d_model)
        
        # A dropout layer for regularization.
        self.dropout = tf.keras.layers.Dropout(dropout_rate)

        # Create a list containing `num_layers` instances of DecoderLayer.
        self.dec_layers = [
            DecoderLayer(d_model=d_model, num_heads=num_heads,
                         dff=dff, dropout_rate=dropout_rate)
            for _ in range(num_layers)]
        
        # A placeholder to store the attention scores from the last layer.
        self.last_attn_scores = None

    def call(self, x, context):
        """Defines the forward pass for the entire Decoder."""
        # 1. Get the embeddings with positional information for the target sequence.
        x = self.pos_embedding(x)

        # 2. Apply dropout to the embeddings.
        x = self.dropout(x)

        # 3. Pass the data through the stack of N decoder layers.
        # Both the target sequence (x) and the encoder's output (context) are passed.
        for i in range(self.num_layers):
            x = self.dec_layers[i](x, context)

        # 4. Cache the attention scores from the final decoder layer.
        self.last_attn_scores = self.dec_layers[-1].last_attn_scores

        # The final output is a sequence of vectors ready for the final classification layer.
        return x

In [None]:
''' 
Step 10: assembles the Encoder and Decoder into the complete Transformer model. It defines the exact flow of data from the source sentence 
all the way to the final prediction for the translated sentence.

The Transformer class inherits from tf.keras.Model, which is the standard way to create custom, complex models in TensorFlow. It brings together all the pieces we've discussed previously.

The __init__ Method (Building the Model)

The __init__ method is the constructor. Its job is to build and initialize the three main parts of the model:

    The Encoder (self.encoder): An instance of the Encoder class we defined earlier. It is configured with the hyperparameters for the input language 
    (e.g., Portuguese vocabulary size).

    The Decoder (self.decoder): An instance of the Decoder class. It is configured with the hyperparameters for the target language (e.g., English vocabulary size).

    The Final Layer (self.final_layer): This is a standard Dense (fully-connected) layer. Its job is to take the final processed vectors from the Decoder and convert 
    them into a score for every single word in the target vocabulary. The word with the highest score is the model's prediction for the next token in the sequence. 
    This layer is often called the "output projection" or "classification" layer.

The call Method (The Forward Pass)

The call method defines how data flows through the model during training and inference. It's a clear, three-step process:

    Encode the Input: The source sentence (context) is passed into the self.encoder. The encoder processes it and produces a set of context-rich vectors. 
    This encoded context captures the meaning of the entire source sentence.

    Decode with Context: The target sentence (x) and the context from the encoder are both passed into the self.decoder. The decoder uses its self-attention to process x 
    and its cross-attention to look at the context, figuring out how to generate the translation.

    Generate Final Predictions: The output from the decoder is passed to the self.final_layer to produce the final output, called logits. These are the raw, 
    unnormalized prediction scores for each word in the target vocabulary. These logits are then used by the loss function to calculate how "wrong" the model was and to update its weights.
'''

In [15]:
class Transformer(tf.keras.Model):
    """
    The complete Transformer model, which encapsulates the Encoder, Decoder,
    and the final linear layer.
    """
    def __init__(self, *, num_layers, d_model, num_heads, dff,
                 input_vocab_size, target_vocab_size, dropout_rate=0.1):
        """
        Initializes the Transformer model.

        Args:
          num_layers: The number of layers for both the encoder and decoder.
          d_model: The dimensionality of the model's embeddings.
          num_heads: The number of attention heads.
          dff: The dimensionality of the inner-layer of the FFN.
          input_vocab_size: The size of the source language's vocabulary.
          target_vocab_size: The size of the target language's vocabulary.
          dropout_rate: The dropout rate for regularization.
        """
        super().__init__()
        
        # Instantiate the Encoder component.
        self.encoder = Encoder(num_layers=num_layers, d_model=d_model,
                               num_heads=num_heads, dff=dff,
                               vocab_size=input_vocab_size,
                               dropout_rate=dropout_rate)

        # Instantiate the Decoder component.
        self.decoder = Decoder(num_layers=num_layers, d_model=d_model,
                               num_heads=num_heads, dff=dff,
                               vocab_size=target_vocab_size,
                               dropout_rate=dropout_rate)

        # The final linear layer that maps the decoder's output to the target
        # vocabulary space, producing the final prediction scores (logits).
        self.final_layer = tf.keras.layers.Dense(target_vocab_size)

    def call(self, inputs):
        """
        Defines the forward pass for the entire model.

        Args:
          inputs: A tuple containing the source sequence (context) and the
                  target sequence (x).
        """
        # Unpack the inputs. `context` is the source sentence (e.g., Portuguese),
        # and `x` is the target sentence (e.g., English).
        context, x  = inputs

        # 1. Pass the source sentence through the encoder to get its contextual representation.
        context = self.encoder(context)

        # 2. Pass the target sentence and the encoder's output (context) through the decoder.
        x = self.decoder(x, context)

        # 3. Pass the decoder's output through the final linear layer to get the logits.
        logits = self.final_layer(x)

        # This try-except block is a non-standard way to handle potential masking issues.
        # It's generally better to handle this by ensuring the final layer's dtype is
        # float32, especially when using mixed precision. This block can often be removed.
        try:
            del logits._keras_mask
        except AttributeError:
            pass

        # Return the final prediction scores (logits).
        return logits

In [None]:
'''
Masked loss and accuracy

The Problem: Padding Tokens

In a batch of sentences, each sentence can have a different length. To process them efficiently on a GPU, we need to make them all the same length. 
We do this by adding special padding tokens (represented by the ID 0) to the end of the shorter sentences.

However, we don't want the model to learn to predict these padding tokens. The model's performance should only be measured on the actual words in the sequence. 
This is where "masking" comes in.


'''

In [16]:
# Loss function and metrics
def masked_loss(label, pred):
    """
    Calculates the cross-entropy loss, but intelligently ignores padded tokens.
    """
    # 1. Create a boolean mask. It's `True` for any token that is NOT a
    #    padding token (where the label is not 0) and `False` otherwise.
    mask = label != 0

    # 2. Create the standard loss object. `from_logits=True` is important because
    #    our model outputs raw scores (logits), not probabilities.
    #    `reduction='none'` is crucial: it returns the loss for each token individually.
    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
        from_logits=True, reduction='none')

    # 3. Calculate the loss for every single token in the batch.
    loss = loss_object(label, pred)

    # 4. Apply the mask to the loss. The mask is cast to the same dtype as the loss
    #    (so True->1.0, False->0.0) and multiplied. This effectively sets the loss
    #    for all padding tokens to zero.
    mask = tf.cast(mask, dtype=loss.dtype)
    loss *= mask

    # 5. Calculate the final average loss. We sum up the total loss (where padding
    #    loss is zero) and divide by the number of real (non-padded) tokens.
    loss = tf.reduce_sum(loss)/tf.reduce_sum(mask)
    return loss

def masked_accuracy(label, pred):
    """
    Calculates the prediction accuracy, but intelligently ignores padded tokens.
    """
    # 1. Get the model's prediction by finding the token with the highest score (logit).
    pred = tf.argmax(pred, axis=2)
    
    # Ensure the label and prediction have the same data type for comparison.
    label = tf.cast(label, pred.dtype)
    
    # 2. Check where the prediction matches the true label.
    match = label == pred
    
    # 3. Create the same padding mask as in the loss function.
    mask = label != 0
    
    # 4. Apply the mask to the matches. A position is now considered a "true match"
    #    only if the prediction was correct AND it was not a padding token.
    match = match & mask
    
    # Cast the boolean tensors to floats for calculation (True->1.0, False->0.0).
    match = tf.cast(match, dtype=tf.float32)
    mask = tf.cast(mask, dtype=tf.float32)
    
    # 5. Calculate the final accuracy: the total number of correct, non-padded
    #    predictions divided by the total number of non-padded tokens.
    return tf.reduce_sum(match)/tf.reduce_sum(mask)

In [None]:
'''
Step 11: Define loss function, optimizer and training loop


a set of hyperparameters is defined. These are the key settings that control the size and shape of the model:

    num_layers: The number of encoder and decoder layers to stack. A deeper model (more layers) can learn more complex patterns but is slower to train.

    d_model: The main dimensionality of the embeddings throughout the model. This is a crucial parameter that affects the model's capacity.

    dff: The "inner" dimension of the feed-forward networks. The expansion to this larger size allows the model to learn more complex features.

    num_heads: The number of parallel attention "heads." More heads allow the model to focus on different parts of the sentence structure simultaneously.

    dropout_rate: The rate for the dropout layers, used to prevent overfitting.

CustomSchedule: This class implements a custom learning rate that changes during training. It follows a "warmup and decay" pattern:

    Warmup: For the first warmup_steps (e.g., 4000 steps), the learning rate starts small and increases linearly. This helps the model stabilize at 
    the beginning of training without making drastic, potentially damaging updates.

    Decay: After the warmup phase, the learning rate decreases proportionally to the inverse square root of the step number. This allows for finer, 
    more precise adjustments as the model gets closer to a solution.

Adam Optimizer: The Adam optimizer is used, but with specific beta_1, beta_2, and epsilon values that were found to work best for Transformers in the original paper.

Compilation: Assembling the Model for Training

The transformer.compile() step brings all the pieces together:

    loss=masked_loss: It tells the model to use our custom masked_loss function, which correctly calculates the error while ignoring padding tokens.

    optimizer=optimizer: It assigns the Adam optimizer with our custom learning rate schedule.

    metrics=[masked_accuracy]: It tells the model to track and report the masked_accuracy during training, which gives a true measure of performance on the real words.

'''

In [None]:
# Hyperparameters
# Just for demo, ideally these values are sub-optimal for transformer
# Refer to the Attention paper


# The number of stacked encoder or decoder layers. 
# More layers allow the model to learn more complex functions.
num_layers = 4 

# The dimensionality of the input and output vectors for the model. 
# It's the size of the embedding vectors for each token.
d_model = 128  

# The dimensionality of the inner "feed-forward" layer.
# It's a standard practice to set this to 4 * d_model.
dff = 512

# The number of attention heads in the multi-head attention mechanism.
# d_model must be divisible by num_heads.
num_heads = 8

# The dropout rate, used for regularization to prevent overfitting.
# A value of 0.1 means 10% of neurons are randomly dropped during training.
dropout_rate = 0.1

# Create the Transformer model
transformer = Transformer(
    num_layers=num_layers,
    d_model=d_model,
    num_heads=num_heads,
    dff=dff,
    input_vocab_size=tokenizers.pt.get_vocab_size().numpy(),
    target_vocab_size=tokenizers.en.get_vocab_size().numpy(),
    dropout_rate=dropout_rate)

# Custom learning rate schedule
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super().__init__()
        self.d_model = d_model
        self.d_model = tf.cast(self.d_model, tf.float32)
        self.warmup_steps = warmup_steps

    def __call__(self, step):
        step = tf.cast(step, dtype=tf.float32)
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)
        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

learning_rate = CustomSchedule(d_model)
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98,
                                     epsilon=1e-9)

# Compile the model
transformer.compile(
    loss=masked_loss,
    optimizer=optimizer,
    metrics=[masked_accuracy])

# Train the model
transformer.fit(train_batches,
                epochs=10,
                validation_data=val_batches)

In [None]:
''' 
Original paper: Attention is all you need
The models developed in this papers are given below. The demo is just a smaller version of it.

1. Base Model

This was the main architecture discussed and used for most experiments.

    Number of Layers (N): 6 identical layers in both the encoder and decoder stacks.

    Model Dimension (d_model): 512 for the embeddings and all sub-layer outputs.

    Number of Attention Heads (h): 8 parallel attention heads.

    Feed-Forward Layer Dimension (d_ff): 2048 for the inner-layer of the feed-forward network.

    Dropout Rate: 0.1 was applied to the output of each sub-layer before it was added to the sub-layer input.

2. Big Model

This larger version was used to achieve the state-of-the-art results on the WMT-14 English-to-German translation task.

    Number of Layers (N): 6 (This remained the same as the base model).

    Model Dimension (d_model): 1024

    Number of Attention Heads (h): 16

    Feed-Forward Layer Dimension (d_ff): 4096

    Dropout Rate: Increased to 0.3 for this specific translation task.

3. Training Parameters

Across both models, the training setup included:

    Optimizer: They used the Adam optimizer with β₁ = 0.9, β₂ = 0.98, and ε = 10⁻⁹.

    Learning Rate: A custom learning rate scheduler was used, which increased the rate linearly for the first 4000 warm-up steps and 
    then decreased it proportionally to the inverse square root of the step number.


 '''

In [None]:
# Train the model
transformer.fit(train_batches,
                epochs=10,
                validation_data=val_batches)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fdc4c3fa530>

In [21]:
# Train the model
transformer.fit(train_batches,
                epochs=20,
                validation_data=val_batches)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fdc4c3f9990>

In [22]:
# Train the model
transformer.fit(train_batches,
                epochs=12,
                validation_data=val_batches)

Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


<keras.callbacks.History at 0x7fdc4c2f6890>

In [None]:
''' 
Step 13: Define the transformer evaluation model

This `Translator` class BELOW is carefully designed to handle the inference with a Transformer model inside a performant `tf.function`. 
Here’s a breakdown of the key concepts:

1. The Autoregressive Loop
The core of translation is the `for` loop. This process is called autoregressive decoding because the model's prediction at each step 
is fed back into itself to generate the next prediction.

Why it's needed: Language is sequential. To predict the word "cat," you need to know that the previous words were "the black...". 
The loop mimics this by building the output sequence one token at a time, using its own previous predictions to inform the next one.
How it works
    1.  Start:The decoder is given the `[START]` token.
    2.  Predict: The model predicts the first word (e.g., "this").
    3.  Append: The decoder's input is now `[START]`, "this".
    4.  Predict: The model predicts the second word (e.g., "is").
    5.  Repeat: The input becomes `[START]`, "this", "is", and so on, until the model predicts the `[END]` token or hits the `max_length`.

2. `tf.function` and `tf.TensorArray`
Why it's needed: Running a loop in standard Python ("eager mode") is very slow because each step involves communication between 
Python and the TensorFlow backend. The `@tf.function` decorator traces the Python code and compiles it into a highly optimized,
 static TensorFlow graph that runs much faster.

The Challenge: A standard Python list (`my_list = []`) cannot be modified inside a compiled graph loop. `tf.TensorArray` is the 
TensorFlow-native equivalent that is designed specifically for this purpose. It allows you to dynamically build up a tensor inside a graph loop.

3. Recalculating Attention Weights (The `InaccessibleTensorError` Fix)
This is the most complex but important part of the code.

Why it's needed: The `@tf.function` turns the `for` loop into a `tf.while_loop` operation in its graph. This `while_loop` has its own internal **scope**. 
Any tensor created inside that scope (like the `attention_weights` at each step) is temporary and cannot be accessed from outside the loop once it finishes. 
Trying to access `self.transformer.decoder.last_attn_scores` after the loop would be pointing to a tensor that no longer exists in the graph's main scope, 
causing the `InaccessibleTensorError`.
The Solution: The fix is to run the model one more time after the loop is complete.
    1.  We take the final, complete `output` sequence that was generated.
    2.  We perform a full forward pass with this `output`.
    3.  Since this forward pass happens outside the `while_loop`'s scope, the `attention_weights` it generates are now accessible in the main function's scope and can be safely returned.
    4.  We use `output[:, :-1]` as the decoder input because which can allow us to see the attention the model was paying when it was about to predict the final tok

 '''

In [32]:
class Translator(tf.Module):
  """
  A tf.Module that encapsulates the trained Transformer model for inference.
  It provides a clean interface to translate a sentence from Portuguese to English.
  """
  def __init__(self, tokenizers, transformer):
    """
    Initializes the Translator.

    Args:
      tokenizers: The loaded tokenizers object containing .pt and .en sub-models.
      transformer: The trained and compiled Transformer Keras model.
    """
    self.tokenizers = tokenizers
    self.transformer = transformer

  # This decorator compiles the Python function into a high-performance,
  # static TensorFlow graph. This is crucial for speed during inference.
  # Without it, the loop would run in slow Python "eager mode".
  @tf.function
  def __call__(self, sentence, max_length=MAX_TOKENS):
    """
    The main translation method. It takes a raw sentence and performs
    autoregressive decoding to generate the translation.

    Args:
      sentence: A scalar tf.Tensor of type string (the Portuguese sentence).
      max_length: The maximum number of tokens to generate for the translation.
    """
    # --- 1. PREPARE THE INPUTS ---

    # The model expects a batch of sentences, so we add a batch dimension
    # to the single input sentence. Shape: () -> (1,).
    assert isinstance(sentence, tf.Tensor)
    if len(sentence.shape) == 0:
      sentence = sentence[tf.newaxis]

    # Tokenize the Portuguese input sentence, converting it from text to a
    # sequence of integer IDs. The `.to_tensor()` call pads the batch.
    # This becomes the input to the Encoder.
    encoder_input = self.tokenizers.pt.tokenize(sentence).to_tensor()

    # --- 2. INITIALIZE THE DECODER'S INPUT ---

    # Get the special [START] and [END] token IDs from the English tokenizer.
    # These are essential for controlling the generation process.
    start_end = self.tokenizers.en.tokenize([''])[0]
    start = start_end[0][tf.newaxis] # The [START] token ID.
    end = start_end[1][tf.newaxis]   # The [END] token ID.

    # The decoder's input sequence starts with only the [START] token.
    # We use tf.TensorArray to build the output sequence dynamically.
    # This is a special TensorFlow-aware data structure that can be written to
    # inside a tf.function's loop. A standard Python list would not work here.
    output_array = tf.TensorArray(dtype=tf.int64, size=0, dynamic_size=True)
    output_array = output_array.write(0, start)

    # --- 3. THE AUTOREGRESSIVE DECODING LOOP ---

    # This loop generates the translation one token at a time.
    for i in tf.range(max_length):
      # Get the sequence generated so far and prepare it as the decoder's input.
      output = tf.transpose(output_array.stack())

      # Run a forward pass through the entire Transformer model.
      # The encoder processes the Portuguese sentence, and the decoder uses that
      # context along with the English translation generated so far (`output`).
      predictions = self.transformer([encoder_input, output], training=False)

      # We only care about the prediction for the very last token in the sequence.
      # Shape: (batch_size, seq_len, vocab_size) -> (batch_size, 1, vocab_size).
      predictions = predictions[:, -1:, :]

      # Select the token with the highest probability (logit score).
      # This is a "greedy search" decoding strategy.
      predicted_id = tf.argmax(predictions, axis=-1)

      # Write the newly predicted token ID to our output array. This token will
      # be part of the decoder's input in the next iteration.
      output_array = output_array.write(i+1, predicted_id[0])

      # If the model predicts the [END] token, we can stop generating.
      if predicted_id == end:
        break

    # --- 4. FINALIZE OUTPUTS ---

    # Stack all the generated token IDs into a final tensor.
    output = tf.transpose(output_array.stack())
    # Convert the token IDs back into a human-readable text string.
    text = self.tokenizers.en.detokenize(output)[0]

    # Also, convert the token IDs into their string representations for inspection.
    tokens = self.tokenizers.en.lookup(output)[0]

    # --- 5. RECALCULATE ATTENTION WEIGHTS (CRITICAL FIX) ---

    # Because this function is a @tf.function, tensors created inside the `for`
    # loop (which becomes a `tf.while_loop`) are in a different graph scope
    # and cannot be accessed after the loop finishes.
    # Therefore, we must run the model one final time *outside* the loop
    # with the complete generated sequence to get the final attention weights.
    # We pass `output[:, :-1]` as the decoder input because the attention weights
    # are calculated based on what the model used to predict the *next* token.
    self.transformer([encoder_input, output[:,:-1]], training=False)
    attention_weights = self.transformer.decoder.last_attn_scores

    return text, tokens, attention_weights

In [33]:
translator = Translator(tokenizers, transformer)

def print_translation(sentence, tokens, ground_truth):
  print(f'{"Input:":15s}: {sentence}')
  print(f'{"Prediction":15s}: {tokens.numpy().decode("utf-8")}')
  print(f'{"Ground truth":15s}: {ground_truth}')

In [34]:
sentence = 'este é um problema que temos que resolver.'
ground_truth = 'this is a problem we have to solve .'

translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

Input:         : este é um problema que temos que resolver.
Prediction     : this is a problem that we have to solve .
Ground truth   : this is a problem we have to solve .


In [35]:
sentence = 'os meus vizinhos ouviram sobre esta ideia.'
ground_truth = 'and my neighboring homes heard about this idea .'

translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

Input:         : os meus vizinhos ouviram sobre esta ideia.
Prediction     : my neighbors have heard about this idea .
Ground truth   : and my neighboring homes heard about this idea .


In [36]:
sentence = 'vou então muito rapidamente partilhar convosco algumas histórias de algumas coisas mágicas que aconteceram.'
ground_truth = "so i'll just share with you some stories very quickly of some magical things that have happened."

translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

Input:         : vou então muito rapidamente partilhar convosco algumas histórias de algumas coisas mágicas que aconteceram.
Prediction     : so i ' m going to very quickly share with you some stories of some magical things that happened .
Ground truth   : so i'll just share with you some stories very quickly of some magical things that have happened.


In [37]:
tf.saved_model.save(translator, export_dir='translator')



INFO:tensorflow:Assets written to: translator/assets


INFO:tensorflow:Assets written to: translator/assets


Use the Transformer saved model for inference