# Machine Translation with Transformer
Machine translation is a key milestone in the development of Natural Language Processing (NLP). The Transformer model, introduced by Vaswani et al. (2017), was a major revolution because it replaced RNN and CNN-based architectures with a more efficient attention mechanism.

In this notebook, we will build a Transformer model from scratch to automatically translate from English to Spanish. This approach does not use a pre-trained model, but instead implements the core components of the Transformer, from positional encoding, multi-head attention, encoder-decoder layers, to the training process.

The main focus is not only on the translation results, but also on the implementation capabilities and understanding of the Transformer's internal mechanisms. This is relevant because the Transformer is the foundation of various modern models such as BERT, GPT, and ChatGPT, making mastering its basic architecture essential for both Machine Learning Engineers and Data Scientists.

The dataset source is www.manythings.org/anki

## Sequence-to-sequence learning

### Import Library

In [2]:
import os
import re
import pathlib
import random
import string
import zipfile
import numpy as np
import tensorflow as tf
from nltk.translate.bleu_score import sentence_bleu
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.layers import Bidirectional,GRU,LSTM,Embedding
from tensorflow.keras.layers import Dense,MultiHeadAttention,LayerNormalization,Embedding,Dropout,Layer
from tensorflow.keras import Sequential,Input
from tensorflow.keras.callbacks import ModelCheckpoint

### Download and Extract Dataset

#### Download ZIP (no auto-extract)

In [3]:
path_to_zip = keras.utils.get_file(
    "spa-eng.zip",
    origin="http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip",
    extract=False)

#### Manual extraction

In [4]:
with zipfile.ZipFile(path_to_zip, 'r') as zip_ref:
    zip_ref.extractall(pathlib.Path(path_to_zip).parent)

text_file = pathlib.Path(path_to_zip).parent / "spa-eng" / "spa.txt"

#### View all extracted files

In [5]:
with open(text_file, encoding="utf-8") as f:
    lines = f.read().split("\n")[:-1]

### Prepare Text Pairs

Each line contains an English sentence and a corresponding Spanish sentence. The English sentence is the source sequence, and the Spanish sentence is the target sequence. We add a "[start]" token at the front and an "[end]" token at the end of the Spanish sentence.

In [6]:
# Start and end indicate when the translation of the word/phrase starts and stops.

text_pairs = []
for line in lines:
    english, spanish = line.split("\t")
    spanish = "[start] " + spanish + " [end]"
    text_pairs.append((english, spanish))

In [7]:
print("Number of pairs:", len(text_pairs))
print("Example text:", text_pairs[2900])

Number of pairs: 118964
Example text: ("I'll be fine.", '[start] Estaré bien. [end]')


Printing random text pairs English - Spanish

In [8]:
import random
print(random.choice(text_pairs))

('Where can I buy envelopes?', '[start] ¿Dónde puedo comprar sobres? [end]')


### Split Data

- train_pairs: initial 70% of the data
- val_pairs: next 15%
- test_pairs: last 15%
- 70% training: large enough to learn patterns.
- 15% validation: sufficient for hyperparameter tuning.
- 15% test: sufficient to measure the model's generalization performance.

#### Determine the amount of validation data 15% of the total data

In [9]:
random.shuffle(text_pairs)

num_val_samples = int(0.15 * len(text_pairs))

#### Taking the remaining data for training, after setting aside 15% for validation and 15% for testing.

In [10]:
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

In [11]:
print("Number of train data:", len(train_pairs))
print("Number of validation data:", len(val_pairs))
print("Number of test data:", len(test_pairs))

Number of train data: 83276
Number of validation data: 17844
Number of test data: 17844


### Preprocessing and Vectorization

* string.punctuation is a Python built-in that contains all common punctuation characters.
* Added + "¿" because the "¿" (inverted question mark) symbol is the opening question mark in Spanish (e.g., ¿Cómo estás?).
* Creates a list of all characters considered "punctuation to be removed or cleared from the text."

In [12]:
def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(lowercase, f"[{re.escape(string.punctuation)}¿]", "")

In [13]:
vocab_size = 15000

This means that only the 15,000 most frequently occurring words will be used in the dictionary (vocab). Next, determine the maximum length of the token sequence after the text is tokenized.

In [14]:
sequence_length = 20 

* Vectorizes input (English)
* Used for encoders, which only need plain text sequences (without special tokens like [start] / [end]).

In [15]:
source_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)

* Vectorizes the target (Spanish)
* output_sequence_length=20: Since the target has [start] and [end] tokens, the target is 1 token longer than the input.
* standardize=custom_standardization: Uses a custom text cleaning function (e.g., removes certain punctuation but keeps [start] and [end]).

In [16]:
target_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
    standardize=custom_standardization,
)

#### Extract all English sentences from training data pairs.

In [17]:
train_english_texts = [pair[0] for pair in train_pairs]

#### Extract all Spanish sentences (with [start] and [end])

In [18]:
train_spanish_texts = [pair[1] for pair in train_pairs]

#### Train the TextVectorization layer to build a vocabulary based on the training text.
`.adapt()` is essential for vectorization to recognize common words and efficiently construct the vocabulary.

In [19]:
source_vectorization.adapt(train_english_texts)
target_vectorization.adapt(train_spanish_texts)

### Dataset Format

In [20]:
batch_size = 64

Meaning: The model will process 64 sentence pairs (English–Spanish) in one step.

In [21]:
def format_dataset(eng, spa):
    eng = source_vectorization(eng) # Input tokenization (English)
    spa = target_vectorization(spa) # Target tokenization (Spanish)
    return ({"english": eng, "spanish": spa[:, :-1]}, spa[:, 1:])
    # All target tokens EXCEPT the last or EXCEPT the first

def make_dataset(pairs):
    # Separate the pairs (eng, spa) into 2 lists
    eng_texts, spa_texts = zip(*pairs) 
    
    # Create a tf.data.Dataset from text pairs
    dataset = tf.data.Dataset.from_tensor_slices((list(eng_texts), list(spa_texts)))
    dataset = dataset.batch(batch_size) # Split data into batches of size 64
    dataset = dataset.map(format_dataset, num_parallel_calls=tf.data.AUTOTUNE) # Batch content format with dataset format
    return dataset.shuffle(2048).prefetch(tf.data.AUTOTUNE).cache() # Store the transformed data in memory for efficiency.

In [22]:
train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

In [23]:
train_ds

<CacheDataset element_spec=({'english': TensorSpec(shape=(None, 20), dtype=tf.int64, name=None), 'spanish': TensorSpec(shape=(None, 20), dtype=tf.int64, name=None)}, TensorSpec(shape=(None, 20), dtype=tf.int64, name=None))>

In [24]:
val_ds

<CacheDataset element_spec=({'english': TensorSpec(shape=(None, 20), dtype=tf.int64, name=None), 'spanish': TensorSpec(shape=(None, 20), dtype=tf.int64, name=None)}, TensorSpec(shape=(None, 20), dtype=tf.int64, name=None))>

In [25]:
for inputs, targets in train_ds.take(1):
    print(f"english {inputs['english']},\n\n\n inputs['english'].shape: {inputs['english'].shape}")
    print(f"spanish {inputs['spanish']},\n\n\n inputs['spanish'].shape: {inputs['spanish'].shape}")
    print(f"targets {targets}, \n\n\n targets.shape: {targets.shape}")

english [[1439   13 1143 ...    0    0    0]
 [   3  214    7 ...    0    0    0]
 [  87   51    5 ...    0    0    0]
 ...
 [  15    5   17 ...    0    0    0]
 [   6  319  493 ...    0    0    0]
 [   6 1186    7 ...    0    0    0]],


 inputs['english'].shape: (64, 20)
spanish [[   2   35  287 ...    0    0    0]
 [   2  434   18 ...    0    0    0]
 [   2   83  523 ...    0    0    0]
 ...
 [   2   94   18 ...    0    0    0]
 [   2    8 7995 ...    0    0    0]
 [   2    8   26 ...    0    0    0]],


 inputs['spanish'].shape: (64, 20)
targets [[  35  287  113 ...    0    0    0]
 [ 434   18    1 ...    0    0    0]
 [  83  523   28 ...    0    0    0]
 ...
 [  94   18  531 ...    0    0    0]
 [   8 7995   84 ...    0    0    0]
 [   8   26 1367 ...    0    0    0]], 


 targets.shape: (64, 20)


## Sequence-to-sequence learning with Transformer

### Positional Embedding
This layer combines word/token embeddings with positional embeddings so that the transformer model can understand the order of words in a sentence. Positional Embedding ensures that the transformer knows not only what words are present, but also their order in the sentence. Without this component, the transformer would simply see words as an unordered "bag of words."

In [26]:
class PositionalEmbedding(Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        
        # Translates ID tokens (numbers) into a vector of dimension output_dim.
        self.token_embeddings = Embedding(
            input_dim=input_dim, output_dim=output_dim, mask_zero=True  # aktifkan masking otomatis!
        )
        
        #intermediate = self.getPositionEncoding(seq_len=input_dim,d=vocab_size,n=output_dim)
        #Positions 0, 1, 2, ..., sequence_length-1 are embedded like regular tokens.
        #Difference: The input here is a sequence of positions, not word IDs.
        
        self.position_embeddings = Embedding(
            input_dim=sequence_length, output_dim=output_dim
        )
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        # Use fixed length of self.sequence_length
        positions = tf.range(start=0, limit=tf.shape(inputs)[-1], delta=1)
        positions = self.position_embeddings(positions) # (1, sequence_length)
        x = self.token_embeddings(inputs)
        return x + positions

    def get_config(self):
        config = super().get_config()
        config.update({
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
            "output_dim": self.output_dim,
        })
        return config

### Transformer Encoder
The `TransformerEncoder` is the encoder block of the transformer architecture, which captures the relationships between words in a sentence using self-attention and a feed-forward network.

`TransformerEncoder = Self-Attention + Feed-Forward + Residual + Normalization`\
Its function is to build a contextual representation of words → each word does not stand alone, but understands its relationship to other words in the sentence.

In [27]:
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads):
        super().__init__()
        self.attention = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential([
            layers.Dense(dense_dim, activation="relu"),
            layers.Dense(embed_dim),
        ])
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        attention_output = self.attention(inputs, inputs, attention_mask=None)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

### Transformer Decoder
The TransformerDecoder is responsible for generating an output sequence based on the representation provided by the encoder, while ensuring that the generation process is autoregressive (only looking at previous tokens).

In short, the decoder combines two aspects:
* Autoregressive generation (only looking at the past through masked self-attention).
* The global context of the input (through cross-attention to the encoder).

In [28]:
class TransformerDecoder(layers.Layer):
    # A constructor containing multihead plus masked self attention
    def __init__(self, embed_dim, dense_dim, num_heads):
        super().__init__()
        
        # Masked self-attention: only looks at the previous token, so as not to "peek into the future".
        self.attention_1 = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        
        # Cross-attention: connects the output of the encoder to the decoder (input from the source language).
        self.attention_2 = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        
        # Feedforward network: 2 Dense layers as non-linear processing.
        self.dense_proj = keras.Sequential([
            layers.Dense(dense_dim, activation="relu"),
            layers.Dense(embed_dim),
        ])
        
        # Normalization at each stage (after residual connection).
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()

    # Create a lower triangle mask (causal mask) of size [batch, seq_len, seq_len]
    # Prevents the i-th position from looking at the j-th token > i.
    # It is used so that when training the decoder model it does not know the next token (future blindness).
    def get_causal_attention_mask(self, inputs):
        i = tf.range(tf.shape(inputs)[1])[:, tf.newaxis]
        j = tf.range(tf.shape(inputs)[1])
        mask = tf.cast(i >= j, dtype="int32")
        return tf.reshape(mask, (1, tf.shape(inputs)[1], tf.shape(inputs)[1]))

    def call(self, inputs, encoder_outputs, mask=None):
        
        # Create a mask for self-attention, so as not to “peek ahead”.
        causal_mask = self.get_causal_attention_mask(inputs)
        
        # Masked Self-Attention
        # Focus only on the previous or current token. Add residual + layernorm
        attention_output_1 = self.attention_1(inputs, inputs, attention_mask=causal_mask)
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        
        # Cross Attention (Decoder attends to Encoder)
        # Connecting the target representation with the representation generated by the encoder (e.g. from an English sentence).
        attention_output_2 = self.attention_2(attention_output_1, encoder_outputs, encoder_outputs)
        attention_output_2 = self.layernorm_2(attention_output_1 + attention_output_2)
        
        # FFN strengthens the token representation. Then, residual connection and layer normalization are performed.
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)

### Build a Transformer Model

#### Determines the size of the representation vector for each token.

In [29]:
embed_dim = 256

#### Determines the size of the hidden layer in the FFN block in the encoder and decoder

In [30]:
dense_dim = 2048

#### Determines the size of the number of parallel attention heads in the attention layer.

In [31]:
num_heads = 8

#### Defines the input for the source language (English) in the form of a sequence of ID tokens, with dynamic length (None).

In [32]:
encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="english")

#### English token representation that takes into account word order/position.

In [33]:
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)




#### Full context representation for the entire input sentence (English).

In [34]:
encoder_outputs = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)

#### Input is an ID token for the target language (Spanish), used during training.

In [35]:
decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="spanish")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, dense_dim, num_heads)(x, encoder_outputs)
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)

In [36]:
transformer = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [37]:
transformer.summary()

#### Model Interpretation
1. Input Layer
    * english (InputLayer) → (None, None)\
      Input English text sequence (sequence length is flexible).
    * spanish (InputLayer) → (None, None)\
      Input Spanish text sequence (target sequence).
2. Positional Embedding
    * positional_embedding (english) → (None, None, 256) | 3,845,120 parameters
    * positional_embedding (spanish) → (None, None, 256) | 3,845,120 parameters
    * Each token (both input and target) is converted into a 256-dimensional embedding vector, then positional information is added so the model knows the word order.
    * Parameter = vocab_size * embedding_dim (here approximately 15k * 256 = 3.8 million).
3. Transformer Encoder
    * transformer_encoder → (None, None, 256) | 3,155,456 parameters
    * Accepts an English embedding input and processes it through:
      - Multi-Head Self-Attention
      - Feedforward Network (dense projection)
      - Residual + Normalization
    * The output is a contextual representation of English.
4. Transformer Decoder
    * transformer_decoder → (None, None, 256) | 5,259,520 parameters
    * Accepts:
        - Target sequence embedding (Spanish)
        - Encoder representation (English)
    * The process includes:
        - Masked Self-Attention (only allows the previous token to be seen in the output).
        - Cross-Attention (connects the target to the encoder representation).
        - Feedforward Network.
    * The output is a representation of the target token that already “knows the context of the previous input + output.”
5. Dropout
    * dropout_3 → (None, None, 256)\
      Regularization layer to prevent overfitting.
6. Dense (Output Layer)
    * dense_4 (Dense) → (None, None, 15000) | 3,855,000 parameters
    * A linear layer that maps the 256-dimensional model to the target vocabulary size (15k).
    * Each target token is projected onto a word probability distribution via softmax.
7. Total Parameters
    * Total parameters = ~20 million (76 MB)
    * All trainable (no frozen parameters).
8. Core Architecture
    * This model is a seq2seq Transformer for machine translation (English → Spanish) with:
        - 256-dimensional embedding
        - Encoder + Decoder stack
        - Target vocabulary size 15k
        - Total 20 million parameters

In [38]:
#from tensorflow.keras.utils import plot_model
#from IPython.display import Image

#plot_model(transformer, to_file='transformer.png', show_shapes=True)
#Image(filename='transformer.png')

The lighter setting (faster training) will be named vanguard and the initial setting will be named transformer

## Build a Vanguard Model

### Determine the initial configuration

In [39]:
embed_dim = 128
dense_dim = 512
num_heads = 4
sequence_length = 20
batch_size = 32

### Defines the input for the source language (English) in the form of a sequence of ID tokens, with dynamic length (None).

In [40]:
encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="english")

### English token representation that takes into account word order/position.

In [41]:
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)

### Full context representation for the entire input sentence (English).

In [42]:
encoder_outputs = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)

### Input is an ID token for the target language (Spanish), used during training.

In [43]:
decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="spanish")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, dense_dim, num_heads)(x, encoder_outputs)
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)

In [44]:
vanguard = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [45]:
vanguard.summary()

### Model Interpretation
1. Input Layer
    * english (InputLayer): accepts input in the form of English sequence tokens.
    * spanish (InputLayer): accepts input in the form of Spanish sequence tokens.
    * Both are of the form (None, None) → meaning the sequence length is flexible, and the batch size is also flexible.
2. Positional Embedding
    * english → positional_embedding:
        - Token embedding + positional embedding, resulting in a 128-dimensional representation.
        - Number of parameters: 1,922,560
        - → This is derived from the vocabulary size (input_dim) × output_dim (128).
    * spanish → positional_embedding_1:
        - Same as above, for Spanish.
        - The parameters are also 1,922,560.
        - These two embeddings ensure that both English and Spanish tokens have a vector representation + positional information in the sequence.
3. Transformer Encoder
    * Input: English embedding.
    * Process: self-attention + feedforward network + residual + norm layer.
    * Output: 128-dimensional English contextual representation.
    * Parameters: 396,032 (quite light because embed_dim is only 128 and the number of heads is limited).
4. Transformer Decoder
    * Main input: Spanish embedding.
    * Additional input: the output from the encoder (English).
    * Process: masked self-attention for the decoder, cross-attention to the encoder output, then feedforward network.
    * Output: Spanish representation connected to the English context.
    * Parameters: 660,096 (larger than the encoder, due to additional cross-attention).
5. Dropout
    * dropout_7: maintains generalization, prevents overfitting.
6. Dense Output Layer
    * dense_9 (Dense, 15,000 units):
    * Output is a probability distribution over 15,000 words (assuming target vocabulary size = 15k).
    * Parameters: 1,935,000 (128 × 15,000 + bias).
7. Total Parameters
    * 6,836,248 (~6.8M parameters)
    * Very lightweight compared to standard Transformer models (for example, the original Transformer can have hundreds of millions of parameters).
    * It's reasonable to call it "vanguard" because it's compact and easy to train with limited resources.
8. Concise Interpretation
    * This vanguard model is a seq2seq Transformer for English → Spanish translation.
    * The encoder understands English sentences.
    * The decoder generates Spanish sentences based on the encoder's output.
    * The small embedding size (128) and limited vocabulary (15k) make this model lightweight, with only 6.8 million parameters.
    * This makes the model suitable for academic experiments, portfolios, or training on laptops without large GPUs, while still maintaining the typical Transformer (encoder–decoder) architecture.

In [46]:
#from tensorflow.keras.utils import plot_model
#plot_model(vanguard, to_file='vanguard.png', show_shapes=True)
#from IPython.display import Image
#Image("vanguard.png")

## Compile and Training

#### Transformer model

In [47]:
transformer.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

checkpoint = keras.callbacks.ModelCheckpoint(
    filepath="language_translation_checkpoint.weights.h5",
    save_weights_only=True,
    verbose=1,
    monitor="val_accuracy"
)

transformer.fit(train_ds, epochs=5, validation_data=val_ds, callbacks=[checkpoint])

Epoch 1/5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1s/step - accuracy: 0.7119 - loss: 2.2096   
Epoch 1: saving model to language_translation_checkpoint.weights.h5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1919s[0m 1s/step - accuracy: 0.7410 - loss: 1.7738 - val_accuracy: 0.8032 - val_loss: 1.2309
Epoch 2/5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1s/step - accuracy: 0.8128 - loss: 1.1956   
Epoch 2: saving model to language_translation_checkpoint.weights.h5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1915s[0m 1s/step - accuracy: 0.8258 - loss: 1.1047 - val_accuracy: 0.8527 - val_loss: 0.8826
Epoch 3/5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1s/step - accuracy: 0.8513 - loss: 0.9121   
Epoch 3: saving model to language_translation_checkpoint.weights.h5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1909s[0m 1s/step - accuracy: 0.8558 - loss: 0.8783 - va

<keras.src.callbacks.history.History at 0x2535baf4830>

#### Vanguard model

In [48]:
vanguard.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

checkpoint = keras.callbacks.ModelCheckpoint(
    filepath="language_translation_checkpoint.weights.h5",
    save_weights_only=True,
    verbose=1,
    monitor="val_accuracy"
)

vanguard.fit(train_ds, epochs=5, validation_data=val_ds, callbacks=[checkpoint])

Epoch 1/5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 442ms/step - accuracy: 0.7038 - loss: 2.5846  
Epoch 1: saving model to language_translation_checkpoint.weights.h5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m636s[0m 475ms/step - accuracy: 0.7359 - loss: 1.9178 - val_accuracy: 0.7923 - val_loss: 1.3261
Epoch 2/5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 441ms/step - accuracy: 0.7992 - loss: 1.3207  
Epoch 2: saving model to language_translation_checkpoint.weights.h5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m615s[0m 472ms/step - accuracy: 0.8105 - loss: 1.2402 - val_accuracy: 0.8381 - val_loss: 1.0146
Epoch 3/5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 442ms/step - accuracy: 0.8354 - loss: 1.0641  
Epoch 3: saving model to language_translation_checkpoint.weights.h5
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m616s[0m 473ms/step - accuracy: 0.8403 - loss:

<keras.src.callbacks.history.History at 0x25329175e50>

#### Training Performance Comparison
1. Initial Training (Epoch 1)
    * Transformer: train acc 74%, val acc 80% (quite high from the start → fast adaptation).
    * Vanguard: train acc 73%, val acc 79% (slightly lower, but still good).
2. Progress Over 5 Epochs
    * Transformer: consistently increased to a val acc of 87.7% with a val loss of 0.71.
    * Vanguard: also steadily increased but slightly lower, with a val acc of 86.6% with a val loss of 0.83.
    * Transformer has a +1% advantage in validation accuracy and lower val loss → better generalization.
3. Training Speed
    * Transformer: 1900 seconds/epoch (32 minutes).
    * Vanguard: 616 seconds/epoch (10 minutes).
    * Vanguard is 3x faster, but with the trade-off of slightly lower accuracy.
#### Interpretation
* Transformer is indeed heavier (self-attention complexity, deeper embedding, etc.), but produces more accurate and stable results.
* Vanguard is lighter and faster to train, suitable for resource-constrained environments, but there is a trade-off in final accuracy.
* The accuracy gap is not too large (87.7% vs. 86.6%), so the choice depends on the context of use:
    - If absolute accuracy is critical (e.g., production-level machine translation) → choose Transformer.
    - If time and resources are constraints (e.g., for prototyping, edge devices, or rapid iteration) → Vanguard is more efficient.
* These results demonstrate Transformer's superiority not only as a robust model for natural language, but also as a very solid baseline.
* Vanguard, as a lighter variant, can be positioned as an alternative trade-off between accuracy and speed.

## Evaluate BLEU Score

#### Transformer model

In [52]:
# Assume test_pairs = [(eng, spa), ...]
def decode_sequence(input_sentence):
    tokenized_input = source_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(sequence_length):
        tokenized_target = target_vectorization([decoded_sentence])[:, :-1]
        predictions = transformer.predict({"english": tokenized_input, "spanish": tokenized_target}, verbose=0)
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = target_vectorization.get_vocabulary()[sampled_token_index]
        if sampled_token == "[end]":
            break
        decoded_sentence += " " + sampled_token
    return decoded_sentence.replace("[start] ", "")

# BLEU Evaluation
for _ in range(5):
    input_text, target_text = random.choice(test_pairs)
    prediction = decode_sequence(input_text)
    reference = [target_text.replace("[start] ", "").replace(" [end]", "").split()]
    candidate = prediction.split()
    bleu_score = sentence_bleu(reference, candidate)
    print(f"Input: {input_text}\nPrediction: {prediction}\nTarget: {reference}\nBLEU: {bleu_score:.4f}\n")

### Save Final Model
transformer.save("transformer_translation_model.keras")

Input: Take it easy!
Prediction: toma fácil end                 
Target: [['¡Relajate!']]
BLEU: 0.0000

Input: Will he succeed or fail?
Prediction: va a triunfar o no end              
Target: [['¿Él', 'triunfará', 'o', 'fracasará?']]
BLEU: 0.0000

Input: I resign.
Prediction: yo [UNK] end                 
Target: [['Dimito.']]
BLEU: 0.0000

Input: He had more than enough money.
Prediction: Él tuvo más de dinero suficiente dinero end            
Target: [['Él', 'tenía', 'más', 'que', 'suficiente', 'dinero.']]
BLEU: 0.0000

Input: I was trying not to look.
Prediction: no estaba intentando mirar end               
Target: [['Trataba', 'de', 'no', 'mirar.']]
BLEU: 0.0000



#### Vanguard model

In [53]:
# Assume test_pairs = [(eng, spa), ...]
def decode_sequence(input_sentence):
    tokenized_input = source_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(sequence_length):
        tokenized_target = target_vectorization([decoded_sentence])[:, :-1]
        predictions = vanguard.predict({"english": tokenized_input, "spanish": tokenized_target}, verbose=0)
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = target_vectorization.get_vocabulary()[sampled_token_index]
        if sampled_token == "[end]":
            break
        decoded_sentence += " " + sampled_token
    return decoded_sentence.replace("[start] ", "")

# BLEU Evaluation
for _ in range(5):
    input_text, target_text = random.choice(test_pairs)
    prediction = decode_sequence(input_text)
    reference = [target_text.replace("[start] ", "").replace(" [end]", "").split()]
    candidate = prediction.split()
    bleu_score = sentence_bleu(reference, candidate)
    print(f"Input: {input_text}\nPrediction: {prediction}\nTarget: {reference}\nBLEU: {bleu_score:.4f}\n")

### Save Final Model
vanguard.save("vanguard_translation_model.keras")

Input: I hope we meet again someday soon.
Prediction: espero que nos [UNK] otra vez end             
Target: [['Espero', 'que', 'algún', 'día', 'pronto', 'nos', 'volvamos', 'a', 'ver.']]
BLEU: 0.0000

Input: Please put a lot of cream in my coffee.
Prediction: por favor [UNK] mucho en el café end            
Target: [['Ponele', 'mucha', 'crema', 'a', 'mi', 'café,', 'por', 'favor.']]
BLEU: 0.0000

Input: Come on, try again.
Prediction: ven a intentar end                
Target: [['Vamos,', 'inténtalo', 'otra', 'vez.']]
BLEU: 0.0000

Input: All human beings are mortal.
Prediction: todas las [UNK] son [UNK] end              
Target: [['Todos', 'los', 'humanos', 'son', 'mortales.']]
BLEU: 0.0000

Input: Can I have some of these?
Prediction: puedo tener algo de estas end              
Target: [['¿Puede', 'darme', 'algunos', 'de', 'éstos?']]
BLEU: 0.0000



OSError: [Errno 28] No space left on device

### Saving model architecture in json file

In [51]:
model_json = transformer.to_json()
with open("translator.json", "w") as json_file:
    json_file.write(model_json)

**Translate new sentences with Transformer model**

In [54]:
spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

**Translate new sentences with vanguard model**

In [55]:
spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

### Output Testing and Decoding the output sequence with transformers

In [56]:
def transformer_decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    transformer_decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization(
            [transformer_decoded_sentence])[:, :-1]
        predictions = transformer(
            [tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = spa_index_lookup[sampled_token_index]
        transformer_decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return transformer_decoded_sentence

### Output Testing and Decoding the output sequence with vanguard

In [57]:
def vanguard_decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    vanguard_decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization(
            [vanguard_decoded_sentence])[:, :-1]
        predictions = vanguard(
            [tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = spa_index_lookup[sampled_token_index]
        vanguard_decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return vanguard_decoded_sentence

### Transformer translating output

In [58]:
test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(5):
    input_sentence = random.choice(test_eng_texts)
    print("-")
    print(input_sentence)
    print(transformer_decode_sequence(input_sentence))

-
Does prison reform criminals?
[start] [UNK] la cárcel end                
-
This juice would be even better with two ice cubes.
[start] este jugo sería mejor que dos con dos hielo end          
-
Show me your passport, please.
[start] muéstrame su pasaporte por favor end              
-
You have three seconds to make your choice.
[start] tienes tres semanas end                
-
What he says is true.
[start] lo que dice es cierto end              


### Vanguard translating output

In [59]:
test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(5):
    input_sentence = random.choice(test_eng_texts)
    print("-")
    print(input_sentence)
    print(vanguard_decode_sequence(input_sentence))

-
I can't decide which car to buy.
[start] no puedo decidir qué auto end              
-
You're quite smart.
[start] eres bastante inteligente end                
-
Tom is seeking a job.
[start] tom está buscando trabajo end               
-
She was heard to criticize the manager.
[start] ella estaba oído [UNK] a el [UNK] end            
-
Tom is extremely busy today.
[start] tom está muy ocupado hoy end              


## Evaluation using the BLEU score

### Transformer BLEU Score 

In [60]:
test_eng_texts = [pair[0] for pair in test_pairs]
test_spa_texts = [pair[1] for pair in test_pairs]
transformer_score = 0
transformer_bleu  = 0
for i in range(20):
    candidate = decode_sequence(test_eng_texts[i])
    reference = test_spa_texts[i].lower()
    print(candidate,reference)
    transformer_score = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))
    transformer_bleu+=transformer_score
    print(f"Transformer Score:{transformer_score}")
print(f"\nTransformer BLEU score : {round(transformer_bleu,2)}/20")

fui a la tienda a comprar algo de comprar un [UNK] y [UNK] end       [start] fui a la tienda a comprar champú y dentífrico. [end]
Transformer Score:0.2647058823529412
dónde está un hospital end                [start] ¿dónde hay un hospital? [end]
Transformer Score:0.34146341463414637
es hora de ir a la escuela end             [start] es hora de ir al colegio. [end]
Transformer Score:0.2857142857142857
no trabajo el domingo end                [start] no trabajo el domingo. [end]
Transformer Score:0.3499999999999999
queremos información end                  [start] queremos información. [end]
Transformer Score:0.3658536585365854
tom debería estar bien el lunes end              [start] tom debería estar bien para el lunes. [end]
Transformer Score:0.3125
te [UNK] este formulario por favor end              [start] ¿podría cumplimentar este formulario, por favor? [end]
Transformer Score:0.35294117647058826
todos los gato se ama a mi gato end            [start] todo el mundo quiere a mi gato.

In [61]:
print(f"Transformer BLEU score : {round(transformer_bleu,2)}/20")

Transformer BLEU score : 5.9/20


### Vanguard BLEU Score 

In [62]:
test_eng_texts = [pair[0] for pair in test_pairs]
test_spa_texts = [pair[1] for pair in test_pairs]
vanguard_score = 0
vanguard_bleu  = 0
for i in range(20):
    candidate = decode_sequence(test_eng_texts[i])
    reference = test_spa_texts[i].lower()
    print(candidate,reference)
    vanguard_score = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))
    vanguard_bleu+=vanguard_score
    print(f"Vanguard Score:{vanguard_score}")
print(f"Vanguard BLEU score : {round(vanguard_bleu,2)}/20")

fui a la tienda a comprar algo de comprar un [UNK] y [UNK] end       [start] fui a la tienda a comprar champú y dentífrico. [end]
Vanguard Score:0.2647058823529412
dónde está un hospital end                [start] ¿dónde hay un hospital? [end]
Vanguard Score:0.34146341463414637
es hora de ir a la escuela end             [start] es hora de ir al colegio. [end]
Vanguard Score:0.2857142857142857
no trabajo el domingo end                [start] no trabajo el domingo. [end]
Vanguard Score:0.3499999999999999
queremos información end                  [start] queremos información. [end]
Vanguard Score:0.3658536585365854
tom debería estar bien el lunes end              [start] tom debería estar bien para el lunes. [end]
Vanguard Score:0.3125
te [UNK] este formulario por favor end              [start] ¿podría cumplimentar este formulario, por favor? [end]
Vanguard Score:0.35294117647058826
todos los gato se ama a mi gato end            [start] todo el mundo quiere a mi gato. [end]
Vanguard Score

In [63]:
print(f"Vanguard BLEU score : {round(vanguard_bleu,2)}/20")

Vanguard BLEU score : 5.9/20


### Translation

#### Translation with transformers

In [64]:
from nltk.translate.bleu_score import sentence_bleu

# Take a random sample from the validation data
for _ in range(5):
    input_text, target_text = random.choice(val_pairs)
    prediction = transformer_decode_sequence(input_text)

    # Reference and prediction formats
    reference = [target_text.replace("[start] ", "").replace(" [end]", "").split()]
    candidate = prediction.split()

    transformer_bleu = sentence_bleu(reference, candidate)

    print(f"ENGLISH   : {input_text}")
    print(f"PREDICTED : {prediction}")
    print(f"TARGET    : {' '.join(reference[0])}")
    print(f"BLEU SCORE: {transformer_bleu:.4f}\n")

ENGLISH   : She denied having taken part in the scheme.
PREDICTED : [start] ella negó haber tomado parte en el plan end           
TARGET    : Ella negó haber tomado parte en el plan.
BLEU SCORE: 0.5170

ENGLISH   : His memory never ceases to astonish me.
PREDICTED : [start] su memoria no me [UNK] a [UNK] end            
TARGET    : Su memoria me sorprende.
BLEU SCORE: 0.0000

ENGLISH   : No, I'm not a teacher. I'm only a student.
PREDICTED : [start] no soy profesor solo un profesor solo estudiante end           
TARGET    : No, no soy maestro. Soy solo un estudiante.
BLEU SCORE: 0.0000

ENGLISH   : I'll be glad to.
PREDICTED : [start] me haré feliz end                
TARGET    : Será un placer.
BLEU SCORE: 0.0000

ENGLISH   : He was wrong in thinking that she'd come to see him.
PREDICTED : [start] Él estaba equivocado en ese día en ver que lo ver end        
TARGET    : Él se equivocaba al pensar que ella vendría a verle.
BLEU SCORE: 0.0000



#### Translation with vanguard

In [65]:
from nltk.translate.bleu_score import sentence_bleu

# Take a random sample from the validation data
for _ in range(5):
    input_text, target_text = random.choice(val_pairs)
    prediction = vanguard_decode_sequence(input_text)

    # Reference and prediction formats
    reference = [target_text.replace("[start] ", "").replace(" [end]", "").split()]
    candidate = prediction.split()

    vanguard_bleu = sentence_bleu(reference, candidate)

    print(f"ENGLISH   : {input_text}")
    print(f"PREDICTED : {prediction}")
    print(f"TARGET    : {' '.join(reference[0])}")
    print(f"BLEU SCORE: {vanguard_bleu:.4f}\n")

ENGLISH   : Tom knows the man Mary came with.
PREDICTED : [start] tom sabe el hombre que vino con mary end           
TARGET    : Tom conoce al hombre con el que vino Mary.
BLEU SCORE: 0.0000

ENGLISH   : Come on in. The water's nice.
PREDICTED : [start] ven a el buen agua end              
TARGET    : Métete. El agua está rica.
BLEU SCORE: 0.0000

ENGLISH   : I don't know my father's annual income.
PREDICTED : [start] no sé mi padre [UNK] [UNK] end             
TARGET    : No conozco los ingresos anuales de mi padre.
BLEU SCORE: 0.0000

ENGLISH   : Get on your knees.
PREDICTED : [start] [UNK] las [UNK] end                
TARGET    : Arrodíllate.
BLEU SCORE: 0.0000

ENGLISH   : I have two dogs, three cats, and six chickens.
PREDICTED : [start] tengo dos perros tres gatos y seis [UNK] end           
TARGET    : Tengo dos perros, tres gatos y seis gallinas.
BLEU SCORE: 0.2778



In [68]:
from nltk.translate.bleu_score import corpus_bleu

# Prepare a list for references and candidates
references = []
candidates = []

# Loop over all sentence pairs in the test set
for input_text, target_text in test_pairs:
    prediction = transformer_decode_sequence(input_text)
    
    # Clear target (remove start/end tokens)
    reference = target_text.replace("[start] ", "").replace(" [end]", "").split()
    candidate = prediction.split()
    
    # Add to list for BLEU corpus
    references.append([reference])   # penting: list of list
    candidates.append(candidate)

# Calculate the BLEU corpus
transformer_bleu_score = corpus_bleu(references, candidates)
print(f"Corpus Transformer BLEU score: {transformer_bleu_score:.4f}")

Corpus Transformer BLEU score: 0.1030


In [69]:
from nltk.translate.bleu_score import corpus_bleu

# Prepare a list for references and candidates
references = []
candidates = []

# Loop over all sentence pairs in the test set
for input_text, target_text in test_pairs:
    prediction = vanguard_decode_sequence(input_text)
    
    # Clear target (remove start/end tokens)
    reference = target_text.replace("[start] ", "").replace(" [end]", "").split()
    candidate = prediction.split()
    
    # Tambahkan ke list untuk corpus BLEU
    references.append([reference])   # penting: list of list
    candidates.append(candidate)

# Calculate the BLEU corpus
vanguard_bleu_score = corpus_bleu(references, candidates)
print(f"Corpus Vanguard BLEU score: {vanguard_bleu_score:.4f}")

Corpus Vanguard BLEU score: 0.0867


## Evaluation Model

In [67]:
transformer_loss, transformer_acc = transformer.evaluate(val_ds)
vanguard_loss, vanguard_acc = vanguard.evaluate(val_ds)
print("Transformer Evaluation: ")
print(f"Validation Accuracy: {transformer_acc:.4f}")
print(f"Validation Loss: {transformer_loss:.4f}")
print(f"Transformer BLEU score : {round(transformer_bleu_score,2)}/20")
print("\nVanguard Evaluation: ")
print(f"Validation Accuracy: {vanguard_acc:.4f}")
print(f"Validation Loss: {vanguard_loss:.4f}")
print(f"Vanguard BLEU score : {round(vanguard_bleu,2)}/20")

[1m279/279[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 269ms/step - accuracy: 0.8776 - loss: 0.7145 
[1m279/279[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 156ms/step - accuracy: 0.8666 - loss: 0.8336
Transformer Evaluation: 
Validation Accuracy: 0.8776
Validation Loss: 0.7145
Transformer BLEU score : 0.0/20

Vanguard Evaluation: 
Validation Accuracy: 0.8666
Validation Loss: 0.8336
Vanguard BLEU score : 0.28/20


In [70]:
transformer_loss, transformer_acc = transformer.evaluate(val_ds)
vanguard_loss, vanguard_acc = vanguard.evaluate(val_ds)
print("Transformer Evaluation: ")
print(f"Validation Accuracy: {transformer_acc:.4f}")
print(f"Validation Loss: {transformer_loss:.4f}")
print(f"Corpus Transformer BLEU score : {transformer_bleu_score:.4f}")
print("\nVanguard Evaluation: ")
print(f"Validation Accuracy: {vanguard_acc:.4f}")
print(f"Validation Loss: {vanguard_loss:.4f}")
print(f"Corpus Vanguard BLEU score : {vanguard_bleu_score:.4f}")

[1m279/279[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m77s[0m 274ms/step - accuracy: 0.8776 - loss: 0.7145 
[1m279/279[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 151ms/step - accuracy: 0.8666 - loss: 0.8336
Transformer Evaluation: 
Validation Accuracy: 0.8776
Validation Loss: 0.7145
Corpus Transformer BLEU score : 0.1030

Vanguard Evaluation: 
Validation Accuracy: 0.8666
Validation Loss: 0.8336
Corpus Vanguard BLEU score : 0.0867


### Interpretation
1. Accuracy and Loss
    * Transformer has a higher accuracy (87.76% vs. 86.66%) and lower loss (0.71 vs. 0.83).
    * This means that, token-by-token, Transformer is better at guessing the correct word.
2. Corpus BLEU
    * Both BLEU scores are still low (around 0.1).
    * This is common when the model is not yet fully fluent in constructing sentences that match the ground truth.
    * However, Transformer has a higher BLEU score (0.1030 vs. 0.0867), indicating better word order and n-gram fit than Vanguard.
3. Inter-metric Fit
    * High accuracy + low BLEU → indicates that although many words are predicted correctly, the order or sentence structure is often inappropriate.
    * Transformer has a slight advantage in BLEU, meaning that in addition to being more accurate at the word level, it is also slightly better at arranging words into sentences that resemble the target.
4. Conclusion
    * Transformer consistently outperforms Vanguard across all metrics (accuracy, loss, and corpus BLEU).
    * A low BLEU score (around 0.1) indicates the model is still far from natural sentence structure, even though token accuracy is relatively high.
    * This suggests that accuracy is not sufficient to assess text quality, and the corpus BLEU helps highlight the model's weaknesses in generating coherent word sequences.

# Conclusion
The evaluation results show that Transformer has advantages in accuracy and loss, indicating better token prediction accuracy than Vanguard. However, the relatively low Corpus BLEU scores for both models indicate that sentence structure is still far from ideal, even though word-by-word prediction is quite accurate. This means that the syntactic and semantic quality of the sentences are not fully maintained. This situation emphasizes that accuracy and BLEU should not be viewed in isolation, but rather complement each other, accuracy for assessing token prediction, BLEU for assessing sentence fluency. With improved decoding strategies (e.g., beam search) and more sophisticated fine-tuning, BLEU scores can be improved, bringing Transformer closer to being the foundation of modern generative language models.

# Thank You