# English-to-Spanish translation with a sequence-to-sequence Transformer

**Author:** [fchollet](https://twitter.com/fchollet)<br>
**Date created:** 2021/05/26<br>
**Last modified:** 2024/11/18<br>
**Description:** Implementing a sequence-to-sequence Transformer and training it on a machine translation task.

## Introduction

In this example, we'll build a sequence-to-sequence Transformer model, which
we'll train on an English-to-Spanish machine translation task.

You'll learn how to:

- Vectorize text using the Keras `TextVectorization` layer.
- Implement a `TransformerEncoder` layer, a `TransformerDecoder` layer,
and a `PositionalEmbedding` layer.
- Prepare data for training a sequence-to-sequence model.
- Use the trained model to generate translations of never-seen-before
input sentences (sequence-to-sequence inference).

The code featured here is adapted from the book
[Deep Learning with Python, Second Edition](https://www.manning.com/books/deep-learning-with-python-second-edition)
(chapter 11: Deep learning for text).
The present example is fairly barebones, so for detailed explanations of
how each building block works, as well as the theory behind Transformers,
I recommend reading the book.

## Setup

In [None]:
# We set the backend to TensorFlow. The code works with
# both `tensorflow` and `torch`. It does not work with JAX
# due to the behavior of `jax.numpy.tile` in a jit scope
# (used in `TransformerDecoder.get_causal_attention_mask()`:
# `tile` in JAX does not support a dynamic `reps` argument.
# You can make the code work in JAX by wrapping the
# inside of the `get_causal_attention_mask` method in
# a decorator to prevent jit compilation:
# `with jax.ensure_compile_time_eval():`.
import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import pathlib
import random
import string
import re
import numpy as np

import tensorflow.data as tf_data
import tensorflow.strings as tf_strings

import keras
from keras import layers
from keras import ops
from keras.layers import TextVectorization

## Downloading the data

We'll be working with an English-to-Spanish translation dataset
provided by [Anki](https://www.manythings.org/anki/). Let's download it:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os

os.chdir("/content/drive/MyDrive/Diplomado_IA/NLP/Keras Translator")

In [None]:
text_file = keras.utils.get_file(
    fname="spa-eng.zip",
    origin="http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip",
    extract=True,
)
text_file = pathlib.Path(text_file).parent / "spa-eng_extracted" / "spa-eng" / "spa.txt"

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
[1m2638744/2638744[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


## Parsing the data

Each line contains an English sentence and its corresponding Spanish sentence.
The English sentence is the *source sequence* and Spanish one is the *target sequence*.
We prepend the token `"[start]"` and we append the token `"[end]"` to the Spanish sentence.

In [None]:
with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    eng, spa = line.split("\t")
    spa = "[start] " + spa + " [end]"
    text_pairs.append((eng, spa))

In [None]:
for _ in range(5):
    print(random.choice(text_pairs))

('I agree with you completely.', '[start] Estoy totalmente de acuerdo contigo. [end]')
('Being tired, he went to bed earlier than usual.', '[start] Cansado, se fue a la cama antes de lo normal. [end]')
('"Tom drank three cups of coffee after dinner." "No wonder he couldn\'t sleep."', '[start] "Tom bebió tres vasos de café después de cenar." "No me sorprende que no pudiera dormir." [end]')
('How about going to the movie tonight?', '[start] ¿Qué te parece ir al cine esta noche? [end]')
('What takes you only three days, takes me three weeks.', '[start] Lo que a ti te lleva sólo tres días a mí me lleva tres semanas. [end]')


Now, let's split the sentence pairs into a training set, a validation set,
and a test set.

In [None]:
random.shuffle(text_pairs)

num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples

train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples :]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

118964 total pairs
83276 training pairs
17844 validation pairs
17844 test pairs


## Vectorizing the text data

We'll use two instances of the `TextVectorization` layer to vectorize the text
data (one for English and one for Spanish),
that is to say, to turn the original strings into integer sequences
where each integer represents the index of a word in a vocabulary.

The English layer will use the default string standardization (strip punctuation characters)
and splitting scheme (split on whitespace), while
the Spanish layer will use a custom standardization, where we add the character
`"¿"` to the set of punctuation characters to be stripped.

Note: in a production-grade machine translation model, I would not recommend
stripping the punctuation characters in either language. Instead, I would recommend turning
each punctuation character into its own token,
which you could achieve by providing a custom `split` function to the `TextVectorization` layer.

In [None]:
strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

vocab_size = 15000
sequence_length = 20
batch_size = 64


def custom_standardization(input_string):
    lowercase = tf_strings.lower(input_string)
    return tf_strings.regex_replace(lowercase, "[%s]" % re.escape(strip_chars), "")


eng_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)
spa_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
    standardize=custom_standardization,
)

train_eng_texts = [pair[0] for pair in train_pairs]
train_spa_texts = [pair[1] for pair in train_pairs]

eng_vectorization.adapt(train_eng_texts)
spa_vectorization.adapt(train_spa_texts)

In [None]:
display(eng_vectorization.get_vocabulary()[10:15])
display(spa_vectorization.get_vocabulary()[10:15])

[np.str_('in'), np.str_('of'), np.str_('that'), np.str_('it'), np.str_('was')]

[np.str_('el'), np.str_('en'), np.str_('es'), np.str_('un'), np.str_('me')]

In [None]:
# trying to vectorize a dummy sentence Chollet

output = eng_vectorization([["the cat sat on the mat"]])
display(output.numpy()[0, :6])

output = spa_vectorization([["[start] el gato está en el mantel [end]"]])
display(output.numpy()[0, :7]) #que se note que aquí imprimí uno más por el bloque de "start"

array([   2,  351,  573,   29,    2, 5562])

array([   2,   10,  318,   23,   11,   10, 6661])

In [None]:
eng_voc = eng_vectorization.get_vocabulary()
eng_word_index = dict(zip(eng_voc, range(len(eng_voc))))


spa_voc = spa_vectorization.get_vocabulary()
spa_word_index = dict(zip(spa_voc, range(len(spa_voc))))

In [None]:
len(eng_voc), len(spa_voc)

(12019, 15000)

In [None]:
eng_test = ["the", "cat", "sat", "on", "the", "mat"]
display([eng_word_index[w] for w in eng_test])

spa_test = ["el", "gato", "está", "en", "el", "mantel"]
display([spa_word_index[w] for w in spa_test])

[2, 351, 573, 29, 2, 5562]

[10, 318, 23, 11, 10, 6661]

## importing glove embedding:

ya lo había descargado y descomprimido, entonces omitiré esos pasos...

In [None]:
import numpy as np

path_to_glove_file = "glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

Found 400000 word vectors.


In [None]:
eng_voc_size = len(eng_voc) + 2

In [None]:
eng_num_tokens = len(eng_voc) + 2
eng_embedding_dim = 100
eng_hits = 0
eng_misses = 0

# Prepare embedding matrix
eng_embedding_matrix = np.zeros((eng_num_tokens, eng_embedding_dim))
for word, i in eng_word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        eng_embedding_matrix[i] = embedding_vector
        eng_hits += 1
    else:
        eng_misses += 1
print("Converted %d words (%d misses)" % (eng_hits, eng_misses))

Converted 11709 words (310 misses)


In [None]:
eng_embedding_matrix[2:5]

array([[-3.81940007e-02, -2.44870007e-01,  7.28120029e-01,
        -3.99610013e-01,  8.31720009e-02,  4.39530015e-02,
        -3.91409993e-01,  3.34399998e-01, -5.75450003e-01,
         8.74589980e-02,  2.87869990e-01, -6.73099980e-02,
         3.09060007e-01, -2.63839990e-01, -1.32310003e-01,
        -2.07570001e-01,  3.33950013e-01, -3.38479996e-01,
        -3.17429990e-01, -4.83359993e-01,  1.46400005e-01,
        -3.73039991e-01,  3.45770001e-01,  5.20410016e-02,
         4.49460000e-01, -4.69709992e-01,  2.62800008e-02,
        -5.41549981e-01, -1.55180007e-01, -1.41069993e-01,
        -3.97219993e-02,  2.82770008e-01,  1.43930003e-01,
         2.34640002e-01, -3.10209990e-01,  8.61729980e-02,
         2.03970000e-01,  5.26239991e-01,  1.71639994e-01,
        -8.23780000e-02, -7.17869997e-01, -4.15309995e-01,
         2.03349993e-01, -1.27629995e-01,  4.13670003e-01,
         5.51869988e-01,  5.79079986e-01, -3.34769994e-01,
        -3.65590006e-01, -5.48569977e-01, -6.28919974e-0

Desde acá comienza el código de chollet

In [None]:
train_pairs[1:5]

[('Are both of you ready to go?',
  '[start] ¿Están las dos listas para irse? [end]'),
 ('No one is to leave.', '[start] Nos quedamos todos dentro. [end]'),
 ('He is mad about you.', '[start] Está loco por ti. [end]'),
 ('We have four French classes a week.',
  '[start] Tenemos cuatro clases de francés a la semana. [end]')]

In [None]:

def format_dataset(eng, spa):
    eng = eng_vectorization(eng)
    spa = spa_vectorization(spa)
    return (
        {
            "encoder_inputs": eng,
            "decoder_inputs": spa[:, :-1],
        },
        spa[:, 1:],
    )


def make_dataset(pairs):
    eng_texts, spa_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    spa_texts = list(spa_texts)
    dataset = tf_data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset)
    return dataset.cache().shuffle(2048).prefetch(16)


train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

Let's take a quick look at the sequence shapes
(we have batches of 64 pairs, and all sequences are 20 steps long):

In [None]:
for inputs, targets in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f"targets.shape: {targets.shape}")

inputs["encoder_inputs"].shape: (64, 20)
inputs["decoder_inputs"].shape: (64, 20)
targets.shape: (64, 20)


## Building the model

Acá metro otra vez mi mano en el código

In [None]:
import keras.ops as ops


class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(dense_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, mask=None):
        if mask is not None:
            padding_mask = ops.cast(mask[:, None, :], dtype="int32")
        else:
            padding_mask = None

        attention_output = self.attention(
            query=inputs, value=inputs, key=inputs, attention_mask=padding_mask
        )
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "embed_dim": self.embed_dim,
                "dense_dim": self.dense_dim,
                "num_heads": self.num_heads,
            }
        )
        return config

## Adding a fixed positional embeffing to add a fixed embedding matrix and a trainable param, in case I want to fix it to FALSE...

class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, vocab_size, embed_dim, embedding_matrix=None, trainable=False, **kwargs):
        super().__init__(**kwargs)

        # Token embedding layer (uses pretrained weights if provided)
        self.token_embeddings = layers.Embedding(
            input_dim=vocab_size,
            output_dim=embed_dim,
            trainable=trainable
        )

        # If a pre-trained embedding matrix is provided, set it
        if embedding_matrix is not None:
            self.token_embeddings.build((None,))  # Build the layer
            self.token_embeddings.set_weights([embedding_matrix])  # Load pre-trained weights

        # Position embedding layer (randomly initialized)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length,
            output_dim=embed_dim
        )

        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim

    def call(self, inputs):
        length = ops.shape(inputs)[-1]  # Get sequence length dynamically
        positions = ops.arange(0, length, 1)  # Generate position indices
        embedded_tokens = self.token_embeddings(inputs)  # Token embeddings
        embedded_positions = self.position_embeddings(positions)  # Positional embeddings
        return embedded_tokens + embedded_positions  # Sum embeddings

    def compute_mask(self, inputs, mask=None):
        return ops.not_equal(inputs, 0)  # Mask padding tokens (0)

    def get_config(self):
        config = super().get_config()
        config.update({
            "sequence_length": self.sequence_length,
            "vocab_size": self.vocab_size,
            "embed_dim": self.embed_dim
        })
        return config



class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, latent_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.latent_dim = latent_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.dense_proj = keras.Sequential(
            [
                layers.Dense(latent_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def call(self, inputs, mask=None):
        inputs, encoder_outputs = inputs
        causal_mask = self.get_causal_attention_mask(inputs)

        if mask is None:
            inputs_padding_mask, encoder_outputs_padding_mask = None, None
        else:
            inputs_padding_mask, encoder_outputs_padding_mask = mask

        attention_output_1 = self.attention_1(
            query=inputs,
            value=inputs,
            key=inputs,
            attention_mask=causal_mask,
            query_mask=inputs_padding_mask,
        )
        out_1 = self.layernorm_1(inputs + attention_output_1)

        attention_output_2 = self.attention_2(
            query=out_1,
            value=encoder_outputs,
            key=encoder_outputs,
            query_mask=inputs_padding_mask,
            key_mask=encoder_outputs_padding_mask,
        )
        out_2 = self.layernorm_2(out_1 + attention_output_2)

        proj_output = self.dense_proj(out_2)
        return self.layernorm_3(out_2 + proj_output)

    def get_causal_attention_mask(self, inputs):
        input_shape = ops.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = ops.arange(sequence_length)[:, None]
        j = ops.arange(sequence_length)
        mask = ops.cast(i >= j, dtype="int32")
        mask = ops.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = ops.concatenate(
            [ops.expand_dims(batch_size, -1), ops.convert_to_tensor([1, 1])],
            axis=0,
        )
        return ops.tile(mask, mult)

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "embed_dim": self.embed_dim,
                "latent_dim": self.latent_dim,
                "num_heads": self.num_heads,
            }
        )
        return config


Next, we assemble the end-to-end model.

In [None]:
sequence_length, vocab_size

(20, 15000)

In [None]:
embed_dim = 100
latent_dim = 2048
num_heads = 8

encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
x = PositionalEmbedding(sequence_length, eng_voc_size, embed_dim, embedding_matrix=eng_embedding_matrix, trainable=False)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, latent_dim, num_heads)(x)
encoder = keras.Model(encoder_inputs, encoder_outputs)

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)([x, encoder_outputs])
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

transformer = keras.Model(
    {"encoder_inputs": encoder_inputs, "decoder_inputs": decoder_inputs},
    decoder_outputs,
    name="transformer",
)

## Training our model

We'll use accuracy as a quick way to monitor training progress on the validation data.
Note that machine translation typically uses BLEU scores as well as other metrics, rather than accuracy.

Here we only train for 1 epoch, but to get the model to actually converge
you should train for at least 30 epochs.

In [None]:
epochs = 30  # This should be at least 30 for convergence

transformer.summary()
transformer.compile(
    "rmsprop",
    loss=keras.losses.SparseCategoricalCrossentropy(ignore_class=0),
    metrics=["accuracy"],
)
transformer.fit(train_ds, epochs=epochs, validation_data=val_ds)

Epoch 1/30
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 21ms/step - accuracy: 0.0729 - loss: 5.9817 - val_accuracy: 0.1371 - val_loss: 3.9184
Epoch 2/30
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 7ms/step - accuracy: 0.1396 - loss: 4.0271 - val_accuracy: 0.1687 - val_loss: 3.3012
Epoch 3/30
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 7ms/step - accuracy: 0.1628 - loss: 3.5515 - val_accuracy: 0.1849 - val_loss: 3.0602
Epoch 4/30
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 7ms/step - accuracy: 0.1744 - loss: 3.3551 - val_accuracy: 0.1907 - val_loss: 3.0000
Epoch 5/30
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 7ms/step - accuracy: 0.1806 - loss: 3.2765 - val_accuracy: 0.1930 - val_loss: 2.9618
Epoch 6/30
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 7ms/step - accuracy: 0.1855 - loss: 3.2058 - val_accuracy: 0.1949 - val_loss: 2.9382
Epoch 7/30
[1

<keras.src.callbacks.history.History at 0x79f21fa18290>

In [None]:
transformer.save("Caro_GloVe_model_30_epochs.keras")

### With Cleaner data


In [None]:
import unicodedata

def remove_accented_char(texto):
    # Normalizar el texto a la forma NFD
    texto = unicodedata.normalize("NFD", texto)

    # Reemplazar los caracteres diacríticos, pero dejando la "ñ" intacta
    texto = re.sub(r"(?<!n)[\u0300-\u036f]", "", texto)

    # Volver a la forma NFC para evitar problemas de codificación
    return unicodedata.normalize("NFC", texto)

In [None]:
import tensorflow as tf

import tensorflow_text as tf_text

strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")


vocab_size = 15000
sequence_length = 20
batch_size = 64


def custom_standardization(input_string):


    lowercase = tf.strings.lower(input_string)


    return  tf.strings.regex_replace(lowercase, "[%s]" % re.escape(strip_chars), "")

In [None]:
def clean_sp_txt(spa_texts):
  clean_sp = []
  for i in range(len(spa_texts)):
    txt_clean = remove_accented_char(spa_texts[i])
    clean_sp.append(txt_clean)
  return  clean_sp

In [None]:
def make_dataset(pairs):
    eng_texts, spa_texts = zip(*pairs)
    spa_texts = clean_sp_txt(spa_texts)

    spa_texts = custom_standardization(tuple(spa_texts))
    eng_texts = custom_standardization(eng_texts)

    eng_texts = list(eng_texts)
    spa_texts = list(spa_texts)
    dataset = tf_data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset)
    return dataset.cache().shuffle(2048).prefetch(16)


train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)
test_ds = make_dataset(test_pairs)

In [None]:
for inputs, targets in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f"targets.shape: {targets.shape}")

inputs["encoder_inputs"].shape: (64, 20)
inputs["decoder_inputs"].shape: (64, 20)
targets.shape: (64, 20)


In [None]:
epochs = 30  # This should be at least 30 for convergence

transformer.summary()
transformer.compile(
    "rmsprop",
    loss=keras.losses.SparseCategoricalCrossentropy(ignore_class=0),
    metrics=["accuracy"],
)
transformer.fit(train_ds, epochs=epochs, validation_data=val_ds)

Epoch 1/30
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 14ms/step - accuracy: 0.0783 - loss: 5.5855 - val_accuracy: 0.1505 - val_loss: 3.5565
Epoch 2/30
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 7ms/step - accuracy: 0.1459 - loss: 3.7474 - val_accuracy: 0.1747 - val_loss: 3.0344
Epoch 3/30
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 7ms/step - accuracy: 0.1682 - loss: 3.2841 - val_accuracy: 0.1891 - val_loss: 2.7845
Epoch 4/30
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 7ms/step - accuracy: 0.1804 - loss: 3.0842 - val_accuracy: 0.1936 - val_loss: 2.7036
Epoch 5/30
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 7ms/step - accuracy: 0.1883 - loss: 2.9688 - val_accuracy: 0.2007 - val_loss: 2.6575
Epoch 6/30
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 7ms/step - accuracy: 0.1932 - loss: 2.8956 - val_accuracy: 0.2024 - val_loss: 2.6344
Epoch 7/30
[1

<keras.src.callbacks.history.History at 0x79f21f7e4e90>

In [None]:
transformer.save("Caro_GloVe_model_30_epochs_cleaner.keras")

## Decoding test sentences

Finally, let's demonstrate how to translate brand new English sentences.
We simply feed into the model the vectorized English sentence
as well as the target token `"[start]"`, then we repeatedly generated the next token, until
we hit the token `"[end]"`.

In [None]:
spa_vocab = spa_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20


def decode_sequence(input_sentence):
    tokenized_input_sentence = eng_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = spa_vectorization([decoded_sentence])[:, :-1]
        predictions = transformer(
            {
                "encoder_inputs": tokenized_input_sentence,
                "decoder_inputs": tokenized_target_sentence,
            }
        )

        # ops.argmax(predictions[0, i, :]) is not a concrete value for jax here
        sampled_token_index = ops.convert_to_numpy(
            ops.argmax(predictions[0, i, :])
        ).item(0)
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token

        if sampled_token == "[end]":
            break
    return decoded_sentence


test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(30):
    input_sentence = random.choice(test_eng_texts)
    translated = decode_sequence(input_sentence)

    print(input_sentence, " >>> ", translated)

I am busy today.  >>>  [start] estoy ocupado hoy estoy [end]
Tom already knows.  >>>  [start] tom ya sabe [end]
I cannot afford to buy a new bicycle.  >>>  [start] no puedo dar comprar una bicicleta nueva [end]
That box is made of wood.  >>>  [start] esa caja está hecho de la caja [end]
He wore red pants.  >>>  [start] Él se [UNK] los zapatos [end]
Let's do this first of all.  >>>  [start] vamos a que lo que todo [end]
The noise gets on my nerves.  >>>  [start] el ruido me [UNK] [end]
I want to win for once.  >>>  [start] quiero ganar por una vez [end]
Let's get this done and get out of here.  >>>  [start] vamos a hacer esto de aquí [end]
Tell me where you live.  >>>  [start] dime dónde vive [end]
Has the jury reached a verdict?  >>>  [start] ha [UNK] el [UNK] un [UNK] [end]
He's given to going overboard every time he gets a new idea.  >>>  [start] Él se ha [UNK] todos los días nueva [end]
I will never forget the day when I first met him.  >>>  [start] nunca me acuerdo del día en que l

## now running again but without freezing the code:

In [None]:
embed_dim = 100
latent_dim = 2048
num_heads = 8

encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")
x = PositionalEmbedding(sequence_length, eng_voc_size, embed_dim, embedding_matrix=eng_embedding_matrix, trainable=False)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, latent_dim, num_heads)(x)
encoder = keras.Model(encoder_inputs, encoder_outputs)

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, embed_dim), name="decoder_state_inputs")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)([x, encoder_outputs])
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
decoder = keras.Model([decoder_inputs, encoded_seq_inputs], decoder_outputs)

transformer = keras.Model(
    {"encoder_inputs": encoder_inputs, "decoder_inputs": decoder_inputs},
    decoder_outputs,
    name="transformer",
)