# Use KerasNLP layers to build an encoder-decoder Transformer model, and train it on the English-to-Spanish machine translation task

## Introduction

- **KerasNLP Overview**: Provides essential components for NLP, such as model layers, tokenizers, and metrics, simplifying the construction of NLP pipelines.

- **Example Focus**: Demonstrates building an encoder-decoder Transformer model for English-to-Spanish translation using KerasNLP.

- **Original Example Reference**: Based on a lower-level example by [fchollet](https://keras.io/examples/nlp/neural_machine_translation_with_keras_nlp/), this version leverages KerasNLP for advanced techniques like subword tokenization and translation quality metrics.

- **Key Learning Points**:
  - Tokenize text using `keras_nlp.tokenizers.WordPieceTokenizer`.
  - Implement a sequence-to-sequence Transformer model with KerasNLP layers.
  - Train the model on an English-to-Spanish translation task.
  - Generate translations using the top-p decoding strategy with `keras_nlp.samplers`.

## Imports

In [1]:
!pip install -q --upgrade rouge-score
!pip install -q --upgrade keras-nlp
!pip install -q --upgrade tensorflow
!pip install -q --upgrade keras

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m572.0/572.0 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import keras_nlp
import pathlib
import random

import keras
from keras import ops

import tensorflow.data as tf_data
from tensorflow_text.tools.wordpiece_vocab import (
    bert_vocab_from_dataset as bert_vocab,
)

## Define Hyperparameter

In [3]:
BATCH_SIZE = 64
EPOCHS = 1   # Epochs should be at least 10 for convergence
MAX_SEQUENCE_LENGTH = 40
ENG_VOCAB_SIZE = 15000
SPA_VOCAB_SIZE = 15000

EMBED_DIM = 256
INTERMEDIATE_DIM = 2048
NUM_HEADS = 8

## Downloading the data English-to-Spanish translation dataset

In [4]:
text_file = keras.utils.get_file(
    fname='spa-eng.zip',
    origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
    extract=True
)

text_file = pathlib.Path(text_file).parent / 'spa-eng' / 'spa.txt'

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
[1m2638744/2638744[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


## Parsing the data

In [5]:
# Convert text data to lowercase
with open(text_file) as f:
    lines = f.read().split('\n')[:-1]
text_pairs = []

for line in lines:
    eng, spa = line.split('\t')
    eng = eng.lower()
    spa = spa.lower()
    text_pairs.append((eng, spa))

In [6]:
# Lets take a look at our sentence pair
for _ in range(5):
    print(random.choice(text_pairs))

('who should i inform?', '¿a quién debería informar?')
('this steak is too tough.', 'este filete está demasiado duro.')
('he bought a dress for her.', 'él le compró un vestido a ella.')
('thank you very much for your thoughtful present.', 'muchas gracias por su considerado obsequio.')
('i like to fish in the river.', 'me gusta pescar en el río.')


## Split the Dataset

In [7]:
# Spliting the data into training set, validation set, and test set
random.shuffle(text_pairs)

num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples

train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples: num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

print(f'{len(text_pairs)} total pairs')
print(f'{len(train_pairs)} training pairs')
print(f'{len(val_pairs)} validation pairs')
print(f'{len(test_pairs)} test pairs')

118964 total pairs
83276 training pairs
17844 validation pairs
17844 test pairs


## Tokenize the data

Two tokenizer use here:

1. For Source English language
2. FOr Target Spanish language

In [8]:
# Define the WordPiece tokenizater
def train_word_piece(text_samples, vocab_size, reserved_tokens):
    word_piece_ds = tf_data.Dataset.from_tensor_slices(text_samples)
    vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
        word_piece_ds.batch(1000).prefetch(2),
        vocabulary_size=vocab_size,
        reserved_tokens=reserved_tokens,
    )
    return vocab

In [9]:
# Vocabulary has a few special, reserved tokens
reserved_tokens = ['[PAD]', '[UNK]', '[START]', '[END]']

# Create the English tokenizer
eng_samples = [text_pair[0] for text_pair in train_pairs]
eng_vocab = train_word_piece(eng_samples, ENG_VOCAB_SIZE, reserved_tokens)

# Create the Spanish tokenizer
spa_samples = [text_pair[1] for text_pair in train_pairs]
spa_vocab = train_word_piece(spa_samples, SPA_VOCAB_SIZE, reserved_tokens)

In [10]:
# Let's see some tokens
print(f'English Tokens: {eng_vocab[100:110]}')
print(f'Spanish Tokens: {spa_vocab[100:110]}')

English Tokens: ['at', 'know', 'him', 'they', 'there', 'go', 'her', 'has', 'will', 're']
Spanish Tokens: ['qué', 'le', 'ella', 'te', 'para', 'mary', 'las', 'más', 'al', 'yo']


In [11]:
# let's define the tokenizers
eng_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=eng_vocab,
    lowercase=False
)

spa_tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=spa_vocab,
    lowercase=False
)

In [12]:
# Let's try and tokenize a sample from our dataset
eng_input_ex = text_pairs[0][0]
eng_tokens_ex = eng_tokenizer.tokenize(eng_input_ex)
print(f'English Sentence: {eng_input_ex}')
print(f'Tokens: {eng_tokens_ex}')
print('Recovered text after detokenizing:', eng_tokenizer.detokenize(eng_tokens_ex))

print()

spa_input_ex = text_pairs[0][1]
spa_tokens_ex = spa_tokenizer.tokenize(spa_input_ex)
print(f'Spanish Sentence: {spa_input_ex}')
print(f'Tokens: {spa_tokens_ex}')
print('Recovered text after detokenizing:', spa_tokenizer.detokenize(spa_tokens_ex))

English Sentence: she advised him to see the dentist, but he said he didn't have enough time to do so.
Tokens: [  83  722  102   67  141   66 2238   10  154   71  181   71  121    8
   46   81  303  110   67   77  147   12]
Recovered text after detokenizing: tf.Tensor(b"she advised him to see the dentist , but he said he didn ' t have enough time to do so .", shape=(), dtype=string)

Spanish Sentence: le aconsejó que fuera al dentista, pero él dijo que no tenía tiempo suficiente para hacerlo.
Tokens: [ 101  916   80  286  108 2833   13  171   90  164   80   81  213  134
  454  104  300   15]
Recovered text after detokenizing: tf.Tensor(b'le aconsej\xc3\xb3 que fuera al dentista , pero \xc3\xa9l dijo que no ten\xc3\xada tiempo suficiente para hacerlo .', shape=(), dtype=string)


## Format Datasets

- **Training Dataset Format**:
  - **Inputs**:
    - A dictionary with two keys:
      - `encoder_inputs`: Tokenized source sentence.
      - `decoder_inputs`: Target sentence up to word N (used to predict word N+1 and beyond).
  - **Target**:
    - The target sentence offset by one step, providing the next words for the model to predict.

- **Special Tokens**:
  - Add `[START]` and `[END]` tokens to the tokenized Spanish input sentence.
  - **Padding**: Input is padded to a fixed length using `keras_nlp.layers.StartEndPacker`.

- **Objective**: At each training step, predict target words N+1 and beyond using the given inputs and targets.

In [13]:
def preprocess_batch(eng, spa):
    batch_size = ops.shape(spa)[0]

    eng = eng_tokenizer(eng)
    spa = spa_tokenizer(spa)

    # Pad `eng` to `MAX_SEQUENCE_LENGTH`
    eng_start_end_packer = keras_nlp.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH,
        pad_value=eng_tokenizer.token_to_id('[PAD]')
    )

    eng = eng_start_end_packer(eng)

    # Add special tokens ('[START]' and '[END]') to 'spa' and also pad it
    spa_start_end_packer = keras_nlp.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH + 1,
        start_value=spa_tokenizer.token_to_id('[START]'),
        end_value=spa_tokenizer.token_to_id('[END]'),
        pad_value=spa_tokenizer.token_to_id('[PAD]')
    )
    spa = spa_start_end_packer(spa)

    return(
        {
            'encoder_inputs': eng,
            'decoder_inputs': spa[:, :-1]
        },
        spa[:, 1:]
    )

def make_dataset(pairs):
    eng_texts, spa_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    spa_texts = list(spa_texts)
    dataset = tf_data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.map(preprocess_batch, num_parallel_calls=tf_data.AUTOTUNE)
    return dataset.shuffle(2048).prefetch(16).cache()

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

In [14]:
# Let's take a look at the sequence shapes
# (batches of 64 pairs, and all sequences are 40 steps long)
for inputs, targets in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f"targets.shape: {targets.shape}")

inputs["encoder_inputs"].shape: (64, 40)
inputs["decoder_inputs"].shape: (64, 40)
targets.shape: (64, 40)


## Building the model

- **Model Definition**:
  - **Embedding Layer**: Needed to generate a vector for each token in the input sequence. Initialized randomly.
  - **Positional Embedding**: Encodes word order in the sequence. Typically, these embeddings are added together.
  - **KerasNLP Convenience**: Use `keras_nlp.layers.TokenAndPositionEmbedding` to handle both token and positional embeddings in one step.

- **Sequence-to-Sequence Transformer**:
  - Composed of `keras_nlp.layers.TransformerEncoder` and `keras_nlp.layers.TransformerDecoder`.
  - **Process Flow**:
    - **Encoder**: The source sequence is passed through `TransformerEncoder`, which generates a new representation.
    - **Decoder**: This representation, along with the target sequence so far (words 0 to N), is fed into `TransformerDecoder` to predict the next words (N+1 and beyond).

- **Causal Masking**:
  - Necessary to ensure the model only uses past tokens (0 to N) when predicting the next token (N+1).
  - **Enabled by Default**: In `keras_nlp.layers.TransformerDecoder`.

- **Padding Masking**:
  - Padding tokens ("[PAD]") must be masked out.
  - Set `mask_zero=True` in `keras_nlp.layers.TokenAndPositionEmbedding` to handle this, which will propagate the masking to subsequent layers.

In [15]:
# Encoder Block
encoder_inputs = keras.Input(shape=(None,), name='encoder_inputs')
x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=ENG_VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
)(encoder_inputs)

encoded_outputs = keras_nlp.layers.TransformerEncoder(
    intermediate_dim=INTERMEDIATE_DIM,
    num_heads=NUM_HEADS
)(inputs=x)

encoder = keras.Model(encoder_inputs, encoded_outputs)

# Decoder Block
decoder_inputs = keras.Input(shape=(None,), name='decoder_inputs')
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM), name='decoder_state_inputs')

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=SPA_VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
)(decoder_inputs)

x = keras_nlp.layers.TransformerDecoder(
    intermediate_dim=INTERMEDIATE_DIM,
    num_heads=NUM_HEADS
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)

x = keras.layers.Dropout(0.5)(x)

decoder_outputs = keras.layers.Dense(SPA_VOCAB_SIZE, activation='softmax')(x)

decoder = keras.Model(
    [
        decoder_inputs,
        encoded_seq_inputs
    ],
    decoder_outputs
)

decoder_outputs = decoder([decoder_inputs, encoded_outputs])

transformer = keras.Model(
    [
        encoder_inputs,
        decoder_inputs
    ],
    decoder_outputs,
    name='transformer'
)

In [16]:
transformer.summary()

## Training the Model

In [17]:
transformer.compile(
    'rmsprop',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

transformer.fit(train_ds, epochs=EPOCHS, validation_data=val_ds)

[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7962s[0m 6s/step - accuracy: 0.8211 - loss: 1.4492 - val_accuracy: 0.9820 - val_loss: 0.1529


<keras.src.callbacks.history.History at 0x7bad8e6897e0>

## Decoding Test Sentences (Qualitative Analysis)

In [27]:
def decode_sequences(input_sentences):
    batch_size = 1

    # Tokenize the encoder input
    encoder_input_tokens = ops.convert_to_tensor(eng_tokenizer(input_sentences))
    if len(encoder_input_tokens[0]) < MAX_SEQUENCE_LENGTH:
        pads = ops.full((1, MAX_SEQUENCE_LENGTH - len(encoder_input_tokens[0])), 0)
        encoder_input_tokens = ops.concatenate(
            [encoder_input_tokens.to_tensor(), pads], 1
        )

    # Define a function that outputs the next token's probabilities given the input sequence
    def next(prompt, cache, index):
        logits = transformer([encoder_input_tokens, prompt])[:, index - 1, :]
        # Ignore hidden states for now; only needed for contrastive search.
        hidden_states = None
        return logits, hidden_states, cache

    # Build a prompt of length 40 with a start token and padding tokens
    length = 40
    start = ops.full((batch_size, 1), spa_tokenizer.token_to_id("[START]"))
    pad = ops.full((batch_size, length - 1), spa_tokenizer.token_to_id("[PAD]"))
    prompt = ops.concatenate((start, pad), axis=-1)

    generated_tokens = keras_nlp.samplers.GreedySampler()(
        next,
        prompt,
        stop_token_ids=[spa_tokenizer.token_to_id("[END]")],
        index=1,  # Start sampling after start token.
    )
    generated_sentences = spa_tokenizer.detokenize(generated_tokens)
    return generated_sentences

    generated_sentences = spa_tokenizer.detokenize(generated_tokens)
    return generated_sentences

test_eng_texts = [pair[0] for pair in test_pairs]
for i in range(2):
    input_sentence = random.choice(test_eng_texts)
    translated = decode_sequences([input_sentence])
    translated = translated.numpy()[0].decode("utf-8")
    translated = (
        translated.replace("[PAD]", "")
        .replace("[START]", "")
        .replace("[END]", "")
        .strip()
    )

    print(f"** Example {i} **")
    print(f"English: {input_sentence}")
    print(f"Spanish: {translated}")
    print()

** Example 0 **
English: that's my younger sister's photograph.
Spanish: ie p que lamento familia piso p estúpido ley

** Example 1 **
English: my alarm clock didn't go off this morning, so i missed my bus.
Spanish: gado somos si izquierda q mary noche de pronto alta he f decidió que aún manzana



## Decoding Test Sentences (Quantitative Analysis)

In [25]:
# let's compute the ROUGE-1 and ROUGE-2 scores

rouge_1 = keras_nlp.metrics.RougeN(order=1)
rouge_2 = keras_nlp.metrics.RougeN(order=2)

for test_pair in test_pairs[:30]:
    input_sentence = test_pair[0]
    reference_sentence = test_pair[1]

    translated_sentence = decode_sequences([input_sentence])
    translated_sentence = translated_sentence.numpy()[0].decode("utf-8")
    translated_sentence = (
        translated_sentence.replace("[PAD]", "")
        .replace("[START]", "")
        .replace("[END]", "")
        .strip()
    )

    rouge_1(reference_sentence, translated_sentence)
    rouge_2(reference_sentence, translated_sentence)

print("ROUGE-1 Score: ", rouge_1.result())
print("ROUGE-2 Score: ", rouge_2.result())

ROUGE-1 Score:  {'precision': <tf.Tensor: shape=(), dtype=float32, numpy=0.013737975>, 'recall': <tf.Tensor: shape=(), dtype=float32, numpy=0.017252993>, 'f1_score': <tf.Tensor: shape=(), dtype=float32, numpy=0.015082163>}
ROUGE-2 Score:  {'precision': <tf.Tensor: shape=(), dtype=float32, numpy=0.0>, 'recall': <tf.Tensor: shape=(), dtype=float32, numpy=0.0>, 'f1_score': <tf.Tensor: shape=(), dtype=float32, numpy=0.0>}
