<a href="https://colab.research.google.com/github/aflah02/English-to-French-Seq2Seq-KerasNLP/blob/main/EngToFra.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install git+https://github.com/aflah02/keras-nlp.git -q

[K     |████████████████████████████████| 511.7 MB 6.3 kB/s 
[K     |████████████████████████████████| 511.7 MB 4.5 kB/s 
[K     |████████████████████████████████| 4.9 MB 53.4 MB/s 
[?25h  Building wheel for keras-nlp (setup.py) ... [?25l[?25hdone


In [None]:
"""
## Introduction

KerasNLP provides lots of building blocks for NLP (model layers, tokenizers, metrics, etc.) and
makes it convenient to construct NLP pipelines on the fly.

In this tutorial we'll use KerasNLP's `UnicodeTokenizer` to train a sequence-to-sequence Transformer model
on English-to-French translation. This example draws inspiration from 
[Character-level recurrent sequence-to-sequence model example](https://keras.io/examples/nlp/lstm_seq2seq/)
by [fchollet](https://twitter.com/fchollet) and uses the same dataset and 
__Abheest's Guide Here__ and uses the same model architecture and decoding code.

This tutorial broadly covers the following:
- Tokenization using `keras_nlp.tokenizers.UnicodeCharacterTokenizer` to obtain 
character level tokens.
- A sequence-to-sequence transformer model using KerasNLP's
`keras_nlp.layers.TransformerEncoder`, `keras_nlp.layers.TransformerDecoder` and
`keras_nlp.layers.TokenAndPositionEmbedding` layers
- Utilizes `keras_nlp.utils.greedy_search` ultility to translate text at runtime
which implements the Greedy Search Decoding algorithm.

This tutorial will be pretty useful and will be a good starting point for learning about KerasNLP and how to 
incorporate it into your own NLP pipelines.
"""

"""
## Setup

Importing neccessary libraries
"""

import keras_nlp
import numpy as np
import random
import tensorflow as tf
from tensorflow import keras

"""
## Configuration
"""

BATCH_SIZE = 64
EPOCHS = 10

EMBED_DIM = 256
INTERMEDIATE_DIM = 2048
NUM_HEADS = 8

num_samples = 20000
data_path = "fra.txt"

"""
## Downloading the dataset
"""

!!curl -O http://www.manythings.org/anki/fra-eng.zip
!!unzip fra-eng.zip

['Archive:  fra-eng.zip',
 '  inflating: _about.txt              ',
 '  inflating: fra.txt                 ']

In [None]:
"""
## Parsing the Data

Each line contains an English sentence and its corresponding French translation.
In our setting we treat the English sentence as the *source sequence* and French 
sentence as our *target sequence*. Upon splitting on the tab character a third
entry also pops up but that is unused and hence ignored
"""

with open(data_path) as f:
    lines = f.read().split("\n")[:-1]
eng_fra_pairs = []
for line in lines[: min(num_samples, len(lines) - 1)]:
    eng, fra, _ = line.split("\t")
    eng_fra_pairs.append((eng, fra))

"""
Let's take a look at some random pairs presented in the data.
"""

for _ in range(5):
    print(random.choice(eng_fra_pairs))

"""
At this point we have a huge chunk of data that we can use to train our model however
we'll need to split it into training, validation and test data to ensure that our
model is able to generalize well.
"""

random.shuffle(eng_fra_pairs)
num_val_samples = int(0.1 * len(eng_fra_pairs))
num_train_samples = len(eng_fra_pairs) - 2 * num_val_samples
train_pairs = eng_fra_pairs[:num_train_samples]
val_pairs = eng_fra_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = eng_fra_pairs[num_train_samples + num_val_samples :]

print(f"Total Samples: {len(eng_fra_pairs)}")
print(f"Training Samples: {len(train_pairs)}")
print(f"Validation Samples: {len(val_pairs)}")
print(f"Testing Samples: {len(test_pairs)}")


"""
## Tokenizing the Data

Since the UnicodeTokenizer is a vocabulary free tokenizer which tokenizes text as 
unicode characters codepoints it makes our job easy as we just need to pass it the
text to be tokenized.
This also lowercases the text by default before tokenizing.

We also compute the MAX_SEQUENCE_LENGTH in our dataset 
Since we also need something which is analogous to a a VOCAB_SIZE for our model 
we use the max unicode value present in our English and French characters for 
the same as all other character tokens lie in the range 
[0, Max_Unicode_Value_For_Language]
"""

eng_samples = [text_pair[0] for text_pair in train_pairs]

fra_samples = [text_pair[1] for text_pair in train_pairs]

MAX_SEQUENCE_LENGTH = max(max([len(i) for i in eng_samples]), max([len(i) for i in fra_samples]))

print("MAX_SEQUENCE_LENGTH", MAX_SEQUENCE_LENGTH)

eng_vocab_set = set([])
for i in eng_samples:
  eng_vocab_set = eng_vocab_set.union(set(list(i)))
ENG_VOCAB_SIZE = max([ord(i) for i in eng_vocab_set])+1

print("ENG_VOCAB_SIZE", ENG_VOCAB_SIZE)

fra_vocab_set = set([])
for i in fra_samples:
  fra_vocab_set = fra_vocab_set.union(set(list(i)))
FRA_VOCAB_SIZE = max([ord(i) for i in fra_vocab_set])+1

print("FRA_VOCAB_SIZE", FRA_VOCAB_SIZE)

"""
Now, let's define the tokenizers. We will use the vocabularies obtained above as
input to the tokenizers.
"""

tokenizer = keras_nlp.tokenizers.UnicodeCharacterTokenizer()

"""
Let's try to test tokenization and detokenization to make sure it works.
We set a random seed for reproducibility of the examples
"""
random.seed(30)

random_eng_ex = random.choice(eng_samples)
random_eng_ex_tokens = tokenizer.tokenize(random_eng_ex)
print("English sentence: ", random_eng_ex)
print("Tokens: ", random_eng_ex_tokens)
print("Recovered text after detokenizing: ", tokenizer.detokenize(random_eng_ex_tokens))

print()

random_fra_ex = random.choice(fra_samples)
random_fra_ex_tokens = tokenizer.tokenize(random_fra_ex)
print("English sentence: ", random_fra_ex)
print("Tokens: ", random_fra_ex_tokens)
print("Recovered text after detokenizing: ", tokenizer.detokenize(random_fra_ex_tokens))

('Write this down.', 'Mets-le sur papier.')
("She's a beauty.", 'Elle est très belle.')
("It's weird.", "C'est bizarre.")
("I'm so tired.", 'Je suis si fatiguée.')
('Tom looks nice.', "Tom a l'air gentil.")
Total Samples: 20000
Training Samples: 16000
Validation Samples: 2000
Testing Samples: 2000
MAX_SEQUENCE_LENGTH 53
ENG_VOCAB_SIZE 234
FRA_VOCAB_SIZE 8240
English sentence:  Get away!
Tokens:  tf.Tensor([103 101 116  32  97 119  97 121  33], shape=(9,), dtype=int32)
Recovered text after detokenizing:  tf.Tensor(b'get away!', shape=(), dtype=string)

English sentence:  Skier, c'est sympa.
Tokens:  tf.Tensor(
[115 107 105 101 114  44  32  99  39 101 115 116  32 115 121 109 112  97
  46], shape=(19,), dtype=int32)
Recovered text after detokenizing:  tf.Tensor(b"skier, c'est sympa.", shape=(), dtype=string)


In [None]:
"""
## Dataset Preparation

We'll need to bring our dataset into a more usable form which can be used for 
our model

As in a normal sequence to sequence setting the model will try to predict 
N+1th word using information from the source sentence and the previously predicted 
words (i.e. till the Nth word)

We format our dataset into tuples of the form (`inputs`, `target`)

- `inputs` is a dictionary with the keys `encoder_inputs` and `decoder_inputs`.
`encoder_inputs` is the tokenized source sentence and `decoder_inputs` is the target sentence "so far",
that is to say, the words 0 to N used to predict word N+1 (and beyond) in the target sentence.
- `target` is the target sentence offset by one step:
it provides the next words in the target sentence -- what the model will try to predict.

After tokenization we will also add special tokens, -1 acts as the `"[START]"` 
token and 0 acts as the `"[END]"` token to the input French sentence. We use 0 
as the `"[END]"` token as the tensors are padded with 0 to match dimension 
(max_sequence_length) and they can all be stripped off at the end.
"""

def tokenize_and_fix_length_and_add_special_tokens(eng, fra):
    batch_size = tf.shape(fra)[0]

    eng = tokenizer(eng)
    fra = tokenizer(fra)

    # Adding Special Tokens
    start_tensor = tf.fill((batch_size, 1), -1)
    end_tensor = tf.fill((batch_size, 1), 0)
    fra = tf.concat([start_tensor, fra, end_tensor], axis=1)

    # Setting Tensor Size to Maximum Length Allowed
    eng = eng.to_tensor(shape=eng.shape.as_list()[:-1] + [MAX_SEQUENCE_LENGTH])
    fra = fra.to_tensor(shape=fra.shape.as_list()[:-1] + [MAX_SEQUENCE_LENGTH+1])

    return (
        {"encoder_inputs": eng, "decoder_inputs": fra[:, :-1],},
        fra[:, 1:],
    )


def make_data(pairs):
    eng_texts = [pair[0] for pair in pairs]
    fra_texts = [pair[1] for pair in pairs]
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, fra_texts)).batch(BATCH_SIZE)
    dataset = dataset.map(tokenize_and_fix_length_and_add_special_tokens, num_parallel_calls=tf.data.AUTOTUNE)
    return dataset.shuffle(2048).prefetch(16).cache()


train_ds = make_data(train_pairs)
val_ds = make_data(val_pairs)

"""
Let's check our shapes!
We have batches of 64 pairs, and all sequences are MAX_SEQUENCE_LENGTH steps long:
"""

for inputs, targets in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f"targets.shape: {targets.shape}")

inputs["encoder_inputs"].shape: (64, 53)
inputs["decoder_inputs"].shape: (64, 53)
targets.shape: (64, 53)


In [None]:
"""
## Model Architecture

We use randomly initialized embeddings layer while we use `keras_nlp.layers.TokenAndPositionEmbedding `
layer for getting our position embeddings. We then simply add these 2 embeddings!

Our model has an encoder and decoder block present together. The encoder uses 
`keras_nlp.layers.TransformerEncoder` while the decoder uses
`keras_nlp.layers.TransformerDecoder`. The setting is however fairly simple
we pass our original english sentence to the Encoder which generate a output. 
This output along with the character predicted so far are then given to the 
decoder to product the output at the next time step! We also set `use_causal_mask`
to True as we don't want out model to see beyond the already predicted tokens at
this stage as then it would use the tokens it needs to predict to predict themselves
which is information not available at test time hence this parameter prevents the 
decoder from peeping into the future! 
"""

# Encoder
encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="encoder_inputs")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=ENG_VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
)(encoder_inputs)

encoder_outputs = keras_nlp.layers.TransformerEncoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)
encoder = keras.Model(encoder_inputs, encoder_outputs)


# Decoder
decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM), name="decoder_state_inputs")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=FRA_VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)(decoder_inputs)

x = keras_nlp.layers.TransformerDecoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs, use_causal_mask=True,)
x = keras.layers.Dropout(0.5)(x)
decoder_outputs = keras.layers.Dense(FRA_VOCAB_SIZE, activation="softmax")(x)
decoder = keras.Model([decoder_inputs, encoded_seq_inputs,], decoder_outputs,)
decoder_outputs = decoder([decoder_inputs, encoder_outputs])

transformer = keras.Model(
    [encoder_inputs, decoder_inputs], decoder_outputs, name="transformer",
)

"""
## Training our Model

We use accuracy as a metric to monitor our training progress. Accuracy might not 
be the best metric here as other more suitable metrics such as 'BLEU' exist 
however they are more computationally expensive and will make the training much
slower!
"""

transformer.summary()
transformer.compile(
    "rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)
transformer.fit(train_ds, epochs=EPOCHS, validation_data=val_ds)

Model: "transformer"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 encoder_inputs (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 token_and_position_embedding (  (None, None, 256)   73472       ['encoder_inputs[0][0]']         
 TokenAndPositionEmbedding)                                                                       
                                                                                                  
 decoder_inputs (InputLayer)    [(None, None)]       0           []                               
                                                                                                  
 transformer_encoder (Transform  (None, None, 256)   1315072     ['token_and_position_em

<keras.callbacks.History at 0x7fb1900fe3d0>

In [None]:
"""
## Analyzing our Outputs

Let's checkout how you can swiftly use the model to convert your English Sentences
to French
We provide the model the tokenized english sentence as well as -1 as the prompt 
token which was analogous to the START token usually used in such tasks. 
The model uses these 2 pieces of information to generate a probability distribution 
over the next possible tokens and we choose the most likely one out of those 
in a greedy fashion. This is repeated till we hit the end token (0 in our case) 
or a predetermined maximum length for the output.

The above process can be easily performed using the 
`keras_nlp.utils.greedy_search` present in the offerings of Keras-NLP
"""


def decode_sequences(input_sentences):
    batch_size = tf.shape(input_sentences)[0]

    # Tokenize the encoder input.
    tokenized = tokenizer(input_sentences)
    encoder_input_tokens = tokenized.to_tensor(shape=tokenized.shape.as_list()[:-1] + [MAX_SEQUENCE_LENGTH])
    # Define a function that outputs the next token's probability given the
    # input sequence.
    def token_probability_fn(decoder_input_tokens):
        return transformer([encoder_input_tokens, decoder_input_tokens])[:, -1, :]

    # Set the prompt to the "[START]" token.
    prompt = tf.fill((batch_size, 1), -1)

    generated_tokens = keras_nlp.utils.greedy_search(
        token_probability_fn,
        prompt,
        max_length=max([len(i) for i in fra_samples]),
        end_token_id=0,
    )

    # Masking the -1 which was given as the initial prompt and removing it
    mask = tf.math.equal(generated_tokens,-1)
    generated_tokens = tf.boolean_mask(generated_tokens, mask == False)

    # Reshaping the retrived tensor after deletion the original shape
    generated_tokens = tf.reshape(generated_tokens, [1,tf.shape(generated_tokens)[0]])

    generated_sentences = tokenizer.detokenize(generated_tokens)
    return generated_sentences

test_eng_texts = [pair[0] for pair in test_pairs]

for i in range(10):
    input_sentence = random.choice(test_eng_texts)
    translated = decode_sequences(tf.constant([input_sentence]))
    translated = translated.numpy()[0].decode("utf-8")
    print(input_sentence)
    print(translated)
    print()

"""
## Conclusions

Bear in mind that when our model started training it didn't even know a word is
and didn't use any information about word structure directly to learn so it does 
quite a decent job. Within a few epochs it can clearly form sentences and more training
on more data can make it perform much better!
"""

He found it.
il a trouvé inque c'est moi.

See you there.
à vous devotre votre !

I was detained.
j'ai été inciné.

Everyone agreed.
tout le monde a chien.

It's outdated.
c'est beaucoup de faire.

We'll rebuild.
nous sommes fini.

I nailed it.
j'ai besoin d'espare.

I can jump.
je vous suis joure mettra.

Everybody stayed.
tout le monde ment sendé.

I can't swim.
je ne sais pas en moi.



"\n## Conclusions\nBear in mind that when our model started training it didn't even know a word is\nand didn't use any information about word structure directly to learn so it does \nquite a decent job. Within a few epochs it can clearly form sentences and more training\non more data can make it perform much better!\n"