<a href="https://colab.research.google.com/github/Praveen76/Language-Translation-using-Transformer-Decoder-Model/blob/main/Language_Translation_using_Transformer_Decoder_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learning Objectives

At the end of the experiment, you will be able to:

* understand the big picture of transformers
* explore masking of transformers
* implement transformer decoder and understand its architecture
* apply learning on a machine translation problem

## The Big Picture

<center>
<img src= "https://datascienceimages.s3.eu-north-1.amazonaws.com/Language_Translation_using_Transformer_Decoder_Model/NLP_Pipeline.png" width=700px/>
</center>


Transformer architecture follows an encoder-decoder structure:

- the ***encoder***, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations;
- the ***decoder***, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step to generate an output sequence.

The Transformer decoder generates sequences autoregressively by attending to previously generated positions using masked self-attention, attending to the encoder's output using encoder-decoder attention, applying feed-forward networks, and utilizing positional encodings. This architecture allows the decoder to produce coherent and contextually accurate sequences in various natural language processing tasks

### Importing required packages

In [None]:
import numpy as np
import re
import random
import string
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# **Part A** : Building Encoder Transformer

The concepts for Transformer encoder have been discussed in Assignment 3 and the same steps are implemented here for creating a decoder network.

### Define TransformerEncoder class to be used in model building

In [None]:
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim    # Dimension of embedding. 4 in the dummy example
        self.dense_dim = dense_dim    # No. of neurons in dense layer
        self.num_heads = num_heads    # No. of heads for MultiHead Attention layer
        self.attention = layers.MultiHeadAttention(   # MultiHead Attention layer
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]    # encoders are stacked on top of the other.
        )                                 # So output dimension is also embed_dim
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    # Call function based on figure above
    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
            print(f"**test: mask in not None. mask = {mask}")

        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)  # Query: inputs, Value: inputs, Keys: Same as Values by default
                                                  # Q: Can you see how this is self attention? A: all args are the same
        proj_input = self.layernorm_1(inputs + attention_output) # LayerNormalization; + Recall cat picture
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)  # LayerNormalization + Residual connection

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return

### Positional Embedding





*   Learn position- embedding vectors the same way we learn to embed word indices.
*   Proceed to **add** our position embeddings to the corresponding word embeddings, to obtain a position-aware word embedding.
*   This technique is called “positional embedding.”






<img src= "https://datascienceimages.s3.eu-north-1.amazonaws.com/Language_Translation_using_Transformer_Decoder_Model/PositionalEmbedding.png" width=700px/>


<center>
<img src= "https://datascienceimages.s3.eu-north-1.amazonaws.com/Language_Translation_using_Transformer_Decoder_Model/Encoder%20Embedding.png" width=650px/>
</center>

![]()

Q: In the picture above:


*   What is the embedding dimension for both the layers? - 3
*   How many rows would the token embedding layer have?  - 20000 (vocab size)
*   How many rows would the postional embedding layer have? - 600 (seq length)
*   Where do we get the indices in token embedding layer? - from TextVectorization
*   Where do we get the indices in token embedding layer? - We explicitly define a range



### Define PositionalEmbedding class to be used in model building

In [None]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        # input_dim = (token) vocabulary size,  output_dim = embedding size
        super().__init__(**kwargs)

        self.token_embeddings = layers.Embedding(       # Q: what is input_dim and output_dim? A: vocab size, embedding dim
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = layers.Embedding(    # Q: Why input_dim = seq_length?  A: there are seq_len (here 600) no. of possible positions
            input_dim=sequence_length, output_dim=output_dim)   # Q: What is the vocab for this Embedding layer ? A: seq_length
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):   # inputs will be a batch of sequences (batch, seq_len)
        length = tf.shape(inputs)[-1]     # lenght will just be sequence length
        positions = tf.range(start=0, limit=length, delta=1) # indices for input to positional embedding
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions     # ADD the embeddings

    def compute_mask(self, inputs, mask=None):  # makes this layer a mask-generating layer
        return tf.math.not_equal(inputs, 0)     #mask will get propagated to the next layer.

    # When using custom layers, this enables the layer to be reinstantiated from its config dict,
    # which is useful during model saving and loading.
    def get_config(self):
        config = super(PositionalEmbedding, self).get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config

In [None]:
# MAY BE REMOVED
a = tf.constant([1,0,2,0,3]) # a is a tensor
tf.math.not_equal(a, 0) # which elements of 'a' are not equal to 0

<tf.Tensor: shape=(5,), dtype=bool, numpy=array([ True, False,  True, False,  True])>

### TransformerEncoder model definition with Positional Embedding

In [None]:
#  Combining the Transformer encoder with positional embedding
#  The values below are for the classificaiton problem. We will change them for the tranlation example
vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")  # Q: Why is the input expected to have dtype int? A: Inputs coming from TextVectorization layer.
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

print(f"Token embedding weights: {256*20000}")
print(f"Position embedding weights: {256*600}")
print(f"Total no. of weights: {256*20000 + 256*600}")

**test: mask in not None. mask = Tensor("transformer_encoder/strided_slice:0", shape=(None, 1, None), dtype=bool)
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 positional_embedding (Posi  (None, None, 256)         5273600   
 tionalEmbedding)                                                
                                                                 
 transformer_encoder (Trans  (None, None, 256)         543776    
 formerEncoder)                                                  
                                                                 
 global_max_pooling1d (Glob  (None, 256)               0         
 alMaxPooling1D)                                                 
                                                                 
 dropout (Dro

# **Part B** : Building Decoder Transformer

## Encoder - Decoder Overview

**Encoder - Encodes the input as some representation**

**Decoder - Uses the encoded representation (and targets) to decode these representation as per the target.**

*   Encoders - can be CNNs, RNNs, FFNs
*   Decoders - can be CNNs, RNNs, FFNs




Examples:
* Problem: Predict description in text from images.
  - Encoder - CNN
  - Decoder - RNN

<center>
<img src= "https://datascienceimages.s3.eu-north-1.amazonaws.com/Language_Translation_using_Transformer_Decoder_Model/Encoder%20Decoder%20Overview.png" width=350px/>
</center>


* Problem: Language Translation
  - Encoder - RNN
  - Decoder - RNN



<center>
<img src= "https://datascienceimages.s3.eu-north-1.amazonaws.com/Language_Translation_using_Transformer_Decoder_Model/EncoderDecoder.jpeg" width=600px/>
</center>


 - Problem: Language Translation
    - Transformer Encoder
    - Transformet Decoder

<center>
<img src= "https://datascienceimages.s3.eu-north-1.amazonaws.com/Language_Translation_using_Transformer_Decoder_Model/Transformer%20Network.png" width=300px/>
</center>

During training,
* **An encoder model turns the source sequence into an intermediate representation.**
* **A decoder is trained to predict the next token i** in the target sequence by looking at both
    - previous tokens (0 to i - 1) and
    - the encoded source sequence
    

During inference, we don’t have access to the target sequence—we’re trying to predict it from scratch. We’ll have to generate it one token at a time:
1. We obtain the encoded source sequence from the encoder.
2. The decoder starts by looking at the encoded source sequence as well as an initial “seed” token (such as the string "[start]"), and uses them to predict the
first real token in the sequence.
3. The predicted sequence so far is fed back into the decoder, which generates the next token, and so on, until it generates a stop token (such as the string "[end]").

<center>
<img src= "https://datascienceimages.s3.eu-north-1.amazonaws.com/Language_Translation_using_Transformer_Decoder_Model/Transformer%20gif.gif" width=750px/>
</center>


### Masking





Masking is needed to prevent the attention mechanism of a transformer from “cheating” in the decoder when training (on a translating task for instance). This kind of “ cheating-proof masking” is not present in the encoder side.

Consider the sequence: “I love it”, then the expected prediction for the token at position one (“I”) is the token at the next position (“love”). Similarly the expected prediction for the tokens “I love” is “it”.

We do not want the attention mechanism to share any information regarding the token at the next positions, when giving a prediction using all the previous tokens.

To ensure that this is done, we mask future positions (setting them to -inf) before the softmax step in the self-attention calculation.

### Padding mask

Padding is a special form of masking where the masked steps are at the start or the end of a sequence. Padding comes from the need to encode sequence data into contiguous batches: in order to make all sequences in a batch fit a given standard length, it is necessary to pad or truncate some sequences.

* The Embedding layer is capable of generating a “mask” that corresponds to its input data.

* By default, this option isn’t active—you can turn it on by passing mask_zero=True to your Embedding layer.

* You can retrieve the mask with the compute_mask() method:

### An example to understand Padding Masking

In [None]:
# Padding mask
embedding_layer_ = layers.Embedding(input_dim=10, output_dim=256, mask_zero=True)
some_input = [
  [4,3,2,1,0,0,0],
  [5,4,3,2,1,0,0],
  [2,1,0,0,0,0,0]]
d_mask = embedding_layer_.compute_mask(some_input)
print(d_mask)
print(tf.cast(d_mask, dtype="int32"))


tf.Tensor(
[[ True  True  True  True False False False]
 [ True  True  True  True  True False False]
 [ True  True False False False False False]], shape=(3, 7), dtype=bool)
tf.Tensor(
[[1 1 1 1 0 0 0]
 [1 1 1 1 1 0 0]
 [1 1 0 0 0 0 0]], shape=(3, 7), dtype=int32)


### Causal Padding

*   **The Transformer Decoder is order-agnostic: it looks at the entire target sequence at once.**
*   If it were allowed to use its entire input, it would simply learn to copy input step N+1 to location N in the output.

*  **Solution: mask the upper half of the pairwise attention matrix to prevent the model from paying any attention to information from the future**
*  We'll see this in the method get_causal_attention_mask(self, inputs) inside the decoder class


  

<center>
<img src= "https://datascienceimages.s3.eu-north-1.amazonaws.com/Language_Translation_using_Transformer_Decoder_Model/Self%20Attention%20Scores.png" width=600px/>
</center>


<center>
<img src= "https://datascienceimages.s3.eu-north-1.amazonaws.com/Language_Translation_using_Transformer_Decoder_Model/Multihead%20Attention.png" width=600px/>
</center>


In [None]:
# Assume sequence length is 5
j = normal_range = tf.range(5)
i = with_new_axis = tf.range(5)[:, tf.newaxis]
# with_2new_axis = tf.range(10)[:, tf.newaxis, tf.newaxis]

In [None]:
print(normal_range)
print(with_new_axis)


tf.Tensor([0 1 2 3 4], shape=(5,), dtype=int32)
tf.Tensor(
[[0]
 [1]
 [2]
 [3]
 [4]], shape=(5, 1), dtype=int32)


In [None]:
# j is broadcasted; booleans are cast to int32
d_mask = tf.cast(i >= j, dtype="int32")
print(d_mask)

tf.Tensor(
[[1 0 0 0 0]
 [1 1 0 0 0]
 [1 1 1 0 0]
 [1 1 1 1 0]
 [1 1 1 1 1]], shape=(5, 5), dtype=int32)


In [None]:
d_mask = tf.reshape(d_mask, (1, 5, 5))
print(d_mask)

tf.Tensor(
[[[1 0 0 0 0]
  [1 1 0 0 0]
  [1 1 1 0 0]
  [1 1 1 1 0]
  [1 1 1 1 1]]], shape=(1, 5, 5), dtype=int32)


In [None]:
#Define tile multiplier for tiling
batch_size = 2
mult = tf.concat(
    [tf.expand_dims(batch_size, -1),
      tf.constant([1, 1], dtype=tf.int32)], axis=0)
print(tf.expand_dims(batch_size, -1))
print(tf.constant([1, 1], dtype=tf.int32))
print(mult)

tf.Tensor([2], shape=(1,), dtype=int32)
tf.Tensor([1 1], shape=(2,), dtype=int32)
tf.Tensor([2 1 1], shape=(3,), dtype=int32)


In [None]:
# Tile the mask to replicate across batchsize
causal_mask_ = tf.tile(d_mask, mult)
print(causal_mask_)

tf.Tensor(
[[[1 0 0 0 0]
  [1 1 0 0 0]
  [1 1 1 0 0]
  [1 1 1 1 0]
  [1 1 1 1 1]]

 [[1 0 0 0 0]
  [1 1 0 0 0]
  [1 1 1 0 0]
  [1 1 1 1 0]
  [1 1 1 1 1]]], shape=(2, 5, 5), dtype=int32)


In the example above:

sequence length = 5

batch size = 2

To know more about masking, refer [here](https://www.tensorflow.org/guide/keras/masking_and_padding).

### Transformer Decoder

(Two windows for visualization)

In [None]:
class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        # Define the layers. Let's point them out in the diagram
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        # Now we have 2 MultiHead Attention layers - one for self attention and one for generalized attention
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True #ensures that the layer will propagate its input mask to its outputs;

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1])) # sequence_length == input_shape[1]
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1),
              tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)

    def call(self, inputs, encoder_outputs, mask=None): # two inputs: decoder i/p and encoder o/p
        causal_mask = self.get_causal_attention_mask(inputs)
        # print(f"*** test: mask = {mask}")
        if mask is not None:
            padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask) # union of 0s
            # print(f"**** test padding mask: {padding_mask}")
        attention_output_1 = self.attention_1(    # Q: What kind of attention?  A: self attention
            query=inputs,
            value=inputs,
            key=inputs,
            attention_mask=causal_mask) # Q: What will the causal_mask do? A: makes attention score of a query independent of future tokens
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(  # Q: Is this self attention? A: No. This is generalised attention
            query=attention_output_1,
            value=encoder_outputs,    # Key and Value coming from encoder
            key=encoder_outputs,
            attention_mask=padding_mask,
        )

        attention_output_2 = self.layernorm_2(
            attention_output_1 + attention_output_2)
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)

<center>
<img src= "https://datascienceimages.s3.eu-north-1.amazonaws.com/Language_Translation_using_Transformer_Decoder_Model/Transformer%20Network_2.png?response-content-disposition=inline&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEHQaCmV1LW5vcnRoLTEiRzBFAiEAmOZQhT%2BYM40tIrgSCC0juQ3jYb2AAM%2F1Fbz6dD8McLICIHg%2Fvrzjk8lvADAhmWpZrYL9TF3PwvkzKpdWA8hZeRQoKu0CCM3%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEQABoMOTc1MDUwMDY4NjU4IgyeGfj2LlL0RNsVYPMqwQJM4d9h4CmJwEyTBC7qZEzc29kl0a%2FrmErIOHAztB%2FrvmsLwfZSviafkN0S0Rgnb9ZUlQLlOLgmJAh%2Boo5wqCAD6Vph5HKDP0UmiithH%2BlXOUIiDE51ykTUEA3w1iSSHMyOs7vDBsLp%2FMFpvpCG3V2XnlQ5Iqgo422ee763aDs%2FzuanCRZpfGDaLTo%2BgTWY0vQntjWJILYrZeOw0tpLpzZArag7KROagc0J8JjBvK0M%2Fz02ApRdrd4sspTDVFraDlGzFJekxjfCDEkf0gSQy7gutoZ2hujhd%2BonPc2Sob988wEY1BUTJWathwBoXphrbMZIS6LhUp1rTnDl8w7hv54XZ4ORo9w4oXVsZdG%2FDbp2dUX6yVHFY3mjMGdU6fEQ1j2AlzySCTm4T%2FzJpXst4CisnbG6opNd6Os2b78VkYfQnKAwtdPmsQY6swKV5d%2BjoEU9poWR0HUudhbWdbIOfv1xKG64htIRW3NpNgi0OonJi5zOPTw5aWoAvi5hbnA7akONMJtwKRTgl9t30qtpFA8DwvmNL%2BUt%2FNsayqUy%2BrKPtD%2FtfVYyu82iXyxz4EyCSXPTgcCREW61a0z%2F8HtRH76JayEUAc7PAypjS861xn7SFWDzldlYWE8H2bDvjOQdf82VBKm67cP5CPWaxKpPkpTJFAR82hOQeFordG1wctY1YO5AOZnqv8Wd%2B7Tz9rQYNYk9xUihZZHgrvYzMuQWKcj%2By2eMBR7uWxP5LR48KBnL6edR4ZCv5H5LYCGjIREzJAImRO9qbDptQLHW%2B943UkCxOSqNdCIQjgRoWe84M93lB6rTQjLowcqWYU9YJDY9V73Jfw8Qz3mx6HxtnhdV&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20240507T120236Z&X-Amz-SignedHeaders=host&X-Amz-Expires=300&X-Amz-Credential=ASIA6GBMDKKZK2QSB44A%2F20240507%2Feu-north-1%2Fs3%2Faws4_request&X-Amz-Signature=ad2f5678c3b61b7091e67e98753ab19179e6660ac0571ca3bcb9314fd0e1058d" width=350px/>
</center>

In [None]:
# English to spanish translation
vocab_size = 20000
sequence_length = 600
embed_dim = 256
dense_dim = 2048
num_heads = 8

encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="english")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs) # Q: First arg acts like a 'vocabulary' for pos embedding layer
encoder_outputs = TransformerEncoder(embed_dim, dense_dim, num_heads)(x) #Q: What are these arguments? A: embedding dimension, no. of neurons in dense layer, no. of head in multi-head attention layer

decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="spanish")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, dense_dim, num_heads)(x, encoder_outputs) # Q: What are the call arguments in the picture? A: See tutorial video
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
transformer = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs) # Note that there are two input layers
transformer.summary()



**test: mask in not None. mask = Tensor("transformer_encoder_1/strided_slice:0", shape=(None, 1, None), dtype=bool)
Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 english (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 spanish (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 positional_embedding_1 (Po  (None, None, 256)            5273600   ['english[0][0]']             
 sitionalEmbedding)                                                                               
                                                                           

# A Machine Translation Example

English to Spanish translation

### Preparing the data

In [None]:
!wget http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
!unzip -q spa-eng.zip -d /content


--2024-03-22 11:31:02--  http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.218.207, 108.177.11.207, 173.194.217.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.218.207|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2638744 (2.5M) [application/zip]
Saving to: ‘spa-eng.zip’


2024-03-22 11:31:02 (328 MB/s) - ‘spa-eng.zip’ saved [2638744/2638744]



In [None]:
# Rows of the dataset
!tail spa-eng/spa.txt

You can't view Flash content on an iPad. However, you can easily email yourself the URLs of these web pages and view that content on your regular computer when you get home.	No puedes ver contenido en Flash en un iPad. Sin embargo, puedes fácilmente enviarte por correo electrónico las URL's de esas páginas web y ver el contenido en tu computadora cuando llegas a casa.
A mistake young people often make is to start learning too many languages at the same time, as they underestimate the difficulties and overestimate their own ability to learn them.	Un error que cometen a menudo los jóvenes es el de comenzar a aprender demasiadas lenguas al mismo tiempo, porque subestiman sus dificultades y sobrestiman sus propias capacidades para aprenderlas.
No matter how much you try to convince people that chocolate is vanilla, it'll still be chocolate, even though you may manage to convince yourself and a few others that it's vanilla.	No importa cuánto insistas en convencer a la gente de que el chocol

In [None]:
# Pre-processing: Separating input and output sequences
text_file = "spa-eng/spa.txt"

with open(text_file) as f:
    lines = f.read().split("\n")[:-1]

text_pairs = []

for line in lines:
    english, spanish = line.split("\t")
    spanish = "[start] " + spanish + " [end]"
    text_pairs.append((english, spanish))

print(random.choice(text_pairs))
print(f"no. of pairs: {len(text_pairs)}")

('"Is it the first time you\'ve been here?" "Yes, it\'s my first visit."', '[start] "¿Es la primera vez que usted está aquí?" - "Sí, es mi primera visita." [end]')
no. of pairs: 118964


In [None]:
# Splitting data

random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples

train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

In [None]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [None]:
# Vectorizing the English and Spanish text pairs
# Define which characters to strip out for spanish data- [, ], ¿
strip_chars = string.punctuation + "¿"  # strip out stadard punctuations + extra one in spanish
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

# Custom standardization function for spanish
def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(    # Replace elements of input matching regex pattern with rewrite.
        lowercase, f"[{re.escape(strip_chars)}]", "")

vocab_size = 15000
sequence_length = 20

source_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)
target_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
    standardize=custom_standardization,
)

train_english_texts = [pair[0] for pair in train_pairs]
train_spanish_texts = [pair[1] for pair in train_pairs]

source_vectorization.adapt(train_english_texts)
target_vectorization.adapt(train_spanish_texts)

In [None]:
seq = tf.range(10)
dec_in = seq[:-1]
dec_out = seq[1:]

print("original seq")
print(seq)

print("dec_in")
print(dec_in)

print("dec_out")
print(dec_out)

original seq
tf.Tensor([0 1 2 3 4 5 6 7 8 9], shape=(10,), dtype=int32)
dec_in
tf.Tensor([0 1 2 3 4 5 6 7 8], shape=(9,), dtype=int32)
dec_out
tf.Tensor([1 2 3 4 5 6 7 8 9], shape=(9,), dtype=int32)


In [None]:
# Preparing datasets for the translation task

batch_size = 64

#IMPORTANT- returns nested tuple- ( (eng_encod_input, spa_ decod_input), spa_decod_output)
def format_dataset(eng, spa):
    # Q: What are eng and spa pre and post re-assignment ? A: raw text and indices
    eng = source_vectorization(eng)
    spa = target_vectorization(spa)
    return ({
        "english": eng,           # encoder input
        "spanish": spa[:, :-1],    # decoder input Q: what is the first axis?  A: shape = (batch_size, )
    }, spa[:, 1:])                  # decoder ouput

def make_dataset(pairs):
    eng_texts, spa_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    spa_texts = list(spa_texts)
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset, num_parallel_calls=4)
    return dataset.shuffle(2048).prefetch(16).cache() #Use in-memory caching to speed up preprocessing.

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

In [None]:
for inputs, targets in train_ds.take(1):
    print(f"inputs['english'].shape: {inputs['english'].shape}")
    print(f"inputs['spanish'].shape: {inputs['spanish'].shape}")
    print(f"targets.shape: {targets.shape}")

inputs['english'].shape: (64, 20)
inputs['spanish'].shape: (64, 20)
targets.shape: (64, 20)


### Training and evaluating the model

In [None]:
transformer.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])

In [None]:
transformer.fit(train_ds, epochs=2, validation_data=val_ds)

Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x7b8f3211fd30>

In [None]:
# Inference

spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization(
            [decoded_sentence])[:, :-1]
        predictions = transformer(
            [tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return decoded_sentence

test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(4):
    input_sentence = random.choice(test_eng_texts)
    print("-")
    print(input_sentence)
    print(decode_sequence(input_sentence))

-
Did you play tennis yesterday?
**test: mask in not None. mask = [[[ True  True  True  True  True False False False False False False
   False False False False False False False False False]]]
**test: mask in not None. mask = [[[ True  True  True  True  True False False False False False False
   False False False False False False False False False]]]
**test: mask in not None. mask = [[[ True  True  True  True  True False False False False False False
   False False False False False False False False False]]]
**test: mask in not None. mask = [[[ True  True  True  True  True False False False False False False
   False False False False False False False False False]]]
**test: mask in not None. mask = [[[ True  True  True  True  True False False False False False False
   False False False False False False False False False]]]
**test: mask in not None. mask = [[[ True  True  True  True  True False False False False False False
   False False False False False False False False Fals

**Note that both the TransformerEncoder and the TransformerDecoder are shape-invariant, so you could be stacking many of them to create a more powerful encoder or decoder.**

<center>
<img src= "https://datascienceimages.s3.eu-north-1.amazonaws.com/Language_Translation_using_Transformer_Decoder_Model/Encoder_Decoder_2.png" width=600px/>
</center>


### Intriguing Questions:

1. Connection between encoder outputs and decoder inputs when there are multiple stacks of them?

    **Answer:** The output from the last encoder block acts as input to all decoder blocks.

\\

2. During training, are the decoder inputs obtained from decoder predictions or are they obtained directly from the target data?

    **Answer:** During training, the decoder input is obtained directly from the target data. The only differnce between the decoder input and decoder target is an offset of 1 index. For example, consider a hindi to english translation problem with a an english sample "[start] I like to learn [end]".  The input to the decoder for this sample will be sample[:-1], i.e. "[start] I like to learn" and the target will be sample[1:], i.e. "I like to learn [end]". The prediction during training will be a probabilitly distribution over the vocabulary for each element in the sequence. So if the sequence length is 8 and the vocabulary size is 100, then the output shape of the prediction for the given sample will be (6,100). The actual predicted sequence can be computed by taking the argmax, i.e. the token with the maximum probability, for each token in the sequence. An exemplary prediction based on our example can be "I love to study". The loss will be computed based on the sum of cross-entropy losses for each token. Here 'like'/'love' and 'learn'/'study' will contribute to the loss.

  (Notes:
  1. The sample will actually have integer data. Here its written text for the sake of clarity
  2. The above explanation is for 1 sample. If the batch size of 64, i.e. 64 samples in a mini-batch , then the decoder output shape is (64,6,100). In general, it is (batch_size, seq_length, vocab_size).


Other important points:
- An advatage of Transformers (over RNNs) is that they allow parallizable computations. Note that the computation of given token does not depend on the computations of the previous token, and can be done in parallel during training.

- Note what kind of data structure the the function "format_dataset(eng, spa)" returns. It is a nested tuple- ( (eng_encod_input, spa_decod_input), spa_decod_output), where '(eng_encod_input, spa_decod_input)' form the input of the Transformer Model and 'spa_decod_output' is the target output of the Transfomrmer Model.