<a href="https://colab.research.google.com/github/VK-VCS/NLP/blob/main/Transformer_Model_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 Let's build a Transformer-based model for English-to-French translation using positional encoding, multi-head attention, and scaled dot-product attention as part of the architecture.

This implementation will incorporate all the critical components of the Transformer architecture, as used in models like BERT, GPT, and Transformer-based machine translation.

Key Components:

1. **Positional Encoding:** Adds information about the position of tokens in a
sequence, since the Transformer architecture does not have recurrence or convolution.
2. **Scaled Dot-Product Attention:** The core attention mechanism used to compute attention scores.
3. **Multi-Head Attention:** Allows the model to focus on different parts of the input sequence by using multiple attention heads in parallel.
4. **Feedforward Layers:** A pointwise feedforward network applied after the attention layer.
5. **Layer Normalization:** Applied after each sub-layer (attention or feed-forward).

**Step-by-Step Implementation**

We will define the following:

1. Positional Encoding: To add position information to the token embeddings.
2. Scaled Dot-Product Attention: The core attention mechanism.
3. Multi-Head Attention: Uses multiple attention heads to look at different parts of the sequence.
4. Encoder and Decoder: As described in the Transformer model.
5. Output Layer: Final prediction layer for translating sequences.

**Step 1: Define Helper Functions**

1.1 Positional Encoding

We need to create a positional encoding matrix, which is added to the input embeddings to give each token positional information.

In [None]:
import tensorflow as tf
import numpy as np

def get_positional_encoding(seq_len, d_model):
    """ Get positional encoding for a sequence of length `seq_len` and embedding size `d_model`. """
    pos = np.arange(seq_len)[:, np.newaxis]  # Shape: (seq_len, 1)
    i = np.arange(d_model)[np.newaxis, :]  # Shape: (1, d_model)
    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
    positional_encoding = pos * angle_rates
    positional_encoding[:, 0::2] = np.sin(positional_encoding[:, 0::2])  # Apply sin to even indices
    positional_encoding[:, 1::2] = np.cos(positional_encoding[:, 1::2])  # Apply cos to odd indices
    return positional_encoding


1.2 Scaled Dot-Product Attention

The scaled dot-product attention computes attention weights between queries, keys, and values.

In [None]:
def scaled_dot_product_attention(query, key, value, mask=None):
    """ Calculate attention weights. """
    matmul_qk = tf.matmul(query, key, transpose_b=True)  # (batch_size, seq_len_q, seq_len_k)
    dk = tf.cast(tf.shape(key)[-1], tf.float32)  # The depth of the key
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    if mask is not None:
        scaled_attention_logits += (mask * -1e9)

    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (batch_size, seq_len_q, seq_len_k)

    output = tf.matmul(attention_weights, value)  # (batch_size, seq_len_q, depth_value)

    return output, attention_weights


1.3 Multi-Head Attention

The multi-head attention mechanism uses multiple attention heads, each looking at different parts of the sequence.

In [None]:
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, num_heads, d_model):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0, "d_model must be divisible by num_heads"

        self.depth = d_model // self.num_heads
        self.query_dense = tf.keras.layers.Dense(d_model)
        self.key_dense = tf.keras.layers.Dense(d_model)
        self.value_dense = tf.keras.layers.Dense(d_model)
        self.output_dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        """ Split the input into multiple heads. """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))  # (batch_size, seq_len, num_heads, depth)
        return tf.transpose(x, perm=[0, 2, 1, 3])  # (batch_size, num_heads, seq_len, depth)

    def call(self, query, key, value, mask=None):
        batch_size = tf.shape(query)[0]

        query = self.query_dense(query)  # (batch_size, seq_len, d_model)
        key = self.key_dense(key)  # (batch_size, seq_len, d_model)
        value = self.value_dense(value)  # (batch_size, seq_len, d_model)

        query = self.split_heads(query, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        key = self.split_heads(key, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        value = self.split_heads(value, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        output, attention_weights = scaled_dot_product_attention(query, key, value, mask)

        output = tf.transpose(output, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)
        output = tf.reshape(output, (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

        output = self.output_dense(output)  # (batch_size, seq_len_q, d_model)

        return output, attention_weights


**Step 2: Transformer Encoder and Decoder**

Now, let's define the encoder and decoder using multi-head attention and feedforward layers.

2.1 Transformer Encoder Layer

In [None]:
class TransformerEncoderLayer(tf.keras.layers.Layer):
    def __init__(self, num_heads, d_model, dff, dropout_rate=0.1):
        super(TransformerEncoderLayer, self).__init__()
        self.multi_head_attention = MultiHeadAttention(num_heads, d_model)
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(dff, activation='relu'),
            tf.keras.layers.Dense(d_model)
        ])
        self.layer_norm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layer_norm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout2 = tf.keras.layers.Dropout(dropout_rate)

    def call(self, x, mask, training):
        attn_output, _ = self.multi_head_attention(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layer_norm1(x + attn_output)

        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layer_norm2(out1 + ffn_output)

        return out2


2.2 Transformer Decoder Layer

In [None]:
class TransformerDecoderLayer(tf.keras.layers.Layer):
    def __init__(self, num_heads, d_model, dff, dropout_rate=0.1):
        super(TransformerDecoderLayer, self).__init__()
        self.multi_head_attention1 = MultiHeadAttention(num_heads, d_model)
        self.multi_head_attention2 = MultiHeadAttention(num_heads, d_model)
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(dff, activation='relu'),
            tf.keras.layers.Dense(d_model)
        ])
        self.layer_norm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layer_norm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layer_norm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout2 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout3 = tf.keras.layers.Dropout(dropout_rate)

    def call(self, x, enc_output, mask, training):
        attn_output1, _ = self.multi_head_attention1(x, x, x, mask)
        attn_output1 = self.dropout1(attn_output1, training=training)
        out1 = self.layer_norm1(x + attn_output1)

        attn_output2, _ = self.multi_head_attention2(out1, enc_output, enc_output, mask)
        attn_output2 = self.dropout2(attn_output2, training=training)
        out2 = self.layer_norm2(out1 + attn_output2)

        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layer_norm3(out2 + ffn_output)

        return out3


In [None]:
# Preprocess the English and French sentences
def tokenize_pairs(en, fr):
    en = preprocess(en.numpy())
    fr = preprocess(fr.numpy())
    return en, fr

# Create a simple tokenizer for English and French
def tokenize(texts):
    tokenizer = tf.keras.layers.TextVectorization(output_mode='int', output_sequence_length=40)
    tokenizer.adapt(texts)
    return tokenizer

# Tokenize both English and French datasets
english_sentences = [en.numpy() for en, _ in train_data]
french_sentences = [fr.numpy() for _, fr in train_data]

english_tokenizer = tokenize(english_sentences)
french_tokenizer = tokenize(french_sentences)

# Define a function to convert sentences to padded integer sequences
def encode(lang1, lang2):
    lang1 = english_tokenizer(lang1)
    lang2 = french_tokenizer(lang2)
    return lang1, lang2

train_data = train_data.map(lambda en, fr: tf.py_function(encode, [en, fr], [tf.int64, tf.int64]))
train_data = train_data.cache().batch(64)
val_data = val_data.map(lambda en, fr: tf.py_function(encode, [en, fr], [tf.int64, tf.int64]))
val_data = val_data.cache().batch(64)

In [None]:
import tensorflow_datasets as tfds

# Load the TED talks English-to-German dataset from TFDS
dataset, info = tfds.load("ted_hrlr_translate/pt_to_en", as_supervised=True, with_info=True)

train_data = dataset['train']
val_data = dataset['validation']
test_data = dataset['test']


Downloading and preparing dataset 124.94 MiB (download: 124.94 MiB, generated: Unknown size, total: 124.94 MiB) to /root/tensorflow_datasets/ted_hrlr_translate/pt_to_en/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/51785 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/ted_hrlr_translate/pt_to_en/incomplete.8JYQ0S_1.0.0/ted_hrlr_translate-tra…

Generating validation examples...:   0%|          | 0/1193 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/ted_hrlr_translate/pt_to_en/incomplete.8JYQ0S_1.0.0/ted_hrlr_translate-val…

Generating test examples...:   0%|          | 0/1803 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/ted_hrlr_translate/pt_to_en/incomplete.8JYQ0S_1.0.0/ted_hrlr_translate-tes…

Dataset ted_hrlr_translate downloaded and prepared to /root/tensorflow_datasets/ted_hrlr_translate/pt_to_en/1.0.0. Subsequent calls will reuse this data.


**Step 3: Define the Full Transformer Model**

Finally, we'll define the complete Transformer model by stacking the encoder and decoder layers, incorporating the positional encoding, and using the final output layer for predictions.

In [None]:
class Transformer(tf.keras.Model):
    def __init__(self, num_heads, d_model, dff, num_encoder_layers, num_decoder_layers, vocab_size, dropout_rate=0.1):
        super(Transformer, self).__init__()

        self.num_encoder_layers = num_encoder_layers
        self.num_decoder_layers = num_decoder_layers
        self.d_model = d_model

        self.encoder_embedding = tf.keras.layers.Embedding(vocab_size, d_model)
        self.decoder_embedding = tf.keras.layers.Embedding(vocab_size, d_model)

        self.positional_encoding = get_positional_encoding(40, d_model)

        self.encoder_layers = [TransformerEncoderLayer(num_heads, d_model, dff, dropout_rate) for _ in range(num_encoder_layers)]
        self.decoder_layers = [TransformerDecoderLayer(num_heads, d_model, dff, dropout_rate) for _ in range(num_decoder_layers)]

        self.output_layer = tf.keras.layers.Dense(vocab_size, activation='softmax')

    def call(self, encoder_input, decoder_input, mask=None, training=False):
        # Add positional encoding to inputs
        encoder_input = self.encoder_embedding(encoder_input) + self.positional_encoding[:tf.shape(encoder_input)[1], :]
        decoder_input = self.decoder_embedding(decoder_input) + self.positional_encoding[:tf.shape(decoder_input)[1], :]

        # Pass through encoder layers
        encoder_output = encoder_input
        for layer in self.encoder_layers:
            encoder_output = layer(encoder_output, mask, training)

        # Pass through decoder layers
        decoder_output = decoder_input
        for layer in self.decoder_layers:
            decoder_output = layer(decoder_output, encoder_output, mask, training)

        output = self.output_layer(decoder_output)

        return output


**Step 4: Model Training**

Now, you can compile and train the model. Since this model is quite complex, you may want to use a large dataset and fine-tune hyperparameters for optimal performance.

In [None]:
# Define model parameters
vocab_size = len(french_tokenizer.get_vocabulary())  # You should use a tokenized vocabulary size
num_heads = 8
d_model = 512
dff = 2048
num_encoder_layers = 6
num_decoder_layers = 6

# Instantiate and compile the Transformer model
transformer = Transformer(num_heads, d_model, dff, num_encoder_layers, num_decoder_layers, vocab_size)

transformer.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
                    loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
transformer.fit(train_data, epochs=10, validation_data=val_data)


Epoch 1/10


TypeError: missing a required argument: 'decoder_input'

**Step 5: Translate New Sentences**

Once the model is trained, you can use it for inference. This involves encoding the English sentence, passing it through the encoder-decoder model, and decoding the output.

In [None]:
def translate(sentence, model):
    # Preprocess the sentence and pass through model
    tokens = english_tokenizer(sentence)
    tokens = tf.expand_dims(tokens, axis=0)
    output = model(tokens, tokens, training=False)  # Use the model for translation
    output = tf.argmax(output, axis=-1)
    return french_tokenizer.decode(output[0])

# Translate a sentence
english_sentence = "Hello, how are you?"
translated_sentence = translate(english_sentence, transformer)
print(f"English: {english_sentence}")
print(f"French: {translated_sentence}")


NameError: name 'transformer' is not defined