### Transformer Network

The research paper "Attention Is All You Need" published in 2017 introduced Transformer neural networks, which has brought about a revolutionary change in the field of NLP. The key innovation of the Transformer is its **`attention mechanism`**, which allows it to process input sequences in parallel and capture long-range dependencies more effectively.


**Link to Research Paper:** https://arxiv.org/abs/1706.03762



**Why Trasnformer?**

Key things that makes transformer better than RNN and LSTM:
- Parellel processing of the inputs which can make the training process lot faster
- Better contextual representation of the input because of multi-head self attentions


**When to use Transformer?**

Transformers are best suited for tasks where context and long-range dependencies are important. Their attention mechanisms allow them to capture these features more effectively.

Some of the common NLP tasks that are well-suited for transformers include:
- Machine Translation
- Text Summarization
- Question Answering
- Named Entity Recognition etc.,


# Exploring Transformer Architecture through English to French Machine Translation



## Table of Contents

- [Python Libraries](#0)
- [1 - Positional Encoding](#1)
- [2 - Masking](#2)
    - [2.1 - Padding Mask](#2-1)
    - [2.2 - Look-ahead Mask](#2-2)
- [3 - Encoder](#4)
    - [4.1 Encoder Layer](#4-1)
    - [4.2 - Full Encoder](#4-2)
- [4 - Decoder](#5)
    - [5.1 - Decoder Layer](#5-1)
    - [5.2 - Full Decoder](#5-2)
- [5 - Transformer](#6)
- [6 - References](#7)

<a name='0'></a>
## Python Libraries

Loading all the required python packages.

In [1]:
!pip install tensorflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [68]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Embedding, MultiHeadAttention, Dense, Input, Dropout, LayerNormalization
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow import GradientTape, train, function
from tensorflow.keras.metrics import Mean
from time import time
from pickle import dump
from pickle import load


<a name='1'></a>
## 1 - Positional Encoding

Positional encoding to the input sequence is a critical step in the Transformer architecture, as it enables the model to understand the sequential order of the tokens in the input sequence.

The formula for calculating the positional encoding is as follows:

$${PE}{(pos,k)} = sin(pos/10000^{2i/dmodel})$$

$${PE}{(pos,k+1)} = cos(pos/10000^{2i/dmodel})$$

- $pos$ is the position of the token in the sequence
- $k = 2i$ is the index of the the each dimension in positional encoding. So, $i= k//2$
- $dmodel$ is the dimension of the embedding vector.

if $k=[0,1,2,3,4,5]$ indices of postional embedding, then $i=[0,0,1,1,2,2]$

In [3]:
def get_angles(pos, k, d_model:int):
  """
  Arguments:
  pos -- (an array of shape (position, 1) representing the positions in the sequence
  k -- k (an array of shape (1, d_model) representing the indices of the embedding dimensions
  d_model -- integer representing the dimensionality of the model
  
  Returns:
  angles -- an array of shape (position, d_model) representing the angles for the positional encoding.
  """
  i = k//2
  angles = pos/np.power(10_000, 2*i/d_model)
  return angles

In [4]:
def positional_encoding(position:int, d_model:int ):
  """
  Arguments:
  position: an integer indicating the maximum sequence length
  d_model: an integer indicating the dimensionality of the model
  
  Returns:
  encoding -- a 3D tensor of shape (1, position, d_model) representing the positional encoding for a sequence of length position
  """
  angle_radians = get_angles(np.arange(position)[: , np.newaxis],
                             np.arange(d_model)[np.newaxis, :],
                             d_model)
  # apply sin to even indices in the array; 2i
  angle_radians[:, 0::2] = np.sin(angle_radians[:, 0::2])

  # apply cos to odd indices in the array; 2i+1
  angle_radians[:,1:2] = np.cos(angle_radians[:, 1:2])

  pos_encoding = angle_radians[np.newaxis, ...]

  return tf.cast(pos_encoding, dtype=tf.float32)

In [5]:
positional_encoding(4, 5)

<tf.Tensor: shape=(1, 4, 5), dtype=float32, numpy=
array([[[ 0.0000000e+00,  1.0000000e+00,  0.0000000e+00,  0.0000000e+00,
          0.0000000e+00],
        [ 8.4147096e-01,  5.4030228e-01,  2.5116222e-02,  2.5118865e-02,
          6.3095731e-04],
        [ 9.0929741e-01, -4.1614684e-01,  5.0216600e-02,  5.0237730e-02,
          1.2619144e-03],
        [ 1.4112000e-01, -9.8999250e-01,  7.5285293e-02,  7.5356595e-02,
          1.8928709e-03]]], dtype=float32)>

<a name='2'></a>
## 2 - Masking

There are two types of masks used when building Transformer network: *padding mask* and *look-ahead mask*. 

<a name='2-1'></a>
### 2.1 - Padding Mask

It is important to feed sequences of uniform length to the transformer. We can pad the sequences with zeros, and truncate sequences that exceed maximum length of the model.

In [6]:
def create_padding_mask(seq):
  """
    Creates a mask tensor representing the padding positions in the input sequence.
    
    Arguments:
    seq -- a tensor of shape (batch_size, seq_len)

    Returns:
    mask -- a tensor of shape (batch_size, 1, seq_len), where each position is 0 if the corresponding position in
    the input sequence is a padding position, and 1 otherwise.
  """
  mask = 1 - tf.cast(tf.math.equal(seq, 0),dtype=tf.float32)

  # reshaping mask so that it has an additional dimension, 
  # which will be needed when applying the mask in the self-attention mechanism of the Transformer model. 
  return mask[:, tf.newaxis, :]


In [7]:
x = tf.constant([[7., 6., 1., 0., 0.], 
                 [1., 2., 3., 0., 0.], 
                 [4., 5., 0., 0., 0.]])
print(create_padding_mask(x))

tf.Tensor(
[[[1. 1. 1. 0. 0.]]

 [[1. 1. 1. 0. 0.]]

 [[1. 1. 0. 0. 0.]]], shape=(3, 1, 5), dtype=float32)


In [8]:
print(tf.keras.activations.softmax(x)) # softmax without padding
print(tf.keras.activations.softmax(x + (1 - create_padding_mask(x)) * -1.0e9)) #softmax with padding

tf.Tensor(
[[7.2876632e-01 2.6809818e-01 1.8064311e-03 6.6454883e-04 6.6454883e-04]
 [8.4437370e-02 2.2952460e-01 6.2391245e-01 3.1062772e-02 3.1062772e-02]
 [2.6502502e-01 7.2041267e-01 4.8541022e-03 4.8541022e-03 4.8541022e-03]], shape=(3, 5), dtype=float32)
tf.Tensor(
[[[0.7297362  0.26845497 0.00180884 0.         0.        ]
  [0.09003057 0.24472848 0.6652409  0.         0.        ]
  [0.26762316 0.7274751  0.00490169 0.         0.        ]]

 [[0.7297362  0.26845497 0.00180884 0.         0.        ]
  [0.09003057 0.24472848 0.6652409  0.         0.        ]
  [0.26762316 0.7274751  0.00490169 0.         0.        ]]

 [[0.73105854 0.26894143 0.         0.         0.        ]
  [0.26894143 0.73105854 0.         0.         0.        ]
  [0.26894143 0.73105854 0.         0.         0.        ]]], shape=(3, 3, 5), dtype=float32)


<a name='2-2'></a>
### 2.2 - Look-ahead Mask


In [9]:
def create_look_ahead_mask(sequence_length):
  mask = tf.linalg.band_part(tf.ones((1, sequence_length, sequence_length)), -1, 0)
  return mask 

In [10]:
create_look_ahead_mask(4)

<tf.Tensor: shape=(1, 4, 4), dtype=float32, numpy=
array([[[1., 0., 0., 0.],
        [1., 1., 0., 0.],
        [1., 1., 1., 0.],
        [1., 1., 1., 1.]]], dtype=float32)>

<a name='4'></a>
## 3 - Encoder

Encoder contains - multi-head self attention layers and feed forward neural network that is independently applied to every position.

In [11]:
def FeedForward(embedding_dim, full_connected_dim):
  return tf.keras.Sequential([
      tf.keras.layers.Dense(full_connected_dim, activation='relu'),
      tf.keras.layers.Dense(embedding_dim)
  ])

In [12]:
class EncoderLayer(tf.keras.layers.Layer):
  def __init__(self, embedding_dim, num_heads, full_connected_dim,dropout_rate=0.1, layernorm_eps=1e-6 ):
    super().__init__()
    self.mha = MultiHeadAttention(num_heads, key_dim=embedding_dim, dropout=dropout_rate)
    self.ffnn = FeedForward(embedding_dim, full_connected_dim)
    self.layer_norm1 = LayerNormalization(epsilon = layernorm_eps )
    self.layer_norm2 = LayerNormalization(epsilon = layernorm_eps)
    self.drop_out = Dropout(dropout_rate)
  
  def __call__(self, x, training, mask):
    self_mha_output = self.mha(x,x,x,mask) # if query, key, value are same, then self-attenstion will be computed
    out1 = self.layer_norm1(x + self_mha_output)
    ffn_output = self.ffnn(out1)
    ffn_output = self.drop_out(ffn_output, training=training)
    encoder_layer_out = self.layer_norm2(out1 + ffn_output)
    return encoder_layer_out


In [13]:
class Encoder(tf.keras.layers.Layer):
  def __init__(self, num_encoders, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, 
               max_pos_encoding, dropout_rate=0.1, layernorm_eps=1e-6):
    super().__init__()
    self.embedding_dim = embedding_dim
    self.num_layers = num_encoders
    self.embedding = Embedding(input_vocab_size, embedding_dim)
    self.pos_encoding = positional_encoding(max_pos_encoding, embedding_dim)
    self.enc_layers = [EncoderLayer(embedding_dim, num_heads, fully_connected_dim, dropout_rate, layernorm_eps) for _ in range(num_encoders)]
    self.dropout = Dropout(dropout_rate)
  
  def __call__(self, x, training, mask):
    seq_len = tf.shape(x)[1]
    x = self.embedding(x)
    # scaling: This is done to prevent the dot product operation in the self-attention mechanism from getting too large or too small
    x *= tf.math.sqrt(tf.cast(self.embedding_dim, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]
    x = self.dropout(x, training = training)
    for i in range(self.num_layers):
      x = self.enc_layers[i](x, training, mask)

    return x # tensor of shape (batch_size, input_seq_len, embedding_dim)

<a name='5'></a>
## 4 - Decoder


In [14]:
class DecoderLayer(tf.keras.layers.Layer):
  def __init__(self, embedding_dim, num_heads, fully_connected_dim, dropout_rate=0.1, layernorm_eps=1e-6):
    super().__init__()
    self.masked_mha = MultiHeadAttention(num_heads, key_dim=embedding_dim, dropout=dropout_rate)
    self.mha = MultiHeadAttention(num_heads, key_dim=embedding_dim, dropout=dropout_rate)
    self.ffnn = FeedForward(embedding_dim, fully_connected_dim)
    self.layer_norm1 = LayerNormalization(epsilon=layernorm_eps)
    self.layer_norm2 = LayerNormalization(epsilon=layernorm_eps)
    self.layer_norm3 = LayerNormalization(epsilon=layernorm_eps)
    self.dropout = Dropout(dropout_rate)
  
  def __call__(self, x, enc_output, training, look_ahead_mask, padding_mask):
    mult_attn_out1, attn_weights_block1 = self.masked_mha(x, x, x, look_ahead_mask, return_attention_scores=True)
    Q1 = self.layer_norm1(mult_attn_out1 + x)
    mult_attn_out2, attn_weights_block2 = self.mha(Q1, enc_output, enc_output, padding_mask, return_attention_scores=True) 
    mult_attn_out2 = self.layer_norm2(mult_attn_out2 + Q1)
    ffn_output = self.ffnn(mult_attn_out2)
    ffn_output = self.dropout(ffn_output, training = training)
    out3 = self.layer_norm3(ffn_output + mult_attn_out2)
    return out3, attn_weights_block1, attn_weights_block2
    

In [15]:
class Decoder(tf.keras.layers.Layer):
  def __init__(self, num_decoders, embedding_dim, num_heads, fully_connected_dim, target_vocab_size, max_pos_encoding, dropout_rate=0.1, layernorm_eps=1e-6 ):
    super().__init__()
    self.embedding_dim = embedding_dim
    self.num_layers = num_decoders
    self.embedding = Embedding(target_vocab_size, embedding_dim)
    self.pos_encoding = positional_encoding(max_pos_encoding, embedding_dim)
    self.dec_layers = [DecoderLayer(embedding_dim, num_heads, fully_connected_dim) for _ in range(num_decoders)]
    self.dropout = Dropout(dropout_rate)

  def __call__(self,x, enc_output, training, look_ahead_mask, padding_mask):
    seq_len = tf.shape(x)[1]
    attention_weights = {}

    x = self.embedding(x)
    #scaling
    x *= tf.math.sqrt(tf.cast(self.embedding_dim, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]
    x = self.dropout(x, training = training)
    for i in range(self.num_layers):
      x, block1, block2 = self.dec_layers[i](x, enc_output, training, look_ahead_mask, padding_mask)
      attention_weights['decoder_layer{}_block1_self_att'.format(i+1)] = block1
      attention_weights['decoder_layer{}_block2_decenc_att'.format(i+1)] = block2
    return x, attention_weights

<a name='6'></a> 
## 5 - Transformer


In [64]:
class Transformer(tf.keras.Model):
  def __init__(self, num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, target_vocab_size, 
               max_pos_encoding_input, max_pos_encoding_target):
    super().__init__()
    self.encoder = Encoder(num_layers, embedding_dim, num_heads, fully_connected_dim, input_vocab_size, max_pos_encoding_input)
    self.decoder = Decoder(num_layers, embedding_dim, num_heads, fully_connected_dim, target_vocab_size, max_pos_encoding_target)
    self.final_layer = Dense(target_vocab_size, activation='softmax')

  def create_padding_mask(self, seq):
    mask = 1 - tf.cast(tf.math.equal(seq, 0),dtype=tf.float32)
    return mask[:, tf.newaxis, :]
  
  def create_look_ahead_mask(sequence_length):
    mask = tf.linalg.band_part(tf.ones((1, sequence_length, sequence_length)), -1, 0)
    return mask 
    
  def __call__(self, input_sentence, output_sentence, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):
    enc_padding_mask = self.create_padding_mask(input_sentence)
    dec_padding_mask = self.create_padding_mask(output_sentence)
    look_ahead_mask = self.create_look_ahead_mask(output_sentence.shape[1])
    
    dec_in_padding_mask = self.padding_mask(decoder_input)
    dec_in_lookahead_mask = self.lookahead_mask(decoder_input.shape[1])

    enc_output = self.encoder(input_sentence, training, enc_padding_mask)
    dec_output, attention_weights = self.decoder(output_sentence, enc_output, training, look_ahead_mask, dec_padding_mask )
    final_output = self.final_layer(dec_output)
    return final_output, attention_weights

Machine Translation (English to French) using the above transformer

Loading the dataset

In [17]:
!!curl -O http://www.manythings.org/anki/hin-eng.zip
!!unzip hin-eng.zip

['Archive:  hin-eng.zip',
 '  inflating: hin.txt                 ',
 '  inflating: _about.txt              ']

In [18]:
data_path = "hin.txt"
batch_size= 10_000
enc_seq_length = 30
dec_seq_length = 30
input_texts = []
target_texts = []


with open(data_path, "r", encoding="utf-8") as f:
  lines = f.read().split("\n")
  for line in lines[: min(batch_size, len(lines) - 1)]:
    line = line.lower()
    input_text, target_text, _ = line.split("\t")
    input_texts.append('<sos> '+input_text+' <eos>')
    target_texts.append('<sos> ' + target_text + ' <eos>')

Text Vectorization

In [19]:
input_text_vectorizer = TextVectorization(max_tokens=10_000, output_mode='int')
input_tf_dataset = tf.data.Dataset.from_tensor_slices(tf.cast(input_texts, dtype=tf.string))
input_text_vectorizer.adapt(input_tf_dataset)
input_tensors = input_tf_dataset.map(input_text_vectorizer)
input_tensors = pad_sequences(input_tensors, padding='post', maxlen=enc_seq_length)

In [20]:
output_text_vectorizer = TextVectorization(max_tokens=10_000, output_mode='int')
output_tf_dataset = tf.data.Dataset.from_tensor_slices(tf.cast(target_texts, dtype=tf.string))
output_text_vectorizer.adapt(output_tf_dataset)
output_tensors = output_tf_dataset.map(output_text_vectorizer)
output_tensors = pad_sequences(output_tensors, padding='post', maxlen=dec_seq_length)

In [21]:
ENCODER_VOCAB_SIZE = input_text_vectorizer.vocabulary_size()
TARGET_VOCAB_SIZE = output_text_vectorizer.vocabulary_size()


print("Number of samples:", len(input_texts))
print("Number of unique input tokens:", ENCODER_VOCAB_SIZE)
print("Number of unique target tokens:", TARGET_VOCAB_SIZE)

Number of samples: 2980
Number of unique input tokens: 2410
Number of unique target tokens: 3067


Training the Transformer

In [22]:
tf.random.set_seed(10)

transformer = Transformer(num_layers=6, embedding_dim=4, num_heads=4, fully_connected_dim=8,
                          input_vocab_size=ENCODER_VOCAB_SIZE, target_vocab_size=TARGET_VOCAB_SIZE,
                          max_pos_encoding_input=enc_seq_length,
                          max_pos_encoding_target=dec_seq_length)


enc_padding_mask = create_padding_mask(input_tensors)
dec_padding_mask = create_padding_mask(output_tensors)
look_ahead_mask = create_look_ahead_mask(dec_seq_length)

translation, weights = transformer( input_tensors, output_tensors, True, enc_padding_mask, look_ahead_mask, dec_padding_mask )

In [33]:
def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = tf.keras.losses.sparse_categorical_crossentropy(real, pred, from_logits=True)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    return tf.reduce_mean(loss_)

In [34]:
def compute_accuracy(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    accuracy_ = tf.keras.metrics.sparse_categorical_accuracy(real, pred)
    mask = tf.cast(mask, dtype=accuracy_.dtype)
    accuracy_ *= mask
    return tf.reduce_sum(accuracy_) / tf.reduce_sum(mask)


In [40]:
input_dataset = tf.data.Dataset.from_tensor_slices(input_tensors)
output_dataset = tf.data.Dataset.from_tensor_slices(output_tensors)

dataset= tf.data.Dataset.zip((input_dataset, output_dataset))
dataset = dataset.shuffle(buffer_size=500)

In [56]:
validation_size = 0.2
num_elements = dataset.reduce(0, lambda x, _: x + 1).numpy() # count the number of elements in the dataset
num_validation = int(num_elements * validation_size)

# creating training and validation dataset
train_dataset = dataset.skip(num_validation)
validation_dataset = dataset.take(num_validation)

train_dataset = train_dataset.batch(batch_size)
validation_dataset = validation_dataset.batch(batch_size)

In [58]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

transformer_model = Transformer(num_layers=6, embedding_dim=4, num_heads=4, fully_connected_dim=8,
                                input_vocab_size=ENCODER_VOCAB_SIZE, target_vocab_size=TARGET_VOCAB_SIZE,
                                max_pos_encoding_input=enc_seq_length,
                                max_pos_encoding_target=dec_seq_length)

In [72]:

train_loss = Mean(name='train_loss')
train_accuracy = Mean(name='train_accuracy')
val_loss = Mean(name='val_loss')
# Create a checkpoint object and manager to manage multiple checkpoints
ckpt = train.Checkpoint(model=transformer_model, optimizer=optimizer)
ckpt_manager = train.CheckpointManager(ckpt, "Weights", max_to_keep=3)

# Initialise dictionaries to store the training and validation losses
train_loss_dict = {}
val_loss_dict = {}
epochs = 50

In [71]:
@function
def train_step( encoder_input_data, decoder_target_data, decoder_output):
    with tf.GradientTape() as tape:
    # Run the forward pass of the model to generate a prediction
        prediction,_ = transformer_model(encoder_input, decoder_target_data, trainable=True)
        # Compute the training loss
        loss = loss_function(decoder_output, prediction)
        # Compute the training accuracy
        accuracy = compute_accuracy(decoder_output, prediction)
    # Retrieve gradients of the trainable variables with respect to the training loss
    gradients = tape.gradient(loss, transformer_model.trainable_weights)
    # Update the values of the trainable variables by gradient descent
    optimizer.apply_gradients(zip(gradients, transformer_model.trainable_weights))
    train_loss(loss)
    train_accuracy(accuracy)
    start_time = time()
    for epoch in range(epochs):
        train_loss.reset_states()
        train_accuracy.reset_states()
        val_loss.reset_states()
        # Iterate over the dataset batches
        for step, (train_batchX, train_batchY) in enumerate(train_dataset):
            # Define the encoder and decoder inputs, and the decoder output
            encoder_input = train_batchX[:, 1:]
            decoder_input = train_batchY[:, :-1]
            decoder_output = train_batchY[:, 1:]
            train_step(encoder_input, decoder_input, decoder_output)
            # Print epoch number and loss value at the end of every epoch
            print(f"Epoch {epoch+1}: Training Loss {train_loss.result():.4f}, "
            + f"Training Accuracy {train_accuracy.result():.4f}")
            # Save a checkpoint after every five epochs
            if (epoch + 1) % 5 == 0:
                save_path = ckpt_manager.save()
                print(f"Saved checkpoint at epoch {epoch+1}")
          # Run a validation step after every epoch of training
        for val_batchX, val_batchY in validation_dataset:
          # Define the encoder and decoder inputs, and the decoder output
          encoder_input = val_batchX[:, 1:]
          decoder_input = val_batchY[:, :-1]
          decoder_output = val_batchY[:, 1:]
          # Generate a prediction
          prediction,_,_ = transformer_model(encoder_input, decoder_input, training=False)
          # Compute the validation loss
          loss = loss_function(decoder_output, prediction)
          val_loss(loss)

        # Print epoch number and accuracy and loss values at the end of every epoch
        print(f"Epoch {epoch+1}: Training Loss {train_loss.result():.4f}, "
        + f"Training Accuracy {train_accuracy.result():.4f}, "
        + f"Validation Loss {val_loss.result():.4f}")
        # Save a checkpoint after every epoch
        if (epoch + 1) % 1 == 0:
          save_path = ckpt_manager.save()
          print(f"Saved checkpoint at epoch {epoch+1}")
          # Save the trained model weights
          transformer_model.save_weights("/Weights_" + str(epoch + 1) + ".ckpt")
          train_loss_dict[epoch] = train_loss.result()
          val_loss_dict[epoch] = val_loss.result()
    with open('./train_loss.pkl', 'wb') as file:
      dump(train_loss_dict, file)
    # Save the validation loss values
    with open('./val_loss.pkl', 'wb') as file:
      dump(val_loss_dict, file)
    print("Total time taken: %.2fs" % (time() - start_time))

In [None]:
transformer_model.compile()
transformer_model.fit(train_dataset, validation_dataset)

In [73]:
for i, predicted_tf in enumerate(translation[:10]):
  print('-------------')
  hindi_pred = ' '.join([output_text_vectorizer.get_vocabulary()[np.argmax(pred)] for pred in predicted_tf.numpy()])
  print(f'English:{input_texts[i]}')
  print(f'Hindi:{target_texts[i]}')
  print(f'Predicted Hindi translation:{hindi_pred}')

-------------
English:<sos> wow! <eos>
Hindi:<sos> वाह! <eos>
Predicted Hindi translation:गर्मी खुली धूम्रपान डाँट खाती खोल खोल लगाऊँगा हूँ खाए तीस खाती खाती घर मीटिंग खाए लाती खाती खाती खाती मान धूम्रपान लाती तीस खाती खाती चल खुली इस खाती
-------------
English:<sos> duck! <eos>
Hindi:<sos> झुको! <eos>
Predicted Hindi translation:गर्मी जैसे फटाफट लाती खाती गर्मी खाती गर्मी बस्ता। वहां तीस खाती खाती वहां मीटिंग जीत खाती खाती खाती गर्मी अभिनेता खाए तीस खाती खाती खाती लगाऊँगा व्यवहार तीस खाती
-------------
English:<sos> duck! <eos>
Hindi:<sos> बतख़! <eos>
Predicted Hindi translation:खाती बस्ता। तीस लाती खाती खोल खोल लगाऊँगा मिल लाती खाती खाती खाती खाती हूँ तीस तीस खाती खाती खोल पढ़ूँ। डाँट तीस तीस खाती खाती लगाऊँगा बदतमीज़ इन्तज़ार खाती
-------------
English:<sos> help! <eos>
Hindi:<sos> बचाओ! <eos>
Predicted Hindi translation:मीटिंग मीटिंग धूम्रपान बदतमीज़ रात सदस्य घर अभिनेता डाँट खाए इन्तज़ार खाती गर्मी गर्मी मान खाए तीस खाती खाती खाती मान खुली इन्तज़ार खाती खाती खाती नाई जीत इन्तज़ार 

<a name='7'></a> 
## 6 - References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.


Francois Chollet. "A ten-minute introduction to sequence-to-sequence learning in Keras". Keras Blog, 14 September 2016, https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html.




Jalammar, J. (2018, August 24). The Illustrated Transformer. Retrieved from http://jalammar.github.io/illustrated-transformer/





# Comparing with LSTM and RNN

- The main idea behind Transformers is to replace recursive approach in RNNs by self-attnetion mechanism.

- Self-attention mechanism allows the model to encode the input sequence by including the context by attending to different parts of the input sequence (via Query, Key, and Value tensors)

- For example, if we want to translate the sentence "I am a student" to French, we would feed the input sequence one word at a time and update the hidden state at each time step. The final hidden state would be used to generate the output "je suis étudiant."

- In transformers, we would compute self-attention scores between all the words in the input sequence, and use these scores to compute a weighted sum of the input sequence to obtain a representation for each output element. This allows the model to consider all the words in the input sequence at once, and to capture long-range dependencies between them.


Advantages of Transformers:
- Parellel processing of the inputs; faster training
- Does not suffer from vanishing gradient or exploding gradient problem; makes it easy to train sequences