# Translation using Attention Mechanism

In [3]:
!pip install spacy --quiet


In [5]:
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl --quiet
!pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.5.0/fr_core_news_sm-3.5.0-py3-none-any.whl --quiet
 

Here we import the necessary libraries and modules. 
- torch is the main PyTorch library
- nn is the PyTorch module for building neural networks
- F is a module containing PyTorch's functional operations
- get_tokenizer is a function from torchtext that we'll use to tokenize (split) our input sentences into individual words.

In [19]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchtext.data.utils import get_tokenizer

Here, we define tokenizers for English and French using the SpaCy tokenizers. The get_tokenizer function takes two arguments: the tokenizer type ('spacy') and the language model to use ('en_core_web_sm' for English and 'fr_core_news_sm' for French).


In [7]:
# Define tokenizers for English and French
en_tokenizer = get_tokenizer('spacy', language='en_core_web_sm')
fr_tokenizer = get_tokenizer('spacy', language='fr_core_news_sm')



We define vocabularies for English and French. These are dictionaries that map words to unique numerical indices. The <pad> token is a special token used for padding sequences to a fixed length.

In [8]:
# Define vocabularies
en_vocab = {'<pad>': 0, 'i': 1, 'want': 2, 'to': 3, 'learn': 4, 'about': 5, 'transformers': 6}
fr_vocab = {'<pad>': 0, 'je': 1, 'veux': 2, 'apprendre': 3, 'sur': 4, 'les': 5, 'transformers': 6}


This function converts a sentence into a PyTorch tensor suitable for input to the model. It takes three arguments: the sentence, the vocabulary, and the tokenizer. 

Here's what it does:

1. It tokenizes the sentence using the provided tokenizer, converting it to lowercase.
2. For each token in the tokenized sentence, it looks up its index in the vocabulary. If the token is not found, it uses the index of the <pad> token.
3. It converts the list of indices into a PyTorch tensor with data type torch.long.

In [16]:
# Helper function to convert sentence to tensor
def sentence_to_tensor(sentence, vocab, tokenizer):
    tokens = tokenizer(sentence.lower())
    indices = [vocab.get(token, vocab['<pad>']) for token in tokens]
    tensor = torch.tensor(indices, dtype=torch.long)
    return tensor

We define the dimensionality of the word embeddings (4 in this example) and create embedding layers for English and French. 

The nn.Embedding module takes three arguments: 
1. The size of the vocabulary
2. The dimensionality of the embeddings
3. The index of the padding token

In [10]:
# Define embedding layers
embedding_dim = 4
en_embedding = nn.Embedding(len(en_vocab), embedding_dim, padding_idx=0)
fr_embedding = nn.Embedding(len(fr_vocab), embedding_dim, padding_idx=0)

This is the ScaledDotProductAttention module, which implements the scaled dot-product attention mechanism.

In [20]:
class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k):
        super(ScaledDotProductAttention, self).__init__()
        self.d_k = d_k

    def forward(self, q, k, v, mask=None):
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / self.d_k ** 0.5
        
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        
        attn_weights = F.softmax(attn_scores, dim=-1)
        attended_output = torch.matmul(attn_weights, v)
        
        return attended_output, attn_weights


The Encoder module takes an input sequence x and applies the attention mechanism to it. Here's what it does:

1. It passes the input sequence x through the embedding layer to get word embeddings
2. It applies the scaled dot-product attention mechanism to the embeddings, using the embeddings as queries, keys, and values
3. It returns the attended output from the attention mechanism

In [21]:
class Encoder(nn.Module):
    def __init__(self, embedding, d_k):
        super(Encoder, self).__init__()
        self.embedding = embedding
        self.attention = ScaledDotProductAttention(d_k)

    def forward(self, x):
        embeddings = self.embedding(x)
        attended_output, _ = self.attention(embeddings, embeddings, embeddings)
        return attended_output

The Decoder module takes an input sequence x and the output of the encoder. 

Here's what it does:

1. It passes the input sequence x through the embedding layer to get word embeddings
2. It applies the scaled dot-product attention mechanism to the embeddings, using the embeddings as queries, and the encoder output as keys and values
3. It returns the attended output from the attention mechanism

In [22]:
class Decoder(nn.Module):
    def __init__(self, embedding, d_k):
        super(Decoder, self).__init__()
        self.embedding = embedding
        self.attention = ScaledDotProductAttention(d_k)

    def forward(self, x, encoder_output):
        embeddings = self.embedding(x)
        attended_output, _ = self.attention(embeddings, encoder_output, encoder_output)
        return attended_output

The Transformer module combines the Encoder and Decoder modules. 

Here's what it does:

1. In the constructor, it creates instances of the Encoder and Decoder modules, passing in the appropriate embedding layers and the dimensionality of the attention mechanism
2. In the forward method, it takes an English sentence en_sentence and a French sentence fr_sentence as input
3. It converts the input sentences to tensors using the sentence_to_tensor function and the appropriate vocabularies and tokenizers
4. It passes the English sentence tensor through the Encoder module to get the encoder output
5. It passes the French sentence tensor and the encoder output through the Decoder module to get the decoder output
6. It returns the decoder output, which should ideally be similar to the embedding of the French sentence



In [23]:
class Transformer(nn.Module):
    def __init__(self, en_vocab_size, fr_vocab_size, d_k):
        super(Transformer, self).__init__()
        self.encoder = Encoder(en_embedding, d_k)
        self.decoder = Decoder(fr_embedding, d_k)

    def forward(self, en_sentence, fr_sentence):
        en_tensor = sentence_to_tensor(en_sentence, en_vocab, en_tokenizer)
        fr_tensor = sentence_to_tensor(fr_sentence, fr_vocab, fr_tokenizer)

        encoder_output = self.encoder(en_tensor)
        decoder_output = self.decoder(fr_tensor, encoder_output)

        return decoder_output

In the example below, we create an instance of the Transformer model, passing in the sizes of the English and French vocabularies, and the dimensionality of the attention mechanism. We define an English sentence and its French translation, pass them through the model, and print the output (which should be similar to the embedding of the French sentence).


##### Note that this is a very simplified example, and in practice, you would need to add additional components (like a linear layer to generate the output tokens), handle out-of-vocabulary words, and train the model on a large dataset of English-French sentence pairs to learn the translation task effectively

In [17]:
# Example usage
model = Transformer(len(en_vocab), len(fr_vocab), d_k=embedding_dim)

en_sentence = "I want to learn about transformers"
fr_sentence = "je veux apprendre sur les transformers"

output = model(en_sentence, fr_sentence)
print(output)

tensor([[-0.2523, -0.6888,  0.5728,  0.1789],
        [ 0.1916, -0.8131,  0.5783,  0.2957],
        [ 0.6950, -0.9435,  0.5721,  0.3822],
        [ 0.6151, -0.9133,  0.5673,  0.3577],
        [ 0.8427, -0.9631,  0.5620,  0.4115],
        [ 0.4276, -0.8694,  0.5754,  0.3498]], grad_fn=<MmBackward0>)
