In [4]:
!pip install torch torchtext--quiet

# Attention Mechanism

Here, we import PyTorch (a popular machine learning library). 
Two submodules from PyTorch (nn for building neural networks and nn.functional for common functions), and a function get_tokenizer from the torchtext library. We will use it to split the input sentence into individual words or tokens.

In [13]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [14]:

# Define tokenizer
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer('basic_english')

Function to convert a sentence into a tensor (a data structure that PyTorch can work with). The function does the following - 

1. It takes a sentence and a vocabulary (a dictionary that maps words to numbers) as input.
2. It converts the sentence to lowercase and splits it into individual words (tokens) using the tokenizer function.
3. For each token, it looks up its corresponding number in the vocabulary and adds it to a list called indices.
4. It converts the list of numbers (indices) into a PyTorch tensor, which is the data structure we'll use for our computations.

In [15]:
# Helper function to convert sentence to tensor
def sentence_to_tensor(sentence, vocab):
    tokens = tokenizer(sentence.lower())
    indices = [vocab[token] for token in tokens]
    tensor = torch.tensor(indices, dtype=torch.long)
    return tensor

Then, we define the vocabulary. It is a dictionary that maps each word to a unique number. We'll use this to convert our sentences into tensors of numbers.

In [16]:
# Define vocabulary
vocab = {'<pad>': 0, 'transformer': 1, 'architecture': 2, 'is': 3, 'amazing': 4, 'to': 5, 'learn': 6}
vocab_size = len(vocab)

Next, we create an embedding layer. This is a part of the neural network that converts the numbers representing words into vectors of numbers (embeddings). These embeddings are more useful for the network to work with than just the raw numbers.

In [17]:
# Define embedding layer
embedding_dim = 4
embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)


This is the core part of the attention mechanism. It takes in three inputs- 

1. q (queries)
2. k (keys)
3. v (values). 

The forward method computes the attention scores (how much the module should attend to different parts of the input), applies a mask if provided (to ignore certain parts of the input), and then computes the final attended output and attention weights.

In the example usage section, 
- we take an input sentence: "Transformer architecture is amazing to learn"
- Convert it to a tensor using the sentence_to_tensor function and the provided vocabulary.
- Pass the tensor through the embedding layer to get embeddings (vector representations of the words).
- Create an instance of the ScaledDotProductAttention module.
- Compute the attended output and attention weights by passing the embeddings through the attention module.
- Print the shapes of the attended output and attention weights.
- Finally, we iterate over the tokens in the input sentence and print the attention weights for each token.

In [20]:
class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k):
        super(ScaledDotProductAttention, self).__init__()
        self.d_k = d_k

    def forward(self, q, k, v, mask=None):
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / self.d_k ** 0.5
        
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        
        attn_weights = F.softmax(attn_scores, dim=-1)
        attended_output = torch.matmul(attn_weights, v)
        
        return attended_output, attn_weights

Refer to the example now. When you run the code, you'll see that the attention mechanism assigns different weights (importance) to different words in the input sentence. For example, when processing the word "transformer", the model attends more to the words "transformer" itself and "architecture", and less to words like "is".

In [21]:
# Example
input_sentence = "Transformer architecture is amazing to learn"
input_tensor = sentence_to_tensor(input_sentence, vocab)
embeddings = embedding(input_tensor).unsqueeze(0)  # [1, seq_len, embedding_dim]

# Create attention module
attention = ScaledDotProductAttention(d_k=embedding_dim)

# Compute attended output and attention weights
attended_output, attn_weights = attention(embeddings, embeddings, embeddings)
print("Attended Output Shape:", attended_output.shape)
print("Attention Weights Shape:", attn_weights.shape)

# Print attention weights for each word
for i, token in enumerate(tokenizer(input_sentence.lower())):
    print(f"{token}: {attn_weights[0, i, :]}")

Attended Output Shape: torch.Size([1, 6, 4])
Attention Weights Shape: torch.Size([1, 6, 6])
transformer: tensor([0.4172, 0.1349, 0.1761, 0.0802, 0.0654, 0.1263],
       grad_fn=<SliceBackward0>)
architecture: tensor([0.0868, 0.7531, 0.0112, 0.0640, 0.0247, 0.0603],
       grad_fn=<SliceBackward0>)
is: tensor([0.0734, 0.0072, 0.7869, 0.0121, 0.0979, 0.0225],
       grad_fn=<SliceBackward0>)
amazing: tensor([0.0335, 0.0416, 0.0121, 0.5409, 0.0941, 0.2778],
       grad_fn=<SliceBackward0>)
to: tensor([0.0502, 0.0295, 0.1804, 0.1730, 0.4020, 0.1649],
       grad_fn=<SliceBackward0>)
learn: tensor([0.0767, 0.0569, 0.0328, 0.4042, 0.1304, 0.2990],
       grad_fn=<SliceBackward0>)
