# Additive Attention from scratch

## Attention Mechanism Demo on Pytorch: Machine Translation Example (Many-to-Many, encoder-decoder)

In this demo, we will show you how to create a machine translator using Pytorch. This demo is inspired by Andrew Ng's deeplearning.ai course on sequence models. (Programming Assignment: Neural Machine Translation with Attention)    In this demo, we create a machine translator to translate dates in various formats  into dates in an ISO format.

In [1]:
%matplotlib inline

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch.optim as optim
!pip install lightning
import lightning as L
from lightning import Trainer

import random


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Generate Dataset
We generate a toy dataset using datetime library.  A target output only comes in one format (iso format), while there are three different date format for an input.

In [23]:
#Generating a toy dataset
import datetime
base = datetime.datetime.today()
base = datetime.date(base.year, base.month, base.day)
date_list = [base - datetime.timedelta(days=x) for x in range(0, 15000)]

In [25]:
target_date_list = [date.isoformat() for date in date_list]
print(target_date_list[:5])

['2025-01-20', '2025-01-19', '2025-01-18', '2025-01-17', '2025-01-16']


In [26]:
from random import randint
random.seed(42)
input_date_list = list()
for date in date_list:
    random_num = randint(0, 2)
    if random_num == 0:
        input_date_list.append(date.strftime("%d/%m/%y"))#"11/03/02"
    elif random_num == 1:
        input_date_list.append(date.strftime("%A %d %B %Y")) #"Monday 11 March 2002"
    elif random_num == 2:
        input_date_list.append(date.strftime("%d %B %Y")) #"11 March 2002"

# input: "11/03/02" or "Monday 11 March 2002" or "11 March 2002"
# target: "2002-03-11"

In [30]:
for input_sample, target_sample in zip(input_date_list[0:10],target_date_list[0:10]):
    print(f"input: {input_sample},\t label: {target_sample}")

input: 20 January 2025,	 label: 2025-01-20
input: 19/01/25,	 label: 2025-01-19
input: 18/01/25,	 label: 2025-01-18
input: 17 January 2025,	 label: 2025-01-17
input: Thursday 16 January 2025,	 label: 2025-01-16
input: 15/01/25,	 label: 2025-01-15
input: 14/01/25,	 label: 2025-01-14
input: 13/01/25,	 label: 2025-01-13
input: 12 January 2025,	 label: 2025-01-12
input: 11/01/25,	 label: 2025-01-11


In [31]:
set("hello")

{'e', 'h', 'l', 'o'}

In [33]:
#Preprocessing
input_chars = list(set(''.join(input_date_list)))
output_chars = list(set(''.join(target_date_list)))

# +1 for padding
data_size, vocab_size = len(input_date_list), len(input_chars)+1
output_vocab_size = len(output_chars)+1

print('There are %d lines and %d unique characters in your input data.' % (data_size, vocab_size))
maxlen = len( max(input_date_list, key=len)) #max input length

There are 15000 lines and 42 unique characters in your input data.


In [34]:
print("Max input length:", maxlen)

Max input length: 27


In [35]:
sorted_chars= sorted(input_chars)
sorted_output_chars= sorted(output_chars)
sorted_chars.insert(0,"<PAD>") #PADDING for input
sorted_output_chars.insert(0,"<PAD>") #PADDING for output

# Quick implementation of character tokenizer
# create a mapping from characters to integers
input_stoi = { ch:i for i,ch in enumerate(sorted_chars) }
input_itos = { i:ch for i,ch in enumerate(sorted_chars) }
input_encode = lambda s: [input_stoi[c] for c in s] # encoder: take a string, output a list of integers
input_decode = lambda l: ''.join([input_itos[i] for i in l]) # decoder: take a list of integers, output a string


output_stoi = { ch:i for i,ch in enumerate(sorted_output_chars) }
output_itos = { i:ch for i,ch in enumerate(sorted_output_chars) }
output_encode = lambda s: [output_stoi[c] for c in s] # encoder: take a string, output a list of integers
output_decode = lambda l: ''.join([output_itos[i] for i in l]) # decoder: take a list of integers, output a string

print(input_encode("22/12/24"))
print(input_decode(input_encode("22/12/24")))

[5, 5, 2, 4, 5, 2, 5, 7]
22/12/24


In [36]:
print(input_stoi)
print(output_stoi)

{'<PAD>': 0, ' ': 1, '/': 2, '0': 3, '1': 4, '2': 5, '3': 6, '4': 7, '5': 8, '6': 9, '7': 10, '8': 11, '9': 12, 'A': 13, 'D': 14, 'F': 15, 'J': 16, 'M': 17, 'N': 18, 'O': 19, 'S': 20, 'T': 21, 'W': 22, 'a': 23, 'b': 24, 'c': 25, 'd': 26, 'e': 27, 'g': 28, 'h': 29, 'i': 30, 'l': 31, 'm': 32, 'n': 33, 'o': 34, 'p': 35, 'r': 36, 's': 37, 't': 38, 'u': 39, 'v': 40, 'y': 41}
{'<PAD>': 0, '-': 1, '0': 2, '1': 3, '2': 4, '3': 5, '4': 6, '5': 7, '6': 8, '7': 9, '8': 10, '9': 11}


In [37]:
m=15000
Tx=maxlen
Ty=10

In [11]:
X = []
for line in input_date_list:
    line = [l for l in line] #change from string to list
    X.append(torch.tensor(input_encode(line)))
Y = []
for line in target_date_list:
    line = [l for l in line] #change from string to list
    Y.append(torch.tensor(output_encode(line)))

X = nn.utils.rnn.pad_sequence(X, batch_first = True)

In [12]:
X.shape

torch.Size([15000, 27])

In [13]:
class DateDataset(Dataset):
  def __init__(self, X, y):
    self.encoded = X.long()
    self.label = torch.stack(y).long()

  def __getitem__(self, idx):
    return {"x" :self.encoded[idx], "y":self.label[idx]}

  def __len__(self):
    return len(self.encoded)

In [14]:
class DateDataModule(L.LightningDataModule):

  def __init__(self, train_data, y, batch_size, num_workers=0):
      super().__init__()
      self.train_data = train_data
      self.y = y
      self.batch_size = batch_size
      self.num_workers = num_workers


  def setup(self, stage: str):
    pass

  def collate_fn(self, batch):
      one_hot_x = torch.stack([F.one_hot(b["x"], num_classes=len(input_stoi)) for b in batch])
      return {"x": one_hot_x.float(), "y": torch.stack([b["y"] for b in batch])}

  def train_dataloader(self):
      train_dataset = DateDataset(self.train_data, self.y)
      train_loader = DataLoader(train_dataset,
                                batch_size = self.batch_size,
                                shuffle = True,
                                collate_fn = self.collate_fn,
                                num_workers = self.num_workers)

      return train_loader

In [15]:
batch_size = 16
data_module = DateDataModule(X, Y, batch_size=batch_size,num_workers=0)

## Attention Mechanism
![attn_mech](https://raw.githubusercontent.com/ekapolc/nlp_2019/master/HW8/images/attn_mech.png)

In [16]:
def one_step_attention(h, s_prev, linear_1, linear_2):
    #h.shape = batch, seq_len, hidden_dim
    #s_prev.shape = batch, hidden_dim
    # #linear_1 and linear_2 are linear layers in the model
    s_prev = s_prev.unsqueeze(1).repeat((1, h.shape[1], 1))
    concat = torch.cat([h, s_prev], dim=-1) #concat.shape = batch, seq_len, hidden_dim*2

    #Attention function###
    e = F.tanh(linear_1(concat))
    energies = F.relu(linear_2(e))
    # calculate attention_scores (softmax)
    attention_scores = F.softmax(energies, dim=1)
    # calculate a context vector
    temp = torch.mul(attention_scores, h)
    context = torch.sum(temp,dim=1)

    return context

## The model
![rnn_model](https://raw.githubusercontent.com/ekapolc/nlp_2019/master/HW8/images/rnn_date.png)

In [17]:
class AttentionModel(L.LightningModule):
    def __init__(self, learning_rate, criterion):

        super().__init__()
        self.n_h = 32 #hidden dimensions for encoder
        self.n_s = 64 #hidden dimensions for decoder

        self.learning_rate = learning_rate
        self.criterion = criterion

        #encoder
        bidirection = True
        self.num_directions = 2 if bidirection else 1
        self.lstm = nn.LSTM(len(input_stoi), self.n_h, bidirectional=bidirection, batch_first=True)
        #decoder
        self.decoder_lstm_cell = nn.LSTMCell(self.n_s, self.n_s)
        self.output_layer = nn.Linear(self.n_s, len(output_stoi))
        #attention
        self.fc1 = nn.Linear(self.n_h*2*self.num_directions, self.n_h)
        self.fc2 = nn.Linear(self.n_h, 1)

    def forward(self, src):
        lstm_out, _ = self.lstm(src)

        decoder_s = torch.randn(src.shape[0], self.n_s).to(self.decoder_lstm_cell.weight_ih.device)
        decoder_c = torch.randn(src.shape[0], self.n_s).to(self.decoder_lstm_cell.weight_ih.device)

        prediction = torch.zeros((src.shape[0], Ty, len(output_stoi))).to(self.decoder_lstm_cell.weight_ih.device)
        #Iterate for Ty steps (Decoding)
        for t in range(Ty):

            #Perform one step of the attention mechanism to calculate the context vector at timestep t
            context = one_step_attention(lstm_out, decoder_s, self.fc1, self.fc2)
            # Feed the context vector to the decoder LSTM cell
            decoder_s, decoder_c = self.decoder_lstm_cell(context, (decoder_s, decoder_c))

            # Pass the decoder hidden output to the output layer (softmax)
            out = self.output_layer(decoder_s)

            # Append an output list with the current output
            prediction[:, t] = out
        return prediction

    def training_step(self, batch, batch_idx):
        src = batch['x']
        target = batch['y']
        prediction = self(src)
        prediction = prediction.reshape(-1, len(output_stoi))
        target = target.reshape(-1)
        loss = self.criterion(prediction, target)
        self.log("train_loss", loss)
        return loss

    def predict_step(self, batch, batch_idx, dataloader_idx=0):
        src = batch['x']
        with torch.no_grad():
          prediction = self(src)
          prediction = F.softmax(prediction, dim=-1)
          prediction = torch.argmax(prediction, dim=-1)
          for pred in prediction:
            print(output_decode(pred.cpu().numpy()))

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=self.learning_rate)

In [18]:
criterion = nn.CrossEntropyLoss()
lr = 0.01
model = AttentionModel(lr, criterion)

In [19]:
trainer = Trainer(
    max_epochs=10,
)

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/idhibhatpankam/Code/courses/NLP-SYS/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:76: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default


In [20]:
trainer.fit(model, data_module)


  | Name              | Type             | Params | Mode 
---------------------------------------------------------------
0 | criterion         | CrossEntropyLoss | 0      | train
1 | lstm              | LSTM             | 19.5 K | train
2 | decoder_lstm_cell | LSTMCell         | 33.3 K | train
3 | output_layer      | Linear           | 780    | train
4 | fc1               | Linear           | 4.1 K  | train
5 | fc2               | Linear           | 33     | train
---------------------------------------------------------------
57.7 K    Trainable params
0         Non-trainable params
57.7 K    Total params
0.231     Total estimated model params size (MB)
6         Modules in train mode
0         Modules in eval mode
/Users/idhibhatpankam/Code/courses/NLP-SYS/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argume

Epoch 9: 100%|██████████| 938/938 [00:50<00:00, 18.60it/s, v_num=1]

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 9: 100%|██████████| 938/938 [00:50<00:00, 18.60it/s, v_num=1]


## Let's do some "translation"

In [21]:
EXAMPLES = ['Monday 15 March 2025', '3 May 1999', '05 October 2009', '30 August 2016', '11 July 2000', 'Saturday 19 May 2018', '3 March 2001', '1 March 2001']
predict_data = []
for line in EXAMPLES:
    line = [l for l in line] #change from string to list
    predict_data.append(torch.tensor(input_encode(line)))

print(len(predict_data))
def collate_fn(batch):
    one_hot_x = torch.stack([F.one_hot(b["x"], num_classes=len(input_stoi)) for b in batch])
    return {"x": one_hot_x.float()}

predict_data = nn.utils.rnn.pad_sequence(predict_data, batch_first = True)
predict_dataset = DateDataset(predict_data, [torch.tensor(0)]*len(predict_data))
predict_loader = DataLoader(predict_dataset,
                          batch_size = 1,
                          shuffle = False,
                          collate_fn = collate_fn,
                          num_workers = 0)

8


In [22]:
trainer.predict(model, predict_loader)

/Users/idhibhatpankam/Code/courses/NLP-SYS/.venv/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Predicting DataLoader 0:   0%|          | 0/8 [00:00<?, ?it/s]2025-03-15
Predicting DataLoader 0:  12%|█▎        | 1/8 [00:02<00:16,  0.42it/s]1999-05-33
Predicting DataLoader 0:  25%|██▌       | 2/8 [00:02<00:07,  0.83it/s]2029-10-05
Predicting DataLoader 0:  38%|███▊      | 3/8 [00:02<00:04,  1.24it/s]2016-08-30
Predicting DataLoader 0:  50%|█████     | 4/8 [00:02<00:02,  1.64it/s]2000-07-11
Predicting DataLoader 0:  62%|██████▎   | 5/8 [00:02<00:01,  2.03it/s]2018-05-19
Predicting DataLoader 0:  75%|███████▌  | 6/8 [00:02<00:00,  2.42it/s]2001-03-33
Predicting DataLoader 0:  88%|████████▊ | 7/8 [00:02<00:00,  2.81it/s]2001-03-11
Predicting DataLoader 0: 100%|██████████| 8/8 [00:02<00:00,  3.19it/s]




[None, None, None, None, None, None, None, None]