$\Huge AS4501$

Transformers and Attention

Francisco Förster

Bibliography:

* [Attention is all you need, Vaswani et al. 2017](https://arxiv.org/pdf/1706.03762.pdf)
* https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html (many figures from this great website)
* https://towardsdatascience.com/attention-and-transformer-models-fe667f958378

# Motivation

Recurrent neural networks have two big problems:

1. They tend to give too much weight to recent elements in a sequence, but sometimes the most important connections in a sentence are separated by a large number of elements.

2. They are intrinsically serial in nature. We need to process a sequence in order to compute the output of a RNN.

This is how a RNN processes a sentence, paying more attention to the last word at each step and requiring a serial processing:

![](images/sentence-classification-rnn.png)

But in many cases the last word is not the most important, and we would like to be able to process each word and its association with other words in parallel:

![](images/sentence-example-attention.png)

This also happens in the problem of translation:

![](images/sentence.png)

# Softmax

Let's remember the softmax function applied to a vector x:

$\Large {\rm softmax(x_i)} = \frac{\exp{x_i}}{\sum\limits_j \exp{x_j}}$ 

This function returns ~1 at the largest value of the vector and ~0 elsewhere.

![](images/softmax.png)

# Attention mechanism

The attention mechanism is an approach in deep learning that allows models to focus on different parts of the input when producing the output. Instead of focusing in some hidden state like in RNNs, in attention each output explicitly depends on all previous input states, weighted by attention scores.

For example in this sentence with the following attention scores:

 I love travelling
   
   [0.1,  0.2,  0.7] ---> J'adore
  
  [0.5,  0.5,  0.0] ---> voyager

'J'adore' pays more attention or has more affinity to 'travelling' as the next word when translating.

'voyager' pays attention to 'I' and 'love' equally when translating.

# Self-attention

Self Attention, also known as intra Attention, is an attention mechanism that relates different positions of one sequence in order to compute a representation of the same sequence. 

![](images/intraattention.png)

In a self-attention layer, an input matrix $X$ ($n$ tokens of dimension $d$) are turned it into an output matrix $Z$ ($n$ components of dimension $d_v$) via three representational matrices of the input:

* queries Q
* keys K
* values V

$\Large {\rm Attention}(Q, K, V) = {\rm softmax}( Q \cdot K^T / \sqrt{d_k}) * V$

where $Q$, $K$ and $V$ are matrices representing linear transformations from the input vector $x$ via learnable parameters $W^Q$, $W^K$ and $W^V$:

* $Q = X W^Q$
* $K = X W^K$
* $V = X W^V$

Note that 
* $x \in \mathbb{R}^{n \times d}$
* $Q \in \mathbb{R}^{n \times d_k}$
* $K \in \mathbb{R}^{n \times d_k}$
* $V \in \mathbb{R}^{n \times d_v}$
* $W^Q \in \mathbb{R}^{d \times d_k}$
* $W^K \in \mathbb{R}^{d \times d_k}$
* $W^V \in \mathbb{R}^{d_v \times d}$

![](images/attention_detail.png)

![](images/selfattention_summary.png)

# Cross-attention

One can generalize the previous computation for combining two input matrices $X_1$ and $X_2$:

![](images/cross-attention-summary.png)

And this is an example of a cross attention matrix:

![](images/bahdanau-fig3.png)

and a visualization of one row

![](images/attention.png)

# Multi-head attention

In multi-head attention we concatenate the output from several heads $i$ with learnable parameters $W_i^Q$, $W_i^K$ and $W_i^V$, and then linearly transform this vector with learnable parameters $W^O$:

$\Large {\rm Multihead} = {\rm concat}({\rm head}_1, ... {\rm head}_h) W^O$

![](images/multi-head.png)

# Positional encodings

One problem with the previous strategy is that the order of the input is never used to compute the attention scores. In order to fix this problem, information about the relative positions of the inputs must be added. In the original paper by Vaswani they use sine and cosine functions of different frequencies:

* $PE(pos, 2i) = sin(pos / 10000^{2i/d})$
* $PE(pos, 2i) = cos(pos / 10000^{2i/d})$

![](images/PE.png)

In other works, a set of functions are learned as the positional encoder. For example, in [Pimentel+2023](https://arxiv.org/pdf/2201.08482.pdf) they use the following function (timeFiLM):

![](images/timefilm.png)
![](images/timefilm2.png)

# Transformers

The full transformer arquitecture proposed by Vaswani et al. 2017 is the following:

![](images/transformer.png)

The model is composed of an encoder and a decoder. 

The encoder is composed of 6 identical layers, each one with two sublayers: a multi-head self-attention mechanism and a position wise fully connected feed-forward network. The output of each sublayer uses a residual connection (we add the input to the output of the sublayer), which helps with convergence, and is normalized using layer normalization.

The decoder is also composed of 6 identical layers. In addition to the two sublayers used in the encoder, a sublayer is added in between that uses multihead cross attention with the output of the encoder. The multihead self-attention is also modified to mask positions that have not been visited by the decoder (predictions for position i can depend only on the known outputs of positions less than i).



# Examples

In [32]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import Dataset, DataLoader
from collections import Counter
import re

# Load IMDb dataset
df = pd.read_csv("IMDB.csv")
df

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1
...,...,...
39995,"""Western Union"" is something of a forgotten cl...",1
39996,This movie is an incredible piece of work. It ...,1
39997,My wife and I watched this movie because we pl...,0
39998,"When I first watched Flatliners, I was amazed....",1


In [33]:
# Split data into training and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

In [34]:
# Basic text preprocessing and tokenization
def tokenize(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text.split()

# Build vocabulary from the training data
def build_vocab(texts, min_freq=2):
    counter = Counter()
    for text in texts:
        counter.update(tokenize(text))
    return {word: idx+1 for idx, (word, count) in enumerate(counter.items()) if count >= min_freq}

vocab = build_vocab(train_df['text'].values)
vocab_size = len(vocab) + 1  # +1 for padding token

def text_to_indices(text, vocab, max_len=100):
    tokens = tokenize(text)
    indices = [vocab.get(token, 0) for token in tokens][:max_len]
    indices = [min(idx, len(vocab)) for idx in indices]  # Ensure all indices are within vocabulary size
    indices += [0] * (max_len - len(indices))  # Pad sequences shorter than max_len
    return indices

In [35]:
# Prepare data with indices for train and test sets
train_df['indices'] = train_df['text'].apply(lambda x: text_to_indices(x, vocab))
test_df['indices'] = test_df['text'].apply(lambda x: text_to_indices(x, vocab))
train_df['label'] = train_df['label'].apply(torch.tensor)
test_df['label'] = test_df['label'].apply(torch.tensor)

In [36]:
train_df

Unnamed: 0,text,label,indices
14307,I watched it last night and again this morning...,tensor(1),"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 13,..."
17812,"although i liked this Western,i do have to say...",tensor(1),"[63, 1, 13, 8, 88, 46, 74, 24, 89, 47, 90, 45,..."
11020,I sat down to watch a documentary about Puerto...,tensor(0),"[1, 147, 148, 24, 149, 66, 150, 17, 151, 152, ..."
15158,"This was probably intended as an ""arty"" crime ...",tensor(0),"[8, 21, 287, 288, 72, 204, 289, 290, 291, 84, ..."
24990,The summary provided by my cable TV guide made...,tensor(0),"[20, 318, 319, 117, 31, 320, 321, 322, 323, 3,..."
...,...,...,...
6265,This movie is one of the worst movie i have ev...,tensor(0),"[8, 18, 15, 90, 45, 20, 331, 18, 1, 74, 334, 1..."
11284,This movie is inspiring to anyone who is or ha...,tensor(1),"[8, 18, 15, 2918, 24, 1503, 40, 15, 167, 490, ..."
38158,"""East Side Story"" is a documentary of musical ...",tensor(1),"[9339, 541, 87, 15, 66, 150, 45, 1083, 1072, 5..."
860,And a self-admitted one to boot. At one point ...,tensor(0),"[6, 66, 0, 90, 24, 4822, 313, 90, 2721, 20, 12..."


In [37]:
# Custom Dataset class for DataLoader
class IMDBDataset(Dataset):
    def __init__(self, df):
        self.reviews = df['indices'].values
        self.labels = df['label'].values

    def __len__(self):
        return len(self.reviews)

    def __getitem__(self, idx):
        return torch.tensor(self.reviews[idx]), self.labels[idx]

# Create DataLoader
train_dataset = IMDBDataset(train_df)
test_dataset = IMDBDataset(test_df)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

In [38]:
import torch.nn as nn
import torch

# Define the TransformerEncoderLayer class
class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, dim_feedforward):
        super(TransformerEncoderLayer, self).__init__()
        self.multi_head_attn = nn.MultiheadAttention(d_model, num_heads)
        self.layer_norm1 = nn.LayerNorm(d_model)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, dim_feedforward),
            nn.ReLU(),
            nn.Linear(dim_feedforward, d_model)
        )
        self.layer_norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        # Apply multi-head attention and add residual connection
        attn_output, _ = self.multi_head_attn(x, x, x)
        x = self.layer_norm1(x + attn_output)

        # Apply feed-forward network and add residual connection
        ff_output = self.feed_forward(x)
        x = self.layer_norm2(x + ff_output)
        
        return x

# Define the SimpleTransformerClassifier model using the TransformerEncoderLayer
class SimpleTransformerClassifier(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, dim_feedforward, num_layers, num_classes):
        super(SimpleTransformerClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.encoder_layers = nn.ModuleList([
            TransformerEncoderLayer(d_model, num_heads, dim_feedforward) for _ in range(num_layers)
        ])
        self.pool = nn.AdaptiveAvgPool1d(1)
        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x):
        x = self.embedding(x).permute(1, 0, 2)  # Embed and transpose for transformer layer
        for layer in self.encoder_layers:
            x = layer(x)
        # Pooling and final classification layer
        x = x.permute(1, 2, 0)  # Reshape to (batch_size, d_model, seq_length) for pooling
        x = self.pool(x).squeeze(-1)  # Global average pooling
        x = self.fc(x)
        return x

# Model parameters
d_model = 64  # Reduced dimension for faster training
num_heads = 2
dim_feedforward = 128
num_layers = 1
num_classes = 2

# Instantiate the model
vocab_size = len(vocab) + 1  # from previous vocab creation
model = SimpleTransformerClassifier(vocab_size, d_model, num_heads, dim_feedforward, num_layers, num_classes)

In [39]:
import torch.optim as optim

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 5

for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0
    correct = 0
    total = 0
    
    for texts, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(texts)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
    
    accuracy = correct / total
    avg_loss = epoch_loss / len(train_loader)
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}, Accuracy: {accuracy:.4f}")


Epoch [1/5], Loss: 0.5220, Accuracy: 0.7304
Epoch [2/5], Loss: 0.3699, Accuracy: 0.8357
Epoch [3/5], Loss: 0.2884, Accuracy: 0.8807
Epoch [4/5], Loss: 0.2190, Accuracy: 0.9163
Epoch [5/5], Loss: 0.1574, Accuracy: 0.9430


In [40]:
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for texts, labels in test_loader:
        outputs = model(texts)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

test_accuracy = correct / total
print(f"Test Accuracy: {test_accuracy:.4f}")


Test Accuracy: 0.8204


## Fine tuning [BERT](https://arxiv.org/pdf/1810.04805.pdf)

![](images/bert.png)

In [41]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, TensorDataset, random_split
import numpy as np

In [42]:
# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

  return torch.load(checkpoint_file, map_location="cpu")
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassif

In [43]:
# Example data: a list of sentences and their corresponding labels
sentences = [
    "I love programming.", "The weather is great today!", "I'm feeling sad.",
    "It's a beautiful day.", "I hate traffic.", "Coding is fun.", "I enjoy sunny days.",
    "It's raining cats and dogs.", "I am very excited.", "I feel disappointed."
]

labels = [1, 1, 0, 1, 0, 1, 1, 0, 1, 0]  # 1 = positive, 0 = negative

In [44]:
# Tokenize the sentences
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

In [45]:
# Create a TensorDataset and DataLoader
dataset = TensorDataset(inputs['input_ids'], inputs['attention_mask'], torch.tensor(labels))
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=16)

In [46]:
# Define the optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

In [47]:
# Training loop with validation
best_val_loss = float('inf')
best_model_state = None

num_epochs = 10

for epoch in range(num_epochs):  # Training for more epochs
    model.train()
    total_train_loss = 0
    for batch in train_dataloader:
        input_ids, attention_mask, labels = batch

        # Forward pass
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_train_loss += loss.item()
    
    avg_train_loss = total_train_loss / len(train_dataloader)

    # Validation
    model.eval()
    total_val_loss = 0
    with torch.no_grad():
        for batch in val_dataloader:
            input_ids, attention_mask, labels = batch
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(val_dataloader)

    print(f"Epoch: {epoch + 1}, Train Loss: {avg_train_loss:.4f}, Validation Loss: {avg_val_loss:.4f}")

    # Save the best model
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        best_model_state = model.state_dict()

# Load the best model
model.load_state_dict(best_model_state)

Epoch: 1, Train Loss: 0.7311, Validation Loss: 0.6165
Epoch: 2, Train Loss: 0.6910, Validation Loss: 0.6253
Epoch: 3, Train Loss: 0.6891, Validation Loss: 0.6228
Epoch: 4, Train Loss: 0.6692, Validation Loss: 0.6177
Epoch: 5, Train Loss: 0.6328, Validation Loss: 0.6045
Epoch: 6, Train Loss: 0.5852, Validation Loss: 0.5947
Epoch: 7, Train Loss: 0.5514, Validation Loss: 0.5858
Epoch: 8, Train Loss: 0.5903, Validation Loss: 0.5769
Epoch: 9, Train Loss: 0.5682, Validation Loss: 0.5694
Epoch: 10, Train Loss: 0.5748, Validation Loss: 0.5605


<All keys matched successfully>

In [51]:
# Inference
model.eval()
with torch.no_grad():
    inputs = tokenizer(["I love sunny days."], padding=True, truncation=True, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    print(f"Predicted label: {predictions.item()}")


Predicted label: 1


## Vision transformers

This is based on the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
](https://arxiv.org/abs/2010.11929)

![](images/vit.png)

See https://github.com/huggingface/notebooks/blob/main/examples/image_classification.ipynb
    

In [52]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms, datasets
from transformers import ViTForImageClassification, ViTFeatureExtractor
import matplotlib.pyplot as plt
import numpy as np

In [53]:
# Define the transformations for the training and validation sets
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

In [54]:
# Load the CIFAR-10 dataset
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
val_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)

# Create DataLoader objects for the training and validation sets
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

Files already downloaded and verified
Files already downloaded and verified


In [55]:
# Load the pre-trained ViT model for image classification
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224', num_labels=10, ignore_mismatched_sizes=True)

  return torch.load(checkpoint_file, map_location="cpu")
Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224 and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([1000, 768]) in the checkpoint and torch.Size([10, 768]) in the model instantiated
- classifier.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([10]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [56]:
# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=5e-5)

In [57]:
# Move the model to the GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

ViTForImageClassification(
  (vit): ViTModel(
    (embeddings): ViTEmbeddings(
      (patch_embeddings): ViTPatchEmbeddings(
        (projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
      )
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): ViTEncoder(
      (layer): ModuleList(
        (0-11): 12 x ViTLayer(
          (attention): ViTAttention(
            (attention): ViTSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): ViTSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (intermediate): ViTIntermediate(
            (dense): Linear(in_features=7

In [None]:
# Training loop
num_epochs = 3

for epoch in range(num_epochs):
    model.train()
    total_loss = 0.0
    
    ibatch = 0
    for inputs, labels in train_loader:
        print(ibatch, end='\r')
        ibatch += 1
        inputs, labels = inputs.to(device), labels.to(device)
        
        # Zero the parameter gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(inputs).logits
        loss = criterion(outputs, labels)
        
        # Backward pass and optimize
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(train_loader)}")

    # Validation loop
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in val_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs).logits
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    print(f"Validation Accuracy: {100 * correct / total}%")

0

In [None]:
# Save the fine-tuned model
model.save_pretrained('./fine-tuned-vit')

# Helper function to display images along with their predicted labels
def imshow(img, title):
    img = img / 2 + 0.5  # unnormalize
    npimg = img.numpy()
    plt.figure(figsize=(10, 10))
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.title(title)
    plt.show()

# Class labels for CIFAR-10
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

# Get some random validation images
dataiter = iter(val_loader)
images, labels = dataiter.next()

# Move the images to the GPU if available
images = images.to(device)

# Make predictions
model.eval()
with torch.no_grad():
    outputs = model(images).logits
    _, predicted = torch.max(outputs, 1)

# Convert images and predictions back to CPU for visualization
images = images.cpu()
predicted = predicted.cpu()

# Show the images along with their predicted labels
for i in range(4):  # Display 4 examples
    imshow(images[i], f'Predicted: {classes[predicted[i]]} | True: {classes[labels[i]]}')