## Machine Learning and Artificial Intelligence 
Summer High School Academic Program for Engineers (2025)
## 7 - Attention and Transformers 

**Author:** <a href="https://www.cs.columbia.edu/~bauer">Daniel Bauer &lt;bauer@cs.columbia.edu&gt;</a>

### Limitations of Basic Neural Language Models 
In the last session, we discussed the next word prediction problem, basic neural language models, and embeddings. Ultimately, we want to be able to classify entire documents—such as determining the sentiment of a movie review.

Basic (fully connected) neural language models we discussed are inadequate for this task: the context they consider is typically much too short to capture the meaning of an entire document. Even if we expand a model’s input size to process an entire sentence or document, it still lacks an understanding of the linguistic structure of the text—such as which words grammatically depend on or modify others, and how words combine to form noun phrases. Moreover, such a model would require a very large number of parameters, as each element of each input embedding would have to be connected to each unit in the hidden layer.

Another approach is to aggregate all input representations, for example, by averaging the embeddings of all words in a sentence. While this technique can work for some classification tasks, it tends to miss much of the nuance in a document’s meaning. One key issue is that not all words contribute equally to the overall meaning of a document.

There is another important limitation: the embeddings are static for each word *type* -— so regardless of where a word appears, it always has the same vector representation. But many words are ambiguous. For example, consider the word "play" in the following sentences:

* *The kids love to **play** soccer after school.*
* *She will **play** the role of Juliet in tonight’s performance.*
* *Can you **play** some music on your phone?*

Each instance of play here represents a different **word sense**: (1) engaging in a sport, (2) performing a role in a play, (3) causing a device to produce sound. While it is true that the context in which a word appears helps determine its meaning (as per the distributional hypothesis), with static embeddings we lose the ability to capture these context-dependent distinctions. **Contextualized embeddings** address this problem by computing a representation for each word *token* that reflects its meaning in its specific context. 

### Attention 

Assume we have an input sequence and static embeddings for each token. To obtain contextualized embeddings, we can compute a new representation for each token as a weighted combination of the static embeddings of all tokens in its context (including itself). The goal of the attention mechanism is to calculate the attention weights, which determine how much importance to assign to each context word when forming the weighted combination. Attention lets the model "decide" which other tokens are important when computing a token’s new representation.

<center>
<img src="https://www.cs.columbia.edu/~bauer/shape/attention1.png" width=600px>
</center>

Let's say the input sequence of embeddings is $[t_1, t_2, \ldots, t_n]$. The contextualized representation for a specific token $i$ is $w_i = \sum\limits_{j=1}^n \alpha_{i,j} t_j$. 

The $\alpha_{i,j}$ are the attention weights between token $i$ and $j$ -- how much attention the new representation of token $i$ will place on token $j$. The attention weights all sum to 1.0. We can represent the attention weights for all $i$ and $j$ token positions as a square $n\times n$ matrix.

Consider the three examples for the word **play** above. When computing a contextualized embedding for each **play** token, what do you think the attention weights might look like? 

How can we compute the attention weights? A simple approach is to compute the dot product between $x_i$ and each potential context word $t_j$ and then normalize using softmax.

$\text{score}(i,j) = t_i \cdot x_j$ and

$\alpha_{i,j} = \frac{e^\text{score(i,j)}}{\sum\limits_{k=1}^n e^\text{score(i,k)}}$

The transformer (see below) additionally scales the scores: 
$\text{score}(i,j) = \frac{t_i \cdot t_j}{\sqrt{d}}$, where $d$ is the size of the input vectors.

We can compute the entire $(n \times n)$ matrix of $\text{attention}$ weights using matrix multiplication. Let's stack all the row vectors $t_i$ together as a $(n \times d)$ matrix $T$. Then the dot product is $\text{Attention}(T,T) = \text{softmax}(\frac{T T^\intercal}{\sqrt{d}})$

Then we can compute the $(n \times d)$ matrix of contextualized representations: $U = \text{Attention}(T,T) T$ 

**Attention Heads**

Prior to applying the attention operation, we pass the input matrix $T$ of static embeddings through three different linear projections to transform it into a **query** $Q$, **key** $K$, and **value** $V$. The projections are the parameters of the attention model and are learned. 
$Q = W_q T$
$K = W_k T$
$V = W_vT$

We then compute the contextualized representations as $U = \text{Attention}(Q,K) V$. 

In other words, we compute attention scores between the query and key, an then use those to compute a combination of the values. 
Note that the vectors in $Q$, $K$ and $V$ do not necessarily need to have the same size as the vectors in $T$ (although $Q$ and $K$ must match). 

**Multi-Head Attention**

Transformers (see below) use multiple **attention heads**, each performing its own set of learned linear projections. The idea is that different attention heads may learn to focus on different aspects of the input -- some may attent more to syntactic relations (like subjects-verb connections) while other may focus more on semantics or long-distance dependencies.

We concatenate the representations resulting from the attention operation for each attention head (along the axis of the vector representations). Then we feed the concatenated matrix through a feed-forward layer to bring it back to the original dimensionality of the token representation vectors, if necessary. Let's call this dimensionality the $\text{embed\_dim}$. 

The size of the vectors in each $Q$, $K$ and $V$ in the attention head is usually set to $\text{embed\_dim} / h$ where $h$ is the number of attention heads. For example, if we use $embed\_dim=128$ and we have $h=8$ attention heads each $Q$, $K$, and $V$ will have vectors of size $128/8=16$. In concrete terms, the size of $Q$, $K$, and $V$ would be $(n  \times 16)$.

<img src="https://www.cs.columbia.edu/~bauer/shape/mh_attention_module.png" width=400px caption="a multi-head attention block">

### Implementing Attention

Here is a simple implementation for single-head attention as a PyTorch module. 

In [58]:
import torch
import torch.nn as nn 

class MyAttention(nn.Module):
    def __init__(self, input_dim, output_dim):
        """
        input_dim: dimension of input embeddings
        output_dim: dimension for Q, K, V projections and final output
        """
        super().__init__()
        # the parameters of the attention module are the linear projections for Q, K, V
        self.query = nn.Linear(input_dim, output_dim)
        self.key = nn.Linear(input_dim, output_dim)
        self.value = nn.Linear(input_dim, output_dim)

    def forward(self, x):        
        # x is (batch, n, input_dim)
        Q = self.query(x)  # (batch, n, output_dim)
        K = self.key(x)    # (batch, n, output_dim)
        V = self.value(x)  # (batch, n, output_dim)

        # Compute attention scores
        # Note: Q is (batch, n, dim)
        # we need to transpose K's n and dim dimensions so we get a (batch, d, n) matrix
        scale = K.shape(2)
        scores = Q @ K.transpose(1, 2)  # (batch, n, dim) @ (batch, dim,n) = (batch, n, n)
        scores = scores / scale 
        Attention = F.softmax(attn_scores, dim=-1)  # (batch, n, n), each row sums to 1

        # Compute combination
        out = Attention @ V  # (batch, n,n) @ (batch, n, dim) = (dim, n, batch)

        return out, Attention # return weights for inspection if you want

We can use the Attention module to define MultiHeadAttention

In [107]:
import torch
import torch.nn as nn 

class MyMultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):        
        super().__init__()
        assert embed_dim % num_heads == 0, "embed_dim should be divisible by num_heads"
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads # typically each head projects into a vector space with embed_dim / num_heads components

        # Create a list of attention heads -- need to be in a ModuleList to ensure proper gradient flow
        self.heads = nn.ModuleList([MyAttention(embed_dim, self.head_dim) for i in range(num_heads)])
        
        # Final linear layer to combine all heads' outputs
        self.output_proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        """
        x : (batch, n, embed_dim)
        """
        # Run each head and collect their outputs
        head_outputs = []
        attn_weights_all = []
        for head in self.heads:
            out, _ = head(x)                      # (batch, n, head_dim)
            head_outputs.append(out)

        # Concatenate outputs from all heads along feature dimension
        # List of (batch, seq_len, head_dim) --> (batch, n, num_heads * head_dim) = (batch, n, embed_dim)
        concat = torch.cat(head_outputs, dim=-1)

        # Final linear projection to combine info from all heads
        final_output = self.output_proj(concat)              # (batch, n, model_dim)

        return final_output


In [109]:
attn = MyMultiHeadAttention(128,8)

PyTorch already implements attention for us: 
https://docs.pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html


### Neural LM with Attention

We can now update our Neural Language Model to use attention.

In [122]:
import torch
from torch import nn

class SimpleLMWithAttention(nn.Module):
    def __init__(self, vocab_size, context_size = 40, embed_dim=256, hidden_dim=1024, num_heads = 4):
        super().__init__()
        
        self.E = nn.Embedding(vocab_size, embed_dim) 
        self.attention = nn.MultiheadAttention(embed_dim, num_heads)
        self.hidden = nn.Linear(context_size * embed_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.out = nn.Linear(hidden_dim, vocab_size) # 10485760 parameters
        
    def forward(self, x):
        emb = self.E(x)                     # (batch, context length, emb_dim)
        attn_output, attn_weights = self.attention(emb, emb,emb) # (batch, context_len, embed_dim)
        flattened = attn_output.view(x.size(0), -1) # flatten to to (batch, 3*emb_dim)
        h = self.relu(self.hidden(flattened))
        out = self.out(h)
        return out  # predict next token

Attention only really makes sense if we set the context size to something large, but then the size of the hidden layer explodes (40x256x1024 =10485760 parameters 1!11!!!oneoneeleven)
So while attention might give us better word representations, it's not really suitable for this simple type of LM -- nor for text classification with a pooled input. 

### Transformers

The **Transformer** architecture is probably the most influental machine learning technique in recent history. They are the foundation for all modern Large Language Models. You can take a look at the original Transformer paper 

> Vaswani et al. 2017 "Attention is All you Need!". Proceedings of NeurIPS.
> https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

The transformer uses multiple attention blocks to compute increasingly refined contextualized representations for the input tokens. 
The original transformer was intended for **sequence-to-sequence** tasks (such as machine translation or question answering). It includes an "**encoder stack**" and a "**decoder stack**". The purpose of the encoder is to compute contextualized token representations for the input. The decoder then uses these representations to produce the output, one token at a time. 

<img src="https://www.cs.columbia.edu/~bauer/shape/transformer_overview.png" width=600px>

In the full **encoder-decoder** model, the decoder blocks may attent to the output representations computed by the encoder. This is called **cross-attention**. Within the encoder and within the decoder, token representations are contextualized based on the other tokens in the input using **self-attention** (the operation described above). 

In practice, many popular models use either the encoder, where the focus is on token or sequence encoding, or the decoder, where the focus is on next-word prediction. **Encoder-only models** are usually pre-trained on a **denoising autoencoder objective** (the BERT family of models). **Decoder-only** models are pre-trained on a **language modeling objective** (the GPT family of models).

**Positional Embeddings**

We will look at the encoder first. In essence, each encoder block is applied to the output of another encoder block. Each encoder block applies self-attention to contextualized the tokens. 
By itself, the attention mechanism does not know _where_ in the input a token appears. Thus, the tokens in 

"*the kids love to play soccer after school*" and 

"*the play love kids after school to soccer*" 

would have the same representation. Positional embeddings allow the attention model to take the position of the tokens into account. We can either learn an embedding vector for each position (similar to how we learned the static embeddings), or we can hand-craft embeddings with desirable properties. A popular approach is to use sinusoidal embeddings where each component of the vector is computed based on the sin or cos function with a wavelength depending on the component. 

Specifically for the vector at position $\text{pos}$, for each odd component, the value is $PE_{(\text{pos},2i+1)} =  \cos(\frac{\text{pos}}{10000^{2i/d\_emb}})$ and for each even component it is $PE_{(\text{pos},2i)} =  \sin(\frac{\text{pos}}{10000^{2i/d\_emb}})$ 

<img src="https://www.cs.columbia.edu/~bauer/shape/positional_embeddings.png" width=600px>

The sinusoidal vectors describe both absolute and relative position. The model can easily learn to attend to tokens at particular distances apart, because these distances translate to predictable differences between the vectors.

The positional embeddings are added to the static embeddings used as input for the model. 

**Encoder Layers**

After adding the positional embeddings, the static input embeddings are passed into the first encoder layer. The transformer layers use some tricks we have seen before: residual connections (also called skip connections) are added between the input and the output of the multi-head attention block, as well as between the input and output of the feed-forward component. After each residual connection, we apply a layer normalization (“layer norm”), which normalizes the representation across the features of each individual vector.

Layer normalization computes the mean and standard deviation for each token vector (across its features), then subtracts the mean from each feature and divides by the standard deviation. This helps stabilize and speed up training by keeping the representations at each layer on a similar scale.
For comparison, batch normalization normalizes across the feature dimension, but averaged over all examples in a batch.

<img src="https://www.cs.columbia.edu/~bauer/shape/transformer_encoder.png" width=200px>

**Decoder Layers: masked self-attention and cross-attention**


The purpose of the decoder is to generate text. This works differently from the simple neural LM we discussed earlier. 
For each output $o_t$ at position $t$, the decoder produces an contextualized token representation. It then uses this representation to predict the successor token $o_{t+1}$. Once the token has been generated, it is concatenated to the existing output and then the sequence is passed through the transformer again. Note that during training, rather than passing the outputs token by token we can just present the entire output sequence at once. 

The transformer accepts a certain number of input tokens. Tokens not used in the output (as well as the input in the encoder) are padded with the special symbol <PAD> to the right. <S> is the special start symbol. 

<img src="https://www.cs.columbia.edu/~bauer/shape/transformer_decoder_io.png" width=200px>

A small complication arises with **self-attention** in the decoder: We do not want the model to attent to output positions that have not been generated yet. We achieve this by setting the attention weights for future positions to $-\infty$. We create a an attention mask (also called a causal mask in this context) and then add it to the computed attention weights. For five tokens, the attention mask might look like this: 




In [210]:
import math 
inf = math.inf

torch.tensor([[0., -inf, -inf, -inf, -inf],
        [0.,   0., -inf, -inf, -inf],
        [0.,   0.,   0., -inf, -inf],
        [0.,   0.,   0.,   0., -inf],
        [0.,   0.,   0.,   0.,   0.]])

tensor([[0., -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0.]])

In a full encoder-decoder model, the decoder layers additional perform **cross attention** with the output representations computed by the encoder. They are otherwise identical. 
In cross-attention the representations computed by the decoder are the queries $Q$, and the encoded input representations are the keys $K$ and values $V$. Residual connections are added between the input and output of the cross-attention component and layer norm is applied to the output. 

<img src="https://www.cs.columbia.edu/~bauer/shape/transformer_decoder.png" width=300px>


### Example: A tiny GPT like model

Decoder-only models form the basis of the popular GPT family of models. Here is a tiny but conceptually complete version, using the components discussed in this notebook

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


def get_sinusoid_encoding(seq_len, emb_dim, device=None):
    """Create sinusoidal positional encoding table."""
    pos = torch.arange(seq_len, dtype=torch.float32, device=device).unsqueeze(1)        
    i = torch.arange(emb_dim, dtype=torch.float32, device=device).unsqueeze(0)          
    denom torch.pow(10000, (2 * (i//2)) / emb_dim)                        
    angle_rads = pos / denom

    encodings = torch.zeros(seq_len, emb_dim, device=device)
    # Even i: sin, Odd i: cos
    encodings[:, 0::2] = torch.sin(angle_rads[:, 0::2])
    encodings[:, 1::2] = torch.cos(angle_rads[:, 1::2])
    return encodings  # (seq_len, emb_dim)

class TinyGPT(nn.Module):
    def __init__(self, vocab_size, emb_dim, n_heads, ff_dim, max_seq_len):
        super().__init__()
        self.vocab_size = vocab_size
        self.emb_dim = emb_dim
        self.max_seq_len = max_seq_len

        # Token embedding only (no positional embedding)
        self.tok_embedding = nn.Embedding(vocab_size, emb_dim)
        
        # Register the (non-trainable) sinusoidal positional encoding
        pe = get_sinusoid_encoding(max_seq_len, emb_dim)
        self.register_buffer('pos_encoding', pe)  # (max_seq_len, emb_dim), not a parameter

        self.ln1 = nn.LayerNorm(emb_dim)
        self.attn = nn.MultiheadAttention(emb_dim, n_heads, batch_first=True)
        self.ln2 = nn.LayerNorm(emb_dim)
        self.ff = nn.Sequential(
            nn.Linear(emb_dim, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, emb_dim),
        )
        self.output_proj = nn.Linear(emb_dim, vocab_size)

    def forward(self, x):
        bsz, seq_len = x.shape
        tok_emb = self.tok_embedding(x)  # (batch, seq, emb_dim)

        # Fetch sinusoidal positional encodings for current seq_len
        # Shape: (1, seq, emb_dim), so it can be broadcast across the batch
        pos_emb = self.pos_encoding[:seq_len, :].unsqueeze(0)
        h = tok_emb + pos_emb

        # Causal mask
        attn_mask = torch.triu(torch.ones(seq_len, seq_len, device=x.device), diagonal=1)
        attn_mask = attn_mask.masked_fill(attn_mask == 1, float('-inf'))

        h_ln1 = self.ln1(h)
        attn_output, _ = self.attn(h_ln1, h_ln1, h_ln1, attn_mask=attn_mask)
        h = h + attn_output # residual connection 

        # No cross attention here because it's a decoder-only model
        
        h_ln2 = self.ln2(h)
        ff_output = self.ff(h_ln2)
        h = h + ff_output # residual connection 

        logits = self.output_proj(h)
        return logits

# Example usage
vocab_size = 100
emb_dim = 32
n_heads = 4
ff_dim = 64
max_seq_len = 16
device = 'cpu'

model = TinyGPT(vocab_size, emb_dim, n_heads, ff_dim, max_seq_len).to(device)
x = torch.randint(0, vocab_size, (2, 8)).to(device)
logits = model(x)
print(logits.shape)  # (2, 8, 100)
