> <p><small><small>This Notebook is made available subject to the licence and terms set out in the <a href = "http://www.github.com/google-deepmind/ai-foundations">AI Research Foundations Github README file</a>.

![](https://storage.googleapis.com/dm-educational/assets/ai_foundations/GDM-Labs-banner-image-C4-white-bg.png)

# Lab: Positional Embeddings

<a href='https://colab.research.google.com/github/google-deepmind/ai-foundations/blob/master/course_4/gdm_lab_4_4_positional_embeddings.ipynb' target='_parent'><img src='https://colab.research.google.com/assets/colab-badge.svg' alt='Open In Colab'/></a>

Explore how self-attention without positional embeddings is order-invariant.

15 minutes

## Overview

In this lab, you will explore how the attention mechanism is **position invariant**. This means the output of the attention mechanism is the same independent of the word order of the previous tokens.

This notebook defines a toy attention computation. It defines random token embeddings, random query, key, and value projection matrices, and it defines the computations of a single attention head.




### What you will learn

By the end of this lab, you will:

* Understand how the self-attention mechanism leads to the same output when the order of tokens in the prompt has been re-arranged.


### Tasks

In this lab, you will:

* Walk through the implementation of a toy language model that implements the attention mechanism and observe which values change as you change the order of tokens in the prompt.


## How to use Google Colaboratory (Colab)


Google Colaboratory (also known as Google Colab) is a platform that allows you to run Python code in your browser. The code is written in **cells** that are executed on a remote server.

To run a cell, hover over the cell and click on the `run` button to its left. The run button is the circle with the triangle (▶). Alternatively, you can also click on a cell and use the keyboard combination Ctrl+Return (or ⌘+Return if you are using a Mac).

To try this out, run the following cell. This should print today's day of the week below it.

In [None]:
from datetime import datetime

print(f"Today is {datetime.today():%A}.")

Note that the *order in which you run the cells matters*. When you are working through a lab, make sure to always run *all* cells in order, otherwise the code might not work. If you take a break while working on a lab, Colab may disconnect you and in that case, you have to execute all cells again before  continuing your work. To make this easier, you can select the cell you are currently working on and then choose __Runtime → Run before__  from the menu above (or use the keyboard combination Ctrl/⌘ + F8). This will re-execute all cells before the current one.

## Imports


In [None]:
import numpy as np # For definining and working with embeddings.
from scipy.special import softmax # For computing attention weights.

## Prepare the model

Run these cells one by one to compute the output of the attention head and the corresponding attention weights.

The following code block first defines the sentences and the vocabulary of the model. Given that this is a simple toy model, the vocabulary is limited to the five tokens in the `vocabulary` dictionary.


In [None]:
# Define sentences. Lower-case everything so that "The" and "the" use the same
# embedding.
sentence1_str = "the zebra chased the lion ."
sentence2_str = "the lion chased the zebra ."

# Define vocabulary.
vocabulary = {"<pad>": 0, "the": 1, "zebra": 2, "chased": 3, "lion": 4, ".": 5}
inv_vocabulary = {v: k for k, v in vocabulary.items()}

The following block constructs a random embedding matrix for the five tokens defined above. This embedding matrix will not capture any semantic similarities between words. The goal of this exercise is to demonstrate the position invariance of the attention mechanism rather than using these embeddings in a model for predicting the next token. It therefore does not matter that these embeddings are not good for making predictions.

In [None]:
# Embedding dimension.
embedding_dim = 3
vocabulary_size = len(vocabulary)

# Set a seed for reproducibility.
np.random.seed(2311)

# Embedding matrix (vocab_size x d_model).
embedding_matrix = np.random.rand(vocabulary_size, embedding_dim)

The following block similarly constructs random query, key, and value projection matrices. In a real model these matrices would be learned from data. Again, for the purpose of this exercise, it does not matter that the parameters do not represent anything useful for making predictions.

In [None]:
# Dimension of key, query, value vectors (can be different from embedding_dim,
# but here same for simplicity).
d_k = embedding_dim
d_q = embedding_dim
d_v = embedding_dim

# Constant for to be used in masked attention scores.
K_MASK = -2.3819763e38  # Set to a large negative number.

# Projection matrices (embedding_dim x d_k for W_q, embedding_dim x d_q for W_k,
# embedding_dim x d_v for W_v).
W_q = np.random.rand(embedding_dim, d_k)
W_k = np.random.rand(embedding_dim, d_q)
W_v = np.random.rand(embedding_dim, d_v)

Finally, this block defines the attention computation function. It computes the queries, keys, and values for a sentence, then computes the (masked) logits, and finally the attention weights and the output of the attention head. This function also prints many intermediate computations so that you can inspect the parameters.

In [None]:
def compute_attention_output(sentence: str) -> tuple[np.ndarray, np.ndarray]:
    """Computes the attention output and attention weights for all tokens in
    `sentence`.

    Args:
      sentence: The sentence for which to compute the attention weights. Tokens
        must be space-separated and from the list of tokens in vocab.

    Returns:
      attention_output: The output of the attention mechanism.
        Shape: (num_tokens, d_v).
      attention_weights: The attention weights for all tokens.
        Shape: (num_tokens, num_tokens).
    """

    tokens = [vocabulary[word] for word in sentence.split()]

    embeddings = embedding_matrix[tokens]

    # Compute queries, keys, values for sentence.
    Q = embeddings @ W_q # Shape: (num_tokens, d_q)
    K = embeddings @ W_k # Shape: (num_tokens, d_k)
    V = embeddings @ W_v # Shape: (num_tokens, d_v)

    # Compute the attention mask.
    l = len(tokens)
    attention_mask = np.tri(l) # Shape: (num_tokens, num_tokens).
    print(f"\n--- Sentence: \"{sentence}\" ---")

    # Compute attention logits with scaling factor.
    scale_factor = np.sqrt(d_k)
    logits = (Q @ K.T) / scale_factor # Shape: (num_tokens, num_tokens).
    # Apply attention mask.
    # Shape: (num_tokens, num_tokens).
    logits = np.where(attention_mask, logits, K_MASK)

    print("\nAttention logits (Q @ K.T / sqrt(d_k)) for last token:\n",
          logits[-1, :])

    # Compute attention weights (SoftMax).
    # Shape: (num_tokens, num_tokens).
    attention_weights = softmax(logits, axis=1)

    print("\nAttention weights (SoftMax) for last token:\n",
          attention_weights[-1, :])

    print("\nValue matrix:\n", V)

    # Compute attention output.
    attention_output = attention_weights @ V # Shape: (num_tokens, d_v)
    print("\nAttention output (weights @ V) for last token:\n",
          attention_output[-1, :])

    return attention_output, attention_weights

## Compare sentences with different word order

You can now compute the attention weights and outputs for the sentences "the zebra chased the lion ." and "the lion chased the zebra ."

In [None]:
attention_output1, attention_weights1 = compute_attention_output(sentence1_str)
attention_output2, attention_weights2 = compute_attention_output(sentence2_str)

### What did you observe?

Take a close look at the attention weights and the value matrices above.

As you will see, both the attention weights and the value matrices differ between the two sentences. In the attention weight vector the second and the fifth entry have been swapped (these correspond to "zebra" and "lion").

The two value matrices differ in the second and the fifth row. These are the representations that correspond to "zebra" and "lion".

However, when you multiply the attention vector with the matrix, then we obtain the same output in both cases. Perform this computation manually to see why.

The fact that this matrix multiplication leads to the same result independent of the order of the previous tokens is the reason why the attention mechanism in its current form is order invariant.

Run the following cell to see this comparison even more directly.



In [None]:
print("\n--- Comparison ---")

print("\nAttention weights for last token in S1:", attention_weights1[-1, :])
print("Attention weights for last token in S2:", attention_weights2[-1, :])

print("\nAttention outputs for last token in S1:", attention_output1[-1, :])
print("Attention outputs for last token in S2:", attention_output2[-1, :])


print("\nAre attention outputs for last token in S1 and S2 the same?",
      np.allclose(attention_output1[-1, :], attention_output2[-1, :]))

## Compare the attention weights for other sentences

To see that this is not just an artifact of the two sentences above, perform this comparison for other pairs of sentences.

<br />

------
>**💻 Your task:**
>
>Compare the attention weights and outputs for other sentences.
>
>Note that you can only use the following tokens in your sentences (but you can use each token as often as you would like):
>- `the`
>- `lion`
>- `chased`
>- `zebra`
>- `.`
>
>What happens when the set of tokens is the same across both sentences but  the last token differs? (e.g., "lion the chased ." and "chased the . lion") Does this lead to different attention outputs? If so, why?
>
------

In [None]:
# @title Compare attention weights and outputs for other sentences

sentence_1 = "lion the chased zebra the ."  # @param {"type": "string"}
sentence_2 = "chased lion the the zebra ."  # @param {"type": "string"}

tokens_1 = sentence_1.split()
tokens_2 = sentence_2.split()

possible_token_list = "', '".join(vocabulary.keys())

for t in tokens_1:
    if t not in vocabulary:
        raise ValueError(
            f"Invalid token '{t}' in sentence_1. Please only use one of the"
            f" following tokens: '{possible_token_list}'."
        )

for t in tokens_2:
    if t not in vocabulary:
        raise ValueError(
            f"Invalid token '{t}' in sentence_2. Please only use one of the"
            f" following tokens: '{possible_token_list}'."
        )

attention_output1, attention_weights1 = compute_attention_output(sentence_1)
attention_output2, attention_weights2 = compute_attention_output(sentence_2)

print("\n--- Comparison ---")

print("\nAttention weights for last token in S1:", attention_weights1[-1, :])
print("Attention weights for last token in S2:", attention_weights2[-1, :])

print("\nAttention outputs for last token in S1:", attention_output1[-1, :])
print("Attention outputs for last token in S2:", attention_output2[-1, :])

print(
    "\nAre attention outputs for last token in S1 and S2 the same?",
    np.allclose(attention_output1[-1, :], attention_output2[-1, :]),
)

## Summary

This interactive activity showed you that the output of the attention mechanism for the last token (that is, the embedding from which the model predicts the next token) does not depend on the order of tokens in the prompt. The attention mechanism is therefore **order-invariant**.

In the next activity, you will explore techniques for encoding positional information in transformer models.