In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show
from bertviz import head_view
from transformers import AutoModel
from torch import nn
from transformers import AutoConfig
import torch
from math import sqrt
import torch.nn.functional as F

# Transformer

![Image Title](images/transformer.png)

*Encoder-decoder architecture of the Transformer, with the encoder shown in the upper half of the figure and the decoder in the lower half.*

## Transformer Encoder

The Transformer’s encoder consists of many encoder layers stacked next to each other. Each encoder layer receives a sequence of embeddings and feeds them through the following sub-layers:
1. A multi-head self-attention layer.
2. A shared feed-forward layer.

The output embeddings of each encoder layer have the same size as the inputs.

The main role of the encoder stack is to "update" the input embeddings to produce representations that encode some contextual information in the sequence. For example the word "apple" will be updated to be more "company-like" and less "fruit-like" if the words "keynote" or "phone" are close to the word.

Each of these sub-layers also has a skip connection and layer normalization, which are standard tricks to train deep neural networks effectively.

![Image Title](images/encoder.png)

*Zooming into the encoder layer.*

## Self-Attention

**Attention** is a mechanism that allows neural networks to assign a different amount of weight or "attention" to each element in a sequence. For text sequences, the elements are token embeddings. Each token is mapped to a vector of some fixed dimension. For example, in BERT each token is represented as a 768-dimensional vector.

The main idea behind self-attention is that instead of using a fixed embedding for each token, we can use the whole sequence to compute a weighted average of each embedding. Another way to formulate this is to say that given a sequence of token embeddings $x_1, \ldots, x_n$, self-attention produces a sequence of new embeddings $x'_1, \ldots, x'_n$ where each $x'_i$ is a linear combination of all the $x_j$:

$$x'_i = \sum_{j=1}^{n} w_{ij}x_j$$

The coefficients $(w_{ji})$ are called attention weights and are normalized so that $\sum_{j} w_{ji} = 1$.

Embeddings that are generated in this way are called contextualized embeddings.

### Time flies like an arrow; fruit flies like a banana

![Image Title](images/contextualized_embeddings.png)

*Diagram showing how self-attention updates raw token embeddings (upper) into contextualized embeddings (lower) to create representations that incorporate information from the whole sequence.*

### Scaled Dot-Product Attention

There are several ways to implement a self-attention layer, but the most common is scaled dot-product attention from the Attention is All You Need paper where the Transformer was introduced. There are four main steps needed to implement this mechanism:
1. **Create query, key, and value vectors.** Each token embedding is projected into query, key, and value vectors.
2. **Compute attention scores.** Determine how much the query and key vectors relate using a similarity function. As the name suggests, the similarity function for scaled dot-product attention is the dot-product, computed efficiently using matrix multiplication of the embeddings. Similar queries and keys will have a large dot product, while those that don't share much in common will have little to no overlap. The outputs from this step are called the attention scores, and for a sequence with $n$ input tokens, there is a corresponding $n \times n$ matrix of attention scores.
3. **Compute attention weights.** Dot products can, in general, produce arbitrarily large numbers, which can destabilize the training process. To handle this, the attention scores are first multiplied by a scaling factor to normalize their variance and then normalized with a softmax to ensure all the column values sum to one. The resulting $n \times n$ matrix now contains all the attention weights $w_{ji}$.
4. **Update the token embeddings.** Once the attention weights are computed, we multiply them by the value vector $( v_1, \ldots, v_n )$ to obtain an updated representation for embedding $(x'_i = \sum_{j} w_{ji}v_j )$.

**The first step of self-attention calculation**: For each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They DON'T HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.

![Image Title](images/qkv.png)

*Multiplying $x_1$ by the $W_Q$ weight matrix produces $q_1$, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.*

![Image Title](images/self_attention_process.png)

*Self-attention calculation.*

**The second step of self-attention calculation**: We're calculating the self-attention for the first word in this example, "Thinking". We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a particular position.

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we're scoring. So, if we're processing the self-attention for the word in position #1, the first score would be the dot product of $q_1$ and $k_1$. The second score would be the dot product of $q_1$ and $k_2$.

**The third and fourth steps** are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper—64), which leads to more stable gradients. Other possible values here could be used, but this is the default. Then, the result is passed through a softmax operation. Softmax normalizes the scores to be positive and add up to 1.

This softmax score determines how much each word will be expressed at this position. Clearly, the word at this position will have the highest softmax score, but sometimes, attending to another word relevant to the current word is helpful.

**The fifth step** is to multiply each value vector by the softmax score (in preparation for summarizing them). The intuition here is to keep intact the values of the word(s) we want to focus on and drown out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

**The sixth step** is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

$$\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$$

![Image Title](images/attention_matrix_form.png)

*The self-attention calculation in matrix form.*

That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural network.

In [37]:
model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)
config = AutoConfig.from_pretrained(model_ckpt)

In [38]:
text = "time flies like an arrow"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [39]:
text = "fruit flies like a banana"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Attention implementation

In [5]:
text = "time flies like an arrow"

inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids

tensor([[ 2051, 10029,  2066,  2019,  8612]])

In [6]:
tokens = tokenizer.convert_ids_to_tokens(inputs.input_ids[0])
tokens

['time', 'flies', 'like', 'an', 'arrow']

Then, we need to create some dense embeddings. "Dense" in this context means that each entry in the embeddings contains a non-zero value.

In [7]:
config

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.39.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

Each input ID will be mapped to one of the 30,522 embedding vectors stored in `nn.Embedding`, each with a size of 768. Note that the token embeddings at this point are independent of their context.

In [8]:
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb

Embedding(30522, 768)

In [9]:
# look-up table
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()

torch.Size([1, 5, 768])

The next step is to create the query, key, and value vectors and calculate the attention scores using the dot-product as the similarity function:

In [10]:
# We’ll see later that the query, key, and value vectors are generated by applying independent weight matrices W_Q, W_K, W_V to the embeddings, but for now we’ve kept them equal for simplicity.

# In scaled dot-product attention, the dot-products are scaled by the size of the embedding vectors so that we don’t get too many large numbers during training that can cause the softmax we will apply next to saturate.

query = key = value = inputs_embeds
dim_k = key.size(-1)
scores = torch.bmm(query, key.transpose(1,2)) / sqrt(dim_k) # bmm performs a batch matrix-matrix product
scores.size()

torch.Size([1, 5, 5])

In [11]:
weights = F.softmax(scores, dim=-1)
print(weights.shape)

torch.Size([1, 5, 5])


In [12]:
scores

tensor([[[27.2713,  0.4895,  1.6083,  0.9601, -0.5755],
         [ 0.4895, 27.9188, -2.5096, -1.2528, -0.7713],
         [ 1.6083, -2.5096, 27.2688, -0.4353, -1.3818],
         [ 0.9601, -1.2528, -0.4353, 26.8312,  1.1933],
         [-0.5755, -0.7713, -1.3818,  1.1933, 27.6317]]],
       grad_fn=<DivBackward0>)

In [13]:
weights

tensor([[[1.0000e+00, 2.3378e-12, 7.1564e-12, 3.7424e-12, 8.0591e-13],
         [1.2235e-12, 1.0000e+00, 6.0964e-14, 2.1424e-13, 3.4675e-13],
         [7.1746e-12, 1.1678e-13, 1.0000e+00, 9.2955e-13, 3.6076e-13],
         [5.8119e-12, 6.3574e-13, 1.4399e-12, 1.0000e+00, 7.3388e-12],
         [5.6204e-13, 4.6206e-13, 2.5096e-13, 3.2957e-12, 1.0000e+00]]],
       grad_fn=<SoftmaxBackward0>)

In [16]:
weights.sum(dim=-1)

tensor([[1., 1., 1., 1., 1.]], grad_fn=<SumBackward1>)

In [17]:
attn_outputs = torch.bmm(weights, value)
attn_outputs.shape

torch.Size([1, 5, 768])

In [18]:
attn_outputs

tensor([[[ 2.0039, -0.1344, -0.0472,  ..., -0.1041, -0.0050,  1.3917],
         [-0.4035,  0.4007,  0.5775,  ...,  1.2056,  0.6780,  0.4305],
         [-0.5986,  1.7995, -1.1579,  ..., -1.5968, -1.2708,  0.1663],
         [-1.3095,  1.1721, -0.2025,  ..., -0.6782,  1.0053,  0.3022],
         [ 0.5114,  0.3997,  0.0382,  ..., -1.2835, -0.1476, -0.2333]]],
       grad_fn=<BmmBackward0>)

In [19]:
def scaled_dot_product_attention(query, key, value):
    dim_k = query.size(-1)
    scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
    weights = F.softmax(scores, dim=-1)
    return torch.bmm(weights, value)

In practice, the self-attention layer applies three independent linear transformations to each embedding to generate the query, key, and value vectors.

These transformations project the embeddings and each projection carries its own set of learnable parameters, which allows the self-attention layer to focus on different semantic aspects of the sequence.

In [20]:
class AttentionHead(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, head_dim)
        self.k = nn.Linear(embed_dim, head_dim)
        self.v = nn.Linear(embed_dim, head_dim)

    def forward(self, hidden_state):
        attn_outputs = scaled_dot_product_attention(self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
        return attn_outputs

### Multi-Headed Attention

It also turns out to be beneficial to have multiple sets of linear projections, each one representing a so-called attention head.

But why do we need more than one attention head? The reason is that softmax of one head tends to focus on mostly one aspect of similarity.

Having several heads allows to focus on several aspects at once. For instance one head can focus on subject-verb interaction, whereas another finds nearby adjectives.

Obviously, we don’t handcraft these relations into the model and they are fully learned from the data.

![Image Title](images/multi_head_att.png)

*Multi-headed attention*

For example, given the word “the”, the first head will give more attention to the word “bank” whereas the second head will give more attention to the word “river”.

![image.png](images/Screenshot_from_2025-04-20_17-31-23.png)

It’s important to note that after the split each head has a reduced dimensionality. Thus, the total computation cost is the same as a single head attention with full dimensionality. So if in casual setup you have used ONE `D` dimension linear projection, in Multi-Head - you will use `N` linear projections with dims - `D/N`

In [21]:
class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.hidden_size
        num_heads = config.num_attention_heads
        head_dim = embed_dim // num_heads
        self.heads = nn.ModuleList([AttentionHead(embed_dim, head_dim) for _ in range(num_heads)] )
        self.output_linear = nn.Linear(embed_dim, embed_dim)
        
    def forward(self, hidden_state):
        x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
        x = self.output_linear(x)
        return x

In practice `head_dim` is chosen to be a multiple of `embed_dim` so that the computation across each head is constant. For example in BERT has 12 attention heads, so the dimension of each head is 768/12 = 64.

In [22]:
multihead_attn = MultiHeadAttention(config)
attn_output = multihead_attn(inputs_embeds)
attn_output.size()

torch.Size([1, 5, 768])

In [23]:
attn_output

tensor([[[-0.1885,  0.0207, -0.0889,  ..., -0.2257, -0.0421,  0.0096],
         [-0.1907, -0.0019, -0.2403,  ..., -0.2443, -0.1058, -0.0044],
         [-0.2268,  0.0474, -0.1930,  ..., -0.1873, -0.0734, -0.0165],
         [-0.2082, -0.0105, -0.2151,  ..., -0.2332, -0.0583,  0.0175],
         [-0.1648,  0.0482, -0.1428,  ..., -0.2590, -0.1276, -0.0723]]],
       grad_fn=<ViewBackward0>)

In [40]:
model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)
sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"
viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
attention = model(**viz_inputs).attentions
sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])
head_view(attention, tokens, sentence_b_start, heads=[8])

<IPython.core.display.Javascript object>

### Feed Forward Layer

The feed forward sub-layer in the encoder and decoder is just a simple 2-layer fully-connected neural network, but with a twist: instead of processing the whole sequence of embeddings as a single vector, it processes each embedding independently.

In [25]:
class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, x):
        print(x.shape)
        x = self.linear_1(x)
        print(x.shape)
        x = self.gelu(x)
        x = self.linear_2(x)
        x = self.dropout(x)
        return x

In [26]:
feed_forward = FeedForward(config)
ff_outputs = feed_forward(attn_outputs)
ff_outputs.size()

torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])


torch.Size([1, 5, 768])

### Adding Layer Normalization

![Image Title](images/normalization.png)

*Different arrangements of layer normalization in a transformer encoder layer.*

In [27]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)
        
    def forward(self, x):
        # Apply layer normalization and then copy input into query, key, value
        hidden_state = self.layer_norm_1(x)
        # Apply attention with a skip connection
        x = x + self.attention(hidden_state)
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.layer_norm_2(x))
        return x

In [28]:
encoder_layer = TransformerEncoderLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size()

torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])


(torch.Size([1, 5, 768]), torch.Size([1, 5, 768]))

## Positional Encodings

Embeddings represent a token in a d-dimensional space where tokens with similar meaning are closer to one another. However, the embeddings do not encode the relative position of the tokens in a sentence.

![image.png](images/Screenshot_from_2025-04-20_17-35-42.png)

### Relative Positional Embeddings

Formula:

![image.png](images/Screenshot_from_2025-04-20_17-36-11.png)

Positional encoding works because absolute position is less important than relative position. For instance, we don’t need to know that the word “good” is at index 6 and the word “looks” is at index 5. It’s sufficient to remember that the word “good” tends to follow the word “looks”.

Here’s a plot generated using a sequence length of 100 and embedding space of 512 dimensions:

![image.png](images/Screenshot_from_2025-04-20_17-36-24.png)

For the first dimension, if the value is 1, it’s an odd word, if the value is 0, it’s an even word. For the d/2th dimension, if the value is 1, we know the word is in the second half of the sentence and if the value is 0, then it’s in the first half of the sentence. The model can use this information to determine the relative position of the tokens.

### Absolute Positional Embeddings

Another approach is to use `nn.Embedding`, where each index corresponds to the positional index of a token in the sequence. The main advantage of this method is that the model can learn richer, data-driven positional representations. However, this also increases the risk of overfitting. A key limitation is that Transformers using absolute positional embeddings in this way can only handle sequences up to a fixed length—determined by the size of the positional embedding matrix.

**Final Encoding**

![image.png](images/Screenshot_from_2025-04-20_17-36-40.png)

In [29]:
class Embeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embeddings = nn.Embedding(config.vocab_size,
                                             config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
                                                config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()
        
    def forward(self, input_ids):
        # Create position IDs for input sequence
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0) # create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        print(position_embeddings)
        # Combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        embeddings = self.layer_norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

In [30]:
embedding_layer = Embeddings(config)
embedding_layer(inputs.input_ids).size()

tensor([[[ 1.5975, -0.1457,  0.9248,  ..., -1.1966,  1.7868,  1.2085],
         [ 0.3623, -0.3844,  0.9222,  ..., -2.3493,  2.3304, -0.2547],
         [ 1.4564,  0.2533,  2.5097,  ...,  1.7681,  0.5663,  1.3367],
         [-2.1938,  0.1737, -0.5294,  ..., -0.2313,  0.3406,  1.2499],
         [ 1.6471,  0.2573, -0.2157,  ...,  2.0815,  0.4832, -0.0587]]],
       grad_fn=<EmbeddingBackward0>)


torch.Size([1, 5, 768])

In [31]:
config.max_position_embeddings

512

In [32]:
class TransformerEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embeddings = Embeddings(config)
        self.layers = nn.ModuleList([TransformerEncoderLayer(config) for _ in range(config.num_hidden_layers)])
    def forward(self, x):
        x = self.embeddings(x)
        for layer in self.layers:
            x = layer(x)
        return x

In [33]:
encoder = TransformerEncoder(config)
encoder(inputs.input_ids).size()

tensor([[[-0.9258, -0.3330,  0.9727,  ...,  0.0857, -0.3978,  0.6130],
         [ 0.3336,  0.2114,  1.4433,  ..., -0.5778,  0.9534, -0.2460],
         [ 0.3560, -0.8761,  1.8398,  ..., -1.2096,  0.4889, -1.1113],
         [-0.1865,  1.6005,  2.3346,  ..., -0.5136,  0.5033,  0.8220],
         [ 0.3549,  0.2831,  1.2217,  ..., -0.3101,  0.6883, -0.4330]]],
       grad_fn=<EmbeddingBackward0>)
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])


torch.Size([1, 5, 768])

In [34]:
# For classification tasks, it is common practice to just use the hidden state associated with the [CLS] token as the input feature.

class TransformerForSequenceClassification(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.encoder = TransformerEncoder(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, x):
        x = self.encoder(x)[:, 0, :] # select hidden state of [CLS] token
        x = self.dropout(x)
        x = self.classifier(x)
        return x

In [35]:
config.num_labels = 3
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids).size()

tensor([[[ 0.0514, -1.3353,  0.2038,  ...,  0.7851, -0.9172,  0.1367],
         [ 0.0675,  1.1968, -1.0047,  ...,  0.5052, -0.1728, -1.0372],
         [-0.1245, -0.2202,  0.1149,  ...,  0.0940,  1.7403,  0.6007],
         [-0.0252,  1.1386,  1.8956,  ...,  0.4091, -0.2178, -0.8104],
         [ 0.3177, -0.0576,  0.1077,  ...,  1.0879, -0.1229,  0.3207]]],
       grad_fn=<EmbeddingBackward0>)
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])
torch.Size([1, 5, 768])
torch.Size([1, 5, 3072])


torch.Size([1, 3])

## Transformer Decoder

![Image Title](images/decoder.png)

*Zooming into the Transformer decoder layer*

The decoder’s job is to generate text. The decoder has similar hidden layers to the encoder. However, unlike the encoder, the decoder’s output is sent to a softmax layer in order to compute the probability of the next word in the sequence.

![image.png](images/Screenshot_from_2025-04-20_17-46-34.png)

Detailed image, from which sequences we take Keys, Queries and Values. Concept explained on Translation task:



So assume we have access to the WHOLE input text sequence in English and we want to translate it to German:
1. We will do it word by word - we'll compute the first word of our German translation, then the second one and so on. Note that by doing it this way, our German translation does not need to have the same number of words as the English one, nor it needs to follow its word order. This kind of flexibility is the main advantage of cross-attention.
2. So, to get our first German word we generate a "query" vector (out of the blue), which in some sense means something like: "To which words in the input sequence I need to pay attention to have a good guess about how to begin my German translation?".
3. In the meantime, another neural network has produced a "key" and a "value" vectors for each English word, taking into account its position in the input sentence.
4. Now you compare your "query" vector from step 2 to each of those "keys". In essence, every "key" encodes some information about its corresponding English word, that could be relevant to different queries. For example, in this particular case, each "key" has some information about HOW IMPORTANT the corresponding English word is to figure out the FIRST word in our German translation. In other words, after comparing our first "query" to ALL the "keys", we end up with a set of weights for every corresponding input English word.
5. Now that we've identified English words in the input sequence that are important to generate our first German word, we multiply the weights obtained in the previous step by "values" of the corresponding English words (obtained in step 3) and add them up. In essence, "keys" were telling us WHICH English words are important with respect to our "query", while "values" of those words contain information about WHAT exactly needs to be extracted from those relevant words. 
6. Now we transform the vector obtained in step 5 (sum of "values" of relevant English words, according to our first "query") in some fixed way to arrive to our first German word. Hurrah!
7. Now that we have our first German word, we encode it in some way and use it to produce our "query" #2 (i.e. not out of the blue this time). This "query" will help us search for relevant words in the input Englush sequence to generate the second German word, but also taking into account the first German word we have generated earlier.
8. You get the idea: repeat steps 4-6 with "query" #2, then "query" #3 and so on, until we've generated every word of our output German translation.

Detailed explanation - https://vaclavkosar.com/ml/cross-attention-in-transformer-architecture

![image.png](images/Screenshot_from_2025-04-20_17-47-28.png)

## Decoder Input Embeddings & Positional Encoding

The decoder is autoregressive meaning that it predicts future values based on previous values. To be exact, the decoder predicts the next token in the sequence by looking at the encoder’s output and self-attending to its own previous output. Just like we did with the encoder, we add the positional encodings to the word embedding to capture the position of the tokens in the sentence. **Pay attention to the phrase "shifted right" on the image**

![image.png](images/Screenshot_from_2025-04-20_17-49-43.png)

## Masking

Since the decoder is trying to generate the sequence word by word, a look-ahead mask is used to indicate which entries should not be used. For example, when predicting the third token in the sentence, only the previous tokens, that is, the first and second tokens, should be used.

![image.png](images/Screenshot_from_2025-04-20_17-52-02.png)

## Output

Like we mentioned previously, the output of the hidden layers goes through a final softmax layer. If we have a vocabulary of 10,000 words, then the output of the classifier will be a vector of length 10,000 where the value at each index is the probability that the word associated with that index is the next word in the sequence.

![image.png](images/Screenshot_from_2025-04-20_17-53-07.png)

We take the word with the highest probability and append it to the sequence used in the next training iteration.

> **TODO**: Read about other sampling methods - https://medium.com/nlplanet/two-minutes-nlp-most-used-decoding-methods-for-language-models-9d44b2375612

## Whisper summary

![Image Title](images/whisper_tasks.png)

*Whisper tasks.*

![Image Title](images/training_data.png)

*Training data. Of those 680k hours of audio, 117k hours cover 96 other languages.*

### Approach

**Data-processing**

1. No specific data pre-processing applied.
2. Developed several automated filtering methods to improve transcript quality.
3. Developed many heuristics to detect and remove machine-generated transcripts from the training dataset.
    1. An all-uppercase or all-lowercase transcript is very unlikely to be human-generated.
4. Use an audio language detector to ensure that the spoken language matches the language of the transcript according to CLD2.
5. Break audio files into 30-second segments paired with the subset of the transcript that occurs within that time segment.
6. Train on all audio, including segments with no speech (though with sub-sampled probability), and use these segments as training data for voice activity detection.
7. After training an initial model, they aggregated information about its error rate on training data sources. They manually inspected these data sources, sorting by a combination of high error rates and data source size to identify and remove low-quality ones efficiently.

**Model**

1. Using an off-the-shelf architecture avoids confounding our findings with model improvements.
2. Choose an encoder-decoder Transformer.
3. All audio is re-sampled to 16,000 Hz, and an 80-channel log-magnitude Mel spectrogram representation is computed on 25 millisecond windows with a stride of 10 milliseconds.
4. They globally scale the input between -1 and 1 with approximately zero mean across the pre-training dataset for feature normalization.
5. They use the same byte-level BPE text tokenizer used in GPT-2 for the English-only models and refit the vocabulary (but keep the same size) for the multilingual models to avoid excessive fragmentation on other languages since the GPT-2 BPE vocabulary is English-only.

![Image Title](images/whisper_architecture.png)

*Whisper architecture.*

**Multitask Format**

1. Since their decoder is an audio-conditional language model, they also train it to condition the history of the transcript's text in the hope that it will learn to use longer-range text context to resolve ambiguous audio.
2. Specifically, with some probability, they add the transcript text preceding the current audio segment to the decoder's context.
3. A sequence-to-sequence Transformer model is trained on many different speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing for a single model to replace many stages of a traditional speech-processing pipeline.

![Image Title](images/whisper_multitask.png)

# Homework

Theory (4 points):
- Follow links.
- Try to fill/do **TODO** and **Explore** comments.
- Answer theory questions in the Google Form.

Practice (14 Points)

Finetune Whisper model for [Toronto dataset](https://drive.google.com/file/d/1j9d91QqE7_WnOnmEmidtOG55tpmxQUeJ/view)

1. Select appropriate Whisper pretrain from [huggingface](https://huggingface.co/models?sort=trending&search=whisper)
2. Check some public resources about [How To Tune Whisper](https://huggingface.co/blog/fine-tune-whisper). But in order to get the highest score you need to re-implement tuning using Lightning
3. Fine-tune Whisper model on part of Toronto dataset. Also you can use other Ukranian Speech2Text (or Text2Speech) datasets, for example you can find a bunch of them [here](https://huggingface.co/Yehor)
4. Do not use next part of Toronto dataset nor in training neaither in validation. And use it only for testing your final model. Use CER and WER metrics (and others if you want to)

```
test_lines = [
    'toronto_27',
    'toronto_46',
    'toronto_42',
    'toronto_37',
    'toronto_89',
    'toronto_43',
    'toronto_157',
    'toronto_9',
    'toronto_156',
    'toronto_7',
    'toronto_123',
    'toronto_54',
    'toronto_67',
    'toronto_62',
    'toronto_81',
    'toronto_134',
    'toronto_148',
    'toronto_21',
    'toronto_135',
    'toronto_166',
    'toronto_58'
]
```