#### -----
#### Build a Large Language Model
#### Sebastian Raschka
#### -----

# Coding attention mechanisms

  - This chapter will focus on coding the remaining parts of the LLMs surrounding the self-attention mechanism to see it in action and to create a model to generate text.
  - Different types of attention mechanisms
    - Simplified self-attention:
    - Self-attention: Simplified self-attention with trainable weights
    - Casual attention: Type of self-attention used in LLMs that allows a model to consider on previous and current inputs in a sequence
    - Multi-head attention:

## 3.1 - The problem with modeling long sequences

  - You can't simply translate a text word into another language due to the grammatical structures in source and target language.
  - Deep neural network with two submodules, encoder and decoder:
    - encode : first read in and process the entire text
    - decoder : then produces the translated text.
  - Before transformers, recurrent neural networks (RNNs) were the most popular encoder-decoder architecture for language translation.

## 3.2 - Capturing data dependencies with attention mechanisms

  - Bahdanau attention mechanism : modifies the encoder-decoder RNN such that the decoder can selectively access different parts of the input sequence at each decoding step.
  - Self-attention is a mechanism that allows each position in the input sequence when computing the representation of a sequence.

## 3.3 - Attending to different parts of the input with self-attention

  - In self-attention, the "self" refers to the mechanism's ability to compute attention weights by relating different positions within a single input sequence.

### 3.3.1 - A simple self-attention mechanism without trainable weights

  - In self-attention, our goal is to calculate context vectors for each element x in the input sequence.  A context vector can be interpreted as an enrighted embedding vector.
  - Context vectors play a crucial role in self-attention.  Their purpose is to create enriched representations of each element in an input sequence by incorporating information from all other elsements in the sequence.
  - The first step of implementing self-attention is to compute the intermediate values w, referred to as attention scores.

In [8]:
# small embedding dimension
import torch
inputs = torch.tensor(
    [[0.43, 0.15, 0.89],  # Your       (x^1)
     [0.55, 0.87, 0.66],  # journey    (x^2)
     [0.57, 0.85, 0.64],  # starts     (x^3)
     [0.22, 0.58, 0.33],  # with       (x^4)
     [0.77, 0.25, 0.10],  # one        (x^5)
     [0.05, 0.80, 0.55]]  # step       (x^6)
)

query = inputs[1]
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query)
attn_scores_2

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])

In [9]:
# Next step is to normalize each of the attention scores we computed previously
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()
print("Attention weights: ", attn_weights_2_tmp)
print("Sum: ", attn_weights_2_tmp.sum())

Attention weights:  tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum:  tensor(1.0000)


In [10]:
#   More common to use the softmax function for normalization.
#   Additionally, softmax function ensures that the attention weights are always
# positive.
def softmax_naive(x):
  return torch.exp(x) / torch.exp(x).sum(dim = 0)

attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights: ", attn_weights_2_naive)
print("Sum: ", attn_weights_2_naive.sum())

Attention weights:  tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum:  tensor(1.)


In [12]:
#   It's advisable to use PyTorch of softmax to prevent overflow and underflow
att_weights_2 = torch.softmax(attn_scores_2, dim = 0)
print("Attention weights: ", att_weights_2)
print("Sum: ", att_weights_2.sum())

Attention weights:  tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum:  tensor(1.)


In [13]:
#   After we computed the normalized attention weights, we are ready to
# calculate the context vector z, by multiplying the embedded input tokens, x
query = inputs[1]
context_vec_2 = torch.zeros(query.shape)
for i, x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2_naive[i] * x_i
print(context_vec_2)

tensor([0.4419, 0.6515, 0.5683])


### 3.3.2 - Computing attention weights for all input tokens

In [14]:
#   Let's extend this computation to calculate attention weights and context
# vectors for all inputs
attn_scores = torch.empty(6, 6)
for i, x_i in enumerate(inputs):
  for j, x_j in enumerate(inputs):
    attn_scores[i, j] = torch.dot(x_i, x_j)
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


In [15]:
#   Each element in the tensor represents an attention score beween each pair
# of inputs.  The values are normalized, which is why they differ from
# unnormalized attention scores in the preceding tensor.
attn_scores = inputs @ inputs.T
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


In [16]:
attn_weights = torch.softmax(attn_scores, dim = 1)
print(attn_weights)

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


In [17]:
#   By setting dim = -1, we are instructing the softmax function to apply
# normalization along the last dimension of the attn_scores tensor.

# Verify that the rows indeed sum to 1:
row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
print("Row 2 sum: ", row_2_sum)
print("All row sums: ", attn_weights.sum(dim = -1))

Row 2 sum:  1.0
All row sums:  tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


In [18]:
#   In the third and final step we use these attention weights to comput all
# context vectors via matrix multiplication
all_context_vectors = attn_weights @ inputs
print(all_context_vectors)

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


In [19]:
# We can doubl-check that the code is correct
print("Previous 2nd context vector: ", context_vec_2)

Previous 2nd context vector:  tensor([0.4419, 0.6515, 0.5683])


## 3.4 - Implementing self-attention with trainable weights

  -