In [7]:
# Load corporate proxy configuration
import sys
sys.path.insert(0, '..')
try:
    from _proxy_config import *
except ImportError:
    print("Warning: _proxy_config.py not found. Proxy settings may not be configured.")
except Exception as e:
    print(f"Error loading proxy configuration: {e}")

# Attention mechanisms

Before modern LLM's, language tasks were performed by other neural network architectures such as RNN's. Some of these tasks, such as translations, relied on two parts: an encoder and a decoder.

The encoder would be fed a text input in the source language and result in a hidden state. That would then be fed into the decoder to be transformed back into the target natural language.

However, text translations are not done by simply translating word after word. Different languages have different grammatical structures, where sentence entities can come in different orders and sometimes a word in a language could need multiple words in another language (and vice-versa).

The RNN's were good enough for short sentences, but the decoder's lack of access to previous words in the input made them unsuitable for longer texts. The Bahdanau attention mechanism was proposed in 2014 to fix that, giving the RNN's decoder the ability to selectively access different parts of the input sequence at each decoding step.



## 1. Self-attention

The Bahdanau Attention inspired the self-attention mechanism proposed in the transformer architecture.

Self-attention gives the model the ability to compute attention weights that relate different positions within the same input sequence. In traditional attentions mechanisms, the relationships focused on elements of two different sequences, where the attention could be between an input sequence and an output sequence, for example.

In [53]:
import mermaid as md
from mermaid.graph import Graph
sequence = Graph('Self-attention', """
block
    columns 1
                 
    block:input:1
        columns 6
        x1["The"] x2["dog"] x3["attacks"] x4["the"] x5["wild"] x6["cat"]
    end
    space
    block:embeddings:1
        block:embedding1:1
            columns 3
            x1d1["0.1"] x1d2["0.2"] x1d3["0.3"]
        end
        block:embedding2:1
            columns 3
            x2d1["0.4"] x2d2["0.1"] x2d3["0.2"]
        end
        block:embedding3:1
            columns 3
            x3d1["0.3"] x3d2["0.2"] x3d3["0.1"]
        end
        block:embedding4:1
            columns 3
            x4d1["0.2"] x4d2["0.3"] x4d3["0.1"]
        end
        block:embedding5:1
            columns 3
            x5d1["0.1"] x5d2["0.4"] x5d3["0.2"]
        end
        block:embedding6:1
            columns 3
            x6d1["0.3"] x6d2["0.1"] x6d3["0.4"]
        end
    end
    x1 --> embedding1
    x2 --> embedding2
    x3 --> embedding3
    x4 --> embedding4
    x5 --> embedding5
    x6 --> embedding6
    space
    block:context_vectors:1
        columns 6
        block:z1:1
            columns 3
            z1d1["0.2"] z1d2["0.3"] z1d3["0.4"]
        end
        block:z2:1
            columns 3
            z2d1["0.1"] z2d2["0.4"] z2d3["0.3"]
        end
        block:z3:1
            columns 3
            z3d1["0.4"] z3d2["0.2"] z3d3["0.1"]
        end
        block:z4:1
            columns 3
            z4d1["0.3"] z4d2["0.1"] z4d3["0.2"]
        end
        block:z5:1
            columns 3
            z5d1["0.2"] z5d2["0.4"] z5d3["0.1"]
        end
        block:z6:1
            columns 3
            z6d1["0.1"] z6d2["0.2"] z6d3["0.4"]
        end
    end
    embedding1 --> z2
    embedding2 --> z2
    embedding3 --> z2
    embedding4 --> z2
    embedding5 --> z2
    embedding6 --> z2
""")
render = md.Mermaid(sequence)
render

### 1.1. Simplified self-attention

What self-attention does is transform each input embedding (a vector of N dimensions) into another vector called a context vector. 

#### Context vectors

Context vectors can be defined as "enriched" versions of the input token embeddings. Each context vector is a representation of one of the embeddings in the input sequence, but it contains information about all other tokens.

#### Attention scores
