<a href="https://colab.research.google.com/github/christophergaughan/PyTorch/blob/main/Positional_Encodings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformer Model and Positional Encodings

### Token to Embedding Layer
Tokens are converted into vectors via the embedding layer:
$$ \text{Token} \xrightarrow{\text{Embedding Layer}} \text{Embedding Vector} $$

Positional encodings are added to embeddings to retain word order information:
$$ \text{Embedding} + \text{Positional Encoding} = \text{Input Representation} $$

---

### Transformer Components
1. **Multi-Head Attention**:
   Processes embeddings in parallel:
   $[
   \text{Query}, \text{Key}, \text{Value} \to \text{Attention Scores} \to \text{Weighted Sum}
   ]$

2. **Feed-Forward Layers**:
   Fully connected layers that process each embedding:
   $[
   \text{Input} \to \text{Feed-Forward Neural Network} \to \text{Output}
   ]$

3. **Layer Normalization**:
   Ensures stable gradients:
   $$[
   \text{Output} = \frac{\text{Input} - \mu}{\sigma}
   ]$$

---

### Positional Encoding Calculation
Positional encodings use sine and cosine functions:
$[
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)
]$
$[
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)
]$

Example for \(d = 4\) and \(pos = 100\):
$$[
PE_{(100, 0)} = \sin\left(\frac{100}{10000^{0/4}}\right), \quad PE_{(100, 1)} = \cos\left(\frac{100}{10000^{0/4}}\right)
]$$
$$[
PE_{(100, 2)} = \sin\left(\frac{100}{10000^{2/4}}\right), \quad PE_{(100, 3)} = \cos\left(\frac{100}{10000^{2/4}}\right)
]$$

---

### Key Points
- **Positional Encodings** add information about word order to embeddings.
- The Transformer processes all tokens in parallel.
- Each layer normalizes, applies multi-head attention, and uses feed-forward networks to process data.


# Transformer Model and Positional Encodings

### Token to Embedding Layer
Tokens are converted into vectors via the embedding layer:
$$ \text{Token} \xrightarrow{\text{Embedding Layer}} \text{Embedding Vector} $$

Positional encodings are added to embeddings to retain word order information:
$$ \text{Embedding} + \text{Positional Encoding} = \text{Input Representation} $$

---

### Transformer Components

#### 1. **Multi-Head Attention**
Multi-head attention calculates attention scores using:
$$[
Q = XW^Q, \quad K = XW^K, \quad V = XW^V
]$$

The scaled dot-product attention mechanism:
$$[
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
]$$
- $(QK^T)$: Measures similarity between queries and keys.
- $(\sqrt{d_k})$: Scales scores to avoid large gradients.
- softmax: Converts scores into probabilities.

Multi-head attention combines multiple heads in parallel:
$$[
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \dots, \text{head}_h)W^O
]$$

---

#### 2. **Feed-Forward Network**
Each embedding is processed independently through a feed-forward neural network:
$$[
\text{FFN}(x) = \text{max}(0, xW_1 + b_1)W_2 + b_2
]$$

---

#### 3. **Residual Connections and Layer Normalization**
Residual connections stabilize learning by adding the input back to the processed output:
$$[
\text{Output} = \text{LayerNorm}(\text{Input} + \text{Processed Output})
]$$

---

### Positional Encoding Calculation

Positional encodings use sine and cosine functions to represent position information:
$$[
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)
]$$
$$[
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)
]$$

Example for \(d = 4\) and \(pos = 100\):
$$[
PE_{(100, 0)} = \sin\left(\frac{100}{10000^{0/4}}\right), \quad PE_{(100, 1)} = \cos\left(\frac{100}{10000^{0/4}}\right)
]$$
$$[
PE_{(100, 2)} = \sin\left(\frac{100}{10000^{2/4}}\right), \quad PE_{(100, 3)} = \cos\left(\frac{100}{10000^{2/4}}\right)
]$$

---

### Why Positional Encodings Matter
Transformers process words in parallel and lack recurrence. Positional encodings provide information about the order of words, ensuring that sequence relationships are preserved.

---

### Transformer Architecture Summary
- **Input Embeddings**: Convert tokens to dense vectors.
- **Positional Encodings**: Add positional information to embeddings.
- **Multi-Head Attention**: Enables the model to focus on different parts of the sequence.
- **Feed-Forward Layers**: Process embeddings independently.
- **Residual Connections**: Stabilize learning and avoid gradient vanishing.
- **Layer Normalization**: Normalize outputs to improve stability.



# Encoding "I am a robot" with the Transformer Model

### Step 1: Tokenization
The sentence **"I am a robot"** is tokenized into individual tokens:
$$[
\text{Tokens: } ["I", "am", "a", "robot"]
]$$

Each token is then mapped to a unique vector through an embedding layer:
$$[
\text{Embedding}(\text{"I"}) = E_1, \quad \text{Embedding}(\text{"am"}) = E_2, \quad \text{Embedding}(\text{"a"}) = E_3, \quad \text{Embedding}(\text{"robot"}) = E_4
]$$

Assume the embedding vectors are 4-dimensional (\(d = 4\)):
$$[
E_1 = [1.0, 0.5, 0.3, 0.2], \quad E_2 = [0.8, 0.2, 0.5, 0.9], \quad E_3 = [0.4, 0.7, 0.6, 0.1], \quad E_4 = [0.9, 0.3, 0.8, 0.5]
]$$

---

### Step 2: Positional Encoding
Positional encodings are calculated using the following formulas:
$$[
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d}}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)
]$$
Here:
- $(pos)$ is the position of the token in the sequence (0-indexed).
- $(d)$ is the dimensionality of the embedding (\(d = 4\)).

#### Positional Encoding for Each Token
- For **"I"** at $(pos = 0)$:
$$[
PE_0 = \left[\sin\left(\frac{0}{10000^{0/4}}\right), \cos\left(\frac{0}{10000^{0/4}}\right), \sin\left(\frac{0}{10000^{2/4}}\right), \cos\left(\frac{0}{10000^{2/4}}\right)\right] = [0, 1, 0, 1]
]$$

- For **"am"** at $(pos = 1)$:
$$[
PE_1 = \left[\sin\left(\frac{1}{10000^{0/4}}\right), \cos\left(\frac{1}{10000^{0/4}}\right), \sin\left(\frac{1}{10000^{2/4}}\right), \cos\left(\frac{1}{10000^{2/4}}\right)\right]
]$$
Approximating values:
$$[
PE_1 \approx [0.8415, 0.5403, 0.001, 1.0]
]$$

- For **"a"** at $(pos = 2)$:
$$[
PE_2 = \left[\sin\left(\frac{2}{10000^{0/4}}\right), \cos\left(\frac{2}{10000^{0/4}}\right), \sin\left(\frac{2}{10000^{2/4}}\right), \cos\left(\frac{2}{10000^{2/4}}\right)\right]
]$$
Approximating values:
$$[
PE_2 \approx [0.9093, -0.4161, 0.002, 1.0]
]$$

- For **"robot"** at \(pos = 3\):
$$[
PE_3 = \left[\sin\left(\frac{3}{10000^{0/4}}\right), \cos\left(\frac{3}{10000^{0/4}}\right), \sin\left(\frac{3}{10000^{2/4}}\right), \cos\left(\frac{3}{10000^{2/4}}\right)\right]
]$$
Approximating values:
$$[
PE_3 \approx [0.1411, -0.9899, 0.003, 1.0]
]$$

---

### Step 3: Adding Positional Encodings to Embeddings
The final input to the Transformer is obtained by adding the positional encodings to the token embeddings:
$$[
\text{Input Representation} = \text{Embedding Vector} + \text{Positional Encoding}
]$$

#### Combined Representations:
- For **"I"**:
$$
\text{Input}_1 = E_1 + PE_0 = [1.0, 0.5, 0.3, 0.2] + [0, 1, 0, 1] = [1.0, 1.5, 0.3, 1.2]
$$

- For **"am"**:
$$[
\text{Input}_2 = E_2 + PE_1 = [0.8, 0.2, 0.5, 0.9] + [0.8415, 0.5403, 0.001, 1.0] \approx [1.6415, 0.7403, 0.501, 1.9]
]$$

- For **"a"**:
$$[
\text{Input}_3 = E_3 + PE_2 = [0.4, 0.7, 0.6, 0.1] + [0.9093, -0.4161, 0.002, 1.0] \approx [1.3093, 0.2839, 0.602, 1.1]
]$$

- For **"robot"**:
$$[
\text{Input}_4 = E_4 + PE_3 = [0.9, 0.3, 0.8, 0.5] + [0.1411, -0.9899, 0.003, 1.0] \approx [1.0411, -0.6899, 0.803, 1.5]
]$$

---

### Final Encoded Representation
The encoded input for the phrase **"I am a robot"** is:
$$[
\text{Input} =
\begin{bmatrix}
1.0 & 1.5 & 0.3 & 1.2 \\
1.6415 & 0.7403 & 0.501 & 1.9 \\
1.3093 & 0.2839 & 0.602 & 1.1 \\
1.0411 & -0.6899 & 0.803 & 1.5 \\
\end{bmatrix}
]$$

This matrix is passed as the input to the Transformer layers.


# Processing "I am a robot" in the Transformer

### Step 1: Input to the Transformer
The final encoded representation of the input sentence **"I am a robot"** is:
$$
\text{Input} =
\begin{bmatrix}
1.0 & 1.5 & 0.3 & 1.2 \\
1.6415 & 0.7403 & 0.501 & 1.9 \\
1.3093 & 0.2839 & 0.602 & 1.1 \\
1.0411 & -0.6899 & 0.803 & 1.5 \\
\end{bmatrix}
$$
Each row corresponds to one token, and each column corresponds to one dimension of the embedding space.

This matrix is passed as input to the **Transformer layers**, starting with Multi-Head Attention.

---

### Step 2: Multi-Head Attention
Multi-head attention allows the model to focus on different parts of the sequence simultaneously. Here’s how it processes the input:

#### 2.1: Compute Query (Q), Key (K), and Value (V)
The input matrix is linearly projected into queries $((Q))$, keys $((K))$, and values $((V))$ using learned weight matrices:
$$[
Q = \text{Input} \cdot W^Q, \quad K = \text{Input} \cdot W^K, \quad V = \text{Input} \cdot W^V
]$$
Assume $(W^Q)$, $(W^K)$, and $(W^V)$ are $(4 \times 4)$ matrices (same as embedding dimension). After projection, we get:
$$[
Q, K, V \in \mathbb{R}^{4 \times 4}
]$$

#### 2.2: Compute Attention Scores
Attention scores are computed by taking the dot product of $(Q)$ and $(K^T)$, then scaling by the square root of the embedding dimension $((\sqrt{d_k}))$:
$$[
\text{Attention Scores} = \frac{QK^T}{\sqrt{d_k}}
]$$
Assuming $(d_k = 4)$, scaling factor = $(2)$.

#### 2.3: Apply Softmax
The attention scores are passed through a softmax function to normalize them into probabilities:
$$
\text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)
$$

#### 2.4: Weighted Sum of Values
The attention probabilities are used to compute a weighted sum of the value vectors $((V))$:
$$
\text{Output from Attention} = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$
This operation creates a new set of vectors that combine information from all tokens.

#### 2.5: Multi-Head Attention
Multiple attention heads compute this process independently, and their results are concatenated:
$$[
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \dots, \text{head}_h)W^O
]$$

Here $(W^O)$ is another learned weight matrix.

---

### Step 3: Residual Connection and Layer Normalization
The output of the multi-head attention layer is added back to the original input (residual connection):
$$[
\text{Residual Output} = \text{Input} + \text{Attention Output}
]$$

Then, layer normalization is applied to stabilize training:
$$[
\text{Normalized Output} = \frac{\text{Residual Output} - \mu}{\sigma}
]$$

---

### Step 4: Feed-Forward Network
The normalized output is passed through a position-wise feed-forward network:
$$[
\text{FFN}(x) = \text{max}(0, xW_1 + b_1)W_2 + b_2
]$$
This network applies two linear transformations with a ReLU activation function in between. Each token is processed independently.

---

### Step 5: Repeated Transformer Layers
The Transformer layer (Multi-Head Attention + Feed-Forward Network) is repeated $(N)$ times (e.g., 6 layers in the original Transformer model). Each layer refines the token representations further, allowing the model to build richer contextual embeddings.

---

### Step 6: Output from Encoder
After passing through all Transformer layers, the encoded representation for **"I am a robot"** is ready. This encoded representation contains contextual information about the entire sentence, allowing the model to understand the relationships between tokens.

$$[
\text{Final Encoded Representation (Encoder Output)} =
\begin{bmatrix}
h_1 \\
h_2 \\
h_3 \\
h_4 \\
\end{bmatrix}
]$$
Each $(h_i)$ is a refined vector representation for the corresponding token.

---

### Step 7: Decoder (Optional, for Sequence-to-Sequence Models)
If this model is part of a sequence-to-sequence architecture (e.g., machine translation), the encoded representation is passed to the decoder. The decoder uses this representation, along with its own self-attention mechanism, to generate the output sequence.

---

### Recap of the Process for "I am a robot"
1. **Input Representation**: Combine embeddings and positional encodings.
2. **Multi-Head Attention**: Learn relationships between tokens.
3. **Residual and Normalization**: Add stability and prevent vanishing gradients.
4. **Feed-Forward Network**: Process token embeddings independently.
5. **Repeat**: Stack multiple layers to refine representations.
6. **Output**: Encoded sentence ready for downstream tasks (e.g., classification, translation).



In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Activate seaborn's style for better visuals
sns.set_theme()

# Define the ReLU function
def relu(x):
    return np.maximum(0, x)

# Generate x values for the graph
x = np.linspace(-10, 10, 500)  # 500 points between -10 and 10

# Compute ReLU values for x
y = relu(x)

# Plot the ReLU function
plt.figure(figsize=(8, 6))  # Set the figure size
plt.plot(x, y, label='ReLU(x)', color='blue', linewidth=2)

# Add labels and title
plt.title('ReLU (Rectified Linear Unit) Function', fontsize=16)
plt.xlabel('x', fontsize=14)
plt.ylabel('ReLU(x)', fontsize=14)

# Add a grid for better readability
plt.grid(color='gray', linestyle='--', linewidth=0.5, alpha=0.7)

# Add a legend
plt.legend(fontsize=12, loc='upper left')

# Show the graph
plt.tight_layout()
plt.show()



In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Activate seaborn's style for better visuals
sns.set_theme()

# Define the Softmax function
def softmax(x):
    e_x = np.exp(x - np.max(x))  # Subtract max(x) for numerical stability
    return e_x / e_x.sum(axis=0)

# Generate multiple input categories
x_values = np.linspace(-5, 5, 100)  # 100 points between -5 and 5
categories = np.array([x_values, x_values + 2, x_values - 2, x_values + 1])  # 4 "categories" for Softmax

# Compute the Softmax values for these categories
softmax_outputs = np.apply_along_axis(softmax, axis=0, arr=categories)

# Plot the Softmax function for all categories
plt.figure(figsize=(10, 7))  # Set the figure size
for i, softmax_curve in enumerate(softmax_outputs):
    plt.plot(x_values, softmax_curve, label=f'Category {i+1}', linewidth=2)

# Add labels and title
plt.title('Softmax Function Shape Across Categories', fontsize=16)
plt.xlabel('Input Value (x)', fontsize=14)
plt.ylabel('Softmax Probability', fontsize=14)

# Add grid for better readability
plt.grid(color='gray', linestyle='--', linewidth=0.5, alpha=0.7)

# Add a legend
plt.legend(fontsize=12, loc='upper right')

# Show the graph
plt.tight_layout()
plt.show()


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Activate seaborn's style for better visuals
sns.set_theme()

# Define the Softmax function
def softmax(x):
    e_x = np.exp(x - np.max(x))  # Subtract max(x) for numerical stability
    return e_x / e_x.sum(axis=0)

# Generate input values for one dimension and some fixed competitors
x = np.linspace(-10, 10, 500)  # 500 points between -10 and 10
fixed_competitors = [-2, 0, 2]  # Fixed inputs for other "categories"

# Create inputs for a single class (varies) and competitors (fixed)
inputs = np.vstack([x] + [np.full_like(x, c) for c in fixed_competitors])

# Compute Softmax probabilities
softmax_outputs = np.apply_along_axis(softmax, axis=0, arr=inputs)

# Plot the Softmax function for the varying category
plt.figure(figsize=(8, 6))
plt.plot(x, softmax_outputs[0], label='Softmax of Varying Input', color='blue', linewidth=2)

# Add horizontal line for comparison (other classes)
for i, prob in enumerate(softmax_outputs[1:], start=1):
    plt.axhline(y=prob.mean(), color='gray', linestyle='--', label=f'Fixed Competitor {i}')

# Add labels and title
plt.title('Sigmoid Shape of Softmax Function', fontsize=16)
plt.xlabel('Input Value (x)', fontsize=14)
plt.ylabel('Softmax Probability', fontsize=14)

# Add a grid for better readability
plt.grid(color='gray', linestyle='--', linewidth=0.5, alpha=0.7)

# Add a legend
plt.legend(fontsize=12, loc='center right')

# Show the graph
plt.tight_layout()
plt.show()
