# Transformers: Walking Through the Encoder Forward Pass

**Disclaimer:**<br>

The [notebook](https://github.com/anasashb/transformers_math_walkthrough/blob/main/transformer_attn_walkthrough.ipynb) on the math behind attention used conventional representation of input and output vectors $x_i$ and $y_i$ as column vectors. While this is a common representation in mathematical literature, in practical computations, including when dealing with TensorFlow, it is standard practice to treat these vectors as rows in the data matrix $X$. Therefore if single embedded input was of shape $d \times 1$ in the previous notebook, here it will be of $1 \times d$. Therefore some of the matrix multiplication notations will be different in this case. This difference between conventional notation and implementation stems from the need to use, among others, mini-batches.
- - -
### The Encoder Forward Pass Revisited
In this notebook I demonstrate the forward pass of an encoder with a **single-head scaled dot product self attention,** and thus notation is tailored to that (*i.e.* $d_{model}=d_q=d_k=d_v$)<br>

**Encoder Inputs and Outputs**
- Input matrix $X \in \mathbb{R}^{t \times d}$
- Output matrix $Y \in \mathbb{R}^{t \times d}$<br>

where $t$ stands for number of tokens and $d$ the embedding dimension.

**Initialized, Trainable Weight Matrices**
- Weight matrices for Query, Key and Value $W^Q, W^K, W^V \in \mathbb{R}^{d \times d}$

**Step 1)** Obtain Query, Key and Value Matrices
$$
\begin{aligned}
Q = XW^{Q}\\
K = XW^{K}\\
V = XW^{V},
\end{aligned}
$$
where $Q, K, V \in \mathbb{R}^{t \times d}$.

**Step 2)** Compute Scaled Attention Scores
$$
scaled\_attention = \frac{QK^{T}}{\sqrt{d}}
$$
where $scaled\_attention \in \mathbb{R}^{t \times t}$

**Step 3)** Compute Normalized Attention Scores
$$
normalized\_attention = softmax(\frac{QK^{T}}{\sqrt{d}})
$$
where $normalized\_attention \in \mathbb{R}^{t \times t}$

**Step 4)** Compute Encoder/Attention Output $Y$
$$
Y = softmax(\frac{QK^{T}}{\sqrt{d}})V
$$
where $Y \in \mathbb{R}^{t \times d}$



In [1]:
# Required libraries
import pandas as pd
import numpy as np
import tensorflow as tf
import seaborn as sns
from scipy.special import softmax
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8')

np.random.seed(66)
tf.random.set_seed(66)

2023-10-10 17:40:23.111884: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-10 17:40:23.164329: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-10 17:40:23.164388: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-10 17:40:23.164419: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-10 17:40:23.176721: I tensorflow/core/platform/cpu_feature_g

**Step 0)** Obtain encoder input $X$, denoted as `embeddings_input`

In [3]:
# Define input sentence
sentence = 'in the bleak midwinter'

# Tokenize
input_tokens = sentence.split()
print(f'Our input tokens are:\n'
      f'{input_tokens}\n',
      f'with length: {len(input_tokens)}')
print('-'*80)

# Pick embedding dimension
d = 300

# Generate embeddings randomly
embeddings_input = tf.keras.initializers.RandomNormal()(shape=(len(input_tokens), d))
print(f'Our embeddings are:\n'
      f'{embeddings_input}\n',
      f'with shape: {embeddings_input.shape}')


Our input tokens are:
['in', 'the', 'bleak', 'midwinter']
 with length: 4
--------------------------------------------------------------------------------
Our embeddings are:
[[-0.07616111 -0.01057373 -0.07564942 ... -0.03830564 -0.01908852
   0.07599268]
 [-0.05880255 -0.04481804  0.03535214 ...  0.02076481 -0.01052421
   0.00542679]
 [ 0.01348811  0.00810569 -0.00789542 ...  0.06691866  0.0353842
  -0.02301901]
 [ 0.07057167  0.00093471 -0.03097019 ...  0.11004905  0.06634303
   0.01168947]]
 with shape: (4, 300)


**Step 0)** Initialize $W_q, W_k, W_v$ matrices of $d \times d$ dimensions

In [5]:
# Initialize query, key and value weight matrices randomly
W_q = tf.keras.initializers.RandomNormal()(shape=(d, d))
W_k = tf.keras.initializers.RandomNormal()(shape=(d, d))
W_v = tf.keras.initializers.RandomNormal()(shape=(d, d))

print(f'Our weight matrices are:\n'
      f'W_q:\n{W_q}\n',
      f'W_k:\n{W_k}\n',
      f'W_v:\n{W_v}\n',            
      f'with shapes: {W_q.shape}, {W_k.shape}, {W_v.shape}')


Our matrices are:
W_q:
[[ 0.04303246  0.03313142  0.03040164 ... -0.02537489  0.02974148
   0.01310409]
 [-0.01867909  0.03220072 -0.03042476 ...  0.06480556 -0.07691052
  -0.02612035]
 [-0.1046806   0.04067704 -0.10271883 ... -0.05716374 -0.0358577
  -0.0737501 ]
 ...
 [ 0.07037354 -0.04791072 -0.02540406 ...  0.07900711  0.03343224
  -0.00564269]
 [ 0.06309637  0.0914696  -0.07195995 ... -0.04517866  0.10747699
  -0.0859692 ]
 [-0.03365593  0.06519532 -0.0437687  ...  0.07391986  0.0249158
  -0.02816771]]
 W_k:
[[-2.12770253e-02  3.11037432e-02  8.63985741e-04 ... -6.93682209e-02
   4.09356281e-02 -3.21749225e-02]
 [ 1.76140741e-02 -2.15053409e-02  8.03746581e-02 ... -2.19596978e-02
   6.19593496e-03  2.72714738e-02]
 [ 3.15739103e-02  9.11310688e-03  6.57749102e-02 ... -6.83886409e-02
  -4.80998345e-02  6.35050014e-02]
 ...
 [-3.73952799e-02  6.28292412e-02  1.24926805e-01 ...  2.69469172e-02
  -1.18080636e-04 -1.10481749e-03]
 [-9.17510837e-02 -4.51365411e-02  5.76580549e-03 ... -4

**Step 1)** Obtain $Q, K, V$
$$
\begin{aligned}
Q = XW^{Q}\\
K = XW^{K}\\
V = XW^{V},
\end{aligned}
$$
where $Q, K, V \in \mathbb{R}^{t \times d}$.

In [6]:
# Obtain Query, Key and Value matrices

Q = tf.matmul(embeddings_input, W_q)
K = tf.matmul(embeddings_input, W_k)
V = tf.matmul(embeddings_input, W_v)

print(f'Our matrices are:\n'
      f'Query:\n{Q}\n',
      f'Key:\n{K}\n',
      f'Value:\n{V}\n',            
      f'with shapes: {Q.shape}, {K.shape}, {V.shape}')


Our matrices are:
Query:
[[-0.01358831  0.00225072  0.04069463 ... -0.04041561  0.07240771
   0.02474956]
 [ 0.00899658  0.01655877 -0.00674951 ... -0.04415486 -0.0130793
  -0.07043502]
 [ 0.04078966  0.062748    0.01114449 ...  0.03822777  0.06681118
  -0.04799606]
 [ 0.07740284  0.02113403  0.00226755 ... -0.02833924 -0.04025527
   0.0632926 ]]
 Key:
[[-0.00755814 -0.01416114  0.03085505 ...  0.02321012 -0.02985067
  -0.06097139]
 [ 0.02322749 -0.04437843 -0.03968051 ...  0.03866518  0.03314797
   0.01338449]
 [-0.09906237  0.01016132  0.03210665 ...  0.03507121 -0.02283784
   0.13048255]
 [-0.0725933   0.04244903  0.02024318 ... -0.0377757   0.00357432
   0.00459064]]
 Value:
[[-0.0428874  -0.0227301  -0.05851181 ...  0.0111142   0.07013445
   0.05753509]
 [ 0.00068681  0.00997287  0.00854428 ...  0.05659451  0.07158233
  -0.04577174]
 [-0.00587513 -0.02620863 -0.0323611  ...  0.02873419  0.00258767
   0.02111014]
 [-0.0233941  -0.05579733  0.01938301 ... -0.02305271  0.02037075
   

**Step 2)** Compute Scaled Attention Scores
$$
scaled\_attention = \frac{QK^{T}}{\sqrt{d}}
$$
where $scaled\_attention \in \mathbb{R}^{t \times t}$

In [7]:
# Compute raw attention scores QK'
QK_T = tf.matmul(Q, tf.transpose(K))

# Compute scaling factor (square root of d)
scaling_factor = tf.sqrt(tf.cast(d, dtype=tf.float32))

# Compute scaled attention
scaled_QK_T = QK_T / scaling_factor

print(f'Scaled attention:\n'
      f'{scaled_QK_T}\n'
      f'with shape: {scaled_QK_T.shape}')

Scaled attention:
[[-1.2512440e-03  2.6803457e-03 -2.3253488e-03  4.8317036e-04]
 [ 2.1728969e-04  4.1843363e-04 -1.0025405e-05  2.6821145e-03]
 [-1.5776812e-03 -6.6248595e-04 -8.4195595e-04 -1.9943712e-03]
 [-4.6782996e-04 -2.2011064e-03  4.2129325e-04 -1.6234732e-03]]
with shape: (4, 4)


**Step 3)** Normalize Scaled Attention Scores Using $softmax$
$$
normalized\_attention = softmax(\frac{QK^{T}}{\sqrt{d}})
$$
where $normalized\_attention \in \mathbb{R}^{t \times t}$

In [8]:
# Squash through softmax
attention = tf.keras.activations.softmax(scaled_QK_T)

print(f'Self attention scores:\n'
      f'{attention}\n'
      f'with shape: {attention.shape}')

Self attention scores:
[[0.24971272 0.25069642 0.24944463 0.2501462 ]
 [0.24984749 0.24989775 0.2497907  0.25046408]
 [0.24992284 0.25015166 0.25010678 0.24981873]
 [0.25012487 0.24969172 0.25034738 0.24983601]]
with shape: (4, 4)


**Step 4)** Obtain Weighted Sum $Y$, Denoted as `encoder_output`
$$
Y = softmax(\frac{QK^{T}}{\sqrt{d}})V
$$
where $Y \in \mathbb{R}^{t \times d}$

In [11]:
# Obtain weighted sum
encoder_output = tf.matmul(attention, V)

print(f'Encoder output:\n'
      f'{encoder_output}\n'
      f'with shape: {encoder_output.shape}')

Encoder output:
[[-0.01785482 -0.02367092 -0.01569284 ...  0.01836444  0.04120005
   0.00879391]
 [-0.01787061 -0.02370876 -0.01571258 ...  0.01832335  0.0411597
   0.00884634]
 [-0.01786043 -0.02368021 -0.01573756 ...  0.01836252  0.04117083
   0.00884408]
 [-0.01787123 -0.02369666 -0.01576076 ...  0.01834525  0.04115305
   0.00888188]]
with shape: (4, 300)
