# Transformers: Walking Through the Encoder Forward Pass

**Disclaimer:**<br>

The [notebook](https://github.com/anasashb/transformers_math_walkthrough/blob/main/transformer_attn_walkthrough.ipynb) on the math behind attention used conventional representation of input and output vectors $x_i$ and $y_i$ as column vectors. While this is a common representation in mathematical literature, in practical computations, including when dealing with TensorFlow, it is standard practice to treat these vectors as rows in the data matrix $X$. Therefore if single embedded input was of shape $d \times 1$ in the previous notebook, here it will be of $1 \times d$. Therefore some of the matrix multiplication notations will be different in this case. This difference between conventional notation and implementation stems from the need to use, among others, mini-batches.
- - -
### The Encoder Forward Pass Revisited
In this notebook I demonstrate the forward pass of an encoder with a **single-head scaled dot product self attention,** and thus notation is tailored to that (*i.e.* $d_{model}=d_q=d_k=d_v$)<br>
Positional emebedding is ignored for the time being.<br>

**Encoder Inputs and Outputs**
- Input matrix $X \in \mathbb{R}^{t \times d}$
- Output matrix $Y \in \mathbb{R}^{t \times d}$<br>

where $t$ stands for number of tokens and $d$ the embedding dimension.

**Initialized, Trainable Weight Matrices**
- Weight matrices for Query, Key and Value $W^Q, W^K, W^V \in \mathbb{R}^{d \times d}$

**Step 1)** Obtain Query, Key and Value Matrices
$$
\begin{aligned}
Q = XW^{Q}\\
K = XW^{K}\\
V = XW^{V},
\end{aligned}
$$
where $Q, K, V \in \mathbb{R}^{t \times d}$.

**Step 2)** Compute Scaled Attention Scores
$$
scaled\_attention = \frac{QK^{T}}{\sqrt{d}}
$$
where $scaled\_attention \in \mathbb{R}^{t \times t}$

**Step 3)** Compute Normalized Attention Scores
$$
normalized\_attention = softmax(\frac{QK^{T}}{\sqrt{d}})
$$
where $normalized\_attention \in \mathbb{R}^{t \times t}$

**Step 4)** Compute Attention Output $Y$
$$
Y = softmax(\frac{QK^{T}}{\sqrt{d}})V
$$
where $Y \in \mathbb{R}^{t \times d}$

**Step 5)** Residual Connection and Normalization
$$
O = LayerNorm(Y+X)
$$
where $O \in \mathbb{R}^{t \times d}$

**Step 6)** Position-Wise Feed-Forward Networks
$$
FFN(O) = max(0, OW_1 + b_1)W_2 + b_2
$$
where $O, FFN(O) \in \mathbb{R}^{t \times d}$, and $OW_1 + b_1 \in \mathbb{R}^{t \times 4d}$ 

**Step 7)** Residual Connection and Normalization
$$
encoder\_output = LayerNorm(FFN(O)+O)
$$
where $encoder\_output \in \mathbb{R}^{t \times d}$

In [1]:
# Required library
import tensorflow as tf
tf.random.set_seed(66)

2023-10-11 20:00:58.223337: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-11 20:00:58.278232: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-11 20:00:58.278287: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-11 20:00:58.278323: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-11 20:00:58.291862: I tensorflow/core/platform/cpu_feature_g

**Step 0)** Obtain encoder input $X$, denoted as `embeddings_input`

In [3]:
# Define input sentence
sentence = 'in the bleak midwinter'

# Tokenize
input_tokens = sentence.split()
print(f'Our input tokens are:\n'
      f'{input_tokens}\n',
      f'with length: {len(input_tokens)}')
print('-'*80)

# Pick embedding dimension
d = 300

# Generate embeddings randomly
embeddings_input = tf.keras.initializers.RandomNormal()(shape=(len(input_tokens), d))
print(f'Our embeddings are:\n'
      f'{embeddings_input}\n',
      f'with shape: {embeddings_input.shape}')


Our input tokens are:
['in', 'the', 'bleak', 'midwinter']
 with length: 4
--------------------------------------------------------------------------------
Our embeddings are:
[[-0.0741843  -0.04915943  0.05049502 ... -0.01158285 -0.01054677
   0.04119534]
 [ 0.00722885  0.04369759 -0.04274543 ... -0.02851543 -0.02436578
   0.06748476]
 [ 0.04392974 -0.04113375 -0.00963057 ... -0.01584903 -0.03317151
   0.02710428]
 [-0.08654431  0.06789926  0.09438287 ... -0.07203033 -0.01282948
  -0.03549376]]
 with shape: (4, 300)


---

**Step 0)** Initialize $W_q, W_k, W_v$ matrices of $d \times d$ dimensions

In [4]:
# Initialize query, key and value weight matrices randomly
W_q = tf.keras.initializers.RandomNormal()(shape=(d, d))
W_k = tf.keras.initializers.RandomNormal()(shape=(d, d))
W_v = tf.keras.initializers.RandomNormal()(shape=(d, d))

print(f'Our weight matrices are:\n'
      f'W_q:\n{W_q}\n',
      f'W_k:\n{W_k}\n',
      f'W_v:\n{W_v}\n',            
      f'with shapes: {W_q.shape}, {W_k.shape}, {W_v.shape}')


Our weight matrices are:
W_q:
[[ 0.03723269  0.09281119 -0.05361765 ... -0.05797908  0.01792317
  -0.0382619 ]
 [ 0.02328759  0.03270123 -0.03892419 ... -0.00826736  0.04628843
   0.01631828]
 [ 0.03692954 -0.01910817 -0.00831627 ...  0.03324584  0.10601654
   0.08415358]
 ...
 [-0.01936804 -0.09737956  0.08661649 ... -0.00909938  0.0481543
  -0.01972168]
 [-0.05319289  0.01023542 -0.042817   ...  0.08164191  0.0091115
  -0.09240133]
 [ 0.00219032  0.06736977 -0.03829434 ...  0.02602117  0.00231728
   0.02032198]]
 W_k:
[[ 0.01603875 -0.00056137 -0.06327474 ... -0.03770884  0.00751227
   0.05085857]
 [ 0.00794533  0.03619261 -0.07646581 ...  0.02190027 -0.09783776
  -0.1076859 ]
 [ 0.00324819  0.02829494  0.07573404 ...  0.00340645  0.0515605
   0.04967254]
 ...
 [-0.07645648 -0.03013856  0.03536862 ...  0.01988157  0.02923589
   0.05864898]
 [-0.11877687  0.02494119  0.04692116 ... -0.14199881  0.08960991
  -0.01894834]
 [-0.05679299 -0.0028196  -0.09659957 ... -0.00726818  0.03675446

---

**Step 1)** Obtain $Q, K, V$
$$
\begin{aligned}
Q = XW^{Q}\\
K = XW^{K}\\
V = XW^{V},
\end{aligned}
$$
where $Q, K, V \in \mathbb{R}^{t \times d}$.

In [5]:
# Obtain Query, Key and Value matrices
Q = tf.matmul(embeddings_input, W_q)
K = tf.matmul(embeddings_input, W_k)
V = tf.matmul(embeddings_input, W_v)

print(f'Our matrices are:\n'
      f'Query:\n{Q}\n',
      f'Key:\n{K}\n',
      f'Value:\n{V}\n',            
      f'with shapes: {Q.shape}, {K.shape}, {V.shape}')


Our matrices are:
Query:
[[ 0.02234076 -0.07591635 -0.09255555 ...  0.03983417  0.0057872
   0.01784528]
 [-0.01952961  0.05058814 -0.08638044 ... -0.06978144 -0.09432264
  -0.0087329 ]
 [-0.03929292  0.0156117  -0.00356103 ...  0.02536867  0.02940155
  -0.02930875]
 [-0.02998801  0.06668639 -0.02769908 ...  0.03847094  0.023365
  -0.10388693]]
 Key:
[[-0.05925012  0.05007924  0.00738794 ... -0.0012     -0.02463339
  -0.03889964]
 [ 0.03592325  0.0922419   0.01586427 ... -0.0051557  -0.10935269
  -0.00841157]
 [-0.00770236 -0.10901807 -0.03415017 ... -0.01817019 -0.02409642
   0.00808546]
 [-0.0191135   0.04928383 -0.00215943 ...  0.02107384 -0.02404883
  -0.01796726]]
 Value:
[[-0.05953827 -0.00528069 -0.04912516 ... -0.03327402  0.01761387
  -0.02158837]
 [ 0.03896216 -0.00531136 -0.09674695 ... -0.06482318  0.03180076
  -0.01549592]
 [-0.03189508 -0.03950288  0.01442864 ...  0.00566404  0.10965772
   0.06029917]
 [ 0.00702286  0.02378504  0.03215706 ...  0.06221448  0.06543559
  -0.

---

**Step 2)** Compute Scaled Attention Scores
$$
scaled\_attention = \frac{QK^{T}}{\sqrt{d}}
$$
where $scaled\_attention \in \mathbb{R}^{t \times t}$

In [6]:
# Compute raw attention scores QK'
QK_T = tf.matmul(Q, tf.transpose(K))

# Compute scaling factor (square root of d)
scaling_factor = tf.sqrt(tf.cast(d, dtype=tf.float32))

# Compute scaled attention
scaled_QK_T = QK_T / scaling_factor

print(f'Scaled attention scores:\n'
      f'{scaled_QK_T}\n'
      f'with shape: {scaled_QK_T.shape}')

Scaled attention scores:
[[ 0.00263519 -0.00174089 -0.00037623 -0.00035859]
 [ 0.00236044  0.00206439  0.00187134  0.00241708]
 [-0.00104838  0.00487237 -0.00182653  0.00039156]
 [ 0.00036275  0.00206183 -0.00057808  0.00099374]]
with shape: (4, 4)


---

**Step 3)** Normalize Scaled Attention Scores Using $softmax$
$$
normalized\_attention = softmax(\frac{QK^{T}}{\sqrt{d}})
$$
where $normalized\_attention \in \mathbb{R}^{t \times t}$

In [7]:
# Squash through softmax
norm_attention = tf.keras.activations.softmax(scaled_QK_T)

print(f'Normalized attention scores:\n'
      f'{norm_attention}\n'
      f'with shape: {norm_attention.shape}')

Normalized attention scores:
[[0.25064936 0.24955489 0.24989569 0.24990009]
 [0.2500455  0.24997151 0.24992326 0.25005966]
 [0.24958809 0.25107023 0.24939394 0.24994774]
 [0.24991307 0.25033805 0.24967806 0.2500708 ]]
with shape: (4, 4)


---

**Step 4)** Obtain Weighted Sum $Y$, Denoted as `attention_output`
$$
Y = softmax(\frac{QK^{T}}{\sqrt{d}})V
$$
where $Y \in \mathbb{R}^{t \times d}$

In [8]:
# Obtain weighted sum
attention_output = tf.matmul(norm_attention, V)

print(f'Attention output:\n'
      f'{attention_output}\n'
      f'with shape: {attention_output.shape}')

Attention output:
[[-0.01141546 -0.00657679 -0.02481516 ... -0.00755423  0.05610629
   0.00270854]
 [-0.01136303 -0.00657311 -0.02482027 ... -0.00755106  0.05612237
   0.00271481]
 [-0.0112769  -0.00655828 -0.02491533 ... -0.00761702  0.05608388
   0.00267712]
 [-0.01133297 -0.00656441 -0.0248524  ... -0.00757111  0.05610554
   0.00269707]]
with shape: (4, 300)


---

**Step 5)** Obtain Final Attention Output $O$, Denoted as `attention_add_norm`
$$
\begin{aligned}
O = LayerNorm(Y+X)
\end{aligned}
$$
where $O \in \mathbb{R}^{t \times d}$

In [9]:
# Residual Connection: Add Original Input to Attention Output Elementwise
attention_add = attention_output + embeddings_input

attention_add_norm = tf.keras.layers.LayerNormalization(epsilon=1e-6)(attention_add)

print(f'Normalized output:\n'
      f'{attention_add_norm}\n'
      f'with shape: {attention_add_norm.shape}')

Normalized output:
[[-1.592812   -1.0233724   0.529074   ... -0.3254979   0.90814054
   0.8765707 ]
 [-0.08818263  0.6591465  -1.2371346  ... -0.6665808   0.5619165
   1.2582444 ]
 [ 0.63286376 -0.9073197  -0.655313   ... -0.442916    0.44614235
   0.5778193 ]
 [-1.824939    1.2616862   1.4205737  ... -1.4706278   0.91158247
  -0.56323004]]
with shape: (4, 300)


---

**Step 6)** Position-Wise Feed-Forward Networks
$$
FFN(O) = max(0, OW_1 + b_1)W_2 + b_2
$$
where $O, FFN(O) \in \mathbb{R}^{t \times d}$, and $OW_1 + b_1 \in \mathbb{R}^{t \times 4d}$ 

In [10]:
# Initialize Inner Layer Weights and Bias
W_1 = tf.keras.initializers.RandomNormal()(shape=(d, 4*d))
b_1 = tf.Variable(tf.zeros(4*d))

# Initialize Output Weights and Bias
# Project back to original shape
W_2 = tf.keras.initializers.RandomNormal()(shape=(4*d, d))
b_2 = tf.Variable(tf.zeros(d))

# Compute First Linear Projection
inner_projection = tf.matmul(attention_add_norm, W_1) + b_1

# For Operation, ReLU
inner_output = tf.keras.activations.relu(inner_projection)

# Second Linear Projection
ffn_output = tf.matmul(inner_output, W_2) + b_2

print(f'FFN output:\n'
      f'{ffn_output}\n'
      f'with shape: {ffn_output.shape}')

FFN output:
[[ 2.882871   -0.9661224  -2.249414   ... -0.62725556 -0.01618898
   0.32342726]
 [ 0.37960598 -0.9368476  -0.34070563 ...  0.39195973 -1.0509952
  -0.3859678 ]
 [ 1.303434    0.44110128 -0.51026565 ... -1.0850596   0.27255872
  -0.57497585]
 [ 0.44652134  0.48340976 -0.3626876  ...  0.7292412  -0.19128858
   2.5369987 ]]
with shape: (4, 300)


---

**Step 7)** Residual Connection and Normalization
$$
encoder\_output = LayerNorm(FFN(O)+O)
$$
where $encoder\_output \in \mathbb{R}^{t \times d}$

In [11]:
# Residual Connection: Add Final Attention Output to Feed-Forward Output Elementwise
# Noramlize
encoder_output = tf.keras.layers.LayerNormalization(epsilon=1e-6)(ffn_output + attention_add_norm)

print(f'Encoder output:\n'
      f'{encoder_output}\n'
      f'with shape: {encoder_output.shape}')

Encoder output:
[[ 0.97223    -1.3499476  -1.1593652  ... -0.61585456  0.69033915
   0.9084598 ]
 [ 0.26922503 -0.14316043 -1.0852363  ... -0.14092863 -0.29632384
   0.6901091 ]
 [ 1.5002097  -0.27086684 -0.7864182  ... -1.0535685   0.602627
   0.07491418]
 [-0.9377893   1.222774    0.74742466 ... -0.497149    0.513909
   1.3809491 ]]
with shape: (4, 300)


---

After this, the encoder can be pushed through an output head in case of an encoder-only model, or can be sent to the decoder. 