### -1. Back prop for matmul

Given

$$
Y = WX, \quad y_{ij} = \sum_a w_{ia}x_{aj}
$$

Note that

$$
\begin{align*}
\dfrac{\partial y_{ij}}{\partial x_{mn}}&=\dfrac{\partial}{\partial x_{mn}}\left(\sum_aw_{ia}x_{aj}\right)\\
&=\sum_a\dfrac{\partial}{\partial x_{mn}}w_{ia}x_{aj}\\
&=\sum_a w_{ia}\delta_{m,a}\delta_{n,j}\\
&=w_{im}\delta_{n,j}
\end{align*}
$$

thus

$$
\begin{align*}
\dfrac{\partial L}{\partial x_{mn}}&=\sum_{ij}\dfrac{\partial L}{\partial y_{ij}}\dfrac{\partial y_{ij}}{\partial x_{mn}}\\
&=\sum_{ij}\dfrac{\partial L}{\partial y_{ij}}w_{im}\delta_{n,j}\\
&=\sum_{i}\dfrac{\partial L}{\partial y_{in}}w_{im}
\end{align*}
$$

therefore

$$
\left(\dfrac{\partial L}{\partial X}\right)=W^T\left(\dfrac{\partial L}{\partial Y}\right)
$$

and similarly,

$$
\dfrac{\partial L}{\partial W}=\dfrac{\partial L}{\partial Y} X^T
$$

#### 0. Input and embedding

Let:

* $I \in \mathbb{Z}^{T}$: token indices of the input sequence
* Embedding matrix: $E_{\text{lookup}} \in \mathbb{R}^{d_{\text{token}} \times d_{\text{model}}}$

Then:

$$
X = E_{\text{lookup}}[I] \in \mathbb{R}^{T \times d_{\text{model}}}
$$

This means that each token index in $I$ is used to select a row from $E_{\text{lookup}}$.

Note that this is not a matmul (it can be written as a matmul if $I$ was one-hot encoded), it is just a lookup. However, autograd engines can track which row it used and back prop gradients to that row.

#### 1. Linear Projections

We define three learnable projection matrices:

* $W_Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$
* $W_K \in \mathbb{R}^{d_{\text{model}} \times d_k}$
* $W_V \in \mathbb{R}^{d_{\text{model}} \times d_v}$

Then the projections are:

$$
Q = X W_Q \in \mathbb{R}^{T \times d_k}
$$

$$
K = X W_K \in \mathbb{R}^{T \times d_k}
$$

$$
V = X W_V \in \mathbb{R}^{T \times d_v}
$$

The dimension $k$ is usually taken to be even for RoPE and acceleration. Each token corresponds to a row.

#### 2. RoPE

Positional Embeddings are necessary to make sure the model is position-ware to inputs. For each token position $t\in\{1,\cdots,T\}$, and each dimension pair $i\in\{1,\cdots, d_k/2\}$ (arrays start from 1), define

$$
q_{t,i} = \left[\begin{matrix}
Q_{t, 2i-1}\\ Q_{t,2i}
\end{matrix}\right],\quad
k_{t,i} = \left[\begin{matrix}
K_{t, 2i-1}\\ K_{t,2i}
\end{matrix}\right]
$$

and define

$$
R(i, t) = \left[\begin{matrix}
\cos(\omega_i t) & -\sin(\omega_i t)\\
\sin(\omega_i t) & \cos(\omega_i t)
\end{matrix}\right]
$$

so that

$$
q'_{t,i}=R(i, t)q_{t,i},\quad k'_{t,i}=R(i, t)k_{t,i}
$$

and we put it back into $Q', K'$.

We use RoPE mainly because of the property where the dot product only relies on the original dot product and the relative position.

#### 3. Naive Attention

We compute attention weights:

$$
A = \frac{Q' K'^\top}{\sqrt{d_k}} \in \mathbb{R}^{T \times T}
$$

and $A_{ij}$ is basically the dot product of the query of token $i$ and the key of token $j$. So now the $i$-th row of $A$ is the 'score' of keys of other tokens against token $i$. We apply row-wise softmax to convert them to weights that adds to 1:

$$
\alpha_{ij} = \frac{\exp\left( A_{ij} \right)}{\sum_{k=1}^{T} \exp\left( A_{ik} \right)}
$$

then we use this to average against $V$:

$$
O = \alpha V \in \mathbb{R}^{T\times d_v}
$$

#### 4. Masks

Before applying softmax, we may need to mask out certain entries in $A$ to force better behaviour.

(a) Causal Mask: For auto-regressive generation, the output of token $i$ should only use the values of tokens $j\le i$, since the future is not avaliable at inference-time. Define 

$$
(M_c)_{ij}=\begin{cases}
0, & j\le i\\
-\inf, &j > i
\end{cases}
$$

and let $A'=A+M_c$, this effective turns the weights / scores of masked entries to zero.

(b) Padding Mask: If padding tokens were to be used, they would be meaningless, therefore, we zero out every entry whose indices contains a padding token.

For the autograd engine, the gradient will only flow to non-masked entries:

$$
\dfrac{\partial L}{\partial \alpha} = \dfrac{\partial L}{\partial O} V^T
$$

$$
\dfrac{\partial L}{\partial A'_{ij}}=\sum_{ab}\dfrac{\partial L}{\partial \alpha_{ab}}\dfrac{\partial \alpha_{ab}}{\partial A'_{ij}}
$$

Since the softmax was applied row-wise, within the same row we have

$$
\dfrac{\partial \alpha_{ij}}{\partial A'_{ik}}= \begin{cases}
\alpha_{ij}(1-\alpha_{ij}),& j = k\\
-\alpha_{ij}\alpha_{ik},& j\neq k\\
\end{cases}
$$

therefore

$$
\dfrac{\partial L}{\partial A'_{ij}}=\sum_b\dfrac{\partial L}{\partial \alpha_{ib}}\dfrac{\partial \alpha_{ib}}{\partial A'_{ij}}
$$

note that we can factorize $\alpha_{ij}$ in the above expression, therefore it is zero.

For $A$,

$$
\dfrac{\partial L}{\partial A_{ij}}=\sum_{ab}\dfrac{\partial L}{\partial A'_{ab}}\dfrac{\partial A'_{ab}}{\partial A_{ij}}= \dfrac{\partial L}{\partial A'_{ij}}=\dfrac{\partial L}{\partial M_{ij}}
$$

note that $M$ is not learnable so we turn off the gradients for $M$.
