
# Equations for Advanced Machine Learning

## Transformers

Dimensions of the original paper:

$d_k$ = 64, dimension of the query and the key <br>
$d_v$ = 64, dimension of the value<br>
$d_{model}$ = 512 <br>
h = 8 <br>
$T_x$: variable sequence length; number of keys to compare the query with <br>

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$

$Q$ = query  <br>
$K^T$ = key <br>
$V$ = value <br>
$d_k$ = dimension of keys  <br>
$T$ = transpose (in order for us to multiply the matrices)


$\text{Multihead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, ..., \text{head}_h)W_O$

$W_O \in \mathbb{R}^{h d_v \times d_{model}}$ 

$W_O = 512 \times 512$

### Positional Encoding

$PE_{(pos,2i)} = \sin(\frac{{pos}}{{10000^{(\frac{{2i}}{{d_{\text{{model}}}}})}}})$

$PE_{(pos,2i+1)} = \cos(\frac{{pos}}{{10000^{(\frac{{2i}}{{d_{\text{{model}}}}})}}})$.

$PE_i: (pos, i) $ &rarr; $\sin(\frac{pos}{10000\frac{i}{d}})$ if i is even <br>
$PE_i: (pos, i) $ &rarr; $\cos(\frac{pos}{10000\frac{i-1}{d}})$ else

lecture slides (p.67) pairwise encoding <br>
d model dimension 512

$\sin(\frac{pos}{10000\frac{0}{d}}), \cos(\frac{pos}{10000\frac{0}{d}}), \sin(\frac{pos}{10000\frac{1}{d}}), \cos(\frac{pos}{10000\frac{1}{d}})$



Dimensions of the original paper:

$d_k$ = 64, dimension of the query and the key <br>
$d_v$ = 64, dimension of the value<br>
$d_{model}$ = 512 <br>
h = 8 <br>
$T_x$: variable sequence length; number of keys to compare the query with <br>

## Reinforcement Learning

### N-armed bandit

$$Q(a) = Q(a) + \alpha \cdot (R - Q(a))$$

- Q(a) is the current estimate of the action value for action 'a'.
- α (alpha) is the step size or learning rate, determining the weight given to new information. It controls the rate at which the agent updates its value estimates.
- R is the reward received by taking action 'a'.



$$ A_t = \argmax_{a \in A}(Q_t(a) + c \sqrt{\frac{\ln t }{ N_t(a)}})$$

### Bellman equation:

state value:
$$v_{\pi}(s) = \mathbb{E_π}[G_t| S_t = s]$$

$G_t$ = return (i.e., cumulative discounted reward) following time t

$$v_{\pi}(s) = \sum_{a \in A(s)} \pi(a|s) \sum_{s' \in S, r \in R} p(s', r|s,a) [r(s,a) + \gamma v_{\pi}(s')]$$


**action-value:**

$$q_*(s,a) = \sum_{s' \in S, r \in R} p(s', r|s,a) [r(s,a) + \gamma \max_{a'} q_*(s', a')]$$



### Policy iteration (slides 36)

**state value:**

$$v_{\pi}(s) = \sum_{a \in A(s)} \pi(a|s) \sum_{s' \in S, r \in R} p(s', r|s,a) [r(s,a) + \gamma v_{\pi}(s')]$$

first sum is overall everything: 

$$v_{\pi}(s) = \sum_{a \in A(s)} (\pi(a|s) \sum_{s' \in S, r \in R} p(s', r|s,a) [r(s,a) + \gamma v_{\pi}(s')])$$


The first sum is the sum over all actions, the second is a sum over all state actions

### Value iteration (book p.83)

**state value**
$$v_{k + 1}(s) = \max_{a} \sum_{s' \in S, r \in R} p(s', r|s,a) [r(s,a) + \gamma v_k(s')]$$





basically take the bellman optimality equation and turn it into an iterative update (which is denoted by k). Get closer and closer to optimal solution by feeding it back to itself iteratively. We have replaced the sum over policy actions with the maximum of action.

We can also write it without the k:

$$v(s) = \max_{a} \sum_{s' \in S, r \in R} p(s', r|s,a) [r(s,a) + \gamma v(s')]$$



#### Policy vs. Value iteration

The main advantage of the value iteration over the policy iteration is that it can converge faster if the state space is large. In problems with more limited state spaces (like the Gridworld), the policy iteration converges with fewer steps, as calculating a good approximation of the action value is easy. Another advantage of the value iteration is that it only relies on the Bellman optimality equation rather than requiring both the Bellman expectation equation and a greedy policy improvement as the policy iteration does. This alteration between the policy evaluation and the policy improvement requires an additional loop, but tends to make the training process of the policy iteration more stable (i.e. less oscillation).
Both these algorithms can be applied when we know the MDP to finite environments, where actions, states and rewards are finite too. Common applications are certain boardgames, video games, as well as control and navigation of robots that can be modelled as Markov Decision Processes.

## Diffusion models

## ELBO Variational Autoencoders

$$ \mathbb{E}_{q\phi(z|x)}[\log \frac{p(x,z)}{p_{\theta}(\mathbf{x}|\mathbf{z})}] = \mathbb{E}_{q\phi(z|x)}[\log p_{\theta}(\mathbf{x}|\mathbf{z})] - D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x})||p(\mathbf{z}))$$

$q(x_t|x_{t-1})$  is the forward diffusion process <br>
$q_\phi(z|x)$ tells us how to convert data samples 𝒙 to some latent variables 𝒛 → encoder

$p(x_{t-1}|x_t)$ is the reverse denoising process (this is what we want to learn) <br>
$p_\theta(x|z)$ tells us how to convert latent variables 𝒛 to the original data samples 𝒙 → decoder

$ \mathbb{E}_{q\phi(z|x)}[\log p_{\theta}(\mathbf{x}|\mathbf{z})]$ Reconstruction term (maximise)

$D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x})||p(\mathbf{z}))$ prior matching term (minimise)


https://yunfanj.com/blog/2021/01/11/ELBO.html

## ELBO Hierarchical Variational Autoencoders

$$ \mathbb{E}_{q\phi(z_{1:T}|x)}[\log \frac{p(x,z_{1:T})}{q\phi(z_{1:T}|x)}] = \mathbb{E}_{q\phi(z_{1:T}|x)}[\log \frac{p(z_T)p_\phi(x|z_1)\prod_{t=2}^Tp_\theta(z_{t-1}|
z_t)}{q_\phi(z_1|x)\prod_{t=2}^Tq_\phi(z_t|z_{t-1})}]$$

## ELBO Diffusion models

$$\text{ELBO} = \mathbb{E}_{q(\mathbf{x_1}|\mathbf{x_0})}[\log p_{\theta}(\mathbf{x_0}|\mathbf{x_1})] - D_{KL}(q(\mathbf{x_T}|\mathbf{x_0})||p(\mathbf{x_T})) - \sum_{t = 2}^T \mathbb{E}_{q(x_t|x_0)[D_{KL}(q(x_{t-1}))]}$$

$x_0$ = GT data

$x_1$ to $x_T$ = latent variables (aka z)