# 

# Learning a Probability Distribution

Here we will try to learn a probability distribution. To do so we need to define an objective function. A common choice is the Kullback-Leibler divergence (KL divergence) between the true distribution and the learned distribution. The KL divergence is defined as

$$
\mathbf{KL}(p || q) = \int p(x) \log \frac{p(x)}{q(x)} dx
$$

Lets say there is a family of densities $q_\theta(x)$ parameterized by $\theta$. Hence we can solve the minimization problem

$$
\min_\theta \mathbf{KL}(p || q_\theta) = \min_\theta \int p(x) \log \frac{p(x)}{q_\theta(x)} dx
$$

The above equation is not very useful, because we do not know the true distribution $p(x)$.

However, the KL divergence can be rewritten as

$$
\mathbf{KL}(p || q_\theta) = \mathbb{E}_{x \sim p} \left[ \log \left( \frac{p(x)}{q_\theta(x)} \right) \right]
$$

because the expectation of a function $f(x)$ with respect to a distribution $p(x)$ is defined as

$$
\mathbb{E}_{x \sim p} [f(x)] = \int f(x) p(x) dx.
$$

Using the linear property of the expectation, we can rewrite the KL divergence as

$$
\mathbf{KL}(p || q_\theta) = \mathbb{E}_{x
\sim p} [ \log p(x) ] - \mathbb{E}_{x \sim p} [ \log q_\theta(x) ]
$$

Since $\theta$ only appears in the second term, the training objective becomes

$$
\min_\theta - \mathbb{E}_{x \sim p} [ \log q_\theta(x) ].
$$

This is very usefule, as we can sample from the distribution $p(x)$ minimizing the objective function to find the parameters $\theta$ of the distribution $q_\theta(x)$.

In a conditional setting, the objective becomes

$$
\min_\theta - \mathbb{E}_{x \sim p} [ \mathbb{E}_{y \sim p(x|y)} [ \log q_\theta(x|y) ] ].
$$


which, using Bayes' rule, can be rewritten as

$$
\min_\theta - \mathbb{E}_{x \sim p(x), y \sim p(y|x)} [ \log q_\theta(x|y) ]
$$

Where an $p(x|y)$ is a conditional distribution of $x$ given $y$ and can be sampled using a numerical simulation.


## Normalizing Flows

Now how exactly do we learn a distribution? One way is to take an existing distribution simple base distribution $p_{\mathbf{Z}}$ and transform it into a potentially complex distribution $p_{\mathbf{Y}}$.

![Normalizing Flows](prob01-cnf.jpg)

Given an invertible mapping $g$ and an invere function $f = g^{-1}$, we have $y=f(z)$ and $z=g(y)$. The cool thing about the above is that we can compute the probability density function of $p_{\mathbf{Y}}$ using the change of variables formula:

$$
p_{\mathbf{Y}}(y) = p_{\mathbf{Z}}(z) \left| \det \frac{\partial g}{\partial y} \right| = p_{\mathbf{Z}}(f(y)) \left| \det \frac{\partial f}{\partial y} \right|
$$

Since our initial distribution $p_{\mathbf{Z}}$ is simple, its pdf is usually known and easy to compute.