<img src="https://drive.google.com/uc?id=1cXtXdAcwedVDbapmz1pj_hULsQrhEcff" width="500"/>

---


# **Probability for Deep Learning**

#### **Probability contents/agenda**

1. Why do we care about probability?
2. A more formal recap of probability
3. Maximum Likelihood Estimation
4. Comparison of probability density functions

#### **Learning outcomes**

1. Become aware of uniform, Gaussian, and Bernoulli distributions.
2. Understand the role of maximum likelihood estimation in the context of Machine Learning.
3. Understand how to compare probability density functions.


<br/>

---

<br/>

## 1. Why do we care about probability?

So far, we have been looking at how we can use different network architectures to *learn* to distinguish aspects of different datasets. For example, we have used FFNs and CNNs to classify different digits in the MNIST dataset and characters in the KMNIST dataset.

<br>

<p align = "center"><img src="https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png" width="400"/></p><p align = "center">
<i>MNIST dataset: 60k training & 10k test images</i>
</p>

<br>

Intuitively, we can see that the networks we have trained have encoded, in some way, the different ways we can write every digit from zero to nine. Given some new digit, then, these networks are using fine-tuned weights and biases to answer the question: *out of all the types of digits I have seen, which type is this new image most likely to belong to?*.   

Following this intuition, we could say that the network has learned the **probability** that an image of a digit belongs to each of the families of digits (from zero to nine) that exist in the dataset. Or, in other words, we could say that our network has learned the **probability distribution** for every class of digit in our dataset.

In the next few lectures, we will be looking at how we can use the capacity of neural networks to learn a dataset's probability distribution in order to generate new, original data samples:

<br>

<center><img src="https://drive.google.com/uc?id=1gNMlFLoNXcWcIpOaLfpH9mp6aoVIw0a3" width="800"/></center>

<br>

<br>

<center><img src="https://drive.google.com/uc?id=1wDruJEzfbTeSKBD7MRYYIJhWqYbEOgyC" width="780"/></center>

<br>

We call these types of models **Deep Generative Models**. In particular, we will be looking at:

- Variational AutoEncoders (VAEs)
- Generative Adversarial Networks (GANs)
- Diffusion Models
- Transformers (can be used as generative models)

<br/>

---

<br/>


## 2. A more formal recap of probability

**Remember:** You have already been introduced to probability in Lecture06 of the Computational Mathematics module. We will quickly recap some of the main concepts below.

### **Probability**

Probability, as a discipline, provides a means to talk about and to measure uncertainty.

There are two (philosophically different) ways to consider probability. We could think of it as the proportion of times a certain event occurs, or our degree of belief about an event occurring.

These two ways of thinking about probability are related to the [Frequentist](https://en.wikipedia.org/wiki/Frequentist_probability) and [Bayesian](https://en.wikipedia.org/wiki/Bayesian_probability) interpretations.


### **Random variables**

A random variable (usually denoted $X$) is a function that takes values (usually denoted $x$) depending on the outcome of a random experiment.


### **Probability distributions**


What is a probability distribution?

> A probability distribution is a mathematical function that describes all
the possible outcomes of an experiment as well as the likelihoods of
each possible outcome.

That is, a probability function maps the outcome of an experiment to its probability.

When dealing with continuous variables, we talk about the probability density function (PDF), while in the case of discrete variables we talk about probability mass function (PMF).

#### *Example*

What is the probability distribution of the possible outcomes when
**throwing 1 die**:

<br>

<center><img src="https://drive.google.com/uc?id=1nzGHIqNS2iVxsWlxr70N40C9v63W1rzO" width="500"/></center>
<br>

<br>

<center><img src="https://drive.google.com/uc?id=1_m_5lYHFizu0UZ9eHaHhpoVr-el9esbD" width="500"/></center>
<br>

What is the probability distribution of the possible outcomes when
**throwing 2 dice**:

<br>

<center><img src="https://drive.google.com/uc?id=1GB0g3HIXaIshINGYky2IT5HYamxtJktj" width="500"/></center>
<br>

<br>

<center><img src="https://drive.google.com/uc?id=1hGM5sWFqphcN63RCk1QkCuHky1avBOe0" width="500"/></center>
<br>

What is the probability distribution of the possible outcomes when
**throwing an increasing number of dice**:

<br>

<center><img src="https://drive.google.com/uc?id=1mpKg4aXgyZ7j4YtlP1AC0H6hY0-U1xjQ" width="500"/></center>
<br>

Examples adapted from [this blog](https://www.cantorsparadise.com/what-to-expect-when-throwing-dice-and-adding-them-up-5231f3831d7).

### **Probability, expected value and variance**

Given a random variable $X$ and a probability density function $f(x)$, we can calculate the probability $a < X < b$:

$$
P(a < X < b) = \int_a^b f(x) dx
$$

and,

$$
\int_{-\infty}^{\infty} f(x) dx = 1
$$

Additionally, we often compute the expected value (or expectation) and variance. The expectation describes the average value and the variance describes the spread (amount of variability) around the expectation.

Formally, we define the expectation as:

$$
E(X) = \mu = \int_{-\infty}^{\infty} x f(x) dx
$$

We define the variance as:

$$
Var(X) = \sigma^2 = \int_{-\infty}^{\infty} (x-\mu)^2 f(x) dx = E[(X-\mu)^2] = E(X^2) - \mu^2
$$


### **Common probability distributions**

**Uniform distribution**

The uniform distribution assigns the same probability to all the
possible outcomes:

<br>

<center><img src="https://drive.google.com/uc?id=19S2Q9XRRphYuCSkmd04WSsrf99H6ycfa" width="800"/></center>

<br>

For the uniform distribution:

$$
\mu(X) = \frac{1}{2} (a + b)
$$

$$
Var(X) = \frac{1}{12} (b - a)^2
$$

<br>

**Normal distribution**

The normal (or Gaussian) distribution describes a symmetric probability
distribution uniquely defined by its mean μ and its standard deviation σ:

<br>

<center><img src="https://drive.google.com/uc?id=1xDTqakF52JaKS8tBc8iiKvlFJ-0zn593" width="500"/></center>

<br>

$f(x)$ is also written $N(x; \mu, \sigma^2)$. For the normal distribution:

$$
\mu(X) = \mu
$$

$$
Var(X) = \sigma^2
$$

<br>

**Multivariate normal distribution**

Multivariate normal distributions represent probabilities of random variables
in several dimensions:

<br>

<center><img src="https://drive.google.com/uc?id=1hpcP0Mf6ip4Bl6pry_Mg3Y4G_WUrAmSY" width="800"/></center>

<br>

To define a multivariate normal distribution, we need to introduce the concept of covariance matrix.

Given two random variables (defining a two-dimensional normal distribution), with corresponding means $\mu_1$ and $\mu_2$ and standard deviations $\sigma_1$ and $\sigma_2$ their covariance is:

$$
Cov(X_1, X_2) = E[(X_1 - \mu_1)(X_2 - \mu_2)]
$$

And their correlation is:

$$
\rho = \frac{Cov(X_1, X_2)}{\sigma_1 \sigma_2}
$$

Then, we define the covariance matrix as:

$$
\Sigma =
\begin{pmatrix}
Var(X_1) & Cov(X_1, X_2) \\
Cov(X_1, X_2) & Var(X_2)
\end{pmatrix} =
\begin{pmatrix}
\sigma_1^2 & \rho \sigma_1 \sigma_2 \\
\rho \sigma_1 \sigma_2 & \sigma_2^2
\end{pmatrix}
$$

If $Cov(X_1, X_2) = 0$ or $\rho = 0$ then $X_1$ and $X_2$ are uncorrelated. If $\rho = \pm 1$, they are perfectly correlated.

With these concepts, we can define the expression for the multivariate normal distribution:

$$
f(x) = \frac{1}{\sqrt{2 \pi^k det(\Sigma)}} exp\left( - \frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu) \right)
$$

We can explore the how the covariance matrix impacts the shape of the multivariate Gaussian:

In [None]:
%%html
<iframe src="https://distill.pub/2019/visual-exploration-gaussian-processes/" width="1000" height="500"></iframe>

**Bernoulli distribution**

A Bernoulli random variable only takes two possible values: 0 or 1:

<br>

<center><img src="https://drive.google.com/uc?id=15xJa_l0_uS6WG1sLw9-KUW9I1BDhDw9O" width="800"/></center>

<br>

### **Independent and identically distributed (iid)**

In probability theory, a sequence or collection of random variables is **independent and identically distributed (i.i.d., iid, or IID)** if each random variable has the same probability distribution as the others and they are all mutually independent.

We almost always assume that samples from a training or test dataset are **iid**.


<br>

---

<br/>

## 3. Maximum Likelihood Estimation

The goal of many of the models that we will introduce in the next lectures, is to estimate the (unknown) probability distribution that best describes our data:

<center><img src="https://drive.google.com/uc?id=1o6wwWEXMOcwH35EsodCmfed77SYDnAuI" width="800"/></center>

There are two main methods to estimate the probability distribution:

- **Maximum likelihood estimation** (**MLE**) is the frequentist approach. Given some parametrisation of the probability distribution $\theta$ and some data $X$:

$$
\theta_{MLE} = argmax_\theta P(X | \theta)
$$

- **Maximum a posteriori** (**MAP**) is the Bayesian approach. Given some parametrisation of the probability distribution $\theta$ and some data $X$

$$
\theta_{MAP} = argmax_\theta P(\theta | X)
$$

<br><br>

Here, we will focus on MLE. The goal of MLE is to maximise the probability of my data $X$ given my parameters $\theta$; that is, to maximise the **likelihood**:

$$
L_x(\theta) = P(X = x | \theta)
$$

In the case of training neural networks, our training dataset usually consists of $m$ iid data points $(x^1, y^1), (x^2, y^2), ..., (x^m, y^m)$. The likelihood will then be:

$$
L_y(x, \theta) = P(y^1 | x^1, \theta) \cdot P(y^2 | x^2, \theta) \cdot ... \cdot P(y^m | x^m, \theta) = \prod_{i=1}^m P(y^i | x^i, \theta)
$$

Maximising the likelihood often requires diffentiating it, and to make this differentiation easier we can use a trick based on the fact that the logarithm is an increasing function of its argument:

$$
\theta_{MLE} = argmax_\theta L_y(x, \theta) \rightarrow \theta_{MLE} = argmax_\theta \log L_y(x, \theta)
$$

For our $m$ elements in the training dataset, this becomes:

$$
\theta_{MLE} = argmax_\theta \log L_y(x, \theta) = argmax_\theta \log{\prod_{i=1}^m P(y^i | x^i, \theta)} = argmax_\theta \sum_{i=1}^m \log P(y^i | x^i, \theta)
$$

We can also change the maximisation for a minimisation simply by changing the sign of the logarithm:

$$
\theta_{MLE} = argmin_\theta - \sum_{i=1}^m \log L_y(x, \theta)
$$

### **Relation between MSE and MLE**

At this point, let's assume we have a network $f(x^i; \theta)$, such that:

$$
y^i = f(x^i; \theta) + \epsilon
$$

$$
\epsilon \sim N(\mu, \sigma^2)
$$

That is, the error between our network's predictions and the real observations is normally distributed:

$$
y^i - f(x^i; \theta) = \epsilon \sim N(\mu, \sigma^2)
$$

**What is the likelihood of observing the data given the network's parameters?**

Using these assumptions, we can then derive:

$$
argmin_\theta - \sum_{i=1}^m \log L_y(x, \theta) = \\
argmin_\theta - \sum_{i=1}^m \log \left[ \frac{1}{\sqrt{2 \pi \sigma}} exp\left(\frac{-(y^i - f(x^i; \theta))^2}{2 \sigma^2} \right) \right] = \\
argmin_\theta \frac{1}{2\sigma^2} \sum_{i=1}^m (y^i - f(x^i; \theta))^2 + m \log \sigma + m \log 2\pi = \\
argmin_\theta \frac{1}{2\sigma^2} \sum_{i=1}^m (y^i - f(x^i; \theta))^2 + constant → argmin_\theta \frac{1}{2\sigma^2} \sum_{i=1}^m (y^i - f(x^i; \theta))^2
$$

### **Relation between Cross-Entropy and MLE**

The likelihood of a Bernoulli distribution is,

$$
b(x) = p^x (1 - p)^{1-x}, x \in \{0, 1\}
$$

and its log-likelihood:

$$
\log b(x) = x \log p + (1 - x) \log (1 - p)
$$

If we have $m$ samples:

$$
\sum_{i=1}^m x^i \log p + (1 - x^i) \log (1 - p)
$$

The minus binary cross-entropy is the same as the negative log-likelihood loss.


<br>

---

<br/>

## 4. Comparison of probability density functions

There are different metrics that can be used to compare probability density functions. One of the most used ones is the Kullback-Leibler (KL) Divergence. For two probability density functions $p(x)$ and $q(x)$:

$$
D_{KL} (p || q) = \int_{-\infty}^{\infty} p(x) \log{p(x)} dx - \int_{-\infty}^{\infty} p(x) \log{q(x)} dx
$$

Importantly, the KL divergence is asymmetric: $D_{KL} (p || q) \ne D_{KL} (q || p)$. $D_{KL} (p || q)$ is positive, and is equal to 0 when $p(x)$ and $q(x)$ are identical.

#### *Exercise*

What is the KL divergence between two normal distributions $p(x)$ and $q(x)$?

$$
p(x) = N(x; \mu_1, \sigma_1); q(x) = N(x; \mu_2, \sigma_2)
$$

The divergence then is:

$$
D_{KL} (p || q) = -\frac{1}{2} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2 \sigma_2^2} + \log{\frac{\sigma_2}{\sigma_1}}
$$

### **Relation to MLE**

If the training dataset consists of $m$ iid data points $x^1, x^2, ..., x^m$, the KL divergence between the training set distribution $p_d(x)$ and the estimated distribution $p_\theta(x)$ is:

$$
D_{KL} (p_d || p_\theta) = \int_{-\infty}^{\infty} p_d(x) \log{p_d(x)} dx - \int_{-\infty}^{\infty} p_d(x) \log{p_\theta(x)} dx
$$

Minimising $D_{KL} (p_d || p_\theta)$ with respect to the parameters $\theta$ is equivalent to maximising the second term:

$$
\int_{-\infty}^{\infty} p_d(x) \log{p_\theta(x)} dx
$$

But, this is the expression of the expectation of $\log{p_\theta(x)}$ calculated over the training set:

$$
\theta = argmax \left( \frac{1}{m} \sum_{i=1}^{m} p_d(x) \log{p_\theta(x^i)} \right)
$$

**Minimising the KL divergence is equivalent to maximising the likelihood.**

<br>

---

<br/>
