# Bayesian inference


Bayesian probability is a theoretical framework for inference of unknown quantities when uncertainty is on the premises. The Bayesian inference process of updating beliefs of certain quantities when new information is observed relies on probability theory. The representation of belief of random variables is through probability distributions. Given a model describing mutual dependencies of random variables, Bayesian probability theory can be used to infer all the unknown quantities. All uncertainties, either in observations and model parameters, are modeled as probability distributions.

Bayesian probability offers a conceptual framework for deducing unknown variables while accounting for uncertainty. The process of Bayesian inference involves revising beliefs about specific quantities once new data is observed, drawing upon probability theory. The depiction of variable beliefs is facilitated through probability distributions. By employing a model that describes dependencies of random variables, Bayesian probability theory can be used to infer all the unknown quantities. All uncertainties, either in observations and model parameters, are modeled as probability distributions.

In short, <font color='orange'>Bayesian inference is the process of deducing properties of a probability distribution from data using Bayesâ€™ theorem</font>. It incorporates the idea that probability should include a measure of **belief** about a prediction or outcome.


Bayesian inference is a method of statistical inference in which probability is used to *update beliefs* about model's parameters based on available *evidence or data*.

**Group task 1:**

To better understand the role of prior <font color='purple'>`beliefs and subjective probability`</font>, discuss with your neighbour the following questions:


- What is the probability that it will rain tomorrow?
- What is the probability that the next president will be a woman?
- What is the probability that aliens built the pyramids?

How do these questions compare to the probability that a die will roll a 6?

Such questions, unlike the die, cannot be answered by "long-run" probability, i.e., probability obtained from multiple repeated runs of the same experiment. A certain degree of <font color='purple'>`belief`</font> is involved.

<font color='purple'>`Priors`</font> and "subjective" probability are foundational for Bayesian inference!

## Bayesian and frequentist paradigms

In *frequentist* statistics, probability has to be seen as frequency of occurrence of events, considering these events as in a process having intrinsic randomness, i.e. the frequentist probability of an event is the limit of its relative frequency of occurrence when the experiment is repeated in a very large number of times.

The Bayesian approach describes prior knowledge about the parameters gov- erning a phenomena through probability distributions. New knowledge about the parameters governing a phenomena is provided by new observed data described by the likelihood function, which is the probability distribution of the observed data conditioned on the parameters governing the phenomena. Through Bayesâ€™ theorem, prior probability distribution of the parameters governing the phenomena is updated with the observed data likelihood function, obtaining a posterior probability distribution for the parameters governing the phenomena.


## Probability density

We have seen examples of probability density functions (PDF) in the previous chapter. What are characteristics of a PDF in general?

If $X$ is a random variable with a probability density function $p(x),$ the probability of the event that $x$ is in the interval $(a,b)$ can be computed as
$$
p(x\in(a,b)) = \int_a^b p(x)dx.
$$

## The sum and product rule

The sum and product rules for probability densities take the form

$$
p(x) = \int p(x,y)dy \quad \text{- sum rule},\\
p(x,y) = p(y|x) p(x) = p(x|y) p(y) \quad \text{- product rule}.\\
$$

The probability of $x$, $p(x)$, is called the marginal probability of the variable $x$, because it is obtained *marginalizing*, or integrating out, the variable y. The product rule specifies that the *joint* probability distribution of two variables can be expressed as the product of a *conditional distribution* $p(x|y)$ and a marginal distribution $p(x)$, or vice-versa.

## The Bayes' theorem

From the product rule, and with the symmetry property $p(x|y)p(y) = p(y|x)p(x)$, we immediately derive the Bayesâ€™ rule :

$$
p(y|x) =  \frac{p(x|y)p(y)}{p(x)} = \frac{p(x|y)p(y)}{\int p(x|y)p(y)dy}
$$

which is the key element in Bayesian inference since it defines the posterior density of $y$, $p(y|x)$, after including new information of $x$ through the conditional probability model $p(x|y)$. The marginal probability of $x$, $p(x)$, makes of a normalization constant for the numerator, becoming a proper probability density function.

## The marginalization principle

The marginalization principle comes from the sum rule in probability theory. The marginalized principle formalizes the generalization or predicting the capacity of a learning system.
If we can specify the product rule for two related quantities $(z, v)$, 

$$
p(z, v) = p(z|v)p(v),
$$

that one can be explained by the other through the likelihood function $p(z|v)$, a generalization or prediction of the unknown $z$ can be obtained by integration out over all the different explanations $v$:

$$
p(z) = \int p(z|v)p(v)dv
$$

The likelihood function $p(z|v)$ gives the probability of the unknowns for a particular explanation, and $p(v)$ gives the weights for every possible explanation.



## Bayes theorem

_Note: Conditional probability is the axis on which Bayesian statistics turns! If you skipped this section in the first part of the practical and need a refresher on joint and conditional probabilities, go back to that bit of the prac to make sure you are ready for the phenomenon of Bayes' Theorem!_

From the joint and conditional formulae, we can relate two conditional probabilities:

\begin{equation}
p(B \mid A) = \frac{p(A \mid B) p(B)}{p(A)}
\end{equation}

This is the same equation you have already explored...

...but with a slightly different interpretation.

**The famous Bayes Theorem!** ðŸ˜»

Why is it so famous? Well let's understand what it means and gives us first!



## Prior, likelihood, posterior

Bayes' Theorem is commonly seen in machine learning and other models using <font color='red'>`data or evidence`</font> $\mathcal{D}$ and <font color='purple'>`parameters`</font> $\theta$ as:

\begin{equation}
p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta)p(\mathcal{\theta})}{p(D)}
\end{equation}

* The denominator $p(\mathcal{D})$ is often called a *normaliser* or <font color='red'>`evidence`</font> .
* $p(\mathcal{\theta})$ is the <font color='purple'>`prior`</font> .
* $p(\mathcal{D} \mid \theta)$ is the <font color='teal'>`likelihood`</font> .
* and $p(\theta \mid \mathcal{D})$ is the <font color='pink'>`posterior`</font>.

For this reason, you will often see Bayes rule summarised as "<font color='pink'>`posterior`</font> $\propto$ <font color='purple'>`prior`</font> $\times$ <font color='teal'>`likelihood`</font> ", which ignores the denominator since it is a constant (independent of $\mathcal{D}$). This posterior summarises our belief state about the possible values of $\theta$.

**Group task 2:** _Switch $A$ and $B$ in $
p(B \mid A) = \frac{p(A \mid B)p(B)}{p(A)}$ and label what is the posterior, likelihood, prior and evidence in the new equation. Explain what each means to each other a few times to be sure you know what goes where! _Often the literature will refer to these 'labels' and require you know them already, so while they might seem arbitrary this memorisation task goes a long way to making your life easier in practice!_

## Diagnosing cause of headache

Imagine a situation where you need make a decision concerning your health. You have a headache, and can choose between two doctors:

**Doctor 1:**
- Has a mental model for the cause of pain.
- Performs tests.

**Doctor 2:**
- Has a mental model for the cause of pain.
- Has access to the patient's chronic history.
- Performs tests.

Which doctor do you choose? Can you make sense of which parts are the <font color='red'>`data`</font>, <font color='teal'>`likelihood`</font> and <font color='purple'>`prior`</font> in this scenario?

Inference without priors is like a doctor who does not know the patient's history!

## Diagnosing COVID-19

**Group task 3:** We know that the probability of having fever this time of the year is 10%, the probability of having COVID is 7%, and among all people who have COVID, 70% of them have fever.

If you're a doctor, you don't know whether someone has COVID until you test them, but they may present with a high temperature and you want to reason whether to isolate them on that basis! So you are interested in knowing the chance that someone has COVID given they have a high temperature.

Find the probability that a patient has COVID given they have high temperature (fever).


## Bayesian modeling and inference

### Joint probability distribution

Bayesian modeling consists in describing in a mathematical form all observable (data), $y$, and unobservable (parameter), $Î¸$, quantities in a problem, through defining the joint probability distribution of data and parameters.

We define probability models for the observed quantities, $p(y|Î¸)$, and unob- served quantities about we wish to learn, $p(Î¸)$, and combine them through the product rule in a joint probability distribution:

$$
p(y, Î¸) = p(y|Î¸)p(Î¸).
$$

The observational model $p(y|Î¸)$ is a probabilistic model for the observed data that relates the observed data y with the unknown quantities (parameters) $Î¸$ we want to learn. This model represents the evidence provided by the data, summarizes the information from the data. It is the main source of information and is called likelihood function. This is the same as in the frequentist approach. Actually, it is the unique probabilistic model formulated in a frequentist approach to describe and solve a problem.

The distribution $p(Î¸)$ denotes a prior probability distribution for the parameters, that encodes our prior knowledge about the parameters. This probability distribution can be an informative or non-informative prior distribution, depending on the reliable information (knowledge) available for the parameters. This is one of the key features that differentiate from the frequentist approach, i.e. probability distributions are defined for the unknown quantities (parameters) and combined with the likelihood function.

### Parameter inference

Obtaining the posterior distribution of the unknowns (parameters) is the key element of the Bayesian approach. Through Bayesâ€™ rule, the likelihood function (probability model for the data) and prior distributions for the parameters are combined, and the uncertainty in the parameters once the data have been observed is updated, obtaining the posterior distribution of the parameters:

$$
p(\theta | y) = \frac{p(y |\theta)p(\theta)}{p(y)} = \frac{p(y|\theta)p(\theta)}{ \int p(y|\theta)p(\theta) d\theta}
$$

The denominator of Bayesâ€™ rule, $p(y) = \int p(y|\theta)p(\theta)d\theta$, is the marginal likelihood, as it integrates the likelihood over the prior information of parameters, also known as the evidence of the model. The marginal likelihood normalizes the posterior into a proper probability distribution.
The final inference will be a compromise between the evidence provided by the data and the prior information. With non-informative priors, the inference would be based mainly on the data.

### Predictive inference

The posterior distribution of the parameters $p(\theta|y)$ can be used to model the uncertainty of predictions $\tilde{y}$ for new observations. In a Bayesian approach, the posterior predictive distribution of $\tilde{y}$ is obtained by marginalizing or integrating out the joint posterior of predictions $\tilde{y}$ and model parameters $\theta$ over the model parameters:

$$
p(\tilde{y}|y) = p(\tilde{y}, \theta|y)d \theta = p(\tilde{y}|\theta, y)p(\theta|y)d\theta.
$$

The predictive distribution can also be seen as averaging the predictions of the model $p(\tilde{y}|\theta, y)$ over the posterior distribution of the model $p(\theta|y)$.

## Role of priors

One of the main distinguishing things in the Bayesian approach is the consideration of prior knowledge about the model parameters.

The Bayesian approach allows for performing consistent inferences even when the prior information is lacking, by marginalizing or integrating out over this prior information. This property also allows us not to make guesses on certain unknown quantities, in contrast to classical methods. Furthermore, the Bayesian approach, by its property of defining conditional dependencies among parameters and model assumptions in hierarchical modeling, allows for defining lack of prior information in an appropriate way.

However, the use of uncertainty assumptions makes the Bayesian approach to be more sensible to prior assumptions than classical methods, and that is the main source of critique for Bayesian inference.

Some non-obvious advantages of Bayesian inference, which are not easy to spot at the first glance, is their

- ability to work with small data,
- ability to perform model regularisation.


## How can we perform Bayesian inference?

### What does it take?
- <font color='red'>`Data`</font>
- A generative model (how does the conditional <font color='teal'>`likelihood`</font> come about?)
- Our <font color='purple'>`beliefs`</font> before seeing the data.

### What does it make?
- The values of parameters that could give rise to the observed data **in the form of a distribution**.


### How can we perform it?

- **Analytically**
        
     Solving the maths! This is an elegant approach. However, it is rarely available in real life.

- **Numerically**

    - Rather than deriving a posterior distribution in the closed form, we can use computational tools to **sample** from the posterior. The obtained samples describe the distributions of parameters.
    
    - We achieve this by exploring the space of parameters to find the most probable combinations of parameters.
    
    - Further we treat the obtained sampled as new data, to extract information about parameters, such as mean, credible interval or other statistics.

### Numerical methods

- Markov Chain Monte Carlo (MCMC) family of algorithm, e.g.,
  * Metropolis-Hastings
  * Gibbs
  * Hamiltonian Monte Carlo (HMC)
  * No-U-Turn sampler (NUTS)
  * further variants such as SGHMC, LDHMC, etc
- Variational Bayes
- Approximate Bayesian Computation (ABC)
- Particle filters
- Laplace approximation

More on this later! First, let's discuss some analytics and point estimates.    

## Assignment: analytical Bayesian inferenc -  Point estimates for Bernoulli-beta coin flips

### Point estimates

To illustrate the use of Bayes' Theorem further, let's explore the coin flip example that you saw with the Bernoulli distribution to try to figure out whether a coin is weighted or not. ðŸ’°

We really just want one answer out of this problem -- the probability that a coin will give us a heads (since we know that the probability of tails is just $1- p(heads)$). This is a <font color='green'>`point estimate`</font>: one answer out of a range.

In machine learning, we are often interested in estimating parameters $\theta$ that best allow us to describe our data. Generally this leads to solving some optimisation problem for a loss function $\mathcal{L}$, i.e.
\begin{equation*}
\hat{\theta} = {\arg \min}_\theta \mathcal{L}(\theta)
\end{equation*}

This gives us a <font color='green'>`point estimate`</font> $\hat{\theta}$.

**Group task 6**: Does a point estimate tell us anything about our uncertainty or the distribution from which we draw the estimate? Discuss the difference between <font color='green'>`point estimates`</font> and estimating a *distribution*.