# Bayesian Approach

> I had enrolled in the [Machine Learning](https://rajeshhr.github.io/ml-2022/) course in my Masters. Following is my attempt to document a lecture as I understand about all things Bayesian.

Bayesian approaches basically assumes a prior probability distribution over the random variables to quantify the uncertainty in their values.

## Key Ideas

- **Prior: $p(\theta)$**

    $\theta$ is representative of all the paramaters in the model. This prior distribution signifies the state of the parameters before the training begun.


- **Likelihood: $p(y | \theta, x)$**

    $y$ is an output for the corresponding given input $x$. This phase is similar to training the model where we are predicting the $y$, when we have the train data $x$ and the prior $\theta$ parameters. 


- **Posterior: $p(\theta | x, y)$**

    Once the training is complete, the parameters of the model must have been tuned, given the pair of train data $x$ and $y$.

## Posterior is proportional to joint

<!-- $$
\begin{aligned}
p(\theta \mid x, y) &= \frac{p(\theta, x, y)}{p(x, y)} \\
&= \frac{p(x) \text{ } p(\theta, x, y)}{p(x) \text{ } p(x, y)} \\
&=  \frac{p(\theta, y \mid x)}{p(y \mid x)} \\
& \propto p(\theta, y \mid x) \\
\end{aligned}
$$ -->


So, $p(y \mid x)$ acts like a normalising contant here. Dividing by it does not change the inherent shape of the function, just brings the area under the curve to $1$ so that we can call it probability.

Then, $p(\theta, y \mid x)$ is the joint distribution here since initially we were given $x$ and, $\theta$ and $y$ were not given. So we take joint of not given over given variables.

## Posterior = Prior * Likelihood

This is a very frequently used equality whenever Bayesian approaches are involved. Now that we know what the individual terms are, we can verify (if) it is true using simpler building block Bayes rule

$$
\begin{aligned}
p(\theta, y \mid x) &= \frac{p(\theta, x, y)}{p(x)} \\
&= p(\theta)  \frac{p(\theta, x, y)}{p(\theta) \text{ } p(x)} \\
&= p(\theta) \text{ } p(y \mid \theta, x) \frac{p(\theta, x)}{p(\theta) \text{ } p(x)} \\
&= p(\theta) \text{ } p(y \mid \theta, x) \tag{$\theta$ and $x$ assumed independent}\\
\end{aligned}
$$

Therefore, the equality is true only when $p(\theta, x) = p(\theta) \text{ } p(x)$, that is, $x$ and $\theta$ are independent. It may sound intuitive that the input data $x$ has nothing to do with the parameters of the model $\theta$ before the model training begins. However, writing the above steps just makes it explicit.

## Predicting a new point $x^{\star}$, given x, y

Let the prediction for the new point $x^{\star}$ be $y^{\star}$, therefore, we want to find the distribution $p(y^{\star} \mid x^{\star}, x, y)$.

Marginalizing over posterior $p(\theta \mid x, y)$,
$$
\begin{aligned}
p(y^{\star} \mid x^{\star}, x, y) &= \int p(y^{\star} \mid x^{\star}, \theta) \text{ } p(\theta \mid x, y) d\theta \\
\end{aligned}
$$

> I find it easier to quickly use it as a kind of chain rule. In $p(\theta \mid x, y)$, $x$ and $y$ were given to us and we found out $\theta$. Next, we use this $\theta$ and the remaining $x^{\star}$ in the given side of the $\mid$ (bar), thus we can write $p(y^{\star} \mid x^{\star}, \theta)$.

When marginalizing over posterior $p(\theta \mid x, y)$, we are going over (weighted average) all the models (combination of paramters $\theta$) for given $x$ and $y$. We are using these models as priors for $x^{\star}$ and $y^{\star}$ to find their likelihood $p(y^{\star} \mid x^{\star}, \theta)$

Here we assume that $(x^{\star}, y^{\star})$ are from the same distribution as $(x, y)$ because we have covered only those models that were derived from $(x, y)$. 

Below is an alternative way to do the same thing, however it gives clearer idea about the assumptions we make.

<!-- $$
\begin{aligned}
p(\theta, y \mid x) &= \frac{p(\theta, x, y)}{p(x)} \\
&= p(\theta)  \frac{p(\theta, x, y)}{p(\theta) \text{ } p(x)} \\
&= p(\theta) \text{ } p(y \mid \theta, x) \frac{p(\theta, x)}{p(\theta) \text{ } p(x)} \\
&= p(\theta) \text{ } p(y \mid \theta, x) \tag{$\theta$ and $x$ assumed independent}\\
\end{aligned}
$$ -->


$$
\begin{aligned}
p(y^{\star} \mid x^{\star}, x, y) &= \int p(y^{\star}, \theta \mid x^{\star}, x, y) d\theta \tag{Marginalizing over $\theta$}\\
&= \int p(y^{\star} \mid x^{\star}, x, y, \theta)p(\theta \mid x^{\star}, x, y) d\theta \tag{$p(A \mid B, C) \text{ } p(B \mid C) = p(A, B \mid C)$}\\
&= \int \frac{p(y^{\star}, x, y \mid x^{\star}, \theta)}{p(x, y)}  \frac{p(\theta, x^{\star} \mid x, y)}{p(x^{\star})} d\theta \\
&= \int \left[\frac{p(x, y \mid x^{\star}, y^{\star}, \theta)}{p(x, y)} \frac{p(x^{\star} \mid \theta, x, y)}{p(x^{\star})}\right] \cdot \left[p(y^{\star} \mid x^{\star}, \theta) p(\theta \mid x, y)\right] d\theta \\ 
\end{aligned}
$$

Therefore, this would be equal to the original result when the first big bracket is $1$. Thus $p(x, y \mid x^{\star}, y^{\star}, \theta) = p(x, y)$ and $p(x^{\star} \mid \theta, x, y) = p(x^{\star})$ must be true. Intuitively it means that $(x^{\star}, y^{\star})$ come from the same distribution as $(x, y)$.