# Introduction to the Bayesian approach

Bayesian theory can be applyed for
- **Bayesian learning of network weights**
- **Distribution of network outputs: regression**
- **Distribution of network outputs: classification**


#### Introduction to the Bayesian approach 

The Bayesian approach  treats the issue of model complexity very differently from cross validation. It allows all of the
available data to be used for 'training' (instead of splitting the data into train and validation).

To gain insight into why this makes any sence we can consider a hypothetical example where three different models $H_1$,$H_2$ and $H_3$ which have steadily increasing flexibility corresponding with the increasing number of hidden units.  Thus, each model consists of a specification of the network architecture (number of units, type of activation function, etc.) and is governed by a number of adaptive parameters. By varying the values of these parameters, each model can represent a range of input-output functions. The more complex models, with a greater number of hidden units for instance, can represent a greater range of such functions.

Let us assume we have a set of input vectors $X= \{x^1, \dots, x^M \}$ and a corresponding set of target vectors $Y= \{y^1, \dots, y^M \}$. We can consider the posterior probability for each of the models given the observed data (targets) $Y$ (shouldn't be X????).

Using Bayes' theorem we have:

\begin{equation}
p(H_i \vert Y) = \frac{p(Y \vert H_i) p(H_i)}{p(Y)}
\end{equation}

- The term $ p(H_i)$ is the prior probability for model $H_i$. If we have no reason to prefer one model over another, then we would assign equal prior probability to all models.

- The term $ p(Y)$ does not depend on the model. We will see that different models can be compared by evaluating $p(Y \vert H_i)$. The term $p(Y \vert H_i)$ is called the **evidence for model $H_i$**.
    - This indicates that the Bayesian approach can be used to select a particular model for which the evidence is largest. We might expect that the model with the greatest evidence is also the one which will have the best generalization performance. The will see it on detail.
    
    - We shall see that the correct Bayesian approach is to make use of
the complete set of models. Predicted outputs for new input vectors are obtained by performing a weighted sum over the predictions of all the models, where the weighting coefficients depend on the evidence. More probable models therefore contribute more strongly to the predicted output. **Since the evidence  $p(Y \vert H_i)$ can be evaluated using the training data, we see that Bayesian methods are able to deal with the issue of model complexity, without the need to use cross-validation**.

#### Marginalization

Bayesian inference is based on marginalization. The process of marginalization involves integrating out unwanted variables. Imagine we are discussing a model with two variables $w$ and $\alpha$. The most complete description of these variables is in terms of the joint distribution $p(\alpha, w)$. Nevertheless if we are interested only in the distribution of $w$ then we can integrate out $\alpha$ as follows:

\begin{equation}
p(w) = \int p(w,\alpha) d\alpha =  \int p(w \vert\alpha) p(\alpha) d\alpha 
\end{equation}

we say that the predictive distribution over $w$ is obtained by averaging the conditional distribution $p(w\vert \alpha)$ with a weighting factor $p(\alpha)$. 


## Bayesian learning of network weights

The first problem we shall address is that of learning the weights in a neural network on the basis of a set of training data. The standard approach used for training a neural network consist on maximizing the likelihood function (equivalent to minimizing an error function). This process finds a single set of values for the network weights. By contrast,the Bayesian approach 

- considers a probability distribution function over weight space, representing the relative degrees of belief in different values for the weight vector. This function is initially set to some prior distribution. 
- Once the data has been observed it can be converted to a posterior distribution using Bayes' theorem. The posterior distribution can then be used to evaluate the predictions of the trained network for new values of the input variables.



- https://www.cs.cmu.edu/afs/cs/academic/class/15782-f06/slides/

Maximum likelihood estimation is based on finding the (single) most likely value for the parameters of the model $\Theta$ given the observed data. The Bayesian approach is rather different. In the Bayesian approach the uncertainty in the value of the parameters is captured by a density function.


#### Usual names given to the probabilities

- $P(\theta \vert X)$ is the posterior probability of the parameters given the data. Usually referred to as the posterior probability.


- $P(\theta)$ is the prior probability of the parameters. Usually referred to as the prior.


- $P(X \vert \theta)$ is the conditional likelihood of the data. 


- $P(X)$ is the total probability of the data.



If we assume  $X= \{x^1, \dots, x^M \}$ and  the data is drawn independently from the same underlying distribution,then we can write

\begin{equation}
p(X \vert \theta) = \prod_{m=1}^M p(x^m \vert \theta)
\end{equation}


#### Brief introduction to bayesian inference for the parameters of the network


Before learning, the parameters are described by a prior probability density  $p(\theta)$. This prior is normally very broad and reflects the fact the we have little idea of what are good values of the parameters. During learning, when the data is observed, we can use Bayes'theorem to find the corresponding posterior proability density $P(\Theta \vert X)$. Let us see it in detail


We will start by writting the density function of an input vector $\textbf{x}$ given the training data $X$. That is 

\begin{equation}
p( \textbf{x} \vert X) = \int p(x, \theta  \vert X)  d\theta
\end{equation}

Notice that the probability $p(x, \theta \vert X)=p(x \vert \theta, X) p(\theta  \vert  X)$, therefore

\begin{equation}
p( \textbf{x} \vert X) = \int p(x \vert \theta, X) p(\theta  \vert  X) d\theta
\end{equation}

The first factor $ p(x \vert \theta, X)$ is independent of $X$ by construction since we assume that our density is completly specified once the values of the parameters $\theta$ have been found. Therefore


\begin{equation}
p( \textbf{x} \vert X) = \int p(x \vert \theta) p(\theta  \vert  X) d\theta
\end{equation}


The previous formula tell us that the Bayesian approach instead of choosing a specific value for $\theta$ performs a weighted average over all values of $\theta$. The weights are given by the posterior distribution (of the parameters), $p(\theta \vert X)$. 

#### How to compute the posterior distribution  $p(\theta \vert X)$

This posterior is determined by starting from some assumed prior distribution $p(\theta)$  and then updating it using Bayes' theorem to take into account the data $X$. Let us concrete the procedure.

\begin{equation}
p(\theta \vert X) := \frac{p(\theta,X)}{p(X)} = \frac{p(X \vert \theta)p(\theta)}{p(X)} =  \frac{p(\theta)}{p(X)} p(X \vert \theta) = 
  \frac{p(\theta)}{p(X)} \prod_{m=1}^M p(x^m \vert \theta)
\end{equation}



The term $P(X)$ is given by

\begin{equation}
p(X) = \int p(\theta^\prime) \prod_{m=1}^M p(x^m \vert \theta^\prime) d \theta^\prime
\end{equation}

#### The difficulties of the integrals for $p(x \vert X)$ and  $p( X)$

Typically the evaluation of the integrals required to compute  $p(x \vert X)$ and  $p(X)$ is only analytically feasible for very specific density functions. Those special density functions require the posterior density  $p(\theta \vert X) $ to have the same functional form as the prior.

### Distribution of weights

We begin by considering the problem of training a network in which the architecture (number of layers, number of hidden units, choice of activation functions  etc.) is given. In the conventional maximum likelihood approach, a single 'best' set of weight values is determined by minimization of a suitable error function. In the Bayesian framework, however, we consider a probability distribution over
weight values. In the absence of any data, this is described by a prior distribution which we shall denote by $p(w)$, and whose form we shall discuss shortly. Here $w = \{w_1,\dots, w_K\}$ denotes the vector of weight and bias parameters.


Let us assume we have a set of input vectors $X= \{x^1, \dots, x^M \}$ and a corresponding set of target vectors $Y= \{y^1, \dots, y^M \}$. We can consider the posterior probability:

\begin{equation}
p(w \vert Y ) = \frac{p(Y\vert w) p(w)}{p(Y)}
\end{equation}

where the denominator is a normalization constant


\begin{equation}
p( Y ) = \int  p(Y\vert w) p(w)
\end{equation}

which ensures that $p(w \vert Y )$ defined using Bayes' theorem gives the unity when integrated over all weight space.  We shall see shortly that $p(Y \vert w)$ (which represents a model for the noise process on the target data) corresponds to the likelihood function. 

Since the data set consists of input $X$ as well as target data $Y$, the input values should strictly be included in Bayes' theorem which should be written in the form:


\begin{equation}
p(w \vert  X, Y ) = \frac{p(Y\vert w, X) p(w \vert X)}{p(Y \vert X)}
\end{equation}

As we have already noted in earlier, however, feed-forward networks trained by supervised learning do not in general model the distribution p(x) of the input data. Therefore X always appears as a conditioning variable on the right-hand side of the probabilities in the previous equation. **We shall therefore continue to omit it from now on in order to simplify the notation**.


The picture of learning provided by the Bayesian formalism is as follows. 

- We start with some prior distribution over the weights given by p(w). Since we generally have little idea at this stage of what the weight values should be, the prior might express some rather general properties such as smoothness of the network function, but will otherwise leave the weight values fairly unconstrained. The prior will therefore typically be a rather broad distribution.
- Once we have observed the data, this prior distribution can be converted to a posterior distribution using Bayes' theorem in the form
of  \begin{equation}
p(w \vert Y ) = \frac{p(Y\vert w) p(w)}{p(Y)}
\end{equation}
This posterior distribution will be more compact, expressing the fact that we have learned something about the extent to which different weight values are consistent with the observed data. 
- In order to evaluate the posterior distribution we need to provide expressions for the prior distribution $p(w)$ and for the likelihood function $p(Y\vert w)$


#### Distribution of weights: Gaussian prior

We first consider the prior probability distribution for the weights. This distribution should reflect any prior knowledge we have about the form of network mapping we expect t o find. In general, we can write this distribution as an exponential of the form


## Distribution of network outputs

## Distribution of network outputs: classification