## Theory

Suppose we have a two-dimensional set of $N$ points $\{x_i,y_i\}$.
The model $M$ predicts the values of $y$ as being 
\begin{alignat}{2}
y \,&=\, f(x) + \epsilon \ \ \ \ \ {\rm where} \\
f(x) \,&=\, b_0 + b_1x
\end{alignat}
which is a straight line with parameters $b_0$ (intercept) and $b_1$ (gradient). 
$f(x)$ is the generative model: it gives the noise-free predictions of the data given the parameters.
The residuals
$\epsilon = y - f(x)$ are modelled as a zero-mean Gaussian random variable with standard deviation $\sigma$, i.e.\
$\epsilon \sim {\cal N}(0, \sigma)$.
This is the 
{\em noise model}.
Assuming the $\{x\}$ are noise free, this tells us that the likelihood is
\begin{equation}
P(y_i  |  x_i, \theta, M) \,=\, \frac{1}{\sigma\sqrt{2 \pi}} \exp{ \left[ -\frac{[y_i - f(x_i; b_0, b_1)]^2}{2\sigma^2} \right] } 
\end{equation}
where $\theta = (b_0, b_1, \sigma)$ are the parameters of the model.
It may come as a surprise that we will try to infer the uncertainty in the data points from the data. Yet $\sigma$ is as a model parameter just like the others.
Although the $x$ values are supplied with the data, we assume them to be fixed: they are not described by a measurement model.
Thus the data are $D=\{y_i\}$. Assuming that the various $y$ measurements are independent, the log likelihood for all $N$ data points is
\begin{equation}
\ln P(\{y_i\}  |  \{x_i\}, \theta, M) \,=\, \sum_{i=1}^N \ln P(y_i  |  x_i, \theta, M) \ .
\end{equation}
In general none of the three parameters are known in advance, so we want to infer their posterior PDF from the data, which is given as usual by
\begin{equation}
P(\theta  |  D) \,\propto\, P(D  | \theta) P(\theta) \ .
\end{equation}
As we will be sampling from the posterior we do not need to compute the normalization constant.

## Procedure

Given a set of data, the procedure to compute the posterior is as follows:
* define the prior PDF over the parameters. I will use plausible yet convenient priors, and I will make use of a variable transformation;
* define the covariance matrix of the proposal distribution. I will use a diagonal, multivariate Gaussian;
* define the starting point (initialization) of the MCMC;
* define the number of burn-in iterations and the number of sampling iterations.

Once we have run the MCMC we perform the following analyses:
* thin the chains;
* plot the chains and the one-dimensional marginal posterior PDFs over the parameters. I do the latter via kernel density estimation;
* plot the two-dimensional posterior distributions of all three pairs of parameters, simply by plotting the samples (we could do two-dimensional kernel density estimation instead). I do this to look for correlations between the parameters;
* calculate the maximum a posteriori (MAP) values of the model parameters from the MCMC chains, calculate and plot the resulting model, and compare to the original data;
* calculate the predictive posterior distribution over $y$ at a new data point 
Note that because we have samples drawn from the posterior, we don't need the actual values of the posterior density in order to plot the posteriors. We likewise don't have to do any integration to get the one-dimensional marginal distributions.


## Code

The analysis described above is all done with the R script linearmodel_posterior.R, with explanations provided as comments in the code. The code looks long, but a reasonable chunk of it is actually concerned with plotting and analysing the results.  The code initially sources two other files. The first file, linearmodel_functions.R, defines functions that compute the (logarithm of the) prior, likelihood, and posterior. The second file is the Metropolis algorithm metrop in metropolis.R. The rest of the code is then executed. Part of the code at the end of the file is concerned with doing prediction, but we are not covering that in this course (so ignore or delete that part of the program).

## Exercises

To better appreciate how MCMC works in this example, I experiment with the code. Change in particular the following (one by one):
* the initialization of the parameters;
* the parameter step sizes;
* the number of iterations and the length of the burn-in;
* the value of the standard deviation of the prior on $b_0$;
* the prior on $b_0$ from a Gaussian to an improper uniform prior. Do this by setting {\tt b0Prior} to unity in the function {\tt logprior.linearmodel};
* the amount of data. Try both more data points (e.g.\ 100) and fewer, including just 3, 2, and 1 data points.


## Things to think about and to investigate

* Why is the inferred straight line sometimes quite different from the true straight line?
* How can we infer the errors bars as well as the line?
* What happens if we have much more data?
* How are the results affected by the choice of priors?
* What happens if we reduce the size of the data set to two points, or even just one?
* How, logically thinking, can we estimate three parameters at all given only one or two data points? (Isn't the the solution is somehow "underdetermined''?)
* What would the result be if we had no data? (Don't try to run my code with no data, however, as it's not robust to this)
* How would you extend the code to fit higher order polynomials, or even different types of functions?

If your current work involve fitting curves to data, think about taking this Bayesian approach of determining the posterior PDF over the parameters. It has several advantages over simply using linear least squares: we get full posteriors on the parameters, rather than simplified symmetric error bars; we can take into account prior information; it can easily be extended to include uncertainty on both axes, upper/lower limits, outliers, etc.


