# Bayesian linear regression

If we infer the parameters of a linear regression model 

\begin{align}
\mathbf{y} \sim p(\mathbf{y} \mid (\mathbf{X} ,\boldsymbol \beta, \sigma^2 )  = \mathcal{N}(\mathbf{X} \boldsymbol \beta, \sigma^2 \mathbf{I}_n),
\end{align}

where $y$ is a $n$-dimensional vector of (centered) Gaussian responses, $X$ is a $(n \times p)$-dimensional design matrix and $\beta$ is a vector of coefficients of appropriate size, we usually compute the MLE $\hat{\beta}$ as *best* solution of the problem.

In a Bayesian context, where we are (I would generally argue) primarily interested in *analysis of beliefs*, instead of establishing *frequentist guarantees*, we introduce prior information to the model (I took the term *anaysis of beliefs* from Larry Wasserman, I think).

As a refresher, the maximum likelihood estimate would be computed as:

\begin{align}
\max_{\boldsymbol \beta, \sigma^2} \mathcal{L}(\boldsymbol \beta, \sigma^2) = & \max_{\boldsymbol \beta, \sigma^2} p(\mathbf{y} \mid \mathbf{X}, \boldsymbol \beta, \sigma^2),\\
 =& \max_{\boldsymbol \beta, \sigma^2} \prod_i^n p(y_i \mid \mathbf{x}_i, \boldsymbol \beta, \sigma^2).
\end{align}

### Normal-Inverse Gamma Prior

For a linear regression model, this would mean to put a prior distribution on the coeffients $\beta$ and the variance $\sigma^2$. 

Fere, for the coefficients we will assume a Gaussian prior (as it is conjugate to a Gaussian likelihood for our responses $y$) with mean $0$ and variance $\sigma^2 \mathbf{I}_p$. For $\sigma^2$ we will use an inverse Gamma prior with hyperparameters $a$ and $b$.

\begin{align}
 \boldsymbol \beta  &\sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}),\\
 \sigma^2 &\sim \mathcal{IG}(a, b).\\
\end{align}

Other choices of priors are of course also fine. For demonstration this will do, though. I refer to Gelman's *Bayesian Data Analysis* for more information.

The joint prior distribution is given by a rather tedious form:

\begin{align}
 \beta, \sigma^2 &\sim \mathcal{NIG}(\mathbf{0}, \mathbf{I}, a, b),\\
 p(\beta, \sigma^2) &= p(\beta \mid \sigma^2) \ p(\sigma^2) \\
 &= \frac{1}{ (2\pi\sigma^2)^{\frac{p}{2}}} \exp\left(  -\frac{1}{2\sigma^2}    \boldsymbol \beta^T \boldsymbol \beta \right) \frac{b^a}{\Gamma(a)} \frac{1}{(\sigma^2)^{a+1}} \exp\left(  -\frac{b}{\sigma^2} \right), \\
 &\propto \frac{1}{(\sigma^2)^{\frac{p}{2} + a + 1 }} \exp \left(  -\frac{1}{2\sigma^2} \boldsymbol \beta^T \boldsymbol \beta  -\frac{b}{\sigma^2}\right)
\end{align}

In the alst part we removed every factor that does not depend on the variables $\boldsymbol \beta$ and $\sigma^2$.

### Normal-Inverse Gamma Posterior 

The posterior in a Bayesian model is proportional to the likelihood times the prior. So in our case:

\begin{align}
  \text{posterior} &\propto \text{likelihood} \times \text{prior},\\
  p(\beta, \sigma^2 \mid \mathbf{X}, \mathbf{y}) \propto \; &  p(\mathbf{y} \mid \mathbf{X}, \boldsymbol \beta, \sigma^2) \ p(\boldsymbol \beta , \sigma^2),\\
  p(\beta, \sigma^2 \mid X, y) \propto \; & \frac{1}{(\sigma^2)^{\frac{n}{2}}} \exp \left( -\frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X} \boldsymbol \beta)^T(\mathbf{y} - \mathbf{X} \boldsymbol \beta)  \right)\\
 & \; \frac{1}{(\sigma^2)^{\frac{p}{2} + a + 1 }} \exp \left(  -\frac{1}{2\sigma^2} \boldsymbol \beta^T \boldsymbol \beta  -\frac{b}{\sigma^2}\right),\\
 = \; & \frac{1}{(\sigma^2)^{\frac{n}{2} + \frac{p}{2} + a + 1 }} \exp \left( -\frac{1}{2\sigma^2}  (\mathbf{y} - \mathbf{X} \boldsymbol \beta)^T(\mathbf{y} - \mathbf{X} \boldsymbol \beta)  -\frac{1}{2\sigma^2} \boldsymbol \beta^T \boldsymbol \beta  -\frac{b}{\sigma^2}  \right)
\end{align}


