And now for something different.

Bayesian regression in a just a few minutes.

Prediction problems can be approached using Bayesian methods.  Although methods for nontrivial "learning" (prediction) problems can be computationally intensive, continuing advances in algorithm and hardware development are making Bayesian applications feasible in an ever-widening array of contexts.

### Bayes Theorem

It has been called the "inverse probability" theorem.  Attributed to the 18th century cleric Thomas Bayes, it's based on a principle of conditional probabilities:  

\begin{align*}
\large
p(A|B) = \frac{p(B|A)p(A)}{p(B)}
\end{align*}  

Here, A and B are events of some sort that may or may not occur. You can see where it came from by using the definition of a conditional probability:


\begin{align*}
\large {
p(A~and~B) = p(A|B)p(B) \\
P(B~and~A) = p(B|A)p(A) \\
p(A~and~B) = p(B~and~A) \\
p(A|B)p(B) = p(B|A)p(A) }
\end{align*}  

\begin{align*}
\large{
p(A|B) = \frac{p(B|A)}{p(B)} \\ }
\end{align*}  

It's pretty simple.  In scientific applications, we use Greek letters to make it look more serious:  

\begin{align*}
\large
p(\theta|D)~=~\frac{p(D|\theta)p(\theta)}{p(D)}
\end{align*}  

Here, $\theta$ is one or more parameters we want to learn about, and D is "data," or information. $\theta$ can be a very long vector. The term on the LHS is the posterior probability of $\theta$ conditional on D; it tells us about uncertainty about $\theta$.  The two quantities in the numerator on the RHS are referred to as the likelihood of the data given $\theta$, and the prior probability of $\theta$.  p(D) is often called the data density.  For a particular data set it's constant, so when estimating p($\theta$|D) it's ignored, and the version of the above that's used is:  


\begin{align*}
\large
p(\theta|D)~\propto~p(D|\theta)p(\theta)
\end{align*} 

One way of looking at this theorem is that it is a _learning algorithm_.  p($\theta$) is what we know about parameters of interest before getting D.  p(D|$\theta$) is the likelihood of data we've observed given what we have believed about $\theta$.  p($\theta$|D) is how we've "adjusted" what we believe about $\theta$ now that we've received D. p($\theta$|D) can be our "best guess" about p($\theta$) the next time we are about to get new D.  

In any given Bayesian model, $\theta$ may have many parameters, and p($\theta$|D) may be highly dimensional.  Approximating p($\theta$\D) is the main objective of Bayesian modeling.  It's what we use to make inferences about $\theta$.

**QUESTION** Bayes Theorem is used in many different applications.  Can you think of any?

### Getting an approximation of p($\theta$|D) 

Most applications these days involve estimating many parameters, and estimating them cannot be done analytically, e.g. by solving equations.  For estimation purposes, stochastic simulation methods are typically used.  These procedures make use of the fact that the marginal posterior probability of a particular model parameter can be expressed as a function of the values of the other parameters and the data. So these methods repeatedly iterative over the parameters whose marginal posterior probabilities are to be estimated, ultimately resulting in an approximation to the joint posterior conditional density of the parameters.  These kinds of procedures are generally called _Markov Chain Monte Carlo_ (MCMC) simulation methods.  There are different kinds, and what's used depends on the characteristics of the probabilities to be estimated. 

### Model Specification and Estimation

Here are the basic steps used when estimating a Bayesian regression model.   In the following it's assumed that data are available, although you can specify a model before laying hands on data.  (You should, in fact.)  The simple application in what follows after this exemplifies what's described here.

First, you specify a _full probability model_ that defines the relationships between variables, and the distributions of all parameters and functions of them.  This is one place where Bayesian methods differ from classical approaches.  Parameters have _hyperparameters_: parameters have parameters.  This part requires specifying _priors_ for parameters.

Next, you select a _sampling method_ that is used iteratively to obtained values from the marginal conditional posterior of each parameter that is defined in your model specification. Samplers vary in terms of whether it's possible to directly evaluate (make a random draw) from marginal posterior distributions.  Commonly used samplers include the Gibbs sampler, the Metropolis sampler, and the Hamiltonian ("NUTS") sampler.

Assuming that you have code that can run your sampler, you define initial values for each parameter, your $\theta$ elements, and let your algorithm run.  Each iteration of it produces a "draw," a value, from the posterior of each parameter. Based on a general theory, given that a model's parameter's are sufficiently identified in your specification, these "chains" or "traces" of parameter values generated by the iteration of the algorithm will "settle in" to be random draws from stable, conditional postierior parameter distributions. The iterations up until this occurs is usually called the "burn-in" for a MCMC run. The values obtained after the burn-in are used to estimate parameters and their uncertainties.  

There's a lot more to this in the details, of course. But what's above is what it's about, in a nutshell.

### Prediction

Once you've approximated the condition joint posterior probability distribution of model parameters, you can use it to make predictions for new data.  You can get prediction error estimates by using the new data and making random draws from  the posterior density.

### But Why?

It's reasonable ask why go to what appears a lot of effort when it would be easier to just use conventional methods?  The answer has many parts.  

First, Bayesian modeling methods can be used to estimate where it's not possible to use non-Bayesian methods.  Bayesian models can have many thousands of parameters.  They can be thought of as "big parameter space" models.

Second, Bayesian estimates are _shrinkage_ estimates, so that they tend to mitigate overfitting by models.  

Third, they permit incorporating _prior knowledge_ into the estimation of parameters given data. If there's a lot of prior uncertainty about a parameter, you use a relatively uniformative prior for it. 

Fourth, hypothesis testing doesn't require relying on asymptotics as sample sizes to to infinity, or to doing thought experiments using imaginary data.  You just use the posterior densities.  

Fifth, once you've estimated parameter chains, you can use them to compute posterior distributions of _functions_ of the parameters.  

Sixth, parameters for individual observational units can be estimated even when the unit-level data are sparse due to the "partial pooling" of information that hierarchical models afford.

Lastly, Bayesian methods allow a "natural" way of dealing with missing values.  You just estimate them as parameters are estimated, all together in the same simulation.