Creative Commons CC BY 4.0 Lynd Bacon & Associates, Ltd. Not warranted to be suitable for any particular purpose. (You're on your own!)

# A Brief Bayes Digression

(_Originally prepared for the 2018 Univ. of Utah DeCART data science for health care workshop on machine learning and predictive analytics._)

And now for something different.

Bayesian regression in a just a few minutes.

Prediction problems can be approached using Bayesian methods.  Although methods for nontrivial "learning" (prediction) problems can be computationally intensive, continuing advances in algorithm and hardware development are making Bayesian applications feasible in an ever-widening array of contexts.

### Bayes Theorem

It has been called the "inverse probability" theorem.  Attributed to the 18th century cleric Thomas Bayes, it's based on a principle of conditional probabilities:  

\begin{align*}
\large
p(A|B) = \frac{p(B|A)p(A)}{p(B)}
\end{align*}  

Here, A and B are events of some sort that may or may not occur. 

You can see where this theorem came from by using the definition of a conditional probability:


\begin{align*}
\large {
p(A~and~B) = p(A|B)p(B) \\
P(B~and~A) = p(B|A)p(A) \\
p(A~and~B) = p(B~and~A) \\
p(A|B)p(B) = p(B|A)p(A) }
\end{align*}  

\begin{align*}
\large{
p(A|B) = \frac{p(B|A)}{p(B)} \\ }
\end{align*}  

It's pretty simple.  

In scientific applications, we use Greek letters to make it look more serious:  

\begin{align*}
\large
p(\theta|D)~=~\frac{p(D|\theta)p(\theta)}{p(D)}
\end{align*}  

Here, $\theta$ is one or more parameters we want to learn about, and D is "data," or information. $\theta$ can be a very long vector. The term on the LHS is the posterior probability of $\theta$ conditional on D; it tells us about uncertainty about $\theta$.  The two quantities in the numerator on the RHS are referred to as the likelihood of the data given $\theta$, and the prior probability of $\theta$.  p(D) is often called the data density.  For a particular data set it's constant, so when estimating p($\theta$|D) it's ignored, and the version of the above that's used is:  


\begin{align*}
\large
p(\theta|D)~\propto~p(D|\theta)p(\theta)
\end{align*} 

\begin{align*}
\large
p(\theta|D)~\propto~p(D|\theta)p(\theta)
\end{align*}

One way of looking at this theorem is that it is a _learning algorithm_:  
* p($\theta$) is what we know about parameters of interest before getting D.  
* p(D|$\theta$) is the likelihood of data we've observed given what we have believed about $\theta$.
* p($\theta$|D) is how we've "adjusted" what we believe about $\theta$ now that we've received D. 

p($\theta$|D) can be our "best guess" about p($\theta$) the next time we are about to get new D.  

### Approximating p($\theta$|D) 

Approximating p($\theta$\D) is the main objective of Bayesian modeling.  It's what we use to make inferences about $\theta$.

In any given Bayesian model, $\theta$ may have many parameters, and p($\theta$|D) may be highly dimensional. Many applications involve estimating so many parameters that estimating them _cannot_ be done analytically, e.g. by solving equations.  

For estimation purposes, _stochastic_ simulation methods are typically used. These methods are generally called _Markov Chain Monte Carlo_ (MCMC) simulation methods. 

Here are the basic steps used when estimating a Bayesian regression model.

* Specify a _full probability model_ that defines the relationships between variables, and the distributions of all parameters and functions of them. 

* Select a _sampling method_ that is used iteratively to obtained values from the marginal conditional posterior of each parameter that is defined in your model specification.

* Define initial values for each parameter, your $\theta$ elements, and let your algorithm run.

* Each iteration of it produces a "draw," a value, from the posterior of each parameter. A series of draws is called a "chain" or a "trace."

* Based on a general theory of ergodicity, given that a model's parameters are sufficiently identified in your specification, these chains of parameter value estimates will "settle in" to be random draws from stable, conditional posterior parameter distributions.  


Once you've approximated the conditional joint posterior probability distribution of model parameters, you can use it to make predictions for new data.  You can get prediction error estimates by using the new data and making random draws from  the posterior density.

### But, Why?

* Bayesian modeling methods can be used to estimate where it's not possible to use non-Bayesian methods. 

* Bayesian estimates are _shrinkage_ estimates, so that they tend to mitigate overfitting by models.  

* Bayesian models can incorporate _prior knowledge_ into the estimation of parameters given data.

* Hypothesis testing doesn't require relying on asymptotics as sample sizes to to infinity, or to doing thought experiments using imaginary data. 

* Once you've estimated parameter chains, you can use them to compute posterior distributions of _functions_ of the parameters.  

* Parameters for individual observational units can be estimated even when the unit-level data are sparse due to the "partial pooling" of information that hierarchical models afford.

* Bayesian methods provide a "natural" way of dealing with missing values.  You just estimate them as parameters are estimated, all together in the same simulation.