# MCMC 

## 27 October, by Adele and Jorge

----

## Bayesian Inference 1 (Bayes theorem, priors)
## Sections 5.1 - 5.2
(Note, this notebook is based **heavily** on *Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data : A Practical Python Guide for the Analysis of Survey Data* by Ivezić, Connolly,
Vanderplas, & 
Gray; Princeton University Press: 2014. *Please excuse my plagiarism!*)

----


## The History 

-----

1. Reverend Thomas Bayes (1702 - 1761), a British amateur mathemetician, wrote about how to combine an intial belief with new data to arrive at an improved belief.  This manuscript was published in 1763 (posthumously)
2. In 1774 Pierre Simon Lapalace rediscovered and clarified Bayes' principles.  He applied these to astronomy, physics, population stats, jurisprudence (study of law).  By the way, he estimated the mass of Saturn and its uncertainty, which remains consistent with the best current measurements.
3. In the early 20th century, Harold Jeffreys brought back Bayes' theorem and Laplace's work.
4. Around 1960, proponents included Finetti, Savage, Wald, Jaynes.  Computing technology was taking off at this time.

For more information, see:
* McGrayne, S.B. (2011) *The Theory That Would Not Die: How Bayes' Rule Cracked the Enigma Code, Hunted down Russain Submarines, and Emerged Triumphant from Two Centuries of Controversy*. Yale University Press
* Jaynes, E.T. (2003) *Probability Theory: The Logic of Science*. Cambridge University Press
* Gregory, P.C. (2005) *Bayesian Logical Data Analysis for the Physical Sciences: A comparative Approach with 'Mathematica' Support*. Cambridge University Press

## The basic introduction

----

* Probability statements are not limited to data, but can be made for model parameters and models themselves.  
* Inferences are made by producing probability density functions (PDFs)
* Model parameters are treated as random variables
* Remember, Bayesian method yields optiumum results **assuming that all of the supplied info is correct**.

#### The data likelihood function
* Shared by both Classical and Bayesian techniques
* In classical stats: d.l.f. is used to find model parameters that yield highest data likelihood
* Note that the d.l.f. cannot be used as a PDF for the model parameters in classical statistics, the book says that this "is not even a valid concept in classical statistics".
* Bayes extends the concept of d.l.f. by adding extra *PRIOR* information to the analysis

### An Example: Bus stop

-----

>You arrive at a bus stop and observe that the bus leaves *t* minutes later.
Assume you know nothing about the bus schedule, but it is regular.
What is the mean time between two successive buses, $\tau$ ?? 

#### The intuitive answer

Wait time is distributed evenly within range $0 \leq t \leq \tau$, so on average you wait $t= \tau/2$ minutes.  Rearrange to get, $\tau = 2t$.

#### Maximum Likelihood Approach (MLE)

The probability that you wait $t$ minutes (the likelihood of the data) is given by a uniform distribution:

$$
p(t|\tau) = 1/\tau\ \mathrm{ if }\ 0\leq t \leq \tau,
$$

and 0 otherwise.  In this case, there is only a single observed data point, so the data likelihood is equal to the probability.  The MLE corresponds to the smallest possible $\tau$ such that $t\leq\tau$, which is satisfied by $\tau=t$, NOT $\tau=2t$ as expected(!).  But, using some *priors* helps "fix" these puzzling results.


## Bayesian basics

----

#### Bayes' Theorem

$$
p(M|D) = \frac{p(D|M)p(M)}{p(D)},
$$

where $D$ is data, $M$ is model. In other words, this is based on the likelihood fuction, applying Bayes' rule.  In words, Bayes' Theorem basically says:
> An "improved belief" is proportional to the product of an "initial belief" and the probability that the "initial belief" generated the observed data.

Explicitly acknowledging prior information $I$, then Bayes' Theorem looks like this:

$$
p(M,\theta|D,I) = \frac{p(D|M,\theta,I)p(M,\theta|I)}{p(D|I)},
$$

where the model $M$ includes $k$ model parameters $\theta_p,p=1,...,k$ (note, $\theta$ is a vector, in the book it's written in slight bold text).

#### Some vocabulary:
* Prior information $I$
* Model $M$
* Paremeters $\theta$
* **Posterior** pdf $p(M,\theta|D,I)$
* **likelihood** of data is given by $p(D|M,\theta,I)$
* A priori joint probability $p(M,\theta|I)$ (in absence of any data), often called **the prior**
* Probability of data $p(D|I)$

The prior is expanded as:

$$
p(M,\theta|I) = p(\theta|M,I)p(M|I),
$$

noting that for parameter estimation you need only p(\theta|M,I), but for model selection you need the full prior. By the way, despite the name "prior", the data can chronologically precede the information in the prior.

The priors based on on other measurements (or sources of meaninful info) are called **informative priors** (i.e. you already measured the mass of a particle $m_A$ with uncertainty $\sigma_A$ using another method, now you want to test a new method)


#### A discussion:

Take note that $p(M,\theta|D,I)$ is **NOT** a probability the same way as $p(D|M,\theta,I)$.  Think of the latter as long-term frequency of events (i.e. long-term probability to observe heads when flipping a coin, termed as "frequentist").  

On the other hand, think of measuring something (mass of planet, apple, elementary particle). The mass is **NOT** a random number; rather "it is what it is and cannot have a distribution." Therefore the Bayesian formalism $p(M,\theta|D,I)$ corresponds to the state of knowledge/belief about a model and its parameters, given data $D$ and prior information $I$.


#### Some more vocab that might be handy:
* **credible region**, (analog to frequentist confidence region) can be obtained analytically, and with numerical techniques to simulate samples from the posterior (like the frequentist boot strap approach)
* **Posterior mean** $\bar\theta = \int \theta p(\theta|D) d\theta$
* **Maximum a priori** (MAP) searches for best model parameters $M$ to maximize $p(M|D,I)$ (the analog to MLE from classical stats)
* **Marginalization** means integration of $p(M,\theta|D,I)$ over all other model paramters to get one of the paramters $\theta$ (in multidimensional case)
* **Hypothesis tests** (for Bayes) incorporate the prior and might give different results as the tests in classical stats.


### Return to the Bus stop example

-----

This equation $$p(t|\tau) = 1/\tau\ \mathrm{ if }\ 0\leq t \leq \tau,$$ correspondes to $p(D|M,\theta,I)$ in the Bayes' Theorem. The vector of parameters $\theta$ has one component: $\tau$. 

The **prior** $p(\tau|I)$ is proportional to $1/\tau$.

The **posterior** pdf for $\tau$ is (multiplying the likelihood and prior):  $$p(\tau|t,I) = t/\tau^2\ \mathrm{ if }\ \tau\geq t,$$.

Then normalize the integral of $p(\tau|t,I)$:

$$
\int_t^\infty p(\tau|t,I)d\tau = \int_t^\infty C/\tau^2 d\tau = 1,
$$

And now the median $\tau$ given by the posterior is $2t$ as hoped for.


## ALL about Bayesian Priors (see section 5.2 in the book)

----

Recall from before: The priors based on on other measurements (or sources of meaninful info) are called **informative priors** (i.e. you already measured the mass of a particle $m_A$ with uncertainty $\sigma_A$ using another method, now you want to test a new method).

#### What if no other information other than the data we are analyzing is available?

Then, assign priors by formal rules.

These are called **uninformative priors** (but realize that these can incorporate weak but objective information such as "the model parameter describing variance cannot be negative").  These affect the estimates, even if they're weak, so results are usually different from frequentist or MLE.

An example is a **flat prior**, $p(\theta|I) \propto C$, where $C>0$.  Notice, that $\int p(\theta|I) d\theta = \infty$, so this is not a PDF.  It's called an **improper prior**.  
* Improper priors are not a problem as long as the posterior is a well-defined PDF.  
* Otherwise, adopt lower- and upper-limits on $\theta$.  For example, assume the mass of an elementary particle must be positive and smaller than earth's mass.

There are 3 ways to assign uninformative priors.
1. **principle of indifference** -- states that a set of basic, mutually exclusive possibilities be assigned equal probabilities (i.e. think of a dice, the outcomes have prior probability of 1/6) 
2. **principle of consistency** -- the prior for a location parameter should not change with translations of coordinate system; the prior for a scale parameter shoudl not depend on choice of units. For example, scaling by a positive factor $a$, then $p(\sigma|I)d\sigma = p(a\sigma |I)d(a\sigma)$, with the solution $p(\sigma|I)\propto \sigma^{-1}$, called a **scale-invariant prior**.
3. **principle of maximum entropy** -- can be used when there is weak additional information about some parameter, such as low-order statistic. (It seems important enough to give its own section.)

#### Principle of Maximum Entropy 

"Entropy" measures information content of a pdf.  The main idea is to assign **uninformative** parameters by maximizing entropy over a suitable set of pdfs, so the distrubution is "least uninformative".  You can add additional information about the prior distribution, such as mean value and variance.

Entropy uses the symbol $S$, defined as:

$$
S = - \sum_{i=1}^N p_i \ln(p_i),
$$

where a PDF has $N$ discrete values $p_i$, with $\sum_{i=1}^N p_i = 1$.  

Some interesting details:
* Can be justified using logical consistency and information theory (not sure what those are...)
* Also called "Shannon's entropy" because Shannon was first to derive it, 1948.
* Resembles thermodynamic entropy.
* The unit for entropy is the "nat" (from "natural unit, based in the natural log; if you use base 2, then the unit is the "bit", and 1 nat = 1.44 bits).

The continous form of entropy is:
$$
S = -\int_\infty^\infty p(x)\ln \left( \frac{p(x)}{m(x)} \right) dx
$$

Here the "measure" $m(x)$ ensures that entropy is invariant under a change of variables.

#### An example: six-faced die (FYI: die is the singular form of dice) using Bayes!

* If no specific information is available, then the **principle of indifference** says each outcome has a prior probability of 1/6.
* If additional information is avaialbe, then adjust prior probabilities to be consistent. For example, maybe you know the mean value of a large number of rolls, $\mu$.

There are two constraints for the 6 unknown probabilities $p_i$: the expected mean value is $\sum_{i=1}^6 ip_i=\mu$, and the sum is $\sum_{i=1}^6 p_i=1$.

Then to assign the individual values $p_i$, use **principle of maximum entropy** and the method of Lagrangian multipliers.  We want to maximize this with respect to $p_i$:

$$
Q = S + \lambda_0 (1-\sum_{i=1}^6 p_i) + \lambda_1 (\mu-\sum_{i=1}^6 ip_i)
$$

Using what we learned about entropy $S$: 
$$
S = - \sum_{i=1}^6 p_i \ln \left(\frac{p_i}{m_i}\right).
$$

The second and third terms come from additional constraints, where $\lambda$ is a Lagrangian multiplier.  $m_i$ are values assigned to $p_i$ in the case that no additional information is known (i.e. no constraint on mean value; in this case $m_i=1/6$).

Then differentiating $Q$ with respect to $p_i$, you solve for conditions:

$$
\textrm{Conditions: }  \ - \left[ \ln  \left(\frac{p_i}{m_i}\right) + 1 \right] - \lambda_0 - i\lambda_1 = 0
$$

And solutions:

$$
\textrm{Solutions: }  \   p_i = m_i \exp(-1 - \lambda_0)\exp(i\lambda_1)
$$

The final values of $\lambda_0$ and $\lambda_1$ you solve numerically (using the constraints for $\mu$ and the normalization above).

The result (of all these complication equations) is that even with incomplete knowledge of $p_i$ and with only 2 constraints, you can assign all six $p_i$.

A few special cases:
* When number of possible discrete events is infinite (as opposed to 6 in the example), then the maximum entropy solution for $p_i$ is the Poisson distribution parametrized by expectation value $\mu$.
* When only the mean and variance are known in advance, and distribution is defined over the whole real line, then the solution is a Gaussian with those values of mean and variance.
* The continuous case gives the prior $$p(\theta|\mu) = \frac{1}{\mu} \exp \left(\frac{-\theta}{\mu}\right)$$

#### Just a few more notes to finish the section

* **Conjugate prior** -- when posterior probability has same functional form as prior probability; these are convenient for generalizing computations.  
    * For example, when the likelihood function is Gaussian, then conjugate prior is also gaussian. 
    * In discrete case, the most frequently encountered conjugate priors are the beta distribution for binomial likelihood, and gamma distribution for Poissonian likelihood.
* ** Empirical Bayes** (a.k.a. Maximum marginal likelihood) -- refers to an approximation of the Bayes inference procedure where you use **data** in order to estimate parameters of priors (or *hyperparameters*).  
    * This differs from standard Bayes, where parameters of priors are chosen before data are observed.  In Empirical Bayes, the hyperparameters are set to their most likely values. 
* ** Hierarchical Bayes ** (a.k.a. multilevel) -- the prior distribution depends on unknown variables (the *hyperparameters*) that describe the group/population level probabilistic model.  The priors here are called *hyperpriors* and resemble those of simple single-level Bayesian models.  
    * This is useful if you want to use  **overal population properties** for estimating parameters of a **single population member**.