## Chapter 1: Introduction

Statistical induction is the process of learning about the general characteristics of a population from a subset of members of that population. Numerical values of population characteristics are typically expressed in terms of a parameter $\theta$, and numerical descriptions of the subset make up a dataset $y$. Before a dataset is obtained, the numerical values of both the population characteristics and the dataset are uncertain. After a dataset $y$ is obtained, the information it contains can be used to *decrease* our uncertainty about the population characteristics. *Quantifying this change in uncertainty is the purpose of Bayesian inference.*

The sample space $\mathcal{Y}$ is the set of all possible datasets, from which a single dataset $y$ will result. The parameter space $\Theta$ is the set of possible parameter values, from which we hope to identify the value that best represents the true population characteristics.

*Bayes' rule*: $$p(\theta|y)=\frac{p(y|\theta)p(\theta)}{\int_{\Theta}p(y|\tilde{\theta})p(\tilde{\theta})d\tilde{\theta}}$$

*Sensitivity analysis*: an exploration of how posterior information is affected by differences in prior opinion.

If $\theta$ has a $\text{beta}(a,b)$ distribution, then the expectation
of $\theta$ is $a/(a+b)$ and the most probable value of $\theta$
is $(a-1)/(a-1+b-1)$

## Chapter 2: Belief, probability and exchangeability

In Bayesian inference, ${H_{1},...,H_{K}}$ often refer to disjoint hypotheses or states of nature and $E$ refers to the outcome of a survey, study or experiment. To compare hypotheses post-experimentally, we often calculate the following ratio:
$$\frac{Pr(H_{i}|E)}{Pr(H_{j}|E)}=\frac{Pr(E|H_{i})}{Pr(E|H_{j})}\times\frac{Pr(H_{i})}{Pr(H_{j})}=\text{"Beyes factor"}\times\text{prior beliefs}$$
This calculation reminds us that Bayes’ rule does not determine what our
beliefs should be after seeing the data, it only tells us how they should change
after seeing the data.

Two events $F$ and $G$ are conditionally **independent** given $H$ if $Pr(F\cap G|H)=Pr(F|H)Pr(G|H)$.

In Bayesian inference a **random variable** is defined as an unknown numerical
quantity about which we make probability statements.

The *probability density function (pdf)* of Y for discrete distributions. For continuous random variables, we use a *cumulative distribution function (cdf)*: $F(y)=Pr(Y\leq y)$. Note that $F(\infty)=1,F(-\infty)=0$, and $F(b)<F(a)\:if\:b<a$.
A theorem from mathematics says that for every continuous cdf $F$
there exists a positive function $p(y)$ such that $F(a)=\int_{-\infty}^{a}p(y)dy$.
This function is called the *probability density function* of
Y, and its properties are similar to those of a pdf for a discrete
random variable: $0\leq p(y)$ for all $y$ ; $\int_{y\in\mathbb{R}}p(y)dy=1$.

The **mean** or **expectation** of an unknown quantity
$Y$ is given by $E[Y]=\sum_{y\in\mathcal{Y}}yp(y)$ if $Y$ is discrete;
$E[Y]=\int_{y\in\mathcal{Y}}yp(y)dy$ if $Y$ is continuous.

The most popular measure of spread is the **variance** of a distribution:
$Var[Y]=E[(Y-E[Y])^{2}]=E[Y^{2}]-E[Y]^{2}.$

### Joint distribution
Let $\mathcal{Y}_{1},\mathcal{Y}_{2}$ be two countable sample spaces;
$Y_1,Y_2$ be two random variables, taking values in $\mathcal{Y}_{1},\mathcal{Y}_{2}$
respectively.

The **joint pdf or joint density** of $Y_{1}$ and $Y_{2}$ is defined as   
$p_{Y_{1}Y_{2}}(y_{1},y_{2})=Pr({Y_{1}=y_{1}}\cap{Y_{2}=y_{2}})$
for $y_{1}\in\mathcal{Y}_{1},y_{2}\in\mathcal{Y}_{2}$.

The **marginal density** of $Y_1$ can be computed from the joint density:
$p_{Y_{1}}(y_{1})=Pr(Y_{1}=y_{1})=\sum_{y_{2}\in\mathcal{Y}_{2}}p_{Y_{1}Y_{2}}(y1,y2)$. Or   
$p_{Y_{1}}(y_{1})=\int_{-\infty}^{\infty}p_{Y_{1}Y_{2}}(y1,y2)dy_{2}$

The **conditional density** of $Y_{2}$ given ${Y_{1}=y_{1}}$ can be
computed from the joint density and the marginal density: $p_{Y_{2}|Y_{1}}(y_{2}|y_{1})=\frac{Pr({Y_{1}=y_{1}}\cap{Y_{2}=y_{2}})}{Pr(Y_{1}=y_{1})}=\frac{p_{Y_{1}Y_{2}}(y_{1},y_{2})}{p_{Y_{1}}(y_{1})}$

ghly speaking, $Y_1, \dots, Y_n$ are **exchangeable** if the subscript labels convey
no information about the outcomes.

$Y_{1},\ldots,Y_{n}|\theta$ are i.i.d. AND $\theta\sim p(\theta)$
$\Leftrightarrow$ $Y_{1},\ldots,Y_{n}$ are exchangeable for all
$n$ .

## Chapter 3: One-parameter models

### Conjugate
A class $\mathcal{P}$ of prior distributions for is called conjugate
for a sampling model $p(y|\theta)$ if $p(\theta)\in\mathcal{P})\Rightarrow p(\theta|y)\in\mathcal{P}$.


#### The beta distribution
An uncertain quantity $\theta$, known to be between 0 and 1, has
a `beta(a, b)` distribution if 
$$p(\theta)=\text{dbeta}(\theta,a,b)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1}$$
for $0\leq\theta\leq1$. ($\Gamma(x+1)=x!;\Gamma(1)=1$)  
For such a random variable, $mode[\theta]=(a-1)/[(a-1)+(b-1)]$ if
a > 1 and b > 1; $E[\theta]=a/(a+b)$; $Var[\theta]=ab/[(a+b+1)(a+b)^{2}]=E[\theta]\times E[1-\theta]/(a+b+1)$.

#### Binomial distribution has beta conjugate distribution
Suppose a prior $\theta\sim\text{beta}(a,b)$ and $Y|\theta\sim\text{binomial}(n,\theta)$,
then the posterior distribution $p(\theta|y)=\text{dbeta}(\theta,a+y,b+n-y)$.

Supose $\theta\sim\text{beta}(a,b)$ and $Y|\theta\sim\text{binomial}(n,\theta)$, observation $Y = y$, then 
$$p(\theta|y)=\frac{p(\theta)p(y|\theta)}{p(y)}=\text{dbeta}(\theta, a+y, b+n-y)$$

#### Gamma distribution
An uncertain **positive** quantity $\theta$ has a Gamma distribution if $P(\theta)=\text{dgamma}(\theta,a,b)=\frac{b^{a}}{\Gamma(a)}\theta^{a-1}e^{-b\theta}$
for $\theta,a,b>0$

$E(\theta)=a/b$; $Var(\theta)=a/b^2$; $mode(\theta)=(a-1)/b$ if $a>1$ and 0 if $a \leq 1$.

#### Poisson distribution has gamma conjugate distribution
The distribution $P(Y=y|\theta)=\text{dpois}(y,\theta)=\theta^{y}\frac{e^{-\theta}}{y!}$
for $y\in\{0,1,2,\ldots\}$

Suppose a prior $\theta\sim\text{gamma}(a,b)$ and $Y_1,\ldots, Y_n|\theta\sim\text{Poisson}(\theta)$,
then the posterior distribution $p(\theta|Y_1,\ldots, Y_n)\sim\text{gamma}(a+\sum_{i=1}^nY_i,b+n)$.

Supose $\theta\sim\text{gamma}(a,b)$ and $Y_{1,...,n}|\theta\sim\text{poisson}(\theta)$, observation $Y_{1,...,n} = y_{1,...,n}$, then 
$$p(\theta|y_{1,...,n})=\frac{p(\theta)p(y_{1,...,n}|\theta)}{p(y_{1,...,n})}=\text{dgamma}(\theta, a+\sum_{i=1}^{n}y_{1,...,n}, b+n)$$

$E[\theta|y_{1},\ldots,y_{n}]=\frac{(a+\sum y_{i})}{(b+n)}=\frac{b}{b+n}\frac{a}{b}+\frac{n}{b+n}\frac{\sum y_{i}}{n}$.
b is interpreted as the number of prior observations; a is interpreted
as the sum of counts from b prior observations.

### Exponential families and conjugate priors
A **one-parameter exponential family** model is any model whose densities can be expressed as p(y|\phi)=h(y)c(\phi)e^{\phi t(y)}, where \phi is the unknown parameter and t(y) is the sufficient statistic.
Conjugate prior distributions for general exponential family models: $p(\phi|n_{0},t_{0})=k(n_{0},t_{0})c(\phi)^{n_{0}}e^{n_{0}t_{0}\phi}$. $n_0$ can be interpreted as a "prior sample size" and is a measure of how informative the prior is. $t_0$ can be interpreted as a "prior guess" and the prior expected value of $t(Y)$.

Data: $Y_{1},\ldots,Y_{n}\sim\text{i.i.d.}p(y|\phi)$

Posterior distribution: $p(\phi|y_{1},\ldots,y_{n}\propto p(\phi|n_{0}+n,n_{0}t_{0}+n\bar{t}(\mathbf{y}))$


## Monte Carlo approximation

Conjugate prior makes the computation of posterior distribution easier. But what if we not only interested in mean, variance, and mode of the posterior, but also want to summarize other aspects of a posterior distribution? Under this condition, conjugate prior is not helpful. We need Monte Carlo approximation.

Monte Carlo approximation:
- Based on random sampling
- Useful for obtaining numerical solutions to problems which are too complicated to solve analytically
- If $\theta^1, \ldots, \theta^S \sim \text{i.i.d}\ p(\theta|y_1,\ldots,y_n)$
  - The empirical distribution of {$\theta^1, \ldots, \theta^S$} is a Monte Carlo approximation to $p(\theta|y_1,\ldots,y_n)$
  - $\frac{1}{S}\sum_{s=1}^{S}g(\theta^{s}\rightarrow E[g(\theta)|y_1,\ldots,y_n]$ as $S\rightarrow\infty$
    - $\text{mean}[\theta^1,\ldots,\theta^s] \rightarrow E[\theta|y_1,\ldots,y_n]$
    - $\text{Var}[\theta^1,\ldots,\theta^s] \rightarrow \text{Var}[\theta|y_1,\ldots,y_n]$
    - $no.(\theta^s \leq c)/S \rightarrow Pr(\theta\leq c|y_1,\ldots,y_n)$
    - the $\alpha$-percentile of {$\theta^1,\ldots,\theta^s$} $\rightarrow \theta_\alpha$
    
#### Sampling from predictive distributions

The *predictive distribution* of a random variable $\tilde{Y}$ is a probability distribution that

- known quantities have been conditioned on (e.g obsreved values)
- unknown quantities have been integrated out (e.g. unknown paramter $\theta$)

let $\tilde{Y}$ be the number of children of **a person** from a population under some constrains. Assume $\tilde{Y}\sim\text{Poisson}(\theta)$.

- if $\theta$ is known, then $Pr(\tilde{Y} = \tilde{y}|\theta)=\text{Poisson}(\theta)$
- but $\theta$ is never known:
    +  if we do not have any sampled data (*prior pred. dist.*), $Pr(\tilde{Y}=\tilde{y})=\int p(\tilde{y}|\theta)p(\theta)d\theta$ 
    + if we do have sampled data (*post. pred. dist.*), $Pr(\tilde{Y}=\tilde{y}|Y_{1}=y_{1},\ldots,Y_{n}=y_{n})=\int p(\tilde{y}|\theta)p(\theta|{\color{red}{y_{1},\ldots,y_{n}}})d\theta$
- $\theta\sim\text{gamma}(a, b)$
    + prior predictive distribution: `negative binomial`($a, b$)
    + post. predictive distribution: `negative binomial`($a+\sum y_i, b + n$)
    
Posterior predictive samples from the conjugate Poisson model can be generated as follows:

- sample $\theta^1\sim\text{gamma}(a + \sum y_i, b+n)$; sample $\tilde{y}^1\sim\text{Poisson}(\theta^1)$
- sample $\theta^2\sim\text{gamma}(a + \sum y_i, b+n)$; sample $\tilde{y}^2\sim\text{Poisson}(\theta^2)$
- repeat...
- sample $\theta^s\sim\text{gamma}(a + \sum y_i, b+n)$; sample $\tilde{y}^s\sim\text{Poisson}(\theta^s)$


## The normal model
The importance of the normal distribution stems primarily from the central
limit theorem, which says that under very general conditions, the sum (or
mean) of a set of random variables is approximately normally distributed. In
practice, this means that the normal sampling model will be appropriate for
data that result from the additive effects of a large number of factors.

$$p(y|\theta,\sigma^{2})=\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{1}{2}(\frac{y-\theta}{\sigma})^{2}},-\infty < y < \infty$$

If $X\sim\text{normal}(\mu,\tau^{2})$, $Y\sim\text{normal}(\theta,\sigma^{2})$
and X and Y are independent, then $aX+bY\sim\text{normal}(a\mu+b\theta,a^{2}\tau^{2}+b^{2}\sigma^{2})$;

The set $\sum{y_i^2}, \sum{y_i}$ makes up a two-dimensional sufficient statistic for normal disrtibution. Knowing the values of these quantities is equivalent to knowing the
values of $\bar{y}=\sum y_{i}/n$ and $s^{2}=\sum(y_{i}-\bar{y})^{2}/(n-1)$,
and so \{$\bar{y},s^{2}$\} are also a sufficient statistic. 

Inference for this two-parameter model can be broken down into two oneparameter
problems. We can begin with the problem of making inference for $\theta$ when $\sigma^2$ is known, and use a conjugate prior distribution for $\theta$ (also a normal distribution).

If $\theta\sim\text{normal}(\mu_{0},\tau_{0}^{2})$, $y_{1},\ldots,y_{n}$
are observed data, then the posterior distribution of $\theta$ is
$$p(\theta|\sigma^{2},y_{1,}\ldots,y_{n}\}\sim\text{normal}(\mu_{n}=\frac{\frac{1}{\tau_{0}^{2}}\mu_{0}+\frac{n}{\sigma^{2}}\bar{y}}{\frac{1}{\tau_{0}^{2}}+\frac{n}{\sigma^{2}}},\tau_{n}^{2}=\frac{1}{\frac{1}{\tau_{0}^{2}}+\frac{n}{\sigma^{2}}})$$

Let $k_{0}$ as the sample size of the prior distribution $$p(\theta|\sigma^{2},y_{1,}\ldots,y_{n}\}\sim\text{normal}(\mu_{n},\sigma^{2}/k_{n})$$,
where $k_{n}=k_{0}+n$ and $\mu_{n}=\frac{k_{0}\mu_{0}+n\bar{y}}{k_{n}}$.


The predictive distribution is $\tilde{Y}|\sigma^{2},y_{1},\ldots,y_{n}\sim\text{normal}(\mu_{n},\tau_{n}^{2}+\sigma^{2})$

A joint distribution for two quantities can be expressed as the product
of a conditional probability and a marginal probability: $p(\theta,\sigma^{2})=p(\theta|\sigma^{2})p(\sigma^{2})$. Similarly,
$$p(\theta,\sigma^{2}|y_{1,}\ldots,y_{n}\}=p(\theta|\sigma^{2},y_{1,}\ldots,y_{n}\}p(\sigma^{2}|y_{1,}\ldots,y_{n}\}$$




*Gamma* distribution is a conjugate distribution for the precision $1/\sigma^2$; or, $\sigma^2$ has an *inverse-gamma* distribution. $1/\sigma^{2}\sim\text{gamma}(\frac{v_{0}}{2},\frac{v_{0}}{2}\sigma_{0}^{2})$. The prior parameters ($\sigma_{0}^{2},v_{0}$) can be interpreted as the sample variance and sample size of prior observations.

Suppose our prior distributions and sampling model are as follows:
$$1/\sigma^{2}\sim\text{gamma}(\frac{v_{0}}{2},\frac{v_{0}}{2}\sigma_{0}^{2})$$
$$\theta|\sigma^{2}\sim\text{normal}(\mu_{0},\sigma^{2}/k_{0})$$ $$Y_{1},\ldots,Y_{n}|\theta,\sigma^{2}\sim\text{i.i.d.normal}(\theta,\sigma^{2})$$

${\{1/\sigma^{2}|y_{1},\ldots,y_{n}\}\sim\text{gamma}(\frac{v_{n}}{2},\frac{v_{n}}{2}\sigma_{n}^{2})}$
where $v_{n}=v_{0}+n$; $\sigma_{n}^{2}=\frac{1}{v_{n}}[v_{0}\sigma_{0}^{2}+(n-1)s^{2}+\frac{k_{0}n}{k_{n}}(\bar{y}-\mu_{0})^{2}]$.

In the conjugate case where $\tau_{0}^{2}$ was proportional to $\sigma^{2}$,
we showed that $p(\sigma|y_{1},\ldots,y_{n})$ was an inverse-gamma
distribution, and that a Monte Carlo sample of ${\theta,\sigma^{2}}$
from their joint posterior distribution could be obtained by sampling
(1). a value $\sigma^{2(s)}$ from $p(\sigma^{2}|y_{1},\ldots,y_{n})$,
an inverse-gamma distribution, then (2) a value $\theta^{(s)}$ from
$p(\theta|\sigma^{2(s)},y_{1},\ldots,y_{n})$, a normal distribution.
However, in the case where $\tau_{0}^{2}$ is not proportional to
$\sigma^{2}$, the marginal density of $1/\sigma^{2}$ is not a gamma
distribution, or any other standard distribution from which we can
easily sample.

Keep in mind that while $\theta^{(1)},\ldots, \theta^{(s)}$ are indeed each conditional
samples, they are each conditional on different values of $\sigma^2$. Taken together,
they constitute marginal samples of $\theta$.

## Posterior approximation with the Gibbs sampler

For many multiparameter models the joint posterior distribution is nonstandard and difficult to sample from directly. However, it is often the case that it is easy to sample from the full conditional distribution of each parameter. In such cases, posterior approximation can be made with the Gibbs sampler, an iterative algorithm that constructs a dependent sequence of parameter values whose distribution converges to the target joint posterior distribution.

### Semiconjugate prior distribution
For normal distribution, sometimes we may want to specify our uncertainty
about $\theta$ as being independent of $\sigma^{2}$, so that $p(\theta,\sigma^{2})=p(\theta)\times p(\sigma^{2})$. One such joint distribution is the following "semiconjugate" prior distribution: $\theta\sim\text{normal}(\mu_{0},\tau_{0}^{2})$; $\frac{1}{\sigma^{2}}\sim\text{gamma}(\frac{\upsilon_{0}}{2},\frac{\upsilon_{0}\sigma_{0}^{2}}{2})$.



### Sampling from the conditional distributions

Suppose for the moment you knew the value of $\theta$. The conditional
distribution of $\tilde{\sigma}^{2}$ given $\theta$ and \{$y_{1},\ldots,y_{n}$\}
is 

$$
p(\tilde{\sigma}^{2}|\theta,y_{1},\ldots,y_{n})\propto p(y_{1},\ldots,y_{n},\theta,\tilde{\sigma}^{2})=p(y_{1},\ldots,y_{n}|\theta,\tilde{\sigma}^{2})p(\theta|\tilde{\sigma}^{2})p(\tilde{\sigma}^{2})
$$
For the semiconjugate, we assume that $\theta$ and $\tilde{\sigma}^{2}$
are independent, sp $p(\theta|\tilde{\sigma}^{2})=p(\theta)$.
$$
p(\tilde{\sigma}^{2}|\theta,y_{1},\ldots,y_{n})\propto p(y_{1},\ldots,y_{n}|\theta,\tilde{\sigma}^{2})p(\tilde{\sigma}^{2})\propto(\tilde{\sigma}^{2})^{(\upsilon_{0}+n)/2-1}\times\exp\{-\tilde{\sigma}^{2}\times[\upsilon_{0}\sigma_{0}^{2}+\sum(y_{i}-\theta)^{2}]/2\}
$$

This is the form of a gamma density, and so evidently \{$\sigma^{2}|\theta,y_{1},\ldots,y_{n}\}\sim\text{inverse-gamma}(\upsilon_{n}/2,\upsilon_{n}\sigma_{n}^{2}(\theta)/2)$,
where $\upsilon_{n}=\upsilon_{0}+n$ , $\sigma_{n}^{2}(\theta)=[\upsilon_{0}\sigma_{0}^{2}+ns_{n}^{2}(\theta)]/\upsilon_{n}$,
and $s_{n}^{2}(\theta)=\sum(y_{i}-\theta)^{2}/n$, the unbiased estimate
of $\sigma^{2}$ if $\theta$ were known. This means that we can easily
sample directly from $p(\sigma^{2}|\theta,y_{1},\ldots,y_{n})$, as
well as from $p(\theta|\sigma^{2},y_{1},\ldots,y_{n})$ as shown at
the beginning of the chapter. However, we do not yet have a way to
sample directly from $p(\theta,\sigma^{2}|y_{1},\ldots,y_{n})$. The
distributions $p(\sigma^{2}|\theta,y_{1},\ldots,y_{n})$ and $p(\theta|\sigma^{2},y_{1},\ldots,y_{n})$
are called the **full conditional distributions** of $\theta$ and $\sigma^{2}$
respectively, as they are each a conditional distribution of a parameter
given everything else.

### Gibbs sampler

1. sample $\theta^{(s+1)}\sim p(\theta|\tilde{\sigma}^{2(s)},y_{1},\ldots,y_{n})$

2. sample $\tilde{\sigma}^{2(s+1)}\sim p(\tilde{\sigma}^{2}|\theta^{(s+1)},y_{1},\ldots,y_{n})$

3. let $\phi^{(s+1)}=\{\theta^{(s+1)},\tilde{\sigma}^{2(s+1)}\}$

This algorithm is called the Gibbs sampler, and generates a dependent
sequence of our parameters ${\phi^{(1)},\phi^{(2)},\ldots,\phi^{(S)}}$.

Given a starting point $\phi^{(0)}={\phi_{1}^{(0)},\ldots,\phi_{p}^{(0)}}$
(i.e. p parameters), the Gibbs sampler generates $\phi^{(s)}$ from
$\phi^{(s-1)}$ as follows: 

1. sample $\phi_{1}^{(s)}\sim p(\phi_{1}|\phi_{2}^{(s-1)},\phi_{3}^{(s-1)},\ldots,\phi_{p}^{(s-1)})$

2. sample $\phi_{2}^{(s)}\sim p(\phi_{2}|\phi_{1}^{(s)},\phi_{3}^{(s-1)},\ldots,\phi_{p}^{(s-1)})$

3. sample $\phi_{3}^{(s)}\sim p(\phi_{3}|\phi_{1}^{(s)},\phi_{2}^{(s)},\phi_{4}^{(s-1)}\ldots,\phi_{p}^{(s-1)})$

... 

p. sample $\phi_{p}^{(s)}\sim p(\phi_{p}|\phi_{1}^{(s)},\phi_{2}^{(s)},\phi_{3}^{(s)}\ldots,\phi_{p-1}^{(s)})$
. 

This algorithm generates a dependent sequence $\phi^{(1)},\phi^{(2)},\ldots\phi^{(s)}$.
$\phi^{(s)}$ only depends on $\phi^{(s-1)}$. So called markov chain. 

The sampling distribution of $\phi^{(s)}$ approaches the target distribution
as $s\rightarrow\infty$, no matter what the starting value $\phi^{(0)}$
is (although some starting values will get you to the target sooner
than others). More importantly, for most functions $g$ of interest
$$
\frac{1}{S}\sum_{s=1}^{S}g(\phi^{(s)})\rightarrow E(g(\phi))=\int g(\phi)p(\phi)d\phi\:\text{as }S\rightarrow\infty
$$
This means we can approximate $E[g(\phi)]$ with the sample average
of ${g(\phi^{(1)}),\ldots,g(\phi^{(S)})}$, just as in Monte Carlo
approximation. For this reason, we call such approximations Markov
chain Monte Carlo (MCMC) approximations, and the procedure an MCMC
algorithm. The purpose of Monte Carlo or MCMC is to approximate the
expected value of $g(\phi)$ under a target probability distribution
$p(\phi)$ (usually the posterior distribution) by the emperical average
of ${g(\phi^{(1)}),\ldots,g(\phi^{(S)})}$.


For many models, posterior distribution $p(\phi|y)$ is complicated,
hard to write down, and so on. In these cases, a useful way to "look
at" $p(\phi|y)$ is by studying Monte Carlo samples
from $p(\phi|y)$. Thus, Monte Carlo and MCMC sampling algorithms 

-  are not models, 

-  they do not generate "more information"
than is in $y$ and $p(\phi)$, 

-  they are simply "ways of looking at"
$p(\phi|y)$.

For example, if we have Monte Carlo samples $\phi^{(1)},\phi^{(2)},\ldots\phi^{(s)}$
that are approximate draws from $p(\phi|y)$, then these samples help
describe 

$$
\frac{1}{S}\sum\phi^{(s)}\approx\int\phi p(\phi|y)d\phi
$$

$$
\frac{1}{S}\sum1(\phi^{(s)}\leq c)\approx\text{Pr}(\phi\leq c|y)=\int_{-\infty}^{c}p(\phi|y)d\phi
$$


### MCMC diagnostics

The purpose of Monte Carlo or MCMC is to approximate the expected
value of $g(\phi)$ under a target probability distribution $p(\phi)$
(usually the posterior distribution) by the emperical average of ${g(\phi^{(1)}),\ldots,g(\phi^{(S)})}$.

The indepedent Monte Carlo samples automatically create a sequence
that is representative of $p(\phi)$: $\text{Pr}(\phi^{(s)}\in A)=\int_{A}p(\phi)d\phi$.
This is true for any set of A and every $s\in\{1,\ldots,S\}$, and
conditionally of unconditionally on the other values in the sequence.
We need the empirical distribution of the simulated sequence $\{\phi^{(1)},\ldots,\phi^{(s)}\}$
to look like the target distribution $p(\phi)$.

This is no true for MCMC. For MCMC: $\underset{s\rightarrow\infty}{\lim}\text{Pr}(\phi^{(s)}\in A)=\int_{A}p(\phi)d\phi$.

Example to show MC and MCMC:

Our target distribution will be the joint probability distribution
of two variables: a discrete variable $\delta\in\{1,2,3\}$ and a
continuous variable $\theta\in\mathbb{R}$. Suppose we know the distribution
of $p(\delta)$: $\{\text{Pr}(\delta=1,2,3)=(0.45,0.1,0.45)\}$ and
$p(\theta|\delta)=\text{dnorm}(\theta,\mu_{\delta},\sigma_{\delta})$
where $(\mu_{1},\mu_{2},\mu_{3})=(-3,0,3)$ and $(\sigma_{1}^{2},\sigma_{2}^{2},\sigma_{3}^{2})=(1/3,1/3,1/3)$.

MC: first, a random sample of $\delta$ from its marginal distribution
$p(\delta)$; then the value is plugged into the conditional distribution
$p(\theta|\delta)=\text{dnorm}(\theta,\mu_{\delta},\sigma_{\delta})$
and get a sample of $\theta$. The sample pair $(\theta,\delta)$
represents a sample from the joint distribution $p(\delta,\theta)=p(\theta|\delta)p(\delta)$.
The empirical distribution of the $\theta$-samples provides an approximation
to the marginal distribution $p(\theta)=\sum p(\theta|\delta)p(\delta)$.

Gibbs sample $\phi=(\delta,\theta)$: sample values of $\theta$ and
$\delta$ from their full conditional distributions. $p(\theta|\delta)=\text{dnorm}(\theta,\mu_{\delta},\sigma_{\delta})$;
$p(\delta=d|\theta)=\frac{\text{Pr}(\delta=d)\times\text{dnorm}(\theta,\mu_{d},\sigma_{d})}{\sum_{d=1}^{3}\text{Pr}(\delta=d)\times\text{dnorm}(\theta,\mu_{d},\sigma_{d})}$,
for $d\in\{1,2,3\}$.

Achieved stationarity/converged

Speed of mixing: jump between different regions of the parameter space.

How does the correlation of the MCMC samples affect posterior approximation?

For MC: with the empirical distribution of the simulated sequence
$\{\phi^{(1)},\ldots,\phi^{(s)}\}$ , $\text{Var}_{\text{MC}}[\bar{\phi}]=E[(\bar{\phi}-\phi_{0})^{2}]=\text{Var}[\phi]/S$.

$\text{Var}_{\text{MCMC}}[\bar{\phi}]=E[(\bar{\phi}-\phi_{0})^{2}]=\text{Var}_{\text{MC}}[\bar{\phi}]+\frac{1}{S^{2}}\sum_{s\neq t}E[(\phi^{(s)}-\phi_{0})(\phi^{(t)}-\phi_{0})]$.
we expect the MCMC approximation to be further away from $\phi_{0}$
than the MC approximation is. The higher the autocorrelation in the
chain, the larger the MCMC variance and the worse the approximation
is. The sample autocorrelation (t-steps apart) can be measured with
the lag-t autocorrelation function (assuming the chain is stationary,
which means that correlation between $\phi^{(k)}$ and $\phi^{(t+k)}$
does not dependent on $k$):

$\text{acf}_{t}(\phi)=\frac{\frac{1}{S-t}\sum_{s=1}^{S-t}(\phi_{s}-\bar{\phi})(\phi_{s+t}-\bar{\phi})}{\frac{1}{S-1}\sum_{s=1}^{S}(\phi_{s}-\bar{\phi})^{2}}$

Effective sample size of MCMC (effectiveSize function in the coda
package) $S_{eff}$: $\text{Var}_{\text{MCMC}}[\bar{\phi}]=\text{Var}[\phi]/S_{eff}$.
So that $S_{eff}$ can be interpreted as the number of independent
Monte Carlo samples necessary to give the same precision as the MCMC
samples. Rule of thumb: 400 effective iterations for the mean, 4000
for 95\% posterior interval.

Thinnning the chain.



## The multivariate normal model