## Priors

### Priors and prior predictive distributions

How do we choose a prior? Our prior needs to represent our personal perspective, our beliefs, and our uncertainties. Theoretically we're defining a cumulative distribution function for the parameter. 

So in particular we're defining $P(\theta \leq c) $ for all $c \in \mathbb{R}$

This isn't practical to do. And it would be very difficult to do coherently. So that all the probabilities were consistent. 

In practice, we work with a convenient family that's sufficiently flexible such that a member of the family will represent our beliefs. And we can build an external information if available such as previous experiments. 

Generally if one has enough data, the information in the data will overwhelm the invasion of prior. And so it, prior is not particularly important in terms of what you get for the posterior. Any reasonable choice of prior will lead to approximately the same posterior. However, there are some things that can go wrong. 

Suppose: 
$$f(\theta = 0.5) = 1$$
Means that the probability of $\theta$ being something else than $0.5$ is $0$.  If we do this, we'll see :
$$f(\theta | y) \propto f(y| \theta)f(\theta)= f(\theta)$$

**So in the basic context, events with prior probability of zero have posterior probability of zero. Events with prior probability of one, have posterior probability of one. Thus a good bayesian (like you) will not assign probability of zero or one to any event that has already occurred or already known not to occur. **

### Calibration

A useful concept in terms of choosing priors is that of calibration. Calibration of predictive intervals. So if we make an interval where we're saying we predict 95% of new data points will occur in this interval. It would be good if in reality 95% of new data points actually did fall in that interval. How do we calibrate to reality? This is actually more frequent as concept but this is important for practical substicle purposes that are results do reflect reality.

So we can compute a predictive interval, this is an interval such that 95% of new observations are expected to fall into it. So it's an interval for the data, for y or x. Rather than an interval for theta, like we've been looking at. 

In particular: 
$$f(y) = \int f(y |\theta)f(\theta)d\theta = \int f(\theta \cap y) d\theta$$ So this is our prior predictive, before we observe any data. 

**Remember the definition of conditional probability $P(A|B) = \dfrac{P(A\cap B)}{P(B)}$ **

Prior  predictive intervals are useful because they reveal the consequences of the prior at the data (observational level).

**
Just a reminder of Beta distribution, we will use them at the following example.
$$n! = \Gamma(n + 1)$$
$$z \sim Beta(\alpha, \beta)$$
$$f(z) = \dfrac{\Gamma(\alpha, \beta)}{\Gamma(\alpha)\Gamma(\beta)}z^{\alpha-1}(1-z)^{\beta - 1}$$
**
### Example

Suppose we're going to flip coin ten times and count the number of heads we see. We're thinking about this in advance of actually doing it, so we're interested in the predictive distribution. How many heads do we predict we're going to see? 

This, of course, will depend on the coin itself. What's the probability that it shows up heads? So, we'll need to choose a prior. 

Let X be the number of heads.

$$X = \sum_{i=0}^{1} Y_{i}$$
$$f(\theta) = I_{\{0\leq \theta \leq 1 \}}$$
$$f(x) = \int f(x|\theta)f(\theta)d\theta = \int_{1}^{10}\dfrac{10!}{x!(10-x)!}\theta^{x}(1-\theta)^{10-x} (1) d\theta $$
$$=\int_{0}^{1}\dfrac{\Gamma(11}{\Gamma(x+1)\Gamma(11-x)}\theta^{(x+1)-1}(1-\theta)^{(11-x)-1}d\theta $$
$$= \dfrac{\Gamma(11)}{\Gamma(12)}\int_{0}^{1}\dfrac{\Gamma(12)}{\Gamma(x+1)\Gamma(11-x)}\theta^{(x+1)-1}(1-\theta)^{(11-x)-1}d\theta$$
$$= \dfrac{\Gamma(11)}{\Gamma(12)}(1)$$
$$=\dfrac{10!}{11!} = \dfrac{1}{11},\ for \ x \in \{0,1,2,\dots,10\}$$
What I have inide the integral is exactly like the beta density with $\alpha$ to be $x+1$ and $\beta$ to be $11-x$. Because it's a beta density, we know all densities integrate up to 1. Much easier way to do integrals. 

Thus we see that if we start with a uniform prior, we then end up with a discrete uniform predictive density for X. If all possible coins or all possible probabilities are equally likely, then all possible X outcomes are equally likely. 

### Posterior Predictive distribution

What about after we've observed data? What's our posterior predictive distribution? So in this case, we can think about flipping a coin. We don't know what the probability comes up heads is. But suppose we observe, after one flip, we got a head the first time. We want to ask, what's our predicted distribution for the second flip, given that we saw a head on the first flip? 

So in this case, we can think about a posterior predictive distribution, $f(y_{2}| y_{1}) =\int f(y_{2}|\theta,f_{1})f(\theta|y_{1})d\theta$ .

So again, you can see this looks very much like the prior predictive, except we're using the posterior distribution for theta, instead of the prior distribution. As a particular example, again, suppose we're thinking of a uniform distribution for theta, and we observe the first flip, y1 = 1, we get a heads. What do we predict for the second flip? 

This is no longer going to be a uniform distribution like it was in the first case, because we have some data. We have one head coming up on the first flip. So, this gives us some information about the coin. We're now going to think it's more likely we're going to get a second head, because it's more likely that theta is at least one half, and possibly larger than a half. So let's compute this:
$$f(y_{2}|y_{1}) = \int_{0}^{1}\theta^{y_2}(1-\theta)^{1-y_{2}}2\theta d\theta$$
$$= \int_{0}^{1} 2\theta^{y_{2}+1}(1-\theta)^{1-y_{2}}d\theta$$

We could work this out in a more general form, but in this case, y2 has to take the value 0 or 1. The next flip is either going to be heads or tails, so it's easier to just stick to a particular example. So, $P(y_{2}=1|y_{1}=1) =  ? $

$$\int_{0}^{1}2\theta^{1+1}(1-\theta)^{1-1}d\theta$$
$$=\int_{0}^{1}2\theta^{2}d\theta$$
$$=\dfrac{2}{3}$$

Thus, the compliment, probability of getting a head, given we saw a tail the first time, is going to be one-third. So we can see here, that the posterior is a combination of the information in the prior and the information in the data. In this case, our prior is like having two data points, one head and one tail. Saying we have a uniform prior for theta, is actually equivalent in an information sense, to saying we have observed one head and one tail. And so then, when we do go ahead and observe one head, it's like we now have seen two heads and one tail, and so our predictive distribution, our posterior predictive distribution for the second flip, says, if we have two heads and one tail, then we have a probability of two-thirds of getting another head, and a probability of one-third of getting a tail.
