# Lab 1

### Lab Date: Wednesday, January 29

### Due: Wednesday, February 5

## Instructions

Work with your lab group to complete the following notebook. It will be reviewed by your peers in lab next week (Wednesday, February 5th). 

This notebook is only lightly scaffolded. This is intentional - as the learning goal for today's lab is as much about how to frame prior fitting and model evaluation as is it the necessary computation. As such, I have left the two major questions at the end of the lab open. If your group is stuck and unsure how to proceed, ask the instructor during lab, come into OH, or, review Chapter 6. Many of the ideas in Chapter 6 are essentially standard statistical practice, wrapped around a Bayesian pipeline, so can be discovered independently without much technical course knowledge. This lab is designed to give you enough space to discover some of these ideas.

If you are new to working in python, or in a Jupyter notebook, please ask your lab members for help. If you notice a lab member struggling, and have experience, please offer your help.

In [5]:
# Basic Set Up
import numpy as np
from matplotlib import pyplot as plt
import scipy as scipy
from scipy import io, integrate, linalg, signal
import pandas as pd
# import pymc as pm
# import bambi as bmb
# import arviz as az
import statsmodels.api as sm

# Add what you like beneath here...

In [6]:
# Load Data for the Lab
success_counts = pd.read_csv('Beta_Binomial_Draws',header=None) # The first column is the number of trials, the second is the number of successes
success_string = pd.read_csv('Ys',header=None) # The specific outcome string (sequence of 0, 1's) for the 100th trial
true_probabilities = pd.read_csv('Thetas_true',header=None) # The true outcome probabilities for each row. DO NOT LOOK AT THIS UNTIL THE END. It is "unknown" and in most problems, unknowable.

## The (Beta, Binomial) Model

As in class, and as in Chapter 2.4, consider the (Beta, Binomial) model. That is, we draw a success probability $\Theta$ from a Beta distribution with parameters $\alpha, \beta$, then draw $S$ successes from a Binomial distribution on $n$ trials, with success probability $\Theta$:

$$ \Theta \sim \text{Beta}(\alpha,\beta), \quad S \sim \text{Binomial}(n,\Theta)$$

We will assume, as usual, that $n$ is known, $S$ is observed, and all other variables are unknown.

It will be our goal in this lab to practice building posterior distributions, and posterior estimates to the unknown success probabilities. We learned how to do this in class for any fixed $\alpha, \beta$. Your main goal for this lab is to learn how to reasonably estimate your prior parameters (note, I did not give them to you), and to check that your prior model leads to sensible posterior inference (model checking). 

## Explore the Data

### The Observables

The file loaded into `success_counts` contains 100 independent draws from this model, for varying $n$. The first column contains the number of trials, the second contains the number of success.

The file loaded into `success_string` contains the actual sequence of outcomes for the 100th trial.

### The Unknown

The file loaded into `true_probabilities` contains the true probabilities $\{\Theta_i\}_{i=1}^{100}$ for the 100 draws from the full joint model on $(\Theta, S)$. Do not look at this until the end, or unless you are absolutely out of ideas. I have made it available here so that, after choosing your inferential methods, and evaluation scheme, you can go back and check whether we got close to the truth. You should avoid looking at this since in essentially any real Bayesian setting, this would never be known. It is common when testing pipelines to generate and save the truth, as we have done here, to validate that the pipeline is accurate had you known the truth. We'll save this evaluation for the end. For most of the lab, operate under the assumption that this file is inaccessible.

In [7]:
# Use this cell to explore the observable data. Make whatever simple EDA plots you think you need to get some feel for it



## The Inferential Pipeline

In class we discussed how to derive the posterior distribution for a model with a Beta prior and a Binomial likelihood. In particular, since the models are conjugate, the posterior is also a Beta distribution:

$$\Theta|S = s ; n, \alpha, \beta \sim \text{Beta}(s + \alpha, n - s + \beta)$$


### Your aim:

We want to estimate $\Theta_{100}$ from $n_{100}$ and $S_{100}$. The other 99 draws from the joint model have been provided for your use either as prior data (note that $\alpha, \beta$ are the same for all 100 draws), or for model checking.

**In the cell below:** write a function that accepts $s, n, \alpha, \beta$ and returns the following posterior summaries:
1. The MLE estimator: $\hat{\theta}_{\text{MLE}}(s;n,\alpha,\beta)$
1. The posterior mean: $\hat{\theta}_{\text{mean}}(s;n,\alpha,\beta)$
1. The posterior mode: $\hat{\theta}_{\text{MAP}}(s;n,\alpha,\beta)$
1. The posterior standard deviation: $\text{SD}[\Theta|S=s]$
1. A basic interval estimate for $\Theta|S=s$ using posterior summaries (1. - 3.). Choose your interval estimate so that it should contain the truth with reasonably high probability. You may decide as a group what convention to use here (a Chebyshev bound, the normal approximation to the Beta, etc.).

**In the cell below:** write a function that accepts $s, n, \alpha, \beta$ and returns $m$ i.i.d. samples from the posterior distribution:

**In the space below:** write a code that accepts $s,n,\alpha,\beta$, a chosen number of posterior samples $m$, and a coverage probability $p$. Your code should return:
1. Returns a plot of the posterior density overlaid on a histogram of the $m$ samples
1. Print an interval estimate for $\Theta|S=s$ that contains the unknown with probability $p$. Your interval may be estimated from samples (as is usually done for more complex models), or, as the Beta CDF is easy to work out analytically, from exact quantiles. 

**In the space below:** Apply the codes you wrote above to explore the range of different posterior distributions for $\Theta_{100}|S_{100} = s_{100}$ that can be produced by varying the prior parameters $\alpha$ and $\beta$. I am not going to give you specific prior parameter pairs to check here. Instead, be responsible for your own exploration. Stop once you think you have sufficiently demonstrated the possible posteriors, and their dependence on the possible prior assumptions.

*Note: For this sort of demonstration, it helps to do a little back of the envelope thinking first about how the prior parameters bias the posterior, and the balance of information provided by the prior and by the data collected.*

## Gathering Information

In Bayesian statistics, observations provide information via conditioning. Let's see how our posterior changes sequentially as we gain evidence.

**In the cell below:** Write a code that walks through the `success_string` corresponding to the 100th trial. As it goes, use the code you wrote above to print out a sequence of plots and interval estimates for the unknown. For now, initialize the process with a uniform prior ($\alpha = 1, \beta = 1$). What do you notice about the posterior (its shape, its uncertainty with respect to resampling, etc.) as we gather evidence? What do you notice about its point summaries (the MLE, MAP, and posterior mode)?

*Note: you don't need a unique output for all 70 trials in the success string. Instead, space your outputs to show the main trends. You should space your outputs tighter at the beginning, when our posterior is more sensitive to any individual trial outcome, and less tightly at the end. As a basic heuristic, the spacing should grow linearly so that the sampled trial lengths follow a quadratic sequence (e.g. 1, 4, 9, 16, 25, 36, 64).*

If you feel you already have a decent intuition for the Beta distribution, I would save this step for the end of the lab. If you'd like to try a basic widget to see how the shape of the Beta depends on its parameters, go to [this article](https://www.mathmouth.com/bayesball) and scroll to Demo 3 (roughly 3/5ths of the way down).

## Fitting the Prior

You should have seen that your estimates above depend on your choice of prior parameters. This is the point of Bayesian inference. The prior *should* influence our estimation.

In order to influence our estimates informatively, the prior must encode actual information. Otherwise, the Bayesian pipeline is really just a different estimation procedure that uses the mathematics of conditioning to define intervals and to regularize the inference.

Discuss with your group how you could use the first 99 sample outcomes to estimate $\alpha$ and $\beta$. 

*For now, we will assume that the form of the prior model (i.e. the Beta) is correct. We will discuss what happens if your prior form is misspecified (does not contain the true data generating distribution for any parameter values) later in the class. I generated this data using the model specified above.*

If you get stuck, consider the following pair of ideas:
1. You could select a set of previous observations with large $n_j$, estimate $\theta_j$ for each with a simple point estimator, then fit the resulting ensemble of estimated $\theta$'s to a Beta distribution (either by matching moments or via an MLE). *If you choose this route, discuss the trade-off between using many past samples, and incorporating less reliable past samples*
1. Consider the marginal distribution of $S$ drawn from the joint model $(\Theta, S)$. When evaluated at $S = s$ we called this the "evidence". Discuss the relationship between the marginal distribution of $\{S_j\}_{j=1}^{99}$ and the prior parameters. Does this relation have a familiar name? Could you use it to estimate $\alpha$ and $\beta$? If so, what standard framework could you adopt?

How you proceed from here is up to you. Note: if you choose 1. you will have to explain how you fit (i.e. how to solve the moment matching problem). If you choose 2, I strongly suggest you look ahead to the first part of the "Model Checking" section of the lab.

Comment on your degree of confidence in your prior parameter fits. Think about simple procedures for evaluating your uncertainty in your fits with your group. Based on your experience toying with the posterior for different prior choices, do you believe you've resolved the parameters with enough confidence so that your posterior is reliable (i.e. the uncertainty captured in the posterior fully captures your uncertainty about the unknown)?

*Replace this text with your discussion. You are welcome to attempt the procedure you sketch out below.*

## Model Checking

### The Evidence

Derive the marginal distribution for the number of observed successes $S$ given $(\Theta,S)$ drawn with parameters $\alpha, \beta, n$. That is, derive an explicit formula for $\text{Pr}(S = s)$ for all $s \in \{1,2,...,n\}$. You may express your answer using any standard combinatorial functions (factorial, choose, $\Gamma$, Beta, etc). Do not try to close the integrals directly. Instead, reference the normalization factor used in the Beta distribution's density function.

*Replace this text with your analysis*

Give an argument justifying the word "evidence" to describe $\text{Pr}(S = s)$. Relate your answer to the likelihood function of the prior parameters $\alpha, \beta$.

*Replace this text with your discussion*

### Posterior Predictive Distribution

Use your work above, and the rule that updates the prior to produce the posterior, to derive the posterior predictive distribution for a new set of $m$ sample outcomes, given an observed set of $n$ outcomes containing $s$ successes.

*Replace this text with your analysis*

With your group, discuss how you could use this distribution to check whether your model (and, deductively, your inferences regarding the unknown) are realistic given your observations.

Hint:
1. It may help to first think about trying to evaluate the probability of observing what you observed, conditioned on what you observed. This will motivate the need for the word "prediction", and the distinction between evaluating a likelihood, evaluating the evidence, and predictive checking.
1. Since we only have access to a fixed data set, consider breaking your data set into pieces that serve separate purposes. Just as we secluded a prior data set ($j = \{1, 2, ..., 99\}$), we could further split the data to include an evaluation or test set. Discuss how you might compare posterior predictions to this test set, whether it would make sense to evaluate the posterior predictive probability of observing the test set, and what does or doesn't go wrong if you tried to evaluate the posterior predictive probability of observing the data *that you did observe (i.e. conditioned on).*

If you start going in circles here, call over the instructor, or review BDA chapter 6.

*Replace this text with your discussion. Then, describe your method, and implement an example below.*

## Post-Script (Peaking at the Answer)

*This part is optional.*

Now that you have a pipeline for: (a) fitting the prior, (b) using it to produce inferences, and (c) rejecting models that produce implausible inferences, it's worth seeing whether that pipeline typically returns accurate inferences. 

Select a subset of the provided data as a prior set for fitting the prior parameters, then apply steps (b) and (c) to the remaining test cases. Think carefully about whether you should use step (c) to eliminate specific test cases, or to eliminate all inferences if enough test cases are sufficiently implausible. Then, compare your remaining answers to the true success probabilities.

Note: A nice way to summarize whether your pipeline is producing accurate answers when you have access to the truth is via a coverage test. That is, return an interval estimate for each unknown $\Theta_j$ in the test set that should hold the unknown with some large success probability (I usually repeat this for success probabilities 0.5, 0.75, 0.875, 0.95, 0.99 or something to the effect). Then, compute the fraction of the intervals that *actually* contained the truth. These should match. Notice that this is essentially a procedural guarantee. Consider the analogy to confidence intervals. This process of checking whether the pipeline produces intervals that actually cover the unknown as often as they purport to cover the unknown is especially important in more complicated examples when we can't perform step (b) exactly, and instead use approximations to perform posterior inference.