In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

## Learning Objectives

In this notebook, we're going to get some practice writing data generating processes,
and calculating joint likelihoods between our data and model,
using the SciPy statistics library.

## Simulating coin flips (again!)

In this notebook, we are going to have some fun telling _probabilistic stories_.

We're going to stick with coin flip simulations, because it's a very good "simplest complex model".

Previously, we constructed coin flips where we knew the parameter $p$ (probability of heads) precisely.
This time, though, we're going to construct a model of coin flips
that no longer involves a fixed/known $p$,
but instead involves a $p$ that is not precisely known.

### Protocol

If we have a $p$ that is not precisely known, we can set it up by instantiating a probability distribution for it, rather than a fixed value.

How do we decide what distribution to use?
Primarily, the criteria that should guide us is the _support_ of the distribution,
that is, the range of values for which the probability distribution is valid.

$p$ must be a value that is bounded between 0 and 1.
As such, the choice of probability distribution for $p$ is most intuitively the Beta distribution,
which provides a probability distribution over the interval $[0, 1]$.

Taking that value drawn from the Beta, we can pass it into the Bernoulli distribution,
and then draw an outcome (either 1 or 0).
In doing so, we now have the makings of a __generative model__ for our coin flip data!

### Generating in code

Let's see the algorithmic protocol above implemented in code!

In [None]:
from scipy import stats as sts
import numpy as np

def coin_flip_generator() -> np.ndarray:
    """
    Coin flip generator for a `p` that is not precisely known.
    """
    p = sts.beta(a=10, b=10).rvs(1)
    result = sts.bernoulli(p=p).rvs(1)
    return result

coin_flip_generator()

### Graph form

If we visualize this model in graphical form,
it would look something like this:

In [None]:
from bayes_tutorial.solutions.simulation import coin_flip_pgm

coin_flip_pgm()

In this graph, each node is a random variable. For example, `result` is the random variable that models outcomes. It accepts a parameter `p`, which itself is a random variable that does not depend on anything. At the same time, `p` depends on two parameters, $\alpha$ and $\beta$, which are fixed.

The graphical form expresses _conditional dependence_ between random variables, that is to say, `result`'s draws depend on the value of `p` drawn. In math symbols, we would write this joint distribution between `p` and `result` as:

$$P(p, result) = P(result | p)P(p)$$

The `|` tells us that `results` is conditioned on, or depends on, the value of the random variable `p`.

The graphical form is a definitely a simplified view, in that we don't show the exact probability distributions by which each random variable is distributed. That is what can make reading the diagrams a bit confusing at first, though with practice, things get much easier over time.

## Prior Information

The astute eyes amongst you will notice 
that the Beta distribution has parameters of its own,
so how do we instantiate that?
Well, one thing we can do is bring in some _prior information_ to the problem.

Is our mental model of this coin that it behaves like billions of other coins in circulation,
in that it will generate outcomes with basically equal probability?
Turns out, the Beta distribution can assign credibility in this highly opinionated fashion!
And by doing so, we are injecting _prior information_
by instantiating a Beta _prior distribution_.

In [None]:
from ipywidgets import FloatSlider, interact, Checkbox
import matplotlib.pyplot as plt
import numpy as np


alpha = FloatSlider(value=2, min=1.0, max=100, step=1, description=r'$\alpha$')
beta = FloatSlider(value=2, min=1.0, max=100, step=1, description=r'$\beta$')
equal = Checkbox(value=False, description=r"set $\beta$ to be equal to $\alpha$")

@interact(alpha=alpha, beta=beta, equal=equal)
def visualize_beta_distribution(alpha, beta, equal):
    if equal:
        beta = alpha
    dist = sts.beta(a=alpha, b=beta)
    xs = np.linspace(0, 1, 100)
    ys = dist.pdf(xs)
    plt.xlabel("Support")
    plt.ylabel("Likelihood")
    plt.plot(xs, ys)
    plt.title(fr"$\alpha$={alpha}, $\beta$={beta}")


As you play around with the slider, notice how when you increase the $\alpha$ and $\beta$ sliders,
the width of the probability distribution decreases,
while the height of the maximum value increases,
thus reflecting greater _certianty_ in what values for $p$ get drawn.
Using this _prior distribution_ on $p$, we can express what we think is reasonable
given _prior knowledge_ of our system.

### Justifying priors

Some of you, at this point, might be wondering - is there an algorithmic protocol for justifying our priors too?
Can we somehow "just pass our priors into a machine and have it tell us if we're right or wrong"?

It's a great wish, but remains just that: wishful thinking.
Just like the "Aye Eye Drug", one for which a disease is plugged in,
and the target and molecule are spat out.
(I also find it to not be an inspiring goal,
as the fun of discovery is removed.)

Rather, as with all modelling exercises,
I advocate for human debate about the model.
After all, humans are the ones taking action based on, and being affected by, the modelling exercise.
There are a few questions we can ask to help us decide:

- Are the prior assumptions something a _reasonable_ person would make?
- Is there evidence that lie outside of our problem that can help us justify these priors?
- Is there a _practical_ difference between two different priors?
- In the limit of infinite data, do various priors converge? (We will see later how this convergence can happen.)

## Exercises

It's time for some exercises to practice what we've learnt!

### Exercise: Control prior distribution

In this first exercise, I would like you to modify the `coin_flip_generator` function
such that it allows a user to control what the prior distribution on $p$ should look like
before returning outcomes drawn from the Bernoulli.

Be sure to check that the values of `alpha` and `beta` are valid values, i.e. floats greater than zero.

In [None]:
from bayes_tutorial.solutions.simulation import coin_flip_generator_v2

# Your answer below:
# def coin_flip_generator_v2(alpha: float, beta: float) -> np.ndarray:
#     pass


### Exercise: Simulate data

Now, simulate data generated from your new coin flip generator.

In [None]:
from typing import List

from bayes_tutorial.solutions.simulation import generate_many_coin_flips

# Your answer below:
# def generate_many_coin_flips(n_draws: int, alpha: float, beta: float) -> List[int]:
#     pass

generate_many_coin_flips(50, alpha=5, beta=1)

With that written, we now have a "data generating" function!

## Joint likelihood

Remember back in the first notebook how we wrote about evaluating the joint likelihood of multiple coin flip data
against an assumed Bernoulli model?

We wrote a function that looked something like the following:

```python
from scipy import stats as sts
from typing import List

def likelihood(data: List[int]):
    c = sts.bernoulli(p=0.5)
    return np.product(c.pmf(data))
```

Now, if $p$ is something that is not precisely known,
then any "guesses" of $p$ will have to be subject to the Likelihood principle too,
which means that we need to jointly evaluate the likelihood of $p$ and our data.

Let's see that in code:

In [None]:
def coin_flip_joint_likelihood(data: List[int], p: float) -> float:
    p_like = sts.beta(a=10, b=10).pdf(p)  # evaluate guesses of `p` against the prior distribution
    data_like = sts.bernoulli(p=p).pmf(data)
    
    return np.product(data_like) * np.product(p_like)

coin_flip_joint_likelihood([1, 1, 0, 1], 0.3)

## Joint _log_-likelihood

Because we are dealing with decimal numbers,
when multiplying them together,
we might end up with underflow issues.
As such, we often take the log of the likelihood.

### Exercise: Implementing joint _log_-likelihood

Doing this means we can use summations on our likelihood calculations,
rather than products.

Because of the rules of logarithms, what originally was:

$$P(D|p)P(p)$$

becomes:

$$\log(P(D|p)) + \log(P(p))$$

Also, if you think about the joint distribution of data,
$P(D)$ is actually $P(D_1, D_2, ..., D_n)$ for $n$ data points,
but because each is independent from one another, the joint distribution of $P(D)$ factorizes out to $P(D_1)P(D_2)...P(D_n)$. Taking the log then allows us to sum up the log of PMFs!

In [None]:
from bayes_tutorial.solutions.simulation import coin_flip_joint_loglike

# Your answer below:
# def coin_flip_joint_loglike(data: List[int], p: float) -> float:
#     pass

coin_flip_joint_loglike([1, 1, 0, 1], 0.3)

### Exercise: Confirm equality

Now confirm that the joint log-likelihood is of the same value as the log of the joint likelihood,
subject to machine precision error.

In [None]:
np.log(coin_flip_joint_likelihood([1, 1, 0, 1], 0.3))

## Key Idea: Statistical Stories

Before we can go into probabilistic programming,
one has to know the skill of "telling statistical stories".

In telling statistical stories, we are using probability distributions
to represent the pieces of our problem that are difficult to precisely know.
It is because they are difficult to precisely know
that we use random variables, distributed by some probability distribution,
as the modelling tool of choice.

### Stories of probability distributions

One skill that is necessary in knowing how to choose
what probability distribution to associate with a random variable
is to learn their "distribution stories".

Here's an example, taken from [Justin Bois' excellent resource][jsbois],
for the Bernoulli distribution:

> A Bernoulli trial is an experiment that has two outcomes that can be encoded as success ($y=1$) or failure ($y=0$). The result $y$ of a Bernoulli trial is Bernoulli distributed.

[jsbois]: http://bois.caltech.edu/dist_stories/t3b_probability_stories.html

### Workflow

A generally usable workflow for telling statistical stories
is to work backwards from the data.
Using our evergreen coin flip example, if we start with coin flip-like data,
and have a hunch that our data are never going to be anything other than 0s and 1s,
then we might use a Bernoulli to model the data.
And then as we saw above, if we realize that we can't be precisely sure
of the value $p$, then we model it using a Beta distribution.
In many cases, knowing the distribution of $p$ is useful.

One might ask, then, how about the parameters of the Beta distribution?
Do we have to give _them_ distributions too?

The answer is "usually not", as we consider them "nuisance" parameters:
parameters that we need to have, but can't take action on even if we know something about them.

## Exercises

To help you get familiar with this skill,
I've designed a number of exercises below that will help you get some practice.
Be sure to reference the [distribution stories][jsbois]
for any probability distributions mentioned in here.

[jsbois]: http://bois.caltech.edu/dist_stories/t3b_probability_stories.html

As you embark on the exercises, always remember:

![](https://memegenerator.net/img/instances/84732105/if-youre-uncertain-about-it-put-a-distribution-on-it.jpg)

### Exercise: Simulate the number of car crashes per week at Brigham circle

Brigham circle is a place in Boston near the Longwood Medical Area, and is notorious for car crashes. (I made the car crashes piece up.)

Write down a statistical simulation that generates counts of car crashes per week at Brigham circle.

Some hints that may help:

- Count data are normally distributed by [Poisson][poisson] or [Negative Binomial][negbinom] distributions.
- If you use the Poisson distribution, then its key parameter, the "rate" parameter, is a positive real number (positive floats). The [exponential distribution][expon] is a good choice.
- If you use the negative binomial distribution, remember that it takes in one integer and one float parameter.
- The official answer uses the Poisson distribution, and follows the following graphical form.

[expon]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.expon.html
[poisson]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.poisson.html#scipy.stats.poisson
[negbinom]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.nbinom.html#scipy.stats.nbinom

In [None]:
from bayes_tutorial.solutions.simulation import car_crash_pgm

car_crash_pgm()

In [None]:
def car_crash_generator():
    """Generate a "per week" car crash data point"""
    rate = sts.expon(0.5).rvs()
    crashes = sts.poisson(mu=rate).rvs()
    return crashes

Now, simulate 10 draws from the generator.

In [None]:
[car_crash_generator() for _ in range(10)]

### Exercise: Joint log-likelihood function for observed car crashes

Now, write down the joint likelihood function for observed car crashes and its key parameters.

In [None]:
from bayes_tutorial.solutions.simulation import car_crash_loglike

# Uncomment the block below and fill in your answer.
# def car_crash_loglike(rate: float, crashes: List[int]) -> float:
#     """Evaluate likelihood of per-week car crash data points."""
#     
#     your answer goes here
# 
#     return rate_like + crashes_like

### Exercise: Evaluate joint log-likelihood of data and parameter guesses

Now that you have a log likelihood function that was constructed from your priors,
evaluate guesses of car crash rates against the following data.

To best visualize this, make a plot of log likelihood on the y-axis against rate on the x-axis.

In [None]:
from bayes_tutorial.solutions.simulation import car_crash_data, car_crash_loglike_plot
import matplotlib.pyplot as plt

data = car_crash_data()

# Comment out the next line before filling in your answer
car_crash_loglike_plot();

# Your answer goes below:


### Bonus exercise

As a bonus exercise, add a few more data points.

### Exercise: Simulate the heights of men in North and South Korea

It is well-known that there is a height difference between adult men in North and South Korea,
due to differences in nutrition (direct cause) resulting from government (mis-)management. 

Write two functions that simulates the data generating process for observed human male height in North and South Korea.
Assume that South Korean men are somewhere in the vicinity of 180 cm on average,
while North Korean mean are somwhere in the vicinity of 165 cm on average,
but that this is not precisely known.

Some guides to help:

- Name the two functions `s_korea_generator()` and `n_korea_generator()`.
- For height, a [Gaussian distribution][gaussian] is a _good enough_ model, even though strictly speaking it is positive-bound.
- We should operate in the centimeter scale, as this scale places us in the hundreds range, which makes things easier to reason about.
- Because the spread of heights might not be precisely known, we can model this uncertainty by placing an [exponential distribution][expon] over it, because scale parameters are positive-only distributed.
- Assume that the mean height and the variance of the height distribution cannot be precisely known, which means you have to place a probability distribution over those parameters too.

[gaussian]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.norm
[expon]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.expon.html#scipy.stats.expon

The graphical form would look like this:

In [None]:
from bayes_tutorial.solutions.simulation import korea_pgm

korea_pgm()

In [None]:
from bayes_tutorial.solutions.simulation import s_korea_generator, n_korea_generator

# Your answer goes here.
# def s_korea_generator():
#     pass 

# def n_korea_generator():
#     pass

In [None]:
s_korea_generator()

In [None]:
n_korea_generator()

You might notice that the two are of the same structure, so you can probably merge them into one function:

In [None]:
def korea_height_generator(mean_loc: float, mean_scale: float, scale_scale: float) -> float:
    mean = sts.norm(loc=mean_loc, scale=mean_scale).rvs()
    scale = sts.expon(scale=scale_scale).rvs()
    height = sts.norm(loc=mean, scale=scale).rvs()
    return height

n_korea_height = korea_height_generator(mean_loc=165, mean_scale=3, scale_scale=1)
s_korea_height = korea_height_generator(mean_loc=180, mean_scale=3, scale_scale=1)
n_korea_height, s_korea_height

### Exercise: Joint log-likelihood of heights

Similar to the exercise above, calcualte the joint log-likelihood of heights with possible values of mean and scale evaluated against the prior distributions stated.

To be a bit more precise, create one log-likelihood function for South Korean heights and one for North Korean heights, and then one for their combined joint likelihood.

In [None]:
from bayes_tutorial.solutions.simulation import s_korea_height_loglike, n_korea_height_loglike, joint_height_loglike

# Your answer for South Korean log likelihoods here
# def s_korea_height_loglike(mean: float, scale: float, heights: List[int]) -> float:
#     pass

# Your answer for North Korean log likelihoods here
# def n_korea_height_loglike(mean: float, scale: float, heights: List[int]) -> float:
#     pass

# Your answer for the combined joint likelihood of South and North Korean heights
# def joint_height_loglike(s_mean: float, s_scale: float, n_mean: float, n_scale: float, s_heights: List[int], n_heights: List[int]) -> float:
#     pass

### Exercise: Evaluate log-likelihood of true parameter guesses

Now that you've got a log likelihood function written down,
evaluate some guesses as to what the best "mean" and "scale" values are,
given the data
and the priors that you specified in your log likelihood.

In [None]:
from bayes_tutorial.solutions.simulation import s_korea_height_data, n_korea_height_data
s_korea_heights = s_korea_height_data()
n_korea_heights = n_korea_height_data()

In [None]:
s_mean = FloatSlider(min=150, max=190, value=155, step=1)
s_scale = FloatSlider(min=0.1, max=10, value=2, step=0.1)
n_mean = FloatSlider(min=150, max=190, value=155, step=1)
n_scale = FloatSlider(min=0.1, max=10, value=2, step=0.1)

@interact(s_mean=s_mean, s_scale=s_scale, n_mean=n_mean, n_scale=n_scale)
def evaluate_joint_likelihood(s_mean: float, s_scale: float, n_mean: float, n_scale: float) -> float:
    return joint_height_loglike(s_mean, s_scale, n_mean, n_scale, s_korea_heights, n_korea_heights)

## Visualizing the full uncertainty

Exciting stuff ahead! Notice how it's super troublesome to manually slide sliders all over the place.
Well, we're going to attempt to solve that  by using Monte Carlo simulation!

In [None]:
# Firstly, draw numbers uniformly in the regime of 130-210 for heights, and 1-6 for scales.
def draw():
    s_mean, n_mean = sts.uniform(130, 80).rvs(2)  # bounds are 150-190, rtfd
    s_scale, n_scale = sts.uniform(1, 5).rvs(2)   # bounds are 2-8, rtfd
    
    return (s_mean, s_scale, n_mean, n_scale)

# Then, set up 2000 draws
params = np.array([draw() for _ in range(2000)])

In [None]:
# Now, we evaluate the log-likelihood.
loglikes = []
for param_set in params:
    loglikes.append(evaluate_joint_likelihood(*param_set))
loglikes = np.array(loglikes)

In [None]:
import pandas as pd

param_df = pd.DataFrame(params)
loglike_df = pd.DataFrame(loglikes)

plotting_df = pd.concat([param_df, loglike_df], axis=1)
plotting_df.columns = ["s_mean", "s_scale", "n_mean", "n_scale", "loglike"]
plotting_df.head()

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(9, 6))

s_mean = params[:, 0]
s_scale = params[:, 1]
n_mean = params[:, 2]
n_scale = params[:, 3]

alpha=1
axes[0, 0].hexbin(s_mean, s_scale, C=loglikes, alpha=alpha)
axes[0, 0].set_xlabel("South Korea Mean")
axes[0, 0].set_ylabel("South Korea Scale")

axes[0, 1].hexbin(s_mean, n_mean, C=loglikes, alpha=alpha)
axes[0, 1].set_xlabel("South Korea Mean")
axes[0, 1].set_ylabel("North Korea Mean")

axes[0, 2].hexbin(s_mean, n_scale, C=loglikes, alpha=alpha)
axes[0, 2].set_xlabel("South Korea Mean")
axes[0, 2].set_ylabel("North Korea Scale")

axes[1, 0].hexbin(s_scale, n_mean, C=loglikes, alpha=alpha)
axes[1, 0].set_xlabel("South Korea Scale")
axes[1, 0].set_ylabel("North Korea Mean")

axes[1, 1].hexbin(s_scale, n_scale, C=loglikes, alpha=alpha)
axes[1, 1].set_xlabel("South Korea Scale")
axes[1, 1].set_ylabel("North Korea Scale")

axes[1, 2].hexbin(n_mean, n_scale, C=loglikes, alpha=alpha)
axes[1, 2].set_xlabel("North Korea Mean")
axes[1, 2].set_ylabel("North Korea Scale")

plt.tight_layout()

### Exercise: What are _plausible_ values?

Given the chart that you see above, 
what are the plausible values of the mean and scale parameters?

## Inference: Figuring out plausible values

Now that you've seen how to use the `scipy.stats` module to write
data-generating stories and simulate data,
in the next notebook, we are going to use PyMC3
to help us with the inferential protocol,
i.e. inferring the most credible values of key model parameters, given the data.
Hop over to the next chapter to learn about the Inference Button (tm)!

## Solutions

Here are the solutions to the chapter.

In [None]:
from bayes_tutorial.solutions import simulation

simulation??