In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
from scipy import stats

## Learning Objectives

In this notebook, you will:

- Learn how to define foundational terminology for reasoning about probability, distributions, and likelihoods.
- Use probability distributions implemented in the `scipy.stats` library to understand those terms.

## So, what is... probability?

As a learner of Bayesian statistics, 
you, ahem, _probably_ (\*cough cough\*) have heard of the term "probability".
What exactly is this beast?
How do I learn it without getting lost in an entanglement of unclear terminology?

Getting down to its core, using relatable and understandable words and pictures,
is what I want to do in this notebook.

## Spaces

Firstly, probability is concerned with _spaces of possible outcomes_,
and outcomes are drawn from these spaces of possibilities in a non-deterministic fashion.
In classical statistics, this "space" of possibilities is also called a **"support"**.
Remember that term, it's super important!
You might encounter it in the literature that you read.

To illustrate what "spaces" are, here's a few examples.

### Coin flips

With the classic coin flip, the complete space of possibilities are the heads and tails.
Under the most common circumstances, landing on neither heads nor tails
is considered out of the space of possibilities.

In math notation, we might say the space of possibilities for the coin flip, $S_{\text{coin}}$ is:

$$S_{\text{coin}} = \{\text{H}, \text{T}\}$$

And visually:

In [None]:
from daft import PGM

G = PGM(grid_unit=1)
G.add_node("H", "H", x=0, y=0)
G.add_node("Y", "Y", x=2, y=0)
plate = [-1.5, -1.5, 5, 3]
G.add_plate(plate, label="space of possibilities")
G.show()

### Rolling dice

With dice rolling, the complete space of possibilities that we are interested in
are each of the sides' values.
On a six-sided dice, the space of possibilities are the sides 1-6 inclusive.
The side with the number 7... doesn't exist.
And as such, it's outside of the space of possibilities

In math notation, the space of possibilities $S_{\text{coin}}$ could be denoted as a set:

$$S_{\text{dice}} = \{1, 2, 3, 4, 5, 6\}$$

And visually:

In [None]:
G = PGM(grid_unit=1)
for i in range(6):
    G.add_node(i+1, i+1, x=(i % 3) * 2, y = (i // 3) * 2)
plate = [-1.5, -1.5, 7, 5]
G.add_plate(plate, label="space of possibilities")
G.show()

### Counts of stuff

When we record counts of objects,
the space of possible values that those counts could take in
is the set of all positive integers,
including the number 0.
For example, when counting the number of car crashes at an intersection,
we should expect the positive integers, as well as 0, to be present.
We would not expect to see -1 or 3.4 in the set of observed values.

In math notation, we might say the set of possible values for counts of stuff, $S_{\text{counts}}$ is:

$$S_{\text{counts}} = \{0, 1, 2, \ldots, +\infty \}$$

It's a little hard to draw this space visually, but I think you might get the point.

### Heights of adult people

When measured in meters,
the space of possibilities is the set of all positive real numbers.
Thus, we would expect to see 1.6, 1.49, and 1.88 as valid heights (in meters),
but we would not expect to see -1.3 meters as a valid height.
Strictly speaking, we might bound the space of possibilities even further,
such as bounding the values to known world records
of the shortest and tallest human beings ever seen,
though oftentimes for convenience, we might elect not to.
(Though biologically highly implausible and improbable,
a 0.1m tall adult human is _technically_ within the space of all positive real numbers.)

In math notation, we might say that the set of possible values for human heights, $S_{\text{height}}$, is:

$$S_{\text{height}} = \mathbb{R}^+$$

## Credibility Points

Having dealt with probability's first concept, "spaces of outcomes",
we now have to turn our attention to what I commonly refer to as "credibility points".
Probability is concerned with credibility points assigned to each of the possible outcomes.

The easiest way to think about these credibility points is to think of it in terms of an "area".
To help you visualize, imagine that you have a square of area 1.
What you want to do is break the square of area 1 into N smaller pieces,
each corresponding to one of the possible outcomes in the space of possible outcomes,
each of which having an area proportional to the credibility points we are going to assign
to each of those possible outcomes.
Thus, with a probability distribution, a core thing we must know 
is the assignments of stick lengths to any individual value.

Remember, the _only_ requirements to how area is assigned to each of the outcomes:

- It must be positive and real-valued.
- The total area assigned to all possible outcomes must sum to 1.

Let's look at a few examples!

### Discrete "Credibility Points"

For a coin flip, if you have a fair coin,
you might assign 0.5 of the stick to one outcome, and 0.5 of the stick to the other.
Or if you have a biased coin, you would assign $p$ to one outcome and $1-p$ to the other.

Visually:

In [None]:
import matplotlib.pyplot as plt
from matplotlib import patches
from itertools import cycle

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8, 4))
axes[0].set_xlim(0, 1)
axes[0].set_ylim(0, 1)

colors = cycle(["red", "blue"])

def plot_outcomes(proportions, colors, ax):
    total = 0
    for n, h in proportions.items(): # n = name, h = height
        rect = patches.Rectangle((0, total), 1, h, linewidth=1, color=next(colors))
        ax.add_patch(rect)
        total += h

plot_outcomes({"H": 0.5, "T": 0.5}, colors, axes[0])
axes[0].set_title("Fair dice")
plot_outcomes({"H": 0.3, "T": 0.7}, colors, axes[1])
axes[1].set_title("Biased dice")

def set_ticks(ax):
    ax.set_xticks([])
    ax.set_yticks([0, 1])
    
set_ticks(axes[0])
set_ticks(axes[1])

If you have a six-sided dice,
you might assign $\frac{1}{6}$ of the area to each of the outcomes,
if the dice were fair.

Visually:

In [None]:
colors = ["red", "black", "yellow", "blue", "white", "purple"]
fig, ax = plt.subplots()
plot_outcomes({i: 1/6 for i in range(6)}, colors=cycle(colors), ax=ax)
ax.set_aspect("equal")

In each of these examples described above,
the assignment of stick length is _uniform_ across all possible outcomes.

Now, there is usually a math function that assigns area to each outcome.
It's a bit like assigning a lump of mass to each outcome.
The function that we get to define,
or that others may have defined for us,
is called the **probability mass function**.
You may have seen it for the coin flip and dice below:

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8, 4), sharey=True)
axes[0].bar(x=["H", "T"], height=[0.5, 0.5])
axes[0].set_title("Fair Coin Flip PMF")
axes[1].bar(x=range(6), height=[1/6] * 6)
axes[1].set_title("Fair Dice Roll PMF")
axes[0].set_ylim(0, 1)

### Heights and Areas

At this point, I want to highlight two important points.

Firstly, we had the chance to decide what the function is that best expresses our belief
over the credibility of each of the outcomes.
We used what we might effectively call a "uniform discrete distribution"
over each of the discrete outcomes.

Secondly, it is the _areas_ that express for us
the credibility assigned to any of the outcomes.
When we unstack the areas
and place each of the outcomes on the x-axis with width 1,
their heights _just happen_ to equal to their area.
This is what happens when we plot the bar chart.

### Continuous "Credibility Points"

Now, let's consider the interval $[0, 1]$.
How many _real_ numbers lie in that interval? 

The answer? 

> _It's actually infinite..._

As such, unlike the discrete case, if we try to assign any real area to a single number,
within the constrained bound of $[0, 1]$, we would end up with infinite area!

So, instead of assigning area to individual values,
we assign area to a range of values,
with increasing or decreasing _density_ of area per unit interval,
taken _in the limit as the interval width approaches 0_.
That curve we draw, which tells us the density of area over an infinitesimal interval,
is called the **probability density function** (PDF).

The corollary of this definition is that there is 0 area, 
and hence 0 probability, associated with any value;
probability can only be associated with a _range_ of values.
That is because there is 0 area associated with any value under the PDF.
Additionally, by definition the total area under the credibility assignment function
must total to 1.0,
otherwise we do not have a proper probability function,
over the range of possible outcome values.
That said, even though there is no _area_ associated with a single value,
there is still _height_ associated with it.
This point is important; we will use this fact when discussing _likelihoods_.

### Exercise: Biased coin flip outcome distribution

We have enough ideas covered for our first programming exercise!
This first exercise will get you comfortable with the fact that probability distributions
are nothing more than math functions that assign credibility points to a space of possible values.
Are you ready? Here's the exercise:

1. Define a Python function called `coin_distribution()` that assigns area to outcomes of a biased coin as a dictionary mapping.
    1. You may define the credibility points any way you want. The coin can be fair or unfair.
    1. In the dictionary mapping, the keys should be possible outcomes, and the values should be the credibility points assigned.
2. Write a software test that guarantees that the total area assigned equals to 1.

An example to get you kickstarted is as follows:

```python
def some_distribution():
    mapping = {
        "outcome_1": 1/3,
        "outcome_2": 1/3,
        "outcome_3": 1/3,
    }
    return mapping
```

In [None]:
from bayes_tutorial.solutions.probability import coin_distribution, test_coin_distribution
import numpy as np

# Your answer below:
# def coin_distribution():
#     pass

# def test_coin_distribution():
#     pass

test_coin_distribution()

### Exercise: Dice outcome distribution

Let's do one more exercise, simply to reinforce the ideas.
Instead of a coin, let's do a six-sided dice.

1. Define a Python function called `dice_distribution()` that assigns credibility area to outcomes of a dice roll.
    1. You may define the credibility area any way you want: it can be a fair dice or an unfair dice.
    1. In the dictionary mapping, the keys should be possible outcomes, and the values should be the credibility points assigned.
1. Write a software test that guarantees that the total area assigned equals to 1.

In [None]:
from bayes_tutorial.solutions.probability import dice_distribution, test_dice_distribution
    
# Your answer below:
# def dice_distribution():
#     pass

# def test_dice_distribution():
#     pass


test_dice_distribution()

If this implementation of the coin distribution _feels_ a tad dissatisfying, you're not off the mark!
Thus far, for the coin distribution, we've only defined the mapping from outcomes to credibility fraction.
We have not, however, defined other things that we might be used to with probability distributions, though,
such as drawing numbers from it.
That's the topic of the next section!

## Probability distributions have stories

You've thus now defined two discrete probability distributions: the coin flip, and the dice roll.
The key thing we implemented here was nothing more than
**the assignment of probability density/mass (_credibility points_)
across the potential outcomes (_support_).**

Each of these probability distributions tell a "data generating story".
There are other probability distributions, and they each have their own data generating story.
It pays to know these stories, as they come handy when describing different data generating processes.

Justin Bois, an instructor in the Bioengineering department at Caltech,
has compiled [an incredibly useful resource][dist_stories] for learning about the data generation stories
for most probability distributions.
You should bookmark it!

[dist_stories]: http://bois.caltech.edu/dist_stories/t3b_probability_stories.html

### The shortcut to using distributions...

...is to know its support.
More often than not, your data restricts the distributions that are valid to describe it.
For example, if you have positive-only data, you shouldn't use a distribution that has infinite support.
Or if you have continuous data, you can't use discrete distributions.

Keep that pro tip in mind!


## Events and its relation to outcomes

Let's talk about "Events", then.
Events happen, and when they do, we get data.
Well, at least, that's the colloquial way of expressing how our data are generated.

More formally, when an event takes place, we draw an outcome from our probability distribution
in a fashion proportional to the credibility points assigned to that outcome.
In the statistics and probability worlds, this is how we might choose to model our **data generation process**.

In the Python world, we have the SciPy statistics library that gives us a collection of distributions
that we can use in our modelling problems, to illustrate this idea.
We will see the library's use throughout this tutorial.
For now, let's take a look at the use of the Bernoulli distribution in modelling coin flips.

### Bernoulli: Modelling coin flips

To model coin flips, the probability model that one might choose
is the Bernoulli distribution.
This is because its support is binary (`0` and `1`),
and it allows us arbitrarily assign probability between them.

First off, we instantiate a Bernoulli distribution object.

In [None]:
from scipy.stats import bernoulli

b = bernoulli(p=0.8)

This gives us a biased coin flip probability distribution.

Now, we can draw values from the Bernoulli, using the `.rvs(n)` method. 
The etymology of `.rvs` is "random variates".
We will touch on "random variables" (which is related) later.

Meanwhile, we can draw 10 values:

In [None]:
b.rvs(10)

Go ahead and run the cell above a second time.
You should see a different configuration of 10 outcomes.

### Exercise: Binomials

While Bernoulli distributions provide a data generating model for 1 coin flip,
Binomial distributions provide a data generating model for 1 or more coin flips.
As such, Binomial distributions are the generalization of Bernoulli distributions.

To reinforce this idea, try this next exercise:

1. Create a Binomial distribution that generates Bernoulli outcomes. Some hints:
    1. The binomial distribution is named `binom` in `scipy.stats`. We've imported it for you. 
    1. Be sure to reference [the docs][binom]!
    1. You will have to set both the `n` and `p` parameters!)
1. Then draw 10 outcomes from the Binomial distribution.

[binom]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html

In [None]:
from scipy.stats import binom
from bayes_tutorial.solutions.probability import binomial_answer

# The correct answer is generated from this function.
draws = binomial_answer()

# Your answer goes here

# The test of your answer's correctness is below:
assert set(draws).issubset([1, 0])

### Exercise: Modelling dice rolls

The multinomial distribution gives us a natural extension to the Bernoulli.
Try using it to model dice rolls.

Some hints:

1. The multinomial distribution is named `multinomial` in `scipy.stats`
1. It too also accepts two arguments, `n` and `p`.
    1. `n` is the number of draws to make.
    1. `p` is a list/vector of probabilities of each outcome, and must sum to 1!
1. Take out 10 draws from the multinomial distribution.

In [None]:
from scipy.stats import multinomial
from bayes_tutorial.solutions.probability import multinomial_answer

# The correct answer is generated from this function.
draws = multinomial_answer()

# Your answer below


You'll notice, to represent six discrete outcomes,
we choose to use a "one-hot" vector as the representation of the outcomes.
This can be converted back to our favourite representation:

In [None]:
def vect2dice(v: np.ndarray) -> int:
    return np.where(v == 1)[1] + 1

vect2dice(draws)

Now, our understanding of of probability distributions should feel a little more complete.
In this section, we saw **the ability to draw _outcomes_ 
in proportion to the probability density/mass assigned to the outcome**.

## Likelihoods

We're now going to look at this idea of likelihoods.
Likelihood is related to probability,
but here, I have chosen to think of it as a separate idea.

Probability is the tendency of certain outcome values
to be drawn from a distribution.
By contrast, likelihood is calculated
when we evaluate data against a probability distribution.

There is one very important thing for you to keep in mind
as you continue reading on:
In the vast majority of cases,
the math function that defines the tendency of data to be drawn from the distribution,
i.e. the _probability mass function_ or the _probability density function_,
is **exactly the same math function** used to evaluate the likelihood of data.

### Likelihood of a coin flip

Let's start simple, and look at the ideas in the context of coin flips.

When we flip a fair coin once, giving us an event $C$, and we get a heads (denote this with `H`), what is the _likelihood_ (height) of this event happening?
From our common knowledge, we might choose to start by setting up a probabilistic model
where the probability of heads $P(C=H)$, is given by the PMF of a Bernoulli(0.5).
Now, we need to evaluate data _against_ this probabilistic model, 
so we use the PMF height to give us the "likelihood" of observing the data, under the assumed model,
because the height of the PMF gives us a measuring rod of sorts to tell us how "likely" our data are.
Let's denote this "likelihood" as $\mathcal{L}(\text{H})$:

$$\mathcal{L}(heads)=P(C=H)$$

...or in words, the likelihood of getting a heads is evaluated by using the (height of the) probability of the coin flip being heads.

### Exercise

We are going to calculate the likelihood of coin flip data under the assumption of a fair coin flip, using the `scipy.stats` library.

(Do pay attention to the language: with likelihoods, we are always evaluating data against an assumed distribution.)

Every `scipy.stats` probability distribution has an associated `.pmf(x)` (discrete distributions) or `.pdf(x)` (continuous distributions) class method. You pass in data `x`, which can be either a scalar number or an ndarray of some kind, and it returns, elementwise, the likelihood of each data point.

Your exercise is as follows: calculate the likelihood of each observation in `coin_1_data` and `coin_2_data` below. We have provided a `fair_coin_model` RV that you can use, which is a "frozen" or pre-configured Bernoulli.

In [None]:
from bayes_tutorial.solutions.probability import fair_coin_model, coin_data_likelihood
from inspect import getsource

# FYI on how fair_coin_model is defined
print(getsource(fair_coin_model))

In [None]:
coin = fair_coin_model()
coin_data_1 = [0, 1, 1, 0, 1, 0, 0, 1, 0, 0]
coin_data_2 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

# This is the "correct" answer.
coin_data_likelihood(coin_data_1, coin_data_2)

# Your answer here:


### Discussion

Notice how I have provocatively placed in two very contrasting situations:
one with a roughly even balance of `0`s and `1`s,
and one with a completely biased set of `0`s and `1`s.

Do the likelihood calculations surprise you?
Can you explain the rationale for your answer?

In [None]:
from bayes_tutorial.solutions.probability import likelihood_coin_toss

# Uncomment the next line to see my answer.
# print(likelihood_coin_toss())

## Joint distribution and likelihood of multiple coin flips

When we flip a fair coin twice and obtain the outcome `HT` (a head, followed by a tail),
what is the _joint likelihood_ of these two events happening in this sequence?

To start, just as we use a probability distribution to model a single coin toss,
we can use a _joint probability distribution_ as a probability model for two coin tosses.
The same rules of probability apply, in that the total "mass" or "density" assigned must equal to 1,
and that the values must be distributed across the valid outcome values (support).

Using the notation where $C_i$ refers to the $i$th random variable modelling coin $i$, and the outcomes are $H$ and $T$, the joint pairs are:

- ($C_1=H, C_2=H$)
- ($C_1=T, C_2=H$)
- ($C_1=H, C_2=T$)
- ($C_1=T, C_2=T$)

We can choose to define their joint probability distribution as $P(C_1, C_2)$,
with the following "canonical" probability mass allocations:

- $P(C_1=H, C_2=H)=P(C_1=H) \times P(C_2=H)=0.25$
- $P(C_1=T, C_2=H)=P(C_1=T) \times P(C_2=H)=0.25$
- $P(C_1=H, C_2=T)=P(C_1=H) \times P(C_2=T)=0.25$
- $P(C_1=T, C_2=T)=P(C_1=T) \times P(C_2=T)=0.25$

You might have noticed the multiplication of independent probabilities.
Because two coin tosses are independent,
(i.e. one toss doesn't affect the other),
one can make the argument that their joint probability mass allocation
is equivalent to the product of their individual probability allocation.
This comes from the rules of boolean logic.
Now, because the coins are fair, the products all end up being the same,
but this need not necessarily be true!
Especially if one of the coins, or both of them, are modelled using an "unfair" coin toss model.

You should know also that you _technically_ can define
the _joint_ probability mass allocations any way you want,
without leveraging independence,
_as long as they sum to 1, and cover all possible valid outcomes_.
Leveraging this idea of "independence" between events simply
makes constructing the joint probability model easier,
as we won't _otherwise_ have to individually create entries for the joint likelihoods.
The deeper concept powering the simple multiplication is the idea of **exchangability**,
which colloquially means I could ___exchange___ the positions of our data points,
and their joint likelihood does not change.

Since we have the probability mass function, we also have the likelihoods (heights!) available to us.
Therefore, we can calculate the likelihood of our data using the _joint distribution_ available to us.

### Exercise

Calculate the joint likelihood of the coin tosses above.

_Hint:_ If you have an array and you would like to multiply all of the individual elements together, you can use NumPy: `np.product(arr)` gives you the product of all elements in an array.

In [None]:
from bayes_tutorial.solutions.probability import coin_data_joint_likelihood

# This is the "correct" answer
coin_data_joint_likelihood(coin_data_1, coin_data_2)

# Your answer below:


### Exercise

You'll notice that the numbers get really small.
In practice, we usually calculate the joint log-likelihood, 
rather than the joint likelihood, so that we don't run into underflow issues
when doing Bayesian computing.

Every `scipy.stats` probability distribution has a `.logpmf(x)` or `.logpdf(x)` class method,
whose semantics are similar to `.pdf(x)` and `.pmf(x)`.
Use this to calculate the joint **log**-likelihood of the coin tosses.

(_Hint:_ The product of numbers, in log-space, is a sum!)

In [None]:
from bayes_tutorial.solutions.probability import coin_data_joint_loglikelihood

# This is the "correct" answer
coin_data_joint_loglikelihood(coin_data_1, coin_data_2)

# Your answer below:


### Data are usually assumed to be "independent and identically distributed"

As things turn out, it is a common, crucial, but also fairly reasonable assumption
that data come to us independent of one another.
So in most cases, calculating joint likelihoods of data
usually means that we can multiply together their likelihoods under the given model.
(This is the "exchangability idea" at work!)

### Dependencies show up elsewhere

That said, there are exceptions, such as in Markov chains and sequential data,
where data come to us with sequential dependencies on one another.

You'll see this in the next notebook, when we look at how to write simulation models,
so don't be too quick to assume that independence shows up everywhere! :)

## "Likelihood is all you need"

You might be wondering, what's the difference between probability and likelihood?
Put as succinctly as I could think of:

- Probability (the _area_) gives us the tendencies for _potential_ outcomes to be drawn on each event. 
- Likelihoods (the _height_) give us the function to evaluate how probable an _observed_ outcome (or data point) is under a given probability model (i.e. the assignment of area to outcomes (discrete) or ranges of outcomes (continuous)).

In statistical inference, likelihood is all you need.
Well, strictly speaking, likelihood is _basically what you need_.

Remember: likelihood is calculated
when we evaluate how likely data, assumed to be drawn from a distribution,
were drawn from that distribution.

For two independently drawn outcomes,
we can multiply their likelihoods together
to obtain their joint likelihood.
This trivially extends to three or more outcomes.

There is a principle in statistics, called the ["Likelihood principle"][likelihood].
You don't have to remember the term, but it is helpful to remember the idea.
From Wikipedia:

> In statistics,the likelihood principle is the proposition that, given a statistical model, 
> all the evidence in a sample relevant to model parameters is contained in the likelihood function.

[likelihood]: https://en.wikipedia.org/wiki/Likelihood_principle

As such, you'll see that in this suite of notebooks,
we will be looking _primarily_ at likelihoods, and not probability.
After all, the most common thing that we're trying to do
is evaluate data against a model.

## Probability Distributions

With these components:

- spaces of possible outcomes (i.e. the "_support_"),
- probability mass or density assigned to each outcome,
- total probability assigned across all outcomes summing to 1, and
- non-deterministic drawing of outcomes per event,

we have enough to define a probability distribution.
Pictorially, it looks like the following:

![](./images/distributions.png)

Remember: The PMF/PDF can be mathematically or empirically defined,
it doesn't really matter, as long as the total area is 1.

I think we have enough to define a probability distribution in an understandable fashion for programmers:

> A probability distribution is a description of how probability mass or density is assigned to valid outcomes (the support of the distribution), such that the sum of masses or integral of densities equals to 1.

That _description_ is most commonly done by a math equation that is parametrized.

Finally, keep in mind the distinction between how probability, from the perspective of how we use them:
- When we draw data from a probability distribution's space of outcomes, the "probability" of each outcome is defined by the probability mass/density function.
- When we evaluate data against a probability distribution's PMF/PDF, we are evaluating the _likelihood_ of the data.

## Random Variables

"Random variable" is probably another term that you've seen floating around.

As usual, I think contrasts will be the most illuminating here.

When we build a model of the world, we'll usually assign _variables_ to represent things.
When we so-called "run the model" once, usually we'll assign a real value to those variables,
yielding one "realization" of the model.
Now, those variables can be _deterministic_/_fixed_ or _random_/_stochastic_.
If over each realization, we "fix" the variable at a given value, then it is a _deterministic_ variable.
If over each realization, we allow it to vary stochastically, then it is a _random_ variable.

Let's move on to a bit of verbiate. 
When setting up a problem, we'll usually say something like:

> Let $p$ be the __random variable__ that models _the probability of heads_, and let $p$ be __Beta distributed__, with parameters $\alpha$ and $\beta$
>
> Let $c$ be the __random variable__ that models the _outcome of a coin flip_, and let $c$ be __Bernoulli distributed__, with parameter $p$.

More generically:

> Let `algebraic symbol` be the random variable that models `some real thing, or model component`, and let `algebraic symbol` be distributed by `some distribution`, with parameters `some distribution's parameters`.

I think with this in place, we have a sufficiently precise language going forward!

## Bayes' Rule

This notebook is about Bayesian statistical inference, so we have to talk about Bayes' rule.

Prior to reading this notebook, you may have seen Bayes' rule.
It's so famous, [it's even become a neon sign][bayesjpg]!

[bayesjpg]: https://en.wikipedia.org/wiki/File:Bayes%27_Theorem_MMB_01.jpg

Bayes' rule looks like this:

$$P(A|B) = 
\frac{P(B|A)P(A)}
{P(B)}
$$

It is a natural result that comes straight from the rules of probability,
being that the joint distribution of two random variables
can be written in two equivalent ways:

$$P(A, B) = P(A|B)P(B) = P(B|A)P(A)$$

For further treatment of joint probability and Bayes' rule,
I'd like to suggest you take a look at Allen Downey's [Think Bayes](http://www.greenteapress.com/thinkbayes/html/thinkbayes002.html).

### From random variables to hypotheses and data

Now, I have encountered in many books that write,
regarding the application of Bayes' rule to statistical modelling,
something along the lines of the following:

> Now, there is an alternative _interpretation_ of Bayes' rule,
> one that replaces the symbol "A" with "Hypothesis",
> and "B" with the "Data", such that we get:
> 
> $$P(H|D) = \frac{P(D|H)P(H)}{P(D)}$$

At first glance, nothing seems wrong about this statement,
but I did remember having a lingering nagging feeling
that there was a logical jump unexplained here.

More specifically, that logical jump yielded the following question: _Why are we allowed to take this interpretation?_

It took asking the question to a mathematician friend, Colin Carroll,
to finally "grok" the idea.
Let me try to explain.

## Spaces of models and data

We have to think back to the fundamental idea of possible spaces.

### Exercise: How many possible Bernoullis?

If we set up a Bernoulli probability distribution with parameter $p$,
then what is the space of possible probability distributions that we could instantiate?

In [None]:
from bayes_tutorial.solutions.probability import spaces_of_p

# Uncomment the next line to reveal the answer.
# print(spaces_of_p())

Let's take a look at this more concretely. 

It is possible to instantiate many other Bernoulli distributions:

```python
b1 = bernoulli(p=0.1)
b2 = bernoulli(p=0.15)
b3 = bernoulli(p=0.78)
```

Each value of $p$ gives us a different configuration of a Bernoulli distribution.
As such, a space of $p$ that gives us an infinite set of possibilities of $p$
**will give us an infinite set of Bernoullis**!

This result should not surprise you: $p$ can take on any one of an infinite set of values between 0 and 1, each one giving a different instantiated Bernoulli.
As such, a `Bernoulli(p)` hypothesis is drawn from a (very large) space of possible `Bernoulli(p)`s,
or more abstractly, hypotheses, thereby giving us a $P(H)$.

### Exercise: How many configurations of data?

Moreover, consider our data.
The Bernoulli data that came to us, which for example might be `0, 1, 1, 1, 0`.
Let's just consider one case: Given five draws from a Bernoulli,
how many ways could our data have come in?
(_Hint:_ You can choose to simplify the problem by considering three `1`s and two `0`s,
or you can try considering the full space of possible outcomes for five draws.
Either way, your answer will be illustrative!)

In [None]:
from bayes_tutorial.solutions.probability import spaces_of_data

# Uncomment the next line to see my answer
# print(spaces_of_data())

As such, we have the $P(D)$ interpretation.

As a modelling decision, we _choose_ to say
that our data and model are jointly distributed,
thus we have the _joint distribution_
between model and data, $P(H, D)$.

## Summary

Let us summarize the main points for you to take away from here:

Firstly, a probability distribution is defined by a function
that assigns credibility points to a space of possible outcomes.
We can use the probability distribution
to draw outcomes proportional to the credibility points assigned to each of the outcomes.

Secondly, the likelihood principle gives us a way of connecting the sample data
to the parameters of a probability distribution.
We can evaluate data against a particular distribution
to see how _likely_ that observed data was drawn from that given distribution.
In evaluating data against that particular distribution, 
we reuse the function that assigns credibility points,
usually known as a probability mass or density function,
as the _likelihood_ function.
As we will see in later notebooks,
we can take advantage of the likelihood principle
to infer the most likely parameter values
that generated the data.

## Conclusion

Now that you've learned about the basics of probability,
let's go on to see how we can use probability distributions
to simulate our data generating processes.
Head over to the next chapter!

## Solutions

In [None]:
from bayes_tutorial.solutions import probability

probability??