## Learning Objectives

In this notebook, you will:

- Learn how to define foundational terminology for reasoning about probability, distributions, and likelihoods.
- Use probability distributions implemented in the `scipy.stats` library to understand those terms.

## So, what is... probability?

As a learner of Bayesian statistics, 
you, ahem, _probably_ (\*cough cough\*) have heard of the term "probability".
What exactly is this beast?
How do I learn it without getting lost in an entanglement of unclear terminology?

Getting down to its core, using relatable and understandable words and pictures,
is what I want to do in this notebook.

## Spaces

Firstly, probability is concerned with _spaces of possibilities_,
and outcomes drawn from these spaces of possibilities
that are non-deterministic.
(In classical statistics, this "space" of possibilities
is also called a "support".)

To illustrate this point, here's a few examples.

### Coin flips

With the classic coin flip, the complete space of possibilities are the heads and tails.
Under the most common circumstances, landing on neither heads nor tails
is considered out of the space of possibilities.

### Rolling dice

With dice rolling, the complete space of possibilities that we are interested in
are each of the sides.
On a six-sided dice, the space of possibilities are the sides 1-6 inclusive.
The side with the number 7... doesn't exist.
And as such, it's outside of the space of possibilities.

### Counts of stuff

When we count things, the space of possibilities is the set of all positive integers,
including the number 0.
For example, when counting the number of car crashes at an intersection,
-1 car accidents is quite commonly left out of the realm of possibilities.

### Heights of adult people

When measured in meters, the space of possibilities is the set of all positive real numbers.
Strictly speaking, we might bound the space of possibilities even further,
though for convenience, we often choose not to.
(Though biologically highly implausible and improbable,
a 0.1m tall adult human is _technically_ within the space of all positive real numbers.)

## Events and Outcomes

With probability, we are also concerned with "events".
A coin flip, a dice roll, and a car entering an intersection are "events",
and events are when our observed data are generated.

At each _event_, there is an _outcome_ that is drawn from the space of possibilities.

For example, when a dice roll _event_ takes place,
any one of the six sides is a possible _outcome_.

As another example, when a counting the number of car crashes at an intersection per day,
each day is an _event_,
and the number of car crashes happening in that day 
is the _outcome_ drawn from the positive integers space.

Finally, when a human is born and finally grown up,
them growing up is an event (in both the metaphorical and real sense!),
and their adult height at a given point in time
is the _outcome_drawn from the positive real numbers space.

## Probability as "Credibility Points"

Probability is also concerned with credibility points assigned to each of the possible outcomes.

The easiest way to think about these credibility points is to think of it in terms of an "area".
Give yourself a two-dimensional blob, with 1 unit of area.
What you're doing with probability is assigning fractions of this area
to each of the _outcomes_.

Here, the _only_ requirements to how area is assigned to each of the outcomes:

- It must be positive.
- The total area assigned to all possible outcomes must sum to 1.

Let's look at a few examples.

### Discrete "Credibility Points"

For a coin flip, if you have a fair coin,
you might assign 0.5 of the area to one outcome, and 0.5 of the area to the other.
Or if you have a biased coin, you would assign $p$ to one outcome and $1-p$ to the other.

If you have a six-sided dice,
you might assign $\frac{1}{6}$ of the area to each of the outcomes,
if the dice were fair.

If you have a 100-sided dice,
you might assign $\frac{1}{100}$ of the area to each of the outcomes.

There is a function that assigns area to each outcome.
It's a bit like assigning a lump of mass to each outcome.
The function that we get to define, or that others may have defined for us,
is called the **probability mass function**.

What's cool is that if we straighten out one side of that area
such that it is of length 1 unit,
then now the height value is equal to the area value.
Keep this in mind; the _height_ is going to come in pretty handy.

### Continuous "Credibility Points"

When we think about assignment of credibility points to a continuous outcome space,
we must remember that the area we assign to each of the outcomes must sum to 1.

Let's consider the interval $[0, 1]$.
How many _real_ numbers lie in that interval? 

_It's actually infinite..._

As such, unlike the discrete case, if we try to assign any real area to a signle number,
within the constrained bound of $[0, 1]$, we would end up with infinite area!

So instead of assigning area to individual values, we assign area to a range of values,
with increasing or decreasing _density_ of area per unit interval,
taken _in the limit as the interval width approaches 0_.
That curve we draw, which tells us the density of area over an infinitesimal interval,
is called the **probability density function** (PDF).

The corollary of this definition is that there is 0 probability associated with any value;
probability can only be associated with a _range_ of values.
That is because there is 0 area associated with any value under the PDF.

That said, even though there is no _area_ associated with a single value,
there is still _height_ associated with it.

## Rules of Probability and Likelihood

The rules of probability and likelihood give us the basic tools to work with data.
The unfortunate piece is that terminology gets mixed up so frequently,
we end up with a muddied understanding of how to calculate with them.

### Coin Flips

Let's start simple, and look at the rules in the context of coin flips.

When we flip a fair coin once and we get a heads (denote this with `H`), what is the _likelihood_ (height) of this event happening? It should be trivial to grasp: 0.5.
Let's denote this as $\mathcal{L}(\text{H})$.

When we flip a fair coin twice and obtain the outcome `HT` (a head, followed by a tail), what is the _joint likelihood_ of these two events happening in this sequence? By the rules of probability and likelihood, we multiply the _likelihoods_ of each individual _outcome_ together. Let's denote this $\mathcal{L}(\text{HT})$, and is equal to $0.5 \times 0.5 = 0.25$.

### Two pointers: Multiplication, and Likelihoods

At this point, we need to disentangle a few ideas.

Firstly, where does the multiplication of likelihoods come from? It comes from considering all of the possible spaces induced by the _joint space of outcomes_. If I have two possible outcomes for the first event and two possible outcomes for the second event, then in total I have 4 possible joint outcomes (`HT`, `HH`, `TH`, `TT`). Those of us who have received at least basic high school math training should know that the multiplication of likelihoods gives us the correct answer.

But wait, why do I keep saying "multiplication of likelihoods" rather than "multiplication of probabilities"? 
This comes to the second distinction, between likelihood and probability:

- Probability (the _area_) gives us the tendencies for _potential_ outcomes to be drawn on each event. 
- Likelihoods (the _height_) give us the function to evaluate how probable an _observed_ outcome (or data point) is under a given probability model (i.e. the assignment of area to outcomes (discrete) or ranges of outcomes (continuous)).

It just so happens that in the discrete distribution case,
the probability value is equivalent to the likelihood value,
but as you saw above, with continuous distributions,
the likelihood value is some non-zero number at a point,
even though the probability of that value (area at that point) is zero.

With continuous distributions, the same rules apply.
We take the likelihood (given by the probability density function)
of each data point and multiply them together.

## Likelihood is all you need

Well, strictly speaking, likelihood is _basically most of what we need_.
Remember: likelihood is calculated
when we evaluate how likely data, assumed to be drawn from a distribution,
were drawn from that distribution.

For two independently drawn outcomes,
we can multiply their likelihoods together
to obtain their joint likelihood.
This trivially extends to three or more outcomes.

There is a principle in statistics, called the ["Likelihood principle"][likelihood].
You don't have to remember the term, but it is helpful to remember the idea.
From Wikipedia:

> In statistics, the likelihood principle is the proposition that, given a statistical model, 
> all the evidence in a sample relevant to model parameters is contained in the likelihood function.

[likelihood]: https://en.wikipedia.org/wiki/Likelihood_principle

As such, you'll see that in this suite of notebooks,
we will be looking _primarily_ at likelihoods, and not probability.
After all, the most common thing that we're trying to do
is evaluate data against a model.

## Probability Distributions

With these components:

- spaces of possible outcomes (i.e. the "_support_"),
- probability mass or density assigned to each outcome,
- total probability assigned across all outcomes summing to 1, and
- non-deterministic drawing of outcomes per event,

we have enough to define a probability distribution.

The PMF/PDF can be mathematically or empirically defined,
it doesn't really matter,
as long as the total area is 1.

I think we have enough to define a probability distribution in an understandable fashion for programmers:

> A probability distribution is a description of how probability mass or density is assigned to valid outcomes (the support of the distribution), such that the sum of masses or integral of densities equals to 1.

That _description_ is most commonly done by a math equation. 

When we draw data from a probability distribution's space of outcomes, the "probability" of each outcome is defined by the probability mass/density function.

When we evaluate data against a probability distribution's PMF/PDF, we are evaluating the _likelihood_ of the data.

## Random Variables

"Random variable" is probably another term that you've seen floating around.

As usual, I think contrasts will be the most illuminating here.

When we build a model of the world, we'll usually assign _variables_ to represent things.
When we so-called "run the model" once, usually we'll assign a real value to those variables,
yielding one "realization" of the model.
Now, those variables can be _deterministic_/_fixed_ or _random_/_stochastic_.
If over each realization, we "fix" the variable at a given value, then it is a _deterministic_ variable.
If over each realization, we allow it to vary stochastically, then it is a _random_ variable.

Let's move on to a bit of verbiate. 
When setting up a problem, we'll usually say something like:

> Let $p$ be the __random variable__ that models _the probability of heads_, and let $p$ be __Beta distributed__, with parameters $\alpha$ and $\beta$
>
> Let $c$ be the __random variable__ that models the _outcome of a coin flip_, and let $c$ be __Bernoulli distributed__, with parameter $p$.

More generically:

> Let `symbol` be the random variable that models `thing`, and let `symbol` be distributed by `some distribution`, with parameters `some distribution's parameters`.

I think with this in place, we have a sufficiently precise language going forward!

## Simulating Probability with Python

We've gone through a lot without doing any programming,
and the intent here is to ensure that we have a well-defined suite of terms
that make clear what we're trying to communicate.
Well, congratualtions for staying the course, as in this section, we'll finally get to programming!

### Instantiating Distribution Objects

As an anchoring example, let's use Bernoulli distributions, used to model coin flips.

Let $C$ be the random variable that models our coin flips.
$C$ is Bernoulli-distributed. Therefore:

$$C \sim Bernoulli(0.5)$$

Let's translate that to code.

In [None]:
c = stats.bernoulli(0.5)

### Simulating realizations of data

We can draw numbers from the Bernoulli, thereby simulating the generation of Bernoulli-distributed data:

In [None]:
# Draw 10 simulated outcomes from 10 simulated events.
c.rvs(10)

### Evaluating the likelihood of data

For this distribution, 
the likelihood of getting a 1 is equal to the likelihood of getting a 0,
which is $0.5$.
This can be written using:

$$\mathcal{L}(C=1) = 0.5$$

We evaluate this by invoking the PMF:

In [None]:
# Evaluate the likelihood of drawing a "1"
c.pmf(1)

Same goes for the likelihood of drawing a 0:

$$\mathcal{L}(C=0) = 0.5$$

In [None]:
# Evaluate the likelihood of drawing a "0"
c.pmf(0)

### Evaluating joint likelihood of independent outcomes

The joint likelihood of a head and a tail (1 and 0) is:

In [None]:
# Evaluate the joint likelihood of a "1" followed by a "0"
import numpy as np
np.product(c.pmf([1, 0]))

And the joint likelihood of three heads is:

In [None]:
# Evaluate the joint likelihood of the sequence of results, "1, 1, 1".
np.product(c.pmf([1, 1, 1]))

### Visualizing the PMF and distribution of outcomes

Visualization helps with learning, so let's do that too.

Firstly, we're going to visualize the PMF of the Bernoulli.

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

xs = np.array([0, 1])
ys = c.pmf(xs)
width = 0.99
ax.bar(xs + width / 2, ys, width=width)
ax.set_xlabel("Support")
ax.set_xticks([0, 1, 2])
ax.set_ylabel("PMF");

Next, we're going to visualize the distribution of 10 values drawn from the distribution.

In [None]:
import pandas as pd
from collections import Counter

outcomes = c.rvs(10)
counts = pd.Series(Counter(outcomes), name="biased_outcomes")
counts.plot(kind="bar")

## Exercises

Let's embark on a series of exercises to reinforce some of the points above.

### Exercise: Instantiate a model for a biased coin

Pick a value `p`, and use it to instantiate a Bernoulli distribution for a biased coin.

In [None]:
# Answer:
bc = stats.bernoulli(p=0.2)

### Exercise: Simulated data from the biased coin

Now simulate 7 draws from the biased coin.

In [None]:
draws = bc.rvs(7)
draws

### Exercise: Evaluate joint likelihood of data

Now, using the biased coin, evaluate the joint likelihood of the coin flip results `[1, 1, 1, 1, 0, 0, 1]`

In [None]:
data = [1, 1, 1, 1, 0, 0, 1]
np.product(bc.pmf(data))

### Exercise: Create a new distribution that better explains the data

Now, try creating a new distribution that better explains the data you saw above.

In [None]:
new_bc = stats.bernoulli(p=5/7)

### Exercise: Evaluate joint likelihood of data under new distribution

Using the same data as above, evaluate the joint likelihood under the new hypothesized distribution that you created.

In [None]:
np.product(new_bc.pmf(data))

### Exercise: Which better explains the data?

Compare the two joint likelihoods. Which better explains the data?

## Bayes' Rule

Prior to reading this notebook, you may have seen Bayes' rule.
It's so famous, [it's even become a neon sign][bayesjpg]!

[bayesjpg]: https://en.wikipedia.org/wiki/File:Bayes%27_Theorem_MMB_01.jpg

Bayes' rule looks like this:

$$P(A|B) = 
\frac{P(B|A)P(A)}
{P(B)}
$$

It is a natural result that comes straight from the rules of probability,
being that the joint distribution of two random variables
can be written in two equivalent ways:

$$P(A, B) = P(A|B)P(B) = P(B|A)P(A)$$

Now, I have encountered in many books write,
regarding the application of Bayes' rule to statistical modelling,
something along the lines of the following:

> Now, there is an alternative _interpretation_ of Bayes' rule,
> one that replaces the symbol "A" with "Hypothesis",
> and "B" with the "Data", such that we get:
> 
> $$P(H|D) = \frac{P(D|H)P(H)}{P(D)}$$

At first glance, nothing seems wrong about this statement,
but I did remember having a lingering nagging feeling
that there was a logical jump unexplained here.

More specifically, _why are we allowed to take this interpretation?_

It took asking the question to a mathematician friend, Colin Carroll,
to finally "grok" the idea.
Let me try to explain.

### Spaces of models and data

We have to think back to the fundamental idea of possible spaces.

If we set up a Bernoulli probability distribution with parameter $p$,
then the space of possible probability distributions that we could instantiate is infinite!
This result should not surprise you: $p$ can take on any one of an infinite set of values between 0 and 1, each one giving a different instantiated Bernoulli.
As such, a `Bernoulli(p)` hypothesis is drawn from a (very large) space of possible `Bernoulli(p)`s,
or more abstractly, hypotheses, thereby giving us a $P(H)$.

Moreover, consider our data.
The Bernoulli data that came to us, which for example might be `0, 1, 1, 1, 0`,
were drawn from a near-infinite space of possible configurations of data.
First off, there's no reason why we always have to have three 1s and two 0s in five draws;
it could have been five 1s or five 0s.
Secondly, the order of data (though it doesn't really matter in this case)
for three 1s and two 0s might well have been different.
As such, we have the $P(D)$ interpretation.

As a modelling decision, we _choose_ to say
that our data and model are jointly distributed
out of their joint space,
thus we have the _joint distribution_
between model and data, $P(H, D)$.

## Conclusion

Enough has been written in this notebook.
What's much more exciting than learning about definitions
is learning how to use probability distributions
to simulate how our data were generated!

Let's go on to see this in action.