<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Bayesian Statistics

_Authors: Matt Brems (DC), Kiefer Katovich (SF)_

---

### Learning Objectives
- Review the axioms and properties of probability.
- Cover the formula for Bayes' theorem.
- Learn the diachronic interpretation of Bayes' theorem.
- Gain an intuition for the different components of the formula.
- Tackle the Monty Hall problem with Bayesian statistics.
- Complete some additional Bayesian statistics problems.

### Lesson Guide
- [Review of Probability](#review)
    - [Axioms of Probability](#axioms)
    - [Properties of Probability](#properties)
- [Bayes' Theorem](#bayes-rule)
    - [The Diachronic Interpretation](#diachronic)
- [Frequentist vs. Bayesian Probability](#freq-vs-bayes)
- [Bayes' Theorem in Parts](#parts)
- [The Monty Hall Problem](#monty-hall)
- [Additional Bayesian Statistics Problems](#additional)

<a id='review'></a>
## Review of Probability

---

### The Sample Space, Event Space, and Probability Function

With $S$ denoting the sample space and $F$ denoting the event space — or space of all possible events — we have a probability function $P$ defined as:

### $$ P(S, F) \rightarrow [0, 1]$$

The probability maps all events in the sample space to the interval from zero to one. 

---
<a id='axioms'></a>
### Axioms of Probability

**Non-Negativity**

For any event, $A$, the probability of the event must be greater than or equal to zero.

### $$ 0 \le P(A) $$

**Unit Measure**

The probability of the entire sample space is one.

### $$ P(S) = 1 $$

**Additivity**

For mutually exclusive (or disjoint) events, $E$, the probability of any of the events occurring is equivalent to the sum of their probabilities.

### $$ P\left(\cup_{i=1}^{\infty}\; E_i \right) = \sum_{i=1}^{\infty} P(E_i) $$

---
<a id='properties'></a>
### Properties of Probability

**The Probability of No Event**

The probability of the empty set, denoted $\emptyset$, is zero.

### $$ P\left(\emptyset \right) = 0 $$

**The Probability of $A$ or $B$ Occurring (Union)**

The probability of event $A$ or event $B$ occurring is equivalent to the sum of their individual probabilities, minus the intersection of their probabilities (the probability that they both occur).

### $$ P(A \cup B) = P(A) + P(B) - P(A \cap B)$$

**Conditional Probability**

The probability of an event that is conditional on another event is written using a vertical bar between the two events. The probability of event $A$ occurring _given_ that event $B$ occurs is calculated as:

### $$ P(A | B) = \frac{P(A \cap B)}{P(B)} $$

This represents the probability of both $A$ and $B$ occurring, divided by the probability that $B$ occurs at all.

**Joint Probability**

The joint probability of two events, $A$ and $B$, is a reformulation of the equation above.

### $$ P(A \cap B) = P(A|B) \; P(B) $$

If we want to know the probability that both $A$ and $B$ happen, we can multiply the probability that $B$ happens by the probability that $A$ occurs if $B$ does.

**The Law of Total Probability**

Let's say we want to know the probability of the event $B$ occurring across _all_ different events, $A$. For example, let's say that we are a judge presiding over a murder trial. $B$ is the event that the suspect's wallet was found at the scene of the murder. We have many hypotheses or possible scenarios in which the wallet is found at the murder scene, one being that the suspect was actually at the scene of the crime at the time of the murder.

These different events, $A$ — our scenarios — are disjoint. The _total probability_ of $B$ is the probability across all of these scenarios that the wallet was found at the murder scene. In other words — regardless of which possible scenario $A$ — what is the overall probability that the wallet was at the murder scene?

### $$ P(B) = \sum_{i=1}^n P(B \cap A_i) $$

![total probability](./assets/images/output_27_0.png)

<a id='bayes-rule'></a>
## Bayes' Theorem

---

Bayes' theorem relates the probability of $A$ given $B$ to the probability of $B$ given $A$. This rule is critical for performing statistical inference, as we'll see shortly. It's formulated as:

### $$ P(A|B) = \frac{P(B|A)\;P(A)}{P(B)} $$

Let's return to the courtroom example.

Say $A$ is the event that the suspect is guilty.

$B$ is the event that the suspect's wallet was found at the scene of the crime.

Using Bayes' theorem, we phrase this as: The probability that the suspect is guilty given that the suspect's wallet was found at the scene of the crime is equivalent to the probability that the suspect's wallet was found there given that the suspect is guilty, times the probability that the suspect is guilty (without evidence), and divided by the total probability that the wallet is found at the scene of the crime.

<a id='diachronic'></a>
### The Diachronic Interpretation of Bayes' Theorem

We can rewrite the formula for Bayes' theorem in the context of hypotheses and data, like we’ve already been doing with the courtroom example. The diachronic interpretation is for the probability of events _over time_, as in, the probability that an event changes over time as we collect new data.

In this case, we have a model or statistic and we’re asking the probability of our model given the data that we’ve observed.

### $$P\left(model\;|\;data\right) = \frac{P\left(data\;|\;model\right)}{P(data)}\; P\left(model\right)$$


<a id='freq-vs-bayes'></a>
## Frequentist vs. Bayesian Probability

---

### Frequentism

Frequentists believe that the "true" value of a statistic about a population (for example, the mean) is fixed and unknown. We can infer more about this "true" distribution by engaging in sampling, testing for effects, and studying relevant parameters of the population.

Say we are flipping a coin and want to know the probability of heads. Frequentists formulate the probability of heads as a limit, deriving the true probability of heads from an infinite number of coin flips with that coin.

### $$P(\text{heads}) = \lim_{\text{# of coin flips} \to \infty} \frac{\text{# of heads}}{\text{# of flips}}$$

Alternatively, we can write this more generally as the number of times any event, $A$, occurs given an infinite number of observations/experiments (random samples from the event space).

### $$P(A) = \lim_{\text{# of experiments} \to \infty} \frac{\text{# of occurrences of A}}{\text{# of experiments}} $$

### Bayesianism

Bayesians believe that data inform us about the distribution of a statistic or event and that, as we receive more data, our view of the distribution can be updated, further confirming or denying our previous beliefs (but never in total certainty).

For the coin flip example above, we would write out the probability of heads as our belief in the probability of getting heads given the evidence we have from observing coin flips.

### $$ P(\text{heads}) = \frac{P(\text{# of heads observed} \;|\; \text{heads})}{P(\text{# of heads observed})} P(\text{heads}) $$

Here, we're representing the probability of flipping with:

Our **prior** belief (before observing flips) of the probability of flipping heads: $P(\text{heads})$.

The **likelihood** of the data we observe given the chance to flip heads: $P(\text{# of heads observed} \;|\; \text{heads})$.

The **total probability** of observing that many heads in coin flip,s regardless of weighting (or rather, across all coin weightings): $P(\text{# of heads observed})$.

<a id='parts'></a>
## Bayes' Theorem in Parts
---

Using the diachronic interpretation of Bayes' theorem, we can describe each part with its label, like in our coin flip example above.

### $$P\left(model\;|\;data\right) = \frac{P\left(data\;|\;model\right)}{P(data)}\; P\left(model\right)$$

**The Prior**

### $$ \text{prior} = P\left(model\right) $$

The prior is our belief in the model given no additional information. This model could be as simple as a statistic, such as the mean we're measuring, or a complex regression. 

**The Likelihood**

### $$ \text{likelihood} = P\left(data\;|\;model\right) $$

The likelihood is the probability of the data we observed occurring given the model. For example, assuming that a coin is biased toward heads with a mean rate of heads of 0.9, what is the likelihood that we observe 10 tails and two heads in 12 coin flips?

The likelihood is, in fact, what frequentist statistical methods are measuring. 

**The Marginal Probability or Total Probability of the Data**

### $$ \text{marginal probability of data} = P(data) $$

The marginal probability of the data is the probability that our data are observed regardless of what model we choose or believe in. You divide the likelihood by this value to ensure that we are only talking about our model within the context of the data occurring. We divide by this value to ensure that what we get on the other side is a true probability distribution — more on this later.

**The Posterior**

### $$ \text{posterior} = P\left(model\;|\;data\right) $$

The posterior is our _updated_ belief in the model given the new data we have observed. Bayesian statistics are all about updating a prior belief we have about the world with new data, so we're transforming our _prior_ belief into this new _posterior_ belief about the world.

<a id='monty-hall'></a>

## The Monty Hall Problem
---

The Monty Hall problem is a famous probability problem with an unintuitive solution. Framing it in a Bayesian context makes it much clearer!

Open up the [Monty Hall notebook](./monty-hall.ipynb) and tackle the problem.

<a id='additional'></a>
## Additional Bayesian Statistics Problems
---

As independent practice, you can tackle some more Bayesian statistics problems, including the:
- Pregnancy screening problem.
- Cookie jar problem.
- German tank problem.
- Dungeons & Dragons dice problems.
- M&M's problem.

These problems can be found in [this notebook](bayes-problems.ipynb).