<img src="../../shared/img/slides_banner.svg" width=2560></img>

# Probability and Statistics 01

In [None]:
%matplotlib notebook

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
import math
from pathlib import Path
import random
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

import utils.plot

In [None]:
sns.set_context(context="notebook", font_scale=1.7)

## Probability is a slippery concept, philosophically.

The Stanford Encyclopedia of Philosophy
[has an entire page devoted to it](https://plato.stanford.edu/entries/probability-interpret/).

### What does it mean to say 'the chance that I roll a die and it comes up 6 is 1/6'?

- If I were to prepare a large number of dice identical to this one and roll them all at once, about 1/6 of them would come up 6

This is the
[frequentist](https://en.wikipedia.org/wiki/Frequentist_probability)
view, and is associated with, in the 20th century,
[Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher),
[Jerzy Neyman](https://en.wikipedia.org/wiki/Jerzy_Neyman),
and [Egon Pearson](https://en.wikipedia.org/wiki/Egon_Pearson),
and earlier with
[Simeon Denis Poisson](https://en.wikipedia.org/wiki/Sim%C3%A9on_Denis_Poisson)
and [John Stuart Mill](https://en.wikipedia.org/wiki/John_Stuart_Mill).

It is a form of _objective_ probability.
In objective proability,
probabilities are properties of the world,
which we wish to measure and mathematize.

- If I were to examine all possible futures consistent with my beliefs right now, then 1/6 of them contain me rolling a die that comes up 6

This is a form of
[Bayesianism](https://en.wikipedia.org/wiki/Bayesian_probability),
a type of _subjective_ probability.
In subjective probability,
probability is used to model beliefs and uncertainity.

Despite its name, it was primarily developed by
[Pierre-Simon Laplace](https://en.wikipedia.org/wiki/Pierre-Simon_Laplace),
rather than
[Thomas Bayes](https://en.wikipedia.org/wiki/Thomas_Bayes).

- If I wish to think logically about the statement "the die comes up 6", then, given my current knowledge, I should assign the truth value 1/6 (where 1 is definitely true and 0 is definitely false)

This view, another flavor of Bayesianism, is due, most prominently, to
[Edwin T. Jaynes](https://en.wikipedia.org/wiki/Edwin_Thompson_Jaynes),
and is expounded in his book
[Probability Theory: The Logic of Science](https://bayes.wustl.edu/etj/prob/book.pdf).

- The mathematical rules that I need to apply to games involving rolling dice in order to avoid being cheated tell me to assign the number 1/6 to the event "the die comes up 6"

This is the ["Dutch Book"](https://en.wikipedia.org/wiki/Dutch_book) argument,
due to [Bruno de Finetti](https://en.wikipedia.org/wiki/Bruno_de_Finetti),
in the 20th century.
It combines the subjectivity of Bayesianism with an operational definition,
which brings many of the benefits of objectivity.

If it were discovered today, it might be called the "Wall Street View".

### $\pi$ makes the differences between these views clearer.

In [None]:
math.pi

What is the chance that the first digit of $\pi$ is odd?

What is the chance that a randomly chosen digit of $\pi$ is odd?

Now consider the $2 ^ {2 ^ {100}}$th digit.

What is the chance that it is odd?

In [None]:
2 ** 100

## We'll be as agnostic as we can.

Probabilities are the numbers we use to talk about _distributions_ of data,
whether those distributions are directly measured from data 
or are represented mathematically or by the behavior of a computer program.

We'll operate, in general, with a subjective definition,
a blend of the latter three,
but traditional statistics uses the first.

## Probability Distributions

### Discrete or Categorical Distributions

Categorical distributions have a finite number of observable values.

For example, if you were observing on which day of each week
the traffic on the Golden Gate Bridge was worst,
there would be seven observable values,
so you'd be dealing with categorical data.

Discrete distributions don't need to have a finite number of observable values.

For example, if you were observing the number of cars that crossed the Golden Gate Bridge each day,
you could see any number of cars greater than or equal to 0.

Discrete distributions that aren't categorical are easier to define as
"not continuous", the definition of which we'll see shortly.

The possibly observable values are also called the _support_,
which you may see while looking through Python documentation.

In [None]:
N = 100
dice_rolls = pd.DataFrame([random.randint(1, 6) for _ in range(N)], columns=["roll"])
dice_rolls["count"] = 1
dice_rolls["frac"] = 1 / N
dice_rolls.head(5)

In [None]:
dice_rolls.groupby(["roll"]).sum()

A `groupby` operation collects up, in groups, all of the rows that share the same value for a given set of columns.
The values in the group can then be aggregated by calling a method, like `.sum`,
that collapses each group into a single value.

Note: pandas does not provide a visual representaiton of the result of a `groupby`.

In [None]:
f, ax = plt.subplots(figsize=(8, 4))

In [None]:
distribution = dice_rolls.groupby("roll").sum()["frac"]
ax.bar(distribution.index, distribution);

Some points:
- Values can't be negative
- Values must sum up to 1
- Values can't be greater than 1

In [None]:
def is_probability_mass_distribution(vals):
    if not all(val > 0 for val in vals):
        return False
    if not all(val < 1 for val in vals):
        return False
    if not np.isclose(sum(vals), 1):
        return False
    return True

In [None]:
is_probability_mass_distribution(distribution)

### Continuous Distributions from Histograms

The method above doesn't work for constructing distributions for data whose values are _continuous_,
which for us means infinite in number, with infinite possibilities between each possibility.

For example, consider _reaction times_ in response to a stimulus,
here measured relative to some baseline value, so some are negative.

In [None]:
N = 100
reaction_times = pd.DataFrame([random.gauss(0, 0.1) for _ in range(N)], columns=["rt"])
reaction_times["count"] = 1
reaction_times["frac"] = 1 / N
reaction_times.head(5)

In [None]:
reaction_times.groupby("rt").sum().head()

Each value appears only once, with over-whelming probability.

The result is that, no matter what kind of data goes in,
the resulting distribution is _flat_.

In [None]:
f, ax = plt.subplots(figsize=(8, 4))

In [None]:
distribution = reaction_times.groupby("rt").sum()["frac"]
ax.bar(distribution.index, distribution);

If we use a rugplot instead,
we'll see that there are some places where,
within a given length, there are more data points than others.

In [None]:
f, ax = plt.subplots(figsize=(8, 4))

In [None]:
sns.rugplot(reaction_times["rt"], ax=ax);

This suggests the notion of probability _density_:
"how tightly packed" are the output values,
rather than just "how many".

In [None]:
f, ax = plt.subplots(figsize=(8, 4))

In [None]:
sns.distplot(reaction_times["rt"], kde=False, rug=True, bins=25, norm_hist=False, ax=ax);

This can be represented with a histogram,
which counts up the number of values in each of a collection of bins.

This isn't yet a probability distribution, since we no longer guarantee that the heights add up to 1:
instead, they add up to the number of observations.

The argument `norm_hist=True` resolves this, but with a difference:
the _areas of the bars_ add up to 1,
rather than their heights.

In [None]:
heights, edges = np.histogram(reaction_times["rt"], density=True)

print(is_probability_mass_distribution(heights))

print(is_probability_mass_distribution(heights * (edges[1:] - edges[:-1])))

Run the above cell with and without the `density=True` keyword argument in the `np.histogram` call.
This argument does the same thing as `norm_hist`.

What might happen if we were to gather more and more data and use smaller and smaller bins?

In [None]:
N = 1000; bins = 50
vals = pd.DataFrame([random.gauss(0, 0.1) for _ in range(N)], columns=["vals"])
f, ax = plt.subplots(figsize=(8, 4))

In [None]:
sns.distplot(vals["vals"], kde=True, bins=bins, norm_hist=True, ax=ax);

Run the above cell with `N` up to 100000 and `bins` increasing as well,
no more than one tenth of `N`
(and probably no more than `1000`, unless you're feeling patient).

You should notice a shape starting to emerge.

If you pass `True` to the `kde` keyword argument,
seaborn will also try and estimate the density function,
using a method called `k`ernel `d`ensity `e`stimation.

Notice one important difference between the estimated density
and the histogram:
while the latter jumps around,
the former varies _smoothly_.
It's hard to imagine writing down a simple mathematical function
for the behavior of the histogram,
but it seems like it might be possible for the density.

When we talk about the _true distribution_ of something we measure,
we mean the mass or density we would see if we measured it infinitely many times.

## Statistics is the art of going from data to probability.

In real life, we don't get to measure something infinitely many times,
so we don't get to ever know the true distribution.

And even if we could, we couldn't store or share that information
without summarizing it.

## Statistics are used primarily for two purposes:

- _description_, or summarization of the data that was observed

- _inference_, or moving beyond the data to knowledge of what data might be observed in the future

Today's lecture focuses on descriptive statistics.

The definition of a _statistic_ is:
- the output of any function we apply to our data

What are some statistics?

Examples that might come up:
- the smallest and largest value
- the "middle" value
- the first value we observed
- any of the statistics we discuss below
- all of the data
- a histogram
- a bootstrap sample

## Descriptive statistics comes from the need to summarize

Large datasets are unwieldy

Luckily, we can often communicate the important information from a dataset using just a few numbers,
saving a ton of space and effort.

In [None]:
print(sys.getsizeof(reaction_times), sys.getsizeof(float(reaction_times["rt"].mean())))

`sys.getsizeof` returns the amount of information used to store the argument, in bytes.

If this is used on `vals` generated with `N=100000`,
the outputs should be around 800,000 and 24.
800,000 bytes is about 6.4 million binary values,
while 24 is only about 160.

Which would you rather copy into a ledger to send someone?

### Measures of Location: Mean, Median

```python
mean_data = 1 / len(data) * (data[0] + data[1] + ... data[-1])
```

The mean is the sum of the datapoints divided by the number of datapoints.

```python
# for N odd:
median_data = sorted_data[N // 2]
# for N even:
median_data = sum(sorted_data[N // 2 - 1 : N // 2 + 1]) / 2
```

The median is the datapoint "in the middle" of all of the others:
as many are above it as below it.

When the number of datapoints is even,
the two numbers on either side are averaged together to get the median.

#### The mean and median can be derived as game-winning strategies.

Imagine we play a game:
I pick elements from a dataset at random.
The smaller the difference,
the higher your score.

We can score the game two different ways:

In [None]:
def first_game_score(value, guess):
    return - (value - guess) ** 2

def second_game_score(value, guess):
    return - np.abs(value - guess)

In [None]:
def first_game_score_on_data(data, guess, N):
    return np.mean([first_game_score(value, guess) for value in data.sample(n=N, replace=True)])

def second_game_score_on_data(data, guess, N):
    return np.mean([second_game_score(value, guess) for value in data.sample(n=N, replace=True)])

In [None]:
data = pd.Series([np.exp(random.gauss(0, 1.2)) for _ in range(5000)])

In [None]:
guesses = np.linspace(0, 10, 50); N = 500000

first_game_scores = [first_game_score_on_data(data, guess, N) for guess in guesses]
second_game_scores = [second_game_score_on_data(data, guess, N) for guess in guesses]

In [None]:
f, axs = plt.subplots(figsize=(8, 8), nrows=3, sharex=True)

In [None]:
utils.plot.plot_score_comparison(data, guesses, first_game_scores, second_game_scores, axs)

The above plot shows the scores, on the first and second game,
of guesses between 0 and 10.
Notice that the score on the first game is approximately maximized by the mean,
while the score on the second game is approximately maximized by the median.

This is generally true:
the mean minimizes the sum of squared differences,
while the median minimizes the sum of absolute differences.

More broadly,
it will often be useful to think of ideas in statistics and probability in terms of games,
with points and sometimes adversaries,
and then search for winning strategies.

### Measures of Scale: Variance, Standard Deviation

```python
var_data = 1 / N * (
    (data[0] - mean) ** 2 + ... (data[-1] - mean) ** 2)
```

Note that the variance is just the mean of the squared differences from the mean,
so the variance can also be derived from a game like the mean and median.
In that game, you'd aim to guess _how far away the values are from the mean_.

```python
std_data = np.sqrt(var_data)
```

NOTE: `.var` and `.std` compute something slightly different:
they use `len(data) - 1`.
We'll see talk about why in a later lecture.
To calculate the variance in the way we defined it in pandas,
you want `.var(ddof=0)`.

### Skew

In [None]:
def skew(series):
    """NOTE: slightly different to pandas' definition of skew
    """
    N, mean, sd = len(series), series.mean(), series.std()
    running_skew = 0
    for elem in series:
        running_skew += ((elem - mean) / sd) ** 3
    return running_skew / N

Loose definition of skewness: do values tend to be above the mean or below the mean more often?

In [None]:
skewness_df = pd.read_json(Path(".") / "data" / "skew_data.json")
skewness_df.sample(10)

In [None]:
skews = skewness_df.groupby(["dataset"]).skew()
skews

In [None]:
f, axs = plt.subplots(figsize=(8, 8), nrows=3, sharex=True)

In [None]:
utils.plot.skewness_comparison_plot(skewness_df, ["height", "cost of insurance", "wealth"], axs)

## Our statistics inherit their distributions from the data collection process.

Statistics are functions we apply to our data,
and our data will be different from experiment to experiment,
so the value of the statistic that we measure will be different from experiment to experiment.

The distribution of a statistic, for a given setting of the experimental parameters,
is called its _sampling distribution_.

Along with the slides this week, there's a demo, called
`Demo - Sampling Distributions.ipynb`,
that calculates some sampling distributions for a few statistics of a series of coin tosses.

Note: we rarely get to actually see the sampling distributions of our statistics in real settings.

On Wednesday, we will talk about one way that we can _estimate_ the sampling distribution:
**bootstrapping**.

We will also talk about how we can use inference to mitigate the uncertainty introduced by the sampling distribution.