<img src="../../shared/img/slides_banner.svg" width=2560></img>

# Models and Random Variables 01

In [None]:
%matplotlib notebook

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed

In [None]:
import random

import daft
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import scipy.stats

## There are many ways to think inferentially

### We can think inferentially with bootstrapping

Bootstrapping lets us think inferentially by representing our uncertainty by _resampling_.

In [None]:
data = pd.read_json("./data/bootstrap_example.json")

In [None]:
bootsrap_f, bootstrap_ax = plt.subplots(figsize=(6,3))
sns.distplot(data, hist=False, rug=True); plt.gca().set_yticks([]);
bootstrap_ax.vlines(np.mean(data["X"]), -0.1, 0.1,
           color="r", zorder=np.inf, lw=4);

In [None]:
# each time this cell is run, a new bootstrap sample is drawn and plotted, along with its mean
bootstrap = data.sample(frac=1., replace=True)
sns.distplot(bootstrap, color="k", hist=False, kde_kws={"alpha": 0.25}, ax=bootstrap_ax)
bootstrap_ax.vlines(np.mean(bootstrap["X"]),
           -0.1, 0.1, color="k", alpha=0.75);

The core idea of bootstrapping was that if the inference we wanted to make was true _for enough of the resamples we drew_, we could feel confident in making that inference.

How might we use bootstrapping to answer these inferential questions?

- is this jury selection data compatible with the hypothesis that juries are selected fairly across racial lines?

- do individuals shown pictures of angry faces show more brain activity in their amygdala?

### We can think inferentially with models

In this class, we will use _models_ to think inferentially.

Like bootstrapping, models will generate _samples_, or "fake"/simulated data points.

Unlike bootstrapping, models will generate samples that _don't look exactly like data we observed_.

Instead, they will generate samples _according to a set of rules_. For us, those rules will be Python commands.

Here's a very simple model of a deceptively simple process: flipping a coin.

In [None]:
random.choice(["Heads", "Tails"])

A good model has samples that are _hard to distinguish from reality_.

The best models have rules that _also mimic reality_.

## Models can be mathematical or computational

### Mathematical models are defined by symbols

Instead of Python commands, we could instead use mathematical symbols to represent the rules of our model.

$p(\texttt{Heads}) = p(\texttt{Tails}) = 0.5$

By using mathematical symbols, we make it possible to _reason_ about our model.

For example, we can calculate the chance that we see five heads in five tosses, under that model.

This is a major focus of _probability theory_, as in e.g. [Prob 140](http://prob140.org/logistics/).

### Mathematical models are powerful but limited

Until the advent of computers, mathematical models were the only game in town.

The whole edifice of modern science is built on using mathematical models to think inferentially.

This has advantages: mathematical models can give exact answers and writing down a mathematical model requires clarity of thought that can bring tremendous insight.

#### The Binomial Distribution: $$p(k) = \binom{N}{k} q^{k}(1-q)^{N-k}$$

#### The Gaussian Distribution: $$p(x) = \frac{1}{\sqrt{2 \pi}\sigma} \mathrm{e}^{\frac{-(x -\mu)^2}{2\sigma^2}}$$

But it has disadvantages too: there are many processes too complicated to describe mathematically.

#### The _Angry Birds_ Distribution

In [None]:
Image("./img/angry_birds.jpg")

Source: https://gamesdb.launchbox-app.com/games/images/2266

Say I choose the birds in a random order. What is the distribution of scores?

Note: contemporary game-playing AIs are trained in something like this fashion.

### Computational models are defined by programs

In cases like _Angry Birds_, the model's distribution is so complicated we can't even write it down!
Forget doing math on it.

Still, we might be able to come up with a _computer program_ that can draw samples according to the model's distribution.
These will be the primary models we'll work with in this course.

The recent advent of both ubiquitous, powerful, and easy-to-program computers and large, complicated data sets means that these models can be used to great effect, promising to fundamentally change how inference is done in science and industry.

#### But traditional mathematical models are still useful

We will still connect back to traditional models as often as we can in this course.

First, when it works, traditional approach will often be faster, cleaner, and more insightful than the sampling approach.

Second, this will help you communicate your inferences with folks trained in the traditional fashion and understand the inferences they have drawn.

## Our models are made of random variables

We will build _models_ by combining together _random variables_.

The random variables and how they are combined will be described with Python code.

### Random variables are variables whose values are random

Randomness is hard to define exactly. It's easy to show:

In [None]:
random.random()

Typically, random values in Python are sampled using the `random` or `numpy.random` libraries. These libraries contain _random number generators_, or functions that return different, unpredictable values every time they are run.

### First pass: random means "unknown" or "unpredictable"

Random variables will correspond to phenomena in the real world. 
These will almost always be phenomena whose values we can't predict or know exactly, for one reason or another.

Examples:
- the fraction of voters who will vote for Candidate A
- the side of the coin that will face up after I toss it
- the average effect on alertness of a single cup of coffee.

Sometimes, they will be unpredictable because something is out of our control or knowledge.

Other times, they will be unpredictable because they are inherently unmeasurable or unknowable.

### Random variables will be represented with circles

We will often draw our models so that we can think about them more clearly. We will call these drawings _graphs_.

When we make a graph, a random variable will be represented by a circle with a label. In the terminology of graphs, these are called _nodes_.

Here's a drawing of our example "coin toss" model.

In [None]:
coin_toss_node = daft.Node("coin_toss", "coin toss", 1, 1, scale=2.)

coin_toss_model_graph = daft.PGM([2, 2])

coin_toss_model_graph.add_node(coin_toss_node)

coin_toss_model_graph.render();

Not a lot going on here.

### Random variables can influence each other

Say I throw a dart at a board, and then, if I bullseye (which I do 1/4 of the time) I roll a die. If I don't bullseye, I flip a coin.

In this case, the value of one random variable determines the value of a different random variable.

The Python program below models this process.

In [None]:
def dart_throw_bullseye():
    if random.random() < .25:
        return True
    else:
        return False

def dart_and_roll():
    if dart_throw_bullseye():
        return random.choice([1, 2, 3, 4, 5 ,6])
    else:
        return random.choice(["Heads", "Tails"])
    
dart_and_roll()

### Influences will be represented with arrows

When one random variable influences another, we will draw an arrow between their circles.

In [None]:
dart_node = daft.Node("dart_throw", "dart throw", 1, 1, scale=2.)
roll_node = daft.Node("roll", "roll", 3, 1, scale=2.)

dart_and_roll_model_graph = daft.PGM([5, 2])

dart_and_roll_model_graph.add_node(dart_node)
dart_and_roll_model_graph.add_node(roll_node)

dart_and_roll_model_graph.add_edge("dart_throw", "roll")

dart_and_roll_model_graph.render();

## Random variables can be combined

Just like we can take regular numerical variables and combine them algebraically, resulting in a new variable,

Math:
$$z = x + y$$
`Python`:
<center>
    <tt>z = x + y</tt>
</center>

we can take numerical random variables and combine them algebraically, resulting in a new random variable:

In [None]:
def x():
    return random.random()

def y():
    return random.random()

def z():
    return x() + y()

z()

### Combinations will be represented by arrows coming together

In [None]:
x_node = daft.Node("X", "X", 1, 3)
y_node = daft.Node("Y", "Y", 1, 1)
z_node = daft.Node("Z", "Z", 2, 2)

coin_toss_model_graph = daft.PGM([3, 4])

coin_toss_model_graph.add_node(x_node)
coin_toss_model_graph.add_node(y_node)
coin_toss_model_graph.add_node(z_node)

coin_toss_model_graph.add_edge("X", "Z")
coin_toss_model_graph.add_edge("Y", "Z")

coin_toss_model_graph.render();

Note: this rule for drawing graphs is a consequence of our last rule!

## Random variables can also be transformed

We can also apply whatever Python transformations we want to our variables:

In [None]:
def random_plus_minus_one():
    return 2 * random.random() - 1

def random_average():
    return 1 / 2 * (random.random() + random.random())

random_average()

## And we can mix transformations and combinations

In [None]:
def a():
    return random.random()

def b():
    return 10 * random.random()

def c():
    return a() + b() - 1

Combinations and transformations will change the values of a random variable.
So even though we start with distributions we can understand pretty well,
we can end up with very complicated distributions!

This is similar in principle to the way a Python program, which can be as complex as Twitter or a whole operating system, is built up by combinations and transformations of a few simple things, like `True` and `False`, numbers, and library functions.

If we want to see what kinds of values a random variable typically takes on,
we can construct the histogram of the values from repeated runs,
just as you constructed histograms of your data in data8.

In [None]:
def random_histogram(random_variable, sample_size=100, ax=None):
    if ax is None:
        plt.figure(); ax = plt.gca()
    samples = [random_variable() for _ in range(sample_size)]
    return ax.hist(samples, histtype="step", lw=4) 

random_histogram(random_plus_minus_one);

### That means we can think of almost anything as a random variable...

But notice that the histogram is different each time you execute the cell above.
Note also that the histogram is the output of a Python function,
so it is also a random variable! We can get a rough sense for how those histograms vary with different data by drawing a bunch of them:

In [None]:
random_histogram(random_average)
[random_histogram(random_average, ax=plt.gca()) for _ in range(9)];

### ... and still think inferentially!

At least qualitatively: the histogram seems centered at `0.5` and rarely produces values close to `0` or `1`.

Importantly: this works even though we can't easily write down a probability distribution over histograms.
All we need to do is to generate a large enough number of samples
and then examine the results.

## We will use `pyMC` to build models

The pure-Python approach suffices for coming up with even very complicated random variables.
We just need to keep defining more functions that combine random variables together.

It becomes harder, however, if we want to know what the simultaneous values of several random variables are.

Consider: if the output of `c()` is `9`, is that because `a()` was `1` and `b()` was `11`, or was it the other way around?

In [None]:
def a():
    return random.random()

def b():
    return 10 * random.random()

def c():
    return a() + b() - 1

Furthermore, we'd like to use our models to determine what values of our random variables are plausible by comparing them to data.
Say we measure only `b()`, getting a value of `8`.
Is it more likely that `c()` is `8` or `6`?

So in order to build models,
we will use the Python package
[pyMC](https://docs.pymc.io), or `pm` for short.

"MC" could be interpreted to stand for
"Monte Carlo", a prominent and long-standing casino in Monaco, near the border of Italy and France.
Loosely speaking, an algorithm that uses random samples is a _Monte Carlo algorithm_.
It might also stand for
"Markov Chain", which is the type of random process that pyMC uses to generate its samples.
The full technique that pyMC uses is called "Markov Chain Monte Carlo", or "MCMC".

Models are built by calling functions from pyMC that build random variables while inside a `with` block that names the model:

```python
model1 = pm.Model()
model2 = pm.Model()

with model1:
    X = pm.Function("X")
    
with model2:
    Y = pm.OtherFunction("Y")
```

The `with` block tells pyMC to which model we are currently adding variables, so the result of running the Python code above would be to add the variable returned by `pm.Function` to `model1` with the name `X` and the variable returned by `pm.OtherFunction` to `model2` with the name `Y`.

## Let's build the littlest model

Let's build one of the simplest models imaginable: a single random variable that is either `0` or `1`, each with a probability of 50%.

We can think of this as a model for a single coin toss, where `0` means "heads" and `1` means "tails".

In [None]:
coin_toss_model = pm.Model()

with coin_toss_model:
    coin_toss = pm.Categorical(name="coin_toss", p=[1 / 2, 1 / 2])

We can then sample from a given model by calling the function `pm.sample` inside another `with` block:

In [None]:
with coin_toss_model:
    coin_toss_samples = pm.sample(n_init=0, chains=1, tune=0)

Don't worry about the arguments to `pm.sample` for now.
pyMC is a sophisticated and powerful library for sampling from complicated models, and the arguments to `pm.sample` are there to allow lots of flexibility and performance that we don't need just yet.

The values used above are reasonable defaults for now.
To keep things simple, we'll define a function to take in a model and call `pm.sample` on that model with those values for the arguments:

In [None]:
def sample_from(model):
    with model:
        samples = pm.sample(n_init=0, chains=1, tune=0)
    return samples

In [None]:
coin_toss_samples = sample_from(coin_toss_model)

`pm.sample` returns something like a dictionary.
They keys are the names of the random variables
and the values are the samples of that variable:

In [None]:
coin_toss_samples["coin_toss"]

A couple of "gotchas" for using pyMC:

1. For technical reasons, it's not a good idea to run more than one notebook using pyMC at the same time. If you are running a `with` block in one notebook and then try to also run a `with` block in another, you'll get a warning message and the second notebook will have to wait for the first to finish before it can run.
1. Trying to add a variable more than once will cause an error, so be careful when copying and pasting code to build a model.

We can put these samples into a dataframe and treat them as though they were real data that we gathered from an actual series of coin tosses.

From there, we can do descriptive statistics on our sample.

In [None]:
def samples_to_dataframe(samples):
    return pd.DataFrame([sample for sample in samples])

def add_counts(data):
    data["count"] = np.ones(len(data))
    return data

In [None]:
coin_toss_data = add_counts(samples_to_dataframe(coin_toss_samples))

coin_toss_data

In [None]:
counts = coin_toss_data.groupby("coin_toss").count()["count"]

plt.figure()
plt.bar(counts.index, counts / sum(counts));

### We can still combine and transform random variables in `pm`

Just like variables in algebra, random variables in `pm` models can be manipulated with math: added, subtracted, divided, multiplied, etc.

That is, if we combine two random variables with math, the result is a new random variable:

$$Z := X + Y$$

just like we could define
$$z = x + y$$
in algebra

or
```python
z = x + y
```
in pure Python.

We do this in `pm` by adding a variable with the function `pm.Deterministic`, which we apply to the Python expression for our random variable.

As an example, let's imagine we'd like to toss a coin twice and then count the number of tails. Since we're representing tails as `1`, that's just equal to the sum of the two random variables:

In [None]:
num_tails_model = pm.Model()

with num_tails_model:
    X = pm.Categorical(name="first_coin_toss", p=[1 / 2] * 2)
    Y = pm.Categorical(name="second_coin_toss", p=[1 / 2] * 2)
    Z = pm.Deterministic(name="number_of_tails", var=X + Y)

In [None]:
num_tails_samples = sample_from(num_tails_model)

In [None]:
num_tails_data = add_counts(samples_to_dataframe(num_tails_samples))

counts = num_tails_data.groupby("number_of_tails").count()["count"]

plt.figure()
plt.bar(counts.index, counts / sum(counts));