<img src="../../shared/img/slides_banner.svg" width=2560></img>

# Midterm Review of Modeling Concepts

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
from pathlib import Path
import random

import daft
import IPython.display as display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import scipy.stats

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
import shared.src.utils.util as shared_util

# We make _Models_ out of _Random Variables_ that have _Distributions_

Our core activity is to build models.
When I say "model", I usually mean an instance of the `pm.Model` class.

In [None]:
type(pm.Model())

### A model is a collection of interrelated random variables.

### A random variable is anything with a _distribution_.

Literally: a `pm.Model` is made up of things that have `pm.Distribution`s.

In [None]:
(isinstance(pm.Normal.dist(mu=0, sd=1), pm.Distribution),
 isinstance(pm.DiscreteUniform.dist(lower=0, upper=1), pm.Distribution))

That's why the models we are building are sometimes called _probabilistic models_:
they are made up of things that have _probability distributions_ associated with them.

Technically, `Flat` and `HalfFlat` are distributions but not probability distribtions,
and so any model with those kinds of RVs in it is not a true probabilistic model.

## Random does not mean arbitrary, nor does it mean without any structure.

In colloquial, everyday speech,
_random_ is often taken to mean either
_it doesn't matter which value is chosen_, as in "pick a card, any card, at random"
or
_there is no structure to the values_, as in "the power outages are occurring at random".

In this class, random means something different.

# There are two kinds of random variables.

Both have distributions, but those distributions mean different things.

# Sometimes, a random variable is a variable whose value is different each time we measure it.

  - **Result of a dice roll**: dependent on the precise position of my hand, the air currents, the shape of the table (lec03, lab03)

  - **Score of a participant on a task**: dependent on their attentiveness, their diligence, their skill at the task, and indirectly on any factors that determine those factors (hw02, lab06)

  - **Response of a patient to a drug**: dependent on their genetic makeup, on their environment, on their mental state, on the quality of the pills they are given (hw03)

Can you think of any other examples from the lectures, homework, or labs?

### In this case, the distribution describes the frequency we observe all possible values of the variable.

It is a function that, for each possible value of the variable,
gives a number that tracks the frequency that that value is observed:
it is larger when that value of the variable is observed with higher frequency
and smaller when that value of the variable is observed with lower frequency.

In [None]:
def make_normal_distribution(mu=0, sd=1):
    logp = pm.Normal.dist(mu=mu, sd=sd).logp
    # technically, pyMC works with _logs of distributions_,
    #  so we have to "remove the log" with an exponential
    def compute_normal_distribution(x):
        return np.exp(logp(x).eval())
    return compute_normal_distribution

In [None]:
standard_normal_distribution = make_normal_distribution(mu=0, sd=1)

In [None]:
standard_normal_distribution_scipy = scipy.stats.norm(loc=0, scale=1).pdf

In [None]:
assert np.isclose(standard_normal_distribution_scipy(-10),
                  standard_normal_distribution(-10))

For example, we might measure the amplitudes of a particular "brain wave"
in study participants who are quietly sitting in a chair, focusing on their breathing.

By "amplitude of a brain wave" I mean the peak amplitude in a given frequency band of an EEG measurement,
e.g. the 4-16 Hz band.
A participant focusing on their breathing would be engaging
what is called by some the
[default-mode network](https://www.ncbi.nlm.nih.gov/pubmed/26738043)
of brain regions.

In [None]:
f, ax, = plt.subplots(figsize=(12, 6))
xs = np.linspace(-5, 5, 1000)
ax.plot(xs, standard_normal_distribution(xs), lw=4);
ax.set_xlabel("Brain Wave Amplitude Observed");
ax.set_ylabel("Proportional to Frequency\nBrain Wave Amplitude is Observed");

What this distribution is saying is that it is _less frequently the case_ that values around 3 are observed,
while it is _more frequently the case_ that the values around 1 are observed
and _even more frequently the case_ that values around 0 are observed.

Or, more generically:

In [None]:
f, ax, = plt.subplots(figsize=(12, 6))
xs = np.linspace(-5, 5, 1000)
ax.plot(xs, standard_normal_distribution(xs), lw=4);
ax.set_xlabel("Value Observed");
ax.set_ylabel("Proportional to Frequency\nValue is Observed");

## In probability terms, these distributions tell us how _probable it is to observe a value_.

## This type of random variable most often appears in the _likelihood_ of our model.

The _likelihood_ is the piece of our model that connects to observable data,
and so those random variables usually represent the uncertainty in the frequency with which
values will be observed.

# Sometimes, a random variable is a variable whose value is fixed, but unknown to us.

  - The **average response of patients to a drug**, the **variability in patient responses to drugs**
      - The average effect of caffeine on alertness (lab03)

  - The **probability of an outcome**, e.g. any of the outcomes we've considered so far
      - The probability a hip-hop lyric will contain positive sentiment about Donald Trump (lec06)
      - The probability that a statistical test will give a negative result (lab05)

Can you think of any other examples from the lectures, homework, or labs?

In some cases, these variables are the _parameters of a population distribution_.

They are often the _parameters of observable variables_.

## For this type of random variable, the distribution no longer describes the _frequency_ with which something is _observed_.

For one, this is usually something we cannot or don't expect to observe. 

We typically cannot observe the mean in a population, because we cannot collect up the entire population.

We cannot "observe" the average effect of a drug on patients.
We cannot "observe" the chance that a test gives a negative result.

Instead, we observe the effect on a bunch of patients or the result of many tests,
and we use those observations to estimate the unknown quantity.

It's also not something that happens over and over again.

There is only one "average response",
unlike "response of a single patient",
and so we can't talk about the frequency with which values occur.

So in the case of this type of random variable,
what does the distribution correspond to?

## For these random variables, the distribution tracks the _plausibility of a claim_.

That is, for every possible value of the unknown quantity,
the distribution is a function that returns a value that is
_higher when the claim is more plausible_
and _lower when the claim is less plausible_.

So when we say that a distribution represents "our beliefs",
this is what we mean:
we _believe_ that certain claims are more plausible and certain claims are less plausible.

Returning to our brain wave example, we might think that the true
mean of the brain wave amplitude in the population we are measuring is something close to 0,
but not be exactly sure what it is.

We can use the same distribution to represent our beliefs about the true mean:

In [None]:
f, ax, = plt.subplots(figsize=(12, 6))
xs = np.linspace(-5, 5, 1000)
ax.plot(xs, standard_normal_distribution(xs), lw=4);
ax.set_xlabel("Value Claimed for True Mean of Brain Wave Amplitude");
ax.set_ylabel("Proportional to Plausibility\nof Claim True Mean is Value");

Now that the distribution is representing something we don't observe,
it is instead saying that
it is _very implausible_ that the true value is around 3,
while it is _more plausible_ that the true value is around 1
and _even more plausible_ that the true value is around 0.

Or, more generally:

In [None]:
f, ax, = plt.subplots(figsize=(12, 6))
xs = np.linspace(-5, 5, 1000)
ax.plot(xs, standard_normal_distribution(xs), lw=4);
ax.set_xlabel("Value Claimed"); ax.set_ylabel("Proportional to\nPlausibility of Claim");

## This type of random variable most often appears in the _prior_ of our model.

The _prior_ is the piece of our model that describes which values of certain unknown quantities,
usually parameters, we think are _a priori_ plausible or implausible.

## Both the _long-run frequency of an event_ and the _plausibility of a claim_ are things we can call _probabilities_.

A claim that is more plausible is _more likely to be true_ or _more probable_;
and event that occurs more often is _more likely to occur_ or _more probable_.

## We can make both kinds of random variables in pyMC.

First, let's make a model with two random variables in it:
one for the true mean of the data
and the other for the observed values.

In [None]:
with pm.Model() as model:
    true_mean = pm.Normal("mu", mu=0, sd=1)
    X = pm.Normal("X", mu=true_mean, sd=1)

The random variable `X` is not a `pm.Distribution`:

In [None]:
isinstance(X, pm.Distribution)

It is a type of random variable (`RV`):

In [None]:
isinstance(X, pm.model.FreeRV)

We add random variables to a pyMC model mostly by telling pyMC what `Distribution` function to use for that variable,
including its parameters.

There are other types of `RV`s.

For example, we've often said that some things have been `observed`,
and so we know some values they took on.

These are called `ObservedRV`s.

In [None]:
with pm.Model() as model:
    true_mean = pm.Normal("mu", mu=0, sd=1)
    observed_X = pm.Normal("X", mu=true_mean, sd=1, observed=3.14159)

In [None]:
isinstance(observed_X, pm.model.FreeRV), isinstance(observed_X, pm.model.ObservedRV)

The closest thing to a general type for all `RV`s in this version of pyMC
is a `Factor`:

In [None]:
isinstance(true_mean, pm.Factor), isinstance(observed_X, pm.Factor)

# But why make models?

You might find yourself asking a similar question as Zoolander et al., 2001:

In [None]:
display.Image("img/why_make_models.gif")

From the movie [_Zoolander_](https://en.wikipedia.org/wiki/Zoolander).

# Two main purposes for building models are

### to *simulate a process* whose outcomes are random, and

### to *quantify our uncertainty* about claims.

Let's work through some examples.

## We can model to simulate a process whose outcomes are random.

Or, more generally, **to describe complicated probability distributions as the combination and transformation of their parts**.

  - DnD: combinations and transformations of uniforms (lab03)

  - Sampling distribution of data: the distribution of the values of my data samples (lec02, lec04)

  - Sampling distribution of statistic: the distribution of the values of statistics computed from data samples (lec04)

Can you think of any other examples from the lectures, homework, or labs?

### For example, we might know the distributions of two groups and want to know what the distribution of differences looks like.

These might be scores of two groups on a task,
or amplitudes of brain waves of people doing two different things,
and we want to know what the difference between two individual values is likely to be.

Let's consider the case of brain wave amplitudes for people performing different tasks.
In particular, let's say that the participants are either focusing on their breathing,
as in the examples above, or they are internally recalling a story.

This example is based on
[this paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7320143).

Presume that all of the parameters for this data are known:
the true means and standard deviations of the brain wave amplitudes
measured from both groups of people.

For simplicitly, say the means are `0` and `1`, respectively.

The model below will simulate data that might be measured in an experiment
where we subtract the brain wave amplitude from a subject who is focusing on breathing
from the same from a subject internally telling a story,
given all of the assumptions above.

In [None]:
with pm.Model() as model:
    # all the parameters are known
    # parameters for group 0: focusing on breathing
    mu_0 = 0; sd_0 = 1
    # parameters for group 1: internal monologue
    mu_1 = 1; sd_1 = 1
    
    # random variable for observed brain wave amplitudes
    #  for participant focusing on breathing
    x_0 = pm.Normal("x_0", mu=mu_0, sd=sd_0)
    
    # andom variable of observed brain wave amplitudes
    #  for participant internally telling a story
    x_1 = pm.Normal("x_1", mu=mu_1, sd=sd_1)
    
    # random variable for distribution of differences
    delta_x = pm.Deterministic("delta_x", x_1 - x_0)
    
type(mu_0), type(x_0)

Notice that some parts of the model are normal Python variables,
like the known values of the parameters, e.g. `mu_0`,
while other parts of the model are `pymc` random variables.

When we call `pm.sample`,
values of `x_0` and `x_1` are drawn according to the distributions provided in the model,
and then `delta_x` is computed for each sample.

In [None]:
samples = shared_util.samples_to_dataframe(shared_util.sample_from(model))

samples.head()

In [None]:
f, ax = plt.subplots(figsize=(18, 12))
sns.distplot(samples["delta_x"]); plt.title("Sampling Distribution of delta_x\nFor Given Assumptions");

## For this kind of modeling, we do not _need_ pyMC.

In fact, we can use just use functions from the Python standard library, in particular the `random` module.

In [None]:
import random

In [None]:
# note the similarity here: we pick a fixed value for the parameters
mu_0 = 0; mu_1 = 1; sd_0 = 1; sd_1 = 1

# note the similarity here: we pick a distriubtion for the values
x_0 = pd.Series([random.normalvariate(mu=mu_0, sigma=sd_0) for _ in range(1000)])
x_1 = pd.Series([random.normalvariate(mu=mu_1, sigma=sd_1) for _ in range(1000)])

delta_x = x_1 - x_0

In [None]:
python_samples = pd.DataFrame({"x_0": x_0, "x_1": x_1, "delta_x": delta_x})

In [None]:
python_samples.head()

In [None]:
f, axs = plt.subplots(figsize=(18, 12), nrows=2, sharex=True, sharey=True)
sns.distplot(samples["delta_x"], label="pyMC Samples", axlabel=False, ax=axs[0]); axs[0].legend()
sns.distplot(python_samples["delta_x"], label="Python stdlib Samples", ax=axs[1], color="C1"); axs[1].legend()
axs[0].set_title("Sampling Distribution of delta_x\nFor Given Assumptions");

Using pyMC for something like this, though possible, is unwise,
like the activity depicted in the video below.

In [None]:
display.YouTubeVideo("ep9LZyF2fPA")

That is, pyMC is not quite the right tool for this job:
it is designed for something much harder.

So when and why do we need pyMC?

## Writing down a model this detailed requires lots of knowledge!

We are not usually in the setting where we can do so.

For a board game, we know all of the rules and so can build an exact model.

For a real life problem, we don't know everything:
we don't know the actual means and standard deviations.

We can measure them on samples,
but we _know those values will be wrong_.

Null models, or models of the null hypothesis,
are an interesting example of a workaround.

Instead of specifying that we know the true values,
we "pretend" to claim or hypothesize that the true values
are some uninteresting values.

This allows us to see <i>a</i> sampling distribution of the data,
or a statistic, but we don't know that it's _the_ sampling distribution,
as in the correct one, and so it doesn't make for a good simulation.

In [None]:
with pm.Model() as null_model:
    mu_0 = 0; mu_1 = 0;  # means are the same: uninteresting
    shared_sd = 1  # standard deviations are also the same
    
    x_0 = pm.Normal("x_0", mu=mu_0, sd=shared_sd)
    x_1 = pm.Normal("x_1", mu=mu_1, sd=shared_sd)
    
    delta_x = pm.Deterministic("delta_x", x_1 - x_0)

In [None]:
null_samples = shared_util.samples_to_dataframe(shared_util.sample_from(null_model))

In [None]:
f, axs = plt.subplots(figsize=(12, 12), nrows=2, sharex=True, sharey=True)
sns.distplot(null_samples["delta_x"], ax=axs[0], label="pyMC Null Samples",color="C0", axlabel=False);
sns.distplot(python_samples["delta_x"], ax=axs[1], color="C1", label="Python stdlib Samples");
axs[0].legend(); axs[1].legend();

We want to use what we measured on a sample
to determine _what values are plausible_.

## We need pyMC when we model to quantify our uncertainty.

Specifically to quantify our uncertainty about things we cannot measure, once we've measured related things.

  - **"Effect of" task difficulty on score**, aka average score of participants on tasks with different difficulty (lec07)

  - **Relationship between generation and emoji use** (lab07)

Can you think of any other examples from the lectures, homework, or labs?

### For example, we can use pyMC to guess what the mean is if we know what the data _would_ look like _if only_ we knew the mean.

In our running brain waves example,
this corresponds to the (more realistic)
case where we don't know what the true means are,
but we do know or assume that the data would be Normal
around that true mean.

To start, let's say we know that the mean is _either_ 0 _or_ it is 1.

No other values are possible.

For example, we might know that the means of the two participant groups are `0` and `1`,
and we want to infer whether a particular value or set of values we observed
came from one group or the other.

Perhaps we are trying to prove that the brain wave measurements can be used to predict
whether the person is thinking about their breathing or about a story -- a limited form of mind-reading.

For a more advanced approach to the same idea, see
[Naselaris et al., 2009](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5553889/),
_Bayesian reconstruction of natural umages from human brian activity_.

In [None]:
with pm.Model() as du_model:
    # Our belief: the unknown true mean of the variable X either 0 or 1
    #   the actual value is fixed, but unknown
    mu = pm.DiscreteUniform("true_mean", lower=0, upper=1)
    
    # Frequencies: the frequency of observing values of X drops off
    #  like a bell curve around its mean
    X = pm.Normal("brain_wave_amplitude", mu=mu, sd=1)

For now, we are _not_ including any observed values,
we're just writing down the general form of the model and taking a look at it.

When we draw samples from this model,
all of the free random variables
will take on different values, according to their distributions.

The details are more complicated, but you might think of it this way:
first pyMC flips a coin to determine whether `mu` is `0` or `1`.
Then, that value of `mu` is used to determine the mean of the normal distribution
according to which `X` is drawn.
This is repeated over and over again.

In [None]:
du_samples = shared_util.samples_to_dataframe(shared_util.sample_from(du_model, draws=5000))

In [None]:
du_samples["true_mean"].value_counts() / len(du_samples)

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
[sns.distplot(samp, ax=ax, label=f"Samples When True Mean is {i}")
 for i, samp in du_samples.groupby("true_mean")["brain_wave_amplitude"]];
plt.ylim(*1.4 * np.array(plt.ylim()));ax.legend();

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(du_samples["brain_wave_amplitude"], label="All Samples Together");

In the real world, this is _not_ the frequency of values we'd expect to see!

We'd expect to see _either_ the distribution on the left
_or_ the distribution on the right,
but not this mixture of the two.

The plot below shows the two distributions
along with some data we might have measured
from a participant:

In [None]:
possible_observations = [0.5, 1, -2]

f, ax = plt.subplots(figsize=(12, 6))
[sns.distplot(samp, ax=ax, label=f"Samples When True Mean is {i}")
 for i, samp in du_samples.groupby("true_mean")["brain_wave_amplitude"]];
ax.vlines(possible_observations, 0, 0.3, lw=4, label="Possible Observed Values");
plt.ylim(*1.5 * np.array(plt.ylim())); ax.legend();

For each of the possible observed values,
which of the two distributions was it more likely to be drawn from?

In responding to that question,
we base our answers on $p(\text{data}\vert\text{parameters})$,
the _likelihood_ part of the model.

By selecting out samples that all share the same value for the parameters,
we are visualizing this likelihood component.

Given the three observed values,
do you think the mean is more likely to be 0 or 1?

HINT: you can test your guess by feeding the values to the model `du_model_with_obs` below.

## pyMC gives us access to $p(\text{parameters}\vert\text{data})$, the _posterior_ of the model.

The important question, though, isn't $p(\text{data}\vert\text{parameters})$,
it's $p(\text{parameters}\vert\text{data})$.

That is, not 

which _data values_ will be observed more frequently, for this value of the _parameters_,

but 

which claims about the values of the _parameters_ are more plausible,
given I observed these _data values_.

This is generally the way the world works:
we make internal mental models or mathematical models or computer models that can tell us
how a system will behave _if we know the right things_,
but then we almost never actually _get to know the right things_,
so we have to _infer_ or _guess_ those values.

## This posterior is the distribution corresponding to our uncertainty about the unknown values, once we have observed other values.

## To get a posterior, pyMC needs to know two things: a _prior_ and a _likelihood_.

### The _prior_ is the distribution of the random variables corresponding to the unknown values that we write into our pyMC model.

It specifies, _a priori_ or _before we see anything_,
the plausibility of all claims about the unknown values.

$$
p(\texttt{mu})
$$

### The _likelihood_ is with what frequency we'd observe different values of the data, _if_ the parameters were some fixed value.

$$
p(\text{data} \vert \texttt{mu})
$$

## Unlike the prior and the likelihood, the distribution of the posterior is not specified by us, nor is its mathematical form known to pyMC.

Individual values can be calculated, up to a proportionality constant,
from the prior and the likelihood:

$$
p(\texttt{mu} \vert \text{data}) \propto p(\text{data}\vert\mu)p(\texttt{mu})
$$

This relationship is obtained from Bayes' Rule,
and that's why this approach is called a _Bayesian_ approach.

## Instead we _estimate_ the posterior based on samples, drawn with `pm.sample`.

In [None]:
with pm.Model() as du_model_with_obs:
    # Our belief: the unknown true mean of the variable X either 0 or 1
    #   the actual value is fixed, but unknown
    mu = pm.DiscreteUniform("true_mean", lower=0, upper=1)
    
    # Frequencies: the frequency of observing values of X drops off
    #  like a bell curve around its mean
    X = pm.Normal("brain_wave_amplitude", mu=mu, sd=1, observed=[0.5])

In [None]:
post_samples = shared_util.samples_to_dataframe(shared_util.sample_from(du_model_with_obs, draws=5000))

post_samples["true_mean"].value_counts() / len(post_samples)

Try providing each of the `possible_observations` above to this model and computing the posterior.

Beforehand, try and guess which of the two claims -- that the mean is 0 or that the mean is 1 -- is more plausible _given the observed value_.

Typically,
we aren't in a world where only one of two possibilities is true.

Instead, there are an infinite number of possibilities.

For example, say we thought the true mean was _between_ 0 and 1,
rather then being _equal to_ 0 or 1.

Now, there are infinitely many possibilities: $0, 1/2, 1/4, 3/4, ...$

So we can't visualize the likelihood the same way:
all of our samples will have _different values for the true mean_.

And our prior and posterior will be best viewed as histograms,
rather than as frequencies for all of the values.

In [None]:
with pm.Model() as unif_model_obs:
    # Our belief: the unknown true mean of the variable X between 0 and 1
    mu = pm.Uniform("true_mean", lower=0, upper=1)
    
    # Frequencies: the frequency of observing values of X drops off
    #  like a bell curve around its (unknown!) mean
    X = pm.Normal("brain_wave_amplitude", mu=mu, sd=1, observed=[0.5])

Remember that we can use `pm.sample_prior_predictive`
if we want to see what the prior looks like
for a model with `observed` variables in it.

In [None]:
prior_samples = shared_util.samples_to_dataframe(pm.sample_prior_predictive(model=unif_model_obs, samples=10000))

In [None]:
post_samples = shared_util.samples_to_dataframe(shared_util.sample_from(unif_model_obs, draws=5000))

In [None]:
f, axs = plt.subplots(figsize=(12, 6), ncols=2, sharex=True, sharey=True)
sns.distplot(prior_samples["true_mean"], ax=axs[0], label="Prior");
sns.distplot(post_samples["true_mean"], ax=axs[1], label="Posterior", color="C2");
axs[0].set_ylabel("Plausibility of Claim that\nTrue Mean is Value on x-Axis")
[ax.legend() for ax in axs];

The distribution on the right approximates $p(\texttt{mu} \vert \text{data})$
for every possible value of the true mean (along the x-axis):
the plausibility, given the $\text{data}$ we observed,
of the claim that the parameter `mu` is equal to the value on the x-axis.

Remember that the posterior is obtained using the proportion we got from Bayes' Rule:

$$
p(\texttt{mu} \vert \text{data}) \propto p(\text{data}\vert\mu)p(\texttt{mu})
$$

And note that since we used a `Uniform` prior, we have just
$$
p(\texttt{mu} \vert \text{data}) \propto p(\text{data}\vert\texttt{mu})
$$

It's usually pretty hard to tell the difference between our beliefs before and after seeing a value of `0.5`.
You might notice that the posterior is "bulging" upwards in the middle and lower at the edges.

Change the value observed to a more extreme value:
`0.9`, or `2`, or `10`.
try and predict what the shape of the posterior will be in each case.

The section below shows the same procedure,
but for a `Normal` prior,
which has shown up more often in the class this far.

In [None]:
with pm.Model() as normal_model:
    # Our belief: the unknown true mean of the variable X close to 0, ±1
    mu = pm.Normal("true_mean", mu=0, sd=1)
    
    # Frequencies: the frequency of observing values of X drops off
    #  like a bell curve around its mean
    X = pm.Normal("brain_wave_amplitude", mu=mu, sd=1, observed=[0, 1, 3, 2, 1.5])

In [None]:
prior_samples = pm.sample_prior_predictive(model=normal_model, samples=10000)
post_samples = shared_util.samples_to_dataframe(shared_util.sample_from(normal_model, draws=5000))

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(prior_samples["true_mean"], label="Prior");
sns.distplot(post_samples["true_mean"], label="Posterior", color="C2");
ax.set_ylabel("Plausibility of Claim that\nTrue Mean is Value on x-Axis"); ax.legend();

# Graphical Perspective

When we make graphs,
we draw a circle, or _node_, for each of our variables
and then we draw an arrow, or _edge_, to mean
"the variable at the tail is used to set the distribution of the variable at the tip".

These graphs are partial descriptions of pyMC models.

As descriptions, they capture the overall structure:
which variables are present,
who determines whose values, what values are known,
what values are unknown.

They are _partial_ because they don't capture which distributions are in play.

All of the models in the section on modeling to quantify our uncertainty
have the same graph, the one below:

In [None]:
params_known = False; data_known = True

data_node = daft.Node("brain\nwave\ndata", "brain\nwave\ndata", 4, 1, scale=4, observed=data_known);
params_node = daft.Node("mu", "mu", 1, 1, scale=4, observed=params_known)

pgm = daft.PGM([6, 2]); pgm.add_node(data_node); pgm.add_node(params_node);
pgm.add_edge("mu", "brain\nwave\ndata", head_width=0.25, lw=3)

pgm.render();

That is, we have observed the brain wave data,
and we know that the distribution of the brain wave amplitudes depends on the values of the parameter, `mu`.

When we create a model to simulate our data,
the structure of the model is often similar,
but with one important difference:
the node colored in is different.

In [None]:
params_known = True; data_known = False

data_node = daft.Node("brain\nwave\ndata", "brain\nwave\ndata", 4, 1, scale=4, observed=data_known);
params_node = daft.Node("mu", "mu", 1, 1, scale=4, observed=params_known)

pgm = daft.PGM([6, 2]); pgm.add_node(data_node); pgm.add_node(params_node);
pgm.add_edge("mu", "brain\nwave\ndata", head_width=0.25, lw=3)

pgm.render();

That is, we are presuming that the parameters are _known_,
since these are the parameters we give the model,
while the data is unknown, since that's the output of our model.

Given our definition of the arrows,
"the value at the tail is used to determine the value at the tip",
we can loosely specify when pyMC is useful and when it is not.

Whenever there are _no unknown values determining any known values_,
we don't need pyMC.

Another way of putting it:
whenever the information from our observations only flows _forward along arrows_,
never backwards,
we don't need pyMC.

# Anatomy of a Bayesian Model

```python
with pm.Model() as model:
    # Prior/Pre-Experiment Beliefs:
    #   Create random variables for all unknown, unobservable parameters
    param0 = pm.?("param1", hyperparam0=?, ...)
    ...
    paramN = pm.?("paramN", hyperparam0=?, ...)
    
    # Likelihood:
    #   Create random variable for values we have observed
    #   and use RVs from prior to set parameter
    #   and tell pyMC which values you've observed
    observable_value = pm.?("observable_value",
                            likelihood_param0=param0, ..., observed=?)
    
with model:
    # approximate posterior by drawing samples
    posterior_samples = pm.sample(...)
```

Make sure you can identify each of the pieces above

### Prior

- Random variables with associated distributions <br><br>

- Represent our prior knowledge or beliefs before observing data, typically about parameters of likelihood <br><br>

- Randomness used to represent uncertainty in truth of claims about certain variables <br><br>

- Partially subjective -- different modelers believe or know different things

### Likelihood

- Random variable(s) with associated distribution(s) <br><br>

- Represent what data distribution would look like, if we knew parameters <br><br>

- Randomness used to represent variability from observation to observation due to uncontrolled variables not in prior <br><br>

- Less subjective than prior, but still a modeling choice and reliant on our knowledge of the domain

### Observations

- Concrete Python values (often a `Series`) <br><br>

- Measured out in the world <br><br>

- Fed to model via the likelihood component with `observed` keyword

### Posterior

- Represented approximately with samples (comes out of pyMC as a `MultiTrace`, but we convert to a `DataFrame`) <br><br>

- Knowing posterior is the goal of modeling: knowing what we believe about unknown things, given what we observed (observations) and what we knew beforehand (prior, likelihood)

Graphically, the _prior_ is associated with the nodes for the unknown parameters (nodes without color).

The _likelihood_ is associated with the arrows connecting the parameter nodes and the observed nodes (nodes with color) and with the nodes themselves.

The _observations_ are what make some nodes colored in. Without observations, a node is without color.

The _posterior_ is also associated with the nodes for the unknown parameters.

In [None]:
params_known = False; data_known = True

data_node = daft.Node("brain\nwave\ndata", "brain\nwave\ndata", 4, 1, scale=4, observed=data_known);
params_node = daft.Node("mu", "mu", 1, 1, scale=4, observed=params_known)

pgm = daft.PGM([6, 2]); pgm.add_node(data_node); pgm.add_node(params_node);
pgm.add_edge("mu", "brain\nwave\ndata", head_width=0.25, lw=3)

pgm.render();