<img src="../../shared/img/slides_banner.svg" width=2560></img>

# Parameters, Priors, and Posteriors 01

In [None]:
%matplotlib notebook

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
from IPython.display import Audio, Image
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import PIL
import pymc3 as pm
import seaborn as sns
import scipy.stats

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
import shared.src.utils.util as shared_util

import utils.daft
import utils.plot as plot

import utils.util

## Often, our models are less precise than those we made last week

## We will usually specify models _only up to the values of parameters_

Compare:
- I model the rolls on this six-sided die as a `DiscreteUniform` with `lower=1` and `upper=6`

$
\ \ \ \ \ \ \ \text{d}6 \sim \text{DiscreteUnif}(\text{lower}=1, \text{upper}=6)
$

```python
d6 = pm.DiscreteUniform(lower=1, upper=6)
```

- I model the rolls on this die with an unknown number of sides as a `DiscreteUniform` with `lower=1` and an unknown value of `upper`

$
\ \ \ \ \ \ \ \text{d}X \sim \text{DiscreteUnif}(\text{lower}=1, \text{upper}=\ ?)
$

```python
dX = pm.DiscreteUniform(lower=1, upper=?)
```

Written above the pyMC code, which is hopefully becoming more familiar,
are the equivalents in more traditional mathematical notation.

The symbol $\sim$ is pronounced "is distributed as".

Our model of rainfall based on memorylessness gave us the `Poisson` and `Exponential` distributions for counts of drops and times between drops, but for any given rainstorm, the precise shape of the distributions will be different.

$
\ \ \ \ \ \ \ N_{\text{drops}} \sim \text{Pois}(\mu)
$
```python
N_drops = pm.Poisson(mu=?)
```

For example, the central limit theorem told us many distributions were `Normal`, but it didn't tell us the precise shape of that distribution.

$$S \sim \text{Norm}(\mu, \sigma)$$

where $S$ is a random variable representing, e.g. the value of a statistic that we measure on a random sample.

$$
\mu \sim \ ?, \ \ \sigma \sim \ ?
$$

We do not know what $\mu$ and $\sigma$ are,
but we know that, once we have specific value for them,
the distribution of the values of $S$ will be normal.

## Parameters determine the specific distribution inside a family

Different variables have different numbers and names of parameters, but they all have them.

Those parameters change the particular shape of the distribution of that random variable,
while still keeping it within the same _family_.

$$x\sim\text{Normal}(\mu, \sigma)$$

$$u\sim\text{Unif}(\text{lower}, \text{upper})$$

$$k\sim\text{Poisson}(\mu)$$

In [None]:
with pm.Model() as different_parameters_model:
    # variables from the Normal family
    normal_1 = pm.Normal("normal_1", mu=0, sd=1)  # the "standard normal"
    normal_2 = pm.Normal("normal_2", mu=2, sd=1)  # different mean, same std
    normal_3 = pm.Normal("normal_3", mu=0, sd=3)  # same mean, different std
    
    # variables from the Uniform family
    uniform_1 = pm.Uniform("uniform_1", lower=0, upper=1)  # the "standard uniform"
    uniform_2 = pm.Uniform("uniform_2", lower=-2, upper=3)  # another uniform
    
    # variables from the Poisson family
    poisson_1 = pm.Poisson("poisson_1", mu=1)  # what you might call the "standard Poisson"
    poisson_2 = pm.Poisson("poisson_2", mu=0.1)  # a Poisson with a lower rate
    poisson_2 = pm.Poisson("poisson_3", mu=2)  # a Poisson with a higher rate

In [None]:
diff_params_samples = shared_util.samples_to_dataframe(shared_util.sample_from(
    different_parameters_model, draws=10000, progressbar=True))

In [None]:
plt.figure(figsize=(8,8))
sns.distplot(diff_params_samples["normal_1"], label=r"$x\sim$Norm$(\mu=0, \sigma=1)$")
sns.distplot(diff_params_samples["normal_2"], label=r"$x\sim$Norm$(\mu=2, \sigma=1)$")
sns.distplot(diff_params_samples["normal_3"], color="C3", label=r"$x\sim$Norm$(\mu=0, \sigma=3)$")
plt.xlabel("x")
plt.legend(loc=[-0.1, 0.8], framealpha=1); plt.tight_layout();

In [None]:
plt.figure(figsize=(8, 8))
sns.distplot(diff_params_samples["uniform_1"],
             kde=False, norm_hist=True, bins=5, label=r"$u\sim$ Unif$($lower$=0$, upper$=1)$")
sns.distplot(diff_params_samples["uniform_2"],
             kde=False, norm_hist=True, bins=5, label=r"$u\sim$ Unif$($lower$=-2$, upper$=3)$");
plt.xlabel("u"); plt.axis("equal"); plt.ylim([0, 5]); plt.legend(); plt.tight_layout();

In [None]:
plt.figure(figsize=(8,4))
plt.hist(diff_params_samples["poisson_2"],
         bins=range(10), align="left", label=r"$k\sim$Pois$(\mu=0.1)$", histtype="step", lw=4, density=True)
plt.hist(diff_params_samples["poisson_1"],
         bins=range(10), align="left", label=r"$k\sim$Pois$(\mu=1)$", histtype="step", lw=4, density=True)
plt.hist(diff_params_samples["poisson_3"],
         bins=range(10), align="left", color="C3", label=r"$k\sim$Pois$(\mu=2)$", histtype="step", lw=4, density=True)
plt.xlabel("k"); plt.legend(); plt.tight_layout();

The name of the random variable tells us the shape of the histogram _if all of the parameters are known_.

Consider the model
$$
x \sim Foo(\beta_0,  \beta_1, \beta_2)
$$
where all of the $\beta_i$ are known.

For [historical reasons](https://en.wikipedia.org/wiki/Foobar#History_and_etymology),
the preferred names for variables in programming whose exact values don't matter,
because they're just being used to demonstrate a concept,
are `foo`, `bar`, and `baz`.
In Python, folks sometimes use [`spam` and `eggs` instead](https://www.dailymotion.com/video/x2hwqlw).

Replace `Foo`, in your mind, with a specific example,
like `Normal` or `DiscreteUniform`,
and the parameters with the parameters for that family:
`mu` and `sd`,
`lower` and `upper`.

All of the models we worked with last week were of this type.

Then, distribution of x is given by samples from
```python
pm.Foo("X", beta_0=beta_0_known, beta_1=beta_1_known, beta_2=beta_2_known)
```

When the value of a variable is known or observed, we draw it in our graphs with the circle filled in, like so:

In [None]:
utils.daft.make_parameters_graph();

An arrow is drawn between a pair of random variables when the value of one influences the value of the other.
But which direction should the arrows point?
Very often, one random variable will be a parameter for another.
In that case, we draw the arrow with its tail on the node for the parameter
and pointing to the variable it controls.

## How de we handle parameters we don't know?

We treat the parameters as _random variables_: quantities we don't know exactly, but about which we can express our uncertainty.

In [None]:
utils.daft.make_parameters_graph(observed=False);

## Joint, Conditional, and Marginal Distributions

Until now, we've only thought about one variable at a time.

Once we start thinking about more than one variable,
we need more words to talk about different kinds of probabilities with that variable:

- What _combinations of values_ are likely to be observed together?
- If I know something about one variable, what _values of the others_ are likely to be observed?

## Example: A simple model for effort

You're studying the psychology of effort:
what factors cause a person to try harder when presented with a task?

In a very simple experiment,
you allow participants to play a game with a fixed chance of success, say 50%,
multiple times, up to some fixed maximum number of attempts (here, 6),
at which point you intervene and end the experiment.
You track how many times they play the game and how many times they succeed.

There are lots of interesting questions to ask here:
does the history of outcomes affect how many times a person plays?
what other variables can you manipulate to change that relationship?

But we'll focus on a very simple model:

In [None]:
utils.daft.make_bivariate_graph("N", "S");

$$
\text{Number of Attempts: } N \sim \text{Categorical}(p=[0, 1/4, 1/8, 1/8, 1/8, 1/8, 1/4])\\
\text{Number of Successes: } S \sim \text{Binom}(n=N, p=0.5)
$$

This model expresses the belief that `1/4` of people will only play once,
while `1/4` will play until they are told to stop,
and the remaining half will stop at some point in between.

Later, you might come along and see whether changes in some variable changes effort,
and make the parameters of $N$ _also_ random variables, with parameters,
or you might make $S$ something different from a `Binomial`.

In [None]:
with pm.Model() as effort_model:
    attempts = pm.Categorical("attempts", p=[0, 1/4, 1/8, 1/8, 1/8, 1/8, 1/4])
    successes = pm.Binomial("successes", n=attempts, p=0.5)

Here's a very loose description of the sampling process:
- "Imagine the person were to try 3 times", aka select a random value for $N$ 
- "If they tried 3 times, I can imagine they might succeed twice", aka sample $S$ with parameter `n=3`.
- Repeat.

In [None]:
effort_samples = shared_util.samples_to_dataframe(shared_util.sample_from(
    effort_model, draws=10000, progressbar=True))

In [None]:
plot.jointgrid_effort(effort_samples);

Notice two plots on the sides, or _margins_.
They show what we've been looking at so far:
the distribution of a variable, without paying attention to the values of other variables.
The plot on the y-margin is `sns.distplot` applied to `effort_samples["attempts"]` (with suitable kwargs).

The tradition of putting plots and counts like these on the side goes way back,
so these are called
_marginal_ distributions.

## The distributions we've plotted so far are called _marginals_.

For a `pymc_model` containing the two variables `X` and `Y`,
then the heights of the histograms plotted by
```python
samples = shared_util.samples_to_dataframe(shared_util.sample_from(
    pymc_model, draws))
plt.hist(samples["X"])
plt.hist(samples["Y"])
```
will approach a values given by
$$
p(X) \text{ and } p(Y)
$$
as `draws` gets bigger.

Remember that `.sample_from` is just calling `pm.sample`, inside a `with` block.

The marginal probabilities of the outcomes $N=x$ and $S=y$ are
$$
p(N=x) \text{ and } p(S=y)$$

Sometimes shortened to $p(x)$ or $p_N(x)$.

### The shapes of marginal distributions are not always given by the name of the variable.

If the variable `X` is a `pm.Foo`, $p(X)$ is _not_ always going to be a member of the family `Foo`.

In [None]:
mx = max(effort_samples["successes"]); Ns = range(mx, mx+10); ps = np.linspace(0.1, 0.9, num=90)

params = {"n": Ns, "p": ps}
best_params = utils.util.approx_mle(params, scipy.stats.binom.logpmf, effort_samples["successes"])
best_binomial_pmf = scipy.stats.binom(**best_params).pmf

In [None]:
f, ax = plt.subplots(figsize=(8, 4))

sns.distplot(effort_samples["successes"],
             bins=np.arange(-0.5, 11.5), kde=False, norm_hist=True, ax=ax,
             label=r"estimated $p(S)$");
ax.plot(range(11), best_binomial_pmf(range(11)), lw=4, marker=".", markersize=24, label="closest binomial");
plt.legend(); plt.tight_layout()

Here, I am ploting $p(\text{S})$, as estimated from our sampling,
against the distribution (aka `pmf`, because it's a discrete variable)
of the closest member of the `Binomial` family.

It's not important to understand how the closest member of the family was chosen,
just to recognize that the shape of $p(S)$ cannot be well-approximated by any `Binomial`.

## The distribution of multiple variables at once is the _joint distribution_.

In [None]:
plot.jointgrid_effort(effort_samples)

The center of this plot is a "2-D histogram".
It is created by `plt.hist2d`,
which is much like `plt.hist`, or `sns.distplot` with `kde=False`,
but for two variables at once.

Values at each point (x, y) are the frequencies at which the pair
(attempts, successes) was observed _together_ in the data.

Because this is the distribution of values observed together,or _jointly_,
it is known as the _joint distribution_.

`pm.sample` draws from the joint distribution of all unobserved random variables in the model.

For a `pymc_model` with just two variables `X` and `Y`,
then the heights of the histogram plotted by
```python
samples = shared_util.samples_to_dataframe(shared_util.sample_from(
    pymc_model, draws))
plt.hist2d(samples["X"], samples["Y"])
```
will approach a value given by
$$
p(X, Y)
$$
as `draws` increases.

Note: this will only work the correct bins.
For a discrete variable, you want one bin for each outcome.
For a continuous variable, you need to increase the number of bins along with `draws`.

The joint probability of the outcome that  $N=x$ and $S=y$ is
$$
p(N=x, S=y)
$$

In [None]:
plot.histogram3d(effort_samples["attempts"], effort_samples["successes"],
                 labels=["Attempts", "Successes"],
                 bins=np.arange(-0.5, 7.5), cm=plt.cm.viridis)

This 3-D plot presents another way to view the joint distribution.
The bars have heights $z$, where
$$
z = p(N=x, S=y)
$$
Color corresponds to x position, to make the plot easier to read.
Matplotlib does not have a full 3-D rendering engine,
so you may seem some small graphical glitches.

The heights of all of these bars together add up to 1.

Consider a collection of bars with the same color,
which correspond to cases where the same number of attempts were made:
notice that their shapes, e.g. their centers, are very different.

They are not quite probability distributions: they don't add up to 1.
Instead, they add up to $p(N=x)$. Ignoring that point, use them to answer the question below.

If I told you that a given participant made 5 attempts,
what might you guess the number of successes was?
What if it was 2?

In both cases, you'd want to look at the value of successes observed most often
for the number of attempts that was given.

The most-often-observed value is called the _mode_ of the distribution.

What if I didn't tell you at all?
To answer that,
you'd want to look at the marginal distribution,
and look for the most likely value there.

In [None]:
f, ax = plt.subplots(figsize=(8, 4))
sns.distplot(effort_samples["successes"], bins=np.arange(-0.5, 7.5), kde=False, norm_hist=True, ax=ax,
             label=r"estimated $p(S)$"); plt.legend();

In this case, the mode of the distribution is 1.

Notice that we started thinking about the same-colored bars as _distributions_ above.
We were implictly thinking of them as normalized.

Once a collection of these bars is normalized, it is an example of what's called a _conditional distribution_.

### Conditional distributions arise when we place _conditions_ on some of our random variables

and then look at the distributions of other variables.

For a discrete variable, the conditional distributions we've been talking about
correspond to the collection of histograms made by

```python
samples = shared_util.samples_to_dataframe(shared_util.sample_from(
    pymc_model, draws))
for n in possible_N_values:
    plt.hist(samples[samples["N" == n]]["S"],
             bins=possible_S_values, density=True)
```

where the _condition_ is `"N" == n`.

Most of the time, the condition on the random variable will be like this:
one or more variables has a given value.

But this is not a fundamental restriction!
We could use `"N" in [2, 3]` as a condition, for example,
if we wanted to see what the distribution of `"S"` looks like when `"N"` is one of those values,
but the precise value is unknown.

Remember, correspond means "if I took enough samples, the heights of the histogram bars would be the output of the mathematical formula".

This collection of distributions is written like:

$$
p(S \lvert N)
$$

where the $\lvert$ is pronounced "given", as in "probability of $S$ _given_ $N$".

This thing is known, in a slight abuse of terminology, as **the** _conditional distribution_ of $S$ given $N$.

Why abuse? Because it's actually a whole bunch of distributions, one for each possible value of $N$!

While an individual histogram,
corresponding to the one returned by the Boolean selector `samples["N" == n]`,
for some fixed value fo `n`, would be written:

$$
p(S\lvert N=n)
$$

in English, this is "the probability of $S$ _given_ $N$ is equal to $n$".

This thing is known as "the conditional distribution of $S$ given $N$ is equal to $n$".

We can plot all of these distributions using `groupby`:

In [None]:
viridis_cycler = plot.create_cycler(plt.cm.viridis, 7)
kwargs = {"histtype": "step", "grid": False,
          "density": True, "bins":  range(0, 12), "align": "left"}

with mpl.rc_context({"axes.prop_cycle": viridis_cycler}):
    conditionals_fig, ax = plt.subplots(figsize=(8, 4))
    # pandas also has plotting functions, which can be used with groupbys!
    effort_samples.groupby("attempts")["successes"].hist(lw=4,
        **kwargs);
    
effort_samples["successes"].hist(lw=6, **kwargs, color="k", label="marginal");
ax.set_xlabel("Successes"); plt.legend(); plt.tight_layout()

The collection of all of the colored histograms above is $p(S\lvert N)$.
A single colored histogram is $p(S\lvert N = n)$ for some fixed $n$.
The thicker histogram in black is the marginal, $p(S)$.

As above, more brightly-colored means $n$ is larger.

All of these shapes are binomials!

In [None]:
n = pm.DiscreteUniform.dist(lower=1, upper=6).random().item()
label= r"$p(S\vert N=" +str(n) + ")$"

with mpl.rc_context({"axes.prop_cycle": viridis_cycler[n:n+1]}):
    f, ax = plt.subplots(figsize=(8, 4))
    data = effort_samples.groupby("attempts")["successes"].get_group(n)
    best_binomial_pmf = scipy.stats.binom(n=n, p=data.mean() / n).pmf
    data.hist(
        histtype="step", lw=4, grid=False, density=True, bins=range(0, 7), label=label, align="left");
    plt.xlabel("Successes")
    
ax.plot(range(7),
        best_binomial_pmf(range(7)), lw=4, marker=".", markersize=24,
        label="closest binomial", color="C1");
plt.legend();

Here, I'm picking a random value of `N`
(pyMC can also be used to generate random variables outside of a model,
using the `.dist` and `.random` methods as above).

Then, I'm grabbing the values of `successes` observed when the `"attempts"` is equal to `n`
with a `groupby` (and `.get_group` to pull out a single group).

Lastly, I'm comparing the histogram of those values to the `Binomial` whose distribution
is closest to it.
Again, the method for picking the parameters of that `Binomial` is not important just yet.

All of the histograms are close to their `best_binomial` fits,
which you can observe by executing the cell repeatedly.

Now we can say very concretely what it means for `X` to be a `Foo` random variable, aka the output of `pm.Foo`.

It means that the conditional distribution of `X`, _given a specific value for all of its parameters_, is a `Foo` distribution.

In [None]:
utils.daft.make_parameters_graph();

Note: the documentation of pyMC refers to things like `pm.Foo` as "log-likelihoods".

We'll see why at the end of this lecture, in the section on Bayes' Rule.

In the concrete case we're considering,
```python
S = pm.Binomial(...)
```
and therefore
the conditional distribution of `S`, given all of its parameters, is `Binomial`.

In [None]:
conditionals_fig

The height of the histogram at a point $s$ on the "Successes" axis is
$$
p(S=s\lvert N=n)
$$

in English, this is "the probability that $S$ is equal to $s$ _given_ $N$ is equal to $n$".

Note that, in general, the frequency at which a pair of values is observed in the data
can't be determined by looking just at the frequencies of the two values separately.

It is important that the values are _taken from the same sample_.
In real data, that means measured at the same time, or from the same subject, or similar.

Shuffling lets us approximate what our data would look like
if it didn't matter whether we took from the same sample or not.

In [None]:
shuffled_samples = pd.DataFrame(effort_samples["successes"])
shuffled_samples["attempts"] = effort_samples["attempts"].sample(frac=1).values  # shuffle

In [None]:
plot.jointgrid_effort(shuffled_samples)

Notice that the joint distribution has changed,
even though the marginal distributions have not.

In [None]:
plot.histogram3d(shuffled_samples["attempts"], shuffled_samples["successes"],
                 labels=["Attempts", "Successes"], bins=np.arange(-0.5, 7.5),
                 cm=plt.cm.viridis);

Now, all of the conditional distributions have about the same shape:
each collection of similar-colored bars has the same overall shape,
the shape of the marginal distribution of $S$,
though the bars are all _scaled differently_:
if the fraction of times a given number of attempts occurred is smaller,
all of the bars in that column are shorter.

In [None]:
with mpl.rc_context({"axes.prop_cycle": viridis_cycler}):
    f, ax = plt.subplots(figsize=(8, 4))
    shuffled_samples.groupby("attempts")["successes"].hist(
        histtype="step", lw=4, grid=False, density=True, bins=range(0, 12));

shuffled_samples["successes"].hist(
    histtype="step", lw=6, grid=False, density=True, bins=range(0, 12), color="k", label="marginal");
ax.set_xlabel("Successes"); plt.legend(); plt.tight_layout()

This figure, which is once again plotting the conditional distributions $p(S\lvert N)$,
indicates that the difference in heights goes away once we take into account
the fraction of obesrvations with $N=x$.

So in general,

$$
p(S=y, N=x) \neq p(S=y) \cdot p(N=x) 
$$

But instead,

$$
p(S=y, N=x) = p(S=y \lvert N=x) \cdot p(N=x) 
$$

But if $p(S=y\lvert N=x)$ is the same as  $p(S=y)$, the marginal probability, as in the shuffled data above, then

$$\begin{align}
p(S=y, N=x) &= p(S=y \lvert N=x) \cdot p(N=x) \\
&= p(S=y) \cdot p(N=x) 
\end{align}$$

This is called _statistical independence_,
and we'll have more to say about it in the future.

Note that you can do the same looking the other way:
look at a collection of bars that have the same position on the "Successes" axis.
That would be the conditional distribution

$$
p(N \lvert S=y)
$$

Call `histogram3d` with the keyword argument `colorby="y"` to color the bars
according to the value of their position on the "Successes" axis instead.

## Example: Measurement Noise

Whenever we measure an analog value out in the world,
the thing we would like to measure,
the _signal_,
is all mixed up with things we didn't intend to measure,
the _noise_.

We'd like to measure an organism's weight, but the weight we measure varies according to factors like:

  - most recent meal

  - point in the breathing cycle

  - air pressure fluctuations on our scale

  - electronic or mechanical variability inside our scale

  - the [position of the moon](https://www.quora.com/How-much-does-the-orbit-of-the-Moon-affect-my-weight)

In [None]:
utils.daft.make_bivariate_graph("S", "M", observed=False);

That is, we have a $S$ignal we'd like to know the value of and a $M$easurement we've taken:

$$
M \sim \text{Norm}\left(S, 1\right) \\
S \sim \text{Norm}\left(0, 1\right)
$$

Or, alternatively:

$$
M  \sim S + \varepsilon \\
S \sim \text{Norm}(0, 1) \\
\varepsilon \sim \text{Norm(0,  1)}
$$

Where the term $\varepsilon$ is called the _noise_.

Why does it have a normal distribution?
Because if we assume that the factors interfering with our measurement
are numerous,
are mostly independent from one another,
and combine approximately additively,
the Central Limit Theorem says their distribution will be normal.

The name _noise_ comes from what such interfering factors sound like.

In [None]:
noise = pd.Series(pm.Normal.dist(mu=0, sd=1).random(size=62500))

In [None]:
Audio(noise, rate=10000)

These ideas were worked out for the case where $S$ was an analog audio signal,
being transmitted over radio waves or a similar medium.

The exact materials through which the waves passed, the state and behavior of the air,
the uncontrolled behavior of the small electronic components of the radios,
all combined together to vary the audio signal at the receiver.
Shockingly, the result always sounds about the same: it is _noise_.

When you want to be particular, this kind of noise is called
[white noise](https://en.wikipedia.org/wiki/White_noise).

In [None]:
PIL.Image.fromarray(utils.util.to_image(noise.sample(10000))).resize((250, 250))

It also shows up in the analog transmission of visual signals,
e.g. television, where it is known as _static_.

Normal distributions are perhaps less common when measuring psychological phenomena,
for which there aren't numerous, independent nuisances, like there are for more directly physical phenomena.

However, many branches of psychology now measure physical phenomena in the brain,
using tools like
[PET and EEG](https://charlesfrye.github.io/FoundationalNeuroscience/84/)
or [fMRI](https://charlesfrye.github.io/FoundationalNeuroscience/83/),
and so their measurements are often subject to Gaussian noise.

In pyMC, our model for this measurement process looks like:

In [None]:
with pm.Model() as measurement_model:
    signal = pm.Normal("signal", mu=0, sd=1)
    measurement = pm.Normal("measurement", mu=signal, sd=1)

Loose description of sampling process:
- "Imagine the signal were 0.267", aka select a random value of the signal 
- "If the signal were 0.267, a plausible value of the measutement is .753", aka sample S with parameter mu=0.267.
- Repeat.

In [None]:
measurement_samples = shared_util.samples_to_dataframe(shared_util.sample_from(
    measurement_model, draws=10000, progressbar=True))

In [None]:
measurement_samples.head()

In [None]:
hexplot = sns.jointplot(measurement_samples["signal"], measurement_samples["measurement"], kind="hex", height=8);
plot.add_cbar(hexplot.fig);

The center of this plot is another "2-D histogram".
Rather than using square bins, it uses hexagonal bins,
but it's still just counting how many observations are in each bin.

In [None]:
plot.histogram3d(measurement_samples["signal"], measurement_samples["measurement"],
                 labels=["Signal", "Measurement"],
                 cm=plt.cm.Blues, colorby="x")

We can once again shuffle one of the columns to see what the joint distribution might look like
if the two variables were independent:

In [None]:
hexplot = sns.jointplot(measurement_samples["signal"].sample(frac=1), measurement_samples["measurement"],
                        kind="hex", height=8);
plot.add_cbar(hexplot.fig)

In [None]:
plot.histogram3d(measurement_samples["signal"].sample(frac=1), measurement_samples["measurement"],
                 labels=["Signal", "Measurement"],
                 cm=plt.cm.Blues, colorby="x")

Notice: the heights of the bars are not the same, going across x or across y,
but the _shape_ of the collection of bars, going across y but fixing a value of x,
is the same as the _shape_ of the collection of bars at a different value of x,
and both shapes look like their respective marginal distribution.