<img src="../../shared/img/slides_banner.svg" width=2560></img>

# Regression 03 - Priors in Regression Models

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
from pathlib import Path
import random

from IPython.display import HTML, Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import scipy.stats

In [None]:
sns.set_context("notebook", font_scale=1.7)

import shared.src.utils.util as shared_util

# Linear regression is the most important regression model.

In this lecture,
we continue our deep dive into linear regression:

$$y \sim \text{Foo}(\text{slope}\cdot x + \text{intercept}, \sigma)$$

## Among linear regression models, the case where the likelihood is Normal is particularly important.

So we'll in particular (mostly) focus on the case where the likelihood is Normal:

$$y \sim \text{Normal}(\text{slope}\cdot x + \text{intercept}, \sigma)$$

As with the previous lecture, today we'll work with a famous dataset:
Sir Francis Galton's parent-child height dataset ([source](https://doi.org/10.7910/DVN/T0HSJ1)),
on which the technique of regression was named and invented
([original paper](http://www.stat.ucla.edu/~nchristo/statistics100C/history_regression.pdf)).

In [None]:
df = pd.read_csv("./data/galton_height.csv", index_col=0)

It contains the heights of a nearly 1000 English individuals, their sex, and the height of both their parents, collected in 1885.

Following Galton, we summarize the parental heights by averaging them to obtain a "`midparental_height`".

In [None]:
print(df.head())

In [None]:
sns.jointplot(x="midparental_height", y="height",  data=df, kind="hex");

# In regression, data is often standardized with $z$-scoring.

In [None]:
def standardize(data):
    return (data - data.mean()) / data.std(ddof=0)

In [None]:
standardized_midparental_heights = standardize(df["midparental_height"])
standardized_maternal_heights = standardize(df["mother"])
standardized_paternal_heights = standardize(df["father"])
standardized_heights = standardize(df["height"])

## Once standardized with $z$-scoring, the means of all variables are 0.

In [None]:
np.allclose(0,
            [standardized_heights.mean(), standardized_maternal_heights.mean(),
             standardized_paternal_heights.mean(), standardized_midparental_heights.mean()])

Therefore our intercept can be fixed at 0.

# As always in modeling, we specify a prior and a likelihood in order to obtain a posterior.

The posterior represents our updated belief about how the two variables relate to one another,
once we've observed our dataset.

$$
\color{green}{p(\text{slope} \vert \text{data})}
\propto \color{darkgoldenrod}{p(\text{data} \vert \text{slope})}
\cdot \color{darkblue}{p(\text{slope})}
$$

That is, our

$\color{green}{\text{updated belief about the plausibility of a given relationship between x and y}}$

is proportional to

$\color{darkgoldenrod}{\text{how likely the data is under that relationship}}$

multiplied by

$\color{darkblue}{\text{how plausible we thought that relationship was before we saw the data}}$.

# These slides focus on the role of and common choices for the prior in regression models.

In the previous lecture, we focused on _flat_ priors,
which are technically not probability distributions,
so that we could connect to more mainstream MLE methods.

A linear regression model with a normal likelihood and a flat prior
is known as an _ordinary least squares model_
because the MAP for the parameters also minimizes the squared error.

In [None]:
with pm.Model() as galton_model:
    Slope = pm.Flat("Slope")
    ObservedValues = pm.Normal("Heights",
                               mu=Slope * standardized_midparental_heights,
                               sd=1,
                               observed=standardized_heights)

galton_trace = shared_util.sample_from(galton_model)

galton_MLE = pm.find_MAP(model=galton_model)

Unlike in tests for differences of means,
it's actually quite common for regression models to include
priors, even if they aren't always thought of as such.

To understand why,
we'll consider how a prior can help us answer the following question:

# How do we know when we can ignore the linear relationship between two variables?

For example,
Galton's original explanation of his findings
presumed that individuals selected mates
without respect to height:
if not,
then the predicted heights of grandchildren might not exhibit
"regression to mediocrity".

Instead of making this assumption, we can check the data:
we have both maternal and paternal heights.

If the paternal height can be used to
predict the maternal height to great accuracy,
then the assumption that mate selection mostly ignores height
is unlikely to be true.

In [None]:
with pm.Model() as selection_effect_model:
    Slope = pm.Flat("Slope")
    ObservedValues = pm.Normal("MaternalHeights",
                               mu=Slope * standardized_maternal_heights,
                               sd=1,
                               observed=standardized_paternal_heights)

selection_effect_trace = shared_util.sample_from(selection_effect_model)
selection_effect_posterior_df = shared_util.samples_to_dataframe(selection_effect_trace)

selection_effect_MLE = pm.find_MAP(model=selection_effect_model)

In [None]:
ax = pm.plot_posterior(selection_effect_trace, ref_val=0, figsize=(12, 6), text_size=16);
ax.vlines(selection_effect_MLE["Slope"], 0, ax.get_ylim()[1] * 0.5, lw=6,
          color="C4", label="Flat Prior MAP / MLE");
ax.set_ylim(1.2 * np.array(ax.get_ylim())); ax.legend();

It seems that the 95% posterior density does not include 0,
so by the method we've been using so far in the class,
we'd have to conclude that there's a flaw in Galton's analysis.

Furthermore, the Maximum Likelihood Estimate is not 0 either.

## It is important to consider the _magnitude_ of a relationship, not just whether it is above or below 0.

The posterior above indicates that we should change our expectation of a mother's height
by somewhere around a tenth of a standard deviation every time the father gets taller or shorter by a standard deviation.

This is a _very small_ effect.

For a father who is three standard deviations away from average height,
aka someone who is six feet, two inches,
we predict a maternal height of less than an inch above average.

## Enter the ROPE: Region of Practical Equivalence

Before running our analysis, we define a set of values, close to 0,
which we consider to be _practically equivalent to 0_.

#### We call this the Region Of Practical Equivalence.

Once we obtain a posterior, we can check the overlap between our posterior and this region.

For posteriors with more mass close to 0, this overlap will be large:

In [None]:
ax = pm.plot_posterior(selection_effect_trace, ref_val=0, rope=(-0.05, 0.05),
                       figsize=(12, 6), text_size=16);
ax.vlines(selection_effect_MLE["Slope"], 0, ax.get_ylim()[1] * 0.5, lw=6,
          color="C4", label="Flat Prior MAP / MLE");
ax.set_ylim(1.2 * np.array(ax.get_ylim())); ax.legend();

I selected as my ROPE here -0.05 to 0.05:
a correlation is _effectively_ 0 if it suggests
that I only need to change my prediction by a factor of 1 in 20 or less
when I take into account the independent variable.

Note that the ROPE would be different if we had not standardized our data --
the size of the ROPE is dependent on the meaning of the parameter.

As usual, we test the probability we assign to the statement
"the correlation is within the region of practical equivalence"
by checking whether it is true on samples from our posterior.

In [None]:
def is_in_ROPE(sample, ROPE=(-0.05, 0.05)):
    return ROPE[0] < sample < ROPE[1]

In [None]:
selection_effect_posterior_df["Slope"].apply(is_in_ROPE).mean()

There is a fairly decent chance, around 25%,
that the correlation between maternal and paternal heights is negligible,
according to this choice of ROPE.

## But the MAP estimate of the parameter is still outside the ROPE.

So according to our model,
we should say that the most likely effect is neither 0 nor negligible.

# If we change our priors, and so incorporate different beliefs into our model, we can change our posteriors and MAP estimates.

# In particular, we can choose a &#8220;skeptical&#8221; prior.

Our choice of `Flat` prior indicated that we thought _any_ slope was possible.

We might want to instead adopt
[Ockham's Razor](https://plato.stanford.edu/entries/ockham/),
which states that simpler models are to be preferred to more complex ones.

A model in which there is no effect is considered simpler than a model in which there is an effect.

This principle is motivated less by an understanding of mechanism and more by what is essentially a preference:
we would prefer to work with simpler models,
if we can get away with it.

Some elevate Ockham's Razor all the way to a law of nature,
like Isaac Newton:
> [Nature does nothing in vain, and more is in vain when less will serve; for Nature is pleased with simplicity, and affects not the pomp of superfluous causes.](https://en.wikisource.org/wiki/Page:Newton%27s_Principia_(1846).djvu/390)

That is, nature itself is simple.

I consider that position
[somewhat dubious](https://www.theatlantic.com/science/archive/2016/08/occams-razor/495332/),
especially when it comes to
claims about things like regression slopes.
To say there is _absolutely no relationship_ between the heights of mothers and fathers
seems far too strong.

That is,
if our prior puts high weight on values inside the ROPE,
then it will require strong evidence,
in terms of the likelihood,
in order for our posterior to put high weight on values outside the ROPE.

### A Normal prior centered at 0 can be a skeptical prior.

The Normal distribution puts almost all of its weight
within a few standard deviations of its mean,
and so with the correct choice of parameters,
we can put a high weight on the ROPE.

The choice of a Normal prior for slope parameters is known as
[_ridge regression_ in statistics](https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b),
[weight decay in the neural network literature](https://metacademy.org/graphs/concepts/weight_decay_neural_networks),
or [$\ell_2$ regularization in machine learning](https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/).

$\ell_2$, pronounced "ell-two", is the more abstract mathematical equivalent of the sum of squares.

In general,
if you come across a "regularization" mechanism
in a machine learning algorithm,
it is almost always expressing a Bayesian prior --
implicitly or explicitly.

Let's see how incorporating this "skeptical Normal prior"
changes our posterior and MAP estimate
for the relationship between maternal and paternal heights:

In [None]:
with pm.Model() as selection_ridge_model:
    # ridge regression <> Normal prior on slope
    Slope = pm.Normal("Slope", mu=0, sd=2.5e-2)
    # This prior says: I think there is a 95% chance that
    #  the correlation is between -5e-2 and +5e-2 (and so inside the ROPE)
    
    
    MaternalHeights = pm.Normal("MaternalHeights",
                               mu=Slope * standardized_maternal_heights,
                               sd=1,
                               observed=standardized_paternal_heights)

selection_ridge_trace = shared_util.sample_from(selection_ridge_model)
selection_ridge_posterior_df = shared_util.samples_to_dataframe(selection_ridge_trace)

selection_ridge_MAP = pm.find_MAP(model=selection_ridge_model)

In [None]:
ax = pm.plot_posterior(selection_ridge_trace, rope=(-0.05, 0.05), ref_val=0,
                       figsize=(12, 6), text_size=16)

ax.vlines(selection_ridge_MAP["Slope"], 0, ax.get_ylim()[1] * 0.5, lw=6, color="C1",
          label="Gaussian Prior MAP")
ax.vlines(selection_effect_MLE["Slope"], 0, ax.get_ylim()[1] * 0.5, lw=6,
          color="C4", label="Flat Prior MAP / MLE");
ax.set_ylim(1.2 * np.array(ax.get_ylim())); ax.legend();

The overlap between the posterior and the ROPE is now much greater:

In [None]:
selection_ridge_posterior_df["Slope"].apply(is_in_ROPE).mean()

and the MAP value is in the Region of Practical Equivalence:

In [None]:
is_in_ROPE(selection_effect_MLE["Slope"])

But the MAP value is still not exactly 0.

Is it possible to express our skepticism about effects in
a prior that gives us _eactly_ 0 values for the MAP?

# To obtain MAP values that are exactly 0, we use a different prior.

Consider the following distributions
which are members of the
[_Laplace_](https://en.wikipedia.org/wiki/Laplace_distribution)
family:

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
xs = np.linspace(-3, 3, num=1000)
ax.plot(xs, np.exp(pm.Laplace.dist(mu=0, b=0.1).logp(xs).eval()), lw=4);
ax.plot(xs, np.exp(pm.Laplace.dist(mu=0, b=1).logp(xs).eval()), lw=4);

A Laplace distribution is made of two exponential distributions,
one positive and one negative,
"stapled" together at 0.

Relative to a Normal distribution,
the Laplace places even greater weight on 0 value
and on relatively large values.

## The Laplace distribution is also used as a &#8220;skeptical&#8221; prior.

The choice of a Laplace prior for slope parameters is known as
[_lasso regression_ in statistics](https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b),
[sparse weight prior in the neural network literature](https://medium.com/mlreview/l1-norm-regularization-and-sparsity-explained-for-dummies-5b0e4be3938a),
or [$\ell_1$ regularization in machine learning](https://medium.com/mlreview/l1-norm-regularization-and-sparsity-explained-for-dummies-5b0e4be3938a).

With the right choice of parameters,
we can even ensure that it puts approximately the same amount of total probability on the slope being inside the Region of Practical Equivalence as did the Normal:

In [None]:
pd.Series(pm.Laplace.dist(mu=0, b=0.015).random(size=5000)).apply(is_in_ROPE).mean()

Let's see how incorporating this "skeptical Laplace prior"
changes our posterior and MAP estimate
for the relationship between maternal and paternal heights:

In [None]:
with pm.Model() as selection_lasso_model:
    # lasso regression <> Laplace prior on slope
    Slope = pm.Laplace("Slope", mu=0, b=0.015)
    MaternalHeights = pm.Normal("MaternalHeights",
                               mu=Slope * standardized_maternal_heights,
                               sd=1,
                               observed=standardized_paternal_heights)

selection_lasso_trace = shared_util.sample_from(selection_lasso_model, target_accept=0.9)
selection_lasso_posterior_df = shared_util.samples_to_dataframe(selection_lasso_trace)
selection_lasso_MAP = pm.find_MAP(model=selection_lasso_model)

In [None]:
ax = pm.plot_posterior(selection_lasso_trace, rope=(-0.05, 0.05), ref_val=0,
                       figsize=(12, 6), text_size=16)

ax.vlines(selection_ridge_MAP["Slope"], 0, ax.get_ylim()[1] * 0.5, lw=6, color="C1",
          label="Gaussian Prior MAP")
ax.vlines(selection_effect_MLE["Slope"], 0, ax.get_ylim()[1] * 0.5, lw=6,
          color="C4", label="Flat Prior MAP / MLE");
ax.vlines(selection_lasso_MAP["Slope"], 0, ax.get_ylim()[1] * 0.5, lw=6,
          color="C5", label="Laplace Prior MAP");
ax.set_ylim(1.2 * np.array(ax.get_ylim())); ax.legend();

The overlap between the posterior and the ROPE remains high:

In [None]:
selection_lasso_posterior_df["Slope"].apply(is_in_ROPE).mean()

and the MAP value is not only in the Region of Practical Equivalence, but it's exactly 0:

In [None]:
is_in_ROPE(selection_effect_MLE["Slope"]), selection_lasso_MAP["Slope"]

But note that the posterior probability of the parameter being 0
is 0,
because the posterior is still a density,
over continuous values.

In [None]:
(selection_effect_MLE["Slope"] == 0).mean()

Resolving this and actually obtaining a model that places
non-zero probability on the parameter being 0 requires more advanced techniques.

# Note that with sufficient evidence, it is still possible for the posterior to be outside the ROPE.

For example, we still recover a non-zero value
for the slope for the original problem considered by Galton,
predicting the heights of offspring from the midparental height.

In [None]:
with pm.Model() as galton_ridge_model:
    Slope = pm.Normal("Slope", mu=0, sd=2.5e-2)
    # This prior says: I think there is a 95% chance
    #  the correlation is between -5e-2 and 5e-2 (and so inside the ROPE)
    
    Heights = pm.Normal("Heights",
                               mu=Slope * standardized_midparental_heights,
                               sd=1,
                               observed=standardized_heights)

galton_ridge_trace = shared_util.sample_from(galton_ridge_model)

galton_ridge_MAP = pm.find_MAP(model=galton_ridge_model)

In [None]:
ax = pm.plot_posterior(galton_ridge_trace, rope=(-0.05, 0.05), ref_val=0,
                       figsize=(12, 6), text_size=16)

ax.vlines(galton_ridge_MAP["Slope"], 0, ax.get_ylim()[1] * 0.5, lw=6, color="C1",
          label="Gaussian Prior MAP")
ax.vlines(galton_MLE["Slope"], 0, ax.get_ylim()[1] * 0.5, lw=6,
          color="C4", label="Flat Prior MAP / MLE");
ax.set_ylim(1.2 * np.array(ax.get_ylim())); ax.legend();

Note that the posterior is still closer to 0
than is the MLE.

This is because our prior, the Normal centered at 0,
puts substantially more weight on values with lower magnitude.

This effect is called
[shrinkage](https://en.wikipedia.org/wiki/Shrinkage_estimator).

It is not necessarily a bad thing,
since shrinkage often counters
overestimation bias in Maximum Likelihood Estimates,
but the shrinkage caused by a Normal prior is often too strong.

Let's compare the results for the Laplace prior.

In [None]:
with pm.Model() as galton_lasso_model:
    # lasso regression <> Laplace prior on slope
    Slope = pm.Laplace("Slope", mu=0, b=0.015)
    Heights = pm.Normal("Heights",
                               mu=Slope * standardized_heights,
                               sd=1,
                               observed=standardized_midparental_heights)

galton_lasso_trace = shared_util.sample_from(galton_lasso_model, target_accept=0.9)

galton_lasso_MAP = pm.find_MAP(model=galton_lasso_model)

In [None]:
ax = pm.plot_posterior(galton_lasso_trace, rope=(-0.05, 0.05), ref_val=0,
                       figsize=(12, 6), text_size=16)

ax.vlines(galton_ridge_MAP["Slope"], 0, ax.get_ylim()[1] * 0.5, lw=6, color="C1",
          label="Gaussian Prior MAP")
ax.vlines(galton_MLE["Slope"], 0, ax.get_ylim()[1] * 0.5, lw=6,
          color="C4", label="Flat Prior MAP / MLE");
ax.vlines(galton_lasso_MAP["Slope"], 0, ax.get_ylim()[1] * 0.5, lw=6,
          color="C5", label="Laplace Prior MAP");
ax.set_ylim(1.2 * np.array(ax.get_ylim())); ax.legend();

Notice that the posterior is much closer to the MLE value,
while still slightly lower.

The Laplace prior, while placing more weight at 0 than the Normal,
also places more weight in the tails.
Remember that it is equivalent to a pair of Exponential distributions,
and the Exponential distribution,
which we've used as a prior for the standard deviation in many models,
has a heavy tail.

If we want a prior that expresses the same "skepticism"
that inspired null hypothesis significance testing
without biasing our results too far downwards,
then the Laplace prior is often a good choice.

This property of the Laplace prior gives this regression another name:
_sparse linear regression_,
since the values of the MAP parameters are often 0.