<img src="../../shared/img/slides_banner.svg" width=2560></img>

# 10a - Formulas and Linear Models

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
from pathlib import Path
import random

from IPython.display import HTML, Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import scipy.stats

In [None]:
sns.set_context("notebook", font_scale=1.7)

import shared.src.utils.util as shared_util

# Least squares regression models have a Normal likelihood.

$$
y \sim \text{Normal}(f(X), \sigma)
$$

# Linear least-squares models relate the independent to the dependent variable with a line.

$$
y \sim \text{Normal}(\text{slope}\cdot f(X) + \text{intercept}, \sigma)
$$

In [None]:
iris_df = sns.load_dataset("iris", data_home=Path("..") / ".." / "shared" / "data")

iris_df["is_setosa"] = iris_df["species"].apply(lambda s: bool(s == "setosa"))

In [None]:
sns.pairplot(iris_df, hue="is_setosa", vars=["sepal_length", "sepal_width"], height=6);

## The simplest linear model always predicts the mean.

This is the model used to set the baseline for calculating variance explained.

In [None]:
mean_sepal_length = iris_df["sepal_length"].mean()
sd_sepal_length = iris_df["sepal_length"].std()

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(iris_df["sepal_length"]);

In [None]:
with pm.Model() as mean_model_by_hand:
    Mean_Sepal_Length = pm.Normal("Mean", mu=mean_sepal_length, sd=1e6)
    Sigma_Sepal_Length = pm.Exponential("Sigma", lam=0.5 / sd_sepal_length)
    Sepal_Lengths = pm.Normal("Sepal Lengths",
                              mu=Mean_Sepal_Length, sd=Sigma_Sepal_Length,
                              observed=iris_df["sepal_length"])

mean_model_by_hand_samples = shared_util.sample_from(mean_model_by_hand)

In [None]:
pm.plot_posterior(mean_model_by_hand_samples, figsize=(12, 6), text_size=16,
                  ref_val=[mean_sepal_length, sd_sepal_length]);

## Technically, this model doesn't quite look like the canonical linear model

We are using

$$
y \sim \text{Normal}\left(\mu, \sigma\right)
$$

versus the linear model form

$$
y \sim \text{Normal}\left(\text{slope}\cdot \texttt{f}(X), \sigma\right)
$$

Notice that the `Intercept` is gone.
With this approach,
the `Intercept` ends up just becoming
the "slope" of a data feature that is all 1s.

## To match the two, we just need do define the right `f`:

In [None]:
def map_to_1(x):
    return 1

$$
y \sim \text{Normal}\left(\text{slope}\cdot 1, \sigma\right)
$$

This is called a _data feature_:
a non-linearly transformed version of a variable
that we want to use to in a linear model.

We are always allowed to compute features of our data before performing linear regression --
this is sometimes called _linearization_.

One view is that [neural networks are "just" an automated version of linearization](https://towardsdatascience.com/5-reasons-logistic-regression-should-be-the-first-thing-you-learn-when-become-a-data-scientist-fcaae46605c4).

Homework 05 includes a deeper dive on feature encoding for linnear models of categorical data,
which is a special case of linearization.

In [None]:
all_ones = iris_df["sepal_length"].apply(map_to_1)

with pm.Model() as mean_model_by_hand_two:
    Mean_Sepal_Length = pm.Normal("Intercept", mu=mean_sepal_length, sd=1e6)
    sd = pm.Exponential("sd", lam=0.5 / sd_sepal_length)
    Sepal_Lengths = pm.Normal("Sepal Lengths",
                              mu=Mean_Sepal_Length * all_ones,
                              sd=sd,
                              observed=iris_df["sepal_length"])

mean_model_by_hand_samples = shared_util.sample_from(mean_model_by_hand_two)

In [None]:
pm.plot_posterior(mean_model_by_hand_samples, figsize=(12, 6), text_size=16,
                  ref_val=[mean_sepal_length, sd_sepal_length]);

# Today we'll consider a different way to specify models: formulas.

A formula is a string that describes a model,
indicating which variables are to be used as linear predictors for which other variables.

The variable to be predicted is placed on the far left,
then followed by a `~`, or "distributed as" symbol
(a tilde on many keyboards),
then a description of the variables to be used in the model follows.

For example, the model

$$
y \sim \text{Normal}(\text{slope}\cdot x + \text{intercept}, \sigma)
$$

for a `DataFrame` with columns `x` and `y` becomes

```
"y ~ x"
```

We can also call Python functions inside a formula,
meaning that the generic least-squares regression model might be written

$$
y \sim \text{Normal}(\text{slope}\cdot \texttt{f}(x) + \text{intercept}, \sigma)
$$

```
"y ~ f(x)"
```

and the predictions would differ depending on our definition of `f`.

# We can produce the simple mean-based model of sepal length with a simple formula.

It uses a special symbol: `1`,
to represent the all-ones `Series`.

```
"sepal_length ~ 1"
```

### We can create (generalized) linear models from formulas in pyMC with `pm.GLM.from_formula`

The `G` for _generalized_ means we can choose different likelihoods than the Normal,
but we won't see that here.

In [None]:
with pm.Model() as mean_glm_model:
    # we define a model context and then call the function,
    #  providing the formula as an argument:
    mean_glm = pm.GLM.from_formula(
        "sepal_length ~ 1",
        data=iris_df,  # the df must be provided so the formula can be used
        family="normal"  # which family is used for the likelihood? (defaults to "normal")
    )  

If we print the resulting object,
we can see which variables are defined.

In [None]:
mean_glm

Sampling and MAP inference proceed exactly as before:

In [None]:
mean_glm_samples = shared_util.sample_from(mean_glm_model)
mean_glm_MAP = pm.find_MAP(model=mean_glm_model)
mean_glm_MAP

As with all models with a Normal likelihood,
we can measure how good our model is by looking
at the mean squared error:

In [None]:
def mse(predictions, observations):
    return ((predictions - observations) ** 2).mean()

We just need a function to compute our predictions
given the input data and the parameters:

In [None]:
def compute_prediction_mean(data, parameters):
    return parameters["Intercept"] * data.apply(map_to_1)

In [None]:
mse(compute_prediction_mean(iris_df["sepal_width"], mean_glm_MAP), iris_df["sepal_length"])

Note: this is the `mse` that goes into the denominator of the variance explained.

In [None]:
f, ax = plt.subplots(figsize=(12, 12))
pm.plot_posterior_predictive_glm(  # pymc function to visualize posterior of simple GLMs
    mean_glm_samples,  # samples from posterior
    eval=iris_df["sepal_width"],  # `data` in compute_prediction f'n above
    lm=compute_prediction_mean  # function that takes in data and element of sample,
                                #  and returns a prediction
)

sns.scatterplot("sepal_width", "sepal_length", data=iris_df);

# We can also specify categorical models with the `GLM`+Formula interface.

In [None]:
sns.pointplot(y="sepal_length", x="is_setosa", data=iris_df);

# The formula remains simple.

We just need to note, with a `C`, which variables in the model are categorical:

```
"sepal_length ~ C(is_setosa)"
```

The `1`, for `Intercept`,
is included automatically in every formula.

In [None]:
with pm.Model() as categorical_glm_model:
    categorical_glm = pm.GLM.from_formula(
        "sepal_length ~ C(is_setosa)",
        data=iris_df)

In [None]:
categorical_glm

By default, categorical variables are handled by converting them to "Treatment Coding",

`Intercept` corresponds to what we called the "mean of the baseline group" (`is_setosa` being `False`)
and
`C(is_setosa)[T.True]` corresponds to what we called the "effect of the `is_setosa` factor".

In [None]:
categorical_glm_samples = shared_util.sample_from(categorical_glm_model)
categorical_MAP = pm.find_MAP(model=categorical_glm_model)

In order to compute predictions in this model,
we need to define the preidction function.

In [None]:
def compute_prediction_categorical(data, parameters):
    return parameters["Intercept"] * data.apply(map_to_1)\
            + parameters["C(is_setosa)[T.True]"] * data  # add this value only when `is_setosa` is True

categorical_MAP_predictions = compute_prediction_categorical(iris_df["is_setosa"], categorical_MAP)

mse(categorical_MAP_predictions, iris_df["sepal_length"])

In [None]:
1 - mse(categorical_MAP_predictions, iris_df["sepal_length"]) / mse(mean_glm_MAP["Intercept"], iris_df["sepal_length"])

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
pm.plot_posterior_predictive_glm(categorical_glm_samples, eval=iris_df["is_setosa"],
                                 lm=compute_prediction_categorical)

sns.stripplot(y="sepal_length", x="is_setosa", data=iris_df);

## Specification is even simpler for a linear regression model.

$$
\text{sepal length} \sim \text{Normal}(\text{slope}\cdot \text{sepal width} + \text{intercept}, \sigma)
$$

becomes

```
"sepal_length ~ sepal_width"
```

It doesn't seem particularly promising,
since there isn't an obvious linear relationship
between the values,
but we can try fitting a regression model.

In [None]:
sns.lmplot(x="sepal_width", y="sepal_length", data=iris_df, height=12);

In [None]:
with pm.Model() as regression_glm_model:
    regression_glm = pm.GLM.from_formula(
        "sepal_length ~ 1 + sepal_width",
        data=iris_df)

In [None]:
regression_glm

In [None]:
regression_glm_samples = shared_util.sample_from(regression_glm)

regression_MAP = pm.find_MAP(model=regression_glm_model)

In [None]:
pm.plot_posterior(regression_glm_samples, figsize=(12, 12), text_size=16,
                  ref_val=[mean_sepal_length, 0, sd_sepal_length]);

In [None]:
def compute_prediction_regression(data, parameters):
    return parameters["Intercept"] * data.apply(map_to_1) + parameters["sepal_width"] * data

regression_MAP_predictions = compute_prediction_regression(iris_df["sepal_width"], regression_MAP)

mse(regression_MAP_predictions, iris_df["sepal_length"])

Notice that the MSE is close to the MSE of the baseline, mean-only model.

In [None]:
f, ax = plt.subplots(figsize=(12, 12))
pm.plot_posterior_predictive_glm(regression_glm_samples, eval=iris_df["sepal_width"],
                                 lm=compute_prediction_regression)
ax.scatter(iris_df["sepal_width"], iris_df["sepal_length"]);

# With the right approach, it seems like a linear model could fit this data.

In [None]:
f, ax = plt.subplots(figsize=(12, 12))

sns.scatterplot("sepal_width", "sepal_length", data=iris_df, hue="is_setosa", ax=ax);

We'd like to be able to fit two linear models: one for setosas and one for other irises.

## Formulas make it easy to create complex models.

The `:` symbol is used to create models with interactions in them.

In this case,
an "interaction" means that the slope of the line is different for setosas and non-setosas,
in addition to the intercept being different.

In [None]:
with pm.Model() as combined_glm_model:
    combined_glm = pm.GLM.from_formula("sepal_length ~ 1 + sepal_width + C(is_setosa) + sepal_width:C(is_setosa)",
        data=iris_df)

In [None]:
combined_glm

`sepal_width:C(is_setosa)[T.True]` is the "interaction effect":
the difference in slopes for the two groups.

In [None]:
combined_glm_samples = shared_util.sample_from(combined_glm)

combined_glm_posterior_df = shared_util.samples_to_dataframe(combined_glm_samples)

combined_glm_MAP = pm.find_MAP(model=combined_glm_model)

In [None]:
ref_vals = [0, 0, 0, 0, sd_sepal_length]
pm.plot_posterior(combined_glm_samples, figsize=(8, 12), text_size=12,
                 ref_val=ref_vals, rope=(-0.1, 0.1));

On casual inspection,
it seems that our hypothesis was wrong:
our posteriors for the interaction and for the categorical effect
straddle 0 and overlap with the ROPE.

In [None]:
def in_ROPE(value, rope=(-0.1, 0.1)):
    return rope[0] < value < rope[1]

combined_glm_posterior_df["C(is_setosa)[T.True]"].apply(in_ROPE).mean()

In [None]:
combined_glm_posterior_df["sepal_width:C(is_setosa)[T.True]"].apply(in_ROPE).mean()

But what we should really check is whether _both_ are in the ROPE,
and so _both_ can be ignored:

In [None]:
(combined_glm_posterior_df["sepal_width:C(is_setosa)[T.True]"].apply(in_ROPE)
 * combined_glm_posterior_df["C(is_setosa)[T.True]"].apply(in_ROPE)).mean()

The values of parameters in the posterior are almost always _correlated_,
and so we cannot reason about any single parameter without considering the others.

## More complex linear models require us to encode our data properly in order to calculate predictions.

In [None]:
def encode_data_combined(iris_df):
    data = np.zeros(shape=(len(iris_df), 3))
    data[:, 0] = 1
    data[:, 1] = iris_df["sepal_width"].values
    data[:, 2] = np.array(iris_df["is_setosa"].values, dtype=np.int)
    return data


def compute_prediction_combined(data, posterior_sample):
    data = np.atleast_2d(data)
    all_ones = data[:, 0]
    sepal_width = data[:, 1]
    is_setosa = data[:, 2]
    return posterior_sample["Intercept"] * all_ones\
        + posterior_sample["sepal_width"] * sepal_width\
        + posterior_sample["C(is_setosa)[T.True]"] * is_setosa\
        + posterior_sample["sepal_width:C(is_setosa)[T.True]"] * (is_setosa * sepal_width)

Notice how the prediction is equal to the sum of the parameters times the data values?
This is the signature of a linear model.

You'll work through data encoding for categorical models with interactions in Homework 05,
where you'll use `DataFrame`s instead of `array`s.

Predictions in hand, we can compute the MSE:

In [None]:
encoded_data = encode_data_combined(iris_df)
combined_MAP_predictions = compute_prediction_combined(encoded_data, combined_glm_MAP)

mse(combined_MAP_predictions, iris_df["sepal_length"])

This is quite a bit lower than the regression or categorical models separately.

In [None]:
f, ax = plt.subplots(figsize=(12, 12))
setosa_selector = iris_df["is_setosa"]
ax.plot(encoded_data[~setosa_selector, 1], combined_MAP_predictions[~setosa_selector], lw=4);
ax.plot(encoded_data[setosa_selector, 1], combined_MAP_predictions[setosa_selector], lw=4);

sns.scatterplot("sepal_width", "sepal_length", data=iris_df, hue="is_setosa", ax=ax);

# The approach of specifying models from formulas has advantages and disadvantages.

## One major advantage of this approach is speed.

We specify our model by writing down a string
and selecting a few hyperparameters,
which can make the process as fast as running a `seaborn` plot.

## One major disadvantage is the loss in explicitness.

A `pm.Model` specification usually writes out every single variable
and their relationships directly.

The assumptions are immediately legible to you and to anyone working with your model.

In contrast,
the assumptions of a GLM are hidden inside
some very opaque code in pyMC.

## Another advantage is that it matches frequentist practice.

# The `statsmodels` package for frequentist testing in Python uses formulas.

In [None]:
import statsmodels.formula.api as smf

Creating our simple models is just as straightforward:

In [None]:
mean_MLE_model = smf.ols(  # specifically for models with normal likelihood
    "sepal_length ~ 1",
    data=iris_df)

The equivalent of `.find_MAP` is `.fit`:

In [None]:
mean_MLE_results = mean_MLE_model.fit()

The results are reported in the form of a summary table,
which describes the outcomes and parameters of a number of statistical tests:

In [None]:
mean_MLE_results.summary()

Most folks focus on a few of entries in this summary:
- `Prob (F-statistic)` tells you the p-value of an ANOVA applied to this model.
- `P>|t|` tells you the p-value for a t-test applied to that parameter by itself.

Unsurprisingly,
we find that the null hypothesis that the data has mean 0
is not supported by the data.

## If we wish to test a different model, we simply change the formula:

In [None]:
categorical_MLE_model = smf.ols("sepal_length ~ C(is_setosa)", data=iris_df)

categorical_MLE_results = categorical_MLE_model.fit()

categorical_MLE_results.summary()

In [None]:
regression_MLE_model = smf.ols("sepal_length ~ sepal_width", data=iris_df)

regression_MLE_results = regression_MLE_model.fit()

regression_MLE_results.summary()

But beware, because using frequentist techniques can lead you astray:

In [None]:
combined_MLE_model = smf.ols("sepal_length ~ sepal_width*C(is_setosa)", data=iris_df)

combined_MLE_results = combined_MLE_model.fit()

combined_MLE_results.summary()

`"a*b"` is short-hand for `"a + b + a:b"`

Notice how the p-value for the interaction
`sepal_width:C(is_setosa)[T.True]`
and for the categorical effect `C(is_setosa)[T.True]`
are both above the traditional threshold for significance.

And yet, the data very obviously supports the conclusion
that the model that excludes those factors is extremely poor:

In [None]:
f, ax = plt.subplots(figsize=(12, 12))
setosa_selector = iris_df["is_setosa"]
ax.plot(encoded_data[~setosa_selector, 1], combined_MAP_predictions[~setosa_selector], lw=4);
ax.plot(encoded_data[setosa_selector, 1], combined_MAP_predictions[setosa_selector], lw=4);
sns.scatterplot("sepal_width", "sepal_length", data=iris_df, hue="is_setosa", ax=ax);
pm.plot_posterior_predictive_glm(regression_glm_samples, eval=iris_df["sepal_width"],
                                 lm=compute_prediction_regression)
ax.legend();

The black lines represent the posterior predictions from the regression model,
and their lack of agreement with the data,
relative to the combined model, is clear.

## Generalized Linear Models

Generalized linear models allow us to work with data
that's not well described by a Gaussian likelihood.

$$
y \sim \text{Foo}(f(\text{slope}\cdot X + \text{intercept}))
$$

For example, the model we used for the putting success data looked like

$$
\text{successes} \sim \text{Binomial}(N, \texttt{squash}(\text{slope}\cdot \text{distance} + \text{intercept}))
$$

If we choose a different `family`,
we can obtain posterior estimates for other kinds of GLMs:

In [None]:
pm.GLM.from_formula??

Similarly, we call a different function, `smf.glm`,
to create GLMs in `statsmodels`,
and provide it with a `fmaily` argument that specifies the likelihood:

In [None]:
import statsmodels.api as sm

```python
smf_binomial = smf.glm(?, data=?,
                       family=sm.families.Binomial)
```