<img src="../../shared/img/banner.svg"></img>

# Homework 05 - Linear Models

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
import math
from pathlib import Path

from client.api.notebook import Notebook
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import scipy.stats

import shared.src.utils.util as shared_util

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
ok = Notebook("ok/config")

## Learning Objectives



1. Recognize how the linear model framework unifies one-way, multiway, and regression models.
2. Practice converting between "natural" and encoded representations of linear models.
3. Develop intuition for the variance explained and the maximum likelihood parameters in heterogeneous linear models.

In this homework,
we will work through the three types of linear models
we've considered in greatest detail in this class:
linear regression, and one-way and multiway categorical models.

We will see that,
when the input data and parameters are encoded properly,
a single function can be used to compute the predictions
from all three types of models:

$$
\text{prediction} = \text{sum}\left(\text{parameters} * \text{data features}\right)
$$

or, in more mathematical terms:

$$
y = \sum_i m_i \cdot x_i
$$

## Section 1 - Linear Regression Models

### Making Predictions from Linear Regression Models

Typically, the prediction $y$ of a linear model with input $x$ is written
$$
y = m\cdot x + b
$$

Write a function, `make_prediction_linear_regression`,
that takes in the observed value $x$ as its first argument
and the intercept and slope, $b$ and $m$, as its second and third,
and returns the predicted value $y$.

Write a function `make_predictions_linear_regression`
that does same for a `Series` of `observed_values`.
You should reuse `make_prediction_linear_regression`.
You might use `apply` or a `for` loop or a list comprehension,
depending on your preferred style.

In [None]:
ok.grade("q1_01")

### Alternate Encoding of Linear Regression

The connection between linear regression and other linear models
is more apparent when the data and parameters are written differently.

The manner in which data is represented numerically for use in a model
is known as the _encoding_ or _coding_ of the data.

In this alternate encoding of data for linear regression,
each observed value $x$ is encoded as a pair of numbers:
$\text{data features} = [1, x]$.

The parameters are then written as a pair of numbers in the same format:
$\text{parameters} = [\text{intercept}, \text{slope}]$.

In this style,
instead of the predictions being written
$$
\text{prediction} = \text{intercept} + \text{slope}\cdot x
$$
they are written
$$
\text{prediction} = \text{parameters}[0]\cdot\text{data features}[0] + \text{parameters}[1]\cdot\text{data features}[1]
$$

Note that $\text{data features}[0]$ is always $1$, and so the output is equivalent.

In Python, this might be written

```python
prediction = 0
for parameter, data_feature in zip(parameters, data_features):
    prediction += parameter * data_feature
```

In more standard mathematical notation,
it would be written

$$
y = \sum_i m_i \cdot X_i
$$

where $m$ and $X$ are the parameters and data features,
indexed by $i$,
and
where $\sum$, pronounced "sum",
means "add up all of these values".
It is the mathematical equivalent of the `sum` function in Python.

The predictions are identical,
as you will see below,
but all linear models can be written in this form,
unlike the $m\cdot x + b$ representation.

First, write a function, `encode_data_linear_regression`,
that turns a `Series` of observed values
into a `DataFrame` of observed values in this new encoding.

That is, the input `Series`
```python
0    1
1    2
2    3
```
should result in the output `DataFrame`
```python
   0  1
0  1  1
1  1  2
2  1  3
```

You might wish to split it into two functions,
one that encodes a single datapoint
and one that uses that function to encode an entire `Series` of datapoints
(using `apply`, a `for` loop, or similar).

In [None]:
ok.grade("q1_02")

Now, write a function,  `make_predictions_linear_model`,
that takes a `DataFrame` of `encoded_data` as its first argument
and a `Series` of `parameters` as its second argument
and returns a `Series` of predictions for each input.
For a regression model,
the `parameters` will have length 2 (intercept and slope),
but for other models, this `Series`
will have different lengths.
The `encoded_data` will always have the same number of columns
as `parameters` has entries.

Again, splitting this function up into two pieces,
one that works on a single row and
another that calls it on each row in a `DataFrame`,
is a useful strategy.
`pd.DataFrame.apply` really shines here.

In general, breaking complex functions into their
constituent pieces and composing them
is a key tool for
simplifying the process of writing code
and making it extensible, comprehensible, and general.
For example, using `sum` and `*`,
the prediction function for a single point can be written in a single line!

In [None]:
ok.grade("q1_03")

Below, we will see that this function,
without any modifications,
can also be used to compute the predictions of categorical models.
It can even be extended to linear models with both categorical and continuous components.

### Quantifying Prediction Error

In linear models with Normal likelihoods,
the primary contribution to the posterior log-probability
of the parameters
is given by the average squared difference of the observed values
and the values predicted by the model with those parameters.

Define a function, `mse_predictions`,
that takes in a `Series` of `predictions` and a `Series` of `observations`
and returns the mean squared error of those predictions.

The MSE expresses how likely the data is for a given set of model parameters,
but it is not normalized,
so it can be hard to evaluate the quality of a set of parameters
just based on MSE.

It can be normalized by dividing by the variance of the observations.
When subtracted from 1, this is $R^2$,
or _fraction of variance explained_.

Write a function, `variance_explained`,
that takes the same arguments as `mse_predictions`
and returns the value of $R^2$.
_Hint_: you might want to reuse `mse_predictions` here.

In [None]:
ok.grade("q1_04")

### Minimizing Prediction Error

For any model with a Normal likelihood,
the parameters that maximize the likelihood
are the ones that minimize the mean squared error.

The `regression_df` loaded below contains two columns, `y` and `x`.
Define a `Series` of parameters, an intercept and a slope,
that approximately minimize the mean squared error and
so approximately maximize the likelihood.
Name the resulting variable `regression_parameters`.

Your `regression_parameters` will pass the test if the variance explained
is greater than `0.4`.

It is strongly recommended that you plot your predictions
against the data and use the visualization to guide your choice of parameters,
but it is not necessary in order to receive credit.

You might try any of the following approaches:
- Use `sns.lmplot` to obtain a visualization of the regression line and approximate the parameters from the plot.
- Try a variety of values, making small adjustments and checking for improvement. _Hint: Try finding the best intercept with a slope of 0, then finding the best slope for that choice of intercept. Will the slope be positive or negative? Will the magnitude be bigger than 1 or between 0 and 1?_ Check out [guessthecorrelation.com](http://guessthecorrelation.com/) for a chance to practice guessing the slopes of regression lines in a cute game.
- Notice that the data is standardized with $z$-scoring and use that to determine the maximum likelihood intercept and slope. _Hint: base your answer on the correlation (`scipy.stats.pearsonr`, `np.corrcoef`)._

In [None]:
regression_df = pd.read_csv(Path("data") / "regression_data.csv", index_col=0)
regression_df.head()

In [None]:
ok.grade("q1_05")

## Section 2 - One-way Categorical Linear Models

In the natural encoding of a one-way categorical model,
the prediction for a data point is given by the value of the group mean
for that data point.

### Making Predictions from One-way Categorical Models

Define a function, `make_predictions_oneway`,
that takes in a `Series` of `group_idxs`
and a `Series` of `group_means`,
and returns the prediction of the one-way categorical model
with those `group_means` as its parameters.

Again, consider splitting the function into two pieces.

In [None]:
ok.grade("q2_01")

### Linear Model Encoding of One-way Categorical Models

#### Encoding One-way Categorical Data

There are many ways to encode categorical data so that it can be used with the standard linearmodel.

The method we use here is called _Treatment Coding_,
since it comes from experiments with a control group that sets a baseline
and all other groups come from experimental manipulations, or _treatments_,
being applied to this group.

For simplicity, we will refer to this as **the** linear model encoding,
though there are
[others](https://www.statsmodels.org/dev/contrasts.html#treatment-dummy-coding).

In the linear model encoding of data for categorical models,
each data point is again represented by a `Series` beginning with `1`.

The other entries of the `Series` are all `1` or `0`.
For a one-way model, there is at most one other `1`,
located at the group index for the data point:
```
[1, int(group_idx == 1) int(group_idx == 2) ...]
```

So if there are three groups,
two datapoints with group indexes `0` and `1`
would become
```
[1, 0, 0]
```
and
```
[1, 0, 1]
```

Write a function, `encode_data_oneway_model`,
that takes a `Series` of `group_idxs` for some observed data
and the number of groups
as its arguments
and returns a `DataFrame` where each row is the linear model encoding
of the corresponding group index.

Again, this function naturally breaks down into two pieces,
one that encodes individual rows
and one that combines the encodings of individual rows into a `DataFrame`.

In [None]:
ok.grade("q2_02")

#### Encoding One-way Categorical Parameters

When the inputs to categorical models are written in
the linear model encoding,
the outputs are computed in the same way as
for the alternate encoding of a linear regression model:

$$
\text{prediction} = \text{parameters}[0]\cdot\text{data features}[0] + \text{parameters}[1]\cdot\text{data features}[1] \ \ ...
$$

Again, $\text{data features}[0]$ is always `1`.
Now, instead of $\text{data features}[1]$ being the observed value,
it is either `1`,
if the group index is equal to `1`,
or `0`, if it is not.

For a categorical model, then,
the prediction for the observations in the group at index `0` is
$$
\text{prediction for group 0} = \text{parameters}[0] = \text{group means}[0]
$$
while that for
the observations in the group at index `1` is
$$
\text{prediction for group 1} = \text{parameters}[0] + \text{parameters}[1] = \text{group means}[1]
$$
and so
$$
\text{parameters}[1] = \text{group means}[1] - \text{group means}[0]
$$
which is what we called the "effect of the factor"
when working with categorical models in the mean parameterization.

In general, when we computed effects and interactions in categorical models,
we compared the `group_means` to each other.
When the inputs to categorical models are written in
the linear model encoding,
the parameters are directly in terms of the effects and interactions.

Define a function, `encode_parameters_oneway_model`,
that takes in a `Series` of `group_means`
and returns a `Series` of `parameters` for a one-way model
in the linear model encoding, as defined above.

In [None]:
ok.grade("q2_03")

### Minimizing Prediction Error

Prediction error in a categorical model is again standardized using the variance explained,
and the maximum likelihood parameters for the model maximize the variance explained.

The `oneway_df` loaded below contains two columns, `y` and `factor1`.
Define a `Series` of parameters, in the linear model encoding,
that approximately minimize the mean squared error and
so approximately maximize the likelihood.
Name the resulting variable `oneway_parameters`.

In this case, the maximum variance explainable
by a one-way model is small, and so
your `oneway_parameters` will pass the test if the variance explained
is at least `0.0` when the predictions are obtained by
passing your parameters to `make_predictions_linear_model`.

Keep the following in mind:
- The `variance_explained` is usually greater than `0`, since it is equal to one minus a ratio that is usually less than `1`.
- To make the `variance_explained` equal to `0`, the numerator and denominator of that ratio should be equal. What is the denominator equal to for a categorical model?
- You may find it easier to first define your answer in terms of the group means and then use `encode_parameters_oneway_model` to convert them to the linear model encoding.
- A `pairplot` might make a helpful visualization.

In [None]:
oneway_df = pd.read_csv(Path("data") / "oneway_data.csv", index_col=0)

In [None]:
ok.grade("q2_04")

## Section 3 - Multiway Categorical Linear Models

### Making Predictions from Multiway Categorical Models

As with a one-way categorical model,
in the natural encoding for a multiway categorical model,
the prediction for a data point is given by the value of the group mean
for that data point.

But in a multiway model,
there are multiple factors, not just one,
and so the group means are typically represented with an `array`-like data structure.

Define a function, `make_predictions_twoway`,
that takes as its first two arguments two `Series` of factor indices,
one for the first factor and one for the second factor,
and an `np.ndarray` of `group_means` as its third argument,
and returns a `Series` of predictions from the two-way categorical model
with those `group_means` as its parameters.

The index for factor 1 should be used to select the row of `group_means`,
while the index for factor 2 should be used to select the column.

So if the `group_means` were the array
```
[[0, 1, 2],
 [3, 4, 5]]
```
then `make_predictions_twoway` would return
```
pd.Series([0, 1, 5])
```
on the inputs
```
pd.Series([0, 0, 1]), pd.Series([0, 1, 2])
```

As with the other prediction functions,
this function naturally decomposes into two pieces.

In [None]:
ok.grade("q3_01")

### Linear Model Encoding of Multiway Categorical Models

#### Encoding Multiway Categorical Data

In the linear model encoding of data for categorical models,
including multiway models,
each data point is still represented by a `Series` beginning with `1`.

The other entries of the `Series` are all `1` or `0`.
As with a one-way model,
the length of the `Series` is equal to the total number
of groups.
In addition to starting with a `1`,
there are `1`s
to indicate whether this data point comes
from a non-zero level of each factor
and if it comes from a given _combination_
of non-zero levels,
as in
```
[1, int(factor1_idx == 1), int(factor1_idx == 2), ...
int(factor2_idx == 1),  int(factor2_idx == 2), ...
int((factor1_idx == 1) & (factor2_idx == 1)), ... 
int((factor1_idx == J) & (factor2_idx == K))]
```
for a a model with two factors with total numbers of factor levels `J` and `K`.

We will focus on the simplest case:
a multiway model with two factors,
each of which has two levels.
We will call such a model a _two-by-two_ model.

If there are two levels of each two factors,
for a total of four groups,
a datapoint from level 0 of both factors
would be coded as
```
[1, 0, 0, 0]
```
and one from level 1 of both factors
would be coded as
```
[1, 1, 1, 1]
```
while a datapoint from level 1 of factor 1
and level 0 of factor 2
would be coded as
```
[1, 1, 0, 0]
```

Write a function, `encode_data_twobytwo_model`,
that takes two `Series`,
the `factor1_idxs` and `factor2_idxs`,
for some observed data
from a two-by-two model
and returns a `DataFrame` where each row is the linear model encoding
of the corresponding observed data point.

Consider breaking your function into two pieces, as previously.

In [None]:
ok.grade("q3_02")

#### Encoding Multiway Categorical Parameters

When the inputs to categorical models are written in
the linear model encoding,
the outputs are computed in the same way as
for the alternate encoding of a linear regression model:

$$
\text{prediction} = \text{parameters}[0]\cdot\text{data features}[0] + \text{parameters}[1]\cdot\text{data features}[1] ...
$$

Again, $\text{data features}[0]$ is always `1`.
In categorical models, instead of the $\text{data features}$ being the observed value,
they are instead `0` or `1` to indicate to which groups the data point belonged,
as levels and combinations of levels from each factor.

For a two by two model, then,
the prediction for the observations in level 0 of both facors,
aka the baseline group, is
$$
\text{prediction for factor levels 0, 0} = \text{parameters}[0] = \text{group means}[0, 0]
$$
while that for
an observation in level 1 of both factors is
$$
\text{prediction for factor levels 1, 1} = \text{parameters}[0] + \text{parameters}[1] + \text{parameters}[2] + \text{parameters}[3] = \text{group means}[1, 1]
$$
and so
$$
\text{parameters}[3] = \text{group means}[1, 1] - \text{parameters}[2] - \text{parameters}[1] - \text{parameters}[0]
$$
which is what we called the "interaction of the factors"
when working with categorical models in the mean parameterization.

In general, when we computed effects and interactions in categorical models,
we compared the `group_means` to each other.
When the inputs to categorical models are written in
the linear model encoding,
the parameters are directly in terms of the effects and interactions.

Define a function, `encode_parameters_twobytwo_model`,
that takes in an `array` of `group_means`
and returns a `Series` of `parameters` for a two-by-two model
in the linear model encoding, as defined above.

When we computed treatment effects and interactions in categorical models,
we compared the `group_means` to each other.

When the inputs to a multiway categorical models are written in
the linear model encoding,
the parameters are directly in terms of the treatment effects and interactions.

As a concrete example,
for a collection of `group_means`
```
[[1, 2],
 [1, 4]]
```
the corresponding `parameters` are
```
[1, 0, 1, 2]
```

Notice that while the `group_means` were a two-dimensional array,
the parameters are one-dimensional
(a `Series`, rather than a `DataFrame`),
just as they were for the linear regression and one-way models.

In [None]:
ok.grade("q3_03")

### Minimizing Prediction Error

Prediction error in a multiway categorical model is again standardized using the variance explained,
and the maximum likelihood parameters for the model maximize the variance explained.

The `twobytwo_df` loaded below contains three columns, `y`, `factor1`, and `factor2`.
Define a `Series` of parameters, in the linear model encoding,
that approximately minimize the mean squared error and
so approximately maximize the likelihood.
Name the resulting variable `twobytwo_parameters`.

Your `twobytwo_parameters` will pass the test if the variance explained
is at least `0.3` when the predictions are obtained by
passing your parameters to `make_predictions_linear_model`.

Keep the following in mind:
- A `pairplot` might make a helpful visualization.
- It might be easier to first define your answer in terms of the group means and then use `encode_parameters_twobytwo_model` to convert them to the linear model encoding.

In [None]:
twobytwo_df = pd.read_csv(Path("data") / "twobytwo_data.csv", index_col=0)
twobytwo_df.head()

In [None]:
ok.grade("q3_04")

In [None]:
ok.score()