# Chapter 5 - Exercises

## Set Up

### Packages

In [3]:
import os

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import seaborn as sns
from scipy import stats
from scipy.interpolate import BSpline
from sklearn.preprocessing import StandardScaler



### Defaults

In [4]:
# seaborn defaults
sns.set(
    style="whitegrid",
    font_scale=1.2,
    rc={
        "axes.edgecolor": "0",
        "axes.grid.which": "both",
        "axes.labelcolor": "0",
        "axes.spines.right": False,
        "axes.spines.top": False,
        "xtick.bottom": True,
        "ytick.left": True,
    },
)

colors = sns.color_palette()

### Constants

In [5]:
DATA_DIR = "../data"
HOWELL_FILE = "howell.csv"
CHERRY_BLOSSOMS_FILE = "cherry_blossoms.csv"
WAFFLE_DIVORCE_FILE = "waffle_divorce.csv"
MILK_FILE = "milk.csv"

RANDOM_SEED = 42

In [6]:
def load_data(file_name, data_dir=DATA_DIR, **kwargs):
    path = os.path.join(data_dir, file_name)
    return pd.read_csv(path, **kwargs)

## Easy

### 5E1

Which of the linear models below are multiple linear regressions?
1. $\mu_i = \alpha + \beta x_i$
2. $\mu_i = \beta_x x_i + \beta_z z_i$
3. $\mu_i = \alpha + \beta (x_i - z_i)$
4. $\mu_i = \alpha + \beta_x x_i + \beta_z z_i$

---

2 and 4.

### 5E2

Write down a multiple regression to evaluate the claim: Animal diversity is linearly related to latitude, but only after controlling for plant diversity.
You just need to write down the model definition.

---

Let `A`, `L`, and `P` denote animal diversity, latitute, and plant diversity, respectively. Then the model is

\begin{align}
    A_i & \sim \text{Normal}(\mu_i, \sigma) \\
    \mu_i & = \alpha + \beta_L L_i + \beta_P P_i.
\end{align}

### 5E3

Write down a multiple regression to evaluate the claim: Neither amount of funding nor size of laboratory is by itself a good predictor of time to PhD degree; but together these variables are both positively associated with time to degree.
Write down the model definition and indicate which side of zero each slope parameter should be on.

---

Let `T`, `F`, and `L` denote time to PhD, amount of funding, and size of laboratory, respectively. Then the model is

\begin{align}
    T_i & \sim \text{Normal}(\mu_i, \sigma) \\
    \mu_i & = \alpha + \beta_F F_i + \beta_L L_i.
\end{align}

If together both are positively associated with time to degree then both $\beta_F$ and $\beta_L$ should be positive.
If neither is a good predictor on its own then I expect that they are negatively associated with each other; that is, larger laboratories have less funding per student.

### 5E4

Suppose you have a single categorical predictor with 4 levels (unique values), labeled A, B, C and D.
Let $A_i$ be an indicator variable that is 1 where case $i$ is in category A. Also suppose $B_i$, $C_i$, and $D_i$ for the other categories.
Now which of the following linear models are inferentially equivalent ways to include the categorical variable in a regression?
Models are inferentially equivalent when it’s possible to compute one posterior distribution from the posterior distribution of another model.
1. $\mu_i = \alpha + \beta_A A_i + \beta_B B_i + \beta_D D_i$
2. $\mu_i = \alpha + \beta_A A_i + \beta_B B_i + \beta_C C_i + \beta_D D_i$
3. $\mu_i = \alpha + \beta_B B_i + \beta_C C_i + \beta_D D_i$
4. $\mu_i = \alpha_A A_i + \alpha_B B_i + \alpha_C C_i + \alpha_D D_i$
5. $\mu_i = \alpha_A (1 - B_i - C_i - D_i) + \alpha_B B_i + \alpha_C C_i + \alpha_D D_i$

---

All the models except (2) are inferrentially equivalent: since each case must belong to one of the categories, $A_i + B_i + C_i + D_i = 1$, and so any models that contain exactly four out of $A_i$, $B_i$, $C_i$, $D_i$, and a constant term are inferentially equivalent.
It is possible to compute the posterior distribution of any of the models from that of (2), but not the other way round.
To illustrate this suppose that we $alpha \sim \text{Uniform}(0, 1)$ and $\alpha = \beta + \gamma$.
It is possible that $\beta \sim \text{Uniform}(0, x)$ and $\gamma \sim \text{Uniform}(x, 1)$ for any $x \in  (0, 1)$, so the distributions of $\beta$ and $\gamma$ are not uniquely determined by that of $\alpha$.