# Chapter 7 - Exercises

## Set Up

### Packages

In [3]:
import os

import arviz as az
import graphviz as gr
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import seaborn as sns
from scipy import stats
from scipy.interpolate import BSpline
from sklearn.preprocessing import StandardScaler



### Defaults

In [4]:
# seaborn defaults
sns.set(
    style="whitegrid",
    font_scale=1.2,
    rc={
        "axes.edgecolor": "0",
        "axes.grid.which": "both",
        "axes.labelcolor": "0",
        "axes.spines.right": False,
        "axes.spines.top": False,
        "xtick.bottom": True,
        "ytick.left": True,
    },
)

colors = sns.color_palette()

### Constants

In [5]:
DATA_DIR = "../data"
HOWELL_FILE = "howell.csv"
CHERRY_BLOSSOMS_FILE = "cherry_blossoms.csv"
WAFFLE_DIVORCE_FILE = "waffle_divorce.csv"
MILK_FILE = "milk.csv"
LDS_FILE = "lds_by_state.csv"

RANDOM_SEED = 42

In [4]:
def load_data(file_name, data_dir=DATA_DIR, **kwargs):
    path = os.path.join(data_dir, file_name)
    return pd.read_csv(path, **kwargs)

## Easy

### 7E1

State the three motivating criteria that define information entropy.
Try to express each in your own words.

---

Information entropy is a function of probabilities $p=(p_i)$ such that the following hold.
1. It is continuous in the $p_i$.
2. Increasing the number of possible outcomes should increase the entropy. That is, if a possible outcome with probability $p_i$ is replaced by two outcomes whose probabilities sum to $p_i$ then the entropy should increase (strictly if the new outcomes have non-zero probability)
3. It is additive, in the sense that if we add a new *independent* random variable then then entropy of the joint distribution is the sum of the entropies of the marginal distributions.

### 7E2

Suppose a coin is weighted such that, when it is tosssed and lands on a table, it comes up heads 70% of the time.
What is the entropy of the coin.

---

In [18]:
def information_entropy(prob):
    prob = np.array(prob)
    return - np.sum(prob * np.log(prob))

In [19]:
information_entropy([0.3, 0.7])

0.6108643020548935

### 7E3

suppose a four-sided die is loaded such that, when tossed onto a table, it shows "1" 20%, "2" 25%, "3" 25%, and "4" 30% of the time.
What is the entropy of the die?

---

In [20]:
information_entropy([0.2, 0.25, 0.25, 0.3])

1.3762266043445461

### 7E4

Suppose another four-sided die is loaded such that it never shows "4".
The other three sides show equally often.
What is the entropy of this die.
---

We don't need to account for events with probability zero, so the entropy is

In [21]:
information_entropy([1/3, 1/3, 1/3])

1.0986122886681096

## Medium

### 7M1

Write down and compare the definitions of AIC and WAIC.
Which of these criteria is most generic?
Which assumptions are required to transform the more general criterion into a less general one?

---

The criteria are defined by

\begin{align}
    \text{AIC} &= D_{\text{train}} + 2p, & \text{WAIC} &= D_{\text{train}} + 2\sum_i \text{Var}(\ln(\text{Pr}(y_i | \Theta))),
\end{align}

where $D_{\text{train}}$ is the devaince over the training set, $p$ is the number of parameters, $y_i$ are the observations, and $\Theta$ is the set of parameters.
The difference between the definitions is in the correction term - the expression for WAIC is a more accurate estimate of the over-fitting bias in the training deviance.

The WAIC is the most generic (hence WI = Widely Applicable).

I don't quite understand the last part of the question, but I think it's asking under what assumptions do these definitions agree.
We need to assume
1. Flat priors
2. Multivariate normal posteriors
3. The sample size is much greater than the number of parameters

### 7M2

Explain the difference between model *selection* and model *comparison*.
What information is lost under model selection?

---

Model selection means building multiple models and selecting which to use based on some estimate of out-of-sample predictive accuracy, such as WAIC or PSIS.
Essentially it is using WAIC or PSIS to identify a 'best' model.
Model comparison means using WAIC or PSIS to understand how different variables (and modelling assumptions) influence predictions and predictive accuracy.

Under model selection we lose the context of the models considered, and their relative scores.
We also disregard casual questions when we focus solely on predictive accuracy.

### 7M3

When comparing models with an information criterion, why must all models be fit to exactly the same observations?
What would happen to the information criterion values, if the models were fit to different numbers of observations?
Perform some experiments, if you are not sure.

---

The information criteria defined are explicitly dependent on the observations.
In particular, they are additive - the value for a group of observations is the sum of the individual values.
This means that expanding a sample of observations can only increase the value of the information criterion.
This can be solved by taking an average; that is, dividing the number of information criterion by the number of samples.
This should theoretically make the information criterion independent of the observations for a sufficiently large sample, but in practice particular observations may have a large effect on the criterion and so it's important that these be included for all models being compared.

### 7M4

What happens to the effective number of parameters, as measured by PSIS or WAIC, as a prior becomes more concentrated?
Why?
Perform some experiments, if you are not sure.

---

With a completely flat prior all observations are equally plausible.
Thus the log-likelihood of the observations is relatively flat (even under the posterior distributions).
As the prior becomes more opinionated, there is more variance in the likelihood of different observations and so the number of effective parameters in WAIC decreases.
Since importance sampling involves weighting observations by the reciprocals of their (posterior) likelihoods, again higher variance in the likelihoods of different observations means that we give (relatively) more weight to lower likelihoods which decreases PSIS.

This also makes intuitive sense; the number of effective parameters quantifies how the extent to which the model is overfit and so tighter priors should decrease the value.

### 7M5

Provide an informal explanation of why informative priors reduce overfitting.

---

Informative priors reduce the extent to which individual observations influence the likelihood.
For example, if I have an extremely uninformative prior on human height and then measure the heights of a few basketball players, it will lead to a strong belief that 2 metres is an normal height, and so my model will be bad at predicting the heights of the general population.
However if I use an informative prior my model won't be so affected by these extreme values and will do better in general.

### 7M6

Provide an informal explanation of why overly informative priors result in underfitting.

----

Overly informative priors dominate the observed data meaning that your posterior is relatively similar to the prior.
This means that the model doesn't learn much from the data, including the true (non-stochastic) patterns, and so will generalise less well to unseen data.

## Hard