# Chapter 11 - Exercises

## Set Up

### Packages

In [2]:
import os

import arviz as az
import graphviz as gr
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import seaborn as sns
from scipy import stats
from scipy.interpolate import BSpline
from sklearn.preprocessing import StandardScaler



### Defaults

In [3]:
# seaborn defaults
sns.set(
    style="whitegrid",
    font_scale=1.2,
    rc={
        "axes.edgecolor": "0",
        "axes.grid.which": "both",
        "axes.labelcolor": "0",
        "axes.spines.right": False,
        "axes.spines.top": False,
        "xtick.bottom": True,
        "ytick.left": True,
    },
)

colors = sns.color_palette()

### Constants

In [4]:
DATA_DIR = "../data"
HOWELL_FILE = "howell.csv"
CHERRY_BLOSSOMS_FILE = "cherry_blossoms.csv"
WAFFLE_DIVORCE_FILE = "waffle_divorce.csv"
MILK_FILE = "milk.csv"
LDS_FILE = "lds_by_state.csv"
ADMISSIONS_FILE = "ucbadmit.csv"
KLINE_FILE = "kline.csv"

RANDOM_SEED = 42

In [5]:
def load_data(file_name, data_dir=DATA_DIR, **kwargs):
    path = os.path.join(data_dir, file_name)
    return pd.read_csv(path, **kwargs)

## Easy

### 11E1

If an event has probability 0.35, what are the log-odds of this event?

---

In [8]:
p = 0.35
log_odds = np.log(p/(1-p))
print(round(log_odds, 2))

-0.62


### 11E2

If an event has log-odds 3.2, what is the probability of this event?

---

In [13]:
log_odds = 3.2
p = np.exp(log_odds) / (1 + np.exp(log_odds))
print(round(p, 2))

0.96


### 11E3

Suppose that a coefficient in a logistic regression has value 1.7.
What does this imply about the proportional change in odds of the outcome?

---

It implies that a unit increase in the feature multiplies the log-odds by a factor of $e^{1.7}\approx 5.5$.

### 11E4

Why do Poisson regressions sometimes require the use of an offset?
Provide an example.

---
Poisson regressions require an offset when the duration (or distance or area) of observation varies between data points.

For example, suppose we count the number of flowers in fields of different sizes.
In order to model the rate of flower growth per square mile we need to add an offset to compensate for the different sizes of the observed fields.

## Medium

### 11M1

As explained in the chapter, binomial data can be organized in aggregated and disaggregated forms, without any impact on inference.
But the likelihood of the data does change when the data are converted between the two formats.
Can you explain why?

---

The absolute value of the likelihood isn't what matters, it's the value relative to the marginal probabiliity of the data.
The likelihood is different in the different formats, but they are equivalent once you take the normalising factors into account.

### 11M2

If a coefficient in a Poisson regression has value 1.7, what does this imply about the change in the outcome?

---

It implies that a unit increase in the feature multiplies the mean (the rate that events occur) by a factor of $e^{1.7}\approx 5.5$.

### 11M3

Explain why the logit link is appropriate for a binomial generalized linear model.


---

The purpose of the link function is to map the parameter space to the range of a linear model, i.e. the whole real line.
In a binomial GLM the parameter $p$ represents probability so the parameter space is (0, 1).
The logit function is appropriate because it maps (0, 1) bijectively onto the real line.

### 11M4

Explain why the log link is appropriate for a Poisson generalized linear model.

---

As above, we need to map the parameter space to the real line.
The parameter in a Poisson GLM is $\lambda$ with range $\lambda > 0$.
The log link is appropriate because it maps positive real numbers to the real line.

### 11M5

What would it imply to use a logit link for the mean of a Poisson generalized linear model?
Can you think of a real research problem for which this would make sense?

---

The logit link has domain (0, 1), so using it would imply that the mean parameter $\lambda$ must be between 0 and 1.

Suppose that a factory produces cars at a rate of one per day.
Consider the number of cars that are sold per day.
This can be higher than one on a particular but overall the rate must be lower than one per day as it is limited by the number manufactured.
It might be suitable to model this with a Poisson GLM with logit link function.