In [None]:
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

# Worksheet 5 #
All the libraries that you might reasonably want to use for the calculations in Exercises 4 and 5 are imported in the cell above.

## 1. KL Divergence and Normal Distributions
Let $g$ be the normal $(\mu_1, \sigma_1^2)$ density and $f$ the normal $(\mu_2, \sigma_2^2)$ density.

**a)** Let $g$ be the normal $(\mu_1, \sigma_1^2)$ density and $f$ the normal $(\mu_2, \sigma_2^2)$ density. Find the Kullback-Leibler divergence $D_{KL}(g \Vert f)$.

**b)** Fix $\mu_1$ and $\sigma_1^2$. Discuss the behavior of $D_{KL}(g \Vert f)$ as $\mu_2$ and $\sigma_2^2$ change. You are welcome to hold one of them fixed and consider the effect of changing the other.

## 2. KL Divergence and a Misspecified Model
Suppose you think that $X_1, X_2, \ldots, X_n$ are i.i.d. with density $f(x \mid \lambda)$ the exponential density with rate $\lambda$. Suppose you you construct the MLE $\hat{\lambda}_n$ under this assumption, as an estimate of your imagined true $\lambda_0$.

But suppose in fact $X_1, X_2, \ldots, X_n$ are i.i.d. with density $g$ which is gamma $(3, 1)$.

**a)** What is your $\hat{\lambda}_n$ as a function of $X_1, X_2, \ldots, X_n$? Use the WLLN and the continuous mapping theorem to pick the right choice below: $\hat{\lambda}_n$ converges in probability to

$\bigcirc$ $\lambda_0$ $~~~~~$ $\bigcirc$ $1/\lambda_0$ $~~~~~$ $\bigcirc$ $3$ $~~~~~$ $\bigcirc$ $1/3$

**b)** Find $D_{KL}(g \Vert f)$ where $f$ has a generic $\lambda$ as the rate, and find the value $\lambda^*$ that minimizes the KL divergence. Compare this with your answer to Part (a).

[You don't have to compute the value of any expectation that doesn't involve $\lambda$, and you can assume that the log of a gamma variable has finite expectation.]

**c)** Under your false assumption, what is your estimated expectation of the false underlying exponential distribution? How does this compare with the expectation of the true gamma $(3, 1)$?

**d)** Perform a simulation study of the standard error of $\hat{\lambda}$, and compare it to the standard error using Fisher information as well as the standard error using the sandwich estimator. We suggest:

- 10,000 repetitions of a sample of size 400 drawn from gamma $(3, 1)$
- For each repetition, compute $\hat{\lambda}$ as though the sample were from exponential $(\lambda)$; the estimated standard error using the Fisher information; and the estimated standard error using the sandwich estimator.
- Draw two histograms, one for the 10,000 standard errors using the Fisher information, and the other for the 10,000 standard errors using the sandwich estimator.
- Find the average and standard deviation of $\hat{\lambda}$ in the 10,000 repetitions.

Comment on what you observe.

## 3. Towards the Asymptotic Normality of the Sample Correlation
Let $(X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)$ be i.i.d. pairs.

**a)** Let $a, b, c, d$ be constants and assume $b$ and $d$ are non-zero. Identify the shape of the asymptotic distribution of 

$$
\frac{1}{n} \sum_{i=1}^n \big{(} \frac{X_i - a}{b} \big{)}\big{(} \frac{Y_i - c}{d} \big{)}
$$

and justify your choice. You don't have to find the asymptotic mean and variance.

**b)** Refer to Worksheet 1 for notation for the sample mean and sample variance, and define the [sample correlation coefficient](https://inferentialthinking.com/chapters/15/1/Correlation.html#calculating-r)

$$
R_n ~ = ~ \frac{1}{n} \sum_{i=1}^n \big{(} \frac{X_i - \bar{X}_n}{\hat{\sigma}_X} \big{)}\big{(} \frac{Y_i - \bar{Y}_n}{\hat{\sigma}_Y} \big{)}
$$

Discuss why the asymptotic distribution of $R_n$ should have the same shape as the one you identified in Part (a), apart from the mean and variance of course. You don't have the tools to provide a complete argument. Discuss some issues that a complete argument would have to account for.

## 4. Nonparametric Bootstrap and Confidence Intervals
In the nonparametric bootstrap, which is what we use in Data 8 and 100, the data $X_1, X_2, \ldots, X_n$ are assumed to be i.i.d. from some underlying distribution but nothing further is assumed about that distribution. In particular, we don't assume that is comes from any parametric family such as normal or exponential.

In this exercise you will construct boostrap confidence intervals for an underlying correlation, using the percentile method as well as the basic bootstrap method described in lecture.

The `pandas` table `births` contains data on a random sample of mother-newborn pairs. You can assume that the sample is like a random sample drawn with replacement from a large underlying population. Let $\rho$ be the correlation between the newborns' birthweights and their mothers' heights in the underlying population. You will use the boostrap to estimate $\rho$.

**Useful code:** For two arrays `x` and `y`, the expression `np.corrcoef(x, y)` evaluates to a $2 \times 2$ correlation matrix that has $1$ on the diagonal and $Corr(x, y)$ as the two off-diagonal elements. So `np.corrcoef(x, y)[0, 1]` can be used to get the correlation which is the $(0, 1)$ element of the matrix.

**a)** Find $\hat{\rho}$, the correlation between the birth weights and mothers' heights in the sample.

**b)** Now bootstrap the sample $B = 10,000$ times by resampling from the original sample as in Data 8 and 100. Each time, find the correlation of birth weights and mothers' heights in the new sample, and collect all these correlations in an array or other similar form.

**c) Bootstrap Percentile Method:** As in Data 8, draw the empirical histogram of all the simulated values, and find an approximate 95% confidence interval for $\rho$ by using the appropriate percentiles of the empirical distribution of your estimates.

**d) Basic or Empirical or Pivotal Bootstrap Method:** Yes, it does have a lot of names. Subtract $\hat{\rho}$ from each of your estimates in Part **b** and draw a histogram of these deviations. What value do you notice near the center of the distribution? Provide the appropriate percentiles of these deviations and use them along with $\hat{\rho}$ to construct an approximate 95% confidence for $\rho$.

**e)** Provide a brief comparison of the intervals in Parts (c) and (d).

In [None]:
births = pd.read_csv("births.csv")[["Birth Weight", "Maternal Height"]]
...

## 5. Parametric Bootstrap and Confidence Intervals
The parametric bootstrap method assumes $X_1, X_2, \ldots, X_n$ are i.i.d. from a parametric family with density $f(x \mid \theta)$. So the parametric bootstrap doesn't need to resample from the original sample. Instead, it constructs an estimate $\hat{\theta}$ based on the original sample and then creates new samples by drawing repeatedly from $f(x \mid \hat{\theta})$. 

In this exercise you will construct a parametric bootstrap confidence interval for an exponential rate and compare it with the corresponding interval based on the asymptotic distribution of the MLE.

**Useful code:** To simulate `n` draws from the exponential distribution with rate `lam`, use `stats.expon.rvs(size = n, scale = 1/lam)`. 

The array `expon_sample` contains the results of 400 i.i.d. draws from the exponential distribution with unknown rate $\lambda$. For $n = 400$, these are the observed values of $X_1, X_2, \ldots, X_n$ drawn independently from the exponential $(\lambda)$ distribution.

**a)** Use `expon_sample` to find the observed value of $\hat{\lambda}$, the MLE of $\lambda$. This is the value of some function of your sample. Let's call it $T(X_1, X_2, \ldots, X_n)$.

**b)** Now pretend you don't know any more MLE theory, and construct $B = 10,000$ parametric bootstrap estimates of $\lambda$ as follows.

- Do the following $B$ times and collect the results in an array or other similar form:
    - Generate 400 i.i.d. draws from the exponential distribution with the rate you found in Part (a). Call this new sample $X_1^*, X_2^*, \ldots, X_n^*$.
    - Use the new sample and construct a new estimate of $\lambda$ as you did in Part (a). That is, find the value of $T(X_1^*, X_2^*, \ldots, X_n^*)$.
    
**c)** Draw an empirical histogram of your estimates and check that its shape resembles what you'd expect.

**d)** Construct an approximate 95% empirical bootstrap confidence interval for $\lambda$ based on your $B$ estimates. See Part (d) of Exercise 4.

**e)** Return to the original sample and use MLE theory to construct an approximate 95% confidence interval for $\lambda$. Do not use your bootstrap estimates here. Instead, refer to Exercise 2 of Worksheet 2.

**f)** Provide a brief comparison of the intervals in Parts (d) and (e). 

In [None]:
expon_sample = np.load('expon_sample.npy')
...

## 6. Jeffreys Priors for Location and Scale Families

Recall that the Jeffreys prior for a scalar parameter $\theta$ is $\pi(\theta) \propto \sqrt{I(\theta)}$, where $I(\theta) = \text{Var}_\theta(\ell'(\theta))$ is the Fisher information for a single observation $X$ with log-likelihood $\ell(\theta) = \log f_\theta(X)$. Since $E_\theta[\ell'(\theta)] = 0$, this simplifies to $I(\theta) = E_\theta[(\ell'(\theta))^2]$. Its key property is **reparameterization invariance**: if $\eta = g(\theta)$ is a one-to-one transformation, the Jeffreys prior for $\eta$ is exactly the change-of-variables transformation of the Jeffreys prior for $\theta$.

In this problem you will show that the Jeffreys prior has a clean form for two important classes of one-parameter models.

### (a) Location Families

A **location family** is a model of the form $f_\theta(x) = f_0(x - \theta)$ for some fixed density $f_0$ and $\theta \in \mathbb{R}$. Show that the Fisher information $I(\theta)$ does not depend on $\theta$, and conclude that the Jeffreys prior is flat: $\pi(\theta) \propto 1$.

*Hint: Use the substitution $u = x - \theta$ when computing the expectation. It may simplify your calculations to define $h(x) = \log f_0(x)$.*

### (b) Scale Families

A **scale family** is a model of the form $f_\sigma(x) = \frac{1}{\sigma} f_0(x/\sigma)$ for some fixed density $f_0$ and $\sigma > 0$. Show that $I(\sigma) = c/\sigma^2$ for a constant $c > 0$ that depends on $f_0$ but not on $\sigma$, and conclude that the Jeffreys prior is $\pi(\sigma) \propto 1/\sigma$.

*Hint: Proceed similarly to part (a), now using the substitution $u = x/\sigma$.*

### (c) Connecting Location and Scale

Consider a scale family $f_\sigma(x) = \frac{1}{\sigma} f_0(x/\sigma)$ as in part (b), and let $\tau = \log\sigma$. You showed in part (b) that the Jeffreys prior for $\sigma$ is $\pi(\sigma) \propto 1/\sigma$. What is the Jeffreys prior for $\tau$? Give two arguments:

1. Using the change-of-variables formula applied to $\pi(\sigma) \propto 1/\sigma$.
2. Using part (a) directly. *Hint: What kind of parameter is $\tau$ for the distribution of $\log X$?*