# Problem Set 1, Part Two: Due Tuesday, January 28 by 8am Eastern Standard Time

## Name: David Millard

**Show your work on all problems!** Be sure to give credit to any
collaborators, or outside sources used in solving the problems. Note
that if using an outside source to do a calculation, you should use it
as a reference for the method, and actually carry out the calculation
yourself; it’s not sufficient to quote the results of a calculation
contained in an outside source.

Fill in your solutions in the notebook below, inserting markdown and/or code cells as needed.  Try to do reasonably well with the typesetting, but don't feel compelled to replicate my formatting exactly.  **You do NOT need to make random variables blue!**

In [1]:
%matplotlib inline

In [2]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8.0,5.0)
plt.rcParams['font.size'] = 14

### Bonus for Correct Filename

Your submitted version of the notebook should have a filename `ps01_2_lastname.ipynb` where `lastname` should be replaced by your last name, in all lowercase letters.  You'll get a bonus point here if this was done correctly.

<font color="red">**+1**</font>

### Empirical Confidence Interval Checking

**(a)** In this problem, we’ll be using the Central Limit Theorem to
    construct a confidence interval using the normal distribution, with
    endpoints ${{\overline{x}}}\pm z_{1-\alpha/2} s/\sqrt{n}$. We will
    carry out $N=10^4$ replications of an experiment which involves
    drawing a sample of size $n=50$ to construct a confidence interval
    of confidence level $1-\alpha=90\%$ on the mean $\mu$ of the
    sampling distribution. The fraction of those $N$ confidence
    intervals containing the true mean $\mu$ (which we will arbitrarily
    set to 5 for this experiment) should be about 90% if the procedure
    is correct. We can set things up with a few standard commands

In [3]:
N = 10**4
n = 50
mu = 5.
alpha = 1. - 0.90
zcrit = stats.norm.isf(0.5*alpha)

If each interval has a probability of 90% of containing the true
    value, what is the total number we expect out of $10,\!000$?

$10000 \cdot 0.9 = 9000$

What’s the standard deviation associated with that expectation?

$\sqrt{10000 \cdot 0.9 \cdot (1-0.9)} = 30$

<font color="red">**3/3**</font>

**(b)** Let’s check this with a normal sampling distribution, which we can
    initialize with

In [4]:
mydist = stats.norm(loc=mu)

We can draw a sample (seeding the random number generator so the values are reproducible) with

In [5]:
np.random.seed(20190904)
x_Ii = mydist.rvs(size=(N,n))

and calculate the summary statistics with

In [6]:
xbar_I = x_Ii.mean(axis=-1)
s_I = x_Ii.std(axis=-1,ddof=1)

The `ddof` stands for “degenerate degrees of freedom”, which
    basically just means to use $n-1$ rather than $n$ in the
    normalization of the sample variance. We can define the ends of the
    confidence interval with

In [7]:
CIlo_I = xbar_I - zcrit*s_I/np.sqrt(n)
CIhi_I = xbar_I + zcrit*s_I/np.sqrt(n)

To check how many intervals contain the true value of $\mu=5$, we
    can use the construction that `CIlo_I<mu` will return an array of
    `True` and `False` values based on whether the inequality is true
    for each element of `CIlo_I`.

In [8]:
inCI_I = np.logical_and(CIlo_I<mu, mu<CIhi_I)

We can get the total number of such intervals with

In [9]:
print(np.sum(inCI_I))

8954


and the fraction with

In [10]:
print(np.mean(inCI_I))

0.8954


What fraction of the $N$ confidence intervals contain $\mu$?

$\frac{8954}{10000}$

<font color="red">**=0.8954**</font>

Each of
    the $N$ confidence intervals has a different width, but display
    `CIhi_I-CIlo_I`, and you’ll see the first and last few.

In [11]:
CIhi_I-CIlo_I

array([0.44187625, 0.51064227, 0.44179558, ..., 0.39755524, 0.45816846,
       0.45220293])

What is the
    median with among the $N$ widths?

In [12]:
np.median(CIhi_I-CIlo_I)

0.4614319309578452

The median is approximately 0.4614.

<font color="red">**3/3**</font>

**(c)** Repeat the process for a Laplace sampling distribution.

In [13]:
mydist = stats.laplace(loc=mu)

In [14]:
np.random.seed(20190904)
x_Ii = mydist.rvs(size=(N,n))

In [15]:
xbar_I = x_Ii.mean(axis=-1)
s_I = x_Ii.std(axis=-1,ddof=1)

In [16]:
CIlo_I = xbar_I - zcrit*s_I/np.sqrt(n)
CIhi_I = xbar_I + zcrit*s_I/np.sqrt(n)

In [17]:
inCI_I = np.logical_and(CIlo_I<mu, mu<CIhi_I)

In [18]:
print(np.sum(inCI_I))

8939


In [19]:
print(np.mean(inCI_I))

0.8939


In [20]:
np.median(CIhi_I-CIlo_I)

0.6442683918423411

<font color="red">**3/3**</font>

In [21]:
np.std(CIhi_I-CIlo_I)

0.10135858434688655

<font color="red">**The problem didn't ask for the standard deviation of the confidence interval widths.**</font>

**(d)** Repeat the process for a Student-$t$ sampling distribution with $\nu=3$.
    (Hint: use `mydist = stats.t(loc=mu,df=3)`).

In [22]:
mydist = stats.t(loc=mu,df=3)

In [23]:
np.random.seed(20190904)
x_Ii = mydist.rvs(size=(N,n))

In [24]:
xbar_I = x_Ii.mean(axis=-1)
s_I = x_Ii.std(axis=-1,ddof=1)

In [25]:
CIlo_I = xbar_I - zcrit*s_I/np.sqrt(n)
CIhi_I = xbar_I + zcrit*s_I/np.sqrt(n)

In [26]:
inCI_I = np.logical_and(CIlo_I<mu, mu<CIhi_I)

In [27]:
print(np.sum(inCI_I))

8935


In [28]:
print(np.mean(inCI_I))

0.8935


In [29]:
CIhi_I-CIlo_I

array([0.75548547, 0.70936708, 0.74867323, ..., 0.83447598, 0.63135899,
       0.65394324])

In [30]:
np.median(CIhi_I-CIlo_I)

0.7083713928816566

<font color="red">**3/3**</font>

In [31]:
np.std(CIhi_I-CIlo_I)

0.32484894695898636

<font color="red">**The problem didn't ask for the standard deviation of the confidence interval widths, although I see why it's of interest for comparison with the Cauchy case.**</font>

**(e)** Repeat the process for a Cauchy sampling Distribution.  What goes wrong in this case?

In [32]:
mydist = stats.cauchy(loc=mu)

In [33]:
np.random.seed(20190904)
x_Ii = mydist.rvs(size=(N,n))

In [34]:
xbar_I = x_Ii.mean(axis=-1)
s_I = x_Ii.std(axis=-1,ddof=1)

In [35]:
CIlo_I = xbar_I - zcrit*s_I/np.sqrt(n)
CIhi_I = xbar_I + zcrit*s_I/np.sqrt(n)

In [36]:
inCI_I = np.logical_and(CIlo_I<mu, mu<CIhi_I)

In [37]:
print(np.sum(inCI_I))

9360


In [38]:
print(np.mean(inCI_I))

0.936


In [39]:
CIhi_I-CIlo_I

array([ 1.75815048,  8.77689609, 10.00619675, ...,  2.35468599,
        5.69643106,  3.62300204])

In [40]:
np.median(CIhi_I-CIlo_I)

3.8514825981738254

In [41]:
np.std(CIhi_I-CIlo_I)

712.470055977114

The problem is the size and variability of the CI. We see the the median size of the confidence intervals is 3.85, which is relatively massive when compared to the previous median. We also see that the standard deviation of these intervals is even bigger. This tells me that the only reason we got a ratio of 0.936 intervals containing mu was because our confidence bounds were massive to begin with.

This actually makes sense though because we know that the cauchy distribution does not have a mean or variance because of its very thick tails. Therefore our estimate the mean has to encompass almost all of the number line. In this case we got values because we are sampling a finite amount but if we were to sample an infitite amount we would see there would be no way to setup this interval.

What went "wrong" was these aren't good confidence intervals because they cover most of space and if we were to sample infinitely we would get an infinite bound. Our bounds are diverging, not converging.

<font color="red">**The undefined mean and variance is the key property since it means the Central Limit Theorem can't be used to justify the normal confidence interval construction.  It's not impossible to construct a confidence interval for the location parameter of a Cauchy distribution, though.  See for example the signed rank confidence interval in Lesson 04.1**</font>

<font color="red">**3/3**</font>