In [None]:
# Run cells by clicking on them and hitting CTRL + ENTER on your keyboard
from IPython.display import YouTubeVideo
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

# Module 5.1 Part 1: Estimation, Confidence Intervals, and the Bootstrap

In this module, we'll shift our focus from hypothesis testing to estimation. Much like hypothesis testing,
estimation procedures use data to learn about a population's characteristics. Whereas hypothesis tests provide some
insight on how data has been generated, estimation procedures allow us to make informed guesses about the true values of
a population's parameters.

4 videos make up this notebook, for a total run time of 42:54.

1. [Percentiles](#section1) *1 videos, total runtime 4:56*
2. [Estimation and Confidence Intervals](#section2) *2 videos, total runtime 16:49*
3. [The Bootstrap](#section3) *1 video, total runtime 21:09*
4. [Check for Understanding](#section4)

Textbook readings:
- [Chapter 13: Estimation](https://www.inferentialthinking.com/chapters/13/Estimation.html)
- [Chapter 13.1: Percentiles](https://www.inferentialthinking.com/chapters/13/1/Percentiles.html)
- [Chapter 13.2: The Bootstrap](https://www.inferentialthinking.com/chapters/13/2/Bootstrap.html)

<a id='section1'></a>
## 1. Percentiles

In the next lecture video, you'll learn about *percentiles*. Pay close attention! We'll be using this tool a lot throughout the rest
of Module 5.1.

In [None]:
YouTubeVideo('MhP23zg47yY')

The code corresponding to the discussion question is provided in the cells below.
Determine the output of these expressions before running them!

In [None]:
s = make_array(1, 7, 3, 9, 5)

In [None]:
percentile(10, s) == 0

In [None]:
percentile(39, s) == percentile(40, s)

In [None]:
percentile(40, s) == percentile(41, s)

In [None]:
percentile(50, s) == 5

<a id='section2'></a>
## 2. Estimation
Next, you'll be introduced to another flavor of statistical inference: estimation. Although we've briefly touched on it in previous
lectures, the next videos provides a detailed overview of these procedures.

As always, we encourage you to follow along with the demos using the data loaded at the end of this section.

In [None]:
YouTubeVideo('JvuPIX8n2Lc')

In [None]:
YouTubeVideo('UH6fXRruu0Q')

In [None]:
sf = Table.read_table('https://www.inferentialthinking.com/data/san_francisco_2015.csv').select(3, 11, 21)
sf.set_format('Total Compensation', NumberFormatter(0))
sf = sf.where('Total Compensation', are.above(10000))
sf.show(5)

In [None]:
# follow along here!
...

<a id='section3'></a>
## 3. The Bootstrap

As Professor DeNero mentioned at the end of the last video, we can use the the Bootstrap technique to study the variability
of an estimator, even if we only have access to a single random sample! In the next lecture, you'll learn how this
method is performed, and why it works.

*NB: This recording is particularly long. To stay focused, we recommend that you take a break halfway through.

In [None]:
YouTubeVideo('N-n27tX-pbU')

In [None]:
# follow along here!
...

<a id='section4'></a>
## 4. Check for Understanding

**A. As Professor DeNero discusses in lecture video 21.1, there are a variety of ways to compute the percentiles of numeric lists and
arrays. Explain the output of the code cell below.**

In [None]:
num_list = [1, 5, 2, 6, 3, 7, 8, 10]
percentile(60, num_list)

<details>
    <summary>Solution</summary>
    If a percentile does not exactly correspond to an index of a list or array, the <code>percentile</code> function returns the
    next largest element. Since <code>num_list</code> has 8 elements, the 60th percentile corresponds to element <code>8 * 0.6 = 4.8</code>.
    The <code>percentile</code> function therefore returns the 5th ordered element of <code>num_list</code>: <code>6</code>.
</details>

**B. In which of the following scenarios can statistical estimation be used?**

<ol>
    <li>You work for the United States Census Bureau and are tasked with finding the average number
    people living in each household.</li>
    <li>Given a random sample of UC Berkeley students, you want to approximate the proportion of
    all students who speak more than one language.</li>
    <li>You flip a coin 100 times, and record the number of heads. You find that 70% of your
    coin flips resulted in tails. Is this coin fair?</li>
    <li>You're given a random sample of Californians and want to know whether there is evidence to suggest that
    more than 25% of population are renters.</li>
</ol>

<details>
    <summary>Solution</summary>
    <ol>
        <li>Because you are using census data, there is no need to perform any statistical inference!</li>
        <li>Because you'd like to learn about the proportion of the entire student population that speaks more than one
        language based on a random sample, an estimation procedure must be used.</li>
        <li>Based on how you framed this question, a hypothesis testing procedure is best suited for this scenario. As you'll see in the next module though, the same question could be answered via an estimation procedure.</li>
        <li>As with the scenario above, this scenario's formulation suggests that a hypothesis testing procedure is most appropriate.
        However, you could also use an estimation procedure instead.</li>
    </ol>
    The answers to scenario 3 and 4 might surprise you. The lectures in the second part of this module will provide an explanation.
</details>
<br>

**C. Fill in the blanks: Suppose we wish to learn about the \_\_\_\_\_\_ of a population. We do so by collecting a random sample,
and then using this sample to compute an \_\_\_\_\_\_ of this value. If we were to repeat this process with a second random
sample from the same population, the resulting \_\_\_\_\_\_ would be \_\_\_\_\_\_.**

<details>
    <summary>Solution</summary>
    Suppose we wish to learn about the parameter of a population. We do so by collecting a random sample,
    and then using this sample to compute an estimate of this value. If we were to repeat this process with a second random
    sample from the same population, the resulting estimate will be different.
</details>
<br>

**Recall the maternal smokers and baby weight dataset from Module 4.1. Let's use this random sample to estimate the average weight of babies born to non-smoking mothers. The data is loaded in the cell below.**

In [None]:
# load the baby table
baby = Table.read_table('https://www.inferentialthinking.com/data/baby.csv')

# filter out the babies whose mothers smoked, and extract their birthweights
non_smoker = baby.column('Maternal Smoker') == False
birthweight = baby.select('Birth Weight').where(non_smoker)
birthweight

**D. Compute the 90% confidence interval of the average baby weight.**

In [None]:
...

<details>
    <summary>Solution</summary>
    
    # define an array to contain our resampled means
    means = make_array()

    # define the number bootstrap samples to generate
    boot_rep = 1000

    # perform the bootstrap procedure
    for i in np.arange(boot_rep):
        resample = birthweight.sample(with_replacement = True)
        means = np.append(means, np.mean(resample.column(0)))

    # plot the histogram of resampled means
    Table().with_column('Resampled Means', means).hist()

    # compute the 90% confidence interval
    lower_ci = percentile(5, means)
    upper_ci = percentile(95, means)

    print("The 90% confidence interval is ")
    [np.round(lower_ci, 3), np.round(upper_ci, 3)]
</details>

<br>

**E. What assumption must we make about the original random sample if we wish to use the Bootstrap method?**

<details>
    <summary>Solution</summary>
    If we are to trust the results of the bootstrap method, we must be confident that our random sample
    accurately portrays the population. Otherwise, the results of Bootstrap procedure are meaningless.
</details>