In [None]:
# Run cells by clicking on them and hitting CTRL + ENTER on your keyboard
from IPython.display import YouTubeVideo
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

# Module 5.2 Part 2: The Normal Curve

Now that you've learned how to characterize distributions through their measures of center and spread, let's
spend some time learning about the most well-known distribution in data science: the Normal distribution.

4 videos make up this notebook, for a total run time of 51:09.

1. [Standard Units](#section1) *1 videos, total runtime 14:08*
2. [Standard Deviations and the Bell Curve](#section2) *1 video, total runtime 8:52*
3. [The Normal Distribution](#section3) *1 video, total runtime 8:41*
4. [The Central Limit Theorem](#section4) *1 video, total runtime 19:28*
5. [Check for Understanding](#section5)

Textbook readings:
- [Chapter 14.3: The SD and the Normal Curve](https://www.inferentialthinking.com/chapters/14/3/SD_and_the_Normal_Curve.html)
- [Chapter 14.4: The Central Limit Theorem](https://www.inferentialthinking.com/chapters/14/4/Central_Limit_Theorem.html)
- [Chapter 14.5: The Variability of the Sample Mean](https://www.inferentialthinking.com/chapters/14/5/Variability_of_the_Sample_Mean.html)
- [Chapter 14.6: Choosing a Sample Size](https://www.inferentialthinking.com/chapters/14/6/Choosing_a_Sample_Size.html)

<a id='section1'></a>
## 1. Standard Units

In the next video, Professor Adhikari will describe what constitutes a bell-shaped curve and why is appears
so commonly in statistical inference procedures. Along the way, she'll introduce you to standard units.

In [None]:
YouTubeVideo('x1FSOCH4KGg')

In [None]:
# load the baby data
births = Table.read_table('https://www.inferentialthinking.com/data/baby.csv')
births

In [None]:
# follow along here:
...

<a id='section2'></a>
## 2. Standard Deviations and the Bell Curve

Next, you'll learn about the relationship between the standard deviation and the bell curve, and why it's used as the primary measure
of spread. As always, you're encouraged to follow along in the code cell below the recording.

In [None]:
YouTubeVideo('E3G60vyCimI')

In [None]:
# follow along here:
...

<a id='section3'></a>
## 3. The Normal Distribution

In the next lecture video, we formally define the Normal distribution. It is the most well-known of all bell-shaped distributions
and we will rely on it throughout the rest of the course.

In [None]:
YouTubeVideo('CCFSLbGSc0k')

<a id='section4'></a>
## 4. The Central Limit Theorem

This next lecture video presents the Central Limit Theorem (CLT). The CLT has very close ties to the Normal distribution, and forms
the basis of many, many statistical inference procedures. We'll apply the theorem to the some flight data, and we encourage you to
follow along in the cells below.

*This is a very long recording. We recommend that you talk a break halfway through.*

In [None]:
YouTubeVideo('8uikiqMbQTk')

In [None]:
united = Table.read_table('https://www.inferentialthinking.com/data/united_summer2015.csv')
united

In [None]:
# follow along here:
...

To complete your study of the CLT, read the following sections of the course textbook:
- [Chapter 14.5: The Variability of the Sample Mean](https://www.inferentialthinking.com/chapters/14/5/Variability_of_the_Sample_Mean.html)
- [Chapter 14.6: Choosing a Sample Size](https://www.inferentialthinking.com/chapters/14/6/Choosing_a_Sample_Size.html)

If you'd like to interact with the textbook's examples, click on the "Interact" button in the top left corner of each section. This will generate a
Jupyter notebook for you.

<a id='section5'></a>
## 5. Check for Understanding

**A. Convert the `Midterm` variable of the `midterm_scores` table into standard units. This table contains the midterm scores of
359 students in a lower division CS course. Include this standardized array as a column in `midterm_scores`.**

In [None]:
midterm_scores = Table.read_table('https://www.inferentialthinking.com/data/scores_by_section.csv').drop(0)

def standard_units(x):
    """Convert an array, x, to standard units."""
    ...

# add the standardized scores to the table
...

<details>
    <summary>Solution</summary>
    
    midterm_scores = Table.read_table('https://www.inferentialthinking.com/data/scores_by_section.csv').drop(0)

    def standard_units(x):
        """Convert an array, x, to standard units."""
        return (x - np.mean(x)) / np.std(x)

    midterm_scores = midterm_scores.with_column("standardized scores", standard_units(midterm_scores.column(0)))
    midterm_scores.hist(0)
</details>

<br>

**B. What is the approximate mean of the midterm scores based on the first 20 observations of the updated `midterm_scores` scores?
What about the approximate standard deviation?**

In [None]:
midterm_scores.show(20)

<details>
    <summary>Solution</summary>
    The mean midterm score is approximately 16 points. The standard deviation is about 6 or 7. If you found this question challenging,
    consider reviewing lecture 24.1.
</details>

<br>

**C. True or false: When applied to the Normal distribution, Chebyshev's bounds are overly conservative.**

<details>
    <summary>Solution</summary>
    True! The bounds dictated by Chebyshev's inequality are overly conservative when applied to the Normal distribution. For
    example, approximately 95% of data generated by a Normal distribution are found within 2 standard deviations of its mean.
    However, Chebyshev's inequality states that at least 75% of any distributions data is found within 2 standard deviations
    of its mean.
</details>

<br>

**D. Consider the data on NBA players' salaries during the 2015-2016 season. Does it look approximately Normal to you?**

In [None]:
nba_salaries = Table.read_table('https://www.inferentialthinking.com/data/nba_salaries.csv').select(3)
...

<details>
    <summary>Solution</summary>
    <code>nba_salaries.hist(0)</code>
    <br>
    <b>Interpretation:</b> <br>
    No, it doesn't look bell-shaped. The distribution has a long right tail, and it certainly isn't symmetric.
</details>

<br>

**E. Suppose we were to repeatedly take random samples from this distribution, and compute the average of each sample. What would the
distribution of these sample averages look like? Why?**

<details>
    <summary>Solution</summary>
    Regardless of the population's underlying distribution, the probability distribution of the sample average is approximately Normal
    by the Central Limit Theorem.
</details>

<br>

**F. Simulate 1000 random samples of size 200 from this data and plot their averages using a histogram.**

In [None]:
...

<details>
    <summary>Solution</summary>
    
    # define the sample size
    sample_size = 200

    # define the array to store the sample averages
    averages = make_array()

    # repeatedly sample from the table, and compute each sample's mean
    for i in np.arange(1000):
        sampled_salaries = nba_salaries.sample(sample_size)
        sample_average = np.average(sampled_salaries.column(0))
        averages = np.append(averages, sample_average)

    # plot the empirical distribution of sample averages
    Table().with_column('NBA Salary Sample Average', averages).hist(bins = 20, ec = 'w')
</details>

Try varying the number of samples and the size of each sample used to generate this histogram. How do these factors impact the empirical
distribution of the sample averages?