In [None]:
# Run cells by clicking on them and hitting CTRL + ENTER on your keyboard
from IPython.display import YouTubeVideo
from datascience import *
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

# Module 5.2 Part 1: Center and Spread

The concepts of *center* and *spread* have been informally introduced over the past few modules. In this lecture guide, we'll
provide formal definitions, and discuss how they can be used in statistical prediction procedures.

4 videos make up this notebook, for a total run time of 48:28.

1. [Center](#section1) *2 videos, total runtime 16:27*
2. [Spread](#section2) *1 video, total runtime 12:49*
3. [Bounds](#section3) *1 video, total runtime 19:12*
4. [Check for Understanding](#section4)

Textbook readings:
- [Chapter 14: Why the Mean Matters](https://www.inferentialthinking.com/chapters/14/Why_the_Mean_Matters.html)
- [Chapter 14.1: Properties of the Mean](https://www.inferentialthinking.com/chapters/14/1/Properties_of_the_Mean.html)
- [Chapter 14.2: Variability](https://www.inferentialthinking.com/chapters/14/2/Variability.html)

<a id='section1'></a>
## 1. Center

In the first two recordings of this lecture guide, Professor Adhikari introduces and motivates the study of a distribution's *center*.
This concept will be very helpful once we start studying prediction methods.

In [None]:
YouTubeVideo('ts2_qYrpreo')

In [None]:
# follow along here
...

<br>

A number of discussion questions are posed in the next video. Make sure to pause the recording in order to work through them. We have
provided a code cell below the video for your convenience, but a pen and paper approach will also work!

*Note that there is minor mistake at the beginning of the video: averages and medians are measures of **center**, not spread.*

In [None]:
YouTubeVideo('xmQ7nt2BmS0')

In [None]:
# answer questions and follow along here
...

<a id='section2'></a>

## 2. Spread

In the next video, you'll learn how to quantify a distribution's spread by computing its deviations, standard deviation, and variance.

In [None]:
YouTubeVideo('05L4UQ6V3rE')

In [None]:
# follow along here
...

<a id='section3'></a>

## 3. Bounds

In the upcoming video, you'll learn how to use the mean and the standard deviation of a distribution to determine the
range of the bulk of its values.

In [None]:
YouTubeVideo('PAmOpjJ3u0w')

In [None]:
# load the baby table, and plot the histograms of all numeric variables
baby = Table.read_table('https://www.inferentialthinking.com/data/baby.csv').drop('Maternal Smoker')
baby.hist(overlay = False)

In [None]:
# Check if Chebyshev's Inequality work for all these distributions:
...

<a id='section4'></a>
## 4. Check for Understanding

**A. Fill in the blanks: If a distribution is \_\_\_\_\_\_, then its mean and median are identical. However, if a distribution has a
long, right tail, then its mean is \_\_\_\_\_ than its median.**

<details>
    <summary>Solution</summary>
    If a distribution is symmetric, then its mean and median are identical. However, if a distribution has a long,
    right tail, then its mean is larger than its median.
    <br>
    Don't believe us? Try running the following code:
    <br>
    
    # define and visualize the distribution
    distribution = [0, 0.5, 0.75, 1, 1.25, 1.5, 1.5, 1.75, 2, 2.5, 3, 4, 5, 6]
    Table().with_column('Distribution', distribution).hist(bins = np.arange(0, 8))

    # compute the mean and median
    [np.median(distribution), np.mean(distribution)]
    
    When the distribution has a long, left tail, is the distribution's mean larger or smaller than its median?
</details>
<br>

**B. For any given sequence of two or more unique numeric values, `s`, which of the following expressions could evaluate to 0?**

<ol>
    <li><code>np.average(s)</code></li>
    <li><code>np.std(s)</code></li>
    <li><code>np.median(s)</code></li>
    <li><code>np.var(s)</code></li>
</ol>

<details>
    <summary>Solution</summary>
    Only <code>np.average(s)</code> and <code>np.median(s)</code> could evaluate to 0, i.e. <code>s = make_array(-10, -5, 0, 5, 10)</code>.
    However, <code>np.std(s)</code> and <code>np.var(s)</code> will always be positive in this situation. Recall that the variance is
    the average of the squared deviations, and that the square of any real, non-zero number is always positive. The standard deviation
    is the square root of the variance, and by convention is also always positive.
</details>
<br>

**C. Fill in the blank: For any given distribution, at least \_\_\_\_ % of its values are within the range of its
average $\pm$ 3.25 standard deviations.**

<details>
    <summary>Solution</summary>
    Recall that Chebyshev's inequality states that, no matter the shape of a distribution, the proportion of values in the
    range of its average +/- c standard deviations is 1 - 1/c^2. Therefore:
    
    For any given distribution, at least 1 - 1 / 3.25^2 (or about  90.5)% of its values are within the range of its
    average +/- 3.25 standard deviations.
</details>
<br>

**D. Verify that Chebyshev's Inequality applies to the distribution of NBA players' salaries from the 2015-2016 season.**

In [None]:
# load the nba dataset
nba_salaries =  Table.read_table('https://www.inferentialthinking.com/data/nba_salaries.csv')
nba_salaries.hist("'15-'16 SALARY")

In [None]:
# answer over here:
...

<details>
    <summary>Solution</summary>
    
    salaries = nba_salaries.select("'15-'16 SALARY")
    average = np.average(salaries.column(0))
    sds = np.std(salaries.column(0))

    for k in np.arange(1, 5):
        chosen = salaries.where("'15-'16 SALARY", are.between(average - k * sds, average + k * sds))
        proportion = chosen.num_rows / salaries.num_rows
        percent = np.round(proportion * 100, 2)
        print("Average plus or minus", str(k), "standard deviation(s):", str(percent))
</details>