<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 25: Center and Spread

Associated Textbook Sections: [14.0, 14.1, 14.2](https://inferentialthinking.com/chapters/14/Why_the_Mean_Matters.html)

## Outline

* [Center and Spread](#Center-and-Spread)
* [Average](#Average)
* [Standard Deviation](#Standard-Deviation)
* [Chebyshev's Inequality](#Chebyshev's-Inequality)
* [Standard Units](#Standard-Units)

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

---

## Center and Spread

### Topic Motivation

* How can we quantify natural concepts like "center" and "variability"?
* Why do many of the empirical distributions that we generate come out bell shaped?
* How is sample size related to the accuracy of an estimate?

---

## Average

### The Average (or Mean)

* Example
    * Data: 2, 3, 3, 9    
    * Average = (2+3+3+9)/4 = 4.25
* Need not be a value in the collection
* Need not be an integer even if the data are integers
* Somewhere between min and max, but not necessarily halfway in between
* Same units as the data
* Smoothing operator: collect all the contributions in one big pot, then split evenly

### Demo: Average (Mean)

Explore various ways to calculate and interpret the average.

In [None]:
values = make_array(2, 3, 3, 9)

In [None]:
sum(values)/len(values)

In [None]:
np.average(values)

In [None]:
np.mean(values)

In [None]:
(2 + 3 + 3 + 9)/4

In [None]:
2*(1/4) + 3*(2/4) + 9*(1/4)

Notice how the average reflects a physical balancing point for the data visualized through a histogram.

In [None]:
values_table = Table().with_columns('value', values)
values_table

In [None]:
bins_for_display = np.arange(0.5, 10.6, 1)
values_table.hist(0, bins = bins_for_display)

Let's see what happens when we increase the number of values {2, 3, 9} in the table and preserve the proportions.

In [None]:
new_vals = make_array(2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
                      3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
                      9, 9, 9, 9, 9, 9, 9, 9, 9, 9)

In [None]:
Table().with_column('value', new_vals).hist(bins = bins_for_display)

In [None]:
np.average(values)

In [None]:
np.average(new_vals)

### Reflection

<img src="img/lec25_histograms.png" width = 80%>

* Are the medians of these two distributions the same or different? 
* Are the means the same or different? 
* If you say "different," then say which one is bigger.

### Comparing Mean and Median 

* Mean: Balance point of the histogram
* Median: Half-way point of data; half the area of histogram is on either side of median
* If the distribution is symmetric about a value, then that value is both the average and the median.
* If the histogram is skewed, then the mean is pulled away from the median in the direction of the tail.

### Reflection

<img src="img/lec25_nba.png" width = 50%>

Is the mean or median larger for the distribution of NBA player heights?

---

## Standard Deviation

### Defining Variability

* Plan A: “biggest value - smallest value”
    * Doesn’t tell us much about the shape of the distribution
* Plan B:
    * Measure variability around the mean
    * Need to figure out a way to quantify this


### Demo: Standard Deviation

Explore the standard deviation.

In [None]:
values = make_array(2, 3, 3, 9)

In [None]:
sd_table = Table().with_columns('Value', values)
sd_table

In [None]:
average_value = np.average(sd_table.column(0))
average_value

In [None]:
deviations = values - average_value
sd_table = sd_table.with_column('Deviation', deviations)
sd_table

In [None]:
sum(deviations)

In [None]:
sd_table = sd_table.with_columns('Squared Deviation', deviations ** 2)
sd_table

Explore the variance of the data and see it's relationship with the standard deviation.

In [None]:
variance = np.mean(sd_table.column('Squared Deviation'))
variance

Standard Deviation (SD) is the square root of the variance

In [None]:
sd = ...
sd

In [None]:
np.std(values)

### How Far from the Average?

* Standard deviation (SD) measures roughly how far the data are from their average
* SD = root mean square of deviations from average
* SD has the same units as the data

### Why Use the SD?

There are two main reasons.

1. No matter what the shape of the distribution, the bulk of the data are in the range "average ± a few SDs" (Chebyshev's Inequality)
2. Coming up in a future lecture.

---

## Chebyshev's Inequality

### How Big are Most of the Values?

* No matter what the shape of the distribution, the bulk of the data are in the range "average ± a few SDs"
* Chebyshev’s Inequality
    * No matter what the shape of the distribution, the proportion of values in the range "average ± $z$ SDs" is at least $1 - 1/z^2$



### Chebyshev's Bounds


No matter what the distribution looks like, the proportion of data values follows:
<img src="img/lec26_cheby_bounds.png" width=50%>

### Demo: Chebyshev's Bound

Explore a demonstration of Chebyshev's bounds through a data set.

In [None]:
births = Table.read_table('./data/baby.csv')
births.show(3)

In [None]:
births.hist(0)

In [None]:
births.hist(1)

In [None]:
births.hist(2)

In [None]:
births.hist(3)

In [None]:
births.hist(4)

In [None]:
mpw = ...
mean = ...
sd = ...
mean, sd

In [None]:
within_3_SDs = ...

In [None]:
# Proportion within 3 SDs of the mean
...

In [None]:
# Chebyshev's bound: The proportion we calculated above should be at least
1 - 1/(3**2)

In [None]:
births.labels

See if Chebyshev's bounds work for distributions with various shapes.

In [None]:
for feature in births.labels:
    values = births.column(feature)
    mean = np.mean(values)
    sd = np.std(values)
    print()
    print(feature)
    for z in make_array(2, 3, 4, 5):
        chosen = births.where(feature, are.between(mean - z*sd, mean + z*sd))
        proportion = chosen.num_rows / births.num_rows
        percent = round(proportion * 100, 2)
        print('Average plus or minus', z, 'SDs:', percent, '% of the data')

## Standard Units

### Standard Units

* How many SDs above average?
* `z = (value - average)/SD`
    * Negative z: value below average
    * Positive z: value above average
    * z = 0: value equal to average
* When values are in standard units: average = 0, SD = 1
* Gives us a way to compare/understand data no matter what the original units


### Demo: Standard Units

Create a function to convert a measurement to standard units and apply it to the previous data set.

In [None]:
def standard_units(x):
    """Convert array x to standard units."""
    return ...

In [None]:
ages = births.column('Maternal Age')

In [None]:
ages_standard_units = ...

In [None]:
...

In [None]:
both = Table().with_columns(
    'Age in Years', ages,
    'Age in Standard Units', ages_standard_units
)
both

In [None]:
np.mean(ages), np.std(ages)

Compare the distribution of values and standardized values.

In [None]:
both.hist('Age in Years', bins = np.arange(15, 46, 2))

In [None]:
both.hist('Age in Standard Units', bins = np.arange(-2.2, 3.4, 0.35))
plots.xlim(-2, 3.1);

### The SD and the Histogram

* Usually, it's not easy to estimate the SD by looking at a histogram.
* But if the histogram has a bell shape, then you can.


### The SD and Bell-Shaped Curves

If a histogram is bell-shaped, then
* the average is at the center
* the SD is the distance between the average and the points of inflection on either side


### Demo: The SD and Bell Shaped Curves

Notice the way the distribution changes around $\pm 1$ of the mean.

In [None]:
births.hist('Maternal Height', bins = np.arange(56.5, 72.6, 1))

In [None]:
heights = births.column('Maternal Height')
np.mean(heights), np.std(heights)

In [None]:
np.mean(heights) + np.std(heights), np.mean(heights) - np.std(heights)

### Point of Inflection

<img src="img/lec26_inflection_points.png" width =50%>

---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>