In [None]:
import matplotlib
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
import numpy as np
plots.style.use('fivethirtyeight')

## The SD and Bell Shaped Curves

In [None]:
births = Table.read_table('baby.csv')

In [None]:
# As we saw previously, 'Maternal Height' has a bell-shaped distribution
# Can we spot the average and the SD visually?
births.hist('Maternal Height', bins = np.arange(56.5, 72.6, 1))

In [None]:
# See how we did...
heights = births.column('Maternal Height')
np.mean(heights), np.std(heights)

In [None]:
# We can calculate x-coordinates for the points of inflection
np.mean(heights) - np.std(heights), np.mean(heights) + np.std(heights)

## Central Limit Theorem ##

In [None]:
# Recall the 2015 domestic United Airlines flights out of San Franciso
united = Table.read_table('united.csv')
united_bins = np.arange(-20, 300, 10)
united

In [None]:
united.hist('Delay', bins=united_bins)

In [None]:
# What's a typical delay?  Consider the mean and standard deviation...
# Notice we can't easly read these values from the histogram since it's so skewed
delays = united.column('Delay')
delay_mean = np.mean(delays)
delay_sd = np.std(delays)
delay_mean, delay_sd

In [None]:
# But we know for sure that the mean is more than the median here (why?)
percentile(50, delays)

### Do Some Sampling
Now imagine we don't have the population data, so we take a random sample of flights and determine the mean for that sample (to estimate the population mean). We can simulate that here by sampling from the `united` table.

In [None]:
def one_sample_mean(sample_size):
    """ 
    Takes a random sample of a give size from the population of flights 
    and returns the sample mean of the 'Delay'
    """
    sampled_flights = united.sample(sample_size)
    return np.mean(sampled_flights.column('Delay'))

In [None]:
# Run several times and notice random variation
one_sample_mean(100)

Since we have all the data, we can generate 10,000 sample means. 

In [None]:
# What is the empirical distribution of all the sample means for a given sample size?
def ten_thousand_sample_means(sample_size):
    means = make_array()
    for i in np.arange(10000):
        mean = one_sample_mean(sample_size)
        means = np.append(means, mean)
    return means

In [None]:
sample_means_100 = ten_thousand_sample_means(100)

In [None]:
sample_means_100

In [None]:
len(sample_means_100)

In [None]:
# Visualize the distribution of sample means, sample size == 100
Table().with_column(
    'Mean of 100 flight delays', sample_means_100).hist(bins=20)

print('Population Average:', delay_mean)

Notice it's kind of bell-shaped! Is that surprising? The original distribution was not at all bell-shaped.

  - Is the mean of this sampling distribution similar to the mean of the population we were sampling from?
  - Is the width of the sampling distribution similar to the width of the population distribution?

The Central Limit Theorem tells us that if the sample is
**large**, and
drawn at random with replacement, then, regardless of the distribution of the population,
the probability distribution of the sample average 
is roughly normal (i.e., bell-shaped).

So, let's try a small sample size for comparison...


In [None]:
sample_means_4 = ten_thousand_sample_means(4)
Table().with_column(
    'Mean of 4 flight delays', sample_means_4).hist(bins=20)

print('Population Average:', delay_mean)

In [None]:
# But with a large random sample (say, 400):
sample_means_400 = ten_thousand_sample_means(400)
Table().with_column(
    'Mean of 400 flight delays', sample_means_400).hist(bins=20)

print('Population Average:', delay_mean)

Notice, comparing sample size 400 versus 100:
  - The shape is more normal
  - The values are more concentrated near the mean (less variability)

**Back to slides (discussion question)**