In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Review: Probability and empirical distributions ##

Let's look at the probability distribution for rolling a fair six-sided die and how the empirical distribution is a good approximation when the sample size gets large. 


In [None]:
# Options for a six-side die
die = Table().with_column('Face', np.arange(1, 7))
die

In [None]:
# Setup bins for the die options (middle of bin are integers)
roll_bins = np.arange(0.5, 6.6, 1)

# Theoretical distribution (equal chance for each outcome)
die.hist(bins = roll_bins)

In [None]:
# Empirical distribution based on on different samples sizes
# With larger sample sizes, our empirical distribution becomes closer to the probability (population) distribution
sample_size = 10  # 100, 1000
die.sample(sample_size).hist(bins = roll_bins)

## Sampling from the population of United flight delays

Let's suppose our population is United Airlines domestic flights departing from San Francisco in the summer of 2015; i.e., United Airlines flights departing from San Francisco from 6/1/15 to 8/9/15.  We can plot the distribution of this population as a histogram. 

If we take large enough samples from this population, the histogram of these samples (empirical distributions) will resemble the population of all flight delays. 


In [None]:
# United Airlines domestic flights departing from San Francisco in the summer of 2015
united = Table.read_table('united_summer2015.csv')
united = united.with_column('Row', np.arange(united.num_rows)).move_to_start('Row')

# Let's get back to the United flight delay data. Recall the column headers:
united.show(5)

In [None]:
# Plot a histogram of the delays. Since we are viewing the delays in the data set as a population, 
# this is the probability distribution of the delays 

# (side note: we will be the data using bins at: np.arange(-20, 201, 10)
# This binning scheme does not display about 0.8% of the data with extreme delays, 
# but makes it easier to see the majority of "shape" of the probability distribution)

united.hist('Delay', bins = np.arange(-20, 201, 10))

In [None]:
# Another empirical distribution with a sample size of 100
united.sample(100).hist('Delay', bins = np.arange(-20, 201, 10))

## Empirical sampling distributions via simulating statistics ##

Let's view some empirical sampling distributions of the median statistic for the San Francisco flight delays data.

In [None]:
# Considering united as a population
# The population median delay is:


In [None]:
# Percentage of data less than or equal to the median


In [None]:
# What if we take a random sample of size 30 - what's the estimated median?


In [None]:
# Simulate the empirical distribution of the median (statistic) using a sample size of 30
# We generate 10000 samples of size 30 (there will be 10000 estimates of the median)
# This cell takes a few seconds to run





In [None]:
# Display the empirical distribution of the median as a histogram


## Swain vs. Alabama ##

In [None]:
# Create the proportions that match the underlying population



In [None]:
# Take a sample of 100 potential jurors and get the proportions from the sample


In [None]:
# Statistic: number of Black men among random sample of 100 men from eligible population


In [None]:
# Simulation: randomly drawing many samples of size 100






In [None]:
# Visualization the distribution of statistics





## Mendel and Pea Flowers ##

In [None]:
# Create the proportions that match the underlying population of pea colors



In [None]:
# Draw a sample of 929 plants and calculate proportions of colors


In [None]:
# Statistic: distance between sample percent (of purple plants) and 75



In [None]:
# Simulation: randomly drawing many samples of size 929






In [None]:
# Exam the difference between observed and expected proportions



In [None]:
# Compare observed data to the proportions generated by the model

