# Discussion 8
---
### Bootstrapping, Normal Curve, and CLT


<img src="data/panda_eat.jpg" width="500">

#### Extra
- You can find additional help on these topics in the [Course Notes](https://inferentialthinking.com/chapters/14/Why_the_Mean_Matters.html) and [CIT](https://inferentialthinking.com/chapters/14/Why_the_Mean_Matters.html).
- [Here](https://drive.google.com/file/d/1mQApk9Ovdi-QVqMgnNcq5dZcWucUKoG-/view) is a pointer to that reference sheet we saw last time.

In [1]:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
import babypandas as bpd
%matplotlib inline

import otter
grader = otter.Notebook()


# 1. Bootstrapping
- Problem : statistics about the data population are often unavailable, costly to acquire, unknown, etc.
- Solution : utilize random sampling (and re-sampling) of available data to estimate population statistics
    - The result of bootstrapping will be a distribution over sample statistics!
    - Hopefully we'll see that these *sample statistics* $\approx$ *population statistics*
    
- Procedure:
    - Sample from the population
    - Re-sample from that same sample (make sure to have replace=True!)
    - Repeat
    - **Note** - after re-sampling, we will likely see duplicate data entries within a single sample, but that's okay! 
        - If we didn't have duplicates, then we would have the same exact data in every single sample (this would be bad!)

### 1.1 Confidence Intervals for parameter estimation
- Goal : return a range of estimates that we are confident contain the true population statistic 
- Test your understanding:
    - **YES**: $X$% of all bootstrapped sample statistics fall within that interval
    - **YES**: ~$X$% of the time, the interval will capture the correct population statistic
    - **YES**: I'm $X$% sure that the true population statistic is in the interval
    - **NO**: the true population statistic has an $X$% change of being in the interval
        
### 1.2 Confidence Intervals for hypothesis testing
- Goal: conduct hypothesis testing for parameters. [Read More](https://notes.dsc10.com/06-estimation/3_ht_using_intervals.html)
- Procedure:
    - Define the null hypothesis and alternative hypothesis for a parameter value
    - Define the level of significance, 
    - Generate a  confidence interval for the true parameter
    - Reject null hypothesis ifour interval excludes the null parameter.

### 1.3 Describing a Distribution
- Center of a distribution 
    - *Mean* : balance point
    - *Median* : half-way point (robust to outliers) 
- Spread of distribution 
    - *Range* : biggest - smallest
    - *Standard deviation* : variability around the mean
- Chebyshev's Inequality
    - Proportion of values in the range "average $\pm\ z$ SDs" is ≥ $1-\frac{1}{z^2}$


# 2. Area Under the Curve

The area under the curve normally follows Chebychev's Bounds:
- For all lists, and all numbers  z , the proportion of entries that are in the range "average  $\pm z$  SDs" is at least $1 - \frac{1}{z^{2}} $

- In other words, we can say that at least $1-\frac{1}{z^2}$ of the data from a sample must fall within $z$ standard deviations from the mean.

It is useful because we can actually find the proportion of entries falling within a certain standard deviation which allows us to easily compute the area under a curve. **NOTE** : Chebyshev's inequality holds for any shaped distribution!

### Question 2.1

What is the proportion of entries that are in the range average $\pm 1$ SD 

In [2]:
cheby_area_pm_1 = ...
cheby_area_pm_1

In [None]:
grader.check("q21")

### Question 2.2

What is the proportion of entries that are in the range average $\pm 2$ SD 

In [4]:
cheby_area_pm_2 = ...
cheby_area_pm_2

In [None]:
grader.check("q22")

### Question 2.3

What is the proportion of entries that are in the range average $\pm 3$ SD 

In [6]:
cheby_area_pm_3 = ...
cheby_area_pm_3

In [None]:
grader.check("q23")

## Example : Normal Distribution

In the case of a normal distribution the area under the curve does increase much more due to certain properties of the normal distribution. 
Let us explore what the same bounds look like under normal distributions with the help of scipy.stats. 

We will use the stats.norm.cdf function which gives us the cumulative distribution function till a certain point. So if I say stats.norm.cdf(1) it will give me the area of everything left of 1 positive standard deviation above the mean.

![](data/normal_curve.png)

To find the area between $[a.b]$ (where a & b are standard deviations) do: ```stats.norm.cdf(b) - stats.norm.cdf(a)```

### Question 2.4

What is the proportion of entries that are in the range average $\pm 1$ SD under the normal curve

In [8]:
normal_area_pm_1 = ...
normal_area_pm_1

In [None]:
grader.check("q24")

### Question 2.5

What is the proportion of entries that are in the range average $\pm 2$ SD under the normal curve

In [10]:
normal_area_pm_2 = ...
normal_area_pm_2

In [None]:
grader.check("q25")

### Question 2.6

What is the proportion of entries that are in the range average $\pm 3$ SD under the normal curve

In [12]:
normal_area_pm_3 = ...
normal_area_pm_3

In [None]:
grader.check("q26")

In [14]:
# comparing AUC results

print(f"For ±1 SD --> Cheby. : {round(cheby_area_pm_1,3)}\t Normal : {round(normal_area_pm_1,3)}")
print(f"For ±2 SD --> Cheby. : {round(cheby_area_pm_2,3)}\t Normal : {round(normal_area_pm_2,3)}")
print(f"For ±3 SD --> Cheby. : {round(cheby_area_pm_3,3)}\t Normal : {round(normal_area_pm_3,3)}")

Although it is valid, Chebyshev's inequality provides a much weaker lower bound to the proportion of data that lies within $z$ standard deviations from the mean.

# 3. Central Limit Theorem

The Central Limit Theorem says that the probability distribution of the **sum** or **average** of a large random sample drawn with replacement will be roughly normal, regardless of the distribution of the population from which the sample is drawn.

This is really useful since it can allow us to work with normal curves in most problems. 

Until now you have used this fact when computing the p value. When we say p value <= 0.05 we actually mean that our statistic is at least $\pm 2$ SDs away from the normal mean which is pretty rare under a normal curve. In any other curve under the Chebychev bounds falling outside $\pm 2$ SDs is much more common (at least 75% data lies within 2 SDs, so up to 25% of data lies outside 2 SDs).

In [15]:
# Let us introduce a random uniformly distributed dataset

data = np.random.uniform(0, 20, 200)
plt.hist(data)
plt.show()

As you can see above we have a dataset that is clearly not normal. Let's try bootstrapping this and computing the mean.

In [16]:
num_simulations = ...
sample_means = ...

plt.hist(sample_means, ec = 'w', bins = 24)
plt.show()

This may be surprising, but our statistics are normally distributed!

This is extremely useful since we can compute the p-value even with non-normal data as the distribution of the statistics are normal (as a result of the CLT).

### Question 3.1

What is the probability of observing the simulated statistics smaller than or equal to 9?

In [17]:
p_value_with_obs_9 = ...
p_value_with_obs_9

### Question 3.2

What is the probability of observing the simulated statistics smaller than or equal to 10?

In [18]:
p_value_with_obs_10 = ...
p_value_with_obs_10

## Example: Starbucks Snacks


This data gives us nutrition facts for all different food and snacks Starbucks sells. You can read more about the data [here](https://www.kaggle.com/starbucks/starbucks-menu?select=starbucks-menu-nutrition-food.csv).

Let's assume that this data contains all menu items for starbucks foods. Therefore, we're considering this to be our population.

In [19]:
# load in all the data
starbucks_snacks = bpd.read_csv("data/starbucks-menu-food-nutrition.csv")

starbucks_snacks

# 4. Starbucks Snacks Data

In [20]:
pop_data = starbucks_snacks.get('Calories')
pop_mean = pop_data.mean()
pop_mean

## 4.1 Distribution of Population

In [21]:
# Visualization help
def get_bins(array, bin_size=1):
    smallestNum = int(array.min())
    
    largestNum = int(array.max())
    upperLimit = largestNum + bin_size + 1
    
    return np.arange(smallestNum, upperLimit, bin_size)

plt.title("Population Distribution")
plt.xlabel("Calories")
ax = plt.hist(pop_data, bins=get_bins(pop_data,20))

The population distribution is clearly not a normal distribution

## 4.2 Distribution of Sample Means

In [22]:
# Get a sample
num_samples = 60
collected = starbucks_snacks.sample(n=num_samples, replace=False)

# Bootstrap
sample_means = np.array([])

for i in range(2000):
    bootstrapped = collected.sample(num_samples,replace=True)
    boot_mean = bootstrapped.get('Calories').mean()
    sample_means = np.append(sample_means, boot_mean)
    
plt.title("Distribution of Sample Means (with Population Mean)")
plt.hist(sample_means, bins=get_bins(sample_means, 5))
plt.scatter(pop_mean, 0, color='red', s=80).set_zorder(2)

# However, the distribution of sample means is a normal distribution -- Thanks Central Limit Theorem!

## 4.3 Confidence Intervals

In [23]:
# compute the lower percentile given a confidence interval
def compute_lower_percentile(perc_conf):
    
    lower_perc = (100-perc_conf)/2
    
    return lower_perc

# compute the upper percentile given a confidence interval
def compute_upper_percentile(perc_conf):
    
    upper_perc = 100 - (100-perc_conf)/2 
    
    return upper_perc

def compute_ci(confidence_level, sample_means):

    # What is the mean we're estimating?
    mean = np.mean(sample_means) 

    # What are the percentiles?
    # Use the functions we made above
    lower_perc = compute_lower_percentile(confidence_level)
    upper_perc = compute_upper_percentile(confidence_level)

    # And then our lower and upper bounds?
    lower_bound = np.percentile(sample_means, lower_perc) 
    upper_bound = np.percentile(sample_means, upper_perc) 

    # Printing it out so we can easily see our results.
    print("""
    Sample Mean:\t{}

    Lower Percentile:\t{}
    Upper Percentile:\t{}

    Lower Bound:\t{}
    Upper Bound:\t{}

    Confidence Level:\t{}%
    """.format(mean, lower_perc, upper_perc, lower_bound, upper_bound, confidence_level))
    
    return lower_bound, upper_bound

def plot_ci(ci, lower_bound, upper_bound, sample_means, pop_mean):
    plt.title(f"{ci}% confidence interval")
    plt.hist(sample_means, bins=get_bins(sample_means, 5), density=True)
    plt.scatter(pop_mean, 0, color='red', s=80).set_zorder(3)
    plt.plot([lower_bound, upper_bound], [0,0], color='lime', linewidth=4, zorder=2)
    

In [24]:
lower_bound, upper_bound = compute_ci(95, sample_means)
plot_ci(95, lower_bound, upper_bound, sample_means, pop_mean)

## 4.4 Terminologies

- **Population Distribution** is unknown, and can be any shape.
- **Sample Distribution** should have a shape roughly similar to the population distribution. (provided that the sample was large enough and was properly randomized).
- **Sample Mean** is just the mean of that sample distribution. This is just a single value. We can collect a handful of sample means (or fake it by bootstrapping).
- The **Distribution of Sample Means** will resemble a normal distribution as the number of sample means increases. (CLT)
- The **Center/Mean** of the distribution of sample means should be similar to the true population mean.  
(provided that our original sample was proper)


# 5. Parameterize a normal curve

Lets compute the **mean** and **standard deviation** of our **Distribution Of Sample Means** and parameterize a normal curve!


In [25]:
from scipy.stats import norm

# compute the mean
sample_dist_mean = ...
sample_dist_std = ...

# set limits for plot
start = sample_dist_mean-5*sample_dist_std
stop = sample_dist_mean+5*sample_dist_std
x = np.linspace(start, stop, 100)

plt.title("Distribution of Sample Means (and Normal Curve)")

# plot histogram
plt.hist(sample_means, bins=get_bins(sample_means, 5), density=True)

# plot normal curve
plt.plot(x, norm.pdf(x, sample_dist_mean, sample_dist_std), c='r')

print(f"Center (mean) : {round(sample_dist_mean,3)}")
print(f"Spread (std) : {round(sample_dist_std,3)}")

We now know the Mean and Standard Deviation of the normal curve associated with the distribution of sample means

- As you can see above, this normal curve is centered at our sample mean (367.515 calories) 
- and has a standard deviation of 16.346 calories.

However, we often want to standardize this distribution to be centered at 0 and have a standard deviation of 1. Standardizing distributions make it very easy to compare multiple normal distributions that originally had vastly different centers and spreads. It also makes it really easy to compute different statistics about the distribution.

Let's take a look at how to do that now.

## 5.1 Centering at mean = 0

In [26]:
# recall our sample of means
print(f"First 5 sample means : \t\t{[round(x,4) for x in sample_means[:5]]}")
print(f"Center of sample distribution : \t{round(sample_dist_mean,3)}")
print(f"Std of sample distribution : \t\t{round(sample_dist_std,3)}")

In [27]:
# center the data to have mean = 0
centered_sample_means = ...
centered_sample_dist_mean = ...
centered_sample_dist_std = ...

print(f"First 5 centered sample means : \t{[round(x, 4) for x in centered_sample_means[:5]]}")
print(f"Center of centered sample distribution : \t{round(centered_sample_dist_mean,3)}")
print(f"Std of centered sample distribution : \t\t{round(centered_sample_dist_std,3)}")

In [28]:
# visualize 
plt.title("Distribution of Sample Means (Centered)")
plt.xlabel("Centered Sample Mean Calories")
ax = plt.hist(centered_sample_means, bins=get_bins(centered_sample_means, 5), density=True)

## 5.2 Scaling by STD = 1

In [29]:
# scale the data to have std = 1
centered_and_scaled_means = ...
centered_and_scaled_sample_dist_mean = ...
centered_and_scaled_sample_dist_std = ...

print(f"First 5 centered and scaled sample means : {[round(x,4) for x in centered_and_scaled_means[:5]]}")
print(f"Center of centered and scaled sample distribution : \t{round(centered_and_scaled_sample_dist_mean,3)}")
print(f"Std of centered and scaled sample distribution : \t{round(centered_and_scaled_sample_dist_std,3)}")

In [30]:
# visualize 
plt.title("Distribution of Standardized Sample Means")
plt.xlabel("Standardized Sample Mean Calories")
ax = plt.hist(centered_and_scaled_means, bins=get_bins(centered_and_scaled_means, 0.5), density=True)

## 5.3 Plot the normal curve

In [31]:
# get the mean and std
centered_and_scaled_sample_dist_mean = ...
centered_and_scaled_sample_dist_std = ...

# set limits for plot
start = centered_and_scaled_sample_dist_mean-5*centered_and_scaled_sample_dist_std
stop = centered_and_scaled_sample_dist_mean+5*centered_and_scaled_sample_dist_std
x = np.linspace(start, stop, 100)

plt.title("Distribution of Sample Means (and Normal Curve)")

# plot histogram
plt.hist(centered_and_scaled_means, bins=get_bins(centered_and_scaled_means, 0.5), density=True)

# plot normal curve
plt.plot(x, norm.pdf(x, centered_and_scaled_sample_dist_mean, centered_and_scaled_sample_dist_std), c='r')

print(f"Center (mean) : {round(centered_and_scaled_sample_dist_mean,3)}")
print(f"Spread (std) : {round(centered_and_scaled_sample_dist_std,3)}")

Now that we are looking at a normal distribution, let's talk about standard units and area.

## 5.4 Standard Units and Area
- Define $z(x) = \frac{x-\text{mean}}{\text{std}}$
- $z(x)$ maps $x$ to standard units 
    - If a distribution is roughly normal, then the area between $a$ and $b$ is approx. equal to the area between $z(a)$ and $z(b)$

### Question 5.5 
What proportion of menu items have calories between 315 and 350 using standard units?

In [32]:
# define z(x)
def z(x):
    ...

In [33]:
# define calorie bounds
lower_calories = 315
upper_calories = 350

# comute standard units
lower_standard = ...
upper_standard = ...

# don't change these lines
print(f"Mean Calories: {round(sample_dist_mean,2)}")
print(f"LOWER : {lower_calories} calories --> {round(lower_standard,2)} standard units --> {round(lower_standard,2)} stdev's above the mean")
print(f"UPPER : {upper_calories} calories --> {round(upper_standard,2)} standard units --> {round(upper_standard,2)} stdev's above the mean")

In [34]:
# compute the area under the curve between
approx_prop_standard = ...
approx_prop_standard

In [35]:
# plot area under curve

plt.title("Area Under Curve (Standard)")

start = centered_and_scaled_sample_dist_mean-5*centered_and_scaled_sample_dist_std
stop = centered_and_scaled_sample_dist_mean+5*centered_and_scaled_sample_dist_std
x = np.linspace(start, stop, 100)
y = norm.pdf(x, centered_and_scaled_sample_dist_mean, centered_and_scaled_sample_dist_std)

# plot normal curve
plt.plot(x, y, c='r')

ix = (x>=lower_standard) & (x<=upper_standard)
plt.fill_between(x[ix],y[ix],alpha=0.5)

plt.axvline(lower_standard,color='C1')
ax = plt.axvline(upper_standard,color='C1')

### Question 5.6
What proportion of menu items have calories between 315 and 350 using sample distribution?

In [36]:
# compute proportion using distribution
approx_prop_dist = ...
approx_prop_dist

In [37]:
# plot area under curve

plt.title("Area Under Curve (Distribution)")

# plot histogram
plt.hist(sample_means, bins=get_bins(sample_means, 5), density=True)

ix = (x>=lower_calories) & (x<=upper_calories)
plt.fill_between(x[ix],y[ix],alpha=0.5)

plt.axvline(lower_calories,color='C1')
plt.axvline(upper_calories,color='C1')