# [LEGALST-123] Lab 05: Central Limit Theorem, Confidence Intervals, Hypothesis Testing

In [1]:
from datascience import *
from collections import Counter
import numpy as np
import pandas as pd
from scipy import stats
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import plotly.express as px

## Introduction
In this lab, we aim to prepare students for prediction exercises in PSET 1 and PSET 2 by allowing students to contextualize the statistical ideas of the Central Limit Theorem and hypothesis testing by using a dataset containing continuous variables. 

<br/>

<hr style="border: 1px solid #fdb515;" />

## Data & Exploratory Data Analysis

For this lab, we'll be using the same datasets used in our previous labs: the Nashville police stops dataset. Run the following cell below to read the `DataFrame`.

In [2]:
stops = pd.read_csv("https://github.com/ds-modules/data/raw/main/nashville_sample.csv", index_col=0)
stops.head()

Unnamed: 0,index,raw_row_number,date,time,location,lat,lng,precinct,reporting_area,zone,...,raw_traffic_citation_issued,raw_misd_state_citation_issued,raw_suspect_ethnicity,raw_driver_searched,raw_passenger_searched,raw_search_consent,raw_search_arrest,raw_search_warrant,raw_search_inventory,raw_search_plain_view
0,1840907,93347,2010-04-18,13140.0,"BURGESS AVE & WHITE BRIDGE PIKE, NASHVILLE, TN...",36.145004,-86.85797,1.0,5103.0,113.0,...,False,,N,False,False,False,False,False,False,False
1,492044,2001428,2015-01-19,19920.0,"DUE WEST AVE W & S GRAYCROFT AVE, MADISON, TN,...",36.249187,-86.734459,7.0,1797.0,723.0,...,False,False,N,False,False,False,False,False,False,False
2,431170,1996331,2015-01-15,1020.0,"S GALLATIN PIKE & MADISON BLVD, MADISON, TN, 3...",36.254979,-86.715246,7.0,1623.0,711.0,...,False,False,N,False,False,False,False,False,False,False
3,2066423,1319451,2013-05-17,62760.0,"CHARLOTTE PIKE & W HILLWOOD DR, NASHVILLE, TN,...",36.139093,-86.880533,1.0,5009.0,123.0,...,False,False,N,False,False,False,False,False,False,False
4,2899480,201349,2010-09-01,28140.0,"BELL RD & DODSON CHAPEL RD, HERMITAGE, TN, 37076",36.16331,-86.613147,5.0,9501.0,521.0,...,False,,N,False,False,False,False,False,False,False


Let's refer back to our last lab, where we explored different distributions using histograms. In particular, we looked at the distribution of stop counts for `"subject_sex"` and `"subject_age"`. For this notebook, let's look at the distribution of **age for each race**.

<!-- BEGIN QUESTION -->
<div class=“alert alert-warning”>

#### **Question 1.1**:
Before explore these variables, let's clean the dataset. In the code cell below, drop any columns that have "raw" in their column names. Then, drop the rows with *any* null values EXCEPT for the columns` "contraband_found"`, `"contraband_drugs"`, `"contraband_weapons"`, `"search_basis"`, and `"notes"`.
</div>

Hint: Look at Lab 04 question 1.1 and question 1.2! It should be very similar. 

In [3]:
# YOUR ANSWER HERE. You can use more or less lines than provided below. 
... 
... 
...
stops = ...

stops.head(5)

Let's explore the columns `"subject_race"` and `"subject_age"`. For convenience, we have provided code below showing what race/ethincity categories exist within the column  `"subject_race"`. For this particular question, let's look at the distribution of people who are categorized as `'hispanic'` or `'white'` for `"subject_race"`. 

In [5]:
stops["subject_race"].unique()

array(['black', 'white', 'hispanic', 'unknown', 'asian/pacific islander',
       'other'], dtype=object)

<!-- BEGIN QUESTION -->
<div class=“alert alert-warning”>

#### **Question 1.2**:

In the cell below, use Plotly to plot a histogram showing the distribution of age for people who are categorized as Hispanic with the y axis representing percentage. To do this, first create a table called `subject_hispanic` with your manipulations. Use `range_x=[0,80]` to properly scale the histogram.

Hint 1: [here](https://plotly.github.io/plotly.py-docs/generated/plotly.express.histogram.html) is the documentation for Plotly's histogram method.

Hint 2: Take a look at the `histnorm` attribute of the histogram method!

</div>

In [6]:
# YOUR CODE HERE

subject_hispanic = ... 

px.histogram(...)

**Find the 25th, 50th, 75th percentile of `subject_hispanic` below.**

In [8]:
# YOUR CODE HERE
# Find the 25th percentile

In [9]:
# Find the 50th percentile

In [10]:
# Find the 75th percentile

<!-- BEGIN QUESTION -->
<div class=“alert alert-warning”>

#### **Question 1.3**:

Now, follow the same process for people who are categorized as White. Again, to do this, first create a table called `subject_white` with your manipulations.

</div>

In [12]:
# YOUR CODE HERE

subject_white = ... 

px.histogram(...)

**Find the 25th, 50th, 75th percentile of `subject_white` below.**

In [None]:
# YOUR CODE HERE
# Find the 25th percentile

In [None]:
# Find the 50th percentile

In [None]:
# Find the 75th percentile

<!-- BEGIN QUESTION -->
<div class=“alert alert-warning”>

#### **Question 1.4**:

Now, create an overlaid historgram comparing the two distributions.

</div>

In [None]:
# YOUR ANSWER HERE
combined = pd.concat(...)

**What do you notice about the two distributions? How do they compare?**

_YOUR ANSWER HERE_

## Bootstrapping and the Confidence Interval

Bootstrapping is a statistical technique that allows us to make educated guesses about a population using only a small sample from that population. It works by repeatedly taking small random samples from the data we have and then using these samples to estimate things like averages, variances, or other statistics, as if we had data for the entire population. This technique helps us understand how uncertain or variable our estimates are and is especially useful when we don't have access to the whole population's data.


###  A Random Sample and an Estimate
Let's first draw from the sample, at random with replacement, the same number of times as the original sample size.

It is important to resample the same number of times as the original sample size. The reason is that the variability of an estimate depends on the size of the sample.

If we drew  at random without replacement, we would just get the same sample back. By drawing with replacement, we create the possibility for the new samples to be different from the original, because some participants might be drawn more than once and others not at all.

In [115]:
# Run this cell a few times to see how the distribution varies
h_resampled = stops[stops['subject_race'] == 'hispanic'].sample(len(stops['subject_race'] == 'hispanic'), replace = True)

title = 'Bootstrapped Distribution of Age among the Hispanic Population in our Dataset'
px.histogram(h_resampled, x='subject_age', histnorm='percent', range_x=[0, 80], title=title)

### Resampling from the Sample

By resampling again and again, we can get many such estimates, and hence an `empirical distribution` of the estimates.

Let us collect this code and define a function one_bootstrap_mean that returns one bootstrapped mean of `subject_age`, based on bootstrapping our original dataset.

In [102]:
def one_bootstrap_mean():
    resampled =  stops[stops['subject_race'] == 'hispanic'].sample(len(stops['subject_race'] == 'hispanic'), replace = True)
    bootstrapped_mean = np.mean(resampled['subject_age'])
    return bootstrapped_mean

Run the cell below a few times to see how the bootstrapped means vary. Remember that each of them is an estimate of the population mean.

In [116]:
one_bootstrap_mean()

28.654396728016359

We can now repeat the bootstrap process multiple times by running a `for` loop as usual. In each iteration, we will call the function `one_bootstrap_mean` to generate one value of the bootstrapped mean based on our original dataset. Then we will append the boostrapped mean to the collection array `bstrap_means`.

Let's do 2000 repetitions for this round of bootstrapping! (since this is a large number the code might take a while to run)

In [118]:
num_repetitions = 10000
bstrap_means = make_array()
for i in np.arange(num_repetitions):
    bstrap_means = np.append(bstrap_means, one_bootstrap_mean())

Now let's visualize what we got from the bootstrapped process

In [134]:
title = 'Distribution of Bootstrapped Mean Age Amongst Hispanic Population'
px.histogram(bstrap_means,histnorm='percent', title=title)


### Confidence Intervals

Confidence intervals are an important tool in data science. They help us create an estimate of a population parameter from a subset of the data. Here, we are using a 95% confidence interval to guess the average age of the Hispanic population that has been stopped by police.

In [135]:
# Get the endpoints of the 95% confidence interval
left = percentile(2.5, bstrap_means)
right = percentile(97.5, bstrap_means)

make_array(left, right)

array([ 27.94887526,  29.27607362])

Now let's add the confidence interval to the histogram above

In [152]:
title = 'Distribution of Bootstrapped Mean Age Amongst Hispanic Population'
fig = px.histogram(bstrap_means,histnorm='percent', title=title)
fig.add_shape(type='line', x0=left, y0=0, x1=right, y1=0, line_color='red')

Here, you can visualize our estimate for the population age as displayed by the red line.

## 4. Central Limit Theorem (CLT)

**The Central Limit Theorem (CLT)** is a fundamental concept in statistics that has significant implications for making inferences about populations based on samples. It states that, regardless of the shape of the original population distribution, the distribution of the sample means will approach a normal distribution as the sample size increases. This is true as long as the sample size is sufficiently large.

The significance of the Central Limit Theorem lies in its ability to provide a bridge between the characteristics of a population and the properties of the sample means drawn from that population.

**Population Mean:**

The Central Limit Theorem tells us that the sampling distribution of the mean of a random sample will be approximately normally distributed, even if the population distribution is not normal.
This is crucial because it allows us to make inferences about the population mean using statistical methods that assume a normal distribution.

**Sample Size:**

The larger the sample size, the closer the distribution of the sample mean will be to a normal distribution according to the CLT.
As the sample size increases, the standard deviation of the sampling distribution decreases. This means that larger sample sizes provide more precise estimates of the population mean.

To demonstrate this, let's look at the original dataset and data we obtained from bootstrapping above.

In [156]:
title = 'Bootstrapped Distribution of Age among the Hispanic Population in our Dataset'
px.histogram(h_resampled, x='subject_age', histnorm='percent', range_x=[0, 80], title=title)

Now, let's look at the distribution of bootstrapped mean amongst this sample

In [157]:
title = 'Distribution of Bootstrapped Mean Age Amongst Hispanic Population'
px.histogram(bstrap_means,histnorm='percent', title=title)

As we can see from above, the original distribution looks nothing like a normal distribution, but the bootstrapped means still follow a somewhat normal shape.

Now, let's connect the Central Limit Theorem to the motivation behind using **regression**:

In regression analysis, the Central Limit Theorem is often invoked when dealing with the distribution of the regression coefficients.
The ordinary least squares (OLS) estimators, which are commonly used in regression analysis, are unbiased and efficient under the assumption of normally distributed errors.
The CLT justifies the use of statistical tests and confidence intervals for regression coefficients, as it ensures that the distribution of these coefficients becomes approximately normal as the sample size increases.

In summary, the Central Limit Theorem is significant because it allows statisticians to make valid inferences about population parameters, particularly the population mean, based on samples. This is crucial in various fields, including regression analysis, where assumptions about the distribution of coefficients play a key role in drawing conclusions about relationships between variables.

### Using bootstrapping to arrive at a distribution of a test statistic

Here, we are testing whether the underlying distribution of age is the same amongst Hispanic and White populations in the Nashville traffic stops using a 5% p-value cutoff. The test statistic being used is the difference between the average ages. The null hypothesis here would be that the average age of Hispanic drivers stopped is *not* less than the average age of white drivers stopped. So really the question we are getting at is whether or not we can reject the null hypothesis (H<sub>0</sub>).

This sort of classical hypothesis test for trying to determine whether samples come from different underlying populations is commonly accomplished by using the t-statistic. When we talk about regression coefficients next time, we will see that the analysis typically includes a classical hypothesis test for whether we can reject the null hypothesis (i.e., that the coefficient may be zero and so the variable has no effect on the outcome). Here we are simulating a one-sided Student's t-test for independently sampled means by creating an underlying population mean distribution using bootstrapping and then asking how likely our test statistic is under the assumption that the null hypothesis is true.

In [1]:
# create a dataframe of subject age and subject race
# define a test statistic--difference in sample means
# use boostrapping to create samples (of the same size as the original data) and calculate the mean for each sample
# record the means, calculate the test statistic for each sampling run

ab_dataframe = combined[['subject_age', 'subject_race']]
test_stat = np.mean(ab_dataframe[ab_dataframe['subject_race'] == 'hispanic']['subject_age']) - np.mean(ab_dataframe[ab_dataframe['subject_race'] == 'white']['subject_age'])

simulated_stats = []

for i in range(10000):
    new_column = ab_dataframe.sample(ab_dataframe.shape[0], replace=False).reset_index()['subject_race']

    sampled = ab_dataframe.copy()
    sampled.loc[:, 'shuffled'] = new_column

    new_stat = np.mean(sampled[sampled['shuffled'] == 'hispanic']['subject_age']) - np.mean(sampled[sampled['shuffled'] == 'white']['subject_age'])
    simulated_stats = np.append(simulated_stats, new_stat)

NameError: name 'combined' is not defined

**<span style="color:red">It looks like there are NaNs in the list of differences in mean ages that we just simulated by bootstrapping. I cannot yet figure out how they got there, since we did not have NaNs in our original subject_age data, and since `np.mean` ignores NaNs. But the floating point representation of NaN interferes with the calculation of quantiles. That means we need to get rid of the NaNs in the list of differences in group means.<span>**

In [None]:
np.sum(np.isnan(simulated_stats))

In [None]:
simulated_stats = [x for x in simulated_stats if (np.isnan(x)==False)]

In [None]:
print('Our observed test statistic is', test_stat)

Our observed test statistic is 10.4041666667


In [None]:
# find the 95% confidence interval bounds, remembering that it is one-sided hypothesis
# note that we needed to make sure there are no NaNs to use the 'np.quantile' function
lower = ...
upper = ...
print("lower bound of 95% confidence interval: ", lower," upper bound of 95% confidence interval: ", upper)

In [None]:
title = 'Boostrapped Distribution of Difference in Mean Age Between Hispanic and White Drivers Stopped'
fig = px.histogram(simulated_stats, histnorm='percent', title=title)
fig.add_shape(type='line', x0=lower, y0=0, x1=upper, y1=0, line_color='yellow', line_width=10)
fig.add_shape(type='line', x0=test_stat, y0=0, x1=test_stat, y1=4, line_color='red', line_width=2)

In [None]:
# calculate the value of p, which the proportion of simulated means that are less than (remember this is a
# one-sided test) the value of our test statistic for the sample
# hint: use a list comprehension like we did just above

p_val = ...
print('our p value is', p_val)