# Bootstrap Tests and Confidence Intervals



Let's begin by testing our two statistical questions using NHST with bootstrap resampling. The two questions are:
1. Is the observed increase in median wealth ($\Delta = 57,400$) between the undergrad and grad groups statistically significant?
2. Is the observed increase in the relative frequency of millionaires between the undergrad and grad groups ($\Delta \approx 0.053$) statistically significant?

Let's start by importing the necessary libraries and loading the data:

In [32]:
import numpy as np
import numpy.random as npr
import pandas as pd

import matplotlib.pyplot as plt



In [33]:
#df = pd.read_csv('https://raw.githubusercontent.com/jmshea/Foundations-of-Data-Science-with-Python/main/05-binary-hypothesis-testing/nls/nls.csv')
df = pd.read_csv('nls/nls.csv')


remap = {'R0000100':'CASE_ID',
         'T5597600': 'GENDER',
         'T5684500': 'NET_WEALTH',
         'T9900000': 'HIGHEST_GRADE_EVER'
        }
df.rename(columns=remap, inplace=True)
df2=df.query('HIGHEST_GRADE_EVER > 0 & NET_WEALTH>0') 
undergrad = df2.query('HIGHEST_GRADE_EVER >= 16 & HIGHEST_GRADE_EVER <=17')['NET_WEALTH']                
postbac = df2.query('HIGHEST_GRADE_EVER >= 18')['NET_WEALTH']

pooled = df2.query('HIGHEST_GRADE_EVER >= 16')['NET_WEALTH']

In [34]:
print('The number of data points in each group is:')
print(f'\tUndergrad: {len(undergrad)}')
print(f'\tPost-Bac: {len(postbac)}')
print(f'\tPooled: {len(pooled)}')

The number of data points in each group is:
	Undergrad: 821
	Post-Bac: 473
	Pooled: 1294


## Testing Whether Post-Baccalaureate Eduction Increases Median Net Family Wealth

The median values of net family wealth for the undergraduate and post-baccalaureate groups are

In [35]:
undergrad.median()

427000.0

In [36]:
postbac.median()

484400.0

Then our test statistic is the difference  between these, and the observed value of the test statistic is  

In [37]:
diff1 = postbac.median() - undergrad.median()
diff1

57400.0

Let's start with a standard NHST. We conduct a simulation for which in each iteration, we create two new sample groups by bootstrap sampling from the pooled data. We then compute the medians for each group and calculate the sample test statistic by subtracting the median for the `postbac` group from the `undergrad` group. We are evaluating whether post-baccalaureate education increases net family wealth, so we will use a one-sided test. So, we increment and counter if the test statistic exceeds the observed value. At the end of the iterations, we calculate the relative frequency of the test statistic exceeding the observed difference in median wealth. 

In [48]:
num_sims = 100_000
count1 = 0

for sim in range(num_sims):
  # Bootstrap sampling
  undergrad_sample = npr.choice(pooled, len(undergrad))
  postbac_sample = npr.choice(pooled, len(postbac))
  
  # Compute value of test statistic
  diff_sample = np.median(postbac_sample) - np.median(undergrad_sample) 
  
  # Compare test statistic to observed value and count
  if diff_sample >= diff1:
    count1+=1
  
print(f'The relative frequency of observing a difference in medians as large as')
print(f'the difference in the original data (i.e., the p-value) is {count1/num_sims}')
  
  

The relative frequency of observing a difference in medians as large as
the difference in the original data (i.e., the p-value) is 0.08789


Recall what this value means. If the net wealth data for the two groups come from the same distribution (i.e., there is no real difference between the undergraduate and graduate groups in terms of  the probability of having a certain family wealth), then we will still see a difference as large as the one we observed approximately 8.8% of the time when we have samples of these sizes. Since 8.8% is not insignificant, then we cannot be confident that the observed difference in net family wealth is significant. We say that "we fail to reject the null hypothesis" because $p=0.08$ is greater than our threshold $\alpha=0.01$.

## Testing Whether Post-Baccalaureate Education Increases Probability of Becoming a Millionaire

Now consider whether post-baccalaureate education increases the probability of a family obtaining a net worth of over 1 million. We first determine the proportions of families with net wealth over $1 million in each group.