# Class 22: Confidence intervals

Plan for today:
- Hypothesis tests for comparing two means
- Confidence intervals


## Notes on the class Jupyter setup

If you have the *ydata123_2024a* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [1]:
import YData

# YData.download.download_class_code(22)   # get class code    
# YData.download.download_class_code(22, TRUE) # get the code with the answers 

YData.download_data("babies.csv")

The file `bta.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `babies.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.


In [None]:
# YData.download.download_homework(8)  # download the homework 

If you are using colabs, you should install the YData packages by uncommenting and running the code below.

In [2]:
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

## Hypothesis tests

In hypothesis testing, we start with a claim about a population parameter (e.g., µ = 4.2, or π = 0.25).

This claim implies we should get a certain distribution of statistics, called "The null distribution". 

If our observed statistic is highly unlikely to come from the null distribution, we reject the claim. 

We can break down the process of running a hypothesis test into 5 steps. 

1. State the null and alternative hypothesis
2. Calculate the observed statistic of interest
3. Create the null distribution 
4. Calculate the p-value 
5. Make a decision

Let's run through these steps now by doing one more practice problem running a hypothesis test for a single proportion!


## 1. Smoking and baby weights

The Child Health and Development Studies investigate a range of topics. One study, in particular, considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area.

Let's examine this data to see if the average weight of babies of babies is different depending on whether the mother of the baby smokes. 


### Step 1: State the null and alternative hypotheses


$H_0$: 

$H_A$: 



### Step 2: Calculate the observed statistic

The code below loads the data from the study. The two relevant columns are:
- `bwt`: The birth weight of the baby in ounces
- `smokes`: whether the mother smokes (1) or does not smoke (0)

More information about the data is available at: https://www.openintro.org/data/index.php?data=babies

In [16]:
babies = pd.read_csv("babies.csv")

babies.head()

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,0,27.0,62.0,100.0,0.0
1,2,113,282.0,0,33.0,64.0,135.0,0.0
2,3,128,279.0,0,28.0,64.0,115.0,1.0
3,4,123,,0,36.0,69.0,190.0,0.0
4,5,108,282.0,0,23.0,67.0,125.0,1.0


To simplify the analysis, create a new DataFrame called `babies2` that only has the smoke and bwt columns. 

In [17]:
# create a DataFrame called babies2 that has only the smoke and bwt columns




Let's have our observed statistic be the different of sample means  $\bar{x}_{non-smoke} - \bar{x}_{smoke}$.  

Please calculate this observe statistic and save it to the name `obs_stat`.


To make the rest of the analysis easier, write a function `get_diff_baby_weights(babies_df)` that will take a DataFrame `babies_df` that has smoke and btw information will return the difference in the means of babies that have mothers who to not smoke and those who do smoke. 

Also, test the function to make sure it give the same observed statistic you calculated above

In [18]:
def get_diff_baby_weights(babies_df):

    ...


### Step 3: Create the null distribution 

Now let's create a null distribution that has 10,000 statistics that are consistent with the null hypothesis. 

In this example, if the null hypothesis was true, then there would be no difference between the smoking mothers and the non-smoking mothers. Thus, under the null hypothesis, we can shuffle the group labels and get equally valid statistics. 

Let's create one statistic consistent with the null distribution to understand the process. We can then repeat this 10,000 times to get a full null distribution. 

In [19]:
# shuffle the data 





In [20]:
%%time

# create a full null distribution 








CPU times: user 1e+03 ns, sys: 1 µs, total: 2 µs
Wall time: 3.81 µs


In [21]:
# visualize the null distribution 




# put a line at the observed statistic value




### Step 4: Calculate the p-value

The p-value is the proportion of points in the null distribution that are more extreme than the observed statistic. 


### Step 5: Draw a conclusion






<br>
<center>
<img src="https://i.ytimg.com/vi/x4c_wI6kQyE/maxresdefault.jpg" alt="smoking" style="width: 300px;"/>
</center>
<br>

## 2. Two-sided hypothesis test

Sometime in hypothesis testing we don't know the direction of an effect, we only know that the null hypothesis is incorrect. 

In these circumstances, we write our alternative hypothesis such that we state that the parameter value is not equal to the value specified by the null hypothesis.

For the baby weight example, we would write our hypotheses as:

$H_0$: $\mu_{non-smoke} =  \mu_{smokes}$   or    $H_0$: $\mu_{non-smoke} -  \mu_{smokes} = 0$ 

i.e., the null hypothesis is the same as before.

$H_A$: $\mu_{non-smoke} \ne  \mu_{smokes}$   or    $H_0$: $\mu_{non-smoke} -  \mu_{smokes} \ne 0$ 

We now use not equal to ($\ne$) in our alternative hypothesis.


To calculate the p-value, we need to look at the values more extreme than the observed statistic in in both tails. 



In [5]:
# visualize the null distribution 





# put lines showing values more extreme than the observed statistic






When calculating the p-value, we need to get the proportion of statistics in the null distribution that are more extreme than the observed statistic from both tails.

## 3. Using hypothesis tests to generate confidence intervals

There are several methods we that can be used to calculate confidence intervals, including using a computational method called the "bootstrap" and using "parametric methods" that involve using probability distributions. If you take a traditional introductory statistics class you will learn some of these methods.

Below we use a less conventional method to calculate confidence intervals by looking at all parameters values that a hypothesis test fails to reject (at the p-value < 0.05 level). As you will see, the method gives similar results to other methods, although it requires a bit more computation time.

As an example, let's create a confidence interval for the population proportion of movies $\pi$ that pass the Bechdel test. As is the case for all confidence intervals, this confidence interval gives a range of plausible values that likely contains the true population proportion $\pi$.


In [3]:
# To start, let's use a function that generates a statistic p-hat 
# that is consistent with a particular population parameter value pi

def generate_proportion(n, prob_heads):
    
    random_sample = np.random.rand(n) <= prob_heads
    return np.mean(random_sample)


generate_proportion(1794, .5)


0.4977703455964326

In [4]:
# The function below calculates a p-value for the Bechdel data based on a particular pi value that is specified in a null hypothesis.
# (i.e., it is a function that encapsulates the hypothesis test you ran in class 20).


def get_Bechdel_pvalue(null_hypothesis_pi, plot_null_dist = False):
    
    
    # The observed p-hat value
    prop_passed = 803/1794
    
    
    # Generate the null distribution 
    null_dist = []
    
    for i in range(10000):    
        null_dist.append(generate_proportion(1794, null_hypothesis_pi))
    
    
    # Calculate a "two-tailed" p-value which is the proportion of statistcs more extreme than the observed statistic
    
    statistic_deviation = np.abs(null_hypothesis_pi - prop_passed)
    
    pval_left = np.mean(np.array(null_dist) <= null_hypothesis_pi - statistic_deviation)
    pval_right = np.mean(np.array(null_dist) >= null_hypothesis_pi + statistic_deviation)
    
    p_value = pval_left + pval_right

    
    # plot the null distribution and lines indicating values more extreme than the observed statistic 
    if plot_null_dist:
        
        plt.hist(null_dist, edgecolor = "black", bins = 30);
        plt.axvline(null_hypothesis_pi - statistic_deviation, color = "red");
        plt.axvline(null_hypothesis_pi + statistic_deviation, color = "red");
        plt.axvline(null_hypothesis_pi, color = "yellow");

        
        plt.title("Pi-null is: " + str(null_hypothesis_pi) + "      "  +
                  "p-value is: " + str(round(p_value, 5)))
      
    # return the p-value
    return p_value
    

In [16]:
# test the function with the value H0: pi = .5  (as we did in class 20)



# test the function with the value H0: pi = .45



In [17]:
# create a range range of H0: pi = x  values
  


In [19]:
%%time

# get the p-value for a range of H0: pi = x  values


pvalues = []





CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.2 µs


In [None]:
# view the p-values 
# convention calls a p-value < 0.05 is "statistically significant" indicating a pi imcompatible with the null hypothesis
# our confidence interval is all pi values that are not statistically significant (i.e., pi values that are consistent with particular H0)






In [None]:
# Get all plausible Pi values




In [None]:
# get the CI as the max and min plausible pi values 




In [None]:
# using the statsmodels package to compute a confidence interval for a proportion

import statsmodels.api as sm




