# Class 20: Hypothesis tests continued

Plan for today:
- Practice of statistical inference and hypothesis tests for a single proportion
- Hypothesis tests for assessing causality (comparing two proportions)
- Hypothesis tests for comparing two means  


## Notes on the class Jupyter setup

If you have the *ydata123_2024a* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [1]:
import YData

# YData.download.download_class_code(20)   # get class code    
# YData.download.download_class_code(20, TRUE) # get the code with the answers 

YData.download_data("bta.csv")
YData.download_data("babies.csv")

The file `bta.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `babies.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.


If you are using colabs, you should install the YData packages by uncommenting and running the code below.

In [2]:
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

## Hypothesis tests

In hypothesis testing, we start with a claim about a population parameter (e.g., µ = 4.2, or π = 0.25).

This claim implies we should get a certain distribution of statistics, called "The null distribution". 

If our observed statistic is highly unlikely to come from the null distribution, we reject the claim. 

We can break down the process of running a hypothesis test into 5 steps. 

1. State the null and alternative hypothesis
2. Calculate the observed statistic of interest
3. Create the null distribution 
4. Calculate the p-value 
5. Make a decision

Let's run through these steps now by doing one more practice problem running a hypothesis test for a single proportion!


## 1. Warm up: Hypothesis test for a single proportion - sinister lawyers

10% of American popula on is left-handed. A study found that out of a random sample of 105 lawyers, 16 were left-handed. Use our 5 steps of hypothesis testing to assess whether the proportion of left-handed lawyers is greater than the proportion on found in the American population. 


### Step 1: State the null and alternative hypotheses

**In words** 

Null hypothesis: 

Alternative hypothesis:


**In symbols**

$H_0$: 

$H_A$: 


### Step 2: Calculate the observed statistic

Calculate the observed statistic and save it to the name `obs_stat`. 

What symbol should we use to denote this observed statistic? 


### Step 3: Create the null distribution 

To create the null distribution let's use the code we wrote last class to simulate the proportion of heads we get from flipping a coin *n* times. 

The code `generate_proportion(n, prob_heads)` below can be used to simulate flipping a coin *n* times with a probability of getting heads given by the argument `prob_heads` (make sure you understand how this code works!). 

Use this code to simulate one statistic $\hat{p}$ that is consistent with the null hypothesis. 

In [5]:
def generate_proportion(n, prob_heads):
    random_sample = np.random.rand(n) <= prob_heads
    return np.mean(random_sample)


# simulate one random statistic consistent with the null hypothesis

generate_proportion(105, .1)


0.10476190476190476

Now generate a null distribution, by using a for loop to create 10,000 statistics consistent with the null hypothesis. Store this null distribution in an object called `null_dist`. 

In [6]:
# Create the null distribution







Let's also visualize the null distribution as a histogram. Set the `bins` argument to 100 to create 100 bins in this histogram. 

Does the observed statistic you calculated in step 2 seem like it is likely to come from this null distribution? 

In [7]:
# visualize the null distribution 



### Step 4: Calculate the p-value

The p-value is the proportion of points in the null distribution that are more extreme than the observed statistic. 


### Step 5: Draw a conclusion

Is there convincing evidence to reject the null hypothesis? 

Do you believe lawyers are more sinister than the general American public? 


<br>
<center>
<img src="https://github.com/emeyers/YData/blob/main/ClassMaterial/images/lawyer.jpg?raw=true" alt="lawyer" style="width: 200px;"/>
</center>
<br>

## 2. Hypothesis test assessing causal relationships

To get at causality we can run a Randomized Controlled Trial (RTC), where have of the participants are randomly assigned to a "treatment group" that receives an intervention and the other half of participants are put in a "control group" which receives a placebo. If the treatment group shows a an improvement over the control group that is larger than what is expected by chance, this indicates that the treatment **causes** an improvement. 


#### Botulinum Toxin A (BTA) as a treatment to chronic back pain

A study by Foster et al (2001) examined whether Botulinum Toxin A (BTA) was an effective treatment for chronic back pain.

In the study, participants were randomly assigned to be in a treatment or control group: 
- 15 in the treatment group
- 16 in the control group (normal saline)

Trials were run double-blind (neither doctors nor patients knew which group they were in)

Result from the study were coded as:
  - 1 indicates pain relief
  - 0 indicates lack of pain relief 


Let's run a hypothesis test to see if BTA causes a decrease in back pain.

### Step 1: State the null and alternative hypotheses


$H_0$: 

$H_A$:




### Step 2: Calculate the observed statistic

The code below loads the data from the study. We can use the difference in proportions  $\hat{p}_{treat} - \hat{p}_{control}$  as our observed statistic. 

Let's calculate the observe statistic and save it to the name `obs_stat`.


In [8]:
bta = pd.read_csv('bta.csv')
bta.sample(frac = 1)

Unnamed: 0,Group,Result
26,Treatment,0.0
6,Control,0.0
2,Control,0.0
9,Control,0.0
29,Treatment,0.0
15,Control,0.0
22,Treatment,1.0
13,Control,0.0
30,Treatment,0.0
3,Control,0.0


In [9]:
# create a DataFrame with the proportion of people in the treatment and control groups that have pain relief 




In [10]:
# calculate the difference


# extract the value from a series to 




In [11]:
# let's write a function to make it easy to get statistic values

def get_prop_diff(bta_data):

    ...


# Try the function out




### Step 3: Create the null distribution 

To create the null distribution, we need to create statistics consistent with the null hypothesis. 

In this example, if the null hypothesis was true, then there would be no difference between the treatment and control group. Thus, under the null hypothesis, we can shuffle the group labels and get equally valid statistics. 

Let's create one statistic consistent with the null distribution to understand the process. We can then repeat this 10,000 times to get a full null distribution. 

In [12]:
# shuffle the data 






In [13]:
# get one statistic consistent with the null distribution 



In [14]:
%%time

# create a full null distribution 









CPU times: user 4 µs, sys: 2 µs, total: 6 µs
Wall time: 29.1 µs


In [15]:
# visualize the null distribution 




# put a line at the observed statistic value





### Step 4: Calculate the p-value

The p-value is the proportion of points in the null distribution that are more extreme than the observed statistic. 


### Step 5: Draw a conclusion






<br>
<center>
<img src="https://image.spreadshirtmedia.com/image-server/v1/compositions/T347A2PA4306PT17X24Y42D1035176833W20392H24471/views/1,width=550,height=550,appearanceId=2,backgroundColor=000000,noPt=true/ok-but-first-botox-fillers-botox-funny-botox-womens-t-shirt.jpg" alt="botox" style="width: 200px;"/>
</center>
<br>

## 3. Smoking and baby weights

The Child Health and Development Studies investigate a range of topics. One study, in particular, considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area.

Let's examine this data to see if the average weight of babies of babies is different depending on whether the mother of the baby smokes. 


### Step 1: State the null and alternative hypotheses


$H_0$: 

$H_A$: 



### Step 2: Calculate the observed statistic

The code below loads the data from the study. The two relevant columns are:
- `bwt`: The birth weight of the baby in ounces
- `smokes`: whether the mother smokes (1) or does not smoke (0)

More information about the data is available at: https://www.openintro.org/data/index.php?data=babies

In [16]:
babies = pd.read_csv("babies.csv")

babies.head()

Unnamed: 0,case,bwt,gestation,parity,age,height,weight,smoke
0,1,120,284.0,0,27.0,62.0,100.0,0.0
1,2,113,282.0,0,33.0,64.0,135.0,0.0
2,3,128,279.0,0,28.0,64.0,115.0,1.0
3,4,123,,0,36.0,69.0,190.0,0.0
4,5,108,282.0,0,23.0,67.0,125.0,1.0


To simplify the analysis, create a new DataFrame called `babies2` that only has the smoke and bwt columns. 

In [17]:
# create a DataFrame called babies2 that has only the smoke and bwt columns




Let's have our observed statistic be the different of sample means  $\bar{x}_{non-smoke} - \bar{x}_{smoke}$.  

Please calculate this observe statistic and save it to the name `obs_stat`.


To make the rest of the analysis easier, write a function `get_diff_baby_weights(babies_df)` that will take a DataFrame `babies_df` that has smoke and btw information will return the difference in the means of babies that have mothers who to not smoke and those who do smoke. 

Also, test the function to make sure it give the same observed statistic you calculated above

In [18]:
def get_diff_baby_weights(babies_df):

    ...


### Step 3: Create the null distribution 

Now let's create a null distribution that has 10,000 statistics that are consistent with the null hypothesis. 

In this example, if the null hypothesis was true, then there would be no difference between the smoking mothers and the non-smoking mothers. Thus, under the null hypothesis, we can shuffle the group labels and get equally valid statistics. 

Let's create one statistic consistent with the null distribution to understand the process. We can then repeat this 10,000 times to get a full null distribution. 

In [19]:
# shuffle the data 





In [20]:
%%time

# create a full null distribution 








CPU times: user 1e+03 ns, sys: 1 µs, total: 2 µs
Wall time: 3.81 µs


In [21]:
# visualize the null distribution 




# put a line at the observed statistic value




### Step 4: Calculate the p-value

The p-value is the proportion of points in the null distribution that are more extreme than the observed statistic. 


### Step 5: Draw a conclusion






<br>
<center>
<img src="https://i.ytimg.com/vi/x4c_wI6kQyE/maxresdefault.jpg" alt="smoking" style="width: 300px;"/>
</center>
<br>