# Hypothesis Testing - Guided Practice

- ds-flex 
- xx/xx/xx

# Hypothesis Testing Steps

## STEP 1: State Your Hypothesis & Null Hypothesis 

- **Before selecting the correct hypothesis test, you must first officially state your null hypothesis ($H_0$) and alternative hypothesis ($H_A$ or $H_1$)**



- **Before stating your hypotheses, ask yourself:**
    1. What question am I attempting to answer?
    2. What metric/value am I measuring to answer this question?
    3. Do I expect the groups to be different in a specific way? (i.e. one group greater than the other).
        - Or do I just think they'll be different, but don't know how?


## STEP 2: Determine the correct test for the data/hypothesis

<!-- based on data type and # of samples/groups -->

#### Q1: What type of data am I measuring? Is it numeric or categorical?


#### Q2: How many groups/samples am I comparing?
- One group vs a known value?
- Two groups?
- More than two groups?


#### Using the answers to the above 2 questions: select the type of test from this table.

|*What type of comparison?* | Numeric Data | Categorical Data|
| --- | --- | --- |
|**1 Sample vs Known Quantity/Target**|1 Sample T-Test| Binomial Test|
|**2 Samples** | 2 Sample T-Test| Chi-Square|
|**More than 2**| ANOVA and/or Tukey | Chi Square|

## STEP 3: Perform the test

### Hypothesis Test Functions

`from scipy import stats`

| Hypothesis test| Function | 
 | --- | --- |
 | **1-sample t-test** |`stats.ttest_1samp()`|
 | **2-sample t-test** |`stats.ttest_ind()` | 
 | **One-Way ANOVA** | `stats.f_oneway()` | 
 | **Binomial test** |`stats.binom_test()` |
 | **Chi-Square test** | `stats.chi2_contingency()`|
 
 
 - Set your $\alpha$ level (typically $\alpha$ =.05)

## STEP 4: Interpret the test result and perform any post-hoc test

# Activity: Hypothesis Testing with Insurance Data

## US Health Insurance Dataset

- https://www.kaggle.com/teertha/ushealthinsurancedataset

In [1]:
## import the standard packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


## import hypothesis testing functions
%load_ext autoreload
%autoreload 2
import functions as fn

from scipy import stats


In [2]:
# Load in the insurance.csv in the data folder and display preview
df = pd.read_csv("data/insurance.csv")
display(df.head())
df.info()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [None]:
stats.binom_test

# Questions to Answer

>- Q1. Do smokers have higher insurance charges than non-smokers?
>- Q2. Are women and men equally likely to be smokers?
>- Q3. Do different regions have different charges, on average?

## Q1: Do smokers have higher insurance charges than non-smokers?

### Formally State the Hypothesis

- $H_1$ : Smokers have higher charges than non-smokers.

- $H_0$ : Smokers and non-smokers are charged the same. 

### Determine the Correct Statistical Test 

- Q1: What type of data are we comparing?
    - A: Numeric (charges)
    
    
- Q2: How many samples/groups are being compared?
    - A: 
    
    
- Therefore, the correct test to perform would be:
    - A:

In [None]:
df

In [None]:
## let's visualize the distrubtion of charges for smokers and non-smokers
g = sns.displot(data=df,hue='smoker',x='charges',aspect=1.5)

In [None]:
## save all charges for smokers in a new smokers variable
smokers = df.loc[ df['smoker']=='yes','charges']
smokers

In [None]:
## now do the same for the non-smokers
nonsmokers = df.loc[ df['smoker']=='no','charges']
nonsmokers

In [None]:
## Run the correct hypothesis test from scipy
result = stats.ttest_ind(smokers,nonsmokers)
result

In [None]:
## is the result significant?
result.pvalue < .05

In [None]:
## Make a barplot of the average charges for smokers vs non-smokers
ax = sns.barplot(data=df,x='smoker',y='charges',ci=68)

In [None]:
## calcualte the effect size using cohen's d
d = fn.Cohen_d(smokers,nonsmokers)
d

### Conclusion/Interpretation

- Smokers have significantly higher insurance charges than non-smokers (p<.05), with a large effect size (Cohen's d=3.16). 

## Q2: Are women and men equally likely to be smokers?

### Formally State the Hypothesis
- $H_1$ :

- $H_0$ :

### Determine the Correct Statistical Test 

- Q1: What type of data are we comparing?
    - A: 
    
    
- Q2: How many samples/groups are being compared?
    - A: 
    
    
- Therefore, the correct test to perform would be:
    - A:

In [None]:
## Get contingency table 
observed = pd.crosstab(df['smoker'],df['sex'])
observed

In [None]:
## perform the correct test for the hypothesis
result = stats.chisquare(observed,axis=None)
result

In [None]:
result.pvalue <.05

In [None]:
## perform the correct test for the hypothesis
result = stats.chi2_contingency(observed)#,axis=None)
chi, p, dof, exp = result
p

In [None]:
sns.countplot(data=df,x='sex',hue='smoker')

In [None]:
#  pd.crosstab(df['smoker'],df['sex'],normalize='columns')

### Conclusion/Interpretation

-  

## Q3: Do different regions have different insurance charges?

### Formally State the Hypothesis
- $H_1$ :
- $H_0$ : 

### Determine the Correct Statistical Test 

- Q1: What type of data are we comparing?
    - A: 
    
    
- Q2: How many samples/groups are being compared?
    - A: 
    
    
- Therefore, the correct test to perform would be:
    - A:

In [None]:
df

In [None]:
sns.barplot(data=df, x='region',y='charges',ci=68);

In [None]:
## separate out the data by groups
groups = {}

for region in df['region'].unique():
    data = df.loc[ df['region']==region, 'charges']
    data.name = region
    groups[region] = data

groups.keys()

In [None]:
groups['southeast']

In [None]:
groups.values()

In [None]:
## perform the correct hypothesis test
result = stats.f_oneway(*groups.values())
result

In [None]:
result.pvalue < .05

In [None]:
## but which groups are different?

In [None]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [None]:
## save the charges column as values and the region as labels
values = df['charges']
labels = df['region']

In [None]:
## perform tukey's multiple comparison test
tukeys_results = pairwise_tukeyhsd(values,labels)
tukeys_results.summary()

In [None]:
tukeys_results.plot_simultaneous();

In [None]:
sns.barplot(data=df, x='region',y='charges',ci=68);

In [None]:
tukeys_results.pvalues

In [None]:
tukeys_results.groupsunique

### Conclusion/Interpretation

-  

## Your Turn! Think of a 4th Hypothesis to test.

- example: Is our sample's BMI representative of the the national average BMI? (See the pdf in the data folder for the stats)

### Formally State the Hypothesis
- $H_1$ :
- $H_0$ : 

### Determine the Correct Statistical Test 

- Q1: What type of data are we comparing?
    - A: |
    
    
- Q2: How many samples/groups are being compared?
    - A: 
    
    
- Therefore, the correct test to perform would be:
    - A:

# Conclusion/Recap