# Week 15, Lecture 02 CodeAlong: Hypothesis Testing

- xx/xx/xx

Today, we will be analyzing data from the Crowdfunding website Kiva and answering several questions about the data.

- Use your hypothesis testing skills and the  ["Guide: Choosing the Right Hypothesis Test"](https://login.codingdojo.com/m/376/12533/88117) lesson from the LP.
    

- Kiva Crowdfunding Data Set:
    -  https://www.kaggle.com/datasets/kiva/data-science-for-good-kiva-crowdfunding 



### Questions to Answer

- Q1: Do all-male teams get more funding vs teams that include at least 1 female?
- Q2: Do different sectors get more/less funding?

# Hypothesis Testing

In [None]:
import json
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
import scipy
scipy.__version__

In [None]:
## load the kiva_loans.csv. display info and head
df = pd.read_csv('Data/kiva_loans.csv.gz')
df.info()
df.head()

In [None]:
## Drop null values from related columns
df = df.dropna(subset=['borrower_genders','funded_amount'])

# Setting the id as the index
df = df.set_index('id')
df.info()
df.head()

# Q1:  Do all-male teams get more funding vs teams that include at least 1 female?

## 1. State the Hypothesis & Null Hypothesis 

- $H_0$ (Null Hypothesis):All the teams get the same funding based on gender
- $H_A$ (Alternative Hypothesis):  All male teams get more funding (There is a significant difference in funding)

## 2. Determine the correct test to perform.
- Type of Data?Numeric
- How many groups/samples? 2 groups 
- Therefore, which test is appropriate? 2 sample T test

### Visualize and separate data for hypothesis

- What column is our target?
- What column determines our groups?

In [None]:
## check the col that contains info on gender
df['borrower_genders'].value_counts()

In [None]:
## create a column that easily separates our groups
df['has_female'] = df['borrower_genders'].str.contains('female')
df['has_female'].value_counts()

In [None]:
## save list of columns needed for each group
needed_cols = ['funded_amount','has_female']
df[needed_cols]

In [None]:
## save male team in separate variable
male_df = df.loc[df['has_female']==False,needed_cols]
male_df

In [None]:
## save female team in separate variables
female_df = df.loc[ df['has_female']==True, needed_cols]
female_df

In [None]:
## Make a df just for visualization by concat the groups 
plot_df =  pd.concat([male_df,female_df])
plot_df

In [None]:

## visualize the group means
sns.barplot(data=plot_df,x='has_female',y='funded_amount')

## 3. Testing Assumptions

- No significant outliers
- Normality
- Equal Variance

### Checking Assumption of No Sig. Outliers

In [None]:
## Saving JUST the numeric col as final group variables
male_group = male_df['funded_amount']
female_group = female_df['funded_amount']
male_group


In [None]:
## Check female group for outliers
female_outliers = np.abs(stats.zscore(female_group))>3

## how many outliers?
female_outliers.value_counts()

In [None]:
## remove outliers from female_group
female_group = female_group[~female_outliers]

In [None]:
## Check female group for outliers
male_outliers = np.abs(stats.zscore(male_group))>3

## how many outliers?
male_outliers.value_counts()


In [None]:
## remove outliers from female_group
male_group = male_group[~male_outliers]
male_group

### Test for Normality

In [None]:
## Check female group for normality
result = stats.normaltest(female_group)
result

In [None]:
## Check n for female group
len(female_group)
## Check n for male group
len(male_group)

In [None]:
## Check male group for normality


In [None]:
## Check n for male group


- Did we meet the assumption? 

### Test for Equal Variances

In [None]:
## Use Levene's test for equal variance
result= stats.levene(female_group,male_group)
result.pvalue <0.05

In [None]:
## Use an if-else to help interpret the p-value
if result.pvalue <.05:
    print('The groups DO NOT have equal variance')
else:
    print('The groups DO have equal variance')

- Did we meet the assumptions?

## Final Hypothesis Test

- Did we meet our test's assumptions? 
    - If not, what is the alternative test?

In [None]:
## run final hypothess test
result = stats.ttest_ind(female_group,male_group,equal_var=False)
result.pvalue <0.05

In [None]:
## make a plot or calcualte group means to know which group had more/less.
female_group.mean()

In [None]:
male_group.mean()

- Final Conclusion:
    - p value<0.05 for the test so we reject the null hypothesis and support alternative hypothesis. 
    - teams with at least one female receives less funding amount than teams with all males. 

# Q2: Do different sectors get more/less funding?

## 1. State the Hypothesis & Null Hypothesis 

- $H_0$ (Null Hypothesis): There is no difference between funded amounts for different sectors.
- $H_A$ (Alternative Hypothesis):  There is a significant difference between funded amounts for different sectors.

## 2. Determine the correct test to perform.

- Type of Data?
- How many groups/samples?
- Therefore, which test is appropriate?

In [None]:
## how many sectors?
df['sector'].unique()

### Visualize and separate data for hypothesis

- What column is our target?
- What column determines our groups?

In [None]:
## barplot
ax = sns.barplot(data=df,x='sector',y='funded_amount')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right');

In [None]:
needed_cols = ['sector','funded_amount']
df[needed_cols]

In [None]:
## Create a dictionary with each group as key and funded_amount as values


In [None]:
## check one of the sectors in the dict


## 3. Testing Assumptions

- No significant outliers
- Normality
- Equal Variance

### Checking Assumption of No Sig. Outliers

In [None]:
# check for one sector outlier

temp = np.abs(stats.zscore(groups['Personal Use'])) > 3
temp.value_counts()

In [None]:
## Loop through groups dict
for sector, data in groups.items():
    ## determine if there are any outliers
    outliers = np.abs(stats.zscore(data)) > 3
    ## print a statement about how many outliers for which group name
    print(f"There were {outliers.sum()} outliers in the {sector} group.")
    ## Remove the outiers from data and overwrite the sector data in the dict
    data = data.loc[~outliers]
    groups[sector] = data

### Test for Normality

In [None]:
# you can use stats.normaltest for each group separately or use the loop
stats.normaltest(groups['Food']).pvalue <0.5

In [None]:
## Running normal test on each group and confirming there are >20 in each group

## Save a list with an inner list of column names
norm_results = [['group','n','pval','sig?']]


## loop through group dict
for sector, data in groups.items():
    ## calculate normaltest results
    stat, p = stats.normaltest(data)
    
    ## Append the right info into norm_resutls (as a list)
    norm_results.append([sector,len(data), p, p<.05])
    
    
## Make norm_results a dataframe (first row is columns, everything else data)
normal_results = pd.DataFrame(norm_results[1:], columns = norm_results[0])
normal_results


- Did we meet the assumption?

### Test for Equal Variances

In [None]:
## DEMO: using the * operator to unpack lists
a_list = ['a','b','c']
b_list = [1,2,3]
new_list= [*a_list, *b_list]
new_list

In [None]:
## Use Levene's test for equal variance
result = stats.levene(*groups.values())
print(result)

In [None]:
## Use an if-else to help interpret the p-value
if result.pvalue < .05:
    print(f"The groups do NOT have equal variance.")
else:
    print(f"The groups DO have equal variance.")

- Did we meet the assumption?


## Final Hypothesis Test

- Did we meet our test's assumptions? 
    - If not, what is the alternative test?

In [None]:
## Running Krukal Test for Original Hypothesis
result = stats.kruskal(*groups.values())
print(result)
result.pvalue<.05

- Interpret Results. Did we have a significant result?
- Is a post-hoc test needed?

### Post-Hoc Multiple Comparison Test

In [None]:
## Post Hoc
from statsmodels.stats.multicomp import pairwise_tukeyhsd

- Tukey's test requires a list of group names and a list of measured values. 
- Easiest way to produce and visualize this is to make our groups dict into a dataframe 

#### Testing Converting our Dictionary to a DataFrame

In [None]:
## slice a test sector
temp = None


In [None]:
## test making a dataframe from the test sector and filling in the sector name


#### Preparing the new dataframe for Tukey's test in a looop

In [None]:
## make a list for saving the dataframes to


## Loop through groups dict's items


    ## make a temp_df with the data and the sector name
    
    ## append to tukeys_dfs
    
## concatenate them into 1 dataframe    


In [None]:
## save the values as kg_lost and the labels to the Diet
values = None
labels = None

## perform tukey's multiple comparison test and display the summary
tukeys_results = None


In [None]:
## optional -slicing out dataframe from results

In [None]:
## make a barplot of final data to go with results


In [None]:
## also can use built-in plot tukeys_reuslts.plot_simultaneous


- Final summary of group differences