# Notebook 5: T - Tests and Chi-Square

In this notebook, we're going to focus on two statistical tests:

> A t-test of means

> A Chi-Square test


## 1.0 Reading in our libraries, our dataset, and renaming our variables

Just the intro material!  Remember, you need to run all the cells in order - libraries, read data, and rename data, otherwise Python will give you an error message!  

In [None]:
#Because we're expanding our toolkit to do statistical tests, I need to install a new library 
#that allows me to do a ttest of means and the chi-square test.  Because reserachpy is not part of the
#datahub environment, I actually have to install the library before I can call it in
!pip install researchpy

In [None]:
# First, We're going to call in our libraries
from IPython.display import Image
import researchpy as rp
import numpy as np
import pandas as pd
import math
from scipy import stats
from scipy.stats import ttest_ind, chi2_contingency
import seaborn as sns
import matplotlib as plt
import matplotlib.pyplot as plt
import scipy 


pd.options.display.float_format = '{:.4f}'.format

In [None]:
#Show our plots in the Jupyter notebook
%matplotlib inline

In [None]:
#When we start working with nan (missing) values, we can get warnings - we're going to ignore them here
import warnings
warnings.filterwarnings("ignore") 

In [None]:
#Now we're going to read in our data and work with the same extract as we have been

chis_df = pd.read_csv('chis_extract_2022_weights.csv')
chis_df

In [None]:
chis_df.rename(columns={"SRAGE_P1": "age", "AE_VEGI":"ate_veg",
                        "SRSEX": "sex",
                        "OMBSRR_P1": "race_ethnicity",
                        "POVLL" : "pov_cat",
                       "AK22_P1" : "hh_inc",
                       "AM184": "housing_worry",
                       "CV7_1":"covid_lostjob"}, inplace=True)

In [None]:
chis_df=(chis_df[['age','ate_veg','sex', 'race_ethnicity', 'pov_cat', 'hh_inc', 'housing_worry', 'covid_lostjob']])

In [None]:
chis_df

### Codebook

> AE_VEGI: Number of times respondent eats vegetables per week

> SRSEX: Self-reported Sex (1= Male, 2=Female)

> OMBSRR_P1: Race/ethnicity
(1=Hispanic, 2= White NH, 3=Black NH, 4=AmIndian/Alaska Native NH, 5=Asian NH, 6=Other or two or more)

> POVLL: poverty level
(1 = 0-99% FPL, 2=100-199% FPL, 3=200-299% FPL, 4=300% FPL and above)

> AK22_P1: Household Income

> AM184: How Often Worry about Paying Rent/Mortgage
(1=Very often, 2=Somewhat Often, 3=From Time to Time, 4=Almost Never)

> CV7_1: Lost Job due to COVID (1=Yes, 2=No)

### 1.1 Cleaning my variables

This should all look familiar - they are just the code cells from my other notebooks!

In [None]:
#only keep people who said they ate less than 10 veggies a day
chis_df = chis_df[chis_df['ate_veg'] < 71] 

In [None]:
# I decided I want to group together anyone that expresses concern, 
#so I'm going to assign a 1 to 1,2,3, and a 0 to anyone who never worries
chis_df['housing_worry_dv']=chis_df['housing_worry'].map({1:1, 2:1, 3:1, 4:0})
pd.crosstab(chis_df['housing_worry_dv'], columns='count')

In [None]:
#Because I am most concerned about households living under the poverty line, 
#I'm going to create a dummy where 1 = under the poverty line, and 0 is above

chis_df['inpoverty_dv']=chis_df['pov_cat'].map({1:1, 2:0, 3:0, 4:0})
pd.crosstab(chis_df['inpoverty_dv'], columns='Total')

In [None]:
chis_df['lostjob_dv']=chis_df['covid_lostjob'].map({1:1, 2:0})
pd.crosstab(chis_df['lostjob_dv'], columns='count')

##  2.0  Testing Bivariate Relationships

### 2.1 Hypothesis

Let's start by reminding ourselves why we're doing all this data cleaning!  I am interested in understanding whether someone who lost their job due to COVID is more likely to be concerned about paying their rent, which I can use to argue for an extension of eviction moratoria or greater rent relief.

**My hypothesis is that people who lost their job due to COVID are more likely to be concerned about paying their rent.**

>  Y Variable: How Often Worry about Paying Rent/Mortgage  (AM184 - renamed housing_worry)
    (1=Very often, 2=Somewhat Often, 3=From Time to Time, 4=Almost Never)
    
    > dummy is housing_worry_dv

>  X Variable: lost job due to COVID  (CV7_1  - renamed covid_lostjob)

    > dummy is lostjob_dv

>  Alternate X Variable: Categorical poverty level (POVLL - renamed pov_cat)

    > dummy is inpoverty_dv

**I'm also going to look at whether folks in poverty eat fewer vegetables, mostly to demonstrate code and concepts!**

        > Y Variable: Eat Vegetables (ae_veg - renamed ate_veg)
    
        > X Variable: Categorical poverty level (POVLL - renamed pov_cat)
        
            >dummy is inpoverty_dv


### 2.2 Conducting the tests

Now that I've cleaned my data, I can start to explore whether or not there are relationships between my Y and X variables.  I'm going to explore whether there are any observable differences in the average number of veggies a person consumes by my poverty variable first (a ttest of means because "ate_veg" is a numeric variable).  Then, I'm going to see if a greater proportion of people who lost their job during COVID are concerned about paying their rent/mortgage (a Chi-Square test because both are dummy variables).

#### Because the first test is trying to understand whether the *average* number of veggies eaten (numeric) varies by poverty status (dummy), my first test is going to be a ttest of means.

In [None]:
#First, I'm going to look at the average number of veggies eaten by my poverty dummy
chis_df["ate_veg"].groupby(chis_df["inpoverty_dv"]).mean()

In [None]:
#here's where I'm going to take advantage of the researchpy library and run the ttest function
rp.ttest((chis_df[chis_df['inpoverty_dv']==0].ate_veg), (chis_df[chis_df['inpoverty_dv']==1].ate_veg))

In [None]:
#I could also use the ttest function from scipy.stats, but I like the output from researchpy

ttest_ind(chis_df[chis_df['inpoverty_dv'] == 1].ate_veg, chis_df[chis_df['inpoverty_dv'] == 0].ate_veg, equal_var = False, nan_policy="omit")

#The equal variance option allows you to specify whether you think the variances
#of the two samples are the same.  Try and see what happens when you assume equal variances.  

#Setting equal variances as "false" is going to give you a more conservative estimate of statistical significance.  

#The nanpolicy tells Python to omit observations where the data are missing.

#### Now I want to compare two categorical (dummy) variables, so I'm going to use a Chi-Square test

In [None]:
#let's check first and make sure we have at least 5 observations in each cell
pd.crosstab(index=chis_df["housing_worry_dv"], columns=chis_df["lostjob_dv"], margins=True)

In [None]:
pd.crosstab(index=chis_df["housing_worry_dv"], columns=chis_df["lostjob_dv"], margins=True, normalize='columns')

In [None]:
#here's researchpy again, this time for the chi-square test
rp.crosstab(chis_df["housing_worry_dv"], chis_df["lostjob_dv"], prop="col", test="chi-square")

## 3  Conclusion

The principles of testing for statistical significance are exactly the same as with the ACS, now, we're just using different tests with different probability distributions.  The most important thing is to focus on the **meaning** of what you're testing; that, with the p-value, will allow you to interpret a much broader range of research results!