# Notebook 5: Inferential Statistics in Python: TTests, Chi-Square, and Correlation


This notebook provides the code and explanation for how to do the three types of statistical tests in Python, TTests, Chi-Square, and Correlation.  The examples continue to use the CHIS data to explore relationships between "drank sweet drinks" and a number of explanatory or independent variables.

### Codebook

> AC46: Number of times respondent drank sweet fruit drinks in past month

> AC47: Number of times respondent drank water yesterday

> AE_VEGI: Number of times respondent eats vegetables per week

> AC42: Number of times respondent was able to find fresh fruits/vegetables in neighborhood
(1=Never, 2=Sometimes, 3 = Usually, 4 = Always, 5=Doesn't eat f/v, 6: Doesn't shop for f/v, 7 Doesn't shop in neighborhood)

> SRSEX: Self-reported Sex (1= Male, 2=Female)

> OMBSRR_P1: Race/ethnicity
(1=Hispanic, 2= White NH, 3=Black NH, 4=AmIndian/Alaska Native NH, 5=Asian NH, 6=Other or two or more)

> POVLL: poverty level
(1 = 0-99% FPL, 2=100-199% FPL, 3=200-299% FPL, 4=300% FPL and above)

> POVGWD_P1: Family Poverty Threshold Level

> RAKEDW0: Individual weight

## 1 Libraries

We're going to bring in our libraries - you'll notice some new libraries and fundtions - "scipy" is a library includes statistical analysis functions.  We're going to bring in the t and ttest_ind commands.  I'm also going to allow 4 decimal points in my number displays.

In [None]:
#Call our libraries; note, we are adding some libraries to our notebook
from statsmodels.stats.proportion import proportions_ztest
import numpy as np
import pandas as pd
import math
from scipy import stats
from scipy.stats import t
from scipy.stats import ttest_ind
from scipy.stats import pearsonr
from datascience import *

pd.options.display.float_format = '{:.4f}'.format


In [None]:
import matplotlib
import matplotlib.pyplot as plt
import scipy 

In [None]:
%matplotlib inline

In [None]:
pip install researchpy

In [None]:
pip install pingouin

In [None]:
import researchpy as rp
import pingouin as pg

In [None]:
#When we start working with nan (missing) values, we can get warnings - we're going to ignore them here
import warnings
warnings.filterwarnings("ignore") 

## 2 Bring in data and code variables

In [None]:
chis_df = pd.read_csv('chis_extract_2018_weights.csv')
chis_df

In [None]:
chis_df.rename(columns={'AC47':'drank_water', 
                        'AC42':'nhood_fv', 
                        'AE_VEGI':'ate_fv',
                        'SRSEX': 'sex',
                        'AC46': 'drank_sweet',
                        'OMBSRR_P1': 'race_ethnicity',
                        'POVGWD_P1' : 'pov_ratio',
                       'POVLL' : 'pov_cat',
                       'RAKEDW0': 'weight'}, inplace=True)
chis_df

In [None]:
#drop observations 3 StDev from the mean for the "drank_sweet" variable
chis_df=chis_df[(np.abs(stats.zscore(chis_df['drank_sweet'], nan_policy='omit'))<3)]
chis_df['drank_water'].describe()

In [None]:
# assign missing observations for drank water to be missing instead of '99'
chis_df.loc[(chis_df.drank_water == 99),'drank_water']=np.nan
chis_df['drank_water'].describe()

In [None]:
#Dummy for whether the respondent is in poverty
chis_df['inpoverty_dv']=chis_df['pov_cat'].map({1:1, 2:0, 3:0, 4:0})

In [None]:
#Dummy for whether the person can never or only sometimes find fresh fruits and vegetables in their neighborhood
chis_df['nofv_dv']=chis_df['nhood_fv'].map({1:1, 2:1, 3:0, 4:0, 5:np.nan, 6:np.nan, 7:np.nan})

In [None]:
#Text race/ethnicity variable
chis_df.loc[(chis_df['race_ethnicity'] == 2), 'race_eth_text'] = 'NHWhite'  
chis_df.loc[(chis_df['race_ethnicity']==5), 'race_eth_text'] = "Asian"
chis_df.loc[(chis_df['race_ethnicity']==3), 'race_eth_text'] = "Black"
chis_df.loc[(chis_df['race_ethnicity']==1), 'race_eth_text'] = "Hispanic"
chis_df.loc[(chis_df['race_ethnicity']==4), 'race_eth_text'] = "Other/Two Races"
chis_df.loc[(chis_df['race_ethnicity']==6), 'race_eth_text'] = "Other/Two Races"

In [None]:
#A series of dummies for my race ethnicity variable
chis_df['hispanic_dv']=np.where((chis_df['race_ethnicity'] == 1), 1,0)
chis_df['white_dv']=np.where((chis_df['race_ethnicity'] == 2), 1,0)
chis_df['black_dv']=np.where((chis_df['race_ethnicity'] == 3), 1,0)
chis_df['asian_dv']=np.where((chis_df['race_ethnicity'] == 5), 1,0)
chis_df['other_dv']=np.where((chis_df['race_ethnicity'] == 4) | (chis_df['race_ethnicity'] == 6), 1,0)
chis_df

## 3 The TTest

The ttest is used when we are comparing differences in means between two groups. In this case, I'll be looking at whether the average number of sodas is statistically different for those in and out of poverty.

In [None]:
#What is the average number of sodas by poverty status? (Hint - use "groupby" function)

In [None]:
# We can see that the mean is different from the cell above, but is the difference statistically significant?

ttest_ind(chis_df[chis_df['inpoverty_dv'] == 1].drank_sweet, chis_df[chis_df['inpoverty_dv'] == 0].drank_sweet, equal_var = False, nan_policy="omit")

#The equal variance option allows you to specify whether you think the variances
#of the two samples are the same.  Try and see what happens when you assume equal variances.  

#Setting equal variances as "false" is going to give you a more conservative estimate of statistical significance.  

#The nanpolicy tells Python to omit observations where the data are missing.

In [None]:
# Try one on your own! Do respondents in poverty drink less water than those not in poverty?
# First, run the descriptives - based on this, is your intuition that the difference is statistically significant?


In [None]:
#run the test

In [None]:
#Researchpy gives you a slightly different output that I find more helpful, but the conclusion is the same

rp.ttest((chis_df[chis_df['inpoverty_dv']==1].drank_sweet), (chis_df[chis_df['inpoverty_dv']==0].drank_sweet))

## 4 The Chi Square Test

When we are examining our categorical data, we're going to use a different statistical test. (In other software packages, you can run a ttest of proportions, but the code in Python is more complex than the code for chi-square below, and you'll get the same results.) 

The Chi-Square test assesses whether the values in the "cells" of a 2-way contingency table are significantly different from what we would expect were there no relationship between the two variables.

Again, there's lots of ways to run a chi-square test, but the best I've found comes from "researchpy", which is why we installed it above.


In [None]:
pd.crosstab(index=chis_df['inpoverty_dv'], columns=chis_df['nofv_dv'], normalize='index', margins=True, margins_name='Total')

In [None]:
#In this code, I'm creating two objects ("table" and "results")
# and calling researchpy using rp.  I'll print out the table, and then
# the results of the chi-square test.
table, results = rp.crosstab(chis_df["inpoverty_dv"], chis_df["nofv_dv"], prop="row", test="chi-square")
table

In [None]:
results

In [None]:
#Test to see if Blacks are less likely to have access to fresh fruits and vegetables than other race/ethnicity groups


In [None]:
#What about Asians?


## 5 Correlation

The correlation coefficient (sometimes referred to as Pearson's correlation coefficient, Pearson's product-moment correlation, or simply r) measures the strength of the linear relationship between two variables. 

The correlation coefficient is directly linked to the beta coefficient in a linear regression (= the slope of a best-fit line), but has the advantage of being standardized between -1 to 1 ; the former meaning a perfect negative linear relationship, and the latter a perfect positive linear relationship. In other words, no matter what are the original units of the two variables are, the correlation coefficient will always be in the range of -1 to 1, which makes it very easy to work with.

The correlation coefficient *r*

> The correlation coefficient ranges from −1 to 1. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear relationship between the variables. 

<img src="py-corr-1.webp" width=800 height=400 />

In hypothesis testing, you want to find not only the correlation coefficient (the r value) but also the p-value.

In [None]:
plt.scatter(chis_df["pov_ratio"], chis_df["drank_sweet"])
plt.xlabel("Poverty Ratio")
plt.ylabel("Number of Sweet/Soda Drinks")

In [None]:
#scipy gives us a quick printout of both pearson's r and our p-value
pearsonr(chis_df['pov_ratio'], chis_df['drank_sweet'])

In [None]:
#my favorite is actually from the pingouin library, which gives the same output but in a more readable format
pg.corr(x=chis_df['pov_ratio'], y=chis_df["drank_sweet"])

In [None]:
#Test the hypothesis that people who drink more water drink less sweet drinks/sodas


#is poverty or water drinking more strongly associated with the number of sweet drinks?