# Inferential Statistics in Python: TTests and Chi-Square

## 1 Libraries

We're going to bring in our libraries - you'll notice some new libraries and fundtions - "scipy" is a library includes statistical analysis functions.  We're going to bring in the t and ttest_ind commands.  I'm also going to allow 4 decimal points in my number displays.

In [None]:
#Call our libraries; note, we are adding some libraries to our notebook

import numpy as np
import pandas as pd
import math
from scipy import stats
from scipy.stats import t
from scipy.stats import ttest_ind
from datascience import *

pd.options.display.float_format = '{:.4f}'.format

### 1.1 Libraries not in Berkeley's DataHub Package

Sometimes, you'll find a library that does what you want to do, but it's not "pre-installed" in Datahub.  I found a library called "Researchpy" which handles chi-square tests much better than scipy.  To install a new library, you can type pip install library_name, and then "call" it into your Python session.

In [None]:
pip install researchpy

In [None]:
import researchpy as rp

### 1.2  Bringing in Our Data

The next few cells bring in our data and rename our variables, and also create some dummy variables.

In [None]:
chis_df=pd.read_csv("chis_extract.csv")

In [None]:
chis_df.rename(columns={"ac11":"number_sodas","povll":"poverty_line",
"ab1":"health",
"racedf_p1":"race_eth",
"ak28":"feel_safe",
"ak25":"tenure",
"ak10_p":"earnings",
"ak22_p1":"hh_income"}, inplace=True)

In [None]:
#Change earnings to be numeric, assigning a missing (nan) variable to observations that had the value inaaplicable

chis_df["earnings"]=pd.to_numeric(chis_df["earnings"], errors="coerce")

In [None]:
# Create my "own" dummy variable
chis_df["own_dv"]=chis_df["tenure"].map({"OWN":1, "RENT":0, "REFUSED":np.nan, "NOT ASCERTAINED":np.nan, "DON'T KNOW": np.nan, "OTHER ARRANGEMENT": np.nan})
pd.crosstab(chis_df["own_dv"], columns="count")

In [None]:
#create my "healthy" dummy variable
chis_df["healthy_dv"]=chis_df["health"].map({"EXCELLENT":1, "VERY GOOD":1, "GOOD":1, "FAIR":0, "POOR": 0})
pd.crosstab(chis_df["healthy_dv"], columns="count")

In [None]:
#create my "feel safe" dummy variable
chis_df["feel_safe_dv"]=chis_df["feel_safe"].map({"ALL OF THE TIME":1, "MOST OF THE TIME":1, "SOME OF THE TIME":0, "NONE OF THE TIME":0, "PROXY SKIPPED": np.nan})
pd.crosstab(chis_df["feel_safe_dv"], columns="count")

## 2 Calculating Descriptive Statistics

### 2.1 Looking at a Numeric Variable by Group

Before I do my statistical tests, I always want to start by looking at my data descriptively.  Let's refresh how we look at a numeric variable by two groups, for example, the number of sodas by whether or not someone feels safe in the neighborhood.  There are lots of different ways to code this - below are three examples of code.  Take a look at each one, what it produces, and discuss with your team what each of these tell you about the relationship between number of sodas (our dependent variable) and whether or not a respondent feels safe in their neighobrhood.

In [None]:
chis_df["number_sodas"].groupby(chis_df["feel_safe_dv"]).mean()

In [None]:
chis_df["number_sodas"].groupby(chis_df["feel_safe_dv"]).agg(['count','min','max','mean', 'median', 'std'])

In [None]:
chis_df.groupby(['feel_safe_dv']).agg({'number_sodas': 'mean',
                                  'earnings' : 'mean'})

### 2.2 Two Variable Frequency Tables

If we want to explore two categorical variables, we need to rely on the Panda crosstab function.  We specify which variable we want along the rows (our "index" variable) and which variable we want along our columns.  Let's look at how feeling safe is associated with feeling healthy.

In [None]:
pd.crosstab(index=chis_df["healthy_dv"], columns=chis_df["feel_safe_dv"])

In [None]:
#  If we want Python to give us row and column totals, we specify that using
# the "Margins=true" option within the crosstab function.
pd.crosstab(index=chis_df["healthy_dv"], columns=chis_df["feel_safe_dv"], margins=True)

In [None]:
# We can also get the percents by asking Python to normalize the data, either by columns or index
pd.crosstab(index=chis_df["healthy_dv"], columns=chis_df["feel_safe_dv"], margins=True, normalize="columns")

In [None]:
#Normalize by index here.

### 2.3  Let's write out two sentences that describe these data that we can refer back to.

Replace this with a sentence describing the data normalized by columns.

Replace this with a sentence describing the data normalized by rows (index).

## 3 The TTest

The ttest is used when we are comparing differences in means between two groups.  Let's first look at what it would look like if we hard coded the ttest equation ourselves.  Let's assess whether there is a statistically significant difference between the average number of sodas consumed each month by whether a respondent feels safe in their neighborhood (1/0).

The t test statistic is calculated as the observed difference between sample means divided by the square root of the standard error of estimates squared.

t = (mean(X1) - mean(X2)) / sqrt(seX1^2 + seX2^2)

This should look familiar to the significance test we used with the ACS data.  

<img src="ACSStatSigFormula.png" width="300">

This time, rather than deriving the standard error from the provided MOE, we will calculate the SE of the sample ourselves using the observated values in the dataset.

se = std / sqrt(n)

In [None]:
#let's look at the numbers we need to compare our two estimates, one for those who feel safe, and one for those who don't
chis_df["number_sodas"].groupby(chis_df["feel_safe_dv"]).agg(['count','min','max','mean', 'median', 'std'])

In [None]:
mean_feel_safe=5.5412
mean_feel_unsafe=8.3161
se_feel_safe=14.7346/np.sqrt(19410)
se_feel_unsafe=20.0006/np.sqrt(1740)

In [None]:
tstat=(mean_feel_safe - mean_feel_unsafe)/np.sqrt((se_feel_safe**2) + (se_feel_unsafe**2))
tstat

In [None]:
# let's actually calculate the exact p value using the t-distribution.  You don't need to know the code below - it basically
# calls up the p-value assoicated with the t probability distribution.
df=1740+19410-2
pval = stats.t.sf(np.abs(tstat), df)*2
print((tstat, pval))

### 3.1 The Code for a TTest

Now, we generally don't want to hard code our statistics.  We are going to call a new function: "ttest_ind" from the scipy.stats module, which says to conduct a ttest of independent means.  We'll likely get a slightly different answer than in our manual approach, but they should be close to the same.

In [None]:
#here's the syntax - talk through with your neighbor what this is doing
ttest_ind(chis_df[chis_df.feel_safe_dv == 1].number_sodas, chis_df[chis_df.feel_safe_dv == 0].number_sodas, equal_var = False, nan_policy="omit")

#The equal variance option allows you to specify whether you think the variances
#of the two samples are the same.  Try and see what happens when you assume equal variances.  

#Setting equal variances as "false" is going to give you a more conservative estimate of statistical significance.  

#The nanpolicy tells Python to omit observations where the data are missing.

## 4 The Chi Square Test

When we are examining our categorical data, we're going to use a different statistical test. (In other software packages, you can run a ttest of proportions, but the code in Python is more complex than the code for chi-square below, and you'll get the same results.) 

The Chi-Square test assesses whether the values in the "cells" of a 2-way contingency table are significantly different from what we would expect were there no relationship between the two variables.

Again, there's lots of ways to run a chi-square test, but the best I've found comes from "researchpy", which is why we installed it above.

In [None]:
#In this code, I'm creating two objects ("table" and "results")
# and calling researchpy using rp.  I'll print out the table, and then
# the results of the chi-square test.
table, results = rp.crosstab(chis_df["healthy_dv"], chis_df["feel_safe_dv"], prop="col", test="chi-square")
table

In [None]:
results

## 5 Practice on Your Own

Using the codes above, see if there's a statistically significant difference in a) the earnings of those who can and can't find fresh fruits and vegetables in their neighborhood, and b) whether there is a difference the self-reported health between renters and owners.