## Chi-squared

The χ2 test can be used to compare two categorical variables and helps us answer questions like:

Is whether or not a customer churns independent of their subscription plan?
Are doctors less likely to smoke?
Does playing on the home field give a soccer team an advantage?
In this lesson we will dive into how the test is performed.

In [1]:
import pandas as pd
from scipy import stats
from pydataset import data

mpg = data('mpg')
mpg['transmission'] = mpg.trans.str[:-4] # a little cleaning goes a long way
mpg.head()


Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,transmission
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,auto
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,manual
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,manual
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,auto
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,auto


## observed is the crosstab of the two categorical relationships you are interested in.
- input is two series

## Process:
1. state null: there is no relationship between categorical variables a and b/ a and b are independent of one another
2. run 'observed = pd.ctosstab(df.A, df.B)
3. Compare p to a, if p < a, reject the null hypothesis


In [2]:
import pandas as pd
from scipy import stats
from pydataset import data
df = data("tips")
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


In [8]:
#let's investigate smoking status and day
# null: they are independent
#then >> make a crosstab of the two variables we are investigating
observed = pd.crosstab(df.smoker, df.time)
observed

time,Dinner,Lunch
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1
No,106,45
Yes,70,23


In [5]:
a = 0.05

In [9]:
chi2, p, degf, expected = stats.chi2_contingency(observed)
p

0.4771485672079724

In [10]:
if p < a:
    print("we reject the null hypothesis")
else:
    print("we fail to reject the null hypothesis")

we fail to reject the null hypothesis


### What about Gender and Day?
Step 1: set alpha and define null hypothesis:

In [11]:
a = 0.05

In [12]:
## Null hypothesis: day and gender are independent

In [14]:
##Step 2: crosstab

observed = pd.crosstab(df.day, df.sex)
observed    

sex,Female,Male
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Fri,9,10
Sat,28,59
Sun,18,58
Thur,32,30


In [15]:
chi2, p, degf, expected = stats.chi2_contingency(observed)
p

0.004180302092822257

In [16]:
if p < a:
    print("we reject the null hypothesis")
else:
    print("we fail to reject the null hypothesis")

we reject the null hypothesis
