## Comparing Group Membership

1. We use a chi2 test to **compare two categorical variables.** 
    - For this example, we will compare the sex variable with the smoker column. 
2. Our null hypothesis is that membership in these groups is independent, more formally:
    - **$H_0$: sex is indep of whether or not someone is a smoker**

In [1]:
import pandas as pd
from scipy import stats
from pydataset import data

# Load the data set

tips = data('tips')
tips.shape

(244, 7)

In [2]:
# Take a qucik glance
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


First we need to generate a contingency table, which is another word for a cross tabulation, and can easily be generated with pandas.

In [3]:
contingency_table = pd.crosstab(tips.sex, tips.smoker)
contingency_table

smoker,No,Yes
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,54,33
Male,97,60


1. The way the chi2 test works is to **compare the actual contingency table of the actual values against the table that we would predict to be the case if group membership is independent (what numbers you might expect if everything were left to change).**
2. When we perform the test, one of the returned values will be the expected values in the contingency table.
3. To perform the test, we simply pass the contingency table that we created with pandas to the chi2_contingency function from scipy.

In [4]:
test_results = stats.chi2_contingency(contingency_table)
test_results

(0.008763290531773594,
 0.925417020494423,
 1,
 array([[53.84016393, 33.15983607],
        [97.15983607, 59.84016393]]))

The function returns several values:
1. the chi2 test statistic
2. the p value
3. the degrees of freedom
4. the matrix of expected values

We'll focus on the p value and the matrix of expected values:

In [5]:
_, p, _, expected = test_results

Now we can look at p to decide whether to reject / fail to reject H0.

In [6]:
p

0.925417020494423

- With such a high p-value, we fail to reject the null hypothesis.

- Less formally, it seems as though two groups are independent of each other. We can see an intuitive proof of this by comparing the expected values agains what we actually observed:

In [7]:
# Here we'll do some data frame manipulation with pandas to get the two tables
# into a more comparable form

expected = pd.DataFrame(expected, index=['Female', 'Male'], columns=['Non-Smoker', 'Smoker'])

contingency_table.columns = ['Non-Smoker', 'Smoker']
contingency_table.index.name = ''

contingency_table['group'] = 'Actual'
expected['group'] = 'Expected'

(pd.concat([contingency_table, expected])
 .reset_index()
 .rename({'index': 'sex'}, axis=1)
 .set_index(['group', 'sex']))

Unnamed: 0_level_0,Unnamed: 1_level_0,Non-Smoker,Smoker
group,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
Actual,Female,54.0,33.0
Actual,Male,97.0,60.0
Expected,Female,53.840164,33.159836
Expected,Male,97.159836,59.840164


The table above shows us that the actual values are very close to the expected values, thus our failure to reject the null hypothesis.