# $\chi^2$ Testing in Python

What is $\chi^2$ testing and how can we do it in Python?

In [1]:
import scipy.stats as stats
import pandas as pd
import numpy as np

A $\chi^2$ test is a kind of hypothesis test that is appropriate for categorical data. Generally speaking, the null hypothesis will be that there is no difference between groups with respect to some variable of interest. For example, below we'll be looking at responses to a survey about comma use and other fine points of English grammar. Besides the answers to the questions on the survey themselves, we also have access to other information about the respondents, such as their sexes and levels of education.

So questions like:
- Do men and women have different beliefs about the grammatical correctness of the Oxford comma? or
- Is advanced collegiate education associated with wondering whether 'data' is singular or plural?

are the sort of question for which a $\chi^2$ test might be useful.

## Reading in the Data

The dataset below comes from [fivethirtyeight.com](http://fivethirtyeight.com). It records the results of a survey about English grammar.

In [2]:
commas = pd.read_csv('comma-survey.csv')
commas.head()

Unnamed: 0,RespondentID,"In your opinion, which sentence is more gramatically correct?","Prior to reading about it above, had you heard of the serial (or Oxford) comma?","How much, if at all, do you care about the use (or lack thereof) of the serial (or Oxford) comma in grammar?",How would you write the following sentence?,"When faced with using the word ""data"", have you ever spent time considering if the word was a singular or plural noun?","How much, if at all, do you care about the debate over the use of the word ""data"" as a singluar or plural noun?","In your opinion, how important or unimportant is proper use of grammar?",Gender,Age,Household Income,Education,Location (Census Region)
0,3292953864,"It's important for a person to be honest, kind...",Yes,Some,"Some experts say it's important to drink milk,...",No,Not much,Somewhat important,Male,30-44,"$50,000 - $99,999",Bachelor degree,South Atlantic
1,3292950324,"It's important for a person to be honest, kind...",No,Not much,"Some experts say it's important to drink milk,...",No,Not much,Somewhat unimportant,Male,30-44,"$50,000 - $99,999",Graduate degree,Mountain
2,3292942669,"It's important for a person to be honest, kind...",Yes,Some,"Some experts say it's important to drink milk,...",Yes,Not at all,Very important,Male,30-44,,,East North Central
3,3292932796,"It's important for a person to be honest, kind...",Yes,Some,"Some experts say it's important to drink milk,...",No,Some,Somewhat important,Male,18-29,,Less than high school degree,Middle Atlantic
4,3292932522,"It's important for a person to be honest, kind...",No,Not much,"Some experts say it's important to drink milk,...",No,Not much,,,,,,


## Exploration

The first question on the survey is about whether a sentence is more grammatically correct with or without the Oxford comma (roughly, the optional comma before the 'and' in a list of several things).

In [3]:
commas['In your opinion, which sentence is more gramatically correct?'].value_counts()

It's important for a person to be honest, kind, and loyal.    641
It's important for a person to be honest, kind and loyal.     488
Name: In your opinion, which sentence is more gramatically correct?, dtype: int64

## An Initial Question

Suppose we were interested in whether there were some correlation between one's level of education and one's response to the above question.

What we need to do first is to collect numbers: How did folks from each of the different education levels respond to the question about the Oxford comma?

Let's look at the different education levels:

In [4]:
commas['Education'].value_counts()

Bachelor degree                     344
Some college or Associate degree    295
Graduate degree                     276
High school degree                  100
Less than high school degree         11
Name: Education, dtype: int64

What we want is a table for how people, as divided into these education levels, responded to the Oxford comma question.

There are various ways we might go about assembling the table. One natural way would be to make use of the `.groupby()` method of `pandas` DataFrame objects. Here's another:

In [5]:
# Here we'll just filter the data frame according to the
# values we're interested in, and then just measure the
# length of the resulting filtered data frame.

table = np.zeros((2, 5))

for idx, value in enumerate(commas['Education'].value_counts().index):
    table[0, idx] = len(commas[(commas['In your opinion, which sentence is '\
                                       'more gramatically correct?'] ==\
                                        'It\'s important for a person to be '\
                                'honest, kind, and loyal.') & (commas['Education'] == value)])
    table[1, idx] = len(commas[(commas['In your opinion, which sentence is '\
                                       'more gramatically correct?'] ==\
                                        'It\'s important for a person to be '\
                                'honest, kind and loyal.') & (commas['Education'] == value)])

In [6]:
table

array([[199., 157., 172.,  52.,   8.],
       [145., 138., 104.,  48.,   3.]])

## Using `scipy.stats.contingency.chi2_contingency()`

We can use Scipy to conduct a $\chi^2$ test. But before we rely on the software, let's think about what we're measuring here.

Our null hypothesis is that there will be no difference in response to the Oxford comma question among people in the various levels of education. We don't have equal numbers of people in the different educational levels, notice, but the null hypothesis will say that there will be equal *proportions* of Oxford comma favorers and Oxford comma disfavorers among the various levels.

There are five educational levels, so that means four degrees of freedom.

The Scipy function we're using returns a quadruple: The $\chi^2$ statistic itself, the $p$-value, the number of degrees of freedom, and a table of expected numbers given the null hypothesis. (Yes, I like the Oxford comma.)

In [7]:
chisq_test = stats.contingency.chi2_contingency(table)
chisq_test

(7.108946212449898,
 0.13024170886530348,
 4,
 array([[197.14619883, 169.06432749, 158.1754386 ,  57.30994152,
           6.30409357],
        [146.85380117, 125.93567251, 117.8245614 ,  42.69005848,
           4.69590643]]))

The calculation of the $\chi^2$ statistic uses the familiar idea of a sum of squared differences between expected and actual values, but we scale each addend (i.e. each squared difference) by the expected value.

According to Scipy, the value of the $\chi^2$ statistic given these numbers is 7.109. Let's see if we can recreate this figure manually:

In [8]:
# We'll compute the weighted sum of squared differences.

manual_chisq = np.divide((table - chisq_test[3])**2, chisq_test[3]).sum()
manual_chisq == chisq_test[0]

True

## Further Exploration

Great! That checks out. But our $p$-value is 13%, so the division by educational level that we see on the Oxford comma question is not particularly surprising.

Let's see if we can find something more interesting.

In [9]:
# Maybe there's an interesting difference in answers
# to this question according to sex?

table2 = np.zeros((2, 2))

In [10]:
for idx, value in enumerate(commas['Gender'].value_counts().index):
    table2[0, idx] = len(commas[(commas['In your opinion, which sentence is '\
                                       'more gramatically correct?'] ==\
                                        'It\'s important for a person to be '\
                                'honest, kind, and loyal.') & (commas['Gender'] == value)])
    table2[1, idx] = len(commas[(commas['In your opinion, which sentence is '\
                                       'more gramatically correct?'] ==\
                                        'It\'s important for a person to be '\
                                'honest, kind and loyal.') & (commas['Gender'] == value)])

In [11]:
table2

array([[314., 280.],
       [234., 209.]])

In [12]:
stats.contingency.chi2_contingency(table2)

(0.002502344351150657,
 0.9601037108628604,
 1,
 array([[313.89778206, 280.10221794],
        [234.10221794, 208.89778206]]))

This test yielded a $p$-value of 96%! No interesting difference there.

Let's try one more before we move to a more systematic search. Instead of looking across educational levels or the sexes for interesting differences in response to the Oxford comma question, we'll try looking across *ages*:

In [13]:
commas.Age.value_counts()

45-60    290
> 60     272
30-44    254
18-29    221
Name: Age, dtype: int64

In [14]:
table3 = np.zeros((2, 4))

In [15]:
for idx, value in enumerate(commas['Age'].value_counts().index):
    table3[0, idx] = len(commas[(commas['In your opinion, which sentence is '\
                                       'more gramatically correct?'] ==\
                                        'It\'s important for a person to be '\
                                'honest, kind, and loyal.') & (commas['Age'] == value)])
    table3[1, idx] = len(commas[(commas['In your opinion, which sentence is '\
                                       'more gramatically correct?'] ==\
                                        'It\'s important for a person to be '\
                                'honest, kind and loyal.') & (commas['Age'] == value)])

In [16]:
table3

array([[142., 123., 155., 174.],
       [148., 149.,  99.,  47.]])

In [17]:
stats.contingency.chi2_contingency(table3)

(67.37895834554394,
 1.553656961877948e-14,
 3,
 array([[166.11378978, 155.80327869, 145.4927676 , 126.59016393],
        [123.88621022, 116.19672131, 108.5072324 ,  94.40983607]]))

Wow! Here we've found a $p$-value of 1.554e-14! That is strong evidence of a genuine correlation between age and Oxford comma preference. Comparing the columns of `table3` with the value counts of the Age column in `commas`, we can see that younger folks tend to think that the sentence with the Oxford comma is more grammatically correct. By contrast, older folks are more evenly split, and in fact there is a small majority in the other direction, voting for the sentence without the Oxford comma.

If ours is to reason why, we might speculate that it was once common for English teachers and grammarians to advocate writing without the Oxford comma, but that it has since become common to use the comma.

## Searching Systematically

Instead of trying things blindly, we might think about checking *all* possible correlations. Now we need to be careful about this, because by doing this sort of work we're opening ourselves up to charges of $p$-hacking or data dredging, i.e. conducting a large number of tests or otherwise manipulating our data so that we find some low $p$-value when in fact the associated correlation is spurious.

The difficulty, to put it simply, is that, the more tests we perform, the more likely it will be to find an unlikely result. One way of dealing with this is by using the Bonferroni Correction, which scales critical testing thresholds by the number of tests run.

Of course, if we get $p$-values as small as we got last time, this correction won't make much difference (as long as we're not running billions or trillions of tests), but this is something we should keep in mind. And it would be a good idea to count how many tests we'll be performing with our current dataset.

For now, let's consider the following function, which will construct a table for $\chi^2$ analysis given any two (categorical) columns from a `pandas` DataFrame:

In [18]:
def chisq_table(df, col1, col2):
    
    # Build a table with the right dimensions and initialize
    # counts to 0
    table = np.zeros((len(df[col2].value_counts()),
                      len(df[col1].value_counts())))
    
    # Filter the DataFrame down to the relevant rows and then
    # fill the table with the length of each filtered DataFrame
    for idx, value in enumerate(df[col2].value_counts().index):
        for idx2, value2 in enumerate(df[col1].value_counts().index):
            table[idx, idx2] = len(df[(df[col2] == value)\
                                      & (df[col1] == value2)])
            
    return table

Let's see if our function reproduces the first table we generated above:

In [19]:
chisq_table(commas, 'Education',\
           'In your opinion, which sentence is more gramatically correct?')

array([[199., 157., 172.,  52.,   8.],
       [145., 138., 104.,  48.,   3.]])

Great! Looks like it works. Now we can try applying this function to *all* possible pairs of columns from our `commas` DataFrame. Again, we'll be sure to count the number of tests here so we remember to take $p$-values with a grain of salt (especially the ones that are low but not very low).

We'll go ahead and assume that the RespondentID column is not relevant here. (And besides, it would force the creation of some very large tables!)

In [20]:
commas.columns.drop('RespondentID')

Index(['In your opinion, which sentence is more gramatically correct?',
       'Prior to reading about it above, had you heard of the serial (or Oxford) comma?',
       'How much, if at all, do you care about the use (or lack thereof) of the serial (or Oxford) comma in grammar?',
       'How would you write the following sentence?',
       'When faced with using the word "data", have you ever spent time considering if the word was a singular or plural noun?',
       'How much, if at all, do you care about the debate over the use of the word "data" as a singluar or plural noun?',
       'In your opinion, how important or unimportant is proper use of grammar?',
       'Gender', 'Age', 'Household Income', 'Education',
       'Location (Census Region)'],
      dtype='object')

Everything else looks usable! We'll collect the $p$-values of all the tests we run:

In [21]:
pvals = []
for col in commas.columns.drop('RespondentID'):
    for col2 in commas.columns.drop('RespondentID'):
        if col != col2:
            pvals.append(stats.contingency.chi2_contingency\
                         (chisq_table(commas, col2, col))[1])

In [22]:
pvals

[2.1359199588922637e-10,
 2.6128514366119275e-23,
 0.43745396680738136,
 0.3377039287679846,
 0.24447447900218103,
 0.2707945772193671,
 0.9601037108628604,
 1.553656961877948e-14,
 0.02507153894449127,
 0.13024170886530348,
 0.01656656697760126,
 2.1359199588922637e-10,
 1.7986776086368326e-18,
 0.011296356366072539,
 2.9992859525216564e-08,
 0.00036933093635573407,
 0.014041252848884151,
 0.6289453540554488,
 3.70563898184614e-14,
 0.8325492068869516,
 3.3147029057493167e-05,
 0.11146883471712238,
 2.6128514366119275e-23,
 1.7986776086368326e-18,
 0.023000236181127023,
 3.083593576094446e-08,
 2.2927558393598515e-61,
 9.759883174772929e-24,
 0.0009441828426169297,
 0.38375571770738026,
 0.06978767870690113,
 0.008056369776353264,
 0.3405343499710523,
 0.43745396680738136,
 0.011296356366072539,
 0.023000236181127023,
 3.3287750238865536e-18,
 5.898915155380134e-12,
 0.47372320015969493,
 0.10268712155789393,
 0.0014516598512929596,
 0.6751907148766849,
 2.3017061937933766e-06,
 0.721

So how many tests did we run?

In [23]:
len(pvals)

132

If we started with a typical Type I Error threshold of 5% and then added in the Bonferroni Correction, we'd be looking for $p$-values less than:

In [24]:
0.05/132

0.0003787878787878788

This would move some of the results above from "significant" to "insignificant". But also some significance results would be unaffected, including our finding about the correlation between age and Oxford comma use.