# Associations Between Two Categorical Variables

### Introduction

In this lesson, we will cover ways of examining an association between two categorical variables.

As an example, we’ll explore a sample of data from the Narcissistic Personality Inventory (NPI-40), a personality test with 40 questions about personal preferences and self-view. There are two possible responses to each question.

In [7]:
import pandas as pd

npi = pd.read_csv("npi_sample.csv")

display(npi.head())

Unnamed: 0,influence,blend_in,special,leader,authority
0,no,yes,yes,yes,yes
1,no,yes,no,no,no
2,yes,no,yes,yes,yes
3,yes,no,no,yes,yes
4,yes,yes,no,yes,no


### Contingency Tables: Frequencies

Contingency tables, also known as two-way tables or cross-tabulations, are useful for summarizing two variables at the same time. For example, suppose we are interested in understanding whether there is an association between influence (whether a person thinks they have a talent for influencing people) and leader (whether they see themself as a leader). We can use the crosstab function from pandas to create a contingency table:

In [9]:
special_authority_freq = pd.crosstab(npi.special, npi.authority)
display(special_authority_freq)

authority,no,yes
special,Unnamed: 1_level_1,Unnamed: 2_level_1
no,4069,1905
yes,2229,2894


### Contingency Tables: Proportions

In the previous exercise, we looked at an association between the influence and leader questions using a contingency table of frequencies. However, sometimes it’s helpful to convert those frequencies to proportions. We can accomplish this simply by dividing the all the frequencies in a contingency table by the total number of observations (the sum of the frequencies):

The resulting contingency table makes it slightly easier to compare the proportion of people in each category.

In [12]:
special_authority_prop = special_authority_freq/len(npi)
display(special_authority_prop)

authority,no,yes
special,Unnamed: 1_level_1,Unnamed: 2_level_1
no,0.366676,0.171668
yes,0.200865,0.260791


### Marginal Proportions

Now, let’s take a moment to think about what the tables would look like if there were no association between the variables.

Say that we notice that the bottom row, which corresponds to people who think they have a talent for influencing people, accounts for 0.213 + 0.399 = 0.612 (or 61.2%) of surveyed people — more than half! This means that we can expect higher proportions in the bottom row, regardless of whether the questions are associated.

The proportion of respondents in each category of a single question is called a _marginal proportion_.

In [14]:
# calculate and print authority_marginals
authority_marginals = special_authority_prop.sum(axis=0)
print(authority_marginals)

authority
no     0.567541
yes    0.432459
dtype: float64


In [15]:
# calculate and print special_marginals
special_marginals = special_authority_prop.sum(axis=1)
print(special_marginals)

special
no     0.538344
yes    0.461656
dtype: float64


### Expected Contingency Tables

In order to understand whether these questions are associated, we can use the marginal proportions to create a contingency table of expected proportions if there were no association between these variables. To calculate these expected proportions, we need to multiply the marginal proportions for each combination of categories.

In python, we can calculate this table using the chi2_contingency() function from SciPy, by passing in the observed frequency table. There are actually four outputs from this function, but for now, we’ll only look at the fourth one.

The more that the expected and observed tables differ, the more sure we can be that the variables are associated.

In [20]:
print("observed contingency table:")
print(special_authority_prop)

# calculate the expected contingency table if there's no association and save it as expected
chi2, pval, dof, expected = chi2_contingency(special_authority_freq)

# print out the expected frequency table
print("\nexpected contingency table (no association):")
print(np.round(expected))


observed contingency table:
authority        no       yes
special                      
no         0.366676  0.171668
yes        0.200865  0.260791

expected contingency table (no association):
[[3390. 2584.]
 [2908. 2215.]]


### The Chi-Square Statistic

In the previous exercise, we calculated a contingency table of expected frequencies if there were no association between the `authority` and `special` questions. We then compared this to the observed contingency table. Because the tables looked somewhat different, we concluded that responses to these questions are probably associated.

While we can inspect these tables visually, many data scientists use the Chi-Square statistic to summarize how different these two tables are. To calculate the Chi Square statistic, we simply find the squared difference between each value in the observed table and its corresponding value in the expected table, and then divide that number by the value from the expected table; finally add up those numbers:

$$
ChiSquare= ∑ \frac{(observed−expected)^2}{​expected}
$$

​The interpretation of the Chi-Square statistic is dependent on the size of the contingency table. For a 2x2 table (like the one we’ve been investigating), a Chi-Square statistic larger than around 4 would strongly suggest an association between the variables. In this example, our Chi-Square statistic is much larger than that. This adds to our evidence that the variables are highly associated.

In [21]:
print(chi2)

679.1219526170606


### Review

In [6]:
import pandas as pd
from scipy.stats import chi2_contingency
import numpy as np

npi = pd.read_csv("npi_sample.csv")

print(npi.head())

print("\nLeader and authority association:\n")

leader_authority_freq = pd.crosstab(npi.leader, npi.authority)
print(leader_authority_freq)

# save the table of proportions as special_authority_prop:
leader_authority_prop = leader_authority_freq/len(npi)
print(leader_authority_prop)

# calculate and print authority_marginals
authority_marginals = leader_authority_prop.sum(axis=0)
print(authority_marginals)

# calculate and print leader_marginals
leader_marginals = leader_authority_prop.sum(axis=1)
print(leader_marginals)

print("observed contingency table:")
print(leader_authority_freq)

# calculate the expected contingency table if there's no association and save it as expected
chi2, pval, dof, expected = chi2_contingency(leader_authority_freq)

# print out the expected frequency table
print("expected contingency table (no association):")
print(np.round(expected))

# calculate the chi squared statistic and save it as chi2, then print it:
print(chi2)

  influence blend_in special leader authority
0        no      yes     yes    yes       yes
1        no      yes      no     no        no
2       yes       no     yes    yes       yes
3       yes       no      no    yes       yes
4       yes      yes      no    yes        no

Leader and authority association:

authority    no   yes
leader               
no         3820  1555
yes        2478  3244
authority        no       yes
leader                       
no         0.344237  0.140128
yes        0.223304  0.292331
authority
no     0.567541
yes    0.432459
dtype: float64
leader
no     0.484365
yes    0.515635
dtype: float64
observed contingency table:
authority    no   yes
leader               
no         3820  1555
yes        2478  3244
expected contingency table (no association):
[[3051. 2324.]
 [3247. 2475.]]
869.2684782761069
