# SI 618: Categorical Data



# Categorical Data

Categorical data are those that can take on one of a limited number of values (i.e. categories) (Wikipedia). Examples: blood type (A, B, AB, O); types of rock (sedimentary, metamorphic, igneous).

## Contingency tables, crosstabs, and chi-square

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline

Let's generate a data frame to play with:

In [None]:
df = pd.DataFrame({'color' : ['red', 'green', 'green', 'black'] * 6,
                   'make' : ['ford', 'toyota', 'dodge'] * 8,
                   'vehicleClass' : ['suv', 'suv', 'suv', 'car', 'car', 'truck'] * 4})

In [None]:
df

One of the most basic transformations we can do is a crosstab:

In [None]:
ct = pd.crosstab(df.color,df.vehicleClass)
ct

Notice how similar it is to pivoting.  We can use ```pivot_table``` to create a DataFrame similar to the one from the ```crosstab``` above:

In [None]:
p = df.pivot_table(index='color',columns='vehicleClass',aggfunc=len)
p

But that's not quite right; can you figure out how to make that pivot table **exactly** like the crosstab?

In [None]:
p = df.pivot_table(index='color',columns='vehicleClass',aggfunc=len,fill_value=0)
p

In [None]:
p.columns = p.columns.droplevel()
p

As usual, we would like to visualize our results:

In [None]:
import seaborn as sns
sns.heatmap(ct,annot=True)

You might want to investigate other palettes, see https://seaborn.pydata.org/tutorial/color_palettes.html for more details.



In [None]:
sns.heatmap(ct,annot=True,cmap=sns.cubehelix_palette())

### Titanic data

One of the more popular datasets that we use for experimenting with crosstabs is the 
survivor data from the Titanic disaster:

In [None]:
titanic

In [None]:
titanic = pd.read_csv('https://raw.githubusercontent.com/umsi-data-science/si370/master/data/titanic.csv')

Let's create a crosstab of the data:

In [None]:
ct = pd.crosstab(titanic.passtype,titanic.status,margins=False)
ct

In [None]:
sns.heatmap(ct,annot=True,cmap=sns.cubehelix_palette())

Does scientific notation bother you?  Change the format with the ```fmt=``` argument:

In [None]:
sns.heatmap(ct,annot=True,cmap=sns.cubehelix_palette(),fmt='d')

In addition to the heatmap shown above, we can use a mosaic plot to visualize 
contingency tables:

In [None]:
from statsmodels.graphics.mosaicplot import mosaic
t = mosaic(titanic, ['passtype','status'],title='titanic survival')

In [None]:
# slightly easier to read
props = lambda key: {'color': 'r' if 'alive' in key else 'gray'}
t = mosaic(titanic, ['passtype','status'],title='titanic survival',properties=props)

Let's take a look at the "expected" values for each cell.  That is,
the values that we would expect if there was no effect of "passtype" on "status".  To do this, let's take a look at the marginal totals:

In [None]:
ct = pd.crosstab(titanic.passtype,titanic.status,margins=True)
ct

The **expected** value for each cell (i.e. the value that you would expect if there was no interaction between passtype and status in this dataset) can be calculated by taking the row total multiplied by the column total and divided by the overall total.  

So we would get the following expected value for alive crew:

In [None]:
exp = ct['All'].loc['crew'] * ct['alive'].loc['All'] / ct['All'].loc['All']

In [None]:
exp

You could repeat this for each cell (or write code to do so), but you get the idea.

## Let's talk about $\chi^2$

Finally, we can go beyond visual exploration and apply analytic tests to see if the 
observed values differ from the expected ones.  The chi-square test sums the squares of the differences
between the observed and expected values, normalized for the expected values.

Our null hypothesis is that there is no difference in survivorship based on passage type.

Here's a video resource that explains chi-squared:

In [None]:
from IPython.display import YouTubeVideo
vid = YouTubeVideo("VskmMgXmkMQ")
display(vid)

In [None]:
from scipy.stats import chi2_contingency
chi2, p, dof, ex = chi2_contingency(ct)
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)

As a bonus, we also get a DataFrame of the expected values:

In [None]:
pd.DataFrame(ex)