Title: Crosstabs 
Slug: crosstabs-python-pandas
Summary: Implementing a Crosstab in Python using pandas
Date: 2018-11-23 9:15  
Category: Python
Subcategory: Data Analysis in pandas
PostType: Tutorial
Keywords: crosstabs pandas
Tags: crosstabs, python, pandas
Authors: Dan Friedman

### Import Module

In [2]:
import pandas as pd
import seaborn as sns

### Get Tips Dataset

Let's get the `tips` dataset from the `seaborn` library and assign it to the DataFrame `df_tips`.

In [6]:
df_tips = sns.load_dataset('tips')

Each row represents a unique dinner for a party and the following fields:

column name | column description 
--- | ---
`total_bill` | financial amount of meal in U.S. dollars
`tip` |  financial amount of the meal's tip in U.S. dollars
`sex` | gender of server
`smoker` | boolean to represent if server smokes or not
`day` | day of week
`time` | meal name (Lunch or Dinner)
`size` | count of people eating meal

Preview the first 5 rows of `df_tips`. 

In [7]:
df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


### Implement Crosstabs with Tips Dataset

A **crosstab** computes some aggregated metrics among two or more columns in a dataset that contain categorical values. By default in pandas, `crosstab()` computes an aggregated metric of a count (aka frequency).

Let's compute a simple crosstab across the `day` and `sex` column. We want our returned index to be the unique values from `day` and our returned columns to be the unique values from `sex`. Each of the values inside our table represent a count across the index and column. For example, 30 males received tips on Thursday.

In [15]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

sex,Male,Female
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,30,32
Fri,10,9
Sat,59,28
Sun,58,18


In this example, we passed in two columns from our DataFrame. However, one nice feature of `crosstab()` is you don't need the data to be in a DataFrame. For the `index` and `columns` arguments, you can pass in two numpy arrays.

Let's double check the logic from above makes sense. Let's use filtering in pandas to verify that there were 30 meals served by a male on Thursday. Our results of 30 matches what we see above.

In [23]:
len(df_tips.query("sex=='Male' and day=='Thur'"))

30

For each row and column of this previous crosstab, we can modify an argument to get the totals. Set the argument `margins` to `True`to get these totals.

In [16]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], margins=True)

sex,Male,Female,All
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Thur,30,32,62
Fri,10,9,19
Sat,59,28,87
Sun,58,18,76
All,157,87,244


For each row value, we can calculate what percentage it is of the row's total. To do that, set the `normalize` argument to `index` (since index applies to each row).

In [12]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], margins=True, normalize='index')

sex,Male,Female
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,0.483871,0.516129
Fri,0.526316,0.473684
Sat,0.678161,0.321839
Sun,0.763158,0.236842
All,0.643443,0.356557


For each column value, we can calculate what percentage it is of the column's total. To do that, set the `normalize` argument to `columns`.

In [19]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], margins=True, normalize='columns')

sex,Male,Female,All
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Thur,0.191083,0.367816,0.254098
Fri,0.063694,0.103448,0.077869
Sat,0.375796,0.321839,0.356557
Sun,0.369427,0.206897,0.311475


All the returned outputs above looks a bit strange due to the indices and column names. In my opinion, an easier-to-use format to read and utilize is to flatten this returned output. To do that, I'll chain the `reset_index()` method to the end. This removes the `day` as our index and instead makes it a column instead.

In [32]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], margins=True, normalize='columns').reset_index()

sex,day,Male,Female,All
0,Thur,0.191083,0.367816,0.254098
1,Fri,0.063694,0.103448,0.077869
2,Sat,0.375796,0.321839,0.356557
3,Sun,0.369427,0.206897,0.311475


You can learn more about details of using `crosstab()` from the official pandas <a href='https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.crosstab.html' rel='nofollow'>documentation page</a>.