Title: crosstabs() Method: Compute Aggregated Metrics Across Categorical Columns 
Slug: crosstabs-python-pandas
Summary: Learn how to implement a crosstab in Python using pandas with simple examples
Date: 2018-11-23 9:15  
Category: Python
Subcategory: Data Analysis in pandas
PostType: Tutorial
Keywords: crosstabs pandas
Tags: crosstabs, python, pandas
Authors: Dan Friedman

A **crosstab** computes aggregated metrics among two or more columns in a dataset that contains categorical values. 

### Import Modules

In [63]:
import pandas as pd
import seaborn as sns

### Get Tips Dataset

Let's get the `tips` dataset from the `seaborn` library and assign it to the DataFrame `df_tips`.

In [64]:
df_tips = sns.load_dataset('tips')

Each row represents a unique meal at a restaurant for a party of people; the dataset contains the following fields:

column name | column description 
--- | ---
`total_bill` | financial amount of meal in U.S. dollars
`tip` |  financial amount of the meal's tip in U.S. dollars
`sex` | gender of server
`smoker` | boolean to represent if server smokes or not
`day` | day of week
`time` | meal name (Lunch or Dinner)
`size` | count of people eating meal

Preview the first 5 rows of `df_tips`. 

In [65]:
df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


### Implement Crosstabs with Tips Dataset

Let's compute a simple crosstab across the `day` and `sex` column. We want our returned index to be the unique values from `day` and our returned columns to be the unique values from `sex`. By default in pandas, `crosstab()` computes an aggregated metric of a count (aka frequency).

So, each of the values inside our table represent a count across the index and column. For example, males served 30 unique groups across all Thursdays in our dataset.

In [66]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'])

sex,Male,Female
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,30,32
Fri,10,9
Sat,59,28
Sun,58,18


One issue with this crosstab output is the column names are nonsensical. Just saying `Male` or `Female` isn't very specific. They should be renamed to be clearer. We can use the <a href='https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html' rel='nofollow'>rename()</a> method and set the argument `columns` to be a dictionary in which the keys are the current column names and the values are the respective new names to set.

In [67]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex']).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

sex,count_meals_served_by_males,count_meals_served_by_females
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,30,32
Fri,10,9
Sat,59,28
Sun,58,18


In this example, we passed in two columns from our DataFrame. However, one nice feature of `crosstab()` is that you don't need the data to be in a DataFrame. For the `index` and `columns` arguments, you can pass in two numpy arrays.

Let's double check the logic from above makes sense. Let's use filtering in pandas to verify that there were 30 meals served by a male on Thursday. Our query below matches the 30 number we see above.

In [68]:
len(df_tips.query("sex=='Male' and day=='Thur'"))

30

For each row and column of this previous crosstab, we can modify an argument to get the totals. Set the argument `margins` to `True` to get these totals. By default, the returned output will have a column and row name of `All`.

In [69]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], margins=True).rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

sex,count_meals_served_by_males,count_meals_served_by_females,All
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Thur,30,32,62
Fri,10,9,19
Sat,59,28,87
Sun,58,18,76
All,157,87,244


In the `crosstab()` method, we can also rename the `All` column. First, use all the same arguments from above. Then, set the argument `margins_name` to `count_meals_served` .

In [70]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], margins=True, margins_name="count_meals_served").rename(columns={"Male": "count_meals_served_by_males", "Female": "count_meals_served_by_females"})

sex,count_meals_served_by_males,count_meals_served_by_females,count_meals_served
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Thur,30,32,62
Fri,10,9,19
Sat,59,28,87
Sun,58,18,76
count_meals_served,157,87,244


For each cell value, we can calculate what percentage it is of the row's total. To do that, set the `normalize` argument to `'index'` (since index applies to each row).

In [71]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], margins=True, margins_name="proportion_meals_served", normalize='index').rename(columns={"Male": "proportion_meals_served_by_males", "Female": "proportion_meals_served_by_females"})

sex,proportion_meals_served_by_males,proportion_meals_served_by_females
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,0.483871,0.516129
Fri,0.526316,0.473684
Sat,0.678161,0.321839
Sun,0.763158,0.236842
proportion_meals_served,0.643443,0.356557


For each cell value, we can also calculate what percentage it is of the column's total. To do that, set the `normalize` argument to `'columns'`.

In [72]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], margins=True, margins_name="proportion_meals_served", normalize='columns').rename(columns={"Male": "proportion_meals_served_by_males", "Female": "proportion_meals_served_by_females"})

sex,proportion_meals_served_by_males,proportion_meals_served_by_females,proportion_meals_served
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Thur,0.191083,0.367816,0.254098
Fri,0.063694,0.103448,0.077869
Sat,0.375796,0.321839,0.356557
Sun,0.369427,0.206897,0.311475


Given two categorical columns, the `crosstab()` method can additionally utilize a column with numerical values to perform an aggregate operation. That sentence may sound daunting - so let's walk through it with a simple example.

We know there exists `total_bill` values in our datasets for males that served meals on Thursday. Below, I preview the first few `total_bill` values that meet this criteria.

In [73]:
df_tips.query("sex=='Male' and day=='Thur'")['total_bill'].head()

77    27.20
78    22.76
79    17.29
80    19.44
81    16.66
Name: total_bill, dtype: float64

We may want to know the average bill size that meet the criteria above. So, given that series, we can calculate the mean and we arrive at a result of 18.71.

In [74]:
df_tips.query("sex=='Male' and day=='Thur'")['total_bill'].mean()

18.714666666666666

Now, we can perform this same operation using the `crosstab()` method. Same as before, we want our returned index to be the unique values from `day` and our returned columns to be the unique values from `sex`. Additionally, we want the values inside the table to be from our `total_bill` column so we'll set the argument `values` to be `df_tips['total_bill']`. We also want to calculate the mean total bill for each combination of a day and gender so we'll set the `aggfunc` argument to `'mean'`.

In [75]:
pd.crosstab(index=df_tips['day'], columns=df_tips['sex'], values=df_tips['total_bill'], aggfunc='mean').rename(columns={"Male": "mean_bill_size_meals_served_by_males", "Female": "mean_bill_size_meals_served_by_females"})

sex,mean_bill_size_meals_served_by_males,mean_bill_size_meals_served_by_females
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Thur,18.714667,16.715312
Fri,19.857,14.145556
Sat,20.802542,19.680357
Sun,21.887241,19.872222


This crosstab calculation outputted the same 18.71 value as expected!

We can pass in many other aggregate methods to the `aggfunc` method too such as mean and standard deviation.

You can learn more about details of using `crosstab()` from the official pandas <a href='https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.crosstab.html' rel='nofollow'>documentation page</a>.