## Pivot Tables

We have seen how the GroupBy abstraction lets us explore relationships within a dataset. A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on tabular data. The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data. The difference between pivot tables and GroupBy can sometimes cause confusion; it helps me to think of pivot tables as essentially a multidimensional version of GroupBy aggregation. That is, you split-apply-combine, but both the split and the combine happen across not a one-dimensional index, but across a two-dimensional grid.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')


In [2]:
titanic.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### Groupby 

In [3]:
titanic.groupby('sex')[['survived']].mean()

Unnamed: 0_level_0,survived
sex,Unnamed: 1_level_1
female,0.742038
male,0.188908


In [8]:
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()


class,First,Second,Third
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


### Pivot Tables

In [9]:
titanic.pivot_table('survived', index='sex', columns='class')

class,First,Second,Third
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


### Multi-level pivot tables

Just as in the GroupBy, the grouping in pivot tables can be specified with multiple levels, and via a number of options. For example, we might be interested in looking at age as a third dimension. We'll bin the age using the pd.cut function:

In [10]:
age = pd.cut(titanic['age'],[0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')


Unnamed: 0_level_0,class,First,Second,Third
sex,age,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,"(0, 18]",0.909091,1.0,0.511628
female,"(18, 80]",0.972973,0.9,0.423729
male,"(0, 18]",0.8,0.6,0.215686
male,"(18, 80]",0.375,0.071429,0.133663


We can apply the same strategy when working with the columns as well; let's add info on the fare paid using pd.qcut to automatically compute QUANTILES:

In [17]:
fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])


fare            (-0.001, 14.454]                     (14.454, 512.329]  \
class                      First    Second     Third             First   
sex    age                                                               
female (0, 18]               NaN  1.000000  0.714286          0.909091   
       (18, 80]              NaN  0.880000  0.444444          0.972973   
male   (0, 18]               NaN  0.000000  0.260870          0.800000   
       (18, 80]              0.0  0.098039  0.125000          0.391304   

fare                                 
class              Second     Third  
sex    age                           
female (0, 18]   1.000000  0.318182  
       (18, 80]  0.914286  0.391304  
male   (0, 18]   0.818182  0.178571  
       (18, 80]  0.030303  0.192308  

Compute descriptive statistics for certain variables or columns:

In [19]:
titanic.pivot_table(index='sex', columns='class',
                   aggfunc={'survived' :sum, 'fare' :'mean'})


Unnamed: 0_level_0,fare,fare,fare,survived,survived,survived
class,First,Second,Third,First,Second,Third
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
female,106.125798,21.970121,16.11881,91,70,72
male,67.226127,19.741782,12.661633,45,17,47


At times it's useful to compute TOTALS along each grouping. This can be done via the margins keyword:

    Note - The margin label can be specified with the margins_name keyword, which defaults to "All".

In [20]:
titanic.pivot_table('survived', index='sex', columns='class', margins=True)

class,First,Second,Third,All
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.968085,0.921053,0.5,0.742038
male,0.368852,0.157407,0.135447,0.188908
All,0.62963,0.472826,0.242363,0.383838


Above we can see we have an overall survival rate of 38%!

#### Continued Pivot Tables & Cross Tabulation (Frequencies)

A pivot table is a data summarization tool frequently found in spreadsheet programs and other data analysis software. It aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns. Pivot tables in Python with pandas are made possible through the groupby facility described in this chapter combined with reshape opera‐ tions utilizing hierarchical indexing. DataFrame has a pivot_table method, and there is also a top-level pandas.pivot_table function. In addition to providing a convenience interface to groupby, pivot_table can add partial totals, also known as margins.

In [3]:
import pandas as pd

filename = '~/Documents/Datasets/Ozone.csv'
data1 = pd.read_csv(filename)
print(data1.shape)

(153, 6)


In [4]:
data1.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


In [6]:
data1.pivot_table(index=['Day'])


Unnamed: 0_level_0,Month,Ozone,Solar.R,Temp,Wind
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,7.0,77.75,199.0,80.2,6.78
2,7.0,43.0,174.8,80.8,9.16
3,7.0,33.25,177.4,79.4,9.62
4,7.0,62.333333,197.25,81.8,8.62
5,7.0,48.666667,163.333333,79.2,8.46
6,7.0,41.5,223.333333,79.8,12.04
7,7.0,54.2,241.8,80.8,7.66
8,7.0,57.0,217.6,81.2,9.52
9,7.0,61.4,203.8,81.6,11.7
10,7.0,49.333333,234.6,82.0,9.16


In [10]:
data1.pivot_table(['Ozone', 'Solar.R'], index=['Day'],
                                              columns='Month')


Unnamed: 0_level_0,Ozone,Ozone,Ozone,Ozone,Ozone,Solar.R,Solar.R,Solar.R,Solar.R,Solar.R
Month,5,6,7,8,9,5,6,7,8,9
Day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
1,41.0,,135.0,39.0,96.0,190.0,286.0,269.0,83.0,167.0
2,36.0,,49.0,9.0,78.0,118.0,287.0,248.0,24.0,197.0
3,12.0,,32.0,16.0,73.0,149.0,242.0,236.0,77.0,183.0
4,18.0,,,78.0,91.0,313.0,186.0,101.0,,189.0
5,,,64.0,35.0,47.0,,220.0,175.0,,95.0
6,28.0,,40.0,66.0,32.0,,264.0,314.0,,92.0
7,23.0,29.0,77.0,122.0,20.0,299.0,127.0,276.0,255.0,252.0
8,19.0,,97.0,89.0,23.0,99.0,273.0,267.0,229.0,220.0
9,8.0,71.0,97.0,110.0,21.0,19.0,291.0,272.0,207.0,230.0
10,,39.0,85.0,,24.0,194.0,323.0,175.0,222.0,259.0


#### Pivot_Table Func
    
    values:  Column name or names to aggregate; by default aggregates all numeric columns 
    index:   Column names or other group keys to group on the rows of the resulting pivot table
    columns: Column names or other group keys to group on the columns of the resulting pivot table
    aggfunc: Aggregation function or list of functions ('mean' by default); can be any function valid in a groupby context
    fill_value: Replace missing values in result table
    dropna:  If True, do not include columns whose entries are all NA
    margins: Add row/column subtotals and grand total (False by default)


A cross-tabulation (or crosstab for short) is a special case of a pivot table that com‐ putes group frequencies.  It is used to gather descriptive stats on categorical variables.  'Margins=True' will provide 'All'. Here is an example:

In [17]:
import pandas as pd

filename = '~/Documents/Datasets/adult.csv'
data2 = pd.read_csv(filename)
print(data2.shape)


(32561, 15)


In [19]:
data2.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [20]:
data2.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education.num      int64
marital.status    object
occupation        object
relationship      object
race              object
sex               object
capital.gain       int64
capital.loss       int64
hours.per.week     int64
native.country    object
income            object
dtype: object

In [21]:
pd.crosstab(data2.occupation, data2.sex, margins=True)


sex,Female,Male,All
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
?,841,1002,1843
Adm-clerical,2537,1233,3770
Armed-Forces,0,9,9
Craft-repair,222,3877,4099
Exec-managerial,1159,2907,4066
Farming-fishing,65,929,994
Handlers-cleaners,164,1206,1370
Machine-op-inspct,550,1452,2002
Other-service,1800,1495,3295
Priv-house-serv,141,8,149


In [23]:
data2.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education.num',
       'marital.status', 'occupation', 'relationship', 'race', 'sex',
       'capital.gain', 'capital.loss', 'hours.per.week', 'native.country',
       'income'],
      dtype='object')

In [25]:
# Rename columns or features

data2.rename(columns={'hours.per.week': 'hours', 'capital.gain': 'gain'}, inplace=True)


In [26]:
data2.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education.num',
       'marital.status', 'occupation', 'relationship', 'race', 'sex', 'gain',
       'capital.loss', 'hours', 'native.country', 'income'],
      dtype='object')

In [31]:
pd.crosstab([data2.hours, data2.age], data2.sex, margins=True)

Unnamed: 0_level_0,sex,Female,Male,All
hours,age,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,21,0,1,1
1,23,0,1,1
1,27,0,1,1
1,45,1,0,1
1,57,0,1,1
1,58,0,1,1
1,62,1,0,1
1,65,0,1,1
1,66,0,1,1
1,67,1,0,1
