## Pandas (Pivot Tables)

Es una operación parecida al groupby() y que comunmente se utiliza en hojas de cálculo y otros programas que utilizan datos tabulares.

Una pivot table toma una simple columna como entrada, y agrupa las entradas en una tabla bidimensional que proporciona una sumarización multidimensional de los datos. 

La diferencia entre Pivot Table y Groupby() puede a veces generar confusión. Se piensa en una pivot table como una multidimensional versión de la agrupación a través de Groupby. El proceso split-apply-combine tiene lugar a través de dos índices en lugar de uno.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

#### Motivating Pivot Tables

Usaremos la base de datos Titanic, que está dentro de seaborn.
Da información sobre los pasajeros que iban en el barco

In [2]:
titanic = sns.load_dataset('titanic')

In [3]:
type(titanic)

pandas.core.frame.DataFrame

In [4]:
titanic.shape

(891, 15)

In [5]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### Pivot Tables by Hand

Vamos a agrupar los datos por sexo, y calcular la media de aquellos que sobrevivieron

In [7]:
titanic.groupby('sex')

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7fdb95530ac8>

In [23]:
titanic.groupby(['sex', 'survived'])['survived'].count()

sex     survived
female  0            81
        1           233
male    0           468
        1           109
Name: survived, dtype: int64

In [24]:
233/ (233 + 81)  

0.7420382165605095

In [6]:
titanic.groupby('sex')[['survived']].mean()

Unnamed: 0_level_0,survived
sex,Unnamed: 1_level_1
female,0.742038
male,0.188908


Si queremos saber la media de supervivientes por la clase de las personas, podemos hacerlo del mismo modo, pero el código se complica
en su interpretación

In [27]:
# agrupamos por sexo y clase y calculamos la media
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean')

sex     class 
female  First     0.968085
        Second    0.921053
        Third     0.500000
male    First     0.368852
        Second    0.157407
        Third     0.135447
Name: survived, dtype: float64

In [28]:
# deshacemos el multiindice
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()


class,First,Second,Third
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


Esto lo podemos realizar de forma más sencilla con una pivot table

### Pivot Table Sintax

El método para realizar la operación anterior que hemos hecho "a mano" es pivot_table(), donde la pasamos:

* Columna que vamos a analizar
* Columna que hace de índice (agrupación)
* Columna que hace de columnas (agrupación)

Por defecto, la agrupación que realiza es calculando la media

In [30]:
titanic.pivot_table('survived', index='sex', columns='class')

class,First,Second,Third
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


#### Multilevel pivot tables

Si queremos crear varios indices, podemos crear varios niveles a la hora de pivotar el DataFrame

In [46]:
# Creamos una serie a partir de la columna age, agrupando en
# tres tramos. Para ello usamos pd.cut
age = pd.cut(titanic['age'], [0, 18, 80])

In [47]:
type(age)

pandas.core.series.Series

In [48]:
age.head()

0    (18, 80]
1    (18, 80]
2    (18, 80]
3    (18, 80]
4    (18, 80]
Name: age, dtype: category
Categories (2, interval[int64]): [(0, 18] < (18, 80]]

In [49]:
titanic[titanic['age'].isnull()].head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
17,1,2,male,,0,0,13.0,S,Second,man,True,,Southampton,yes,True
19,1,3,female,,0,0,7.225,C,Third,woman,False,,Cherbourg,yes,True
26,0,3,male,,0,0,7.225,C,Third,man,True,,Cherbourg,no,True
28,1,3,female,,0,0,7.8792,Q,Third,woman,False,,Queenstown,yes,True


In [50]:
# creamos la tabla
t = titanic.pivot_table('survived', index=['sex',age], columns='class')

In [51]:
t

Unnamed: 0_level_0,class,First,Second,Third
sex,age,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,"(0, 18]",0.909091,1.0,0.511628
female,"(18, 80]",0.972973,0.9,0.423729
male,"(0, 18]",0.8,0.6,0.215686
male,"(18, 80]",0.375,0.071429,0.133663


In [52]:
type(t)

pandas.core.frame.DataFrame

In [53]:
t.index

MultiIndex(levels=[['female', 'male'], [(0, 18], (18, 80]]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=['sex', 'age'])

Podemos realizar la misma operativa con las columnas. Para ello usaremos pd.qcut para realizar la agrupación en base a quantiles

In [54]:
fare = pd.qcut(titanic['fare'], 2)

In [55]:
fare.head()

0     (-0.001, 14.454]
1    (14.454, 512.329]
2     (-0.001, 14.454]
3    (14.454, 512.329]
4     (-0.001, 14.454]
Name: fare, dtype: category
Categories (2, interval[float64]): [(-0.001, 14.454] < (14.454, 512.329]]

In [56]:
titanic.pivot_table('survived', index=['sex',age], columns=[fare,'class'])

Unnamed: 0_level_0,fare,"(-0.001, 14.454]","(-0.001, 14.454]","(-0.001, 14.454]","(14.454, 512.329]","(14.454, 512.329]","(14.454, 512.329]"
Unnamed: 0_level_1,class,First,Second,Third,First,Second,Third
sex,age,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
female,"(0, 18]",,1.0,0.714286,0.909091,1.0,0.318182
female,"(18, 80]",,0.88,0.444444,0.972973,0.914286,0.391304
male,"(0, 18]",,0.0,0.26087,0.8,0.818182,0.178571
male,"(18, 80]",0.0,0.098039,0.125,0.391304,0.030303,0.192308


#### Additional pivot table options

* fill_value
* dropna
* aggfunc
* margins


In [57]:
pd.DataFrame.pivot_table?

Signature: pd.DataFrame.pivot_table(self, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')

#### aggfunc

Permite definir el tipo de agrupación que resulta de pivotar la tabla, pudiendo ser la media, sum, count, etc.. o cualquier función que implemente una agregación

In [62]:
titanic.pivot_table(index='sex', columns='class',
                   aggfunc={'survived':sum, 'fare':'mean'})

Unnamed: 0_level_0,fare,fare,fare,survived,survived,survived
class,First,Second,Third,First,Second,Third
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
female,106.125798,21.970121,16.11881,91,70,72
male,67.226127,19.741782,12.661633,45,17,47


In [64]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


#### margins

Permite crear un Total a lo largo de los grupos:
* margins=True
* margins_name='All' -> nombre de la columna

In [66]:
titanic.pivot_table('survived', index='sex', columns='class', margins=True)

class,First,Second,Third,All
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.968085,0.921053,0.5,0.742038
male,0.368852,0.157407,0.135447,0.188908
All,0.62963,0.472826,0.242363,0.383838


In [67]:
titanic.pivot_table('survived', index='sex', columns='class', 
                    margins=True, margins_name='Total')

class,First,Second,Third,Total
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.968085,0.921053,0.5,0.742038
male,0.368852,0.157407,0.135447,0.188908
Total,0.62963,0.472826,0.242363,0.383838
