# Data aggregation and group operations with Pandas

### The Split-Apply-Combine processing paradigm

<!-- ## Marco Forgione -->



A helper class for this lecture:

In [1]:
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

Taken from https://github.com/jakevdp/PythonDataScienceHandbook. Useful to display several pandas dataframes in one line... 

### Split-Apply-Combine

Is one of the most useful and powerful data processing paradigms in Pandas. May be used to:  

* Compute statistics (mean, max, min, std...) on groups

* Apply within-group transformations 

* ...


Operations:

1. **Split** the dataset into groups according to one (or more) keys 
2. **Apply** a data manipulation operation independently on each group
3. **Combine** the result in a single dataset

### Split-Apply-Combine

An illustration: compute mean over groups defined by a key

 <img src="split-apply-combine.png" alt="split-apply-combine" width=900> 

Pandas implementation:

In [2]:
import pandas as pd
df = pd.DataFrame({'key': ['a','b','b','a','a','b','c','c','c'], 'val': [1, 3, 4, 2, 2, 2, 2, 1, 0]})
df_sum_by_key = df.groupby("key").sum() # group by key, apply sum on each group, combine result
display('df', 'df_sum_by_key')

Unnamed: 0,key,val
0,a,1
1,b,3
2,b,4
3,a,2
4,a,2
5,b,2
6,c,2
7,c,1
8,c,0

Unnamed: 0_level_0,val
key,Unnamed: 1_level_1
a,5
b,9
c,3


### Example

An example: the tipping dataset

In [3]:
import pandas as pd
df_tips = pd.read_csv("tips.csv")
df_tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


Contains **total_bill** and **tip** over a period of a few months in a restaurant, with additional info about customer and day/time

We already know how to find the average of **total_bill** using the built-in statistical function ``mean``:

In [4]:
df_tips["total_bill"].mean() # mean value of all numeric columns. 

19.78594262295082

What is the average of **total_bill** for the different days of the week?

### Example

Objective: find the mean of the column **total_bill** for groups defined by the value of the column **day**

A poor man's solution:

In [5]:
# Let us find the unique values of the day column
days = df_tips["day"].unique()
days

array(['Sun', 'Sat', 'Thur', 'Fri'], dtype=object)

In [6]:
# Let us loop over the days, filter the dataframe, and compute the mean
res_dict = {}
for day in days:
    df_day = df_tips[df_tips["day"] == day] # filter on the day with boolean indexing
    res_dict[day] = df_day["total_bill"].mean()
res = pd.Series(res_dict)
res

Sun     21.410000
Sat     20.441379
Thur    17.682742
Fri     17.151579
dtype: float64

The standard way:

In [7]:
df_tips.groupby("day")["total_bill"].mean() # concise and efficient

day
Fri     17.151579
Sat     20.441379
Sun     21.410000
Thur    17.682742
Name: total_bill, dtype: float64

### DataFrameGroupBy and SeriesGroupBy objects


In [8]:
df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [9]:
df_tips.groupby("day")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f014546b4f0>

In [10]:
df_tips.groupby("day")["total_bill"]

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f014546bd00>

DataFrameGroupBy becomes a DataFrame if the ``mean`` function is applied

In [11]:
df_tips.groupby("day").mean()

Unnamed: 0_level_0,total_bill,tip,size
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,17.151579,2.734737,2.105263
Sat,20.441379,2.993103,2.517241
Sun,21.41,3.255132,2.842105
Thur,17.682742,2.771452,2.451613


SeriesGroupBy becomes a Series if the ``mean`` function is applied

In [12]:
df_tips.groupby("day")["total_bill"].mean()

day
Fri     17.151579
Sat     20.441379
Sun     21.410000
Thur    17.682742
Name: total_bill, dtype: float64

### Basic aggregations

We can apply the mean (or another aggregation method) to a subset of columns by specifying a list of columns:

In [13]:
df_tips.groupby("day")[["total_bill", "tip"]].mean() # .median(), .std(), .max(), ...

Unnamed: 0_level_0,total_bill,tip
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Fri,17.151579,2.734737
Sat,20.441379,2.993103
Sun,21.41,3.255132
Thur,17.682742,2.771452


If we specify a single column name, then the return type is a series instead of a dataframe:

In [14]:
df_tips.groupby("day")["total_bill"].mean()

day
Fri     17.151579
Sat     20.441379
Sun     21.410000
Thur    17.682742
Name: total_bill, dtype: float64

If we specify a list with a single column name, we obtain a dataframe: 

In [15]:
df_tips.groupby("day")[["total_bill"]].mean()

Unnamed: 0_level_0,total_bill
day,Unnamed: 1_level_1
Fri,17.151579
Sat,20.441379
Sun,21.41
Thur,17.682742


### Grouping mechanism: single key

The string passed to ``groupby`` is the column to be used as *key*.

In [16]:
df_tips.groupby("day").mean() # the result is a dataframe with index day

Unnamed: 0_level_0,total_bill,tip,size
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,17.151579,2.734737,2.105263
Sat,20.441379,2.993103,2.517241
Sun,21.41,3.255132,2.842105
Thur,17.682742,2.771452,2.451613


By default, the key variable is the index of the result. 

This behavior can be modified with the ``as_index`` option:

In [17]:
df_tips.groupby("day", as_index=False).mean()

Unnamed: 0,day,total_bill,tip,size
0,Fri,17.151579,2.734737,2.105263
1,Sat,20.441379,2.993103,2.517241
2,Sun,21.41,3.255132,2.842105
3,Thur,17.682742,2.771452,2.451613


### Grouping mechanism: multiple keys

It is possible to group by more than one key. For instance, to compute the mean value by day and time:

In [18]:
mean_by_day_and_time = df_tips.groupby(["day", "time"]).mean() # mean value of all numeric columns, by day and time
mean_by_day_and_time

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,size
day,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,Dinner,19.663333,2.94,2.166667
Fri,Lunch,12.845714,2.382857,2.0
Sat,Dinner,20.441379,2.993103,2.517241
Sun,Dinner,21.41,3.255132,2.842105
Thur,Dinner,18.78,3.0,2.0
Thur,Lunch,17.664754,2.767705,2.459016


The result has *hierarchical row index* (day, time). 

In [19]:
mean_by_day_and_time.loc["Fri"]

Unnamed: 0_level_0,total_bill,tip,size
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dinner,19.663333,2.94,2.166667
Lunch,12.845714,2.382857,2.0


In [20]:
mean_by_day_and_time.loc[("Fri", "Lunch")]

total_bill    12.845714
tip            2.382857
size           2.000000
Name: (Fri, Lunch), dtype: float64

Use the ``as_index=False`` option to have day and time as columns in the result instead

### Grouping mechanism: user-given array

We can define groups using to any *array-like* python object, to be given as argument to ``groupby``

In [21]:
import numpy as np
import pandas as pd
df_rand = pd.DataFrame(np.random.randint(0, 10, size=(4, 2)), 
                         columns=['C1', 'C1'])
df_rand

Unnamed: 0,C1,C1.1
0,5,0
1,9,5
2,4,3
3,1,7


In [22]:
df_rand.groupby(["A", "B", "B", "A"]).sum() # A list is used to define the groups

Unnamed: 0,C1,C1.1
A,6,7
B,13,8


Note: the more common notation:

In [23]:
mean_by_day = df_tips.groupby("day").mean()

is actually a short-hand for 

In [24]:
mean_by_day = df_tips.groupby(df_tips["day"]).mean()

That is, we are grouping according to the series ``df_tips["day"]``

### Grouping mechanism: dictionaries and functions

We can group according to a dictionary or a function. It is applied to the dataframe index

In [25]:
import numpy as np
df_stud = pd.DataFrame(np.random.randint(1,7, size=(7, 2)), 
                         columns=['Statistics', 'Data Challenge'],
                         index=['Anna', 'Alberto', 'Luigi', 'Loris', 'Laura', 'Dario', 'Daniela'])
df_stud

Unnamed: 0,Statistics,Data Challenge
Anna,2,1
Alberto,3,4
Luigi,2,3
Loris,5,6
Laura,6,3
Dario,2,3
Daniela,6,1


This will group according to nationality, as defined by a dictionary:

In [26]:
dict_nat = {'Anna': "IT", "Alberto": "CH", "Luigi": "CH", "Loris": "IT", "Laura": "DE", "Dario": "DE", "Daniela": "DE"} # 
df_stud.groupby(dict_nat).mean()

Unnamed: 0,Statistics,Data Challenge
CH,2.5,3.5
DE,4.666667,2.333333
IT,3.5,3.5


This will group by the initial letter of the name:

In [27]:
df_stud.groupby(lambda x: x[0]).mean()

Unnamed: 0,Statistics,Data Challenge
A,2.5,2.5
D,4.0,2.0
L,4.333333,4.0


### Aggregation functions

The ``mean`` method of a grouped dataframe or series is an *aggregation*: it produces a scalar output from an input sequence


In [28]:
df_tips.groupby("day").mean()
#df_tips.groupby("day").agg(np.mean) alternative

Unnamed: 0_level_0,total_bill,tip,size
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,17.151579,2.734737,2.105263
Sat,20.441379,2.993103,2.517241
Sun,21.41,3.255132,2.842105
Thur,17.682742,2.771452,2.451613


other common aggregations are ``min``, ``max``, ``median``, ``std``, ``var``, ``count``.

NOTE: ``count`` corresponds to the **non-null** values of the corresponding columns

In [29]:
df_tips_nan = df_tips.copy()
df_tips_nan.iloc[0, 0] = np.nan
df_tips_nan.groupby("day").count()

Unnamed: 0_level_0,total_bill,tip,sex,smoker,time,size
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Fri,19,19,19,19,19,19
Sat,87,87,87,87,87,87
Sun,75,76,76,76,76,76
Thur,62,62,62,62,62,62


The ``size`` method instead returns the size of each group

In [30]:
df_tips_nan.groupby("day").size()

day
Fri     19
Sat     87
Sun     76
Thur    62
dtype: int64

### Custom aggregation functions

It is possible to aggregate according to a user-defined function:

In [31]:
def numerical_range(arr):
    return arr.max() - arr.min()

In [32]:
df_tips_day_range = df_tips.groupby("day").agg(numerical_range) 
df_tips_day_range

Unnamed: 0_level_0,total_bill,tip,size
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,34.42,3.73,3
Sat,47.74,9.0,4
Sun,40.92,5.49,4
Thur,35.6,5.45,5


The difference between the largest and the smallest **total_bill** ever seen on Fridays is

In [33]:
df_tips_day_range.loc["Fri", "total_bill"]

34.42

### Multiple aggregations

It is possible to specify more than one aggregation function, to be applied to all the selected columns.

In [34]:
mean_median_by_day = df_tips.groupby("day")[["total_bill", "tip"]].agg(["mean", "std"])
mean_median_by_day

Unnamed: 0_level_0,total_bill,total_bill,tip,tip
Unnamed: 0_level_1,mean,std,mean,std
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Fri,17.151579,8.30266,2.734737,1.019577
Sat,20.441379,9.480419,2.993103,1.631014
Sun,21.41,8.832122,3.255132,1.23488
Thur,17.682742,7.88617,2.771452,1.240223


The result contains mean and standard deviation of **total_bill** and **tip**

In this case, the result has *hierarchical column names*:

In [35]:
mean_median_by_day[("total_bill", "std")] # or mean_median_by_day.loc[:, ("total_bill", "mean")]

day
Fri     8.302660
Sat     9.480419
Sun     8.832122
Thur    7.886170
Name: (total_bill, std), dtype: float64

### Multiple aggregations

It is possible to use a different aggregation for the different columns

In [36]:
df_tips.groupby("day").agg({"total_bill": "mean", "size": "median", "tip": numerical_range})

Unnamed: 0_level_0,total_bill,size,tip
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,17.151579,2,3.73
Sat,20.441379,2,9.0
Sun,21.41,2,5.49
Thur,17.682742,2,5.45


The result contains the mean of **total_bill**, the median of **size**, and the numerical range of **tip** for the different days

### General transformation

We can apply a custom transformation to a grouped dataframe/series that is not necessarily an aggregation.

Example: normalize the column ``total_bill`` according to the group mean and standard deviation.

In [37]:
def normalize_bill(group_data):
    mean_bill = group_data["total_bill"].mean()
    std_bill = group_data["total_bill"].std()
    group_data = group_data.copy()
    group_data["total_bill_z"] = (group_data["total_bill"] - mean_bill)/std_bill
    return group_data

In [38]:
df_tips_z = df_tips.groupby("day", group_keys=True).apply(normalize_bill) # try the option group_keys=False

In [39]:
display('df_tips', 'df_tips_z')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,sex,smoker,day,time,size,total_bill_z
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Fri,90,28.97,3.00,Male,Yes,Fri,Dinner,2,1.423450
Fri,91,22.49,3.50,Male,No,Fri,Dinner,2,0.642977
Fri,92,5.75,1.00,Female,Yes,Fri,Dinner,2,-1.373244
Fri,93,16.32,4.30,Female,Yes,Fri,Dinner,2,-0.100158
Fri,94,22.75,3.25,Female,No,Fri,Dinner,2,0.674292
...,...,...,...,...,...,...,...,...,...
Thur,202,13.00,2.00,Female,Yes,Thur,Lunch,2,-0.593792
Thur,203,16.40,2.50,Female,Yes,Thur,Lunch,2,-0.162657
Thur,204,20.53,4.00,Male,Yes,Thur,Lunch,4,0.361044
Thur,205,16.47,3.23,Female,Yes,Thur,Lunch,3,-0.153781


### General transformation

We can apply a custom transformation to a grouped dataframe/series that is not necessarily an aggregation.

Example: take the top-3 total bills for each day.

In [40]:
def top3_total_bill(group_data):
    group_data_sort = group_data.sort_values(by="total_bill", ascending=False)
    return group_data_sort.iloc[0:3]

In [41]:
df_tips.groupby("day").apply(top3_total_bill)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,sex,smoker,day,time,size
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Fri,95,40.17,4.73,Male,Yes,Fri,Dinner,4
Fri,90,28.97,3.0,Male,Yes,Fri,Dinner,2
Fri,96,27.28,4.0,Male,Yes,Fri,Dinner,2
Sat,170,50.81,10.0,Male,Yes,Sat,Dinner,3
Sat,212,48.33,9.0,Male,No,Sat,Dinner,4
Sat,59,48.27,6.73,Male,No,Sat,Dinner,4
Sun,156,48.17,5.0,Male,No,Sun,Dinner,6
Sun,182,45.35,3.5,Male,Yes,Sun,Dinner,3
Sun,184,40.55,3.0,Male,Yes,Sun,Dinner,2
Thur,197,43.11,5.0,Female,Yes,Thur,Lunch,4


### Looping over groups

Sometimes we may want to iterate over a DataFrameGroupBy object, instead of applying an aggregation/transformation

In [42]:
df_tips_by_day = df_tips.groupby("day")
df_tips_by_day

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f01453fc9a0>

In [43]:
df_tips_by_day = df_tips.groupby("day")
df_tips_by_day

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f01453fc370>

In [44]:
for key, df_tips_day in df_tips_by_day:
    pass # do something instead

#key, df_tips_day

### Pivot table

Aggregates a dataframe by two keys, arranging the data in a rectangle with one key along the rows and
the other key along the columns. 


In [45]:
df_tips.pivot_table('total_bill', index="day", columns="time", aggfunc="mean")

time,Dinner,Lunch
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Fri,19.663333,12.845714
Sat,20.441379,
Sun,21.41,
Thur,18.78,17.664754


It is an equivalent representation of:

In [46]:
df_tips.groupby(["day", "time"])[["total_bill"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill
day,time,Unnamed: 2_level_1
Fri,Dinner,19.663333
Fri,Lunch,12.845714
Sat,Dinner,20.441379
Sun,Dinner,21.41
Thur,Dinner,18.78
Thur,Lunch,17.664754


Pivot tables may be extended to have a multi-index for rows and columns

In [47]:
df_tips.pivot_table('total_bill', index=["day", "sex"], columns=["time", "smoker"], aggfunc="mean")

Unnamed: 0_level_0,time,Dinner,Dinner,Lunch,Lunch
Unnamed: 0_level_1,smoker,No,Yes,No,Yes
day,sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Fri,Female,22.75,12.2,15.98,13.26
Fri,Male,17.475,25.892,,11.386667
Sat,Female,19.003846,20.266667,,
Sat,Male,19.929063,21.837778,,
Sun,Female,20.824286,16.54,,
Sun,Male,20.403256,26.141333,,
Thur,Female,18.78,,15.899167,19.218571
Thur,Male,,,18.4865,19.171


Do not abuse pivot tables! They may become unreadable.