## Aggregation & Grouping

An essential piece of analysis of large data is efficient summarization: computing aggregations like sum( ), mean( ), median( ), min( ), and max( ), in which a single number gives insight into the nature of a potentially large dataset. In this section, we'll explore aggregations in Pandas, from simple operations akin to what we've seen on NumPy arrays, to more sophisticated operations based on the concept of a groupby.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame(np.array([[1,"w",60], [2,"x",20], [6,"y",700], [9,"z",350]]), columns=["A","B","C"])
print(df)

   A  B    C
0  1  w   60
1  2  x   20
2  6  y  700
3  9  z  350


In [3]:
df.shape

(4, 3)

In [34]:
df.dtypes

A    object
B    object
C    object
dtype: object

In [35]:
df["A"] = pd.to_numeric(df.A)
df["C"] = pd.to_numeric(df.C)
df.dtypes

A    float64
B     object
C      int64
dtype: object

In [36]:
df["C"] = df.C.astype('float64')
df.dtypes

A    float64
B     object
C    float64
dtype: object

In [37]:
df.C.mean()

282.5

In [38]:
df

Unnamed: 0,A,B,C
0,0.25,w,60.0
1,-0.9,x,20.0
2,0.2,y,700.0
3,0.6,z,350.0


In [39]:
df.mean()

A      0.0375
C    282.5000
dtype: float64

In [40]:
df.count()

A    4
B    4
C    4
dtype: int64

In [41]:
df.describe()

Unnamed: 0,A,C
count,4.0,4.0
mean,0.0375,282.5
std,0.64984,314.788289
min,-0.9,20.0
25%,-0.075,50.0
50%,0.225,205.0
75%,0.3375,437.5
max,0.6,700.0


In [42]:
travel = [{'Continent': 'Europe', 'Country': 'UK', 'Pageviews': 100000}, {'Continent': 'Europe', 'Country': 'DE', 'Pageviews': 20000}, {'Continent': 'Africa', 'Country': 'Kenya', 'Pageviews': 40000}, {'Continent': 'Africa', 'Country': 'Morroco', 'Pageviews': 20000}, {'Continent': 'Africa', 'Country': 'Chad', 'Pageviews': 50000}]
pd.DataFrame(travel)
print(travel)

[{'Continent': 'Europe', 'Country': 'UK', 'Pageviews': 100000}, {'Continent': 'Europe', 'Country': 'DE', 'Pageviews': 20000}, {'Continent': 'Africa', 'Country': 'Kenya', 'Pageviews': 40000}, {'Continent': 'Africa', 'Country': 'Morroco', 'Pageviews': 20000}, {'Continent': 'Africa', 'Country': 'Chad', 'Pageviews': 50000}]


In [43]:
travel

[{'Continent': 'Europe', 'Country': 'UK', 'Pageviews': 100000},
 {'Continent': 'Europe', 'Country': 'DE', 'Pageviews': 20000},
 {'Continent': 'Africa', 'Country': 'Kenya', 'Pageviews': 40000},
 {'Continent': 'Africa', 'Country': 'Morroco', 'Pageviews': 20000},
 {'Continent': 'Africa', 'Country': 'Chad', 'Pageviews': 50000}]

In [44]:
pd.DataFrame(travel)

Unnamed: 0,Continent,Country,Pageviews
0,Europe,UK,100000
1,Europe,DE,20000
2,Africa,Kenya,40000
3,Africa,Morroco,20000
4,Africa,Chad,50000


In [45]:
pd.DataFrame(travel).groupby('Continent').Pageviews.mean()


Continent
Africa    36666.666667
Europe    60000.000000
Name: Pageviews, dtype: float64

In [46]:
pd.DataFrame(travel).groupby('Continent').apply(np.mean)

Unnamed: 0_level_0,Pageviews
Continent,Unnamed: 1_level_1
Africa,36666.666667
Europe,60000.0


In [47]:
df = pd.DataFrame({'A':['A','B','C'],'B':[4,5,6]})
group_df = df.groupby('A').mean()
type(group_df)

pandas.core.frame.DataFrame

In [48]:
group_df.dtypes

B    int64
dtype: object

In [49]:
pd.DataFrame(travel)

Unnamed: 0,Continent,Country,Pageviews
0,Europe,UK,100000
1,Europe,DE,20000
2,Africa,Kenya,40000
3,Africa,Morroco,20000
4,Africa,Chad,50000


In [50]:
pd.pivot_table(pd.DataFrame(travel), columns = "Continent", aggfunc=[np.mean])

Unnamed: 0_level_0,mean,mean
Continent,Africa,Europe
Pageviews,36666.666667,60000.0


In [53]:
import numpy as np
import pandas as pd

class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)
    

In [54]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape

(1035, 6)

In [55]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [4]:
dat = np.random.RandomState(42)
data1 = pd.Series(dat.rand(5))
data1

0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64

In [5]:
data1.sum()


2.811925491708157

In [6]:
data1.mean()


0.5623850983416314

In [7]:
df = pd.DataFrame({'A': dat.rand(5),
                  'B': dat.rand(5)})
df

Unnamed: 0,A,B
0,0.155995,0.020584
1,0.058084,0.96991
2,0.866176,0.832443
3,0.601115,0.212339
4,0.708073,0.181825


In [8]:
df.mean(axis='columns')

0    0.088290
1    0.513997
2    0.849309
3    0.406727
4    0.444949
dtype: float64

In [9]:
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


### Groupby:  Split, Apply, Combine

Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called groupby operation. The name "group by" comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: split, apply, combine.

This makes clear what the groupby accomplishes:

The split step involves breaking up and grouping a DataFrame depending on the value of the specified key.
The apply step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
The combine step merges the results of these operations into an output array.  GroupBy can (often) do this in a single pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way. The power of the GroupBy is that it abstracts away these steps: the user need not think about how the computation is done under the hood, but rather thinks about the operation as a whole.

In [10]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                  'data' : range(6)}, columns=['key', 'data'])
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


The most basic split-apply-combine operation can be computed with the groupby( ) method of DataFrames, passing the name of the desired key column:

In [12]:
df.groupby('key').sum()


Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


#### Column indexing

In [13]:
planets.groupby('method')

<pandas.core.groupby.DataFrameGroupBy object at 0x1a0ca1fef0>

In [14]:
planets.groupby('method')['orbital_period'].median()


method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

In [17]:
planets.groupby('method')['year'].describe()


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Astrometry,2.0,2011.5,2.12132,2010.0,2010.75,2011.5,2012.25,2013.0
Eclipse Timing Variations,9.0,2010.0,1.414214,2008.0,2009.0,2010.0,2011.0,2012.0
Imaging,38.0,2009.131579,2.781901,2004.0,2008.0,2009.0,2011.0,2013.0
Microlensing,23.0,2009.782609,2.859697,2004.0,2008.0,2010.0,2012.0,2013.0
Orbital Brightness Modulation,3.0,2011.666667,1.154701,2011.0,2011.0,2011.0,2012.0,2013.0
Pulsar Timing,5.0,1998.4,8.38451,1992.0,1992.0,1994.0,2003.0,2011.0
Pulsation Timing Variations,1.0,2007.0,,2007.0,2007.0,2007.0,2007.0,2007.0
Radial Velocity,553.0,2007.518987,4.249052,1989.0,2005.0,2009.0,2011.0,2014.0
Transit,397.0,2011.236776,2.077867,2002.0,2010.0,2012.0,2013.0,2014.0
Transit Timing Variations,4.0,2012.5,1.290994,2011.0,2011.75,2012.5,2013.25,2014.0


### Aggregate, Filter, Transform, Apply

he preceding discussion focused on aggregation for the combine operation, but there are more options available. In particular, GroupBy objects have aggregate(), filter(), transform(), and apply() methods that efficiently implement a variety of useful operations before combining the grouped data.

In [59]:
data2 = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                  'data3' : range(6),
                  'data4' : data2.randint(0, 10, 6)},
                 columns = ['key', 'data3', 'data4'])
df

Unnamed: 0,key,data3,data4
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


#### Aggregation

We're now familiar with GroupBy aggregations with sum(), median(), and the like, but the aggregate() method allows for even more flexibility. It can take a string, a function, or a list thereof, and compute all the aggregates at once. Here is a quick example combining all these:

In [60]:
df.groupby('key').aggregate(['min', np.median, max])


Unnamed: 0_level_0,data3,data3,data3,data4,data4,data4
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:


In [61]:
df.groupby('key').aggregate({'data3':'min',
                            'data4': 'max'})


Unnamed: 0_level_0,data3,data4
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,7
C,2,9


#### Filtering

A filtering operation allows you to drop data based on the group properties. For example, we might want to keep all groups in which the standard deviation is larger than some critical value:

In [62]:
def filter_func(x):
    return x['data4'].std() > 4

display('df', "df.groupby('key').std()", "df.groupby('key').filter(filter_func)")

Unnamed: 0,key,data3,data4
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9

Unnamed: 0_level_0,data3,data4
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2.12132,1.414214
B,2.12132,4.949747
C,2.12132,4.242641

Unnamed: 0,key,data3,data4
1,B,1,0
2,C,2,3
4,B,4,7
5,C,5,9


The filter function should return a Boolean value specifying whether the group passes the filtering. Here because group A does not have a standard deviation greater than 4, it is dropped from the result.

#### Transformation


While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input. A common example is to center the data by subtracting the group-wise mean:

In [63]:
df.groupby('key').transform(lambda x: x-x.mean())

Unnamed: 0,data3,data4
0,-1.5,1.0
1,-1.5,-3.5
2,-1.5,-3.0
3,1.5,-1.0
4,1.5,3.5
5,1.5,3.0


The apply( ) method lets you apply an arbitrary function to the group results. The function should take a DataFrame, and return either a Pandas object (e.g., DataFrame, Series) or a scalar; the combine operation will be tailored to the type of output returned.

For example, here is an apply( ) that normalizes the first column by the sum of the second:

In [64]:
def norm_by_data4(x):
    x['data3'] /= x ['data4'].sum()
    return x

display('df', "df.groupby('key').apply(norm_by_data4)")

Unnamed: 0,key,data3,data4
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9

Unnamed: 0,key,data3,data4
0,A,0.0,5
1,B,0.142857,0
2,C,0.166667,3
3,A,0.375,3
4,B,0.571429,7
5,C,0.416667,9



apply( ) within a GroupBy is quite flexible: the only criterion is that the function takes a DataFrame and returns a Pandas object or scalar; what you do in the middle is up to you!

#### Specifying the split Key

In [65]:
L = [0, 1, 0, 1, 2, 0]
display('df', 'df.groupby(L).sum()')

Unnamed: 0,key,data3,data4
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9

Unnamed: 0,data3,data4
0,7,17
1,4,3
2,4,7
