### Aggregation & Grouping

An essential piece of analysis of large data is efficient summarization: computing aggregations like `sum(), mean(), median(), min(), and max()` in which a single number gives insight into the nature of a potentially large dataset. We'll explore aggregations in Pandas, from simple operations akin to what we've seen on NumPy arrays, to more sophisticated operations based on the concept of a `groupby`.

* Planetary dataset available from Seaborn
* provides information on planets that astronomers have discovered around other stars

In [2]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [3]:
planets.shape

(1035, 6)

In [5]:
# Numpy Refresher
import numpy as np
import pandas as pd
# Set Seed and pull 5 values to form series
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser

0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64

In [6]:
ser.describe()

count    5.000000
mean     0.562385
std      0.308748
min      0.156019
25%      0.374540
50%      0.598658
75%      0.731994
max      0.950714
dtype: float64

In [7]:
ser.mean(), ser.std()

(0.5623850983416314, 0.30874824961862174)

* aggregates return sa single value as expected for a series (or 1-d type of dimensional value aggreagtion)

In [9]:
print(type(rng.rand(5)), rng.rand(5)) # we can pass this method call when creating a default DataFrame

<class 'numpy.ndarray'> [0.18340451 0.30424224 0.52475643 0.43194502 0.29122914]


#### Column Aggregates for DataFrame

In [10]:
df = pd.DataFrame({'A': rng.rand(5), 'B': rng.rand(5)})
df

Unnamed: 0,A,B
0,0.611853,0.785176
1,0.139494,0.199674
2,0.292145,0.514234
3,0.366362,0.592415
4,0.45607,0.04645


In [11]:
df.mean()

A    0.373185
B    0.427590
dtype: float64

In [12]:
# Specify axis argument, for aggregation within each row .. by specifying columns
df.mean(axis='columns')

0    0.698514
1    0.169584
2    0.403190
3    0.479388
4    0.251260
dtype: float64

#### Back to Planets


In [13]:
planets.describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,1035.0,992.0,513.0,808.0,1035.0
mean,1.785507,2002.917596,2.638161,264.069282,2009.070531
std,1.240976,26014.728304,3.818617,733.116493,3.972567
min,1.0,0.090706,0.0036,1.35,1989.0
25%,1.0,5.44254,0.229,32.56,2007.0
50%,1.0,39.9795,1.26,55.25,2010.0
75%,2.0,526.005,3.04,178.5,2012.0
max,7.0,730000.0,25.0,8500.0,2014.0


In [14]:
# any difference when dropping nulls? Yes! Quite a difference (see overall count) would drop any row with a null value from dropna() call
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


* To go deeper into the data, however, simple aggregates are often not enough. 
* The `groupby` operation allows for quick and efficient aggregate computation on subsets of data

### GroupBy: Split, Apply, Combine

* The split step involves breaking up and grouping a DataFrame depending on the value of the specified key.  
* The apply step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.  
* The combine step merges the results of these operations into an output array.


In [15]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                  'data': range(6)},
                 columns=['key', 'data'])
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [16]:
df.groupby('key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc32d095a60>

* Notice that what is returned is not a set of DataFrames, but a `DataFrameGroupBy` object. This object is where the magic is: you can think of it as a special view of the DataFrame, which is poised to dig into the groups but does no actual computation until the aggregation is applied.

In [17]:
df.groupby('key').sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


# The GroupBy object 

The GroupBy object is a very flexible abstraction. In many ways, you can simply treat it as if it’s a collection of `DataFrame`s, and it does the difficult things under the hood. Let’s see some examples using the Planets data. Perhaps the most important operations made available by a `GroupBy` are `aggregate, filter, transform, and apply`.


#### Column Indexing

In [22]:
# Column indexing
print(planets.columns.tolist())
planets.groupby('method'), planets.groupby('method')['orbital_period']

['method', 'number', 'orbital_period', 'mass', 'distance', 'year']


(<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc32d095910>,
 <pandas.core.groupby.generic.SeriesGroupBy object at 0x7fc32d0a1730>)

* Here we've selected a particular `Series` group from the orginal `DataFrame` groupby reference to its column name.
* As with the `GroupBy` object, no computation is done until we call some aggregate on the object

In [23]:
planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

#### Iteration over groups
`GroupBy` object supports direct iteration over the groups, returning each group as a `Series` or `DataFrame`

In [24]:
for (method, group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))

Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)


* This can be useful for doing certain things manually, though it is often much faster to use the built-in-`apply` functionality

#### Dispatch methods
Through some Python class magic, any method not explicitly implemented by the `GroupBy` object will be passed through and called on the groups, whether they are `DataFrame` or `Series` objects. For example, you can use the `describe()` method of DataFrames to perform a set of aggregations that describe each group in the data:

In [25]:
planets.groupby('method')['year'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Astrometry,2.0,2011.5,2.12132,2010.0,2010.75,2011.5,2012.25,2013.0
Eclipse Timing Variations,9.0,2010.0,1.414214,2008.0,2009.0,2010.0,2011.0,2012.0
Imaging,38.0,2009.131579,2.781901,2004.0,2008.0,2009.0,2011.0,2013.0
Microlensing,23.0,2009.782609,2.859697,2004.0,2008.0,2010.0,2012.0,2013.0
Orbital Brightness Modulation,3.0,2011.666667,1.154701,2011.0,2011.0,2011.0,2012.0,2013.0
Pulsar Timing,5.0,1998.4,8.38451,1992.0,1992.0,1994.0,2003.0,2011.0
Pulsation Timing Variations,1.0,2007.0,,2007.0,2007.0,2007.0,2007.0,2007.0
Radial Velocity,553.0,2007.518987,4.249052,1989.0,2005.0,2009.0,2011.0,2014.0
Transit,397.0,2011.236776,2.077867,2002.0,2010.0,2012.0,2013.0,2014.0
Transit Timing Variations,4.0,2012.5,1.290994,2011.0,2011.75,2012.5,2013.25,2014.0


In [26]:
planets.groupby('method')['year'].describe().unstack()

       method                       
count  Astrometry                          2.0
       Eclipse Timing Variations           9.0
       Imaging                            38.0
       Microlensing                       23.0
       Orbital Brightness Modulation       3.0
                                         ...  
max    Pulsar Timing                    2011.0
       Pulsation Timing Variations      2007.0
       Radial Velocity                  2014.0
       Transit                          2014.0
       Transit Timing Variations        2014.0
Length: 80, dtype: float64

* Looking at this table helps us to better understand the data: for example, the vast majority of planets have been discovered by the Radial Velocity and Transit methods, though the latter only became common (due to new, more accurate telescopes) in the last decade. The newest methods seem to be Transit Timing Variation and Orbital Brightness Modulation, which were not used to discover a new planet until 2011.
* Notice that they are applied to each individual group, and the results are then combined within GroupBy and returned. Again, any valid DataFrame/Series method can be used on the corresponding GroupBy object, which allows for some very flexible and powerful operations!

#### Aggregate, filter, transform, apply
The preceding discussion focused on aggregation for the combine operation, but there are more options available. In particular, GroupBy objects have `aggregate(), filter(), transform(), and apply()` methods that efficiently implement a variety of useful operations before combining the grouped data.



In [28]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                  'data1': range(6),
                  'data2': rng.randint(0,10,6)},
                 columns=['key', 'data1', 'data2'])
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


#### Aggregation 
We’re now familiar with `GroupBy` aggregations with `sum(), median()`, and the like, but the `aggregate()` method allows for even more flexibility. It can take a string, a function, or a list thereof, and compute all the aggregates at once. Here is a quick example combining all these:


In [29]:
df.groupby('key').aggregate(['min', np.median, max])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


In [30]:
df.groupby('key').agg(['min', np.median, max])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


#### Filtering
A `filtering` operation allows you to drop data based on the group properties. For example, we might want to keep all groups in which the standard deviation is larger than some critical value:

In [34]:
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [31]:
def filter_func(x):
    return x['data2'].std() > 4

In [32]:
print(df.groupby('key').std())

       data1     data2
key                   
A    2.12132  1.414214
B    2.12132  4.949747
C    2.12132  4.242641


In [33]:
print(df.groupby('key').filter(filter_func))

  key  data1  data2
1   B      1      0
2   C      2      3
4   B      4      7
5   C      5      9


#### Transformation 
While aggregation must return a reduced version of the data, `transformation` can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input. A common example is to center the data by subtracting the group-wise mean:

In [35]:
df.groupby('key').transform(lambda x: x - x.mean())

Unnamed: 0,data1,data2
0,-1.5,1.0
1,-1.5,-3.5
2,-1.5,-3.0
3,1.5,-1.0
4,1.5,3.5
5,1.5,3.0


* Each key shows the value in data1 and data2 value subtracted by it's column groupby mean
    * First row Example
        * `A` key mean for `data1` column is 0 + 3 / 2 == 1.5
        * Return for row 0 and 4 (both relating to key `A`) in this column is the original value minus the mean above (this is -1.5 and 1.5 respectfully (0 - 1.5 and 3 - 1.5) 

#### Apply() method
The `apply()` method lets you apply an arbitrary function to the group results. The function should take a `DataFrame`, and return either a Pandas object (e.g., DataFrame, Series) or a scalar; the combine operation will be tailored to the type of output returned. 
For example, here is an `apply()` that normalizes the first column by the sum of the second:

In [38]:
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [36]:
def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x

In [37]:
print(df.groupby('key').apply(norm_by_data2))

  key     data1  data2
0   A  0.000000      5
1   B  0.142857      0
2   C  0.166667      3
3   A  0.375000      3
4   B  0.571429      7
5   C  0.416667      9


* Little detail about aboe
    * Key `B` Example
        * Row 1 above for Key B is applying as follows
            * 1) The value of 1 for B in row 1 column `data1` is divided (and then equals) the sum of the same key's column `data2` sum
            * This is equal to (1 / (0+7)) == 1 / 7 == 0.142857
        * Row 2 for Key C applys similary
            * 2 / (3 + 9) == 2/12 (1/6) == 0.166667