## Aggregation and Grouping


An essential piece of analysis of large data is efficient summarization: computing
aggregations like sum(), mean(), median(), min(), and max(), in which a single 
number gives insight into the nature of a potentially large dataset

### Planets Data


Here we will use the Planets dataset, available via the Seaborn package 

In [2]:
import seaborn as sns
import pandas as pd
import numpy as np

In [52]:
planets = sns.load_dataset('planets') # This planet data is imported from seaborn (online)
planets.shape

(1035, 6)

In [53]:
#pd.set_option('display.max_rows',None)  
#This is used to display all rows and not default 10 rows only.

In [54]:
planets

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.300000,7.10,77.40,2006
1,Radial Velocity,1,874.774000,2.21,56.95,2008
2,Radial Velocity,1,763.000000,2.60,19.84,2011
3,Radial Velocity,1,326.030000,19.40,110.62,2007
4,Radial Velocity,1,516.220000,10.50,119.47,2009
...,...,...,...,...,...,...
1030,Transit,1,3.941507,,172.00,2006
1031,Transit,1,2.615864,,148.00,2007
1032,Transit,1,3.191524,,174.00,2007
1033,Transit,1,4.125083,,293.00,2008


### Simple Aggregation in Pandas


In [55]:
rng=np.random.RandomState(42)
ser=pd.Series(rng.random(5))
ser

0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64

In [56]:
ser.sum()

2.811925491708157

In [57]:
ser.prod()

0.02434509596197801

In [58]:
df=pd.DataFrame({'A': rng.random(5),
                 'B': rng.random(5)})
df


Unnamed: 0,A,B
0,0.155995,0.020584
1,0.058084,0.96991
2,0.866176,0.832443
3,0.601115,0.212339
4,0.708073,0.181825


In [59]:
df.mean() # by default mean of individual columns are taken out

A    0.477888
B    0.443420
dtype: float64

In [60]:
df.mean(axis='columns') # by mentioning axis , mena along any axis can be taken

0    0.088290
1    0.513997
2    0.849309
3    0.406727
4    0.444949
dtype: float64

In [61]:
planets.loc[planets['mass'].isnull(),'method'].unique()

array(['Radial Velocity', 'Imaging', 'Eclipse Timing Variations',
       'Transit', 'Astrometry', 'Transit Timing Variations',
       'Orbital Brightness Modulation', 'Microlensing', 'Pulsar Timing',
       'Pulsation Timing Variations'], dtype=object)

In [62]:
planets.dropna().describe() # We asked to describe the new planet data obtained after dropping null values

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


To go deeper into the data, however, simple aggregates are often not enough

### GroupBy: Split, Apply, Combine


Simple aggregations can give you a flavor of your dataset, but often we would prefer
to aggregate conditionally on some label or index: this is implemented in the socalled groupby operation

 The name “group by” comes from a command in the SQL
database language, but it is perhaps more illuminative to think of it in the terms first
coined by Hadley Wickham of Rstats fame: split, apply, combine.

While we could certainly do this manually using some combination of the masking,
aggregation, and merging commands covered earlier, it’s important to realize that the
intermediate splits do not need to be explicitly instantiated. Rather, the GroupBy can
(often) do this in a single pass over the data, updating the sum, mean, count, min, or
other aggregate for each group along the way

In [63]:
df=pd.DataFrame({'key':['A','B','C','A','B','C'],
                 'data':range(6)},
                 columns =['key','data'])
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [64]:
df.groupby('key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000018CDA3A69C8>

Notice that what is returned is not a set of DataFrames, but a DataFrameGroupBy
object. This object is where the magic is: you can think of it as a special view of the
DataFrame, which is poised to dig into the groups but does no actual computation
until the aggregation is applied. This “lazy evaluation” approach means that common
aggregates can be implemented very efficiently in a way that is almost transparent to
the user

In [65]:
df.groupby('key').sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


Perhaps the most important operations made available by a GroupBy are aggregate,
filter, transform, and apply

#### Column indexing

The GroupBy object supports column indexing in the same way as
the DataFrame, and returns a modified GroupBy object

In [66]:
planets.groupby('method')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000018CDA394E08>

In [67]:
planets.groupby('method')['orbital_period']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x0000018CDA41E248>

In [68]:
planets.groupby('method')['orbital_period'].median() #(mean(),sum()) any function can be performed
# grouping of 'method ' column is done and mean operation is performed column 'orbital_period'

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

In [69]:
for (method, group) in planets.groupby('method'):
        print("{0:40s} shape={1}".format(method, group.shape))  # This 0:30s is nothing but to align shape column obtained
#  This can be varied to increase the gap 

Astrometry                               shape=(2, 6)
Eclipse Timing Variations                shape=(9, 6)
Imaging                                  shape=(38, 6)
Microlensing                             shape=(23, 6)
Orbital Brightness Modulation            shape=(3, 6)
Pulsar Timing                            shape=(5, 6)
Pulsation Timing Variations              shape=(1, 6)
Radial Velocity                          shape=(553, 6)
Transit                                  shape=(397, 6)
Transit Timing Variations                shape=(4, 6)


### Dispatch methods.

In [70]:
planets.groupby('method')['year'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Astrometry,2.0,2011.5,2.12132,2010.0,2010.75,2011.5,2012.25,2013.0
Eclipse Timing Variations,9.0,2010.0,1.414214,2008.0,2009.0,2010.0,2011.0,2012.0
Imaging,38.0,2009.131579,2.781901,2004.0,2008.0,2009.0,2011.0,2013.0
Microlensing,23.0,2009.782609,2.859697,2004.0,2008.0,2010.0,2012.0,2013.0
Orbital Brightness Modulation,3.0,2011.666667,1.154701,2011.0,2011.0,2011.0,2012.0,2013.0
Pulsar Timing,5.0,1998.4,8.38451,1992.0,1992.0,1994.0,2003.0,2011.0
Pulsation Timing Variations,1.0,2007.0,,2007.0,2007.0,2007.0,2007.0,2007.0
Radial Velocity,553.0,2007.518987,4.249052,1989.0,2005.0,2009.0,2011.0,2014.0
Transit,397.0,2011.236776,2.077867,2002.0,2010.0,2012.0,2013.0,2014.0
Transit Timing Variations,4.0,2012.5,1.290994,2011.0,2011.75,2012.5,2013.25,2014.0


In [71]:
planets.groupby('method')['year'].describe().unstack()

       method                       
count  Astrometry                          2.0
       Eclipse Timing Variations           9.0
       Imaging                            38.0
       Microlensing                       23.0
       Orbital Brightness Modulation       3.0
                                         ...  
max    Pulsar Timing                    2011.0
       Pulsation Timing Variations      2007.0
       Radial Velocity                  2014.0
       Transit                          2014.0
       Transit Timing Variations        2014.0
Length: 80, dtype: float64

### Aggregate, filter, transform, apply

 In particular, GroupBy objects have aggregate(),
filter(), transform(), and apply() methods that efficiently implement a variety of
useful operations before combining the grouped data.

In [72]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])
# column names have aready been mentioned above, ignoring second columns=... also works
df


Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


### Aggregation

We’re now familiar with GroupBy aggregations with sum(), median(),
and the like, but the aggregate() method allows for even more flexibility. It can take
a string, a function, or a list thereof, and compute all the aggregates at once

In [73]:
df.groupby('key').aggregate(['min',np.median,max])
# 'min' or min works same

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


Another useful pattern is to pass a dictionary mapping column names to operations
to be applied on that column

In [74]:
df.groupby('key').aggregate({'data1':'min',
                             'data2':'max'})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,7
C,2,9


### Filtering

A filtering operation allows you to drop data based on the group properties.
For example, we might want to keep all groups in which the standard deviation is
larger than some critical value

In [96]:
x=print(((df.groupby('key').std())>4)) # This only represents the boolean of the columns containing std()>4


     data1  data2
key              
A    False  False
B    False   True
C    False   True


In [97]:
def filter_func(x):
  return x['data2'].std() > 4
print(df); print(df.groupby('key').std());
print(df.groupby('key').filter(filter_func))

  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
       data1     data2
key                   
A    2.12132  1.414214
B    2.12132  4.949747
C    2.12132  4.242641
  key  data1  data2
1   B      1      0
2   C      2      3
4   B      4      7
5   C      5      9


### Transformation

While aggregation must return a reduced version of the data, 
transformation can return some transformed version of the full data to recombine.

In [89]:
df.groupby('key').transform(lambda x: x-x.mean())


Unnamed: 0,data1,data2
0,-1.5,1.0
1,-1.5,-3.5
2,-1.5,-3.0
3,1.5,-1.0
4,1.5,3.5
5,1.5,3.0


### The Apply method

The apply() method lets you apply an arbitrary function to the
group results. The function should take a DataFrame, and return either a Pandas
object (e.g., DataFrame, Series) or a scalar

In [95]:
def norm_by_data(x):
    # x is a DataFrame of group values
    x['data1']/= x['data2'].mean()
    return x
print(df)
print(df.groupby('key').apply(norm_by_data))


  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
  key     data1  data2
0   A  0.000000      5
1   B  0.285714      0
2   C  0.333333      3
3   A  0.750000      3
4   B  1.142857      7
5   C  0.833333      9


### Specifying the split key


#### A list, array, series, or index providing the grouping keys

The key can be any series or list with a length matching that of the DataFrame

In [111]:
L = [0, 1, 0, 1, 2, 0]
print(df); print(df.groupby(L).sum())

  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
   data1  data2
0      7     17
1      4      3
2      4      7


In [115]:
M=['A','C','B','C','A','A']
df.groupby('key').sum()
# df.groupby(M).sum()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,3,8
B,5,7
C,7,12


In [113]:
# Of course, this means there’s another, more verbose way of accomplishing the df.groupby('key') from before
print(df); print(df.groupby(df['key']).sum())

  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
     data1  data2
key              
A        3      8
B        5      7
C        7     12


### A dictionary or series mapping index to group.

Another method is to provide a dictionary
that maps index values to the group keys

In [123]:
df2=df.set_index('key')
mapping={'A': 'vowel','B' : 'consonant','C' :'consonant'}
df2.groupby(mapping).sum()

Unnamed: 0,data1,data2
consonant,12,19
vowel,3,8


In [120]:
df2

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,0
C,2,3
A,3,3
B,4,7
C,5,9


In [118]:
df.set_index('key').groupby(mapping).sum()

Unnamed: 0,data1,data2
consonant,12,19
vowel,3,8


In [124]:
df.groupby(mapping).sum() # This gives no output as df doesn't have key as index.

Unnamed: 0,data1,data2


### Any Python function

Similar to mapping, you can pass any Python function that will
input the index value and output the group

In [130]:
print(df2); print(df2.groupby(str.lower).mean())
# str.lower :  Return a copy of the string converted to lowercase.

     data1  data2
key              
A        0      5
B        1      0
C        2      3
A        3      3
B        4      7
C        5      9
   data1  data2
a    1.5    4.0
b    2.5    3.5
c    3.5    6.0


### A list of valid keys
 Further, any of the preceding key choices can be combined to
group on a multi-index

In [133]:
print(df2.groupby([str.lower,mapping]).mean())

             data1  data2
a vowel        1.5    4.0
b consonant    2.5    3.5
c consonant    3.5    6.0


In [139]:
pd.set_option('display.max_rows',10)

In [147]:
planets

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.300000,7.10,77.40,2006
1,Radial Velocity,1,874.774000,2.21,56.95,2008
2,Radial Velocity,1,763.000000,2.60,19.84,2011
3,Radial Velocity,1,326.030000,19.40,110.62,2007
4,Radial Velocity,1,516.220000,10.50,119.47,2009
...,...,...,...,...,...,...
1030,Transit,1,3.941507,,172.00,2006
1031,Transit,1,2.615864,,148.00,2007
1032,Transit,1,3.191524,,174.00,2007
1033,Transit,1,4.125083,,293.00,2008


In [152]:
decade=10*(planets['year']//10)
decade=decade.astype(str)+'s'    # all decade names are converted to string type and than 's' is added in the end
decade.name='decade'
planets.groupby(['method',decade])['number'].sum().unstack().fillna(0)

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0


In [155]:
decade=10*(planets['year']//10)
decade=decade.astype(str)+'s'    # all decade names are converted to string type and than 's' is added in the end
decade.name='decade'
planets.groupby(['method','mass'])['number'].sum().unstack().fillna(0) 
# This addition of 2 groupby creates this way .
# any no.of groupings can be done , like i added distance too here.

mass,0.00360,0.00600,0.00755,0.00800,0.00850,0.00870,0.00970,0.01100,0.01133,0.01170,...,15.50000,17.40000,18.10000,18.15000,18.37000,19.40000,19.80000,20.60000,21.42000,25.00000
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Eclipse Timing Variations,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Radial Velocity,1.0,4.0,3.0,12.0,3.0,2.0,2.0,6.0,1.0,3.0,...,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0
Transit,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


This shows the power of combining many of the operations we’ve discussed up to this
point when looking at realistic datasets

Here I would suggest digging into these few lines of code, and evaluating the individual 
steps to make sure you understand exactly what they are doing to the result. It’s
certainly a somewhat complicated example, but understanding these pieces will give
you the means to similarly explore your own data.