<h1>Aggregation and Grouping</h1>

In [1]:
# An essential piece of analysis of large data is efficient summarization.

# In this section, we will explore aggregations in Pandas from simple operations akin to what we have
# seen on NumPy arrays.

# General imports

import numpy as np
import pandas as pd

<h3>Planets Data</h3>

In [2]:
# We will use planets data set available via the Seaborn package. 

# It gives information on planets that astronomers have discovered around other stars. 

import seaborn as sns

In [3]:
# Download planets dataset

planets = sns.load_dataset("planets")

In [4]:
# Check the shape of the dataset
planets.shape()

TypeError: 'tuple' object is not callable

In [5]:
planets.shape

(1035, 6)

In [6]:
# Fetch the top 5 rows of planets dataset

planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [7]:
# check the number of total records using tail command

planets.tail()

Unnamed: 0,method,number,orbital_period,mass,distance,year
1030,Transit,1,3.941507,,172.0,2006
1031,Transit,1,2.615864,,148.0,2007
1032,Transit,1,3.191524,,174.0,2007
1033,Transit,1,4.125083,,293.0,2008
1034,Transit,1,4.187757,,260.0,2008


In [8]:
# For a pandas series the aggregates return a single value:

rangeValue = np.random.RandomState(42)
ser = pd.Series(rangeValue.rand(5))
ser

0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64

In [9]:
ser.sum()

2.811925491708157

In [10]:
ser.mean()

0.5623850983416314

In [11]:
# For a DataFrame, by default the aggregates return results within each column:

df = pd.DataFrame({"A":rangeValue.rand(5),
                   "B":rangeValue.rand(5)})
df

Unnamed: 0,A,B
0,0.155995,0.020584
1,0.058084,0.96991
2,0.866176,0.832443
3,0.601115,0.212339
4,0.708073,0.181825


In [12]:
df.mean()

A    0.477888
B    0.443420
dtype: float64

In [13]:
# To aggregate within each row, we can specify the axis argument:

df.mean(axis="columns")

0    0.088290
1    0.513997
2    0.849309
3    0.406727
4    0.444949
dtype: float64

In [14]:
# Drop missing values
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


<h3>GroupBy: Split, Apply, and Combine</h3>

In [15]:
# Conditional aggregations on some label or index is implemented in the so called groupby operation. 

<h4>Split, Apply, Combine</h4>

In [17]:
# A canonical example of this split-apply-combine operation, with apply being summation aggregation is:

## The split step involves in breaking up and grouping a DataFrame depending on the value of the specified key
## The apply step involves computing some function, usually an aggregate, transformation, or filtering, with
 # individual groups. 
## The combine step merges the results of these operations into an output arrays. 

In [18]:
# The intermediate splits do not need to be explicitly instantiated. 

# The GroupBy can do this in a single pass over the data, updating the sum, mean, count, min or other aggregate 
# for each group along the way. 

In [20]:
# Example:

dataFrame = pd.DataFrame({"key":["A","B","C","A","B","C"], "data":range(6)},columns=["key","data"])
dataFrame

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [21]:
# We can compute the basic split-apply-combine operation with groupBy() method of dataframes passing the name
# and desired key column:

dataFrame.groupby("key")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x168590100>

In [22]:
# To produce a result, we can apply an aggregate to this DataFrameGroupBy object, which will perform the 
# appropriate apply/combine steps to produce the desired result:

dataFrame.groupby("key").sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


<h5>The GroupBy object</h5>

In [23]:
# The GroupBy object is a very flexible abstraction. The most important operations made available by a GroupBy 
# are aggregate, filter , transform and apply. 



<h5>Column Indexing</h5>

In [24]:
# The groupby object supports column indexing in the same way as DataFrame and returns a modified GroupBy object.

planets.groupby("method")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x168d29f00>

In [25]:
planets.groupby("method")["orbital_period"]

<pandas.core.groupby.generic.SeriesGroupBy object at 0x168d9c490>

In [26]:
# Perform operation on the Group By object

planets.groupby("method")["orbital_period"].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

<h5>Iteration over groups</h5>

In [27]:
# The GroupBy object supports direct iteration over the groups, returning each group as a Series or DataFrame

for(method, group) in planets.groupby("method"):
    print("{0:30s} shape={1}".format(method, group.shape))

Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)


<h5>Dispatch Methods</h5>

In [28]:
# Any method not explicitly implemented by the GroupBy object will be passed through and called on the groups
# whether they are DataFrame or series objects. 

In [31]:
planets.groupby("method")["year"].describe().unstack()

       method                       
count  Astrometry                          2.0
       Eclipse Timing Variations           9.0
       Imaging                            38.0
       Microlensing                       23.0
       Orbital Brightness Modulation       3.0
                                         ...  
max    Pulsar Timing                    2011.0
       Pulsation Timing Variations      2007.0
       Radial Velocity                  2014.0
       Transit                          2014.0
       Transit Timing Variations        2014.0
Length: 80, dtype: float64

In [33]:
planets.groupby("method")["year"].describe().unstack().tail()

     method                     
max  Pulsar Timing                  2011.0
     Pulsation Timing Variations    2007.0
     Radial Velocity                2014.0
     Transit                        2014.0
     Transit Timing Variations      2014.0
dtype: float64

In [34]:
planets.groupby("method").describe().unstack().head()

               method                       
number  count  Astrometry                        2.0
               Eclipse Timing Variations         9.0
               Imaging                          38.0
               Microlensing                     23.0
               Orbital Brightness Modulation     3.0
dtype: float64

In [35]:
planets.groupby("method").describe().unstack().tail()

           method                     
year  max  Pulsar Timing                  2011.0
           Pulsation Timing Variations    2007.0
           Radial Velocity                2014.0
           Transit                        2014.0
           Transit Timing Variations      2014.0
dtype: float64

<h5>Aggregate, Filter, Transform, apply</h5>

In [37]:
# GroupBy objects have aggregate(), filter(), transform() and apply() methods that efficiently implement 
# a variety of useful operations before combining the grouped data. 

rng = np.random.RandomState(0)
dataFra = pd.DataFrame({"key":["A","B","C","A","B","C"],"data1":range(6),"data2":rng.randint(0,10,6)},
                       columns=["key","data1","data2"])
dataFra

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


<h5>Aggregation</h5>

In [38]:
# Aggregate method provides more flexibility. It can take a string, a function, or a list thereof and compute 
# all the aggregates at once. 

dataFra.groupby("key").aggregate(["min",np.median, max])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


In [39]:
# Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column

dataFra.groupby("key").aggregate({"data1":"min",
                                  "data2":"max"})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,7
C,2,9


<h5>Filtering</h5>

In [40]:
# A filtering operation allows you to drop data based on the group properties. 

def filter_func(x):
    return x["data2"].std() > 4

print(dataFra);print(dataFra.groupby("key").std());
print(dataFra.groupby("key").filter(filter_func))

  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
       data1     data2
key                   
A    2.12132  1.414214
B    2.12132  4.949747
C    2.12132  4.242641
  key  data1  data2
1   B      1      0
2   C      2      3
4   B      4      7
5   C      5      9


In [41]:
# The filter() function should return a Boolean value specifying whether the group passes the filtering. 

<h5>Transformation</h5>

In [42]:
# While aggregation must return a reduced version of the data, transformaton can return some transformed version 
# of the full data to recombine. 

# Example is to center the data by subtracting the group wise mean:

dataFra.groupby("key").transform(lambda x:x - x.mean())

Unnamed: 0,data1,data2
0,-1.5,1.0
1,-1.5,-3.5
2,-1.5,-3.0
3,1.5,-1.0
4,1.5,3.5
5,1.5,3.0


<h5>The apply() method</h5>

In [43]:
# The apply() method lets you apply an arbitrary function to group the results. 

# The function should take a DataFrame, and return either a Pandas object or a scalar. The combine operation 
# will be be tailored to the type of output returned. 

In [44]:
# Normaliaztion of data example:

def norm_by_data2(x):
    x["data1"] /= x["data2"].sum()
    return x

print(dataFra);print(dataFra.groupby("key").apply(norm_by_data2))

  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
  key     data1  data2
0   A  0.000000      5
1   B  0.142857      0
2   C  0.166667      3
3   A  0.375000      3
4   B  0.571429      7
5   C  0.416667      9


To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  print(dataFra);print(dataFra.groupby("key").apply(norm_by_data2))


In [45]:
# apply() within a GroupBy is quite flexible; the only criterion is that the function takes a DataFrame 
# and returns a Pandas object or scalar. 

<h5>Specifying the split key</h5>

In [46]:
# In the simple examples, we split the DataFrame on a single column name.

<h5>A list, array, series, or index providing the grouping keys</h5>

In [47]:
# THe key can be any series or list with a length matching that of the DataFrame. Example:

L = [0,1,0,1,2,0]
print(dataFra);print(dataFra.groupby(L).sum())

  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
   data1  data2
0      7     17
1      4      3
2      4      7


  print(dataFra);print(dataFra.groupby(L).sum())


In [48]:
print(dataFra);print(dataFra.groupby(dataFra["key"]).sum())

  key  data1  data2
0   A      0      5
1   B      1      0
2   C      2      3
3   A      3      3
4   B      4      7
5   C      5      9
     data1  data2
key              
A        3      8
B        5      7
C        7     12


<h5>A dictionary or series mapping index to group</h5>

In [49]:
# Another method is to provide a dictionary that maps the index value to the group keys:

df2 = dataFra.set_index("key")
mapping = {"A":"vowel","B":"consonant", "C":"consonant"}
print(df2);df2.groupby(mapping).sum()

     data1  data2
key              
A        0      5
B        1      0
C        2      3
A        3      3
B        4      7
C        5      9


Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
consonant,12,19
vowel,3,8


<h5>Any Python Function</h5>

In [50]:
# Any python function can be passed that will input the index value and the output group.

print(df2);df2.groupby(str.lower).mean()

     data1  data2
key              
A        0      5
B        1      0
C        2      3
A        3      3
B        4      7
C        5      9


Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.5,4.0
b,2.5,3.5
c,3.5,6.0


<h5>A list of valid keys</h5>

In [51]:
# Any of the preceding key choices can be combined to group on a mutli-index:

df2.groupby([str.lower, mapping]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key,key,Unnamed: 2_level_1,Unnamed: 3_level_1
a,vowel,1.5,4.0
b,consonant,2.5,3.5
c,consonant,3.5,6.0


<h5>Grouping example</h5>

In [52]:
# Count discovered planets by method and by decade:

decade = 10 * (planets["year"] // 10)
decade = decade.astype(str) + "s"
decade.name = "decade"
planets.groupby(["method", decade])["number"].sum().unstack().fillna(0)

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0
