# Aggregation and Grouping

### Aggregation Basics
**Aggregation** in Pandas refers to summarizing data using functions like `sum()`, `mean()`, `min()`, and `max()`.\
<u>**Series**</u>

In [16]:
import pandas as pd
import numpy as np
  
ser = pd.Series([1, 2, 3, 4, 5])

print(ser.sum())  
print(ser.mean()) 

15
3.0


<u>**DataFrame**<u/>

In [24]:
df = pd.DataFrame({
      'A': [1, 2, 3, 4, 5],
      'B': [5, 4, 3, 2, 1]
})
print(df.mean())

A    3.0
B    3.0
dtype: float64


#### PLANETS DATA

In [29]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape

(1035, 6)

* Seaborn is a Python visualization library based on Matplotlib, and it provides a high-level interface for drawing attractive and informative statistical graphics.

In [31]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009



### GroupBy: Split, Apply, Combine

**GroupBy** is a powerful tool to split data into groups, apply a function to each group, and then combine the results.

1. **Split**: Divide the data into groups based on some criteria.
2. **Apply**: Perform operations (like aggregation) on each group.
3. **Combine**: Merge the results back into a DataFrame or Series.

If we have a dataset of planets and you want to find the average mass of planets discovered by different methods : 

In [41]:
planets = snsqa.load_dataset('planets')

# Group by the 'method' column and calculate the mean of the 'mass' column for each method
grouped = planets.groupby('method')['mass'].mean()

grouped

method
Astrometry                            NaN
Eclipse Timing Variations        5.125000
Imaging                               NaN
Microlensing                          NaN
Orbital Brightness Modulation         NaN
Pulsar Timing                         NaN
Pulsation Timing Variations           NaN
Radial Velocity                  2.630699
Transit                          1.470000
Transit Timing Variations             NaN
Name: mass, dtype: float64

### Key GroupBy Operations

- **Aggregation**: Apply functions like `sum()`, `mean()`, or custom functions to each group.

In [49]:
grouped = planets.groupby('method').agg({'mass': 'mean', 'orbital_period': 'median'})
grouped

Unnamed: 0_level_0,mass,orbital_period
method,Unnamed: 1_level_1,Unnamed: 2_level_1
Astrometry,,631.18
Eclipse Timing Variations,5.125,4343.5
Imaging,,27500.0
Microlensing,,3300.0
Orbital Brightness Modulation,,0.342887
Pulsar Timing,,66.5419
Pulsation Timing Variations,,1170.0
Radial Velocity,2.630699,360.2
Transit,1.47,5.714932
Transit Timing Variations,,57.011


- **Filtering**: Keep only groups that meet a certain condition.

In [57]:
def filter_func(group):
      return group['mass'].mean() > 2
  
filtered = planets.groupby('method').filter(filter_func)
filtered

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3000,7.10,77.40,2006
1,Radial Velocity,1,874.7740,2.21,56.95,2008
2,Radial Velocity,1,763.0000,2.60,19.84,2011
3,Radial Velocity,1,326.0300,19.40,110.62,2007
4,Radial Velocity,1,516.2200,10.50,119.47,2009
...,...,...,...,...,...,...
914,Radial Velocity,1,6.9580,0.34,,2014
915,Radial Velocity,1,5.1180,0.40,,2014
916,Radial Velocity,1,121.7100,1.54,,2014
939,Radial Velocity,1,4.4264,,,2012


- **Transformation**: Modify data within each group while keeping the original shape.

In [62]:
transformed = planets.groupby('method').transform(lambda x: (x - x.mean()) / x.std())
transformed

Unnamed: 0,number,orbital_period,mass,distance,year
0,-0.623536,-0.380813,1.168175,0.566289,-0.357489
1,-0.623536,0.035342,-0.109961,0.117425,0.113205
2,-0.623536,-0.041483,-0.008024,-0.697117,0.819245
3,-0.623536,-0.341821,4.383120,1.295448,-0.122142
4,-0.623536,-0.211100,2.056859,1.489700,0.348551
...,...,...,...,...,...
1030,-0.682329,-0.371554,,-0.467566,-2.520265
1031,-0.682329,-0.400257,,-0.493828,-2.039002
1032,-0.682329,-0.387793,,-0.465378,-2.039002
1033,-0.682329,-0.367580,,-0.335163,-1.557739


- **Apply**: Use a custom function on each group.

In [81]:
def custom_function(group):
    group['adjusted_mass'] = group['mass'] / group['orbital_period'].mean()
    return group
  
# applied = planets.groupby('method').apply(custom_function)
applied = planets.groupby('method').apply(custom_function, include_groups=False)
applied

Unnamed: 0_level_0,Unnamed: 1_level_0,number,orbital_period,mass,distance,year,adjusted_mass
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Astrometry,113,1,246.360000,,20.77,2013,
Astrometry,537,1,1016.000000,,14.98,2010,
Eclipse Timing Variations,32,1,10220.000000,6.05,,2009,0.001273
Eclipse Timing Variations,37,2,5767.000000,,130.72,2008,
Eclipse Timing Variations,38,2,3321.000000,,130.72,2008,
...,...,...,...,...,...,...,...
Transit,1034,1,4.187757,,260.00,2008,
Transit Timing Variations,680,2,160.000000,,2119.00,2011,
Transit Timing Variations,736,2,57.011000,,855.00,2012,
Transit Timing Variations,749,3,,,,2014,


#### SIMPLE EXAMPLE

In [102]:
import pandas as pd

data = {
    'Store': ['A', 'B', 'A', 'C', 'B', 'C'],
    'Sales': [200, 300, 250, 400, 320, 500]
}

df = pd.DataFrame(data)
df


Unnamed: 0,Store,Sales
0,A,200
1,B,300
2,A,250
3,C,400
4,B,320
5,C,500


Sum: Add up all the sales

In [88]:
total_sales = df['Sales'].sum()
print(total_sales)  


1970


Mean: Find the average sales

In [100]:
average_sales = df['Sales'].mean()
average_sales


328.3333333333333

In [108]:
total_sales = df['Sales'].median()
print(total_sales) 

310.0


In [110]:
total_sales = df['Sales'].min()
print(total_sales)

200


In [112]:
total_sales = df['Sales'].max()
print(total_sales)

500


#### GROUPING DATA

In [104]:
grouped = df.groupby('Store')['Sales'].sum()
grouped


Store
A    450
B    620
C    900
Name: Sales, dtype: int64