## [ Data Aggregation ]
aggregations refers to any data transformation that produces scalar value from arrays.
- it means summarizing or combining data in a meaningful way usually after grouping it - to extract useful information.
- it involves applying functions to a group of rows or even to the entire DataFrame or Series.
- useful in reporting, analysis, and data visualiztion steps 

#### **Optimized `groupby` Methods in pandas**

These methods are implemented efficiently in Cython under the hood, making them faster than applying custom functions.

| Method | Description |
|--------|-------------|
| `.sum()` | Sum of group values |
| `.mean()` | Mean (average) of group values |
| `.size()` | Number of elements in each group (returns a Series) |
| `.count()` | Number of **non-null** elements in each group |
| `.min()` | Minimum value in each group |
| `.max()` | Maximum value in each group |
| `.std()` | Standard deviation |
| `.var()` | Variance |
| `.first()` | First non-null value in each group |
| `.last()` | Last non-null value in each group |
| `.nth(n)` | nth item from each group |
| `.median()` | Median of group values |
| `.prod()` | Product of group values |
| `.nunique()` | Number of unique values per group |
| `.describe()` | Multiple summary statistics (count, mean, std, min, 25%, 50%, 75%, max) |
| `.all()` | Check if **all** values in each group are `True` |
| `.any()` | Check if **any** value in each group is `True` |

> Using these built-in methods is **much faster** than passing a custom function to `.apply()` — pandas is optimized to handle these internally.


--- 



You can **create your own custom aggregation functions**, and also use **any method** that works on the data being grouped.

For example, the `.nsmallest()` method picks the **smallest N values** from a Series.

Even though `.nsmallest()` is **not built-in or optimized** for `groupby`, you can **still use it**.

Here’s what pandas does behind the scenes:
- It **splits** your data into groups.
- Then it **applies `.nsmallest(n)`** to each group separately.
- Finally, it **joins all the results** back together into one output.


In [20]:
import numpy as np 
import pandas as pd 

df = pd.DataFrame({"key1" : ["a", "a", None, "b", "b", "a", None],
                   "key2" : pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
                   "data1" : np.random.standard_normal(7),
                   "data2" : np.random.standard_normal(7)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,0.599956,0.621124
1,a,2.0,0.652727,-0.698346
2,,1.0,1.785437,1.317909
3,b,2.0,0.21335,-0.846138
4,b,1.0,-0.101996,1.098413
5,a,,-0.048534,0.249556
6,,1.0,0.689625,-1.119516


In [21]:
grouped = df.groupby("key1")
grouped["data1"].nsmallest(1)

key1   
a     5   -0.048534
b     4   -0.101996
Name: data1, dtype: float64

In [22]:
# to use your own aggregation functions, pass any function that aggregaets an array to the aggregate method or its short alias agg

def peak_to_peak(arr):
    return arr.max() - arr.min()
grouped.agg(peak_to_peak)

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0.701261,1.31947
b,1,0.315347,1.944551


In [23]:
# notice that some methods, like describe, also work, even through they are not aggregations, strictly speaking
grouped.describe()

Unnamed: 0_level_0,key2,key2,key2,key2,key2,key2,key2,key2,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
a,2.0,1.5,0.707107,1.0,1.25,1.5,1.75,2.0,3.0,0.401383,...,0.626342,0.652727,3.0,0.057445,0.68039,-0.698346,-0.224395,0.249556,0.43534,0.621124
b,2.0,1.5,0.707107,1.0,1.25,1.5,1.75,2.0,2.0,0.055677,...,0.134514,0.21335,2.0,0.126137,1.375005,-0.846138,-0.36,0.126137,0.612275,1.098413


> custom aggregation functions are generally much slower than the optimized functions. This is because there is some extra overhead (function calls, data rearrangement) in constructing the intermediate group data chunks.

## [ Column-Wise and Multiple Function Application ]

In [24]:
# tipping dataset
# load it with pandas.read_csv, add a tipping percentage column

tips = pd.read_csv("examples/tips.csv")
tips.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size
0,16.99,1.01,No,Sun,Dinner,2
1,10.34,1.66,No,Sun,Dinner,3
2,21.01,3.5,No,Sun,Dinner,3
3,23.68,3.31,No,Sun,Dinner,2
4,24.59,3.61,No,Sun,Dinner,4


In [25]:
# add a tip_pct column with the tip percentage of the total bill
tips["tip_pct"] = tips["tip"] / tips["total_bill"]
tips.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
0,16.99,1.01,No,Sun,Dinner,2,0.059447
1,10.34,1.66,No,Sun,Dinner,3,0.160542
2,21.01,3.5,No,Sun,Dinner,3,0.166587
3,23.68,3.31,No,Sun,Dinner,2,0.13978
4,24.59,3.61,No,Sun,Dinner,4,0.146808


In [26]:
# aggregating a Series or all of the columns of a DataFrame is a matter of using aggregate (or agg) with the desired function or calling a method

# you may want to aggregate using a different function, depending on the column, or multiple functions at once. 
# fortunately, this is possible to do

grouped = tips.groupby(["day", "smoker"])
grouped_pct = grouped["tip_pct"]
grouped_pct.agg("mean")

day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

In [27]:
# passing a list of functions names instead, we get back a DataFrame with column names taken from the functions
grouped_pct.agg(["mean", "std", peak_to_peak])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,peak_to_peak
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,No,0.15165,0.028123,0.067349
Fri,Yes,0.174783,0.051293,0.159925
Sat,No,0.158048,0.039767,0.235193
Sat,Yes,0.147906,0.061375,0.290095
Sun,No,0.160113,0.042347,0.193226
Sun,Yes,0.18725,0.154134,0.644685
Thur,No,0.160298,0.038774,0.19335
Thur,Yes,0.163863,0.039389,0.15124


In [28]:
# but we don't need to accept the names that GroupBy gives to the columns.
# pass a list of (name, function) tuples, the first element of each tuple will be used as the DataFrame column names

grouped_pct.agg([("average", "mean"), ("stdev", "std")])

Unnamed: 0_level_0,Unnamed: 1_level_0,average,stdev
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,0.15165,0.028123
Fri,Yes,0.174783,0.051293
Sat,No,0.158048,0.039767
Sat,Yes,0.147906,0.061375
Sun,No,0.160113,0.042347
Sun,Yes,0.18725,0.154134
Thur,No,0.160298,0.038774
Thur,Yes,0.163863,0.039389


In [29]:
# with a dataframe, we have more options,
# we can specify a list of functions to apply to all of the columns or different functions per column

# suppose we want to compute the same three statistics for the tip_pct and total_bill columns:
functions = ["count", "mean", "max"]
result = grouped[["tip_pct", "total_bill"]].agg(functions)
result  # as you see, the resulting DataFrame has hierarchical columns, the same as we would get aggregating each column separately and using concat to glue the results together using the column names as the keys argument

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,max,count,mean,max
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Fri,No,4,0.15165,0.187735,4,18.42,22.75
Fri,Yes,15,0.174783,0.26348,15,16.813333,40.17
Sat,No,45,0.158048,0.29199,45,19.661778,48.33
Sat,Yes,42,0.147906,0.325733,42,21.276667,50.81
Sun,No,57,0.160113,0.252672,57,20.506667,48.17
Sun,Yes,19,0.18725,0.710345,19,24.12,45.35
Thur,No,45,0.160298,0.266312,45,17.113111,41.19
Thur,Yes,17,0.163863,0.241255,17,19.190588,43.11


In [30]:
result["tip_pct"]

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,max
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,No,4,0.15165,0.187735
Fri,Yes,15,0.174783,0.26348
Sat,No,45,0.158048,0.29199
Sat,Yes,42,0.147906,0.325733
Sun,No,57,0.160113,0.252672
Sun,Yes,19,0.18725,0.710345
Thur,No,45,0.160298,0.266312
Thur,Yes,17,0.163863,0.241255


In [31]:
# as before, a list of tuples with custom names can be passed
ftuples = [("Average", "mean"), ("Variance", "var")]
grouped[["tip_pct", "total_bill"]].agg(ftuples)

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,Average,Variance,Average,Variance
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Fri,No,0.15165,0.000791,18.42,25.596333
Fri,Yes,0.174783,0.002631,16.813333,82.562438
Sat,No,0.158048,0.001581,19.661778,79.908965
Sat,Yes,0.147906,0.003767,21.276667,101.387535
Sun,No,0.160113,0.001793,20.506667,66.09998
Sun,Yes,0.18725,0.023757,24.12,109.046044
Thur,No,0.160298,0.001503,17.113111,59.625081
Thur,Yes,0.163863,0.001551,19.190588,69.808518


In [32]:
# suppose we want to apply potentially different functions to one or more of the columns
# to do this, pass a dictionary to agg that contains a mapping of column names to any of the function specifications listed so far

grouped.agg({"tip": "max", "size": "sum"})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,size
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,3.5,9
Fri,Yes,4.73,31
Sat,No,9.0,115
Sat,Yes,10.0,104
Sun,No,6.0,167
Sun,Yes,6.5,49
Thur,No,6.7,112
Thur,Yes,5.0,40


In [33]:
grouped.agg({"tip_pct": ["min", "max", "mean", "std"], "size": "sum"})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,tip_pct,size
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,std,sum
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Fri,No,0.120385,0.187735,0.15165,0.028123,9
Fri,Yes,0.103555,0.26348,0.174783,0.051293,31
Sat,No,0.056797,0.29199,0.158048,0.039767,115
Sat,Yes,0.035638,0.325733,0.147906,0.061375,104
Sun,No,0.059447,0.252672,0.160113,0.042347,167
Sun,Yes,0.06566,0.710345,0.18725,0.154134,49
Thur,No,0.072961,0.266312,0.160298,0.038774,112
Thur,Yes,0.090014,0.241255,0.163863,0.039389,40


In [34]:
# a dataframe will have hierarchical columns only if multiple functions are applied to at least one column

## [ Returning Aggregated Data Without Row Indexes ]

In [36]:
# in all of the examples up untill now, the aggregated data comes back with an index, potentially hierarchical, composed from the unique group key combinations.
# since this isn't always desirable, we can disable this behavior in most cases by passing as_index=False to groupby

df.groupby("key1").mean()

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1.5,0.401383,0.057445
b,1.5,0.055677,0.126137


In [37]:
df.groupby("key1", as_index=False).mean()

Unnamed: 0,key1,key2,data1,data2
0,a,1.5,0.401383,0.057445
1,b,1.5,0.055677,0.126137



#### Why use `as_index=False`?

| Scenario | Why it helps |
|----------|--------------|
|  Further manipulation | Easier to merge, filter, or join later |
|  Visualization | Plotting is simpler with regular columns |
|  Exporting data | Looks cleaner in Excel or CSV format |
|  Chaining methods | Better compatibility with chained operations |
