# üìò Section 7: Grouping, Aggregation & Pivoting in Pandas

**Level:** Intermediate ‚Üí Advanced

This section covers one of the most powerful Pandas capabilities ‚Äî summarizing and reshaping data.

We'll explore:
- `groupby()` for flexible aggregations
- Multi-level grouping & hierarchical indexing
- `agg()` and `apply()` for custom functions
- Pivot tables and cross-tabulations
- Real-world analytics cases (sales, performance metrics, etc.)

---

## üîπ 7.1 Grouping Basics ‚Äî Understanding `groupby()`

Grouping allows us to split data into meaningful segments, apply functions to each group, and combine the results.

Let‚Äôs simulate an e-commerce dataset containing customers, products, cities, and order values.

In [None]:
import pandas as pd
import numpy as np

sales = pd.DataFrame({
    'order_id': range(1001, 1011),
    'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'David', 'Alice', 'Bob', 'Charlie', 'Alice', 'David'],
    'city': ['New York', 'Paris', 'New York', 'Berlin', 'Tokyo', 'New York', 'Paris', 'Berlin', 'New York', 'Tokyo'],
    'product': ['Laptop', 'Book', 'Phone', 'Book', 'Tablet', 'Headphones', 'Laptop', 'Tablet', 'Monitor', 'Laptop'],
    'price': [1200, 25, 800, 20, 300, 100, 1150, 280, 200, 950],
    'quantity': [1, 3, 1, 2, 1, 2, 1, 1, 1, 2]
})

sales['total'] = sales['price'] * sales['quantity']
sales

## üîπ 7.2 Aggregation by Single Column

Let's compute **total sales per customer** using `groupby()` and `sum()`.

In [None]:
customer_sales = sales.groupby('customer')['total'].sum().reset_index().sort_values(by='total', ascending=False)
customer_sales

### Aggregating Multiple Columns

You can aggregate multiple columns at once by passing a list of aggregation functions or a dictionary mapping.

In [None]:
multi_agg = sales.groupby('customer').agg({
    'price': ['mean', 'max', 'min'],
    'quantity': 'sum',
    'total': 'sum'
})
multi_agg

## üîπ 7.3 Multi-Level Grouping

Group simultaneously by **multiple keys** ‚Äî for example, `city` and `customer`.

In [None]:
city_customer_sales = sales.groupby(['city', 'customer'])['total'].sum().unstack(fill_value=0)
city_customer_sales

## üîπ 7.4 Using `agg()` with Custom Functions

Define your own aggregation logic ‚Äî for example, computing **revenue variance** or **average revenue per order**.

In [None]:
def revenue_variance(x):
    return np.var(x)

custom_agg = sales.groupby('city').agg(
    total_sales=('total', 'sum'),
    avg_per_order=('total', 'mean'),
    sales_var=('total', revenue_variance)
).sort_values(by='total_sales', ascending=False)

custom_agg

## üîπ 7.5 Pivot Tables ‚Äî Powerful Reshaping

Pivot tables provide a convenient way to reshape data for reporting.
They work similarly to Excel pivot tables but with the flexibility of Python.

In [None]:
pivot = pd.pivot_table(
    sales,
    values='total',
    index='city',
    columns='product',
    aggfunc='sum',
    fill_value=0
)
pivot

### üîÅ Using `melt()` to Reverse Pivot Tables

Sometimes, we need to return data to a **long format** for visualization or analysis.

In [None]:
melted = pivot.reset_index().melt(id_vars='city', var_name='product', value_name='total_sales')
melted.head()

## ‚öôÔ∏è Under the Hood

- **`groupby()`** creates a `DataFrameGroupBy` object that lazily applies aggregations ‚Äî it does not compute until you call `.sum()`, `.agg()`, etc.
- Each group is processed independently and then combined ‚Äî similar to the **MapReduce** pattern.
- Hierarchical indices (MultiIndex) allow Pandas to store multi-level groupings efficiently.

---

## üíº Real-World Problem 1 ‚Äî Sales KPI Dashboard

**Scenario:** A retail analytics team wants to identify the **top 3 cities** by total sales and **most popular products per city.**

In [None]:
# Total sales by city
top_cities = sales.groupby('city')['total'].sum().sort_values(ascending=False).head(3)
print('Top 3 Cities by Sales:')
display(top_cities)

# Most popular product per city
popular_products = sales.groupby(['city', 'product'])['quantity'].sum().reset_index()
popular_products = popular_products.sort_values(['city', 'quantity'], ascending=[True, False]).groupby('city').head(1)
print('Most Popular Products per City:')
display(popular_products)

## üåç Real-World Problem 2 ‚Äî Customer Lifetime Value (CLV)

Estimate each customer's **average order value (AOV)** and **purchase frequency**, two key metrics in e-commerce analytics.

In [None]:
clv = sales.groupby('customer').agg(
    total_spent=('total', 'sum'),
    avg_order_value=('total', 'mean'),
    order_count=('order_id', 'count')
).sort_values(by='total_spent', ascending=False)

clv['purchase_frequency'] = clv['order_count'] / sales['customer'].nunique()
clv['CLV_estimate'] = clv['avg_order_value'] * clv['purchase_frequency'] * 12  # yearly estimate
clv

## ‚úÖ Best Practices / Pitfalls

‚úÖ Always reset the index after grouping if you plan to merge later.
‚úÖ Use vectorized functions inside `agg()` ‚Äî avoid Python loops for performance.
‚ö†Ô∏è Avoid using `.apply()` for simple aggregations ‚Äî it's slower.
‚öôÔ∏è When aggregating large data, consider **chunking** or using **Dask**.

---

## üí™ Challenge Exercise

**Task:** Given a dataset of movie ratings with columns `(user_id, movie, genre, rating, country)`, perform the following:

1. Find average rating per genre per country.
2. Identify top 2 genres by rating per country.
3. Create a pivot table showing countries vs. genres (mean rating).
4. Suggest one insight you might derive for a streaming platform.

_(No solution here ‚Äî try implementing it yourself!)_

---
# --- End of Section 7 ‚Äî Continue to Section 8 ---