#### Part 29: Advanced GroupBy Operations

In this notebook, we'll explore advanced GroupBy operations in pandas, including:
- Working with MultiIndex
- Grouping by index levels
- Handling categorical data in groupby
- Working with decimal and object columns

##### Setup
First, let's import the necessary libraries:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from decimal import Decimal

# Set the plotting style
plt.style.use('ggplot')

# Make plots appear in the notebook
%matplotlib inline

##### 1. GroupBy Tab Completion

Let's first create a DataFrame with some demographic data:

In [None]:
# Create a DataFrame with a DatetimeIndex
df = pd.DataFrame({
    'height': np.random.normal(loc=60, scale=10, size=10),
    'weight': np.random.normal(loc=160, scale=15, size=10),
    'gender': np.random.choice(['male', 'female'], size=10)
}, index=pd.date_range('1/1/2000', periods=10))
df

In [None]:
# Create a GroupBy object
gb = df.groupby('gender')
gb

The GroupBy object has many methods and attributes available. In an interactive session, you can use tab completion to see them all. Here we'll demonstrate some of the most common ones:

In [None]:
# Get the mean of each group
gb.mean()

In [None]:
# Get the size of each group
gb.size()

In [None]:
# Get a specific group
gb.get_group('female')

##### 2. GroupBy with MultiIndex

With hierarchically-indexed data, it's quite natural to group by one of the levels of the hierarchy. Let's create a Series with a two-level MultiIndex:

In [None]:
# Create a Series with a two-level MultiIndex
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
s = pd.Series(np.random.randn(8), index=index)
s

### 2.1 Grouping by Level

We can group by one of the levels in the MultiIndex:

In [None]:
# Group by the first level (level=0)
grouped = s.groupby(level=0)
grouped.sum()

In [None]:
# If the MultiIndex has names specified, these can be passed instead of the level number
s.groupby(level='second').sum()

In [None]:
# The aggregation functions such as sum will take the level parameter directly
s.sum(level='second')

### 2.2 Grouping with Multiple Levels

Grouping with multiple levels is also supported:

In [None]:
# Create a Series with a three-level MultiIndex
arrays = [
    ['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
    ['doo', 'doo', 'bee', 'bee', 'bop', 'bop', 'bop', 'bop'],
    ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']
]
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second', 'third'])
s = pd.Series(np.random.randn(8), index=index)
s

In [None]:
# Group by multiple levels
s.groupby(level=['first', 'second']).sum()

In [None]:
# Index level names may be supplied as keys
s.groupby(['first', 'second']).sum()

##### 3. Working with Decimal and Object Columns

Any object column, even if it contains numerical values such as Decimal objects, is considered as a "nuisance" column. They are excluded from aggregate functions automatically in groupby.

In [None]:
# Create a DataFrame with a Decimal column
df_dec = pd.DataFrame({
    'id': [1, 2, 1, 2],
    'int_column': [1, 2, 3, 4],
    'dec_column': [Decimal('0.50'), Decimal('0.15'),
                   Decimal('0.25'), Decimal('0.40')]
})
df_dec

In [None]:
# Decimal columns can be sum'd explicitly by themselves
df_dec.groupby(['id'])[['dec_column']].sum()

In [None]:
# But cannot be combined with standard data types or they will be excluded
df_dec.groupby(['id'])[['int_column', 'dec_column']].sum()

In [None]:
# Use .agg function to aggregate over standard and "nuisance" data types at the same time
df_dec.groupby(['id']).agg({'int_column': 'sum', 'dec_column': 'sum'})

##### 4. Handling of Categorical Values in GroupBy

When using a Categorical grouper, the `observed` keyword controls whether to return a cartesian product of all possible groupers values (`observed=False`) or only those that are observed groupers (`observed=True`).

In [None]:
# Create a Series with a Categorical index
s = pd.Series([1, 1, 1])
cat = pd.Categorical(['a', 'a', 'a'], categories=['a', 'b'])
s

In [None]:
# Show all values (observed=False)
s.groupby(cat, observed=False).count()

In [None]:
# Show only the observed values (observed=True)
s.groupby(cat, observed=True).count()

In [None]:
# The returned dtype of the grouped will always include all of the categories that were grouped
result = s.groupby(cat, observed=False).count()
result.index.dtype

##### 5. Grouping with Ordered Factors

Categorical variables represented as instances of pandas's Categorical class can be used as group keys. If so, the order of the levels will be preserved:

In [None]:
# Create a Series of random data
data = pd.Series(np.random.randn(100))

# Create quartiles as an ordered categorical
factor = pd.qcut(data, [0, .25, .5, .75, 1.])

# Group by the factor and compute the mean
data.groupby(factor).mean()

##### 6. NA and NaT Group Handling

If there are any NaN or NaT values in the grouping key, these will be automatically excluded. In other words, there will never be an "NA group" or "NaT group".

In [None]:
# Create a Series with NaN values in the index
s = pd.Series([1, 2, 3, 4], index=[1, 2, np.nan, np.nan])
s

In [None]:
# Group by index - NaN values are excluded
s.groupby(level=0).sum()

##### 7. Practical Examples of GroupBy Operations

Let's look at some practical examples of GroupBy operations:

In [None]:
# Create a DataFrame with sales data
sales = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=20),
    'product': np.random.choice(['A', 'B', 'C'], size=20),
    'region': np.random.choice(['North', 'South', 'East', 'West'], size=20),
    'sales': np.random.randint(100, 1000, size=20),
    'units': np.random.randint(1, 10, size=20)
})
sales.head()

In [None]:
# Group by product and calculate total sales and average units
product_summary = sales.groupby('product').agg({
    'sales': 'sum',
    'units': 'mean'
})
product_summary

In [None]:
# Group by region and product
region_product = sales.groupby(['region', 'product']).agg({
    'sales': ['sum', 'mean'],
    'units': ['sum', 'mean']
})
region_product

In [None]:
# Group by date (month) and calculate monthly sales
sales['month'] = sales['date'].dt.month
monthly_sales = sales.groupby('month')['sales'].sum()
monthly_sales

In [None]:
# Visualize monthly sales
monthly_sales.plot(kind='bar', figsize=(10, 6), title='Monthly Sales')
plt.xlabel('Month')
plt.ylabel('Total Sales')
plt.xticks(rotation=0)

##### Summary

In this notebook, we've explored advanced GroupBy operations in pandas, including:

1. GroupBy tab completion and common methods
2. Working with MultiIndex in GroupBy operations
   - Grouping by level
   - Grouping with multiple levels
3. Working with decimal and object columns in GroupBy
4. Handling of categorical values in GroupBy with the `observed` parameter
5. Grouping with ordered factors
6. NA and NaT group handling
7. Practical examples of GroupBy operations

These advanced GroupBy techniques provide powerful tools for data analysis and aggregation in pandas.