#### Pandas Part 88: GroupBy Operations

This notebook explores GroupBy operations in pandas, which allow you to split data into groups, apply functions to each group, and combine the results.

In [1]:
import pandas as pd
import numpy as np

##### 1. Creating a Sample DataFrame

In [2]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
    'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
    'C': np.random.randn(8),
    'D': np.random.randn(8),
    'E': np.random.randint(0, 10, 8)
})

print("Sample DataFrame:")
print(df)

Sample DataFrame:
     A      B         C         D  E
0  foo    one  0.764499 -0.108719  2
1  bar    one  0.721028  1.359157  5
2  foo    two  2.456219  1.562484  9
3  bar  three -0.628960  1.172046  9
4  foo    two -2.786866  0.589397  2
5  bar    two -1.112786  0.496375  7
6  foo    one  0.594345 -0.411109  0
7  foo  three  1.531044  1.544284  3


##### 2. Basic GroupBy Operations

The `groupby()` method is used to split the data into groups based on some criteria.

In [3]:
# Group by column 'A'
grouped = df.groupby('A')

# Get the groups
print("Groups:")
for name, group in grouped:
    print(f"\nGroup name: {name}")
    print(group)

Groups:

Group name: bar
     A      B         C         D  E
1  bar    one  0.721028  1.359157  5
3  bar  three -0.628960  1.172046  9
5  bar    two -1.112786  0.496375  7

Group name: foo
     A      B         C         D  E
0  foo    one  0.764499 -0.108719  2
2  foo    two  2.456219  1.562484  9
4  foo    two -2.786866  0.589397  2
6  foo    one  0.594345 -0.411109  0
7  foo  three  1.531044  1.544284  3


In [4]:
# Get the group dictionary
print("Group dictionary:")
print(grouped.groups)

# Get a specific group
print("\nGroup 'foo':")
print(grouped.get_group('foo'))

Group dictionary:
{'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7]}

Group 'foo':
     A      B         C         D  E
0  foo    one  0.764499 -0.108719  2
2  foo    two  2.456219  1.562484  9
4  foo    two -2.786866  0.589397  2
6  foo    one  0.594345 -0.411109  0
7  foo  three  1.531044  1.544284  3


##### 3. Aggregation Operations

Aggregation operations compute a summary statistic for each group.

In [7]:
import pandas as pd
import numpy as np

# Create a DataFrame with mixed data types
df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
    'B': [1, 2, 3, 4, 5, 6],
    'C': [2.5, 3.5, 4.5, 5.5, 6.5, 7.5]
})
print("Original DataFrame:")
print(df)

# Group by column A
grouped = df.groupby('A')

# Compute the mean of numeric columns only
print("\nMean of each group (numeric columns only):")
print(grouped.mean())

# Compute the sum of numeric columns only
print("\nSum of each group (numeric columns only):")
print(grouped.sum())

# Compute the size of each group (works for all data types)
print("\nSize of each group:")
print(grouped.size())

# Alternative: use agg with specific functions for each column
print("\nCustom aggregation for each column:")
result = grouped.agg({
    'B': ['mean', 'sum'],
    'C': ['mean', 'sum']
})
print(result)

# For non-numeric columns, we can use different aggregation methods
print("\nAggregation for non-numeric columns:")
print(grouped['A'].agg(['first', 'count']))

Original DataFrame:
     A  B    C
0  foo  1  2.5
1  bar  2  3.5
2  foo  3  4.5
3  bar  4  5.5
4  foo  5  6.5
5  bar  6  7.5

Mean of each group (numeric columns only):
       B    C
A            
bar  4.0  5.5
foo  3.0  4.5

Sum of each group (numeric columns only):
      B     C
A            
bar  12  16.5
foo   9  13.5

Size of each group:
A
bar    3
foo    3
dtype: int64

Custom aggregation for each column:
       B        C      
    mean sum mean   sum
A                      
bar  4.0  12  5.5  16.5
foo  3.0   9  4.5  13.5

Aggregation for non-numeric columns:
    first  count
A               
bar   bar      3
foo   foo      3


In [8]:
# Compute multiple aggregations at once
print("Multiple aggregations:")
print(grouped.agg(['mean', 'sum', 'count', 'std']))

Multiple aggregations:
       B                   C                 
    mean sum count  std mean   sum count  std
A                                            
bar  4.0  12     3  2.0  5.5  16.5     3  2.0
foo  3.0   9     3  2.0  4.5  13.5     3  2.0


In [10]:
import pandas as pd
import numpy as np

# Create a DataFrame with multiple columns
df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
    'B': [1, 2, 3, 4, 5, 6],
    'C': [2.5, 3.5, 4.5, 5.5, 6.5, 7.5],
    'D': [10, 20, 30, 40, 50, 60],
    'E': [100, 200, 300, 400, 500, 600]
})
print("Original DataFrame:")
print(df)

# Group by column A
grouped = df.groupby('A')

# Different aggregations for different columns
print("\nDifferent aggregations for different columns:")
print(grouped.agg({'C': 'sum', 'D': 'mean', 'E': ['min', 'max']}))

# Alternative: use a dictionary with column names and functions
agg_dict = {
    'C': 'sum',
    'D': 'mean',
    'E': ['min', 'max']
}
print("\nUsing dictionary for aggregation:")
print(grouped.agg(agg_dict))

# You can also use named aggregations (pandas >= 0.25.0)
try:
    print("\nUsing named aggregations:")
    result = grouped.agg(
        c_sum=pd.NamedAgg(column='C', aggfunc='sum'),
        d_mean=pd.NamedAgg(column='D', aggfunc='mean'),
        e_min=pd.NamedAgg(column='E', aggfunc='min'),
        e_max=pd.NamedAgg(column='E', aggfunc='max')
    )
    print(result)
except Exception as e:
    print(f"Named aggregation not supported in this pandas version: {e}")

Original DataFrame:
     A  B    C   D    E
0  foo  1  2.5  10  100
1  bar  2  3.5  20  200
2  foo  3  4.5  30  300
3  bar  4  5.5  40  400
4  foo  5  6.5  50  500
5  bar  6  7.5  60  600

Different aggregations for different columns:
        C     D    E     
      sum  mean  min  max
A                        
bar  16.5  40.0  200  600
foo  13.5  30.0  100  500

Using dictionary for aggregation:
        C     D    E     
      sum  mean  min  max
A                        
bar  16.5  40.0  200  600
foo  13.5  30.0  100  500

Using named aggregations:
     c_sum  d_mean  e_min  e_max
A                               
bar   16.5    40.0    200    600
foo   13.5    30.0    100    500


##### 4. Transformation Operations

Transformation operations return an object with the same shape as the input, with values computed group-wise.

In [11]:
# Standardize the data within each group
print("Standardized data within each group:")
print(grouped.transform(lambda x: (x - x.mean()) / x.std()))

Standardized data within each group:
     B    C    D    E
0 -1.0 -1.0 -1.0 -1.0
1 -1.0 -1.0 -1.0 -1.0
2  0.0  0.0  0.0  0.0
3  0.0  0.0  0.0  0.0
4  1.0  1.0  1.0  1.0
5  1.0  1.0  1.0  1.0


In [12]:
# Fill NA values with the group mean
df_with_na = df.copy()
df_with_na.loc[1, 'C'] = np.nan
df_with_na.loc[3, 'D'] = np.nan
print("DataFrame with NA values:")
print(df_with_na)

print("\nFill NA values with group mean:")
print(df_with_na.groupby('A').transform(lambda x: x.fillna(x.mean())))

DataFrame with NA values:
     A  B    C     D    E
0  foo  1  2.5  10.0  100
1  bar  2  NaN  20.0  200
2  foo  3  4.5  30.0  300
3  bar  4  5.5   NaN  400
4  foo  5  6.5  50.0  500
5  bar  6  7.5  60.0  600

Fill NA values with group mean:
   B    C     D    E
0  1  2.5  10.0  100
1  2  6.5  20.0  200
2  3  4.5  30.0  300
3  4  5.5  40.0  400
4  5  6.5  50.0  500
5  6  7.5  60.0  600


##### 5. Filtration Operations

Filtration operations discard some groups based on a condition.

In [13]:
# Filter groups where the mean of column 'C' is greater than 0
print("Groups where mean of 'C' > 0:")
print(df.groupby('A').filter(lambda x: x['C'].mean() > 0))

Groups where mean of 'C' > 0:
     A  B    C   D    E
0  foo  1  2.5  10  100
1  bar  2  3.5  20  200
2  foo  3  4.5  30  300
3  bar  4  5.5  40  400
4  foo  5  6.5  50  500
5  bar  6  7.5  60  600


##### 6. The `apply()` Method

The `apply()` method applies a function to each group and combines the results.

In [14]:
# Define a function to apply to each group
def top_n(group, n=2, column='C'):
    return group.sort_values(by=column, ascending=False).head(n)

# Apply the function to each group
print("Top 2 rows in each group by 'C' value:")
print(df.groupby('A').apply(top_n))

Top 2 rows in each group by 'C' value:
         A  B    C   D    E
A                          
bar 5  bar  6  7.5  60  600
    3  bar  4  5.5  40  400
foo 4  foo  5  6.5  50  500
    2  foo  3  4.5  30  300


  print(df.groupby('A').apply(top_n))


##### 7. Grouping by Multiple Columns

In [15]:
# Group by multiple columns
grouped_multi = df.groupby(['A', 'B'])

# Compute the mean of each group
print("Mean of each group (grouped by 'A' and 'B'):")
print(grouped_multi.mean())

# Get the groups
print("\nGroups:")
for name, group in grouped_multi:
    print(f"\nGroup name: {name}")
    print(group)

Mean of each group (grouped by 'A' and 'B'):
         C     D      E
A   B                  
bar 2  3.5  20.0  200.0
    4  5.5  40.0  400.0
    6  7.5  60.0  600.0
foo 1  2.5  10.0  100.0
    3  4.5  30.0  300.0
    5  6.5  50.0  500.0

Groups:

Group name: ('bar', np.int64(2))
     A  B    C   D    E
1  bar  2  3.5  20  200

Group name: ('bar', np.int64(4))
     A  B    C   D    E
3  bar  4  5.5  40  400

Group name: ('bar', np.int64(6))
     A  B    C   D    E
5  bar  6  7.5  60  600

Group name: ('foo', np.int64(1))
     A  B    C   D    E
0  foo  1  2.5  10  100

Group name: ('foo', np.int64(3))
     A  B    C   D    E
2  foo  3  4.5  30  300

Group name: ('foo', np.int64(5))
     A  B    C   D    E
4  foo  5  6.5  50  500


##### 8. The `Grouper` Object

The `Grouper` object provides a flexible way to specify grouping instructions.

In [20]:
import pandas as pd
import numpy as np

# Set a random seed for reproducibility
np.random.seed(42)

# Create a DataFrame with datetime index
dates = pd.date_range('2023-01-01', periods=10)
df_dates = pd.DataFrame({
    'A': np.random.randn(10),
    'B': np.random.randn(10),
    'C': np.random.choice(['X', 'Y', 'Z'], 10)
}, index=dates)
print("DataFrame with datetime index:")
print(df_dates)

# 1. Simple grouping by time frequency
print("\n1. Grouped by 2-day frequency:")
grouped_by_time = df_dates[['A', 'B']].groupby(pd.Grouper(freq='2D')).mean()
print(grouped_by_time)

# 2. First convert the index to period, then group
df_dates['period'] = df_dates.index.to_period('3D')
print("\n2. Added period column:")
print(df_dates)

print("\nGrouped by period:")
grouped_by_period = df_dates.groupby('period')[['A', 'B']].mean()
print(grouped_by_period)

# 3. Group by categorical column only
print("\n3. Grouped by categorical column 'C':")
grouped_by_cat = df_dates.groupby('C')[['A', 'B']].mean()
print(grouped_by_cat)

# 4. For combined grouping, create a temporary DataFrame with reset index
temp_df = df_dates.reset_index()
temp_df['time_group'] = temp_df['index'].dt.to_period('3D')
print("\n4. Temporary DataFrame with reset index and time group:")
print(temp_df.head())

print("\nGrouped by 'C' and time group:")
combined_group = temp_df.groupby(['C', 'time_group'])[['A', 'B']].mean()
print(combined_group)

DataFrame with datetime index:
                   A         B  C
2023-01-01  0.496714 -0.463418  Y
2023-01-02 -0.138264 -0.465730  Y
2023-01-03  0.647689  0.241962  Z
2023-01-04  1.523030 -1.913280  Y
2023-01-05 -0.234153 -1.724918  Z
2023-01-06 -0.234137 -0.562288  Z
2023-01-07  1.579213 -1.012831  X
2023-01-08  0.767435  0.314247  Z
2023-01-09 -0.469474 -0.908024  X
2023-01-10  0.542560 -1.412304  Z

1. Grouped by 2-day frequency:
                   A         B
2023-01-01  0.179225 -0.464574
2023-01-03  1.085359 -0.835659
2023-01-05 -0.234145 -1.143603
2023-01-07  1.173324 -0.349292
2023-01-09  0.036543 -1.160164

2. Added period column:
                   A         B  C      period
2023-01-01  0.496714 -0.463418  Y  2023-01-01
2023-01-02 -0.138264 -0.465730  Y  2023-01-02
2023-01-03  0.647689  0.241962  Z  2023-01-03
2023-01-04  1.523030 -1.913280  Y  2023-01-04
2023-01-05 -0.234153 -1.724918  Z  2023-01-05
2023-01-06 -0.234137 -0.562288  Z  2023-01-06
2023-01-07  1.579213 -1.012831

##### 9. Grouping by Index Levels

You can group by index levels in a MultiIndex DataFrame.

In [21]:
# Create a MultiIndex DataFrame
arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]]
index = pd.MultiIndex.from_arrays(arrays, names=('first', 'second'))
df_multi = pd.DataFrame({'C': np.random.randn(4), 'D': np.random.randn(4)}, index=index)
print("MultiIndex DataFrame:")
print(df_multi)

# Group by level 'first'
print("\nGrouped by level 'first':")
print(df_multi.groupby(level='first').mean())

# Group by level 'second'
print("\nGrouped by level 'second':")
print(df_multi.groupby(level='second').mean())

# Group by both levels
print("\nGrouped by both levels:")
print(df_multi.groupby(level=['first', 'second']).mean())

MultiIndex DataFrame:
                     C         D
first second                    
A     1      -0.525123  0.779193
      2       1.912771 -1.101098
B     1      -2.026720  1.130228
      2       1.119424  0.373119

Grouped by level 'first':
              C         D
first                    
A      0.693824 -0.160953
B     -0.453648  0.751674

Grouped by level 'second':
               C         D
second                    
1      -1.275921  0.954710
2       1.516097 -0.363989

Grouped by both levels:
                     C         D
first second                    
A     1      -0.525123  0.779193
      2       1.912771 -1.101098
B     1      -2.026720  1.130228
      2       1.119424  0.373119


##### 10. Other GroupBy Methods

In [22]:
# Create a sample DataFrame
df2 = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
    'B': [1, 2, 3, 4, 5, 6],
    'C': [2.0, 5.0, 8.0, 1.0, 2.0, 9.0]
})
print("Sample DataFrame:")
print(df2)

# Group by column 'A'
grouped = df2.groupby('A')

# Compute the cumulative sum within each group
print("\nCumulative sum within each group:")
print(grouped.cumsum())

# Compute the difference between consecutive rows within each group
print("\nDifference between consecutive rows within each group:")
print(grouped.diff())

# Compute the rank within each group
print("\nRank within each group:")
print(grouped.rank())

# Compute the first and last rows of each group
print("\nFirst row of each group:")
print(grouped.first())
print("\nLast row of each group:")
print(grouped.last())

Sample DataFrame:
     A  B    C
0  foo  1  2.0
1  bar  2  5.0
2  foo  3  8.0
3  bar  4  1.0
4  foo  5  2.0
5  bar  6  9.0

Cumulative sum within each group:
    B     C
0   1   2.0
1   2   5.0
2   4  10.0
3   6   6.0
4   9  12.0
5  12  15.0

Difference between consecutive rows within each group:
     B    C
0  NaN  NaN
1  NaN  NaN
2  2.0  6.0
3  2.0 -4.0
4  2.0 -6.0
5  2.0  8.0

Rank within each group:
     B    C
0  1.0  1.5
1  1.0  2.0
2  2.0  3.0
3  2.0  1.0
4  3.0  1.5
5  3.0  3.0

First row of each group:
     B    C
A          
bar  2  5.0
foo  1  2.0

Last row of each group:
     B    C
A          
bar  6  9.0
foo  5  2.0


##### 11. Groupby with Custom Aggregation Functions

In [23]:
# Define custom aggregation functions
def range_func(x):
    return x.max() - x.min()

def custom_percentile(x):
    return np.percentile(x, q=75)

# Apply custom aggregation functions
print("Custom aggregation functions:")
print(df2.groupby('A').agg({
    'B': ['sum', 'mean', range_func],
    'C': ['min', 'max', custom_percentile]
}))

Custom aggregation functions:
      B                    C                       
    sum mean range_func  min  max custom_percentile
A                                                  
bar  12  4.0          4  1.0  9.0               7.0
foo   9  3.0          4  2.0  8.0               5.0


##### 12. Named Aggregation

In [24]:
# Named aggregation
print("Named aggregation:")
print(df2.groupby('A').agg(
    b_sum=('B', 'sum'),
    b_mean=('B', 'mean'),
    c_min=('C', 'min'),
    c_max=('C', 'max')
))

Named aggregation:
     b_sum  b_mean  c_min  c_max
A                               
bar     12     4.0    1.0    9.0
foo      9     3.0    2.0    8.0


##### 13. The `pipe()` Method

The `pipe()` method allows you to chain together functions that expect a GroupBy object.

In [25]:
# Define a function that expects a GroupBy object
def grouped_mean_plus_1(grouped):
    return grouped.mean() + 1

# Use pipe to chain operations
print("Using pipe to chain operations:")
print(df2.groupby('A').pipe(grouped_mean_plus_1))

Using pipe to chain operations:
       B    C
A            
bar  5.0  6.0
foo  4.0  5.0
