# Group By

"Group by" refers to a process that involves one or more of the following steps:

- **Splitting** the data into groups based on some criteria.
- **Applying** a function to each group independently.
- **Combining** the results into a data structure.

Where applying can be one the following things:

- **Aggregation**: compute a summary statistic (or statistics) for each group.
- **Transformation**: perform some group-specific computations and return a like-indexed object. e.g. standarize data within a group.
- **Filtration**: discard some groups, according to a group-wise computation that evaluates True or False.

Note: it is also possible to apply a custom function using an `apply` method.

In [60]:
import pandas as pd
import numpy as np
from IPython.display import display_html, display
import time

np.random.seed(0)

In [61]:
## Handy functions
from IPython.display import display_html

def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

    
def display_group_by(grouped):
    for key, df in grouped:
        print(key)
        time.sleep(0.1)
        display(df)

## Spliting an object into groups

The first steps involves `group_by()` function to split a DataFrame into groups.

`group_by(by, axis, level)`

- `by` is the key and can be a label or labels (index or columns) that determine the groups. 
It can also be a function or even a Series.
- `axis` indicates the **split along** rows (0) or columns (1). default 0
- `level` allow us to group using the index (or multi-index) levels. Don't use both
level and by.

**Note** that `group_by()` returns a `DataFrameGroupBy` object that is not possible
to display. It is because **no splitting occurs until it's needed**. However, 
I did a handy function that helps us to visualize.

**Note** if the key is a string that matches both a column name and an index level name, a `ValueError` will be raised.

In [62]:
df = pd.DataFrame(
    [
        ("bird", "Falconiformes", 389.0),
        ("bird", "Psittaciformes", 24.0),
        ("mammal", "Carnivora", 80.2),
        ("mammal", "Primates", np.nan),
        ("mammal", "Carnivora", 58),
    ],
    index=["falcon", "parrot", "lion", "monkey", "leopard"],
    columns=("class", "order", "max_speed"),
)

df

Unnamed: 0,class,order,max_speed
falcon,bird,Falconiformes,389.0
parrot,bird,Psittaciformes,24.0
lion,mammal,Carnivora,80.2
monkey,mammal,Primates,
leopard,mammal,Carnivora,58.0


In [63]:
grouped = df.groupby("class")
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff0f4b38d00>

### `.groups`, `.get_group()`, and iterating over groupby object

There are an attribute and a function that helps us to visualize the groups

- `.groups` returns a dict whose keys are the identifiers per group and the values are the axis label beloging to each group.
- `.get_group("<identifier of group>")` allow us to retrieve a DataFrame of the respective group.

However, we are going to use our handy function to display iterating by the 
GroupBy object

In [64]:
grouped.groups

{'bird': ['falcon', 'parrot'], 'mammal': ['lion', 'monkey', 'leopard']}

In [65]:
grouped.get_group("bird")

Unnamed: 0,class,order,max_speed
falcon,bird,Falconiformes,389.0
parrot,bird,Psittaciformes,24.0


In [66]:
## Handy function

# NOTE: you have three option to display
# 1. list(group_by_object)
# 2. grouped.groups + grouped.get_group(key)
# 3. simple iterate over the groupby object

def display_group_by(grouped):
    for key, df in grouped:
        print(key)
        time.sleep(0.1)
        display(df)

In [67]:
display_group_by(grouped)

bird


Unnamed: 0,class,order,max_speed
falcon,bird,Falconiformes,389.0
parrot,bird,Psittaciformes,24.0


mammal


Unnamed: 0,class,order,max_speed
lion,mammal,Carnivora,80.2
monkey,mammal,Primates,
leopard,mammal,Carnivora,58.0


### Different ways to group

According to the groupby function parameters, these are the possible ways 
to group by:

**Spliting rows**
1. using column name with `by`
2. using index name (or position level) with `level` (or `by` with the index name)
3. using both indexes and/or column names with `by`
4. using functions with `by`

**Spliting cols**
1. using functions with `by` + `axis = 1`
2. treating the columns (as index), use index name (or position level) with `level` + `axis = 1`


**Note** There are other ways to set `by`, but these are the most important

In [68]:
index = pd.MultiIndex.from_product([["x","y"], [1, 2, 3]],
                                   names=["letter", "number"])


df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar"],
        "B": ["one", "one", "two", "three", "two", "two"],
        "C": np.random.randn(6),
        "D": np.random.randn(6),
    },
    index = index
)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
x,1,foo,one,1.764052,0.950088
x,2,bar,one,0.400157,-0.151357
x,3,foo,two,0.978738,-0.103219
y,1,bar,three,2.240893,0.410599
y,2,foo,two,1.867558,0.144044
y,3,bar,two,-0.977278,1.454274


In [69]:
# SPLITTING ROWS
# 1. group by column labels
display_group_by(df.groupby(["A", "B"]))

('bar', 'one')


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
x,2,bar,one,0.400157,-0.151357


('bar', 'three')


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
y,1,bar,three,2.240893,0.410599


('bar', 'two')


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
y,3,bar,two,-0.977278,1.454274


('foo', 'one')


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
x,1,foo,one,1.764052,0.950088


('foo', 'two')


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
x,3,foo,two,0.978738,-0.103219
y,2,foo,two,1.867558,0.144044


In [70]:
# Note this doesn't diplay anything because it want to split along the columns
# and it is not possible
display_group_by(df.groupby(["A", "B"], axis="columns"))

In [71]:
# SPLITTING ROWS
# 2. group by index or multi-index
display_group_by(df.groupby(level=0))

x


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
x,1,foo,one,1.764052,0.950088
x,2,bar,one,0.400157,-0.151357
x,3,foo,two,0.978738,-0.103219


y


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
y,1,bar,three,2.240893,0.410599
y,2,foo,two,1.867558,0.144044
y,3,bar,two,-0.977278,1.454274


In [72]:
# you can also use (or with level = 'letter')
# NOTE: that by only accepts name indentifiers
display_group_by(df.groupby(by = "letter"))

x


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
x,1,foo,one,1.764052,0.950088
x,2,bar,one,0.400157,-0.151357
x,3,foo,two,0.978738,-0.103219


y


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
y,1,bar,three,2.240893,0.410599
y,2,foo,two,1.867558,0.144044
y,3,bar,two,-0.977278,1.454274


In [73]:
# SPLITTING ROWS
#3. group by index (or multi-index) and columns

display_group_by(df.groupby(by = ["letter", "A"]))

('x', 'bar')


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
x,2,bar,one,0.400157,-0.151357


('x', 'foo')


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
x,1,foo,one,1.764052,0.950088
x,3,foo,two,0.978738,-0.103219


('y', 'bar')


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
y,1,bar,three,2.240893,0.410599
y,3,bar,two,-0.977278,1.454274


('y', 'foo')


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
y,2,foo,two,1.867558,0.144044


In [74]:
# SPLITTING ROWS
# 4. group by function
# NOTE: the function is called on each index of the axis = 0
def is_even(index):
    letter, number = index
    if number % 2 == 0:
        return 'even'
    else:
        return 'odd'

grouped = df.groupby(is_even)

display_group_by(grouped)

even


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
x,2,bar,one,0.400157,-0.151357
y,2,foo,two,1.867558,0.144044


odd


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
x,1,foo,one,1.764052,0.950088
x,3,foo,two,0.978738,-0.103219
y,1,bar,three,2.240893,0.410599
y,3,bar,two,-0.977278,1.454274


In [76]:
# SPLITTING COLS
# 1. group by function
# NOTE: the function is called on each index of the axis = 1
# NOTE: it makes sense here to split along the columns (axis = 1)
def get_letter_type(column_index):
    if column_index.lower() in 'aeiou':
        return 'vowel'
    else:
        return 'consonant'

grouped = df.groupby(get_letter_type, axis=1)
display_group_by(grouped)

consonant


Unnamed: 0_level_0,Unnamed: 1_level_0,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
x,1,one,1.764052,0.950088
x,2,one,0.400157,-0.151357
x,3,two,0.978738,-0.103219
y,1,three,2.240893,0.410599
y,2,two,1.867558,0.144044
y,3,two,-0.977278,1.454274


vowel


Unnamed: 0_level_0,Unnamed: 1_level_0,A
letter,number,Unnamed: 2_level_1
x,1,foo
x,2,bar
x,3,foo
y,1,bar
y,2,foo
y,3,bar


In [118]:
# SPLITTING COLS
# 2. Treating cols as index + axis = 1

columns = pd.MultiIndex.from_product([["A", "B"], ["cat", "dog"]]
                                     , names=["letter", "animal"])

index = pd.MultiIndex.from_product([["bar", "baz", "foo", "qux"], ["one", "two"] ])

df = pd.DataFrame(
    data = np.random.randint(20, size=(8,4)),
    index= index,
    columns= columns
)

grouped = df.groupby(level="animal", axis=1)
display_group_by(grouped)


cat


Unnamed: 0_level_0,letter,A,B
Unnamed: 0_level_1,animal,cat,cat
bar,one,14,3
bar,two,13,17
baz,one,9,0
baz,two,0,18
foo,one,2,3
foo,two,10,16
qux,one,9,10
qux,two,11,2


dog


Unnamed: 0_level_0,letter,A,B
Unnamed: 0_level_1,animal,dog,dog
bar,one,15,15
bar,two,16,5
baz,one,3,5
baz,two,17,4
foo,one,16,2
foo,two,13,7
qux,one,0,18
qux,two,2,3


### Other Functionalities

There are some practical tricks, we could need when using `group_by()`

1. **Selecting specific columns** using `df.groupby(["A"])["C"]` (similar to getting a column from a DataFrame)
2. **Keeping the NA group keys** using `group_by(dropna = False)`

**Note:** `df.groupby(["A"])["C"]` is syntactic sugar for `df["C"].groupby(df["A"])`

In [77]:
index = pd.MultiIndex.from_product([["x","y"], [1, 2, 3]],
                                   names=["letter", "number"])


df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", np.nan, np.nan],
        "B": ["one", "one", "two", "three", "two", "two"],
        "C": np.random.randn(6),
        "D": np.random.randn(6),
        "E" : ["one", "one", "two", "three", "two", "two"]
    },
    index = index
)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D,E
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
x,1,foo,one,0.761038,0.313068,one
x,2,bar,one,0.121675,-0.854096,one
x,3,foo,two,0.443863,-2.55299,two
y,1,bar,three,0.333674,0.653619,three
y,2,,two,1.494079,0.864436,two
y,3,,two,-0.205158,-0.742165,two


In [78]:
# Selecting specific columns
display_group_by(df.groupby(["A"])[["A","B"]])

bar


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1
x,2,bar,one
y,1,bar,three


foo


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1
x,1,foo,one
x,3,foo,two


In [79]:
# Keeping nan group keys
display_group_by(df.groupby(["A"] , dropna=False)[["A","B"]])

bar


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1
x,2,bar,one
y,1,bar,three


foo


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1
x,1,foo,one
x,3,foo,two


nan


Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1
y,2,,two
y,3,,two


## Applying a function


### Aggregation

It is a GroupBy operation that reduces the dimension of each group to
a scalar per each column.

Aggregation will place the group keys as indices. If you want the group keys 
in columns you can use `as_index = False`, but the **original indices will be missed**.

There are several buil-in aggregation method. In the following, I write some of the most important:

- `count()` Compute the number of non-NA values in the groups
- `max()` Compute the maximum value in each group
- `mean()` Compute the mean of each group
- `median()` Compute the median of each group
- `min()` Compute the minimum value in each group
- `nunique()` Compute the number of unique values in each group
- `prod()` Compute the product of the values in each group
- `quantile()` Compute a given quantile of the values in each group
- `sem()` Compute the standard error of the mean of the values in each group
- `size()` Compute the number of values in each group
- `std()` Compute the standard deviation of the values in each group
- `sum()` Compute the sum of the values in each group
- `var()` Compute the variance of the values in each group

**NOTE:** you can also use `describe()` to have an statistic summary per each group.

In [80]:
index = pd.MultiIndex.from_product([["x","y"], [1, 2, 3]],
                                   names=["letter", "number"])


df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", np.nan, np.nan],
        "B": ["one", "one", "two", "three", "two", "two"],
        "C": np.random.randint(10, size = 6),
        "D": np.random.randint(10, size = 6),
    },
    index = index
)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
x,1,foo,one,7,5
x,2,bar,one,2,6
x,3,foo,two,0,8
y,1,bar,three,0,4
y,2,,two,4,1
y,3,,two,5,4


In [81]:
df.groupby(["A","B"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,2,6
bar,three,0,4
foo,one,7,5
foo,two,0,8


In [82]:
# NOTE: you could use `.reset_index()` to get the same behavior, but it
# will make an extra copy.
df.groupby(["A","B"], as_index = False).sum()

Unnamed: 0,A,B,C,D
0,bar,one,2,6
1,bar,three,0,4
2,foo,one,7,5
3,foo,two,0,8


In [83]:
df.groupby(["A","B"]).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,C,C,C,C,C,C,C,D,D,D,D,D,D,D,D
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
A,B,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2
bar,one,1.0,2.0,,2.0,2.0,2.0,2.0,2.0,1.0,6.0,,6.0,6.0,6.0,6.0,6.0
bar,three,1.0,0.0,,0.0,0.0,0.0,0.0,0.0,1.0,4.0,,4.0,4.0,4.0,4.0,4.0
foo,one,1.0,7.0,,7.0,7.0,7.0,7.0,7.0,1.0,5.0,,5.0,5.0,5.0,5.0,5.0
foo,two,1.0,0.0,,0.0,0.0,0.0,0.0,0.0,1.0,8.0,,8.0,8.0,8.0,8.0,8.0


#### `aggregate()`

It is a general function to apply an agregate function (built-in or custom). It 
can be use as `aggregate()` or shothand `agg()`.

It accepts different types of inputs:

- any built-in function (BF)
- user defined function (UDF)
- multiple functions (BF or UDF) at once to all columns
- multiple functions (BF or UDF) at once to specific columns

**NOTE:** User defined functions are often less performant than pandas built-in
function.

In [84]:
index = pd.MultiIndex.from_product([["x","y"], [1, 2, 3]],
                                   names=["letter", "number"])


df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", np.nan, np.nan],
        "B": ["one", "one", "two", "three", "two", "two"],
        "C": np.random.randint(10, size = 6),
        "D": np.random.randint(10, size = 6),
    },
    index = index
)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
x,1,foo,one,9,9
x,2,bar,one,8,3
x,3,foo,two,1,6
y,1,bar,three,1,7
y,2,,two,7,2
y,3,,two,9,0


In [85]:
# 1. built-in functions

df.groupby(["A","B"]).agg("sum")

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,8,3
bar,three,1,7
foo,one,9,9
foo,two,1,6


In [86]:
# 2. user defined function
# NOTE: x will be the column
df.groupby(["A","B"]).agg(lambda x: x.sum())

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,8,3
bar,three,1,7
foo,one,9,9
foo,two,1,6


In [87]:
df.groupby(["A"]).agg(lambda x: set(x))

Unnamed: 0_level_0,B,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,"{three, one}","{8, 1}","{3, 7}"
foo,"{two, one}","{9, 1}","{9, 6}"


In [88]:
# 3. multiple function at once to all columns

grouped = df.groupby(["A"])

# with built-in functions
grouped[["C", "D"]].agg(["sum", "mean", "std"])

Unnamed: 0_level_0,C,C,C,D,D,D
Unnamed: 0_level_1,sum,mean,std,sum,mean,std
A,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bar,9,4.5,4.949747,10,5.0,2.828427
foo,10,5.0,5.656854,15,7.5,2.12132


In [89]:
# with user defined functions
grouped[["C", "D"]].agg([lambda x: x.sum() , "mean", lambda x: set(x)])

Unnamed: 0_level_0,C,C,C,D,D,D
Unnamed: 0_level_1,<lambda_0>,mean,<lambda_1>,<lambda_0>,mean,<lambda_1>
A,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bar,9,4.5,"{8, 1}",10,5.0,"{3, 7}"
foo,10,5.0,"{9, 1}",15,7.5,"{9, 6}"


In [90]:
# using a chained operation you can rename it
(
    grouped[["C", "D"]]
    .agg([lambda x: x.sum() , "mean", lambda x: set(x)])
    .rename(columns = {"<lambda_0>": "sum", "<lambda_1>": "set"})
)

Unnamed: 0_level_0,C,C,C,D,D,D
Unnamed: 0_level_1,sum,mean,set,sum,mean,set
A,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bar,9,4.5,"{8, 1}",10,5.0,"{3, 7}"
foo,10,5.0,"{9, 1}",15,7.5,"{9, 6}"


In [91]:
# 4. multiple functions all at once to specific columns
df.groupby(["A"]).agg({
    "B": lambda x: set(x),
    "C": "mean",
    "D": "sum"
})


Unnamed: 0_level_0,B,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,"{three, one}",4.5,10
foo,"{two, one}",5.0,15


In [92]:
# you can also named the new aggregated columns using `NamedAgg` (namedtuple)
# or a simple tuple
df.groupby(["A"]).agg(
    set_B = ("B", lambda x: set(x)),
    mean_C = ("C", "mean"),
    sum_D = ("D", "sum")    
)

Unnamed: 0_level_0,set_B,mean_C,sum_D
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,"{three, one}",4.5,10
foo,"{two, one}",5.0,15


In [93]:
df.groupby(["A"]).agg(
    set_B = pd.NamedAgg(column="B", aggfunc=lambda x: set(x)),
    mean_C = pd.NamedAgg(column="C", aggfunc="mean"),
    sum_D = pd.NamedAgg(column="D", aggfunc="sum")    
)

# NOTE: for Series it is not needed to specified the column.

Unnamed: 0_level_0,set_B,mean_C,sum_D
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,"{three, one}",4.5,10
foo,"{two, one}",5.0,15


### Transformation

A transformation is a GroupBy operation whose result keep the same shape than
the original DataFrame.

The index will be the same as the original, but the groups keys are not 
included in the result (if they were columns). Then, if you want to keep the group
keys, you could set them as index.

**NOTE:** as transformation doesn't include groupings `as_index` and `sort` 
doesn't have effect.

Some important built-in methods are:


- `cumcount()` Compute the cumulative count within each group
- `cummax()` Compute the cumulative max within each group
- `cummin()` Compute the cumulative min within each group
- `cumprod()` Compute the cumulative product within each group
- `cumsum()` Compute the cumulative sum within each group
- `diff()` Compute the difference between adjacent values (actual - previous) in the column of each group
- `bfill()` Fill NA values with the next value in the column of each group
- `ffill()` Fill NA values with the previous value in the column of each group
- `fillna()` Fill NA values given a value per each group

In [94]:
df = pd.DataFrame(
    {
        "letter": ["x", "x", "x", "y", "y", "y"],
        "B": [1, 2, 3, 1, 2, 3],
        "C": [8, 1, np.nan, 7, np.nan, 9],
        "D": np.random.randint(10, size = 6),
    },
    index = [10, 20, 30, 40, 50, 60]
)
df

Unnamed: 0,letter,B,C,D
10,x,1,8.0,3
20,x,2,1.0,5
30,x,3,,9
40,y,1,7.0,4
50,y,2,,4
60,y,3,9.0,6


In [95]:
# NOTE: the group keys are not included in the result
result = df.groupby(["letter"]).cumsum()
display_side_by_side(df, result)

Unnamed: 0,letter,B,C,D
10,x,1,8.0,3
20,x,2,1.0,5
30,x,3,,9
40,y,1,7.0,4
50,y,2,,4
60,y,3,9.0,6

Unnamed: 0,B,C,D
10,1,8.0,3
20,3,9.0,8
30,6,,17
40,1,7.0,4
50,3,,8
60,6,16.0,14


In [96]:
# it does a difference (current_value - previous_value) in each column of each group
result = df.groupby(["letter"]).diff()
display_side_by_side(df, result)

Unnamed: 0,letter,B,C,D
10,x,1,8.0,3
20,x,2,1.0,5
30,x,3,,9
40,y,1,7.0,4
50,y,2,,4
60,y,3,9.0,6

Unnamed: 0,B,C,D
10,,,
20,1.0,-7.0,2.0
30,1.0,,4.0
40,,,
50,1.0,,0.0
60,1.0,,2.0


In [97]:
# it fills the NaN with the previous value in the column of each group
result = df.groupby(["letter"]).ffill()
display_side_by_side(df, result)

Unnamed: 0,letter,B,C,D
10,x,1,8.0,3
20,x,2,1.0,5
30,x,3,,9
40,y,1,7.0,4
50,y,2,,4
60,y,3,9.0,6

Unnamed: 0,B,C,D
10,1,8.0,3
20,2,1.0,5
30,3,1.0,9
40,1,7.0,4
50,2,7.0,4
60,3,9.0,6


In [98]:
# it fills the NaN with the next value in the column of each group
result = df.groupby(["letter"]).bfill()
display_side_by_side(df, result)

Unnamed: 0,letter,B,C,D
10,x,1,8.0,3
20,x,2,1.0,5
30,x,3,,9
40,y,1,7.0,4
50,y,2,,4
60,y,3,9.0,6

Unnamed: 0,B,C,D
10,1,8.0,3
20,2,1.0,5
30,3,,9
40,1,7.0,4
50,2,9.0,4
60,3,9.0,6


In [99]:
# sometimes you will want to keep a resulting transformation in the orginal
# dataframe

df["cumsum_D"] = df.groupby("letter")[["D"]].cumsum()
df

Unnamed: 0,letter,B,C,D,cumsum_D
10,x,1,8.0,3,3
20,x,2,1.0,5,8
30,x,3,,9,17
40,y,1,7.0,4,4
50,y,2,,4,8
60,y,3,9.0,6,14


#### `transform()` method

It can accepts string aliases of built-in functions and user defined functions.
However, it doesn't accepts multiple functions at once.

It can aslo accept string aliases of built-in **aggregation** functions, but the
result will be broadcas accross the group to the shape of the original DataFrame.

**NOTE:** there is other `transform()` method that you can apply directly to a 
DataFrame. That method accepts multiple function at once using lists as with 
`aggregate()` method.

**NOTE:** Transforming by supplying transform with a UDF is often less performant than using the built-in methods on GroupBy

In [100]:
df = pd.DataFrame(
    {
        "letter": ["x", "x", "x", "y", "y", "y"],
        "B": [1, 2, 3, 1, 2, 3],
        "C": [8, 1, np.nan, 7, np.nan, 9],
        "D": np.random.randint(10, size = 6),
    },
    index = [10, 20, 30, 40, 50, 60]
)

In [101]:
# 1. Accepts built-in functions
result = df.groupby("letter").transform("cumsum")
display_side_by_side(df, result)

Unnamed: 0,letter,B,C,D
10,x,1,8.0,4
20,x,2,1.0,4
30,x,3,,3
40,y,1,7.0,4
50,y,2,,4
60,y,3,9.0,8

Unnamed: 0,B,C,D
10,1,8.0,4
20,3,9.0,8
30,6,,11
40,1,7.0,4
50,3,,8
60,6,16.0,16


In [102]:
# 2. Accepts user defined functions
result = df.groupby("letter").transform(lambda x: x.cumsum())
display_side_by_side(df, result)

Unnamed: 0,letter,B,C,D
10,x,1,8.0,4
20,x,2,1.0,4
30,x,3,,3
40,y,1,7.0,4
50,y,2,,4
60,y,3,9.0,8

Unnamed: 0,B,C,D
10,1,8.0,4
20,3,9.0,8
30,6,,11
40,1,7.0,4
50,3,,8
60,6,16.0,16


In [103]:
# 3. Accepts aggregation built-in functions
result = df.groupby("letter").transform("mean")
display_side_by_side(df, result)

Unnamed: 0,letter,B,C,D
10,x,1,8.0,4
20,x,2,1.0,4
30,x,3,,3
40,y,1,7.0,4
50,y,2,,4
60,y,3,9.0,8

Unnamed: 0,B,C,D
10,2.0,4.5,3.666667
20,2.0,4.5,3.666667
30,2.0,4.5,3.666667
40,2.0,8.0,5.333333
50,2.0,8.0,5.333333
60,2.0,8.0,5.333333


You can perform different complex operations using chained built-in functions.
For example: normalization, filling NAs with the mean.

In [104]:
# normalization per group

grouped = df.groupby("letter")

result = (df[["B","C", "D"]] - grouped.transform("mean")) / grouped.transform("std")

# NOTE: the shape of mean and std transformation are the same as df

display_side_by_side(grouped.transform("mean"), grouped.transform("std"), result)

# in this example dataframe result will have mean = 0 (or close) and std = 1 per group


Unnamed: 0,B,C,D
10,2.0,4.5,3.666667
20,2.0,4.5,3.666667
30,2.0,4.5,3.666667
40,2.0,8.0,5.333333
50,2.0,8.0,5.333333
60,2.0,8.0,5.333333

Unnamed: 0,B,C,D
10,1.0,4.949747,0.57735
20,1.0,4.949747,0.57735
30,1.0,4.949747,0.57735
40,1.0,1.414214,2.309401
50,1.0,1.414214,2.309401
60,1.0,1.414214,2.309401

Unnamed: 0,B,C,D
10,-1.0,0.707107,0.57735
20,0.0,-0.707107,0.57735
30,1.0,,-1.154701
40,-1.0,-0.707107,-0.57735
50,0.0,,-0.57735
60,1.0,0.707107,1.154701


In [105]:
# fill NaN values with the mean per group
grouped = df.groupby("letter")


result = grouped.transform("fillna", grouped.transform("mean"))
result

display_side_by_side(df,grouped.transform("mean"),result)

Unnamed: 0,letter,B,C,D
10,x,1,8.0,4
20,x,2,1.0,4
30,x,3,,3
40,y,1,7.0,4
50,y,2,,4
60,y,3,9.0,8

Unnamed: 0,B,C,D
10,2.0,4.5,3.666667
20,2.0,4.5,3.666667
30,2.0,4.5,3.666667
40,2.0,8.0,5.333333
50,2.0,8.0,5.333333
60,2.0,8.0,5.333333

Unnamed: 0,B,C,D
10,1,8.0,4
20,2,1.0,4
30,3,4.5,3
40,1,7.0,4
50,2,8.0,4
60,3,9.0,8


In [106]:
# TODO: Falta Window and resample

### Filtration

A fltration is a GroupBy operation that returns a subset of the orginal 
grouping object. It may

1. filter part of groups (using built-in functions)
2. filter out entire groups (using `filter()` + UDF)
3. both (chaining operations)

**NOTE:** Filtrations don't add the group keys to the index of the result. Then, `as_index`
and `sort` doesn't have effect.

Relevant built-in functions are:

- `head()` Select the top row(s) of each group
- `nth()` Select the nth row(s) of each group
- `tail()` Select the bottom row(s) of each group

**IMPORTANT: Boolean Indexing Option**  You can also Boolean indexing for creating complex filtration 
whitin group.

#### `filter()` method

The filter method **only** takes a User-Defined Function (UDF) that returns `True` or `False`
to a given group. Then, the `False` groups are discarted.

**IMPOTANT:** The function always must return a Boolean scalar `True` or `False`

**NOTE:** you can keep the dropping groups filled with NaNs using `dropna=False` in fitler.

In [107]:
index = pd.MultiIndex.from_product([["x","y"], [0, 1, 2]],
                                   names=["letter", "number"])


df = pd.DataFrame(
    {
        "A": ["foo", "foo", "foo", "bar", "foo", "bar"],
        "B": ["one", "one", "two", "three", "two", "two"],
        "C": np.arange(6)
    },
    index = index
)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
x,0,foo,one,0
x,1,foo,one,1
x,2,foo,two,2
y,0,bar,three,3
y,1,foo,two,4
y,2,bar,two,5


In [108]:
# 1. Filtering parts of groups
# only select the 1 row of each group
df.groupby("letter").nth(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
x,1,foo,one,1
y,1,foo,two,4


In [109]:
# 2. Filtering entire groups (filter() + UDF)
# only select groups with more than 2 elements

# NOTE: that x is the whole group DataFrame, then len(x) will return the 
# number of rows

result = df.groupby("A").filter(lambda x: len(x) > 2)
display_side_by_side(df, result)

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
x,0,foo,one,0
x,1,foo,one,1
x,2,foo,two,2
y,0,bar,three,3
y,1,foo,two,4
y,2,bar,two,5

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
x,0,foo,one,0
x,1,foo,one,1
x,2,foo,two,2
y,1,foo,two,4


In [110]:
# Filtering out elements with sum on C greater than 7

# NOTE: bar group sums 8 on C and foo group sums 7
result = df.groupby("A").filter(lambda x: x["C"].sum() > 7)
display_side_by_side(df, result)

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
x,0,foo,one,0
x,1,foo,one,1
x,2,foo,two,2
y,0,bar,three,3
y,1,foo,two,4
y,2,bar,two,5

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
y,0,bar,three,3
y,2,bar,two,5


In [111]:
# Keeping the dropped groups
result = df.groupby("A").filter(lambda x: x["C"].sum() > 7, dropna=False)
display_side_by_side(df, result)

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
x,0,foo,one,0
x,1,foo,one,1
x,2,foo,two,2
y,0,bar,three,3
y,1,foo,two,4
y,2,bar,two,5

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
x,0,,,
x,1,,,
x,2,,,
y,0,bar,three,3.0
y,1,,,
y,2,bar,two,5.0


In [112]:
# 3. Filtering entire groups and some rows on resulting group

result = df.groupby("A").filter(lambda x: len(x) > 2)

result2 = result[result["C"] < 3] # Boolean indexing

display_side_by_side(df, result, result2)

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
x,0,foo,one,0
x,1,foo,one,1
x,2,foo,two,2
y,0,bar,three,3
y,1,foo,two,4
y,2,bar,two,5

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
x,0,foo,one,0
x,1,foo,one,1
x,2,foo,two,2
y,1,foo,two,4

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
letter,number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
x,0,foo,one,0
x,1,foo,one,1
x,2,foo,two,2


### Flexible `apply()`

`apply()` is a flexible method that allow us to create different user defined functions 

- reducer
- transformer
- filter
- other that doesn't match the previous ones

However, in most of the cases, it is more efficiently to use `agg()`, `transform()` or `filter()`.

**NOTE:** `apply` will try to infer from the result of the function whether it
should act as a reducer, transformer, or filter.

**NOTE:** To control whether the grouped column(s) are included in the indices, you can use the argument `group_keys` in `groupby` which defaults to `True`

I can stand up the following uses cases for `apply`:

1. **Custom Aggregations:** it is an `agregation` that includes some extra steps.
2. **Different Output Lengths**: similar to a `filter` selecting some rows based on some condition.
3. **Complex Grouping Logic**: similar to a `transformation` with a complex logic that uses an aggregation step too.

#### Difference between input - ouput

There are differences between input and output in the UDF for `apply()`, `aggregate()`, `transform()` and `filter()`.
This makes really flexible to use `apply()`.

| method | input | ouput |
|--------|-------|-------|
| `apply()` | The entire DataFrame of the group <br> Series if you are applying on a Series.  | Scalar Value <br> Series <br> DataFrame <br> List or Numpy Array |
| `aggregate()`  | A DataFrame only without the grouping column <br> Series if there are not multiple columns left       | Scalar Value <br> List, Numpy Array or Set |
| `transformation()` | The entire DataFrame of the group <br> Series if there are not multiple columns left | DataFrame (or Series) with the same shape as the input  |
| `filter()`       | The entire DataFrame of the group <br> Series if there are not multiple columns left | Boolean Scalar      |

In [113]:
#1. Custom Aggregation
# Compare the total purchase of each customer to the total purchase of all the
# customers

data = {
    'Customer': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Charlie'],
    'Purchase': [100, 200, 150, 300, 120, 250]
}

df = pd.DataFrame(data)
total_purchases = df['Purchase'].sum()

def calculate_percentage(group):

    total_purchase = group['Purchase'].sum()
    percentage = (total_purchase / total_purchases) * 100

    s = pd.Series( 
        { "Total Amount" : total_purchase, 
         "Percentage": percentage}, 
         name=group["Customer"].unique()[0]
         )
    
    return s

result = df.groupby('Customer').apply(calculate_percentage)
display_side_by_side(df, result)

Unnamed: 0,Customer,Purchase
0,Alice,100
1,Bob,200
2,Charlie,150
3,Alice,300
4,Bob,120
5,Charlie,250

Unnamed: 0_level_0,Total Amount,Percentage
Customer,Unnamed: 1_level_1,Unnamed: 2_level_1
Alice,400.0,35.714286
Bob,320.0,28.571429
Charlie,400.0,35.714286


In [114]:
# 2. Different Output Lengths
# Return the top two product from each group


data = {
    'Product': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Sales': [100, 200, 150, 300, 120, 250, 90, 70, 60]
}

df = pd.DataFrame(data)

def get_top_two(group):
    return group.nlargest(2, 'Sales')

result = df.groupby('Product').apply(get_top_two)

display_side_by_side(df, result)


Unnamed: 0,Product,Sales
0,A,100
1,B,200
2,C,150
3,A,300
4,B,120
5,C,250
6,A,90
7,B,70
8,C,60

Unnamed: 0_level_0,Unnamed: 1_level_0,Product,Sales
Product,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,3,A,300
A,0,A,100
B,1,B,200
B,4,B,120
C,5,C,250
C,2,C,150


In [115]:
#3. Complex Grouping Logic
# Name each region (group) as 'High' if the average sale is greater than the 
# the total average sales, otherwise name as 'Low'

data = {
    'Region': ['A', 'B', 'C', 'D', 'E', 'A'],
    'Sales': [1000, 2000, 1500, 3000, 1200, 50]
}

df = pd.DataFrame(data)

average_sales = df['Sales'].mean()

def label_regions(group):
    if group['Sales'].mean() > average_sales:
        group['Sales Category'] = 'High'
    else:
        group['Sales Category'] = 'Low'
    return group

result = df.groupby('Region').apply(label_regions)
display_side_by_side(df, result)


Unnamed: 0,Region,Sales
0,A,1000
1,B,2000
2,C,1500
3,D,3000
4,E,1200
5,A,50

Unnamed: 0_level_0,Unnamed: 1_level_0,Region,Sales,Sales Category
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,0,A,1000,Low
A,5,A,50,Low
B,1,B,2000,High
C,2,C,1500,High
D,3,D,3000,High
E,4,E,1200,Low


In [116]:
# ignoring index

result = df.groupby('Region',  group_keys = False).apply(label_regions)
display_side_by_side(df, result)

Unnamed: 0,Region,Sales
0,A,1000
1,B,2000
2,C,1500
3,D,3000
4,E,1200
5,A,50

Unnamed: 0,Region,Sales,Sales Category
0,A,1000,Low
1,B,2000,High
2,C,1500,High
3,D,3000,High
4,E,1200,Low
5,A,50,Low
