# Aggregation and Group By

"Group by" refers to a process that involves one or more of the following steps:

- **Splitting** the data into groups based on some criteria.
- **Applying** a function to each group independently.
- **Combining** the results into a data structure.

Where applying can be one the following things:

- **Aggregation**: compute a summary statistic (or statistics) for each group.
- **Transformation**: perform some group-specific computations and return a like-indexed object. e.g. standarize data within a group.
- **Filtration**: discard some groups, according to a group-wise computation that evaluates True or False.

Note: it is also possible to apply a custom function using an `apply` method.

In [2]:
import pandas as pd
import numpy as np

np.random.seed(0)

In [46]:
## Handy function
from IPython.display import display_html

def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

def display_group_by(group_by_object):
    group_by_list = list(group_by_object)
    html_str=''
    for key, df in group_by_list:
        html_str+= "<h3>"+ str(key) + "</h3>"
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

## Spliting an object into group

The first steps involves `group_by()` function to split a DataFrame into groups.

`group_by(by, index, level)`

- `by` is the key and can be a label or labels (index or columns) that determine the groups. 
It can also be a function or even a Series.
- `axis` indicates the **split along** rows (0) or columns (1). default 0
- `level` allow us to group using the index (or multi-index) levels. Don't use both
level and by.

**Note** that `group_by()` returns a `DataFrameGroupBy` object that is not possible
to display. It is because **no splitting occurs until it's needed**. However, 
I did a handy function that helps us to visualize the groups by transforming 
the `DataFrameGroupBy` into a list.

**Note** `df.groupby('A')` is just syntactic sugar for `df.groupby(df['A'])`

**Note** if the key is a string that matches both a column name and an index level name, a `ValueError` will be raised.

In [94]:
df = pd.DataFrame(
    [
        ("bird", "Falconiformes", 389.0),
        ("bird", "Psittaciformes", 24.0),
        ("mammal", "Carnivora", 80.2),
        ("mammal", "Primates", np.nan),
        ("mammal", "Carnivora", 58),
    ],
    index=["falcon", "parrot", "lion", "monkey", "leopard"],
    columns=("class", "order", "max_speed"),
)

df

Unnamed: 0,class,order,max_speed
falcon,bird,Falconiformes,389.0
parrot,bird,Psittaciformes,24.0
lion,mammal,Carnivora,80.2
monkey,mammal,Primates,
leopard,mammal,Carnivora,58.0


In [95]:
grouped = df.groupby("class")
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7efe9f81bb50>

**Note**: The function `.get_group("<name of group>")` allow us to retrieve a 
DataFrame of the respective group.

In [97]:
grouped.get_group("bird")

Unnamed: 0,class,order,max_speed
falcon,bird,Falconiformes,389.0
parrot,bird,Psittaciformes,24.0


In [68]:
display_group_by(grouped)

Unnamed: 0,class,order,max_speed
falcon,bird,Falconiformes,389.0
parrot,bird,Psittaciformes,24.0

Unnamed: 0,class,order,max_speed
lion,mammal,Carnivora,80.2
monkey,mammal,Primates,
leopard,mammal,Carnivora,58.0


In [69]:
display_group_by(df.groupby("order", axis="columns"))

# Note this doesn't diplay anything because it want to split along the columns
# and it is not possible


In [70]:
# by can be a list of labels
display_group_by(df.groupby(["class", "order"]))

Unnamed: 0,class,order,max_speed
falcon,bird,Falconiformes,389.0

Unnamed: 0,class,order,max_speed
parrot,bird,Psittaciformes,24.0

Unnamed: 0,class,order,max_speed
lion,mammal,Carnivora,80.2
leopard,mammal,Carnivora,58.0

Unnamed: 0,class,order,max_speed
monkey,mammal,Primates,


In [73]:
df2 = df.set_index(["class"])
df2

Unnamed: 0_level_0,order,max_speed
class,Unnamed: 1_level_1,Unnamed: 2_level_1
bird,Falconiformes,389.0
bird,Psittaciformes,24.0
mammal,Carnivora,80.2
mammal,Primates,
mammal,Carnivora,58.0


In [74]:
display_group_by(df2.groupby(level="class"))

Unnamed: 0_level_0,order,max_speed
class,Unnamed: 1_level_1,Unnamed: 2_level_1
bird,Falconiformes,389.0
bird,Psittaciformes,24.0

Unnamed: 0_level_0,order,max_speed
class,Unnamed: 1_level_1,Unnamed: 2_level_1
mammal,Carnivora,80.2
mammal,Primates,
mammal,Carnivora,58.0


### Functions as `by` key and Cols split

In [99]:
df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
df

Unnamed: 0,A,B,C,D
0,foo,one,-0.887786,-1.048553
1,bar,one,-1.980796,-1.420018
2,foo,two,-0.347912,-1.70627
3,bar,three,0.156349,1.950775
4,foo,two,1.230291,-0.509652
5,bar,two,1.20238,-0.438074
6,foo,one,-0.387327,-1.252795
7,foo,three,-0.302303,0.77749


In [100]:
# Note the function is called on each index of the axis = 1
def get_letter_type(letter):
    if letter.lower() in 'aeiou':
        return 'vowel'
    else:
        return 'consonant'

grouped = df.groupby(get_letter_type, axis=1)

display_group_by(grouped)

Unnamed: 0,B,C,D
0,one,-0.887786,-1.048553
1,one,-1.980796,-1.420018
2,two,-0.347912,-1.70627
3,three,0.156349,1.950775
4,two,1.230291,-0.509652
5,two,1.20238,-0.438074
6,one,-0.387327,-1.252795
7,three,-0.302303,0.77749

Unnamed: 0,A
0,foo
1,bar
2,foo
3,bar
4,foo
5,bar
6,foo
7,foo


In [101]:
# Note the function is called on each index of the axis = 0
def is_even(num):
    if num % 2 == 0:
        return 'even'
    else:
        return 'odd'

grouped = df.groupby(is_even)

display_group_by(grouped)

Unnamed: 0,A,B,C,D
0,foo,one,-0.887786,-1.048553
2,foo,two,-0.347912,-1.70627
4,foo,two,1.230291,-0.509652
6,foo,one,-0.387327,-1.252795

Unnamed: 0,A,B,C,D
1,bar,one,-1.980796,-1.420018
3,bar,three,0.156349,1.950775
5,bar,two,1.20238,-0.438074
7,foo,three,-0.302303,0.77749


SyntaxError: invalid syntax (371763854.py, line 1)