# Groupby

`pd.DataFrame.groupby` is a very useful tool, but sometimes working with it can be a bit confusing. So in this page I want to pay more attention to some functions and cases.

The most useful page for learning is <a href="https://pandas.pydata.org/docs/reference/groupby.html">GroupBy object in pandas documentation</a>.

# `pd.groupby`

Here I describe basic usage of `pandas.groupby` function.

## `as_idnex`


Setting this value to `True` allows you to say that the aggregation variable should not be used as an index.

So in the following example I show the difference.

**Note** In the first case the return is `padnas.Series` just because I called it that way, but in the second case it's not a dataframe that pandas has to use as a result.

In [29]:
import pandas as pd
from IPython.display import HTML

df = pd.DataFrame({'A': ['a', 'a', 'b', 'b', 'c', 'c'],
                   'B': [2, 1, 3, 4, 6, 5],
                   'C': [10, 20, 30, 40, 50, 60]})

display(HTML("<b>as_index=True</b>"))
display(df.groupby("A", as_index=True)["C"].sum())
display(HTML("<b>as_index=False</b>"))
display(df.groupby("A", as_index=False)["C"].sum())

A
a     30
b     70
c    110
Name: C, dtype: int64

Unnamed: 0,A,C
0,a,30
1,b,70
2,c,110


# Iterating

You can iterate trow `pandas.DataFrame.groupby` retults. In each eteration you will get tuple of two values:

- Value of the grouping variable for this iteration;
- Sub-sampling from the original data set corresponding to the considered value of the grouping variable.

So in the following example I show the result of the first iteration under `pandas.DataFrameGroupby` result and then show a case of using it in the cycle. 

In [16]:
import pandas as pd
from IPython.display import HTML

df = pd.DataFrame({'A': ['a', 'a', 'b', 'b', 'c', 'c'],
                   'B': [2, 1, 3, 4, 6, 5],
                   'C': [10, 20, 30, 40, 50, 60]})


display(HTML("<b>Input dataframe</b>"))
display(df)

display(HTML("<b>Some iteration returns</b>"))
display(next(df.groupby("A").__iter__()))

display(HTML("<b>Whole cycle</b>"))
for a_val, subframe in df.groupby("A"):
    print("====" + a_val + "=====")
    display(subframe)

Unnamed: 0,A,B,C
0,a,2,10
1,a,1,20
2,b,3,30
3,b,4,40
4,c,6,50
5,c,5,60


('a',
    A  B   C
 0  a  2  10
 1  a  1  20)

====a=====


Unnamed: 0,A,B,C
0,a,2,10
1,a,1,20


====b=====


Unnamed: 0,A,B,C
2,b,3,30
3,b,4,40


====c=====


Unnamed: 0,A,B,C
4,c,6,50
5,c,5,60


# `agg` - rule by dict

This is a way to apply aggregation functions using syntax `{<var_name_1>:<aggregation_function_1>, <var_name_2>:<aggregation_function_2>, ...}`.

So in the following example, I use the above syntax to aggregate max `B` values and sum of `C` values by `A` subsets:

In [20]:
import pandas as pd
from IPython.display import HTML

df = pd.DataFrame({'A': ['a', 'a', 'b', 'b', 'c', 'c'],
                   'B': [2, 1, 3, 4, 6, 5],
                   'C': [10, 20, 30, 40, 50, 60]})

display(HTML("<b>Input dataframe</b>"))
display(df)
display(HTML("<b>Aggregation</b>"))
display(df.groupby("A").agg({"B":"max", "C":"sum"}))

Unnamed: 0,A,B,C
0,a,2,10
1,a,1,20
2,b,3,30
3,b,4,40
4,c,6,50
5,c,5,60


Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2,30
b,4,70
c,6,110


# `Apply` - combine results

<a href="https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.apply.html#pandas.core.groupby.DataFrameGroupBy.apply">Pandas documentation about apply function.</a>

## Basic idea

The peculiarity of this method is that it uses `pandas.DataFrame` as the input for the aggregation function.

The following example shows this: `example_funtion' just prints the input and it always prints a DataFrame for each "A" variable option.

In [40]:
df = pd.DataFrame({'A': ['a', 'a', 'b', 'b', 'c', 'c'],
                   'B': [2, 1, 3, 4, 6, 5],
                   'C': [10, 20, 30, 40, 50, 60]})

def example_funtion(subdf):
    print("=========")
    print(subdf)
    return 5

res = df.groupby("A")[["A", "B", "C"]].apply(example_funtion)

   A  B   C
0  a  2  10
1  a  1  20
   A  B   C
2  b  3  30
3  b  4  40
   A  B   C
4  c  6  50
5  c  5  60


## Use case

So it's perfect for cases where you need to get, for each variant of variable A, some value of variable C conditioned on the value of variable B.

In particular, the following example shows how to obtain for each option of "A" the "C" value corresponding to the minimum "B" value.

- For `"A" == "a"` I got `"C" == 20`, because it corresponds to `"B"== 1`, which is the minimum for every `"A" == "a"`;
- For `"A" == "b"` I got `"C" == 30`, because it corresponds to `"B"== 3`, which is the minimum for every `"A" == "b"`;
- For `"A" == "c"` I got `"C" == 60`, because it corresponds to `"B"== 5`, which is the minimum for every `"A" == "c"`.

In [9]:
from IPython.display import HTML
import pandas as pd
df = pd.DataFrame({'A': ['a', 'a', 'b', 'b', 'c', 'c'],
                   'B': [2, 1, 3, 4, 6, 5],
                   'C': [10, 20, 30, 40, 50, 60]})

display(HTML("<b>Initial frame</b>"))
display(df)

result = df.groupby("A")[["B", "C"]].apply(
    lambda subset: subset.loc[subset["B"].idxmin(), "C"]
)
display(HTML("<b>Result</b>"))
result.rename("C").to_frame()

Unnamed: 0,A,B,C
0,a,2,10
1,a,1,20
2,b,3,30
3,b,4,40
4,c,6,50
5,c,5,60


Unnamed: 0_level_0,C
A,Unnamed: 1_level_1
a,20
b,30
c,60


## vs `agg`/`transform`

Other common functions may seem useless because this function can do everything they can. However, according to the <a href="https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.apply.html#pandas.core.groupby.DataFrameGroupBy.apply">pandas documentation</a>, they may work a little faster. I have not been able to test this yet.

# `sum`

The basic function that allows you to get sums by groups.

## For `str` dtype

If you apply the `sum` function to a variable containing a `str` datatype, it will concatenate observations by groups.

So in the following example, this just happened with the `group text` column of the initial dataframe.

In [10]:
import pandas as pd
from IPython.display import HTML
test_df = pd.DataFrame({
    "group class" : ["a", "a", "b", "b"],
    "group numeric" : [3,4,5,1],
    "group text" : ["hello", "test", "line3", "superline"]
})
display(HTML("<b>Initial frame</b>"))
display(test_df)
display(HTML("<b>Aggregation result</b>"))
test_df.groupby("group class").sum()

Unnamed: 0,group class,group numeric,group text
0,a,3,hello
1,a,4,test
2,b,5,line3
3,b,1,superline


Unnamed: 0_level_0,group numeric,group text
group class,Unnamed: 1_level_1,Unnamed: 2_level_1
a,7,hellotest
b,6,line3superline
