# Groupby

`pd.DataFrame.groupby` is a very useful tool, but sometimes working with it can be a bit confusing. So in this page I want to pay more attention to some functions and cases.

The most useful page for learning is <a href="https://pandas.pydata.org/docs/reference/groupby.html">GroupBy object in pandas documentation</a>.

## Basic frame

There are many examples of the same type in this section, so unless specified by default I will use the dataset declared below.

In [2]:
import pandas as pd
from IPython.display import HTML

basic_frame = pd.DataFrame({'A': ['a', 'a', 'b', 'b', 'c', 'c'],
                   'B': [2, 1, 3, 4, 6, 5],
                   'C': [10, 20, 30, 40, 50, 60]})

basic_frame

Unnamed: 0,A,B,C
0,a,2,10
1,a,1,20
2,b,3,30
3,b,4,40
4,c,6,50
5,c,5,60


## `pd.groupby`

Here I describe basic usage of `pandas.groupby` function. For a formal description, see the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html">official pandas documntation for this function</a>.

### `as_idnex`


Setting this value to `True` allows you to say that the aggregation variable should not be used as an index.

So in the following example I show the difference.

**Note** In the first case the return is `padnas.Series` just because I called it that way, but in the second case it's not a dataframe that pandas has to use as a result.

In [2]:
display(HTML("<b>as_index=True</b>"))
display(basic_frame.groupby("A", as_index=True)["C"].sum())
display(HTML("<b>as_index=False</b>"))
display(basic_frame.groupby("A", as_index=False)["C"].sum())

A
a     30
b     70
c    110
Name: C, dtype: int64

Unnamed: 0,A,C
0,a,30
1,b,70
2,c,110


### `observed`

In the categorical datatype there is a possible case where a category exists but never appears in `series'. This parameter describes whether unobserved catetories will be used in `groupby` results (False) or only observed categories will be used (True).

So in the following example I changed a datatype for the `A` column to `category`, added a new category `l` but no new observation corresponding to this category, and finally tried all options for the `observed` parameter. In the first case we don't have `l` in the groupby result index, in the second we do.

In [29]:
example_frame = basic_frame.copy()
example_frame["A"] = example_frame["A"].\
                        astype("category").cat.\
                        add_categories("l")

display(
    HTML("<b style=\"font-size:120%\">=====observed=True=====</b>")
)
display(
    example_frame.groupby("A", observed=True).sum()
)

display(
    HTML("<b style=\"font-size:120%\">=====observed=False=====</b>")
)
display(
    example_frame.groupby("A", observed=False).sum()
)

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3,30
b,7,70
c,11,110


Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3,30
b,7,70
c,11,110
l,0,0


## Iterating

You can iterate trow `pandas.DataFrame.groupby` retults. In each eteration you will get tuple of two values:

- Value of the grouping variable for this iteration;
- Sub-sampling from the original data set corresponding to the considered value of the grouping variable.

So in the following example I show the result of the first iteration under `pandas.DataFrameGroupby` result and then show a case of using it in the cycle. 

In [3]:
display(HTML("<b>Some iteration returns</b>"))
display(next(basic_frame.groupby("A").__iter__()))

display(HTML("<b>Whole cycle</b>"))
for a_val, subframe in basic_frame.groupby("A"):
    print("====" + a_val + "=====")
    display(subframe)

('a',
    A  B   C
 0  a  2  10
 1  a  1  20)

====a=====


Unnamed: 0,A,B,C
0,a,2,10
1,a,1,20


====b=====


Unnamed: 0,A,B,C
2,b,3,30
3,b,4,40


====c=====


Unnamed: 0,A,B,C
4,c,6,50
5,c,5,60


## External group array

You can use an arbitrary array (that is not a column of the dataframe being grouped) for grouping.

So in the following example I use list shat markers to split the dataframe into two groups `x` and `y`.

In [4]:
group_list = ["x", "x", "x", "y", "y", "y"]
display(HTML("<b>Input dataframe</b>"))
display(basic_frame)
display(HTML("<b>Group variable</b>"))
display(group_list)
basic_frame.groupby(group_list).sum()

Unnamed: 0,A,B,C
0,a,2,10
1,a,1,20
2,b,3,30
3,b,4,40
4,c,6,50
5,c,5,60


['x', 'x', 'x', 'y', 'y', 'y']

Unnamed: 0,A,B,C
x,aab,6,60
y,bcc,15,150


You can even mix two external variables.

In [5]:
group_list1 = ["x", "x", "x", "y", "y", "y"]
group_list2 = [1,1,2,2,2,1]
display(HTML("<b>Group variables</b>"))
display(group_list1, group_list2)
basic_frame.groupby([group_list1, group_list2]).sum()

['x', 'x', 'x', 'y', 'y', 'y']

[1, 1, 2, 2, 2, 1]

Unnamed: 0,Unnamed: 1,A,B,C
x,1,aa,3,30
x,2,b,3,30
y,1,c,5,60
y,2,bc,10,90


Or mix external and internal variables in a `groupby`.

In [6]:
group_list = ["x", "x", "x", "y", "y", "y"]
display(HTML("<b>Group variable</b>"))
display(group_list)
basic_frame.groupby([group_list1, "A"]).sum()

['x', 'x', 'x', 'y', 'y', 'y']

Unnamed: 0_level_0,Unnamed: 1_level_0,B,C
Unnamed: 0_level_1,A,Unnamed: 2_level_1,Unnamed: 3_level_1
x,a,3,30
x,b,3,30
y,b,4,40
y,c,11,110


## `agg` - rule by dict

This is a way to apply aggregation functions using syntax `{<var_name_1>:<aggregation_function_1>, <var_name_2>:<aggregation_function_2>, ...}`.

So in the following example, I use the above syntax to aggregate max `B` values and sum of `C` values by `A` subsets:

In [7]:
display(HTML("<b>Aggregation</b>"))
display(basic_frame.groupby("A").agg({"B":"max", "C":"sum"}))

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2,30
b,4,70
c,6,110


## `apply` - combine results

<a href="https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.apply.html#pandas.core.groupby.DataFrameGroupBy.apply">Pandas documentation about apply function.</a>

### Basic idea

The peculiarity of this method is that it uses `pandas.DataFrame` as the input for the aggregation function.

The following example shows this: `example_funtion' just prints the input and it always prints a DataFrame for each "A" variable option.

In [8]:
def example_funtion(subdf):
    print("=========")
    print(subdf)
    return 5

res = basic_frame.groupby("A")[
    ["A", "B", "C"]
].apply(example_funtion)

   A  B   C
0  a  2  10
1  a  1  20
   A  B   C
2  b  3  30
3  b  4  40
   A  B   C
4  c  6  50
5  c  5  60


### Use case

So it's perfect for cases where you need to get, for each variant of variable A, some value of variable C conditioned on the value of variable B.

In particular, the following example shows how to obtain for each option of "A" the "C" value corresponding to the minimum "B" value.

- For `"A" == "a"` I got `"C" == 20`, because it corresponds to `"B"== 1`, which is the minimum for every `"A" == "a"`;
- For `"A" == "b"` I got `"C" == 30`, because it corresponds to `"B"== 3`, which is the minimum for every `"A" == "b"`;
- For `"A" == "c"` I got `"C" == 60`, because it corresponds to `"B"== 5`, which is the minimum for every `"A" == "c"`.

In [9]:
result = basic_frame.groupby("A")[["B", "C"]].apply(
    lambda subset: subset.loc[subset["B"].idxmin(), "C"]
)
display(HTML("<b>Result</b>"))
result.rename("C").to_frame()

Unnamed: 0_level_0,C
A,Unnamed: 1_level_1
a,20
b,30
c,60


### vs `agg`

Other common function may seem useless because this function can do everything they can. However, according to the <a href="https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.apply.html#pandas.core.groupby.DataFrameGroupBy.apply">pandas documentation</a>, they may work a little faster. I have not been able to test this yet.

## `transform`

This is a function that allows you to get aggregations as `pandas.Series`/`pandas.DataFrame` indexed like the original `pandas.DataFrame`.

For example, in the following cell, throw the `transform` function, for each record in the original `pandas.DataFrame` I got the mean value of `B` for each group in `A`.

In [17]:
temp_frame = basic_frame.copy()

temp_frame["mean B by A"] = (
    temp_frame.
    groupby("A")["B"].
    transform("mean")
)
display(temp_frame)

Unnamed: 0,A,B,C,mean B by A
0,a,2,10,1.5
1,a,1,20,1.5
2,b,3,30,3.5
3,b,4,40,3.5
4,c,6,50,5.5
5,c,5,60,5.5


Here I have a `pandas.DataFrame` that for each record from the original `pandas.DataFrame` matches the mean value of the `B` and `C` columns to the `A` column in a command.

In [16]:
display(
    temp_frame.
    groupby("A")[["B", "C"]].
    transform("mean")
)

Unnamed: 0,B,C
0,1.5,15.0
1,1.5,15.0
2,3.5,35.0
3,3.5,35.0
4,5.5,55.0
5,5.5,55.0


## `sum`

The basic function that allows you to get sums by groups.

### For `str` dtype

If you apply the `sum` function to a variable containing a `str` datatype, it will concatenate observations by groups.

So in the following example, this just happened with the `group text` column of the test dataframe.

In [10]:
test_df = pd.DataFrame({
    "group class" : ["a", "a", "b", "b"],
    "group numeric" : [3,4,5,1],
    "group text" : ["hello", "test", "line3", "superline"]
})
display(HTML("<b>Initial frame</b>"))
display(test_df)
display(HTML("<b>Aggregation result</b>"))
test_df.groupby("group class").sum()

Unnamed: 0,group class,group numeric,group text
0,a,3,hello
1,a,4,test
2,b,5,line3
3,b,1,superline


Unnamed: 0_level_0,group numeric,group text
group class,Unnamed: 1_level_1,Unnamed: 2_level_1
a,7,hellotest
b,6,line3superline


## Shift

The `shift` functions allows each value to be matched with the previous/next value in the same group. Check corresponding [documentation page](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.shift.html).

---

The next cell shows the application of the `shift` to the current data frame.

In [10]:
basic_frame.groupby("A").shift(1)

Unnamed: 0,B,C
0,,
1,2.0,10.0
2,,
3,3.0,30.0
4,,
5,6.0,50.0


As a result, to each index corresponds to "B" and "C" values that were in the same "A" group but one position higher in the original table.