# Groupby

`pd.DataFrame.groupby` is a very useful tool, but sometimes working with it can be a bit confusing. So in this page I want to pay more attention to some functions and cases.

The most useful page for learning is <a href="https://pandas.pydata.org/docs/reference/groupby.html">GroupBy object in pandas documentation</a>.

## Basic frame

There are many examples of the same type in this section, so unless specified by default I will use the dataset declared below.

In [2]:
import pandas as pd
from IPython.display import HTML

basic_frame = pd.DataFrame({
    'A': ['a', 'a', 'b', 'b', 'c', 'c'],
    'B': [2, 1, 3, 4, 6, 5],
    'C': [10, 20, 30, 40, 50, 60]
})

basic_frame

Unnamed: 0,A,B,C
0,a,2,10
1,a,1,20
2,b,3,30
3,b,4,40
4,c,6,50
5,c,5,60


## Parameters

Here are described general parameters of the `groupby` method, that affect the result regardless of the specific transformations.

###  Result index (`as_index`)

The `as_index` sepcify whether columns selected for certain groups are to be used as indexes in the output, or as regular columns.

---

The following cell shows the result when using `as_index=True`.

In [3]:
basic_frame.groupby("A", as_index=True)["C"].sum()

A
a     30
b     70
c    110
Name: C, dtype: int64

So the output have groups in the index.

The following cell uses `as_index=False`.

In [4]:
basic_frame.groupby("A", as_index=False)["C"].sum()

Unnamed: 0,A,C
0,a,30
1,b,70
2,c,110


Now groups have a regular column in the output, index is just range.

### `observed`

In the categorical datatype there is a possible case where a category exists but never appears in `series`. This parameter describes whether unobserved catetories will be used in `groupby` results (`False`) or only observed categories will be used (`True`).

---

The following cell sets the column `A` to the categorical datatype and adds a new category `l` that doesn't appear in any observation.

In [10]:
example_frame = basic_frame.copy()
example_frame["A"] = (
    example_frame["A"].astype("category")
    .cat.add_categories("l")
)
example_frame["A"]

0    a
1    a
2    b
3    b
4    c
5    c
Name: A, dtype: category
Categories (4, object): ['a', 'b', 'c', 'l']

The follwing cell shows `groupby` with `observed=True`. 

In [11]:
example_frame.groupby("A", observed=True).sum()

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3,30
b,7,70
c,11,110


There are only groups that appear in the input dataframe.

The following code uses `observed=False`.

In [12]:
example_frame.groupby("A", observed=False).sum()

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3,30
b,7,70
c,11,110
l,0,0


Finally, `l` is included in the result even though there were no corresponding observations in the input data.

## Usage options

This section considers ways to compute things over groups and practical ways to overate with "gropby" objects. 

Check more details on the [corresponding page](groupby/usage_options.ipynb).

---

To better understand what this section is about, consider some cases for using `pandas.groupby` objects, that generally produce same results - counting sum over groups.

The following cell defines the `groupby` object that will be used by all the following examples to make sure there are only different ways to use the same tool.

In [34]:
gb = basic_frame.groupby("A")

Classic option - just apply the function from the `groupby` object:

In [None]:
gb.sum()

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3,30
b,7,70
c,11,110


Iterating over the `groupby` object allows you to work with subsets that correspond to the particular group.

In [33]:
pd.DataFrame({sub_frame[0]: sub_frame[1].sum() for sub_frame in gb})

Unnamed: 0,a,b,c
A,aa,bb,cc
B,3,7,11
C,30,70,110


The `agg` method allows to set a separate transformation for each column by dictionary.

In [39]:
gb.agg({"B": "sum", "C": "sum"})

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3,30
b,7,70
c,11,110


The `apply` method of the `groupby` object allows to specify a function to be applied to each subset.

In [41]:
gb.apply(lambda sub_frame: sub_frame.sum(), include_groups=False)

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3,30
b,7,70
c,11,110


The `transform` function allows to compute values without collapsing the result dataframe into rows by groups.

In [42]:
gb.transform("sum")

Unnamed: 0,B,C
0,3,30
1,3,30
2,7,70
3,7,70
4,11,110
5,11,110


## External group array

You can use an arbitrary array (that is not a column of the dataframe being grouped) for grouping.

So in the following example I use list shat markers to split the dataframe into two groups `x` and `y`.

In [4]:
group_list = ["x", "x", "x", "y", "y", "y"]
display(HTML("<b>Input dataframe</b>"))
display(basic_frame)
display(HTML("<b>Group variable</b>"))
display(group_list)
basic_frame.groupby(group_list).sum()

Unnamed: 0,A,B,C
0,a,2,10
1,a,1,20
2,b,3,30
3,b,4,40
4,c,6,50
5,c,5,60


['x', 'x', 'x', 'y', 'y', 'y']

Unnamed: 0,A,B,C
x,aab,6,60
y,bcc,15,150


You can even mix two external variables.

In [5]:
group_list1 = ["x", "x", "x", "y", "y", "y"]
group_list2 = [1,1,2,2,2,1]
display(HTML("<b>Group variables</b>"))
display(group_list1, group_list2)
basic_frame.groupby([group_list1, group_list2]).sum()

['x', 'x', 'x', 'y', 'y', 'y']

[1, 1, 2, 2, 2, 1]

Unnamed: 0,Unnamed: 1,A,B,C
x,1,aa,3,30
x,2,b,3,30
y,1,c,5,60
y,2,bc,10,90


Or mix external and internal variables in a `groupby`.

In [6]:
group_list = ["x", "x", "x", "y", "y", "y"]
display(HTML("<b>Group variable</b>"))
display(group_list)
basic_frame.groupby([group_list1, "A"]).sum()

['x', 'x', 'x', 'y', 'y', 'y']

Unnamed: 0_level_0,Unnamed: 1_level_0,B,C
Unnamed: 0_level_1,A,Unnamed: 2_level_1,Unnamed: 3_level_1
x,a,3,30
x,b,3,30
y,b,4,40
y,c,11,110


## `sum`

The basic function that allows you to get sums by groups.

### For `str` dtype

If you apply the `sum` function to a variable containing a `str` datatype, it will concatenate observations by groups.

So in the following example, this just happened with the `group text` column of the test dataframe.

In [10]:
test_df = pd.DataFrame({
    "group class" : ["a", "a", "b", "b"],
    "group numeric" : [3,4,5,1],
    "group text" : ["hello", "test", "line3", "superline"]
})
display(HTML("<b>Initial frame</b>"))
display(test_df)
display(HTML("<b>Aggregation result</b>"))
test_df.groupby("group class").sum()

Unnamed: 0,group class,group numeric,group text
0,a,3,hello
1,a,4,test
2,b,5,line3
3,b,1,superline


Unnamed: 0_level_0,group numeric,group text
group class,Unnamed: 1_level_1,Unnamed: 2_level_1
a,7,hellotest
b,6,line3superline


## Shift

The `shift` functions allows each value to be matched with the previous/next value in the same group. Check corresponding [documentation page](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.shift.html).

---

The next cell shows the application of the `shift` to the current data frame.

In [10]:
basic_frame.groupby("A").shift(1)

Unnamed: 0,B,C
0,,
1,2.0,10.0
2,,
3,3.0,30.0
4,,
5,6.0,50.0


As a result, to each index corresponds to "B" and "C" values that were in the same "A" group but one position higher in the original table.