# Groupby

`pd.DataFrame.groupby` is a very useful tool, but sometimes working with it can be a bit confusing. So in this page I want to pay more attention to some functions and cases.

The most useful page for learning is <a href="https://pandas.pydata.org/docs/reference/groupby.html">GroupBy object in pandas documentation</a>.

In [1]:
import pandas as pd
from IPython.display import HTML

## `groupby` parameters

Here are described general parameters of the `groupby` method, that affect the result regardless of the specific transformations.

| Argument     | Description                                                                                                                                                   |
| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `by`         | Mapping, function, label, or list of labels. What to group by (e.g., a column name, a list of column names, or a function to transform the index or columns). |
| `axis`       | `{0 or ‘index’, 1 or ‘columns’}`, default `0`. Whether to group rows (`0`) or columns (`1`).                                                                  |
| `level`      | If the axis is a MultiIndex, group by a specific level or levels.                                                                                             |
| `as_index`   | `bool`, default `True`. If `True`, the group labels become the index; if `False`, they remain columns.                                                        |
| `sort`       | `bool`, default `True`. Sort group keys.                                                                                                                      |
| `group_keys` | `bool`, default `True`. If `True`, adds the group keys to the result index.                                                                                   |
| `observed`   | `bool`, default `False`. For categorical groupers: if `True`, only show observed groups.                                                                      |
| `dropna`     | `bool`, default `True`. If `True`, do not include groups whose key is `NaN`.                                                                                  |

Check some description for some of them in the [corresponding page](groupby/groupby_parameters.ipynb).

## Usage options

This section considers ways to compute things over groups and practical ways to overate with "gropby" objects. 

Check more details on the [corresponding page](groupby/usage_options.ipynb).

---

The following cell defines the `groupby` object that will be used by all the following examples to make sure there are only different ways to use the same tool.

In [None]:
basic_frame = pd.DataFrame({
    'A': ['a', 'a', 'b', 'b', 'c', 'c'],
    'B': [2, 1, 3, 4, 6, 5],
    'C': [10, 20, 30, 40, 50, 60]
})
gb = basic_frame.groupby("A")

Classic option - just apply the function from the `groupby` object:

In [None]:
gb.sum()

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3,30
b,7,70
c,11,110


Iterating over the `groupby` object allows you to work with subsets that correspond to the particular group.

In [33]:
pd.DataFrame({sub_frame[0]: sub_frame[1].sum() for sub_frame in gb})

Unnamed: 0,a,b,c
A,aa,bb,cc
B,3,7,11
C,30,70,110


The `agg` method allows to set a separate transformation for each column by dictionary.

In [39]:
gb.agg({"B": "sum", "C": "sum"})

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3,30
b,7,70
c,11,110


The `apply` method of the `groupby` object allows to specify a function to be applied to each subset.

In [41]:
gb.apply(lambda sub_frame: sub_frame.sum(), include_groups=False)

Unnamed: 0_level_0,B,C
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3,30
b,7,70
c,11,110


The `transform` function allows to compute values without collapsing the result dataframe into rows by groups.

In [42]:
gb.transform("sum")

Unnamed: 0,B,C
0,3,30
1,3,30
2,7,70
3,7,70
4,11,110
5,11,110


## `sum`

The basic function that allows you to get sums by groups.

### For `str` dtype

If you apply the `sum` function to a variable containing a `str` datatype, it will concatenate observations by groups.

So in the following example, this just happened with the `group text` column of the test dataframe.

In [10]:
test_df = pd.DataFrame({
    "group class" : ["a", "a", "b", "b"],
    "group numeric" : [3,4,5,1],
    "group text" : ["hello", "test", "line3", "superline"]
})
display(HTML("<b>Initial frame</b>"))
display(test_df)
display(HTML("<b>Aggregation result</b>"))
test_df.groupby("group class").sum()

Unnamed: 0,group class,group numeric,group text
0,a,3,hello
1,a,4,test
2,b,5,line3
3,b,1,superline


Unnamed: 0_level_0,group numeric,group text
group class,Unnamed: 1_level_1,Unnamed: 2_level_1
a,7,hellotest
b,6,line3superline


## Shift

The `shift` functions allows each value to be matched with the previous/next value in the same group. Check corresponding [documentation page](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.shift.html).

---

The next cell shows the application of the `shift` to the current data frame.

In [10]:
basic_frame.groupby("A").shift(1)

Unnamed: 0,B,C
0,,
1,2.0,10.0
2,,
3,3.0,30.0
4,,
5,6.0,50.0


As a result, to each index corresponds to "B" and "C" values that were in the same "A" group but one position higher in the original table.