# Groupby-aggregations 
By the end of this lecture you will be able to:
- do a group by-aggregation
- group by multiple columns
- sort group by outputs
- grouping on a sorted column

# 分组聚合

本讲结束时，您将能够：

- 进行分组聚合

- 按多列分组

- 对分组结果进行排序

- 按已排序的列分组

In [6]:
import polars as pl

In [7]:
df = pl.read_csv("../../Files/Sample_Superstore.csv")

In [8]:
df.head(3)

Row_ID,Order_ID,Order_Date,Ship_Date,Ship_Mode,Customer_ID,Customer_Name,Segment,Country,City,State,Postal_Code,Region,Product_ID,Category,Sub_Category,Product_Name,Sales,Quantity,Discount,Profit
i64,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,i64,f64,f64
1,,,"""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-BO-10001798""","""Furniture""","""Bookcases""","""Bush Somerset Collection Bookc…",261.96,2,0.0,41.9136
2,"""CA-2016-152156""","""08-11-2016""","""11-11-2016""","""Second Class""","""CG-12520""","""Claire Gute""","""Consumer""","""United States""","""Henderson""","""Kentucky""",42420,"""South""","""FUR-CH-10000454""","""Furniture""","""Chairs""","""Hon Deluxe Fabric Upholstered …",731.94,3,0.0,219.582
3,"""CA-2016-138688""","""12-06-2016""",,,"""DV-13045""","""Darrin Van Huff""","""Corporate""",,"""Los Angeles""","""California""",90036,"""West""","""OFF-LA-10000240""","""Office Supplies""","""Labels""","""Self-Adhesive Address Labels f…",14.62,2,0.0,6.8714


## Group-by and aggregation
In Polars we can group by a column and aggregate the data in other columns with the `group_by.agg` combination.

In this example we group by the `Category` and take the count of the `Profit` column

## 分组和聚合

在 Polars 中，我们可以按某一列分组，并使用 `group_by.agg` 组合对其他列中的数据进行聚合。

在本例中，我们按 `Category` 列分组，并统计 `Profit` 列的计数。

In [11]:
(
    df
    .group_by("Category")
    .agg(
        pl.col("Profit").count()
    )
)

Category,Profit
str,u32
"""Technology""",1845
"""Furniture""",2119
,8
"""Office Supplies""",6022


> Why group_by and not groupby? The Polars API aims to be readable and one standard is to split words by `_`

Almost everything we do after this will be some variation on this basic pattern of `group_by` and `agg`.

Note that we passed an aggregation expression `pl.col("Profit").min()` inside `agg` to get a single value for each group.

Let's see what happens if we don't pass an aggregation expression


> 为什么使用 `group_by` 而不是 `groupby`？Polars API 的目标是提高可读性，其中一个标准是用下划线 `_` 分割单词。

接下来我们几乎所有的操作都将基于 `group_by` 和 `agg` 这种基本模式的某种变体。

请注意，我们在 `agg` 内部传递了一个聚合表达式 `pl.col("Profit").min()`，以便为每个组获取一个单独的值。

让我们看看如果不传递聚合表达式会发生什么。

In [18]:
(
    df
    .group_by("Category")
    .agg(
        pl.col("Profit").min()
    )
)

Category,Profit
str,f64
"""Office Supplies""",-3701.8928
"""Furniture""",-1862.3124
"""Technology""",-6599.978


In [19]:
(
    df
    .group_by("Category")
    .agg(
        pl.col("Profit").max()
    )
)

Category,Profit
str,f64
"""Furniture""",1013.127
"""Office Supplies""",4946.37
"""Technology""",8399.976


In [20]:
(
    df
    .group_by("Category")
    .agg(
        pl.col("Profit").sum()
    )
)

Category,Profit
str,f64
"""Office Supplies""",122490.8008
"""Furniture""",18451.2728
"""Technology""",145454.9481


In this case the `Fare` column is a `pl.List` column with all the values for each group on each row


## What happens when we run `group_by.agg`?
While the full workings are more complicated than this a basic description of the internal flow is that:
- when we call `.group_by` Polars creates a `GroupBy` object that catpures the group-by parameters (e.g. the columns to group by) but **does not calculate the groups** until a further method (such as `agg`) is called on it
- when we call `agg` on the `GroupBy` object Polars:
    - Polars calculates the groups by getting the row indexes for each group
    - Polars applies the expressions in `agg` to each group
    - Polars joins the outputs of the expressions back to each group to create the output `DataFrame`


在这种情况下，`Fare` 列是一个 `pl.List` 列，其中包含每一行中每个组的所有值。

## 运行 `group_by.agg` 时会发生什么？

虽然完整的工作原理比这更复杂，但其内部流程的基本描述如下：

- 当我们调用 `.group_by` 时，Polars 会创建一个 `GroupBy` 对象，该对象会捕获分组参数（例如，要分组的列），但**不会计算分组**，直到对其调用其他方法（例如 `agg`）。

- 当我们对 `GroupBy` 对象调用 `agg` 时，Polars 会执行以下操作：

- Polars 通过获取每个组的行索引来计算分组。

- Polars 将 `agg` 中的表达式应用于每个组。

- Polars 将表达式的输出连接回每个组，以创建输出 `DataFrame`。

## Grouping by multiple columns
We can group by multiple columns by passing a `list` to `group_by` or a comma-separated list of columns

## 按多列分组

我们可以通过将一个列表（以逗号分隔的列名列表）传递给 `group_by` 函数来按多列分组。

In [21]:
(
    df
    .group_by("Category","Region")
    .agg(
        pl.col("Profit").sum()
    )
) 

Category,Region,Profit
str,str,f64
"""Office Supplies""","""Central""",8879.9799
"""Furniture""","""Central""",-2871.0494
"""Technology""","""East""",47462.0351
"""Technology""","""Central""",33697.432
"""Office Supplies""","""South""",19986.3928
…,…,…
"""Office Supplies""","""East""",41014.5791
"""Furniture""","""East""",3046.1658
"""Technology""","""South""",19991.8314
"""Furniture""","""West""",11504.9503


We can also use expressions inside `group_by` - in fact when we pass column names as strings (as above) Polars converts these to expressions internally.

As we can pass expressions to `group_by` we can also group by a transformed column. Here, for example, we group by the `Row_ID` column with values cast to integer

我们还可以在 `group_by` 函数中使用表达式——实际上，当我们将列名作为字符串传递时（如上所示），Polars 会在内部将其转换为表达式。

由于我们可以将表达式传递给 `group_by` 函数，因此我们也可以按转换后的列进行分组。例如，这里我们按 `Row_ID` 列进行分组，并将值转换为整数。

In [24]:
(
    df
    .group_by(pl.col("Row_ID").cast(pl.Int64))
    .agg(
        pl.col("Profit").max()
    )
    .head()
)

Row_ID,Profit
i64,f64
926,5.4432
6175,226.3626
8530,-36.2136
798,6.2208
5395,235.9524


In [25]:
(
    df
    .group_by("Category","Region")
    .agg(
        pl.col("Profit").min()
    )
)

Category,Region,Profit
str,str,f64
"""Technology""","""East""",-6599.978
"""Office Supplies""","""South""",-1306.5504
"""Technology""","""South""",-3839.9904
"""Technology""","""Central""",-1359.992
"""Furniture""","""South""",-1862.3124
…,…,…
"""Office Supplies""","""Central""",-3701.8928
"""Office Supplies""","""West""",-694.2936
"""Furniture""","""West""",-814.4832
"""Furniture""","""East""",-1665.0522


In [26]:
(
    df
    .group_by("Category","Region")
    .agg(
        pl.col("Profit").sum()
    )
)

Category,Region,Profit
str,str,f64
"""Office Supplies""","""South""",19986.3928
"""Furniture""","""West""",11504.9503
"""Technology""","""South""",19991.8314
"""Office Supplies""","""Central""",8879.9799
"""Technology""","""East""",47462.0351
…,…,…
"""Furniture""","""East""",3046.1658
"""Office Supplies""","""West""",52609.849
"""Technology""","""West""",44303.6496
"""Office Supplies""","""East""",41014.5791


## Ordering of the output
We have seen that the output `DataFrame` has a different order each time. This happens because Polars works out the row indexes for the group keys in parallel. This means that Polars:
- splits the group columns into chunks (e.g. first 10 rows in one chunk, second 10 rows in another chunk, etc)
- finds the row indexes within each chunk on a seperate thread
- brings the results from different threads back together


We can force the order of the output to match the order the group keys occur in the input with the `maintain_order` argument

## 输出顺序

我们已经看到，每次输出的 `DataFrame` 顺序都不同。这是因为 Polars 并行计算分组键的行索引。这意味着 Polars：

- 将分组列分割成若干块（例如，前 10 行放在一个块中，后 10 行放在另一个块中，依此类推）

- 在单独的线程中查找每个块内的行索引

- 将不同线程的结果合并在一起

我们可以使用 `maintain_order` 参数强制输出顺序与输入中分组键的顺序一致。

In [29]:
(
    df
    .group_by("Category", maintain_order=True)
    .agg(
        pl.col("Profit").mean()
    )
)

Category,Profit
str,f64
"""Furniture""",8.699327
"""Office Supplies""",20.32705
"""Technology""",78.752002


The first row is group `3` because the first row of `df` is `3` and so on.

Setting maintain_order=True results will affect performance to some extent. We also cannot use the streaming engine for large datasets when `maintain_order=True`.

第一行是第 3 组，因为 `df` 的第一行是 `3`，依此类推。

设置 `maintain_order=True` 会在一定程度上影响性能。此外，当 `maintain_order=True` 时，我们无法对大型数据集使用流式引擎。



## Groupby on a list
We can groupby on a list column just as for non-list columns. 

First we create a `DataFrame` with a `pl.List` column

## 对列表进行分组

我们可以像对非列表列一样，对列表列进行分组。

首先，我们创建一个包含 `pl.List` 列的 `DataFrame`。

In [30]:
list_df = pl.DataFrame(
            {
                "lists": [
                    ["a", "b"],
                    ["a", "c"],
                    ["a", "b"],
                ]
            }
    )

df_lists = (
    list_df
    .with_row_index()
)


In [17]:
df_lists

index,lists
u32,list[str]
0,"[""a"", ""b""]"
1,"[""a"", ""c""]"
2,"[""a"", ""b""]"


Then we `group_by` and count the number of occurences of each list

然后我们使用 `group_by` 并统计每个列表出现的次数

In [31]:
df_lists.group_by("lists").len()

lists,len
list[str],u32
"[""a"", ""c""]",1
"[""a"", ""b""]",2
