# Groupby 1: The `GroupBy` object
By the end of this lecture you will be able to:
- group rows using `groupby`
- access data in each group
- do aggregations on the groups
- group by multiple columns
- do fast-track grouping on a sorted column


In [None]:
import polars as pl

In [None]:
csvFile = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csvFile)
df.head(3)

## The `groupby` object
Calling the `groupby` method with a column creates a `GroupBy` object

In [None]:
(
    df
    .groupby("Pclass")
)

We look at some of the methods we can call on the `GroupBy` object below.

## Iterating over groups
We can access the `DataFrame` for each group by looping over a `GroupBy` object.

When we iterate we get the group key and the sub-`DataFrame` for each group.

In this example we print the mean for each group

In [None]:
for groupKey,groupDf in df.groupby("Pclass"):
    print(groupKey)
    print(groupDf.mean())

## Group indices
We get the group keys and the *row indices* for each group with the `agg_groups` expression

In [None]:
(
    df
    .groupby("Pclass")
    .agg(
        pl.col("PassengerId").agg_groups()
    )
)

Due to the parallelisation algorithm the order of the groups changes each time.

To return the same order use the `maintain_order` argument in `groupby`

In [None]:
(
    df
    .groupby("Pclass",maintain_order=True)
    .agg(
        pl.col("PassengerId").agg_groups()
    )
)

We look at `agg` in more detail int the coming lectures.

## Group values
We use `head` to get the first rows in each group.

In this example we return a `DataFrame` with the first 2 rows from each group

In [None]:
(
    df
    .groupby("Pclass")
    .head(2)
)

## Aggregations
In eager mode we can call aggregations directly on the `GroupBy`

In this example, we count the number of rows per group and we get a single column of counts

In [None]:
(
    df
    .groupby("Pclass")
    .count()
)

We can also calculate aggregations on all columns

In [None]:
(
    df
    .groupby("Pclass")
    .mean()
)

We use the `agg` method for more flexibility in the next lecture. 

The methods we can all on `GroupBy` in eager mode are:
 - `first` get the first element of each group
 - `last` get the last element of each group
 - `n_unique` get the number of unique elements in each group
 - `count` get the number of elements in each group
 - `sum` sum the elements in each group
 - `min` get the smallest element in each group
 - `max` get the largest element in each group
 - `mean` get the average of elements in each group
 - `median` get the median in each group
 - `quantile` calculate quantiles in each group
 

## Groupby on multiple columns

We can do `groupby` on multiple columns by passing a list of column names

In [None]:
(
    df
    .groupby(["Pclass","Survived"])
    .mean()
)

In the following lecture we see how we can also pass expressions to `groupby`

## Groupby on a sorted column
In the lecture "Sorting and Fast-track algorithms" in the Selecting columns and transforming dataframes section we saw how Polars can use fast-track algorithms on sorted columns - if it knows the column is sorted.

A fast-track algorithm can also be used if the groupby column is sorted. See Exercise 3 for an example of this (make sure you have done the Sorting and Fast-track algorithms lecture first).

## Exercises
In the exercises you will develop your understanding of:
- creating a `GroupBy` object
- accessing data from the groups
- aggregating each group
- the effect of the fast-track algorithm on a sorted column

### Exercises 1
Create a `GroupBy` object by grouping on the `Survived` column

In [None]:
(
    pl.read_csv(csvFile)
    <blank>
)

Create a `DataFrame` showing the row indexes for each group

Create a `DataFrame` with the *last* three rows from each group. Ensure the order of the `DataFrame` is the same each time you run the code.

Get the maximum value for each column in each group

### Exercise 2
Create a `GroupBy` object by grouping on the `Survived` and `Pclass` columns.

Call this object `survivedClassDf`

In [None]:
survivedClassDf = (
    pl.read_csv(csvFile)
    <blank>
)

Loop over the groups and print the `mean` of each group (the output of `print` will be an ASCII representation)

### Exercise 3
We look at the effect of sorting and the fast-track algorithm on a `groupby` operation.

We create a `DataFrame` with an `id` column of integers and a `values` column

- The `N` variable sets the number of rows in the `DataFrame`
- The `cardinality` sets the number of distinct group keys in the `id` column

We begin with a low cardinality and see the effect of increasing the cardinality later in the exercise.

We pre-sort the `id`s before creating the `DataFrame`

In [None]:
pl.Config.set_tbl_rows(4)
import numpy as np
np.random.seed(0)
N = 10_000_000
cardinality = 10
# Create a sorted array of id integers
sortedArray = np.sort(np.random.randint(0,cardinality,N))
df = (
    pl.DataFrame(
        {
            "id":[i for i in sortedArray],
            "values":np.random.standard_normal(N)
        }
    )
)
df.head(3)

Time how long it takes to groupby the `id` column and take the mean of the `values` column without any fast-track algorithm

In [None]:
%%timeit -n1 -r3
(
    df
    <blank>
)

Create a new `DataFrame` called `dfSorted` where we tell Polars the `id` column is sorted

In [None]:
dfSorted = (
    df
    <blank>
)
dfSorted["id"].flags

Time how long it takes to groupby the `id` column and take the mean of the `values` column **with** a fast-track algorithm

In [None]:
%%timeit -n1 -r3
(
    dfSorted
    <blank>
)

Compare the difference between the sorted and non-sorted algorithms when the cardinality of `id` is higher. Try:
- `cardinality = 1_000` and 
- `cardinality = 1_000_000`


## Solutions

### Solutions to Exercise 1
Create a `GroupBy` object by grouping on the `Survived` column

In [None]:
(
    pl.read_csv(csvFile)
    .groupby("Survived")
)

Create a `DataFrame` showing the row indexes for each group

In [None]:
(
    pl.read_csv(csvFile)
    .groupby("Survived")
    .agg(
        pl.col("PassengerId").agg_groups()
    )
)

Create a `DataFrame` with the *last* three rows from each group. Ensure the order of the `DataFrame` is the same each time you run the code.

In [None]:
(
    pl.read_csv(csvFile)
    .groupby("Survived",maintain_order=True)
    .tail(3)
)

Get the maximum value for each column in each group

In [None]:
(
    pl.read_csv(csvFile)
    .groupby("Survived",maintain_order=True)
    .max()
)

### Solutions to Exercise 2
Create a `GroupBy` object by grouping on the `Survived` and `Pclass` columns.

Call this object `survivedClassDf`

In [None]:
survivedClassDf = (
    pl.read_csv(csvFile)
    .groupby(["Survived","Pclass"])
)

Loop over the groups and print the `mean` of each group (the output of `print` will be an ASCII representation)

In [None]:
survivedClassDf = (
    pl.read_csv(csvFile)
    .groupby(["Survived","Pclass"])
)
for groupKey,groupDf in survivedClassDf:
    print(groupDf.mean())

### Solution to exercise 3
We look at the effect of sorting on a `groupby` operation.

We create a `DataFrame` with an `id` column of integers and a `values` column

- The `N` variable sets the number of rows in the `DataFrame`
- The `cardinality` sets the number of distinct `id`s

We pre-sort the `id`s before creating the `DataFrame`

In [None]:
pl.Config.set_tbl_rows(4)
import numpy as np
np.random.seed(0)
N = 10_000_000
cardinality = 1_000_000
# Create a sorted array of id integers
sortedArray = np.sort(np.random.randint(0,cardinality,N))
df = (
    pl.DataFrame(
        {
            "id":[i for i in sortedArray],
            "values":np.random.standard_normal(N)
        }
    )
)
df.head(3)

Time how long it takes to groupby the `id` column and take the mean of the `values` column without any fast-track algorithm

In [None]:
%%timeit -n1 -r3
(
    df
    .groupby("id")
    .agg(
        pl.col("values").mean()
    )
)

Create a new `DataFrame` called `dfSorted` where we tell Polars the `id` column is sorted

In [None]:
dfSorted = (
    df
    .with_columns(
        pl.col("id").set_sorted()
    )
)
dfSorted["id"].flags

Time how long it takes to groupby the `id` column and take the mean of the `values` column **with** a fast-track algorithm

In [None]:
%%timeit -n1 -r3
(
    dfSorted
    .groupby("id")
    .agg(
        pl.col("values").mean()
    )
)

Compare the difference in timings between the standard and fast-track algorithm when the cardinality of `id` is higher


The difference is much smaller (and possibly negative) when the cardinality of `id` is high