## Introduction to group operations
By the end of this lecture you will be able to:
- do group operations by a single column
- do group operations by multiple columns
- calculate percentage breakdowns within groups
- cache group operations with the query optimiser

In [None]:
import polars as pl

In [None]:
csvFile = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csvFile)
df.head(3)

## Group operations - groupby,aggregate and join
We want to add a column that has the sum of the fares for all the passengers in the class as that passenger.

To do this manually we must first groupby each class and take the sum of the `Fare` 

In [None]:
sumFareByClass = (
    df
    .groupby("Pclass")
    .agg(
        pl.col("Fare").sum().suffix("_sum")
        )
)
sumFareByClass

We then have to do a left join of the original `DataFrame` df with `sumFareByClass` (we cover joins in more detail in the Section on combining data)

In [None]:
(
    df
    .join(
        sumFareByClass,
        on="Pclass",
        how="left"
    )
    .select(["PassengerId","Pclass","Fare","Fare_sum"])
    .head(3)
)

In Polars we can do this groupby-aggregate-join with the `over` expression

In [None]:
(
    df
    .with_column(
        pl.col("Fare").sum().over("Pclass").alias("Fare_sum")
    )
    .select(["PassengerId","Survived","Pclass","Fare","Fare_sum"])
    .head(3)
)

The syntax is for `over` is:
```python
(
    df
    .with_column(
        pl.col("Fare").sum().over("Pclass")
    )
)
```
which means:
- take the sum of the `Fare` column for each class in `Pclass`
- for this row the value will be the sum for the class this passenger belongs to

> In Pandas the equivalent method is `.groupby.transform`

## Group operation over multiple columns
We can also do group operations over multiple columns.

In this example we get the sum of the Fare for each group of passengers where we group by passenger class and whether they survived

In [None]:
(
    df
    .with_column(
        pl.col("Fare").sum().over(["Pclass","Survived"]).alias("Fare_sum")
    )
    .select(["PassengerId","Survived","Pclass","Fare","Fare_sum"])
    .head(3)
)

## Arithmetic in group operations
We calculate the percentage of the total fare paid for that passenger class by each passenger

In [None]:
(
    df
    .with_column(
        (100*(pl.col("Fare") / pl.col("Fare").sum().over("Pclass"))).alias("Fare_percent")
    )
    .select(["PassengerId","Survived","Pclass","Fare","Fare_percent"])
    .sample(5)
)

## Caching groups
When we compute a window expression over a column Polars calculates the groups for that column.

If we calculate multiple window expressions over the same column then Polars caches the groups on the first calculation to re-use them for the subsequent window expressions.

However, Polars can only do this if the window expressions are in the same `select` or `with_columns` statement.

We explore the effect of this caching in the exercises.

## Window expressions in lazy mode
With window expressions in lazy mode Polars detects that only a subset of columns are requred and only reads these columns from the CSV (see `PROJECT` in the optimised query plan) 

In [None]:
print(
    pl.scan_csv(csvFile)
    .with_columns(
        [
            (100*(pl.col("Fare") / pl.col("Fare").sum().over("Pclass"))).alias("Fare_percent"),
            (100*(pl.col("Fare") / pl.col("Fare").max().over("Pclass"))).alias("Fare_over_fare_max")
        ]
    )
    .select(["Fare","Fare_percent","Fare_over_fare_max"])
    .describe_optimized_plan()
)

## Exercises
In the exercises you will develop your understanding of:
- calculating window expressions on a single column
- calculating window expressions on multiple columns
- doing multiple window expressions in a single `with_columns` statement

### Exercise 1

Count the number of passengers in each group of: passenger class and survival. Name the column of counts `counts`

In [None]:
(
    pl.read_csv(csvFile)
    <blank>
)

Continue by calculating the percentage breakdown of passenger survival within each passenger class group. Call this column `percent`.

Sort the output by passenger class and survival

### Exercise 2
Window functions allow us to do multiple groupbys in the same `select` or `with_column`. Polars can cache the groupbys in the same `with_columns` statement.

In this exercise we explore the effect of this caching on performance.

We begin by creating a `DataFrame` with groups and values

In [None]:
import numpy as np
np.random.seed(0)

N = 1_000_000
cardinality = N // 2
groups = np.random.randint(0,cardinality,N)
df = pl.DataFrame(
        {
            "groups":groups,
            "values":np.random.standard_normal(N)
        }
    )
df.head(3)

We want to add: 
- a `max` column with the maximum value per group and 
- a `min` column with the minimum value per group.


Time how long this takes with two `with_column` statements

In [None]:
%%timeit -n1 -r3
(
    df
    <blank>
)

Time how long this takes in a single `with_columns` statement

In [None]:
%%timeit -n1 -r3
(
    df
    <blank>
)

Can Polars cache the window expressions across `with_column` statements in lazy mode?

In [None]:
%%timeit -n1 -r3
(
    pl.scan_csv(csvFile)
    <blank>
)

## Solutions

### Solution to exercise 1

Count the number of passengers in each group of passenger class and survival

In [None]:
(
    pl.read_csv(csvFile)
    .groupby(["Pclass","Survived"])
    .agg(
        pl.col("Name").count().alias("counts")
    )
)

Calculate the percentage breakdown of passenger survival within each passenger class group. Calculate the percentage as 0-100.

Sort the output by passenger class and survival

In [None]:
(
    pl.read_csv(csvFile)
    .groupby(["Pclass","Survived"])
    .agg(
        pl.col("Name").count().alias("counts")
    )
    .with_column(
        100*(pl.col("counts")/pl.col("counts").sum().over("Pclass")).round(3).alias("percent")
    )
    .sort(["Pclass","Survived"])
)

### Solution to exercise 2

Window functions allow us to do multiple groupbys in the same `select` or `with_column`. Polars can cache the groupbys in the same `with_columns` statement.

In this exercise we explore the effect of this caching on performance.

We begin by creating a `DataFrame` with groups and values

In [None]:
import numpy as np
np.random.seed(0)

N = 1_000_000
cardinality = N // 2
groups = np.random.randint(0,cardinality,N)
df = pl.DataFrame(
        {
            "groups":groups,
            "values":np.random.standard_normal(N)
        }
    )
df.head(3)

We want to add a `max` column with the maximum value per group and a `min` column with the minimum value per group.


Do this with two `with_column` statements

In [None]:
%%timeit -n1 -r3
(
    df
    .with_column(
        pl.col("values").max().over("groups").alias("max")
    )
    .with_column(
        pl.col("values").min().over("groups").alias("min")
    )
)

Do this in a single `with_columns` statement

In [None]:
%%timeit -n1 -r3
(
    df
    .with_columns(
        [
            pl.col("values").max().over("groups").alias("max"),
            pl.col("values").min().over("groups").alias("min")
        ]
    )
)

Can Polars cache the window expressions across `with_column` statements in lazy mode?

In [None]:
%%timeit -n1 -r3
(
    df
    .lazy()
    .with_column(
        pl.col("values").max().over("groups").alias("max")
    )
    .with_column(
        pl.col("values").min().over("groups").alias("min")
    )
    .collect()
)

Not at this point!