# Groupby 4: The `LazyGroupBy` object
By the end of this lecture you will be able to:
- do `groupby` in lazy mode
- do aggregations on a `LazyGroupBy`
- inspect the optimized query plan
- profile a query


In [None]:
import polars as pl

In [None]:
csvFile = "../data/titanic.csv"

## Creating a `LazyGroupBy` object

We create a `LazyGroupBy` object by calling `groupby` on a `LazyFrame`

In [None]:
(
    pl.scan_csv(csvFile)
    .groupby('Pclass')
)

## Aggregations
The only way to do aggregations on a `LazyGroupBy` is with `agg`. We cannot call, for example, `.mean` on `.groubpy` as we can with an eager groupby. 

Calling `agg` converts a `LazyGroupBy` to a `LazyFrame`

In [None]:
(
    pl.scan_csv(csvFile)
    .groupby('Pclass')
    .agg(
        pl.col("Age").mean()
    )
)

### Query optimizations
We print the optimized plan for this groupby query

In [None]:
print(
    pl.scan_csv(csvFile)
    .groupby('Pclass')
    .agg(
        pl.col("Age").mean()
    )
    .explain()
)

In the optimized plan we have:
- `PROJECT 2/12 COLUMNS` so Polars will only read the `Pclass` and `Age` columns from the CSV
- `Aggregate [col("Age").mean()] BY [col("Pclass")]` so Polars will group by the `Pclass` column and take the `mean` of the `Age` column

As with any lazy query we can evaluate this either all-at-once or in batches using streaming. To evaluate all-at-once call `collect` and to evaluate with streaming call `collect(streaming=True)

In [None]:
(
    pl.scan_csv(csvFile)
    .groupby('Pclass')
    .agg(
        pl.col("Age").mean()
    )
    .collect(streaming=True)
)

Note - in tests on my datasets I often find that setting `streaming=True` is 10-20% faster even when streaming is not strictly required so it may be worth testing this on your data.

## Profiling a lazy query
For any lazy query Polars can profile the query showing how long each part of the query takes.

We demonstrate this in the context of a lazy groupby.

The output of `profile` is a 2-element tuple. The first element of the tuple is the output of the query - the same output as we get from `collect` 

In [None]:
groupedDf,profileDf = (
    pl.scan_csv(csvFile)
    .groupby('Pclass')
    .agg(
        pl.col("Age").mean()
    )
    .profile()
)
groupedDf

The second element is a `DataFrame` with timings in microseconds for the start and end of each node in the optimised query plan

In [None]:
profileDf

Polars can also generate a Gantt chart for the timings. For this you need to have Matplotlib installed 

In [None]:
(
    pl.scan_csv(csvFile)
    .groupby('Pclass')
    .agg(
        pl.col("Age").mean()
    )
    .sort("Pclass")
    .profile(show_plot=True,figsize=(6,3))
)


In the chart we see that the time required for:
- optimization of the query is relatively small
- doing the groupby aggregation (in PIPELINE) is the largest component and
- the sort at the end takes a non negligible amount of time (about 10% of the total)

## Exercises
In the exercises you will develop your understanding of:
- creating a `LazyGroupBy`
- doing an aggregation on a `LazyGroupBy`
- interpreting optimized query plans
- profiling a query

## Exercise 1
Create a `LazyGroupBy` on the `Survived` and `Plcass` columns in a query that starts with scanning the CSV

Aggregate the data by getting the minimum, average and maximum age per group

Evaluate the query

Evaluate the query and produce a profile plot

## Exercise 2
Create the query that has the following optimized plan:

```
  SORT BY [col("Decade")]
Aggregate
	[col("Fare").mean()] BY [[(col("Age")) / (10f64)].round().strict_cast(Int64).alias("Decade")] FROM   CSV SCAN ../data/titanic.csv
  PROJECT 2/12 COLUMNS
  SELECTION: Some(col("Age").is_not_null())
```

In [None]:
print(
    <blank>
    .explain()
)

## Solutions

## Solution to Exercise 1
Create a `LazyGroupBy` on the `Survived` and `Plcass` columns in a query that starts with scanning the CSV

In [None]:
(
    pl.scan_csv(csvFile)
    .groupby(["Survived","Pclass"])
)

Exercise 1 cont: Aggregate the data by getting the minimum, average and maximum age per group

In [None]:
(
    pl.scan_csv(csvFile)
    .groupby(["Survived","Pclass"])
    .agg(
        [
            pl.col("Age").min().suffix("_min"),
            pl.col("Age").mean().suffix("_mean"),
            pl.col("Age").max().suffix("_max"),
        ]
    )
)

Exercise 1 cont: Evaluate the query

In [None]:
(
    pl.scan_csv(csvFile)
    .groupby(["Survived","Pclass"])
    .agg(
        [
            pl.col("Age").min().suffix("_min"),
            pl.col("Age").mean().suffix("_mean"),
            pl.col("Age").max().suffix("_max"),
        ]
    )
    .collect()
)

Exercise 1 cont: Evaluate the query and produce a profile plot

In [None]:
(
    pl.scan_csv(csvFile)
    .groupby(["Survived","Pclass"])
    .agg(
        [
            pl.col("Age").min().suffix("_min"),
            pl.col("Age").mean().suffix("_mean"),
            pl.col("Age").max().suffix("_max"),
        ]
    )
    .profile(show_plot=True)
)

## Solution to Exercise 2
Create the query with the optimized plan

```
  SORT BY [col("Decade")]
Aggregate
	[col("Fare").mean()] BY [[(col("Age")) / (10f64)].round().strict_cast(Int64).alias("Decade")] FROM   CSV SCAN ../data/titanic.csv
  PROJECT 2/12 COLUMNS
  SELECTION: Some(col("Age").is_not_null())
```

In [None]:
print(
    pl.scan_csv(csvFile)
    .filter(pl.col("Age").is_not_null())
    .groupby(
        (pl.col("Age")/10).round(0).cast(pl.Int64).alias("Decade")
    )
    .agg(
        pl.col("Fare").mean()
    )
    .sort("Decade",descending=True)
    # .collect()
    .explain()
)