# Groupby 4: The `LazyGroupBy` object
By the end of this section you will be able to:
- do `groupby` in lazy mode
- do aggregations on a `LazyGroupBy`
- inspect the optimized query plan
- profile a query


In [1]:
import polars as pl

In [2]:
csvFile = "../data/titanic.csv"

## Creating a `LazyGroupBy` object

We create a `LazyGroupBy` object by calling `groupby` on a `LazyFrame`

In [3]:
(
    pl.scan_csv(csvFile)
    .groupby('Pclass')
)

<polars.lazyframe.groupby.LazyGroupBy at 0x112c47400>

## Aggregations
The only way to do aggregations on a `LazyGroupBy` is with `agg`. We cannot call, for example, `.mean` on `.groubpy` as we can with an eager groupby. 

Calling `agg` converts a `LazyGroupBy` to a `LazyFrame`

In [4]:
(
    pl.scan_csv(csvFile)
    .groupby('Pclass')
    .agg(
        pl.col("Age").mean()
    )
)

### Query optimizations
We print the optimized plan for this groupby query

In [5]:
print(
    pl.scan_csv(csvFile)
    .groupby('Pclass')
    .agg(
        pl.col("Age").mean()
    )
    .explain()
)

AGGREGATE
	[col("Age").mean()] BY [col("Pclass")] FROM
	
  CSV SCAN ../data/titanic.csv
  PROJECT 2/12 COLUMNS


In the optimized plan we have:
- `PROJECT 2/12 COLUMNS` so Polars will only read the `Pclass` and `Age` columns from the CSV
- `Aggregate [col("Age").mean()] BY [col("Pclass")]` so Polars will group by the `Pclass` column and take the `mean` of the `Age` column

As with any lazy query we can evaluate this either all-at-once or in batches using streaming. To evaluate all-at-once call `collect` and to evaluate with streaming call `collect(streaming=True)

In [8]:
(
    pl.scan_csv(csvFile)
    .groupby('Pclass')
    .agg(
        pl.col("Age").mean()
    )
    .collect(streaming=True)
)

Pclass,Age
i64,f64
2,29.87763
1,38.233441
3,25.14062


In the chart we see that the time required for:
- optimization of the query is relatively small
- doing the groupby aggregation (in PIPELINE) is the largest component and
- the sort at the end takes a non negligible amount of time (about 10% of the total)

## Exercises
In the exercises you will develop your understanding of:
- creating a `LazyGroupBy`
- doing an aggregation on a `LazyGroupBy`
- interpreting optimized query plans

## Exercise 1
Create a `LazyGroupBy` on the `Survived` and `Plcass` columns in a query that starts with scanning the CSV

Aggregate the data by getting the minimum, average and maximum age per group

Evaluate the query

## Solutions

## Solution to Exercise 1
Create a `LazyGroupBy` on the `Survived` and `Plcass` columns in a query that starts with scanning the CSV

In [None]:
(
    pl.scan_csv(csvFile)
    .groupby(["Survived","Pclass"])
)

Exercise 1 cont: Aggregate the data by getting the minimum, average and maximum age per group

In [None]:
(
    pl.scan_csv(csvFile)
    .groupby(["Survived","Pclass"])
    .agg(
        [
            pl.col("Age").min().suffix("_min"),
            pl.col("Age").mean().suffix("_mean"),
            pl.col("Age").max().suffix("_max"),
        ]
    )
)

Exercise 1 cont: Evaluate the query

In [None]:
(
    pl.scan_csv(csvFile)
    .groupby(["Survived","Pclass"])
    .agg(
        [
            pl.col("Age").min().suffix("_min"),
            pl.col("Age").mean().suffix("_mean"),
            pl.col("Age").max().suffix("_max"),
        ]
    )
    .collect()
)