# Introduction to `groupby_dynamic`
By the end of this session you will be able to:

- do groupby and aggregations using `groupby_dynamic`
- use `groupby_dynamic` on multiple columns
- use `groupby_dynamic` in lazy mode


In [None]:
from datetime import datetime

import polars as pl

We use the time series of NYC taxi pickups

In [None]:
csvFile = "../data/nyc_trip_data_1k.csv"

In [None]:
df = pl.read_csv(csvFile,parse_dates=True)
df.head()

## Groupby with datetime components and `groupby`
The simplest way to do a groupby on a time series is to:
- create the datetime components of interest
- do a `groupby` on these components

In this example we get the average trip distance by day-of-week

In [None]:
(
    df
    .groupby(
        pl.col("pickup").dt.weekday().alias("weekday")
    )
    .agg(pl.col("trip_distance").mean().round(1),
         )
    .sort("weekday")
)


## Groupby with `groupby_dynamic`
With `groupby_dynamic` we can work directly with the values in the date column.

**For `groupby_dynamic` the date column must be sorted in ascending order.** We do not need to use `set_sorted` on the date column.

No `Exception` will be raised if the dates are not sorted, but the answers will probably be wrong.

To check sortedness we see if the minimum difference between records is greater than or equal to 0 

In [None]:
(
    df
    .select(
        pl.col("pickup").diff().min() >= 0
    )
)

In its simplest form we specify the datetime column to do `groupby_dynamic` on and the length of the grouping window with the `every` argument

In [None]:
(
    df
    .groupby_dynamic(
        "pickup", 
        every="1d"
    )
    .agg(
        pl.col("trip_distance").mean().round(1)
    )
    .head(5)
)

We look at how the windows are specified in more detail in the next lecture

## `DynamicGroupBy` object

When we do `groupby_dynamic` we create a `DynamicGroupBy` object.

In [None]:
(
    df
    .groupby_dynamic(
        "pickup", 
        every="1d"
    )
)

To do aggregations on a `DynamicGroupBy` we call `agg`. We cannot call aggregation methods like `count` or `sum` on a `DynamicGroupBy` directly.

## Dynamic groupby on multiple columns
To illustrate dynamic groupby on multiple columns we create a `DataFrame` from the NYC taxi data.

In this example we groupby the number of passengers and hourly windows and get the average trip distance

In [None]:
(
    df
    .groupby_dynamic("pickup",every="3h",by="passenger_count")
    .agg(
            pl.col("trip_distance").mean().round(1)
    )
    .sort("trip_distance",reverse=True)
    .head()
)

Notice the order of the columns - Polars first groups by `passenger_count` and then does `groupby_dynamic` on each of those groups.

We can also use expressions when grouping by another column - see the exercises.

## Dynamic groupby in lazy mode
When we do `groupby_dynamic` the Polars query optimiser sees that only a subset of columns are required and only reads these columns from the CSV (`PROJECT 3/7 COLUMNS` below)

In [None]:
print(
    pl.scan_csv(csvFile,parse_dates=True)
    .groupby_dynamic("pickup",every="3h",by="passenger_count")
    .agg(
            pl.col("trip_distance").mean().round(1)
    )
    .describe_optimized_plan()
)

## Exercises
In the exercises you will develop your understanding of:
- doing `groupby_dynamic` on a single column
- doing `groupby_dynamic` on a multiple columns
- the relative performance of `groupby_dynamic` and `groupby`

### Exercise 1
Groupby the `pickup` column on a 6-hourly basis.

Get the count, mean and max of the trip distance for each window.

Sort the output by the mean trip distance with the largest values first

Filter out all windows with less than 5 records

### Exercise 2

Get the same statistics but also group by the Vendor ID

Get the same statistics (`count`,`max` and `mean`) but group by both:
- the Vendor ID and 
- the `trip_distance` where the `trip_distance` is cast to a 16-bit integer before grouping

## Solutions

### Solution to exercise 1
Groupby the `pickup` column on a 6-hourly basis.

Get the count, mean and max of the trip distance for each window.

Sort the output by the mean trip distance with the largest values first

In [None]:
(
    pl.read_csv(csvFile,parse_dates=True)
    .groupby_dynamic("pickup",every="6h")
    .agg(
        [
            pl.col("trip_distance").count().alias("count"),
            pl.col("trip_distance").mean().alias("mean"),
            pl.col("trip_distance").max().alias("max"),
        ]
    )
    .sort("mean",reverse=True)
)
    

Filter out all windows with less than 5 records

In [None]:
(
    pl.read_csv(csvFile,parse_dates=True)
    .groupby_dynamic("pickup",every="6h")
    .agg(
        [
            pl.col("trip_distance").count().alias("count"),
            pl.col("trip_distance").mean().alias("mean"),
            pl.col("trip_distance").max().alias("max"),
        ]
    )
    .filter(pl.col("count") >= 5)
    .sort("mean",reverse=True)
    .head()
)
    

### Solution to exercise 2

Get the same statistics but also group by the Vendor ID

In [None]:
(
    pl.read_csv(csvFile,parse_dates=True)
    .groupby_dynamic("pickup",every="6h",by="VendorID")
    .agg(
        [
            pl.col("trip_distance").count().alias("count"),
            pl.col("trip_distance").mean().alias("mean"),
            pl.col("trip_distance").max().alias("max"),
        ]
    )
    .filter(pl.col("count") >= 5)
    .sort("mean",reverse=True)
    .head(3)
)
    

Get the same statistics (`count`,`max` and `mean`) but group by both:
- the Vendor ID and 
- the `trip_distance` where the `trip_distance` is cast to a 16-bit integer before grouping

In [None]:
(
    pl.read_csv(csvFile,parse_dates=True)
    .groupby_dynamic(
        "pickup",
        every="6h",
        by=["VendorID",pl.col("trip_distance").cast(pl.Int16())
           ]
    )
    .agg(
        [
            pl.col("passenger_count").count().alias("count"),
            pl.col("passenger_count").mean().alias("mean"),
            pl.col("passenger_count").max().alias("max"),
        ]
    )
    .sort("mean",reverse=True)
    .head()
)
    