# Introduction to `group_by_dynamic`

In [1]:
from datetime import datetime

import polars as pl

In [2]:
csv_file = "data/nyc_trip_data_1k.csv"

In [3]:
df = pl.read_csv(csv_file,try_parse_dates=True)
df.head()

VendorID,pickup,dropoff,passenger_count,trip_distance,fare_amount,tip_amount
str,datetime[μs],datetime[μs],f64,f64,f64,f64
"""id1""",2022-01-01 00:04:14,2022-01-01 00:26:12,1.0,10.83,31.0,0.0
"""id2""",2022-01-01 00:32:17,2022-01-01 00:49:23,1.0,3.97,14.5,3.66
"""id8""",2022-01-01 00:40:58,2022-01-01 01:00:59,4.0,8.44,25.5,0.0
"""id0""",2022-01-01 00:55:13,2022-01-01 01:25:49,1.0,12.61,37.5,12.39
"""id1""",2022-01-01 00:55:24,2022-01-01 01:00:45,1.0,1.49,6.5,0.0


## Temporal aggregation with datetime components and `group_by`

In [4]:
df.group_by(
    pl.col("pickup").dt.date().alias("date")
).agg(
    pl.col("trip_distance").mean().round(1)
).sort("date")

date,trip_distance
date,f64
2022-01-01,4.1
2022-01-02,6.6
2022-01-03,3.3
2022-01-04,2.6
2022-01-05,2.6
…,…
2022-01-11,2.4
2022-01-12,2.7
2022-01-13,3.5
2022-01-14,2.7


## Temporal group by with `group_by_dynamic`

With this approach:
- Polars takes the input parameters to create time window boundaries
- Polars then finds all the rows that correspond to each window

### Sorted data for `group_by_dynamic`

**For `group_by_dynamic` the date/datetime column must be sorted in ascending order**. 

Sort is required because in the second step of finding all the rows that correspond to each window Polars uses a fast-track algorithm that requires sorted data.

> Note that the date/datetime column in windows is called *index* column in `group_by_dynamic`.

When Polars starts `group_by_dynamic` it first:
- checks if the `.flag` attribute on the *index* column
- checks if the data in the *index* column is sorted

In [5]:
df["pickup"].is_sorted()

True

In [6]:
df.group_by_dynamic(
    "pickup",
    every="1d"
).agg(
    pl.col("trip_distance").mean().round(1)
).head(5)

pickup,trip_distance
datetime[μs],f64
2022-01-01 00:00:00,4.1
2022-01-02 00:00:00,6.6
2022-01-03 00:00:00,3.3
2022-01-04 00:00:00,2.6
2022-01-05 00:00:00,2.6


## `DynamicGroupBy` object

In [7]:
df.with_columns(
    pl.col("pickup").set_sorted()
).group_by_dynamic(
    "pickup",
    every="1d"
)

<polars.dataframe.group_by.DynamicGroupBy at 0x1a995005090>

We cannot call aggregation methods like `count` or `sum` on a `DynamicGroupBy` directly.

## Dynamic group by on groups

We may want to divide the `DataFrame` into groups before doing `group_by_dynamic` on each group. 

Using the `group_by` argument in `group_by_dynamic`.

In [8]:
df.sort(
    "VendorID", "pickup"
).group_by_dynamic(
    "pickup",
    every="3h",
    group_by="VendorID"
).agg(
    pl.col("tip_amount").mean().round(1)
).head()

VendorID,pickup,tip_amount
str,datetime[μs],f64
"""id0""",2022-01-01 00:00:00,12.4
"""id0""",2022-01-01 03:00:00,2.7
"""id0""",2022-01-01 12:00:00,2.6
"""id0""",2022-01-01 15:00:00,12.4
"""id0""",2022-01-02 03:00:00,0.0


Polars first `groups by` VendorID and then does `group_by_dynamic` on each of those groups.

## Dynamic groupby in lazy mode

In [9]:
print(
    pl.scan_csv(
        csv_file,
        try_parse_dates=True
    ).group_by_dynamic(
        "pickup",
        every="3h",
        group_by="passenger_count"
    ).agg(
        pl.col("trip_distance").mean().round(1)
    ).explain()
)

AGGREGATE[maintain_order: true]
  [col("trip_distance").mean().round()] BY [col("passenger_count")]
  FROM
  Csv SCAN [data/nyc_trip_data_1k.csv]
  PROJECT 3/7 COLUMNS
  ESTIMATED ROWS: 984


## Exercises

### Exercise 1
Groupby the `pickup` column on a 6-hourly basis.

Get the count, mean and max of the trip distance for each window.

Sort the output by the mean trip distance with the largest values first.

In [10]:
df.head()

VendorID,pickup,dropoff,passenger_count,trip_distance,fare_amount,tip_amount
str,datetime[μs],datetime[μs],f64,f64,f64,f64
"""id1""",2022-01-01 00:04:14,2022-01-01 00:26:12,1.0,10.83,31.0,0.0
"""id2""",2022-01-01 00:32:17,2022-01-01 00:49:23,1.0,3.97,14.5,3.66
"""id8""",2022-01-01 00:40:58,2022-01-01 01:00:59,4.0,8.44,25.5,0.0
"""id0""",2022-01-01 00:55:13,2022-01-01 01:25:49,1.0,12.61,37.5,12.39
"""id1""",2022-01-01 00:55:24,2022-01-01 01:00:45,1.0,1.49,6.5,0.0


In [11]:
df.group_by_dynamic(
    "pickup",
    every="6h",
).agg(
    pl.col("trip_distance").count().alias("count"),
    pl.col("trip_distance").max().alias("max"),
    pl.col("trip_distance").mean().alias("mean")
).sort("mean", descending=True).head()

pickup,count,max,mean
datetime[μs],u32,f64,f64
2022-01-10 00:00:00,4,18.94,13.6775
2022-01-03 00:00:00,2,24.75,12.935
2022-01-02 06:00:00,5,18.39,11.464
2022-01-08 06:00:00,11,70.78,8.687273
2022-01-02 18:00:00,18,21.78,8.330556


Filter out all windows with less than 5 records

In [12]:
df.group_by_dynamic(
    "pickup",
    every="6h",
).agg(
    pl.col("trip_distance").count().alias("count"),
    pl.col("trip_distance").max().alias("max"),
    pl.col("trip_distance").mean().alias("mean")
).filter(
    pl.col("count") >= 5
).sort("mean", descending=True)

pickup,count,max,mean
datetime[μs],u32,f64,f64
2022-01-02 06:00:00,5,18.39,11.464
2022-01-08 06:00:00,11,70.78,8.687273
2022-01-02 18:00:00,18,21.78,8.330556
2022-01-09 18:00:00,18,46.39,6.618333
2022-01-01 12:00:00,21,19.6,5.485714
…,…,…,…
2022-01-11 18:00:00,22,8.67,1.851364
2022-01-12 06:00:00,21,3.5,1.687619
2022-01-10 06:00:00,20,4.14,1.685
2022-01-07 06:00:00,15,4.07,1.657333


### Exercise 2

Get the same statistics but also group by the Vendor ID

In [13]:
df.group_by_dynamic(
    "pickup",
    every="6h",
    group_by="VendorID"
).agg(
    pl.col("trip_distance").count().alias("count"),
    pl.col("trip_distance").max().alias("max"),
    pl.col("trip_distance").mean().alias("mean")
).filter(
    pl.col("count") >= 5
).sort("mean", descending=True)

VendorID,pickup,count,max,mean
str,datetime[μs],u32,f64,f64
"""id6""",2022-01-02 18:00:00,5,17.88,8.734
"""id6""",2022-01-08 12:00:00,6,17.15,6.97
"""id5""",2022-01-04 12:00:00,8,19.67,6.73875
"""id3""",2022-01-02 12:00:00,5,15.4,6.474
"""id1""",2022-01-01 00:00:00,5,10.83,5.966
…,…,…,…,…
"""id5""",2022-01-14 12:00:00,6,4.12,1.606667
"""id4""",2022-01-07 18:00:00,5,2.86,1.518
"""id2""",2022-01-07 12:00:00,5,2.06,1.402
"""id8""",2022-01-10 12:00:00,5,3.08,1.392


Get the same statistics (`count`,`max` and `mean`) but group by both:
- the Vendor ID and 
- the `trip_distance` where the `trip_distance` is cast to a 16-bit integer before grouping

In [15]:
df.group_by_dynamic(
    "pickup", every="6h", group_by=["VendorID", pl.col("trip_distance").cast(pl.Int16)]
).agg(
    pl.col("trip_distance").count().alias("count"),
    pl.col("trip_distance").max().alias("max"),
    pl.col("trip_distance").mean().alias("mean"),
).filter(
    pl.col("count") >= 5
).sort(
    "mean", descending=True
)

VendorID,trip_distance,pickup,count,max,mean
str,i16,datetime[μs],u32,f64,f64
"""id7""",1,2022-01-08 12:00:00,5,1.67,1.452
