## Extracting datetime components
By the end of this lecture you will be able to:
- extract date components from a datetime dtype
- extract week-of-year and day-of-year from a datetime dtype
- extract time components from a datetime dtype


In [None]:
from datetime import datetime

import polars as pl

In [None]:
csvFile = "../data/nyc_trip_data_1k.csv"

In [None]:
df = pl.read_csv(csvFile,parse_dates=True)
df.head()

## Extracting date and time
We extract the date from a `pl.Datetime` dtype by casting it to `pl.Date`

In [None]:
(
    df
    .with_column(
        pl.col("pickup").cast(pl.Date)
    )
).head(3)

We extract the time from a `pl.Datetime` dtype by casting it to `pl.Time`

In [None]:
(
    df
    .with_column(
        pl.col("pickup").cast(pl.Time)
    )
).head(3)

## Extracting datetime features
We use expressions in the `dt` namespace to extract date features

In [None]:
(
    df
    .select(
        [
        pl.col("pickup"),
        pl.col("pickup").dt.year().alias("year"),
        pl.col("pickup").dt.quarter().alias("quarter"),
        pl.col("pickup").dt.month().alias("month"),
        pl.col("pickup").dt.day().alias("day"),
        pl.col("pickup").dt.hour().alias("hour"),
        pl.col("pickup").dt.minute().alias("minute"),
        pl.col("pickup").dt.second().alias("second"),
        pl.col("pickup").dt.millisecond().alias("millisecond"),
        pl.col("pickup").dt.microsecond().alias("microsecond"),
        pl.col("pickup").dt.nanosecond().alias("nanosecond"),
        ]
    )
    .sample(5)
    .sort("pickup")
)

The dtype for the `year` column is a signed 32-bit integer. All other columns are unsigned 32-bit integers.

## Ordinal week and day numbers

We can also extract week and day feaures:
- `.dt.week` gives the <a href="https://en.wikipedia.org/wiki/ISO_week_date" target="_blank">ISO week of the year</a>
- `.dt.weekday` gives the day of week where monday = 0 and sunday = 6
- `.dt.day` gives the day of month from 1-31
- `.dt.ordinal_day` gives the day of year from 1-365/366

In [None]:
(
    df
    .select(
        [
            pl.col("pickup"),
            pl.col("pickup").dt.week().alias("week"),
            pl.col("pickup").dt.weekday().alias("weekday"),
            pl.col("pickup").dt.day().alias("day_of_month"),
            pl.col("pickup").dt.ordinal_day().alias("ordinal_day"),
        ]
    )
    .sample(5)
    .sort("pickup")
)

In the ISO system the first two days of 2022 are in week 52 of 2021.

## Extracting datetime components in lazy mode
We do the same query in lazy mode to see how Polars extracts datetime components in lazy mode

In [None]:
print(
    pl.scan_csv(csvFile,parse_dates=True)
    .select(
        [
            pl.col("pickup"),
            pl.col("pickup").dt.week().alias("week"),
            pl.col("pickup").dt.weekday().alias("weekday"),
            pl.col("pickup").dt.day().alias("day_of_month"),
            pl.col("pickup").dt.ordinal_day().alias("ordinal_day"),
        ]
    )
    .describe_optimized_plan()
)

The datetime extraction happens in a `SELECT...FROM` block in the optimized query plan above.

This means that Polars first reads in the datetime column from the CSV and then does the conversion once the column is in a `DataFrame` in memory.


## Exercises
In the exercises you will develop your understanding of:
- extracting datetime components
- extracting ordinal components
- doing these operations in lazy mode

## Exercise 1
Count the number of records for each date (by pickup)

In [None]:
(
    pl.read_csv(csvFile,parse_dates=True)
    <blank>
)

## Exercise 2

Add a `day_of_year` column to get the number of records per ordinal day of the year

In [None]:
(
    pl.read_csv(csvFile,parse_dates=True)
    <blank>
)


Continue by counting how many records there are for each day-of-year

Add columns with the day-of-week and hour of the day based on the pickup time

In [None]:
(
    pl.read_csv(csvFile,parse_dates=True)
    .select(["pickup"])
    <blank>
    .head()
)

Continue by counting the number of records for each (day-of-week,hour-of-the-day) pair.

Sort the output from largest number of records to smallest

Do the count of records by (day-of-week,hour-of-the-day) again, but this time extract the day-of-week & hour-of-the-day **inside the `groupby`**

Do the same operation but this time in lazy mode

## Solutions

## Solution to exercise 1
Count the number of records for each date (by pickup).

This can be done either with `groupby` (first cell) or `value_counts` (second cell)

In [None]:
(
    pl.read_csv(csvFile,parse_dates=True)
    .groupby(
        pl.col("pickup").cast(pl.Date)
    )
    .count()
)

In [None]:
(
    pl.read_csv(csvFile,parse_dates=True)
    .with_column(
        pl.col("pickup").cast(pl.Date)
    )
    ["pickup"]
    .value_counts()
)

## Solution to exercise 2
Add a `day_of_year` column to get the number of records per ordinal day of the year

In [None]:
(
    pl.read_csv(csvFile,parse_dates=True)
    .with_column(
        pl.col("pickup").dt.ordinal_day().alias("day_of_year")
    )
)


Count how many records there are for each day-of-year

In [None]:
(
    pl.read_csv(csvFile,parse_dates=True)
    .with_column(
        pl.col("pickup").dt.ordinal_day().alias("day_of_year")
    )
    ["day_of_year"]
    .value_counts()
)


Add columns with the day-of-week and hour of the day based on the pickup time

In [None]:
(
    pl.read_csv(csvFile,parse_dates=True)
    .select(["pickup"])
    .with_columns(
        [
            pl.col("pickup").dt.weekday().alias("day_of_week"),
            pl.col("pickup").dt.hour().alias("hour")
        ]
    )
    .head(3)
)

Count the number of records for each (day-of-week,hour-of-the-day) pair.

Sort the output from largest number of records to smallest

In [None]:
(
    pl.read_csv(csvFile,parse_dates=True)
    .select(["pickup"])
    .with_columns(
        [
            pl.col("pickup").dt.weekday().alias("day_of_week"),
            pl.col("pickup").dt.hour().alias("hour")
        ]
    )
    .groupby(["day_of_week","hour"])
    .count()
    .sort("count",reverse=True)
)

Do the count of records by (day-of-week,hour-of-the-day) again, but this time extract the day-of-week & hour-of-the-day inside the `groupby`

In [None]:
(
    pl.read_csv(csvFile,parse_dates=True)
    .select(["pickup"])
    .groupby(
        [
            pl.col("pickup").dt.weekday().alias("day_of_week"),
            pl.col("pickup").dt.hour().alias("hour")
        ]
    )
    .count()
    .sort("count",reverse=True)
)

Do the same operation in lazy mode

In [None]:
(
    pl.scan_csv(csvFile,parse_dates=True)
    .select(["pickup","dropoff"])
    .groupby(
        [
            pl.col("pickup").dt.weekday().alias("day_of_week"),
            pl.col("pickup").dt.hour().alias("hour")
        ]
    )
    .agg(
        pl.col("dropoff").count().alias("count")
    )
    .sort("count",reverse=True)
    .collect()
)

We cannot call `count` on a `LazyGroupBy`, we must use `agg`. I recommend just using `agg` to make the conversion to lazy mode easier.