<img src="../assets/data_analysis_with_polars_copyright-1.png" width="600"/>

This notebook is a free sample from my Data Analysis with Polars course on Udemy.

Use this link to do the full course at half price: https://www.udemy.com/course/data-analysis-with-polars/?couponCode=POLARS_HALF_PRICE2

Check out this accompanying video as well: https://www.youtube.com/watch?v=AKnHKUY308o


## Introduction to datetime dtypes
By the end of this lecture you will be able to:
- create a datetime series with `pl.date_range`
- explain the difference between Polars datetime dtypes
- extract the integer representation underlying datetime dtypes

Time series analysis is easier if you have a good understanding of the datetime dtypes and their underlying representation. We get to know the dtypes here.

Time series dtypes behave in some ways like a categorical dtype with an underlying integer representation that maps to a more interpretable datetime representation. I recommend that you do the String and categorical dtypes lecture in Section 3 before doing this lecture.    

In [None]:
from datetime import date,datetime

import polars as pl

## Creating a date range
Before looking at the dtypes we create a date range in Polars with `pl.date_range`. 

In this example we create an hourly date range and specify the start and stop dates with Python `datetime.date` objects

In [None]:
pl.Config.set_tbl_rows(4)
start = date(2022,1,1)
stop = date(2022,1,2)
df = pl.DataFrame(
    {
        'date':pl.date_range(
            low = start,
            high = stop,
            interval='1h'
        ),
    }
)
df

We can also specify the start and stop dates with python `datetime.datetime` objects

In [None]:
start = datetime(2022,1,1,6)
stop = datetime(2022,1,2,3)
df = pl.DataFrame(
    {
        'date':pl.date_range(
            low = start,
            high = stop,
            interval='1h'
        ),
    }
)
df

### Intervals

We specify the interval as a string with the following units:
- "ns"
- "us"
- "ms"
- "s"
- "m"
- "h"
- "d"
- "w"
- "mo"
- "y"

We can also concatenate the units

In [None]:
start = datetime(2022,1,1,6)
stop = datetime(2022,1,2,3)
df = pl.DataFrame(
    {
        'date':pl.date_range(
            low = start,
            high = stop,
            interval='1h30m'
        ),
    }
)
df.head(2)

Instead of the string intervals we can also use Python `datetime.timedelta` objects. The string intervals have more flexibility, however, so we will stick with those.

### Date range closure
The default is for the date range to be closed on both sides. We can specify how this is done with the `closed` argument

In [None]:
pl.Config.set_tbl_rows(4)
start = date(2022,1,1)
stop = date(2022,1,2)
df = pl.DataFrame({
    'date':pl.date_range(
        low = start,
        high = stop,
        interval='1h',
        closed="left"
    ),
})
df

## Datetime dtypes
In the table we set out the Polars datetime dtypes and their key characteristics


| dtype|Example |Time unit | Internal dtype |
---|---|---|---|
|`pl.Datetime` | 2020-01-01 01:00:00 |Microseconds since UNIX epoch | 64-bit signed integer |
|`pl.Date` |2020-01-01 |Days since UNIX epoch | 32-bit signed integer |
| `pl.Time` | 01:00:00 | Nanoseconds since midnight | 64-bit signed integer |
|`pl.Duration` |1d 1h |Microseconds |  64-bit signed integer  |


> In Pandas and Numpy the Datetime objects use nanoseconds rather than microseconds by default.

In the `DataFrame` below we create a date range at 6-hour intervals to see how it is represented in the different dtypes.

We subtract successive values in the column of datetimes with `diff` to get a `pl.Duration`

In [None]:
start = datetime(2020,1,1)
stop = datetime(2020,1,2)
dfDatetimes = (
    pl.DataFrame(
        {
            "datetime":pl.date_range(start,stop,interval="6h")
        }
    ).with_columns(
        [
            pl.col("datetime").cast(pl.Date).alias("date"),
            pl.col("datetime").diff().alias("duration"),
            pl.col("datetime").cast(pl.Time).alias("time"),
        ]
    )

)
dfDatetimes

### Integer representations
We get the underlying integer representations with `to_physical`

In [None]:
dfDatetimesPhysical = (
    dfDatetimes
    .select(
        [
            pl.col("datetime").to_physical().suffix("_us"),
            pl.col("date").to_physical().suffix("_days"),
            pl.col("duration").to_physical().suffix("_us"),
            pl.col("time").to_physical().suffix("_ns"),


        ]
            
    )
)
dfDatetimesPhysical

With a 64-bit integer we can represent a datetime range of 584 billion years at microsecond intervals!

### Timestamp
The integer representation of a datetime is sometimes referred to as the timestamp. 

In Polars we have a `.dt.timestamp` expression that gives the integer representation in a given unit.

In this example we get the integer representation in the available units

In [None]:
(
    dfDatetimes
    .select(
        [
            pl.col("datetime"),
            pl.col("datetime").dt.timestamp(tu="ns").alias("timestamp_ns"),
            pl.col("datetime").dt.timestamp().alias("timestamp_us"),
            pl.col("datetime").dt.timestamp(tu="ms").alias("timestamp_ms"),

        ]
        
    )
)
            


There is also a `.dt.epoch` expression that is an alias for `.dt.timestamp`

## Exercises
In the exercises you will develop your understanding of:
- creating a date range
- converting datetime dtypes
- extracting the integer representation
 
## Exercise 1
Create a `DataFrame` with a column called `datetime` that has datetimes from the start of 2020 to 30th June 2022 at 6-monthly intervals

Extend your query by copying your existing code in each subsequent part of this exercise.

Create this date range again but only including the end date

Add columns that encode the same date range as a:
- date
- time

Add three new columns that have the physical representation for the `datetime`, `date` and `time` columns. Each new column name should end with `_physical`.

Challenge: do this as a single expression inside an additional `with_column`

## Solutions

## Solution to Exercise 1

Create a `DataFrame` with a column called `datetime` that has datetimes from the start of 2020 to 30th June 2022 at 6-monthly intervals

In [None]:
start = datetime(2020,1,1)
stop = datetime(2022,6,30)
df = (
        pl.DataFrame(
            {
                "datetime":pl.date_range(
                    low=start,
                    high=stop,
                    interval="6mo"
                )
            }
        )
)
df

Create this date range again but only including the end date

In [None]:
df = (
        pl.DataFrame(
            {
                "datetime":pl.date_range(
                    low=start,
                    high=stop,
                    interval="6mo",
                    closed="right"
                )
            }
        )
)
df

Add columns that encode the same date range as a:
- date
- time

In [None]:
df = (
        pl.DataFrame(
            {
                "datetime":pl.date_range(
                    low=start,
                    high=stop,
                    interval="6mo",
                    closed="right"
                )
            }
        )
    .with_columns(
        [
            pl.col("datetime").cast(pl.Date).alias("date"),
            pl.col("datetime").cast(pl.Time).alias("time")
        ]
    )

)
df

Add three new columns that have the physical representation for the `datetime`, `date` and `time` columns. 

Challenge: do this as a single expression inside an additional `with_column`

In [None]:
df = (
        pl.DataFrame(
            {
                "datetime":pl.date_range(
                    low=start,
                    high=stop,
                    interval="6mo",
                    closed="right"
                )
            }
        )
    .with_columns(
        [
            pl.col("datetime").cast(pl.Date).alias("date"),
            pl.col("datetime").cast(pl.Time).alias("time")
        ]
    )
    .with_column(
        pl.all().to_physical().suffix("_physical")
    )

)
df