<img src="../assets/data_analysis_with_polars_copyright-1.png" width="600"/>

If you want to join me on the course [**use this referral link**](https://www.udemy.com/course/data-analysis-with-polars/?referralCode=A29DCDA40D369080C05A)



## Working with time zones

By the end of this lecture you will be able to:
- add a time zone to a datetime
- change the time zone
- explain the use case of the different time zone functions
- get the time difference between time zones

Working with time zones can be tricky. In this lecture we break it down to understand how the different time zone functions work.

In [None]:
%pip install polars

In [4]:
from datetime import date,datetime

import polars as pl

## Creating a simple `DataFrame`

We create a `DataFrame` that has a single value in the `date` column - 1970/1/1 00:00:00

This date is the origin point for Unix timestamps. If we instead use a contemporary datetime it can be tricky to track changes in the integer representations as we are looking for small differences in large numbers.

To make things easier we will convert the integer representations from microseconds to hours with the following conversion factor

In [5]:
# Conversion factor to convert integer timestamps to hours
microseconds_per_hour = 3600 * 1e6

In the `DataFrame` we also add a column `date_p` with the physical integer representation coverted from (integer) microseconds to (floating point) hours

In [6]:
df = (
    pl.DataFrame(
        {
            "date":[datetime(1970,1,1)]
        }
    )
    .with_column(
        pl.col("date").to_physical().alias("date_p")/microseconds_per_hour
    )
)
df

date,date_p
datetime[μs],f64
1970-01-01 00:00:00,0.0


By default a `pl.Datetime` is **time zone-naive** - it has no time zone attached. Implicitly, however, a time zone-naive value is in the UTC time zone as 1970-01-01 00:00:00 as it corresponds to a timestamp of 0.

## Specify a time zone for a given datetime
If we know that the datetimes are not UTC but actually record a local datetime in a time zone we can specify the time zone with `dt.tz_localize`

The names of the time zone locations come from the Rust library chrono-tz. <a href="https://docs.rs/chrono-tz/latest/chrono_tz/enum.Tz.html" target="_blank"> See here for the full list of supported time zone names and locations</a>.


We tell Polars that this datetime is actually a local time in New York. We do this in a new column `tz_local` and also add the physical representation in `tz_local_p`

In [7]:
(
    df
    .with_columns(
        [
            pl.col("date").dt.tz_localize("America/New_York").alias("tz_local"),
            pl.col("date").dt.tz_localize("America/New_York").to_physical().alias("tz_local_p")/microseconds_per_hour
        ]
    )
)

AttributeError: 'ExprDateTimeNameSpace' object has no attribute 'tz_localize'

By calling `dt.tz_localize`:
- the datetime hasn't changed from 1970-01-01 00:00:00 but now has the EST timezone
- the physical representation **has changed** from 0 to 5 hours

The physical representation must change by 5 hours because 1970-01-01 00:00:00 EST occured 5 hours into the Unix epoch

> Terminology: we refer to the difference in hours between timezones as the *offset*. For example the offset between 1970-01-01 00:00:00 UTC and 1969-12-31 19:00:00 EST is 5 hours.

## Change the time zone for a given Unix timestamp 
In this scenario we know that the original data was recorded in Unix timestamps and so is in the UTC timezone. We now want to know what local time that UTC timestamp corresponds to in New York. 

In this case we use `df.with_time_zone`

In [None]:
(
    df
    .with_columns(
        [
            pl.col("date").dt.with_time_zone("America/New_York").alias("with_tz"),
            pl.col("date").dt.with_time_zone("America/New_York").to_physical().alias("with_tz_p")/microseconds_per_hour
        ]
    )
)

By calling `dt.with_time_zone`:
- the datetime **has been offset** by -5 hours to 1969-12-31 19:00:00 as EST is 5 hours behind UTC
- the physical representation has not changed from 0

## Change the Unix timestamp *and* the datetime
The third time zone function `dt.cast_time_zone` sometimes causes confusion. We set out what it does before we see a use case for it later.

We can only call `dt.cast_time_zone` on a `pl.Datetime` that already has a timezone. 

In this example we do `dt.cast_time_zone` on the `with_tz` column that has the New York time zone

In [None]:
(
    df
    .with_columns(
        [
            pl.col("date").dt.with_time_zone("America/New_York").alias("with_tz"),
            pl.col("date").dt.with_time_zone("America/New_York").to_physical().alias("with_tz_p")/microseconds_per_hour
        ]
    )
    .with_columns(
            [
                pl.col("with_tz").dt.cast_time_zone("UTC").alias("cast_tz"),
                pl.col("with_tz").dt.cast_time_zone("UTC").to_physical().alias("cast_tz_p")/microseconds_per_hour
            ]
    )
)

By calling `dt.cast_time_zone` from New York to UTC:
- the datetime **moved forward by 10 hours** from 19:00 on 31st December to 05:00 on 1st December in `cast_tz`
- the physical timestamp **changed from 0 to 5 hours** in `cast_tz_p`

The change in datetime is 10 hours: 5 hours from changing the timestamp by 5 hours and then 5 more hours because the offset between UTC and EST is 5 hours.

The behaviour of `dt.cast_time_zone` may be confusing because it is not typically what we are looking for when converting time zones. However, we see how it is useful for calculating offsets between time zones later in this lecture.

### Summary of the methods
We summarise these methods here. The Datetime column reflects whether the datetime changes e.g. 1970-01-01 00:00:00 to 1969-12-31 19:00:00
| Method |Datetime |Time zone | Timestamp|
|---|---|---|---|
| `dt.tz_localize` | No change|Adds time zone | Changes timestamp |
| `dt.with_time_zone` | Changes by offset|Adds/changes time zone | No change |
| `dt.cast_time_zone` | Double change|Changes time zone | Changes |

Example use cases:
- `dt.tz_localize` when your data records when things happened in local time
- `dt.with_time_zone` when your data records when things happened in Unix timestamps and you want to know what this was in local time
- `dt.cast_time_zone` when you want to calculate the offset between time zones (see below)

## Offset between time zones
We use `dt.cast_time_zone` to get the offset between time zones.

To illustrate this we create a `DataFrame` with a date column containing monthly data. We will specify a date range that covers the shift in dayslight savings time

In [None]:
pl.Config.set_tbl_rows(6)
start = datetime(2022,1,1)
stop = datetime(2022,6,1)
df = (
    pl.DataFrame(
        {
            'date':pl.date_range(
                low = start,
                high = stop,
                interval='1mo',
            ),
         }
    )
)
df

As we saw above, when we do a `cast_time_zone` on a timezone column we **change the integer representation** by the time difference. This time difference gives us the offset between time zones.

First we add a column called `date_london` that gives `date` in the London time zone.

Then we add a column called `date_nyc` that gives the London datetime in the New York time zone. Neither of these operations change the integer representation

In [None]:
(
    df
    # Add a timezone column for London
    .with_column(
            pl.col("date").dt.with_time_zone(tz="Europe/London").alias("date_london"),
    )
    # Cast the london column to New York
    .with_column(
            pl.col("date_london").dt.with_time_zone(tz="America/New_York").alias("date_nyc")
    )
)

We expand this query below. 

We add a `with_column` step where we use `cast_time_zone` to get the time difference between the London and New York time zones as a `pl.Duration`

In [None]:
(
    df
    # Add a timezone column for London
    .with_column(
            pl.col("date").dt.with_time_zone(tz="Europe/London").alias("date_london"),
    )
    # Convert the London timezone to New York
    .with_column(
            pl.col("date_london").dt.with_time_zone(tz="America/New_York").alias("date_nyc")
    )
    # Cast the New York timezone to London and subtract
    .with_column(
        
            (
                pl.col("date_london") - pl.col("date_nyc").dt.cast_time_zone("Europe/London")
            ).alias("offset")
    )
)

When we cast the New York time zone to London the integer representation changed by the size of the offset. So when we subtracted
```python
pl.col("date_london") - pl.col("date_nyc").dt.cast_time_zone("Europe/London")
```
the difference is the offset between the time zones.

> Historical note - The UK used BST all year-round in 1970 whereas today it only applies from late March to late October.

## Exercises
In the exercises you will develop your understanding of:
- setting the time zone
- changing the time zone
- getting the time difference between time zones


## Exercise 1
Create a `DataFrame` with a `date` column at monthly intervals from 1st September 2020 to 1st December 2020

In [None]:
start = datetime(2020,9,1)
stop = datetime(2020,12,1)
(
    pl.DataFrame(
        {
            "date":<blank>
        }
    )
)

The dates in the `date` column record events that happened in an factory in Johannesburg in South Africa.

Transform the `date` column so that the datetimes are local to Johannesburg.

Continue with your query from above in each step of this exercise

Add a column with the integer representation called `date_p`

You want to know what time it was in the Dublin office when the events happened in Johannesburg. 

Add a column called `date_dublin` with the local time in Dublin for these events

Add a column called `offset` that shows the offset between Johannesburg and Dublin.

Why does the offset change over the months?

### Exercise 2
You have a weather station that records temperature at hourly intervals. The device records data in UTC.

In [None]:
pl.Config.set_tbl_rows(25)
import numpy as np
start = datetime(2020,9,1)
stop = datetime(2020,9,2)
(
    pl.DataFrame(
        {
            "date":pl.date_range(start,stop,"1h")
        }
    )
    .with_column(
        # We use a cosine function with a period of 24 hours to generate a fake temperature cycle
        25 + 4*((2*np.pi*pl.col("date").to_physical()/(24*60*60*1e6))).cos().alias("temperature")
    )
)

From the output we can see that the device is not located in the UTC time zone as the highest temperature is at night and the lowest is in the afternoon.

Change the time zone to a location that has higher temperatures in the late afternoon and lower temperatures in the early night (<a href="https://docs.rs/chrono-tz/latest/chrono_tz/enum.Tz.html" target="_blank">there are obviously many such locations, you mainly need to figure out whether to go east or west!</a>
).

In [None]:
(
    pl.DataFrame(
        {
            "date":pl.date_range(start,stop,"1h")
        }
    )
    .with_column(
        # Use a cosine function with a period of 24 hours to generate a fake temperature cycle
        25 + 4*((2*np.pi*pl.col("date").to_physical()/(24*60*60*1e6))).cos().alias("temperature")
    )
    <blank>
)

## Solutions

### Solution to Exercise 1

Create a `DataFrame` with a `date` column at monthly intervals from 1st September 2020 to 1st December 2020

In [None]:
start = datetime(2020,9,1)
stop = datetime(2020,12,1)
(
    pl.DataFrame(
        {
            "date":pl.date_range(start,stop,"1mo")
        }
    )
)

The dates in the `date` column actually record events that happened in an factory in Johannesburg in South Africa.

Transform the `date` column so that the datetimes are local to Johannesburg.

In [None]:
(
    pl.DataFrame(
        {
            "date":pl.date_range(start,stop,"1mo")
        }
    )
    .with_column(
        pl.col("date").dt.tz_localize("Africa/Johannesburg")
    )
)

Add a column with the integer representation called `date_p`

In [None]:
(
    pl.DataFrame(
        {
            "date":pl.date_range(start,stop,"1mo")
        }
    )
    .with_column(
        pl.col("date").dt.tz_localize("Africa/Johannesburg")
    )
    .with_column(
        pl.col("date").to_physical().alias("date_p")
    )
)

You want to know what time it was in the Dublin office when the events happened in Johannesburg. 

Add a column called `date_dublin` with the local time in Dublin for these events

In [None]:
(
    pl.DataFrame(
        {
            "date":pl.date_range(start,stop,"1mo")
        }
    )
    .with_column(
        pl.col("date").dt.tz_localize("Africa/Johannesburg")
    )
    .with_column(
        pl.col("date").to_physical().alias("date_p")
    )
    .with_column(
        pl.col("date").dt.with_time_zone("Europe/Dublin").alias("date_dublin")
    )
)

Add a column called `offset` that shows the offset between Johannesburg and Dublin.

In [None]:
(
    pl.DataFrame(
        {
            "date":pl.date_range(start,stop,"1mo")
        }
    )
    .with_column(
        pl.col("date").dt.tz_localize("Africa/Johannesburg")
    )
    .with_column(
        pl.col("date").to_physical().alias("date_p")
    )
    .with_column(
        pl.col("date").dt.with_time_zone("Europe/Dublin").alias("date_dublin")
    )
    .with_column(
        (pl.col("date") - pl.col("date_dublin").dt.cast_time_zone("Africa/Johannesburg"))
        .alias("offset")
    )
)

Why does the offset change over the months?

Because there is daylight savings time (Irish Summer Time IST) applied in Dublin in August and September.

### Solution to exercise 2
You have a weather station that records temperature at hourly intervals. The device records data in UTC.

In [None]:
pl.Config.set_tbl_rows(25)
import numpy as np
start = datetime(2020,9,1)
stop = datetime(2020,9,2)
(
    pl.DataFrame(
        {
            "date":pl.date_range(start,stop,"1h")
        }
    )
    .with_column(
        25 + 4*((2*np.pi*pl.col("date").to_physical()/(24*60*60*1e6))).cos().alias("temperature")
    )
)

From the output we can see that the device is not located in the UTC timezone as the highest temperature is at night and the lowest is in the afternoon.

Change the timezone to a location that has higher temperatures in the late afternoon and lower temperatures in the early night (<a href="https://docs.rs/chrono-tz/latest/chrono_tz/enum.Tz.html" target="_blank">there are obviously many such locations, you mainly need to figure out whether to go east or west!</a>
).

In [None]:
(
    pl.DataFrame(
        {
            "date":pl.date_range(start,stop,"1h")
        }
    )
    .with_column(
        25 + 4*((2*np.pi*pl.col("date").to_physical()/(24*60*60*1e6))).cos().alias("temperature")
    )
    .with_column(
        pl.col("date").dt.with_time_zone("Brazil/West")
    )
)