# Experiment: Timestamps and Timezones

Timestamps behave differently across various destinations. In this experiment, we will explore how timestamps are handled in different scenarios.

## 1. DuckDB to DuckDB

DuckDB supports several timestamp types, but we will focus on the following:

- **TIMESTAMP**: This type ignores the session's timezone setting but still converts all timestamps with timezones to UTC.
- **TIMESTAMPTZ**: This type uses the session's timezone setting to offset the timestamps accordingly.

When you store a TIMESTAMPTZ value, the timestamp is converted to and stored as an instant in time (an absolute point in time, like a Unix timestamp) using the timezone setting active in the session.

## 2. DuckDB to Filesystem (PyArrow Parquet)

In PyArrow Parquet, the timezone name is stored as a string separate from the timestamp value, meaning it does not affect the timestamp values themselves. 


## DuckDB exploration
Before jumping to the experiments, we are going to do a DuckDB exploration.
We will start creating a duckdb table using TIMESTAMPZ type, with the timezone session set to America/Los Angeles. We can see that DuckDB uses the TimeZone to convert and store the timestamp as an instant (without timezone).

In [47]:
import duckdb
import pandas as pd

# Connect to the DuckDB database
conn = duckdb.connect('source.duckdb')

# Create a table and insert data
conn.execute('''
SET TimeZone = 'America/Los_Angeles';
CREATE TABLE IF NOT EXISTS events (
    event_id INTEGER,
    event_tstamp TIMESTAMPTZ
);
DELETE FROM events;
INSERT INTO events (event_id, event_tstamp) VALUES
  (1, '2024-07-30 10:00:00.123'),
  (2, '2024-07-30 10:00:00.123456+00:00');
''')

# Fetch the results and load into a Pandas DataFrame
results = conn.execute('SELECT * FROM events;').fetchdf()
print(results)

conn.close()

   event_id                     event_tstamp
0         1 2024-07-30 10:00:00.123000-07:00
1         2 2024-07-30 03:00:00.123456-07:00


Now we are going to change the Timezone value to UTC, and insert some new data to see what happens:

In [48]:
conn = duckdb.connect('source.duckdb')
conn.execute('''
SET TimeZone = 'UTC';
INSERT INTO events (event_id, event_tstamp) VALUES
  (3, '2024-08-01 10:00:00.123'),
  (4, '2024-08-02 10:00:00.123456+04:00');
''')

results = conn.execute('SELECT * FROM events;').fetchdf()
print(results)
conn.close()

   event_id                     event_tstamp
0         1 2024-07-30 17:00:00.123000+00:00
1         2 2024-07-30 10:00:00.123456+00:00
2         3 2024-08-01 10:00:00.123000+00:00
3         4 2024-08-02 06:00:00.123456+00:00


The TimeZone variable influences the storage of naive timestamps, as demonstrated in event_id=1.

Next, we will examine the behavior of a table using the TIMESTAMP type. 

In [50]:
conn = duckdb.connect('source.duckdb')
conn.execute('''
SET TimeZone = 'America/Los_Angeles';
CREATE TABLE IF NOT EXISTS events_ntz (
    event_id INTEGER,
    event_tstamp TIMESTAMP
);
DELETE FROM events_ntz;
INSERT INTO events_ntz (event_id, event_tstamp) VALUES
  (1, '2024-07-30 10:00:00.123'),
  (2, '2024-07-30 10:00:00.123456+05:00'),
  (3, '2024-07-30 07:00:00.123'),
  (4, '2024-07-30 10:00:00.123456+07:00');
''')

results = conn.execute('SELECT * FROM events_ntz;').fetchdf()
print(results)
conn.close()

   event_id               event_tstamp
0         1 2024-07-30 10:00:00.123000
1         2 2024-07-30 05:00:00.123456
2         3 2024-07-30 07:00:00.123000
3         4 2024-07-30 03:00:00.123456


We observe that the TimeZone variable is ignored for naive timestamps, which are stored as entered (event_id=3), while timestamps with time zones are converted to UTC.

## Experiment 1 - DuckDB to DuckDB

In this experiment, we will use DLT to load data from a source DuckDB instance to a new DuckDB instance. This will be performed three times under different TimeZone settings:

The source table `events` contains timestamps of type TIMESTAMPTZ.

- TimeZone flag set to NONE (unset)
- TimeZone flag set to TRUE (on)
- TimeZone flag set to FALSE (off)

In [1]:
import dlt
import duckdb

# Fetch data
conn = duckdb.connect('source.duckdb')
source_df = conn.execute('SELECT * FROM events;').fetchdf()
conn.close()

print("Source data:")
print(source_df)

pipelines = {
    "duckunset": None,
    "duckon": True,
    "duckoff": False
}

for p in pipelines.keys():
    
  # run pipeline
  pipeline = dlt.pipeline(
    pipeline_name=p,
    destination='duckdb',
  )

  pipeline.run(source_df.to_dict(orient="records"),write_disposition="replace",table_name='events',
               columns=[{"name": "event_tstamp", "data_type": "timestamp", "timezone": pipelines[p]}])

  # fetch results
  conn = duckdb.connect(f'{p}.duckdb')
  
  result = conn.execute(f'''
    SET TimeZone = 'America/Los_Angeles';
    SELECT event_id,event_tstamp FROM {p}_dataset.events;
  ''').fetchdf()

  describe = conn.execute(f'DESCRIBE {p}_dataset.events').fetchdf()

  conn.close()

  print (f"""
    Results for - {p}

    {result}

    DESCRIBE destination table:
    
    {describe}
    ----
  """)

  

Source data:
   event_id                     event_tstamp
0         1 2024-07-30 19:00:00.123000+02:00
1         2 2024-07-30 12:00:00.123456+02:00
2         3 2024-08-01 12:00:00.123000+02:00
3         4 2024-08-02 08:00:00.123456+02:00

    Results for - duckunset

       event_id               event_tstamp
0         1 2024-07-30 17:00:00.123000
1         2 2024-07-30 10:00:00.123456
2         3 2024-08-01 10:00:00.123000
3         4 2024-08-02 06:00:00.123456

    DESCRIBE destination table:
    
        column_name column_type null   key default extra
0  event_tstamp   TIMESTAMP  YES  None    None  None
1      event_id      BIGINT  YES  None    None  None
2  _dlt_load_id     VARCHAR   NO  None    None  None
3       _dlt_id     VARCHAR   NO  None    None  None
    ----
  

    Results for - duckon

       event_id                     event_tstamp
0         1 2024-07-30 10:00:00.123000-07:00
1         2 2024-07-30 03:00:00.123456-07:00
2         3 2024-08-01 03:00:00.123000-07:00
3  

We observe that the timezone column hint from dlt influences whether the destination timestamp type is TIMESTAMP or TIMESTAMP WITH TIME ZONE. 

## Experiment 2 - DuckDB to Parquet

In this experiment, we will use DLT to load data from a source DuckDB instance to parquet files. This will be performed three times under different TimeZone settings:

The source table `events` contains timestamps of type TIMESTAMPTZ.

- TimeZone flag set to NONE (unset)
- TimeZone flag set to TRUE (on)
- TimeZone flag set to FALSE (off)

In [28]:
import dlt
import duckdb
import pyarrow.parquet as pq
import posixpath
import os 

os.environ["DESTINATION__FILESYSTEM__BUCKET_URL"] = "_storage"

# Fetch data
conn = duckdb.connect('source.duckdb')
source_df = conn.execute('''
SET TimeZone = 'America/Los_Angeles';
SELECT * FROM events;
''').fetchdf()
conn.close()

pipelines = {
    "parquettimezoneunset": None,
    "parquettimezoneon": True,
    "parquettimezoneoff": False
}

for p in pipelines.keys():

    # run pipeline
    pipeline = dlt.pipeline(
        pipeline_name=p,
        destination='filesystem',
    )

    pipeline.run(source_df.to_dict(orient="records"),loader_file_format="parquet",write_disposition="replace",table_name='events',
                 columns=[{"name": "event_tstamp", "data_type": "timestamp", "timezone": pipelines[p]}])

    # fetch results
    client = pipeline.destination_client()  # type: ignore[assignment]

    events_glob = posixpath.join(client.dataset_path, f"events/*")
    events_files = client.fs_client.glob(events_glob)

    with open(events_files[0], "rb") as f:
        table = pq.read_table(f)

        df = table.to_pandas()

        print(f"""
Results for - {p}

{df[["event_tstamp","event_id"]]}

DESCRIBE destination table:

{table.schema}
----
        """)



Results for - parquettimezoneunset

                      event_tstamp  event_id
0 2024-07-30 17:00:00.123000+00:00         1
1 2024-07-30 10:00:00.123456+00:00         2
2 2024-08-01 10:00:00.123000+00:00         3
3 2024-08-02 06:00:00.123456+00:00         4

DESCRIBE destination table:

event_tstamp: timestamp[us, tz=UTC]
event_id: int64
_dlt_load_id: string not null
_dlt_id: string not null
----
        

Results for - parquettimezoneon

                      event_tstamp  event_id
0 2024-07-30 17:00:00.123000+00:00         1
1 2024-07-30 10:00:00.123456+00:00         2
2 2024-08-01 10:00:00.123000+00:00         3
3 2024-08-02 06:00:00.123456+00:00         4

DESCRIBE destination table:

event_tstamp: timestamp[us, tz=UTC]
event_id: int64
_dlt_load_id: string not null
_dlt_id: string not null
----
        

Results for - parquettimezoneoff

                event_tstamp  event_id
0 2024-07-30 17:00:00.123000         1
1 2024-07-30 10:00:00.123456         2
2 2024-08-01 10:00:00.123

We observe that the timezone column hint from dlt influences whether the destination schema timestamp type has the timezone (tz) set or unset.

### Conclusion

The timezone column hint might be useful for changing the destination timestamp type. 

As the next step, I suggest testing Snowflake (TZ) to Snowflake(NTZ) replication. The Snowflake TIMESTAMP_NTZ type appears to behave differently than other databases, completely ignoring the timezone data from timestamps.