Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent date offsets with timezone handling #11979

Closed
2 tasks done
paultiq opened this issue May 8, 2024 · 16 comments
Closed
2 tasks done

Inconsistent date offsets with timezone handling #11979

paultiq opened this issue May 8, 2024 · 16 comments

Comments

@paultiq
Copy link

paultiq commented May 8, 2024

What happens?

I noticed pytz mentioned in a recent issue and noted the use of localize() matched the issues raised in this blog post (that comes up every time pytz is mentioned): https://blog.ganssle.io/articles/2018/03/pytz-fastest-footgun.html.

import duckdb
import pytz
from datetime import datetime
conn=duckdb.connect()

result = conn.execute("select '2018-02-14 12:00:00 America/New_York'::TIMESTAMPTZ, '2018-02-14 12:00:00 America/New_York'::TIMESTAMPTZ + interval 60 day  as base_date").df()

from datetime import timedelta

python_plus60 = result.iloc[0, 0] + timedelta(days=60)
duckdb_plus60 = result.iloc[0, 1]

print(f"{python_plus60} vs {duckdb_plus60}")

Output:

  • "2018-04-15 13:00:00-04:00" vs "2018-04-15 12:00:00-04:00"

May be related to #9431 ?

To Reproduce

given above

OS:

Windows

DuckDB Version:

10.2 and duckdb-0.10.3.dev781

DuckDB Client:

Python

Full Name:

Paul Timmins

Affiliation:

Iqmo

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@soerenwolfers
Copy link

Isn't duckdb's duckdb_plus60 the correct answer according to the linked blog post?

@soerenwolfers
Copy link

soerenwolfers commented May 8, 2024

If your problem is that python_plus60 gives the "wrong" result, then that is not a duckdb issue, since it can be reproduced as follows

import datetime 
import pandas
import dateutil 

print(pandas.Timestamp("2018-02-14 12:00:00-05:00") + datetime.timedelta(days=60))

print(pandas.Timestamp("2018-02-14 12:00:00", tz="America/New_York") + datetime.timedelta(days=60))

print(pandas.Timestamp("2018-02-14 12:00:00", tz=dateutil.tz.gettz("America/New_York")) + datetime.timedelta(days=60))
2018-04-15 12:00:00-05:00

2018-04-15 13:00:00-04:00

2018-04-15 13:00:00-04:00

@paultiq
Copy link
Author

paultiq commented May 8, 2024

Isn't duckdb's duckdb_plus60 the correct answer according to the linked blog post?

I interpret as the opposite: that the correct timezone aware offset is the one given by the python. And that duckdb is returning the non normalized version. But, let me mull over the pandas version you sent!

@paultiq
Copy link
Author

paultiq commented May 8, 2024

The first example (UTC-05:00) isn't reproducing, since it's a fixed offset w/ no daylight savings component.

The behavior I'm expecting is given by your second and third examples:

  • print(pandas.Timestamp("2018-02-14 12:00:00", tz="America/New_York") + datetime.timedelta(days=60))
  • 2018-04-15 13:00:00-04:00

@hawkfish
Copy link
Contributor

hawkfish commented May 8, 2024

What time zone are you in?

@soerenwolfers
Copy link

soerenwolfers commented May 8, 2024

Could you summarize what you think is wrong in duckdb, at the moment? (I still don't think anything is wrong.)

Also, note that your results depend on duckdb's timezone setting, so you should explicitly set that in your example via

conn.query("SET TimeZone='America/New_York'")

If you use UTC instead, you get different results (*):

import duckdb

conn=duckdb.connect()
conn.query("SET TimeZone='UTC'")
query = "SELECT ('2018-02-14 12:00:00 America/New_York'::TIMESTAMPTZ + interval 60 day) - '2018-04-15 17:00:00+00'::TIMESTAMPTZ"
print(conn.query(query))
conn.query("SET TimeZone='America/New_York'")
print(conn.query(query))

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ ((CAST('2018-02-14 12:00:00 America/New_York' AS TIMESTAMP WITH TIME ZONE) + to_days(CAST(trunc(CAST(60 AS DOUBLE)…  │
│                                                       interval                                                       │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 00:00:00                                                                                                             │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ ((CAST('2018-02-14 12:00:00 America/New_York' AS TIMESTAMP WITH TIME ZONE) + to_days(CAST(trunc(CAST(60 AS DOUBLE)…  │
│                                                       interval                                                       │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ -01:00:00                                                                                                            │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

(*) and I believe correctly so: a sensible definition of TIMESTAMPTZ + INTERVAL is: (1) convert the TIMESTAMPTZ to TIMESTAMP (i.e.: how would someone in the configured timezone write down the point in time specified by the given TIMESTAMPTZ) (2) add the interval (3) convert to TIMESTAMPTZ (i.e., what point in time would be denoted by someone in the configured timezone with the given TIMESTAMP)

@hawkfish
Copy link
Contributor

hawkfish commented May 8, 2024

from duckdb_settings() where name='TimeZone';
name value description input_type scope
TimeZone America/Los_Angeles The current time zone VARCHAR GLOBAL
select '2018-02-14 12:00:00 America/New_York'::TIMESTAMPTZ as d, '2018-02-14 12:00:00 America/New_York'::TIMESTAMPTZ + interval 60 day  as base_date;
d base_date
2018-02-14 09:00:00-08 2018-04-15 09:00:00-07

This looks correct for me (at least this is what ICU does when you cross a DST boundary: preserve wall clock time).

@hawkfish
Copy link
Contributor

hawkfish commented May 8, 2024

Pandas may do something different here, but what adding 60 days across a DST boundary should return is a matter for theologians really.

I'd also like to point out that pandas uses pytz, which is deprecated(!) and suffers from the Y2038 problem (something one of my clients tripped over the other day when doing long range forecasting.)

@paultiq
Copy link
Author

paultiq commented May 8, 2024

Pandas may do something different here, but what adding 60 days across a DST boundary should return is a matter for theologians really.

I'd also like to point out that pandas uses pytz, which is deprecated(!) and suffers from the Y2038 problem (something one of my clients tripped over the other day when doing long range forecasting.)

I'll answer the other questions this evening, but doesn't DuckDB also use pytz? Or am I misreading the discussion here: #11974

@hawkfish
Copy link
Contributor

hawkfish commented May 8, 2024

No, we use ICU internally (which doesn't have these problems). We only use pytz when producing pandas output because that its what pandas requires.

@paultiq
Copy link
Author

paultiq commented May 8, 2024

No, we use ICU internally (which doesn't have these problems). We only use pytz when producing pandas output because that its what pandas requires.

Yup, that's what I was referring to. I was expecting it (a timezone aware date plus an interval 60 day) to behave the same as the second example above.

I haven't (yet) had a chance to review your subsequent comments which seems to contradict the examples I had at the beginning.

@hawkfish
Copy link
Contributor

hawkfish commented May 8, 2024

Not sure what you meant by "the second example above"?

It sounds to me like you are expecting the date arithmetic to add exactly 60 days to the instant, which is what pandas does? ICU does not behave that way, so duckdb does not either. Moreover, this is not what Postgres does:

hawkfish=# select '2018-02-14 12:00:00 America/New_York'::TIMESTAMPTZ, '2018-02-14 12:00:00 America/New_York'::TIMESTAMPTZ + interval '60 day'  as base_date;
      timestamptz       |       base_date        
------------------------+------------------------
 2018-02-14 09:00:00-08 | 2018-04-15 09:00:00-07
(1 row)

(BTW I am in America/Los_Angeles, which is why the displayed offsets are different.)

@hawkfish
Copy link
Contributor

hawkfish commented May 8, 2024

Yes that does what @paultiq is expecting:

select 
    '2018-02-14 12:00:00 America/New_York'::TIMESTAMPTZ d, 
    '2018-02-14 12:00:00 America/New_York'::TIMESTAMPTZ + interval (24 * 60) hours  as base_date;
d base_date
2018-02-14 09:00:00-08 2018-04-15 10:00:00-07

@soerenwolfers
Copy link

Sorry, didn't meant to make you look talking to yourself; deleted my comment so I could check first whether it actually works because I was on my phone at the time. Anyway, glad it works, so duckdb doesn't only do the theologically right thing from the perspective of the blog post, but also offers the heretic option in a simple way.

Only sad thing to be taken from this is that duckdb can't currently do "add X wall clock days to timestamptz Y from the perspective of timezone Z" with any Z other than the one on the settings (in particular not with a Z that's variable across a table). Not that that's surprising given that TIMESTAMPTZ doesn't store timezones, and not that I can see myself needing it anytime soon, but potentially something for a future separate feature request. (Which would strictly speaking also require specifying what to do if the result lands in the midst of a daylight savings fold or a gap, so really should be a request for a function add(TIMESTAMPTZ, INTERVAL, TIMEZONE, FOLD, GAP) and there might be more complications I haven't thought about. )

@paultiq
Copy link
Author

paultiq commented May 9, 2024

Yes that does what @paultiq is expecting: ...

Yes, thanks... that's the part I was hung up on. That this version is (24*60 - 1) hours, but the Pandas version is 24*60 hours.

I totally accept this is a. a question for the theologians and b. that this is how icu works. So, feel free to close as "as designed" :) Thanks for the discussion.

import duckdb

conn=duckdb.connect()
conn.query("SET TimeZone='UTC'")
query = "SELECT '2018-02-14 23:00:00 America/New_York'::TIMESTAMPTZ start_date , '2018-02-14 23:00:00 America/New_York'::TIMESTAMPTZ + interval 60 day end_date"
print(conn.query(query))
conn.query("SET TimeZone='America/New_York'")
print(conn.query(query))
│        start_date        │         end_date         │
│ timestamp with time zone │ timestamp with time zone │
├──────────────────────────┼──────────────────────────┤
│ 2018-02-14 23:00:00-05   │ 2018-04-15 23:00:00-04   │
└──────────────────────────┴──────────────────────────┘

Yet this version:

import datetime 
import pandas
pts = pandas.Timestamp("2018-02-14 23:00:00", tz="America/New_York")
print(pts)
print(pts + datetime.timedelta(days=60))

is ```
2018-02-14 23:00:00-05:00
2018-04-16 00:00:00-04:00

@soerenwolfers
Copy link

(You can close issues yourself as "won't do')

@paultiq paultiq closed this as not planned Won't fix, can't repro, duplicate, stale May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants