Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd parsing of date time with timezone and offset #968

Closed
shakhat opened this issue Oct 15, 2019 · 2 comments
Closed

Odd parsing of date time with timezone and offset #968

shakhat opened this issue Oct 15, 2019 · 2 comments

Comments

@shakhat
Copy link

shakhat commented Oct 15, 2019

Often timezone offset is specified as UTC or GMT plus-minus some number, e.g. GMT+02:00 for Central European time. UTC time can be calculated by subtracting the offset from the local time. The same way the local time is calculated by adding the offset to UTC time.

For example that's how this is done in JavaScript:

> console.log(new Date('2019-10-15T15:00:12.220Z'));
Tue Oct 15 2019 17:00:12 GMT+0200 (Central European Summer Time)

But dateutil interprets timezone offset completely opposite:

> dateutil.parser.parse('Tue Oct 15 2019 17:00:12 GMT+0200').astimezone(timezone.utc)
datetime.datetime(2019, 10, 15, 19, 0, 12, tzinfo=datetime.timezone.utc)

According to code comments this behaviour is as designed (https://github.com/dateutil/dateutil/blame/master/dateutil/parser/_parser.py#L803). But shouldn't it be changed to conform to interpretation the has become a standard?

@pganssle
Copy link
Member

pganssle commented Nov 2, 2019

See #70. The current way is a standard, and it's triggered by the offset being in the form that more closely matches that standard. In the future, we will likely add an option to invert that parsing logic.

@yohplala
Copy link

Hello,
I still encounter this trouble.
I am parsing dates that are formatted this way (stored in a list):

14    Sun Oct 27 2019 02:00:00 GMT+0200
15    Sun Oct 27 2019 02:00:00 GMT+0100
16    Sun Oct 27 2019 03:00:00 GMT+0100

(as you may notice the timezone the data come from has a DST, which allowed me to spot the trouble)

I add them in a dataframe as datetime object.
GC['date'] = pd.to_datetime(my_timestamps)

As a result, I get:

14    2019-10-27 02:00:00-02:00
15    2019-10-27 02:00:00-01:00
16    2019-10-27 03:00:00-01:00

If you apply directly the offset with utc=True, you will see that lines 14 & 16 have become the same timestamps.
GC['date'] = pd.to_datetime(my_timestamps, utc=True)

14   2019-10-27 04:00:00+00:00
15   2019-10-27 03:00:00+00:00
16   2019-10-27 04:00:00+00:00

I noticed the trouble as I want these timestamps to be indexes for my dataframe. I activate 'verify_integrity' which detects these duplicates.

GC.set_index('date', inplace = True, verify_integrity = True)
ValueError: Index has duplicate keys: Index([2019-10-27 02:00:00-01:00, 2019-10-27 03:00:00-01:00], dtype='object', name='date')

I am sorry, I gave only pandas related commands, but I opened this as a bug in their bugtracker.
pandas-dev/pandas#30518

They are using dateutil and could check they get the same trouble.
They closed the issue, asking me to re-open it here.

I could see I am not the 1st to notice this "reversed" offset trouble.
Please, if there is no consensus on how should be applied the offset, can you let the choice to the coder through an option in the parser?

Also, I am quite a newbie, I am sorry, what would be your recommendation to get around this trouble?

I thank you in advance for your help.
Have a good day,
Bests,
Pierre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants