Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change timezone serialization logic #16631

Closed
wants to merge 4 commits into from

Conversation

ecerulm
Copy link
Contributor

@ecerulm ecerulm commented Jun 24, 2021

Closes #16613

Current dag serialization logic assumes the the dag.timezone is a pendulum.Timezone and more specifically a "named" timezone (FixedTimezone will fail to serialize for example) and not just a datetime.tzinfo.

The proposed change will work with named timezones pendulum.tz.timezone.Timezone, fixed offset timezones pendulum.tz.timezone.FixedTimezone and "regular" datetime.tzinfo implementations.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@potiuk
Copy link
Member

potiuk commented Jun 24, 2021

Nice! LGTM, I'd love others take a look as well though.

@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jun 24, 2021
except InvalidTimezone:
pass
try:
return pendulum.parse("2021-01-01T12:00:00" + tz_name).tzinfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more than slightly odd -- when is this case hit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From any of the new cases provided in the tests_dag_serialization.py:test_deserialization_start_date()

But as an example if the DAG start_date is pendulum.parse("2019-08-01T00:00:00.000+01:00" then DAG timezone will be a pendulum.tz.timezone.FixedOffset with name +01:00 so that it's what it will be serialized to.

Now in the deserialization step pendulum.timezone('+01:00') will raise a InvalidTimezone because that only work for "named" timezones (and not all I must add). See sdispater/pendulum#554

So if we want to parse +01:00 back to a timezone object, pendulum.timezone() won't do it, we need to "hack it" by creating a "fake" iso 8601 string with the +01:00 and using pendulum.parse() to construct a FixedTimezone() object for us.

I couldn't find any other method in pendulum's to convert the string '+01:00 to a FixedTimezone/Timezone/datetime.tzinfo

@ashb ashb added this to the Airflow 2.1.2 milestone Jun 24, 2021
except InvalidTimezone:
pass
try:
return pendulum.parse("2021-01-01T12:00:00" + tz_name).tzinfo
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From any of the new cases provided in the tests_dag_serialization.py:test_deserialization_start_date()

But as an example if the DAG start_date is pendulum.parse("2019-08-01T00:00:00.000+01:00" then DAG timezone will be a pendulum.tz.timezone.FixedOffset with name +01:00 so that it's what it will be serialized to.

Now in the deserialization step pendulum.timezone('+01:00') will raise a InvalidTimezone because that only work for "named" timezones (and not all I must add). See sdispater/pendulum#554

So if we want to parse +01:00 back to a timezone object, pendulum.timezone() won't do it, we need to "hack it" by creating a "fake" iso 8601 string with the +01:00 and using pendulum.parse() to construct a FixedTimezone() object for us.

I couldn't find any other method in pendulum's to convert the string '+01:00 to a FixedTimezone/Timezone/datetime.tzinfo

pendulum.datetime(2019, 8, 1, tz=FixedTimezone(3600)),
None,
pendulum.datetime(2019, 8, 1, tz=FixedTimezone(3600)),
),
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashb , any of this added cases with offset timezones "+01:00", FixedTimezone(3600), etc will raise an InvalidTimezone exception in the SerializedDAG._deserialize_timezone() at the pendulum.timezone(xxx).

Those timezones serialize to the timezone name which is +01:00,etc and pendulum.timezone('+01:00') raises InvalidTimezone because that method is meant to resolve "named" timezones only (so only works with UTC, Europe/Stockholm, EST, etc )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this works:

In [13]: pendulum.timezone(3600)                                                                                                                       
Out[13]: Timezone('+01:00')

Maybe we need to change the serialization format to accept string or int -- I think that might be less hacky.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe coupled with tz_name in pendulum.tz.timezones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try but I really that will add some very pendulum-specific logic.

I mean, it will involve checking if this is a pendulum.tz.timezone.FixedTimezone and serializing to int if so, there is no "standard" good way of checking if a given datetime.tzinfo is a simple timezone (fixed offset only) or a full fledged timezone (with arbitrarily complex dst rules, etc).

In principle the user can provide a datetime.tzinfo of it's own with name RubenTZ and that will fail the pendulum.timezone('RubenTZ') but doesn't mean that we can just serialize to a fixedoffset (RubenTZ is much more complex)

What I'm trying to say is that is going to endup being very hackish anyway, but in the serializing part instead of the deserializing part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe coupled with tz_name in pendulum.tz.timezones?

I'm not sure what you mean , but if you mean to use .tzname() instead of .name I can already tell you that is even more brittle.

ecerulm and others added 2 commits June 24, 2021 14:37
Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>
@ecerulm
Copy link
Contributor Author

ecerulm commented Jun 24, 2021

Just for clarification,

The whole pendulum.tz.timezone.xxx behaviour is a bit confusing. The current codebase assumes that if you get a timezone "name" (with .name) you can reconstruct a datetime.tzinfo with pendulum.timezone(that_name) which is only true in certain conditions...

  • for many "named" timezones like UTC, Europe/Stockholm, CET,etc this holds true and that's what people had to use before in airflow.
  • for all offset timezones like +01:00, etc this is not true and it will raise InvalidTimezone

Besides that, there are some other pendulum/datetime.tzinfo oddities that prevent some simplications.

  • you can't just use .tzname(), because you need an actual datetime in addition to the timezone to get the name.
  • even if you just took the dag.start_date + dag.timezone (or the dag.start_date.tzinfo), tzname() can give back a named timezone like CEST that cannot be converted back to an actual datetime.tzinfo object by pendulum.timezone) .
start_date = pendulum.datetime(2021,1,1,tzinfo=timezone.utc) # as dag start_date is converted to utc
timezone = pendulum.timezone("Europe/Stockholm") # dag.timezone keeps the "original" timezone that came in the start_date argument to the constructor
timezone.tzname(start_date) # returns 'CEST'
pendulum.timezone(timezone.tzname(start_date)) # raises InvalidTimezone

@ashb
Copy link
Member

ashb commented Jun 24, 2021

Man I hate timezones.

@potiuk
Copy link
Member

potiuk commented Jun 24, 2021

Man I hate timezones.

Quite agree :)

@ashb
Copy link
Member

ashb commented Jun 24, 2021

Random timezone factoid:

Ireland and the UK aren't in the same timezone Ireland is technically in Central European Time, but instead of Summer time, they Have Winter time.

https://danq.me/2021/05/11/ireland-timezone/

🤯

@ecerulm
Copy link
Contributor Author

ecerulm commented Jun 24, 2021

Man I hate timezones

Who doesn't ?

I was just focused on getting rid of the exceptions when you use +01:00, but I guess there is a bigger discussion to be had regarding the timezone handling in airflow. I believe that in part the problem is that airflow tries to "remember" what was the original timezone, where is reality the approach IMHO should be to accept timezone-aware datetimes, or convert to timezone-aware if a naive datetime is provided but don't try to hold on to the originally specified timezone. It's should be a GUI decision how to present the datetime to the user.

  • For CLI, present in in the time is whatever the user selected as their default timezone (/etc/timezone, ) so that would be "local" timezone
  • For HTML GUI, get the preferred timezone for the browser js api , but also maybe have a dropdown menu or something so that the user can see the dates in some other timezone .

But in general get rid of "remembering" what was the original datetime's timezone.

@potiuk
Copy link
Member

potiuk commented Jun 24, 2021

Eucla: UTC +8:45
Standard time: Australian Central Western Standard Time (ACWST)
Daylight time: None
Example city: Eucla

😱

@ashb
Copy link
Member

ashb commented Jun 24, 2021

For HTML GUI, get the preferred timezone for the browser js api , but also maybe have a dropdown menu or something so that the user can see the dates in some other timezone .

This is what we now do (as of 1.10.....10?)

@ashb
Copy link
Member

ashb commented Jun 24, 2021

DAGs do need a timezone, as sometimes often you want a DAG to run at midnight "local" time, not midnight UTC.

@ecerulm
Copy link
Contributor Author

ecerulm commented Jun 24, 2021

DAGs do need a timezone, as sometimes often you want a DAG to run at midnight "local" time, not midnight UTC.

If that's the only purpose let me explore the possibility of serializing the timezone as a numeric offset (seconds from utc). Unless there is some interaction with the DST changes. I'll do some tests

EDIT: forget I said that, obviously most timezone are more that just an offset so that's not going to work. I guess if it has a name we need to look it up. And we should check at DAG.init that the timezone is indeed serializable/deserializable to a pendulum.timezone

@ashb
Copy link
Member

ashb commented Jun 24, 2021

Unless there is some interaction with the DST changes

There is yes -- if I make the dag run at midnight "Europe/London" it should follow DST.

Comment on lines +202 to +203
tz_name = getattr(var, 'name', None)
if not tz_name:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also use tzname for pendulum Timezone as well? It’s a standard method that should always yield something useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a standard method that should always yield something useful

Well, not always with pendulum as explained in #16631 (comment) , pendulum.timezone('Europe/Stockholm').tzname(...) can return CEST which can't be converted back with pendulum.timezone('CEST') as it raises InvalidTimezone

I did try that at some point. The whole concept of datetime.tzinfo -> str -> datetime.tzinfo works sometimes but there is no really hard guarantee. So that's why I use the .name if present (which marginally increases the chances of success serialization/deserialization)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gosh we really should drop Pendulum at some point I guess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell there is no good timezone library in python really :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no good timezone libraries, in general 🙂 Personally I think the new zoneinfo is as good as it can be.

airflow/serialization/serialized_objects.py Outdated Show resolved Hide resolved
airflow/serialization/serialized_objects.py Show resolved Hide resolved
@ecerulm
Copy link
Contributor Author

ecerulm commented Jun 26, 2021

@ashb I decided to do another PR with the solve it at serialization time instead of deserialization time at #16678, I guess it will be easier to compare them this way.

@ecerulm ecerulm closed this Jun 29, 2021
@ashb ashb modified the milestones: Airflow 2.1.2, Airflow 2.1.3 Jul 7, 2021
@kaxil kaxil removed this from the Airflow 2.1.3 milestone Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:serialization full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

InvalidTimezone exception if DAG's start_date timezone is "+00:00"
5 participants