Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-288] Make airflow timezone aware #2781

Merged
merged 10 commits into from Nov 27, 2017

Conversation

Projects
None yet
10 participants
@bolkedebruin
Copy link
Contributor

bolkedebruin commented Nov 11, 2017

Dear Airflow maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

To be updated.

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

cc @criccomini @jgao54

Dependent on: sdispater/pendulum#161

@bolkedebruin bolkedebruin force-pushed the bolkedebruin:AIRFLOW-1802 branch 10 times, most recently from f82c042 to d6634d0 Nov 11, 2017

"""
Calculates the following schedule for this dag in local time
:param dttm: utc datetime
:return: utc datetime

This comment has been minimized.

Copy link
@ashb

ashb Nov 13, 2017

Member

Docstring says local time, return type says UTC (which I think is what the code returns).

@@ -33,7 +33,7 @@ class DateTimeWithNumRunsForm(Form):
# Date time and number of runs form for tree view, task duration
# and landing times
base_date = DateTimeField(
"Anchor date", widget=DateTimePickerWidget(), default=datetime.utcnow())
"Anchor date", widget=DateTimePickerWidget(), default=timezone.utcnow())

This comment has been minimized.

Copy link
@ashb

ashb Nov 13, 2017

Member

I realise we're probably at the "making it work" stage, but the forms might want to use DatePickers etc in local time.

This comment has been minimized.

Copy link
@bolkedebruin

bolkedebruin Nov 13, 2017

Author Contributor

I consider that out of scope for the change I am doing.

@jgao54

This comment has been minimized.

Copy link
Contributor

jgao54 commented Nov 14, 2017

I ran this locally with mysql backend:

  • dbupgrade worked smoothly (yay)
  • created a new dag with localized start_date (i.e. datetime(2017, 10, 20, 0, 0, tzinfo=timezone.utc)). Dagrun in running state but task isn't running.
  • created a dag with naive start_date (i.e. 'datetime.datetime(2017, 10, 20, 0, 0)) and got the following error message:
[2017-11-13 17:59:11,456] {models.py:4301} ERROR - (_mysql_exceptions.OperationalError) (1292, "Incorrect datetime value: '2017-10-20T00:00:00+00:00' for column 'execution_date' at row 1") [SQL: 'INSERT INTO dag_run (dag_id, execution_date, start_date, end_date, state, run_id, external_trigger, conf) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)'] [parameters: ('foo', <Pendulum [2017-10-20T00:00:00+00:00]>, datetime.datetime(2017, 11, 14, 1, 59, 11, 431819, tzinfo=datetime.timezone.utc), None, 'running', 'scheduled__2017-10-20T00:00:00+00:00', 0, None)]
Traceback (most recent call last):
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context
    context)
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/sqlalchemy/engine/default.py", line 470, in do_execute
    cursor.execute(statement, parameters)
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/MySQLdb/cursors.py", line 250, in execute
    self.errorhandler(self, exc, value)
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/MySQLdb/connections.py", line 50, in defaulterrorhandler
    raise errorvalue
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/MySQLdb/cursors.py", line 247, in execute
    res = self._query(query)
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/MySQLdb/cursors.py", line 411, in _query
    rowcount = self._do_query(q)
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/MySQLdb/cursors.py", line 374, in _do_query
    db.query(q)
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/MySQLdb/connections.py", line 292, in query
    _mysql.connection.query(self, query)
_mysql_exceptions.OperationalError: (1292, "Incorrect datetime value: '2017-10-20T00:00:00+00:00' for column 'execution_date' at row 1")

Looks like mysqldb no longer knows how to handle the datetime when it's in a pendulum object.

@bolkedebruin

This comment has been minimized.

Copy link
Contributor Author

bolkedebruin commented Nov 14, 2017

@jgao54 Thanks! Can you verify if the database migration was done correctly? The naive datetimes (ie. before migration) should be the same after migration only with the UTC offset added.

I have pushed a fix for the conversion error (I am Postgres focused at the moment), see also:

@jgao54

This comment has been minimized.

Copy link
Contributor

jgao54 commented Nov 14, 2017

@bolkedebruin correct, I didn't see any issues there.

@bolkedebruin bolkedebruin force-pushed the bolkedebruin:AIRFLOW-1802 branch from 6a54120 to e466559 Nov 17, 2017

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Nov 17, 2017

Codecov Report

Merging #2781 into master will increase coverage by 0.21%.
The diff coverage is 86.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2781      +/-   ##
==========================================
+ Coverage   73.62%   73.84%   +0.21%     
==========================================
  Files         158      159       +1     
  Lines       11962    12076     +114     
==========================================
+ Hits         8807     8917     +110     
- Misses       3155     3159       +4
Impacted Files Coverage Δ
airflow/ti_deps/deps/runnable_exec_date_dep.py 100% <100%> (ø) ⬆️
airflow/operators/latest_only_operator.py 90% <100%> (ø) ⬆️
airflow/ti_deps/deps/not_in_retry_period_dep.py 100% <100%> (ø) ⬆️
airflow/api/common/experimental/trigger_dag.py 95.83% <100%> (ø) ⬆️
airflow/www/api/experimental/endpoints.py 89.23% <100%> (ø) ⬆️
airflow/api/common/experimental/mark_tasks.py 82.97% <100%> (ø) ⬆️
airflow/operators/sensors.py 67.7% <100%> (ø) ⬆️
airflow/operators/dagrun_operator.py 96.66% <100%> (ø) ⬆️
airflow/www/forms.py 100% <100%> (ø) ⬆️
airflow/www/utils.py 86.33% <100%> (+0.51%) ⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 59aba30...f1ab56c. Read the comment docs.

@bolkedebruin bolkedebruin force-pushed the bolkedebruin:AIRFLOW-1802 branch from e466559 to daa1072 Nov 17, 2017

@bolkedebruin

This comment has been minimized.

Copy link
Contributor Author

bolkedebruin commented Nov 17, 2017

@jgao54 Ready for testing. You should have running tasks now (tested ;-), this was a DB issue.

cc: @aoen @mistercrunch @ashb @Fokko @saguziel @gwax @artwr

I am writing documentation as we 'speak'.

@jgao54

This comment has been minimized.

Copy link
Contributor

jgao54 commented Nov 20, 2017

@bolkedebruin thanks for all the hard work, I will get to it sometimes today!

Templates
'''''''''

Airflow returns time zone aware datetimes in templates, but does not convert them to local time so they remain in UTC.

This comment has been minimized.

Copy link
@jgao54

jgao54 Nov 21, 2017

Contributor

does the template here refers to the datetime objects in get_template_context()?

This comment has been minimized.

Copy link
@bolkedebruin

bolkedebruin Nov 21, 2017

Author Contributor

Yep.

import sqlalchemy as sa


def upgrade():

This comment has been minimized.

Copy link
@jgao54

jgao54 Nov 21, 2017

Contributor

Given that (in the dag_run table) run_id = id_prefix + execution_date, I wonder if ideally run_id should also be updated to reflect the new time in utc. Given that run_id is never parsed again, the only potential confusion is that there's a bit of semantic inconsistency with future generated run_ids

This comment has been minimized.

Copy link
@jgao54

jgao54 Nov 21, 2017

Contributor

Checked the schema and this field has a unique constraint, so it may potentially cause collision with new run_ids created after the migration.

This comment has been minimized.

Copy link
@bolkedebruin

bolkedebruin Nov 22, 2017

Author Contributor

Run_id will be in UTC by definition as execution_date is by definition time zone aware

This comment has been minimized.

Copy link
@jgao54

jgao54 Nov 22, 2017

Contributor

ah you are right, I was running Airflow locally with system clock (PST) before upgrading db (so the execution_date and run_id didn't match in the same row after the db upgrade).

This comment has been minimized.

Copy link
@bolkedebruin

bolkedebruin Nov 22, 2017

Author Contributor

Can you be a bit more clear what happened? Although you were running with PST the upgrade should happen with UTC. ie 2017-01-01 01:00:00 should just become 2017-01-01 01:00:00+00:00. For run_ids this just means it becomes something like scheduled__2017-01-01 01:00:00+00:00.

This comment has been minimized.

Copy link
@jgao54

jgao54 Nov 22, 2017

Contributor

ah yes, that is the case for any new run_ids that get generated, but not for existing ones since the migration scripts only updates execution_date, start_date, and end_date in the table.

This comment has been minimized.

Copy link
@bolkedebruin

bolkedebruin Nov 22, 2017

Author Contributor

Ah, ok. I think we might need to update run_ids (I rather not) in the migration script as well as they are sometimes used in look ups. I need to verify if it will be an issue if we don't. I think it might be an issue for backfills, but not sure.

This comment has been minimized.

Copy link
@bolkedebruin

bolkedebruin Nov 22, 2017

Author Contributor

^^ seems not to be required. We never use the full run_id.

@@ -3882,7 +3924,7 @@ class Chart(Base):
"User", cascade=False, cascade_backrefs=False, backref='charts')
x_is_date = Column(Boolean, default=True)
iteration_no = Column(Integer, default=0)
last_modified = Column(DateTime, default=func.now())
last_modified = Column(UtcDateTime, default=func.now())

This comment has been minimized.

Copy link
@jgao54

jgao54 Nov 22, 2017

Contributor

The migration script currently does not include this field last_modified, so it's still a DATETIME object, maybe we want to modify it to timestamp as well

This comment has been minimized.

Copy link
@bolkedebruin

bolkedebruin Nov 22, 2017

Author Contributor

Good catch. I'll fix that

This comment has been minimized.

Copy link
@bolkedebruin

bolkedebruin Nov 22, 2017

Author Contributor

Done

@jgao54

This comment has been minimized.

Copy link
Contributor

jgao54 commented Nov 22, 2017

When I set default_timezone = system, I get the following stacktrace:

  File "/Users/joyg/Dev/airflow-source/venv/bin/airflow", line 4, in <module>
    __import__('pkg_resources').run_script('apache-airflow==1.10.0.dev0+incubating', 'airflow')
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/pkg_resources/__init__.py", line 719, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1504, in run_script
    exec(code, namespace, namespace)
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/apache_airflow-1.10.0.dev0+incubating-py3.5.egg/EGG-INFO/scripts/airflow", line 16, in <module>
    from airflow import configuration
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/apache_airflow-1.10.0.dev0+incubating-py3.5.egg/airflow/__init__.py", line 32, in <module>
    from airflow.models import DAG
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/apache_airflow-1.10.0.dev0+incubating-py3.5.egg/airflow/models.py", line 77, in <module>
    from airflow.utils.dates import cron_presets, date_range as utils_date_range
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/apache_airflow-1.10.0.dev0+incubating-py3.5.egg/airflow/utils/dates.py", line 112, in <module>
    def round_time(dt, delta, start_date=timezone.make_aware(datetime.min)):
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/apache_airflow-1.10.0.dev0+incubating-py3.5.egg/airflow/utils/timezone.py", line 97, in make_aware
    return timezone.localize(value)
  File "/Users/joyg/Dev/airflow-source/venv/lib/python3.5/site-packages/pytz/tzinfo.py", line 309, in localize
    loc_dt = dt + delta
OverflowError: date value out of range```
@@ -98,7 +109,7 @@ def date_range(
return sorted(l)


def round_time(dt, delta, start_date=datetime.min):
def round_time(dt, delta, start_date=timezone.make_aware(datetime.min)):

This comment has been minimized.

Copy link
@jgao54

jgao54 Nov 22, 2017

Contributor

looks like this is the root cause of the default_timezone=system bug, timezone.make_aware(datetime.min) throws that exception.

This comment has been minimized.

Copy link
@bolkedebruin

bolkedebruin Nov 22, 2017

Author Contributor

I think the issue is somewhere else (init of timezone with system), but let me have a look.

This comment has been minimized.

Copy link
@bolkedebruin

bolkedebruin Nov 22, 2017

Author Contributor

Fixed.

@jgao54

This comment has been minimized.

Copy link
Contributor

jgao54 commented Nov 22, 2017

Tested locally with mysql db and verified the following:

  • db migration converts all datetime to utc
  • creating dag without specifying timezone honors default timezone
  • creating dag with pytz.timezone and pendulum.timezone honors dag-level time zone
  • before/after DST looks good

Asides from comments above, didn't see any other issues.

Looking forward to this feature 👍

@Fokko
Copy link
Contributor

Fokko left a comment

Awesome change. I'm not a timezone export, but since it is such a big change, I went through it. Looks ok to me.

user_filter = objectClass=*
user_name_attr = uid
group_member_attr = memberOf
superuser_filter =
data_profiler_filter =

This comment has been minimized.

Copy link
@Fokko

Fokko Nov 22, 2017

Contributor

Unrelated changes, but I prefer trimmed lines :-)

This comment has been minimized.

Copy link
@bolkedebruin

bolkedebruin Nov 22, 2017

Author Contributor

Agree, I'm tempted to leave it in :)

session.add(log)
session.commit()
session.add(log)
session.commit()

This comment has been minimized.

Copy link
@Fokko

Fokko Nov 22, 2017

Contributor

Don't we need to close this session if we remove the with statement? Personally I like the with syntax a lot.

This comment has been minimized.

Copy link
@bolkedebruin

bolkedebruin Nov 22, 2017

Author Contributor

Mmm this seems the rebase is not entirely working correctly here, I havent changed this code. Good catch.

@Fokko

This comment has been minimized.

Copy link
Contributor

Fokko commented Nov 22, 2017

Maybe also take this date into account:
https://github.com/bolkedebruin/airflow/blob/AIRFLOW-1802/airflow/utils/log/file_processor_handler.py#L89

Otherwise the logs will be utc, which might be confusing.

@bolkedebruin

This comment has been minimized.

Copy link
Contributor Author

bolkedebruin commented Nov 22, 2017

@Fokko thought about that (logs), but I rather do that as a follow up. It could be that you want this in UTC still or system time. This change will go through some adjustments when people get their hands on it.

@bolkedebruin bolkedebruin force-pushed the bolkedebruin:AIRFLOW-1802 branch 4 times, most recently from f539455 to 596b5c4 Nov 22, 2017

@jgao54

jgao54 approved these changes Nov 22, 2017

Copy link
Contributor

jgao54 left a comment

lgtm

@bolkedebruin

This comment has been minimized.

Copy link
Contributor Author

bolkedebruin commented Nov 22, 2017

K will merge soon (brrrrr) @gwax @saguziel @aoen @ashb

@jgao54

This comment has been minimized.

Copy link
Contributor

jgao54 commented Nov 22, 2017

@bolkedebruin We have American thanksgiving here this week so a lot of folks may be on vacation, so you may have to wait a bit before you get more feedback :)

@bolkedebruin

This comment has been minimized.

Copy link
Contributor Author

bolkedebruin commented Nov 22, 2017

Ahhhhh :-).

bolkedebruin added some commits Nov 10, 2017

[AIRFLOW-1804] Add time zone configuration options
Time zone defaults to UTC as is the default now in order
to maintain backwards compatibility.
[AIRFLOW-1808] Convert all utcnow() to time zone aware
datetime.utcnow() does not set time zone information.
[AIRFLOW-1807] Force use of time zone aware db fields
This change will check if all date times being stored are
indeed timezone aware.
[AIRFLOW-1806] Use naive datetime for cron scheduling
Converting to naive time is required in order to make sure
to run at exact times for crons.
E.g. if you specify to run at 8:00pm every day you do not
want suddenly to run at 7:00pm due to DST.

@bolkedebruin bolkedebruin force-pushed the bolkedebruin:AIRFLOW-1802 branch from 596b5c4 to f1ab56c Nov 27, 2017

@asfgit asfgit merged commit f1ab56c into apache:master Nov 27, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

asfgit pushed a commit that referenced this pull request Nov 27, 2017

==========

Support for time zones is enabled by default. Airflow stores datetime information in UTC internally and in the database.
It allows you to run your DAGs with time zone dependent schedules. At the moment Airflow does not them to the end

This comment has been minimized.

Copy link
@raskasa

raskasa Nov 28, 2017

...Airflow does not them to the end...

Typo? Unsure what the word is that's missing though.

This comment has been minimized.

Copy link
@bolkedebruin

bolkedebruin Nov 28, 2017

Author Contributor

Good catch, I have to think about that ;-)

This comment has been minimized.

Copy link
@Amitkumar27

Amitkumar27 Jan 11, 2018

Hi there I recently upgraded from airflow 1.8 to 1.9 and all of my dags started running in UTC. I tried to set default_timezone = system and default_timezone = America/Los_Angeles but still the dags were being run in UTC.
I tried to make the dag timezone aware it gave the error - one of the execution /start date was not timezone aware, this is what I tried in my dag.

local_tz = pendulum.timezone('America/Los_Angeles')

default_args = {
'start_date': datetime(2018, 01, 07, tzinfo=local_tz),
# 'queue': 'bash_queue',
# 'end_date': datetime(2016, 1, 1),
}
error
File "/opt/python27_enhanced-virtual-hadoop/lib/python2.7/site-packages/airflow/ti_deps/deps/exec_date_after_start_date_dep.py", line 24, in _get_dep_statuses
if ti.task.start_date and ti.execution_date < ti.task.start_date:
TypeError: can't compare offset-naive and offset-aware datetimes

This comment has been minimized.

Copy link
@bolkedebruin

bolkedebruin Jan 11, 2018

Author Contributor

1.9.0 does not support time zones yet. It will arrive in 1.10/2.0 . Before that time it is always assumed airflow is running in UTC as it did not support otherwise.

This comment has been minimized.

Copy link
@Amitkumar27

Amitkumar27 Jan 11, 2018

thanks bolkedebruin !

@brycedrennan

This comment has been minimized.

Copy link

brycedrennan commented on airflow/utils/dates.py in 518a41a Feb 8, 2018

@bolkedebruin this seems to cause this issue
UnboundLocalError: local variable 'tz' referenced before assignment

This comment has been minimized.

Copy link
Contributor Author

bolkedebruin replied Feb 8, 2018

Can you create a jira for this @brycedrennan ?

This comment has been minimized.

@somenzz

This comment has been minimized.

Copy link

somenzz commented on b658c78 Sep 23, 2018

There is no DateTime2literal function in module MySQLdb.converters, and there is no DateTime2literal function in module pymysql.

I use mysql as Metastore, when I run airflow initdb, it go wrong.
AttributeError: module 'MySQLdb.converters' has no attribute 'DateTime2literal'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.