New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-3518] Performance fixes for topological_sort #4322

Merged
merged 2 commits into from Dec 15, 2018

Conversation

Projects
None yet
4 participants
@NielsZeilemaker
Copy link
Contributor

NielsZeilemaker commented Dec 14, 2018

For larger DAGs topological_sort was found to be very inefficient. Made
some small changes to the code to improve the datastructures used in the
method.

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    While running a larger DAG, topological_sort was found to be very slow. This fixup improves the performance on my local machine from ~5-10 seconds to 0.01s.
    Moreover, i've modified the eq operators to have a shortcut for dag_id/task_id. This should improve all other cases where equality is used.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    No tests, as I would expect topological_sort to be covered already.

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.
    • All the public functions and the classes in the PR contain docstrings that explain what it does

Code Quality

  • Passes flake8
[AIRFLOW-3518] Performance fixes for topological_sort
For larger DAGs topological_sort was found to be very inefficient. Made
some small changes to the code to improve the datastructures used in the
method.
@ashb

ashb approved these changes Dec 14, 2018

Copy link
Member

ashb left a comment

Well that looks like a deceptively simple change that has big impact! Nice.

all(self.__dict__.get(c, None) == other.__dict__.get(c, None)
for c in self._comps))
if (type(self) == type(other) and
self.task_id == other.task_id):

This comment has been minimized.

@ashb

ashb Dec 14, 2018

Member

Do we need to also check that self.dag_id == other.dag_id?

This comment has been minimized.

@ashb

ashb Dec 14, 2018

Member

(And if we do we should probably add a test that checks that covers that)

This comment has been minimized.

@Fokko

Fokko Dec 14, 2018

Contributor

I think this is implicitly well covered by the cycle detection tests in tests/models.py.

This comment has been minimized.

@ashb

ashb Dec 14, 2018

Member
dag1 = DAG('dag2')
task1 = MyOp(task_id='start', dag=dag1)

dag2 = DAG('dag1')
task2 = MyOp(task_id='start', dag=dag2)

Will task1 == task2? It sort of looks like form this diff only that it might.

This comment has been minimized.

@Fokko

Fokko Dec 14, 2018

Contributor

So comparing the task_ids is just an optimization. If they differ, the attributes task_id, dag_id, owner, email, email_on_retry, retry_delay, retry_exponential_backoff, max_retry_delay, start_date, schedule_interval, depends_on_past, wait_for_downstream, adhoc, priority_weight, sla, execution_timeout, on_failure_callback, on_success_callback and on_retry_callback are compared to determine if the dags are really equal.

This comment has been minimized.

@ashb

ashb Dec 15, 2018

Member

Oh sorry yes! If they don't match then we return false. Right

@Fokko

Fokko approved these changes Dec 14, 2018

Copy link
Contributor

Fokko left a comment

Waiting for Flake8 to pass

all(self.__dict__.get(c, None) == other.__dict__.get(c, None)
for c in self._comps))
if (type(self) == type(other) and
self.task_id == other.task_id):

This comment has been minimized.

@Fokko

Fokko Dec 14, 2018

Contributor

I think this is implicitly well covered by the cycle detection tests in tests/models.py.

@ashb
Copy link
Member

ashb left a comment

I think we need to check dag_id too.

@feng-tao

This comment has been minimized.

Copy link
Contributor

feng-tao commented Dec 15, 2018

@ashb , I think whether to compare dag_id or not doesn't seem to matter that much. From the code, if self.task_id == other.task_id and (type(self) == type(other), it will still compare all the attributes for the operator which implictly checks dag_id as well.

@NielsZeilemaker

This comment has been minimized.

Copy link
Contributor Author

NielsZeilemaker commented Dec 15, 2018

@feng-tao that's correct, i've modified the condition to check for task_id and in the other equality check to check for dag_id as a performance optimization. The original check (using self._comps) is still there.

During our performance test we saw a lot of calls to both __eq__ methods, and hence the small modification to make them perform better.

@ashb

ashb approved these changes Dec 15, 2018

@ashb

This comment has been minimized.

Copy link
Member

ashb commented Dec 15, 2018

Maybe I should offer bad code reviews as a service? ;)

@ashb ashb merged commit 36e76fa into apache:master Dec 15, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

ashb added a commit that referenced this pull request Dec 15, 2018

[AIRFLOW-3518] Performance fixes for topological_sort of Tasks (#4322)
For larger DAGs topological_sort was found to be very inefficient. Made
some small changes to the code to improve the data structures used in the
method.

aliceabe pushed a commit to aliceabe/incubator-airflow that referenced this pull request Jan 3, 2019

[AIRFLOW-3518] Performance fixes for topological_sort of Tasks (apach…
…e#4322)

For larger DAGs topological_sort was found to be very inefficient. Made
some small changes to the code to improve the data structures used in the
method.

kaxil added a commit that referenced this pull request Jan 9, 2019

[AIRFLOW-3518] Performance fixes for topological_sort of Tasks (#4322)
For larger DAGs topological_sort was found to be very inefficient. Made
some small changes to the code to improve the data structures used in the
method.

ashb added a commit to ashb/airflow that referenced this pull request Jan 10, 2019

[AIRFLOW-3518] Performance fixes for topological_sort of Tasks (apach…
…e#4322)

For larger DAGs topological_sort was found to be very inefficient. Made
some small changes to the code to improve the data structures used in the
method.

cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Jan 23, 2019

Performance fixes for topological_sort of Tasks (apache#4322)
For larger DAGs topological_sort was found to be very inefficient. Made
some small changes to the code to improve the data structures used in the
method.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment