Skip to content

Cleanup & improvements around scheduling#12815

Merged
XD-DENG merged 3 commits intoapache:masterfrom
XD-DENG:clean-scheduler-job
Dec 5, 2020
Merged

Cleanup & improvements around scheduling#12815
XD-DENG merged 3 commits intoapache:masterfrom
XD-DENG:clean-scheduler-job

Conversation

@XD-DENG
Copy link
Member

@XD-DENG XD-DENG commented Dec 4, 2020

1. Cleanup

Mainly to clean up stable docstring.

  • Remove unneeded code lines
  • Remove stale docstring
  • Fix wrong docstring
  • Fix stale doc image link in docstring

2. Improvements

  • Avoid unnecessary loop in DagRun.schedule_tis(), which is invoked inside SchedulerJob
  • Minor improvement on DAG.deactivate_stale_dags(), which is invoked inside SchedulerJob (to do in separate PR to fix similar issue project-wise, for clearer PR scopes)

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

- Remove unneeded code line
- Remove stale docstring
- Fix wrong docstring
- Fix stale doc image link in docstring
- avoid unnecessary loop in DagRun.schedule_tis()
- Minor improvement on DAG.deactivate_stale_dags()
  which is invoked inside SchedulerJob
@XD-DENG XD-DENG added area:Scheduler including HA (high availability) scheduler type:improvement Changelog: Improvements labels Dec 4, 2020
@XD-DENG XD-DENG requested review from ashb, kaxil and turbaszek December 4, 2020 18:04
schedulable_ti_ids.append(ti.task_id)

schedulable_ti_ids = [ti.task_id for ti in schedulable_tis if ti not in dummy_tis]
count = 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is to ensure we traverse schedulable_tis only once, rather than twice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is worth moving these logics to a separate function? See: airflow.utils.helpers.partition

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I don't find it necessary (for now).

On the other hand, if we abstract this into a separate function, at least to what I can see, it's sort of "duplicated" with helpers.partition() (I don't want to use helpers.partition() here because it still traverse the iterable twice.)

dag.is_active = False
session.merge(dag)
session.commit()
session.commit()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to deactivate_unknown_dags(), commit only once outside the for-loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should actually just be

Suggested change
session.commit()
session.flush()

to be in line with https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#database-session-handling

Copy link
Member Author

@XD-DENG XD-DENG Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, only in model/dag.py there are also a few other usage of session.commit() when @provide_session is used. Address them in this PR as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly -- do you think they make sense all as one PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this: I will skip it in this PR, and have another PR dedicated for clearing session.commit() project-wise. So the PR scopes are clearer. Agree?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created issue #12818 for clearing session.commit().

Given it's a relatively easy fix to do, I mark it as "good first issue" and let's see if any new-contributor would like to pick it up (will voice in Slack).


Returns a list of serialized_dag dicts that represent the DAGs found in
the file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • DagFileProcessor.process_file() doesn't take care of killing zombie anymore.
  • The statement of what's returned here is stale

@github-actions
Copy link

github-actions bot commented Dec 4, 2020

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Dec 4, 2020
@XD-DENG XD-DENG merged commit fbb8a4a into apache:master Dec 5, 2020
@XD-DENG XD-DENG deleted the clean-scheduler-job branch December 5, 2020 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler full tests needed We need to run full set of tests for this PR to merge type:improvement Changelog: Improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants