Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-4401] Use managers for Queue synchronization #5200

Merged
merged 1 commit into from
Apr 30, 2019

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Apr 28, 2019

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-4401
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

It is a known problem https://bugs.python.org/issue23582 that
multiprocessing.Queue empty() method is not reliable - sometimes it might
return True even if another process already put something in the queue.

This resulted in some of the tasks not picked up when sync() methods
were called (in AirflowKubernetesScheduler, LocalExecutor,
DagFileProcessor). This was less of a problem if the method was called in sync()

  • as the remaining jobs/files could be processed in next pass but it was a problem
    in tests and when graceful shutdown was executed (some tasks could be still
    unprocessed while the shutdown occured).

We switched to Managers() managed queues to handle that - the queue in this case
is run in a separate subprocess and each process using it uses a proxy to access
this shared queue.

All the cases impacted follow the same pattern now:

while not queue.empty():
res = queue.get()
....

This loop runs always in single (main) process so it is safe to run it this way -
there is no risk that some other process will retrieve the data from the queue in
between empty() and get().

In all these cases overhead for inter-processing locking is negligible
comparing to the action executed (Parsing DAG, executing job)
so it appears it should be safe to merge the change.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Tests were there (but flaky)

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

@potiuk potiuk requested review from ashb and BasPH April 28, 2019 16:06
@potiuk potiuk force-pushed the fix_queues_with_managers branch 3 times, most recently from c293256 to 6b09440 Compare April 29, 2019 05:03
@potiuk potiuk changed the title [aIRFLOW-4401] Use managers for Queue synchronization [AIRFLOW-4401] Use managers for Queue synchronization Apr 29, 2019
@codecov-io
Copy link

codecov-io commented Apr 29, 2019

Codecov Report

Merging #5200 into master will increase coverage by 0.25%.
The diff coverage is 65.16%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5200      +/-   ##
==========================================
+ Coverage   78.28%   78.54%   +0.25%     
==========================================
  Files         469      469              
  Lines       29912    29952      +40     
==========================================
+ Hits        23418    23527     +109     
+ Misses       6494     6425      -69
Impacted Files Coverage Δ
airflow/utils/dag_processing.py 60.13% <100%> (+0.65%) ⬆️
airflow/jobs.py 78.35% <28.57%> (-0.14%) ⬇️
airflow/contrib/executors/kubernetes_executor.py 62.79% <53.19%> (-0.59%) ⬇️
airflow/executors/local_executor.py 82.35% <80%> (+1.3%) ⬆️
airflow/hooks/dbapi_hook.py 88.79% <0%> (+0.86%) ⬆️
airflow/models/connection.py 65.73% <0%> (+1.12%) ⬆️
airflow/hooks/hive_hooks.py 74.93% <0%> (+1.86%) ⬆️
airflow/utils/sqlalchemy.py 80.95% <0%> (+4.76%) ⬆️
airflow/operators/mysql_operator.py 100% <0%> (+100%) ⬆️
airflow/operators/mysql_to_hive.py 100% <0%> (+100%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e2a807...aa16653. Read the comment docs.

@@ -206,7 +206,7 @@ def test_trigger_controller_dag(self):

scheduler = SchedulerJob()
queue = Mock()
scheduler._process_task_instances(target_dag, queue=queue)
scheduler._process_task_instances(target_dag, task_instances_list=queue)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will rename all the queues later when we discuss which solution to choose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good though

airflow/executors/base_executor.py Outdated Show resolved Hide resolved
Copy link
Contributor

@BasPH BasPH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented on 2 tiny nits

From a logic perspective, it appears okay to me although I'm not very experienced with the multiprocessing library. If the tests succeed and don't hang I'm okay with merging, to fix the current issues.

airflow/contrib/executors/kubernetes_executor.py Outdated Show resolved Hide resolved
airflow/contrib/executors/kubernetes_executor.py Outdated Show resolved Hide resolved
@potiuk potiuk force-pushed the fix_queues_with_managers branch 6 times, most recently from b1d8793 to 6ae1ac6 Compare April 30, 2019 05:00
@@ -769,3 +793,7 @@ def _change_state(self, key, state, pod_id):
def end(self):
self.log.info('Shutting down Kubernetes executor')
self.task_queue.join()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had problem here before. task_queue.join() was used but task_done() never called., so I think we could have some hangs when trying to stop gracefully KubernetesScheduler.

@potiuk
Copy link
Member Author

potiuk commented Apr 30, 2019

@BasPH @Fokko @ashb -> I made some amendments now as I realised that Manager().Queue() is always a JoinableQueue (unlike multiprocessing.Queue). I updated the pattern everywhere to follow this general pattern below (also described it in the commit message).

Since the managed queue is anyhow proxied to the shared one, get_nowait() will always return the message if it already has been put to the queue by another process.

BTW. the get_nowait() message in case of managed queue is not really "instant" as in case of standard multiprocessing.Queue - it will poll the shared process, but that's exactly why it is good for us :).

It also seems that in KubernetesExecutor there was a mistake. See the comment above (#5200 (comment)) - if we've heard about inability to gracefully shutdown the KubernetesExecutor - that could be it.

while True:
   try:
       res = queue.get_nowait()
       try:
          .... do some processing
       finally:
           queue.task_done()
   except Empty:
       break

@potiuk potiuk force-pushed the fix_queues_with_managers branch 3 times, most recently from 471df47 to 384af0f Compare April 30, 2019 06:51
self.process_watcher_task(task)
finally:
self.watcher_queue.task_done()
except queue.Empty:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
except queue.Empty:
except Empty:

Copy link
Contributor

@milton0825 milton0825 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. LGTM.

It is a known problem https://bugs.python.org/issue23582 that
multiprocessing.Queue empty() method is not reliable - sometimes it might
return True even if another process already put something in the queue.

This resulted in some of the tasks not picked up when sync() methods
were called (in AirflowKubernetesScheduler, LocalExecutor,
DagFileProcessor). This was less of a problem if the method was called in sync()
- as the remaining jobs/files could be processed in next pass but it was a problem
in tests and when graceful shutdown was executed (some tasks could be still
unprocessed while the shutdown occured).

We switched to Managers() managed queues to handle that - the queue in this case
is run in a separate subprocess and each process using it uses a proxy to access
this shared queue. Additionally all Queues() returned by managers are Joinable
Queues so we should run task_done() after all processing and we can now perform
join() in termination/end code to wait until all tasks are actually processed,
not only retrieved from the queue. That increases gracefulness of shutdown.

All the cases impacted follow the same general pattern now:

while True:
   try:
       res = queue.get_nowait()
       try:
          .... do some processing
       finally:
           queue.task_done()
   except Empty:
       break

In all these cases overhead for inter-processing locking is negligible
comparing to the action executed (Parsing DAG, executing job)
so it appears it should be safe for concurrency as well.
@potiuk
Copy link
Member Author

potiuk commented Apr 30, 2019

One last build and I think we are good to go :)

@potiuk
Copy link
Member Author

potiuk commented Apr 30, 2019

Looks like we are good to go. Last chance to take a look @BasPH @ashb @Fokko :)

@potiuk potiuk merged commit 6952b19 into apache:master Apr 30, 2019
potiuk added a commit that referenced this pull request Apr 30, 2019
It is a known problem https://bugs.python.org/issue23582 that
multiprocessing.Queue empty() method is not reliable - sometimes it might
return True even if another process already put something in the queue.

This resulted in some of the tasks not picked up when sync() methods
were called (in AirflowKubernetesScheduler, LocalExecutor,
DagFileProcessor). This was less of a problem if the method was called in sync()
- as the remaining jobs/files could be processed in next pass but it was a problem
in tests and when graceful shutdown was executed (some tasks could be still
unprocessed while the shutdown occured).

We switched to Managers() managed queues to handle that - the queue in this case
is run in a separate subprocess and each process using it uses a proxy to access
this shared queue. Additionally all Queues() returned by managers are Joinable
Queues so we should run task_done() after all processing and we can now perform
join() in termination/end code to wait until all tasks are actually processed,
not only retrieved from the queue. That increases gracefulness of shutdown.

All the cases impacted follow the same general pattern now:

while True:
   try:
       res = queue.get_nowait()
       try:
          .... do some processing
       finally:
           queue.task_done()
   except Empty:
       break

In all these cases overhead for inter-processing locking is negligible
comparing to the action executed (Parsing DAG, executing job)
so it appears it should be safe for concurrency as well.

(cherry picked from commit 6952b19)
@ashb
Copy link
Member

ashb commented Apr 30, 2019

That .join() issue could be the cause of num_runs not working for Kube executor - https://issues.apache.org/jira/browse/AIRFLOW-4424.

potiuk added a commit that referenced this pull request Apr 30, 2019
It is a known problem https://bugs.python.org/issue23582 that
multiprocessing.Queue empty() method is not reliable - sometimes it might
return True even if another process already put something in the queue.

This resulted in some of the tasks not picked up when sync() methods
were called (in AirflowKubernetesScheduler, LocalExecutor,
DagFileProcessor). This was less of a problem if the method was called in sync()
- as the remaining jobs/files could be processed in next pass but it was a problem
in tests and when graceful shutdown was executed (some tasks could be still
unprocessed while the shutdown occured).

We switched to Managers() managed queues to handle that - the queue in this case
is run in a separate subprocess and each process using it uses a proxy to access
this shared queue. Additionally all Queues() returned by managers are Joinable
Queues so we should run task_done() after all processing and we can now perform
join() in termination/end code to wait until all tasks are actually processed,
not only retrieved from the queue. That increases gracefulness of shutdown.

All the cases impacted follow the same general pattern now:

while True:
   try:
       res = queue.get_nowait()
       try:
          .... do some processing
       finally:
           queue.task_done()
   except Empty:
       break

In all these cases overhead for inter-processing locking is negligible
comparing to the action executed (Parsing DAG, executing job)
so it appears it should be safe for concurrency as well.

(cherry picked from commit 6952b19)
andriisoldatenko pushed a commit to andriisoldatenko/airflow that referenced this pull request Jul 26, 2019
It is a known problem https://bugs.python.org/issue23582 that
multiprocessing.Queue empty() method is not reliable - sometimes it might
return True even if another process already put something in the queue.

This resulted in some of the tasks not picked up when sync() methods
were called (in AirflowKubernetesScheduler, LocalExecutor,
DagFileProcessor). This was less of a problem if the method was called in sync()
- as the remaining jobs/files could be processed in next pass but it was a problem
in tests and when graceful shutdown was executed (some tasks could be still
unprocessed while the shutdown occured).

We switched to Managers() managed queues to handle that - the queue in this case
is run in a separate subprocess and each process using it uses a proxy to access
this shared queue. Additionally all Queues() returned by managers are Joinable
Queues so we should run task_done() after all processing and we can now perform
join() in termination/end code to wait until all tasks are actually processed,
not only retrieved from the queue. That increases gracefulness of shutdown.

All the cases impacted follow the same general pattern now:

while True:
   try:
       res = queue.get_nowait()
       try:
          .... do some processing
       finally:
           queue.task_done()
   except Empty:
       break

In all these cases overhead for inter-processing locking is negligible
comparing to the action executed (Parsing DAG, executing job)
so it appears it should be safe for concurrency as well.
wmorris75 pushed a commit to modmed/incubator-airflow that referenced this pull request Jul 29, 2019
It is a known problem https://bugs.python.org/issue23582 that
multiprocessing.Queue empty() method is not reliable - sometimes it might
return True even if another process already put something in the queue.

This resulted in some of the tasks not picked up when sync() methods
were called (in AirflowKubernetesScheduler, LocalExecutor,
DagFileProcessor). This was less of a problem if the method was called in sync()
- as the remaining jobs/files could be processed in next pass but it was a problem
in tests and when graceful shutdown was executed (some tasks could be still
unprocessed while the shutdown occured).

We switched to Managers() managed queues to handle that - the queue in this case
is run in a separate subprocess and each process using it uses a proxy to access
this shared queue. Additionally all Queues() returned by managers are Joinable
Queues so we should run task_done() after all processing and we can now perform
join() in termination/end code to wait until all tasks are actually processed,
not only retrieved from the queue. That increases gracefulness of shutdown.

All the cases impacted follow the same general pattern now:

while True:
   try:
       res = queue.get_nowait()
       try:
          .... do some processing
       finally:
           queue.task_done()
   except Empty:
       break

In all these cases overhead for inter-processing locking is negligible
comparing to the action executed (Parsing DAG, executing job)
so it appears it should be safe for concurrency as well.
dharamsk pushed a commit to postmates/airflow that referenced this pull request Aug 8, 2019
It is a known problem https://bugs.python.org/issue23582 that
multiprocessing.Queue empty() method is not reliable - sometimes it might
return True even if another process already put something in the queue.

This resulted in some of the tasks not picked up when sync() methods
were called (in AirflowKubernetesScheduler, LocalExecutor,
DagFileProcessor). This was less of a problem if the method was called in sync()
- as the remaining jobs/files could be processed in next pass but it was a problem
in tests and when graceful shutdown was executed (some tasks could be still
unprocessed while the shutdown occured).

We switched to Managers() managed queues to handle that - the queue in this case
is run in a separate subprocess and each process using it uses a proxy to access
this shared queue. Additionally all Queues() returned by managers are Joinable
Queues so we should run task_done() after all processing and we can now perform
join() in termination/end code to wait until all tasks are actually processed,
not only retrieved from the queue. That increases gracefulness of shutdown.

All the cases impacted follow the same general pattern now:

while True:
   try:
       res = queue.get_nowait()
       try:
          .... do some processing
       finally:
           queue.task_done()
   except Empty:
       break

In all these cases overhead for inter-processing locking is negligible
comparing to the action executed (Parsing DAG, executing job)
so it appears it should be safe for concurrency as well.

(cherry picked from commit 6952b19)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants