Dont create duplicate dag file processors by pingzh · Pull Request #11875 · apache/airflow

pingzh · 2020-10-27T05:32:53Z

Context: when a dag file is under processing and multiple callbacks are
created either via zombies or executor events, the dag file is added to
the _file_path_queue and the manager will launch a new process to
process it, which it should not since the dag file is currently under
processing. This will bypass the _parallelism eventually especially when
it takes long time to process some dag files. We have seen ~200 dag
processors on the scheduler even we set the _parallelism as 60. More dag
file processors make CPU spike and in turn it makes the dag file
processing even slower. In the end, the scheduler is taken down.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

Context: when a dag file is under processing and multiple callbacks are created either via zombies or executor events, the dag file is added to the _file_path_queue and the manager will launch a new process to process it, which it should not since the dag file is currently under processing. This will bypass the _parallelism eventually especially when it takes long time to process some dag files. We have seen ~200 dag processors on the scheduler even we set the _parallelism as 60. More dag file processors make CPU spike and in turn it makes the dag file processing even slower. In the end, the scheduler is taken down.

ashb · 2020-10-27T13:23:34Z

tests/utils/test_dag_processing.py

+        manager._file_path_queue = [f1, f2]
+        files_paths_to_exclude_in_this_loop = {f1}
+
+        manager.start_new_processes(files_paths_to_exclude_in_this_loop)


This is too tightly coupled to the way this is implemented, rather than the goal.

Also the fact that you can't esaily test this probably means you need to re-think the a.pproach

if the manager loop calls prepare_file_path_queue() in every loop, we don't need this change (passing files_paths_to_exclude_in_this_loop to the start_new_processes.

It is only called when if not self._file_path_queue. If we remove this, we will also need to to some dupe work in the prepare_file_path_queue

i chose to have files_paths_to_exclude_in_this_loop inside the start_new_processes as i was thinking that is single place we do a pop(0) and start new process, so adding a check will guarantee it won't double process dag files.

ashb · 2020-10-27T13:30:53Z

Are you sure this is still a problem on mainline. Looking at start_new_processes:

    def start_new_processes(self):
        """Start more processors if we have enough slots and files to process"""
        while self._parallelism - len(self._processors) > 0 and self._file_path_queue:
            file_path = self._file_path_queue.pop(0)
            callback_to_execute_for_file = self._callback_to_execute[file_path]
            processor = self._processor_factory(
                file_path,
                callback_to_execute_for_file,
                self._dag_ids,
                self._pickle_dags)

            del self._callback_to_execute[file_path]
            Stats.incr('dag_processing.processes')

            processor.start()
            self.log.debug(
                "Started a process (PID: %s) to generate tasks for %s",
                processor.pid, file_path
            )
            self._processors[file_path] = processor
            self.waitables[processor.waitable_handle] = processor

I can't see at first glance how self._parallelism - len(self._processors) > 0 would ever lead to too many processes.

pingzh · 2020-10-27T17:11:01Z

Are you sure this is still a problem on mainline. Looking at start_new_processes:

    def start_new_processes(self):
        """Start more processors if we have enough slots and files to process"""
        while self._parallelism - len(self._processors) > 0 and self._file_path_queue:
            file_path = self._file_path_queue.pop(0)
            callback_to_execute_for_file = self._callback_to_execute[file_path]
            processor = self._processor_factory(
                file_path,
                callback_to_execute_for_file,
                self._dag_ids,
                self._pickle_dags)

            del self._callback_to_execute[file_path]
            Stats.incr('dag_processing.processes')

            processor.start()
            self.log.debug(
                "Started a process (PID: %s) to generate tasks for %s",
                processor.pid, file_path
            )
            self._processors[file_path] = processor
            self.waitables[processor.waitable_handle] = processor

I can't see at first glance how self._parallelism - len(self._processors) > 0 would ever lead to too many processes.

Yes, it still an issue, the main logic to launch new dag file processes does not change much between 1.10.4 and the master branch. We also cherry-picked this PR #7597 to our 1.10.4 version.

The issue does not happen often.

This is the incident leading us to find this issue. As you can see, the same dag file is processing many times (the dag process for this dag file usually takes more than 15 min)

when the manager adds the callback to the _file_path_queue, it does not care whether this dag file is currently under processing or in the cool down period, which leads to multiple dag processes processing the same dag file.

As for the exceed of the _parallelism, I have lost some context about how exactly it got into that state :(

ashb · 2020-10-28T13:11:00Z

when the manager adds the callback to the _file_path_queue, it does not care whether this dag file is currently under processing or in the cool down period,

Yes, this is "by design" -- if there's a callback we need to execute it "now" to run it.

https://github.com/apache/airflow/blob/master/airflow/utils/dag_processing.py#L713-L718

Given we remove it form the list if it's already there, and the existing checks I'm not sure the two "issues" are related (1- jumping up the queue/ignoring cool down period, and 2- going beyond concurrency limits)

pingzh · 2020-11-02T05:08:50Z

when the manager adds the callback to the _file_path_queue, it does not care whether this dag file is currently under processing or in the cool down period,

Yes, this is "by design" -- if there's a callback we need to execute it "now" to run it.

https://github.com/apache/airflow/blob/master/airflow/utils/dag_processing.py#L713-L718

Given we remove it form the list if it's already there, and the existing checks I'm not sure the two "issues" are related (1- jumping up the queue/ignoring cool down period, and 2- going beyond concurrency limits)

However, the original design has a risk of letting a dag file with a long processing time to take over all dag processors and also introduce a race condition

Race condition case:

a task of a dag is treated as zombie
it is put to the event queue and then being processed by the DagProcesssor. The dag file takes very long time to process dag file and runs the callback.
the next loop in the dag manager sees the same task as zombie and puts it tot he event queue
a new DagProcessor is launched to process the same dag file and also the same zombie event

pingzh · 2020-11-13T21:11:08Z

@kaxil and @ashb friendly remind of this PR. thanks

potiuk · 2020-12-07T09:28:32Z

Hey @kaxil @ashb - do your want to make it part of 2.0.0rc1 ? O should we change the milestone?

ashb · 2020-12-07T09:29:41Z

I'm not yet convinced that this is a) actually a bug, or b) the right solution.

Changing the milestone for now, and it can come in 2.0.x or 2.1

When a dag file is executed via Dag File Processors and multiple callbacks are created either via zombies or executor events, the dag file is added to the _file_path_queue and the manager will launch a new process to process it, which it should not since the dag file is currently under processing. This will bypass the _parallelism eventually especially when it takes long time to process some dag files. This address the same issue as apache#11875 but instead does not exlucde filepaths that are recently processed and that run at limit (which is only used in tests) when Callbacks are sent by the Agent. This is by design as execution of Callbacks is critical. This is done with a caveat to avoid duplicate processor -- i.e. if a processor exists, instead of removing the file path from the queue it is removed from the beginning of the queue to the end. This means that the processor with the filepath to run callback is still run before other filepaths are added. Tests are added to check the same. closes apache#13047

When a dag file is executed via Dag File Processors and multiple callbacks are created either via zombies or executor events, the dag file is added to the _file_path_queue and the manager will launch a new process to process it, which it should not since the dag file is currently under processing. This will bypass the _parallelism eventually especially when it takes a long time to process some dag files and since self._processors is just a dict with file path as the key. So multiple processors with the same key count as one and hence parallelism is bypassed. This address the same issue as apache#11875 but instead does not exclude file paths that are recently processed and that run at the limit (which is only used in tests) when Callbacks are sent by the Agent. This is by design as the execution of Callbacks is critical. This is done with a caveat to avoid duplicate processor -- i.e. if a processor exists, the file path is removed from the queue. This means that the processor with the file path to run callback will be still run when the file path is added again in the next loop Tests are added to check the same. closes apache#13047 closes apache#11875 (cherry picked from commit 32f5953)

When a dag file is executed via Dag File Processors and multiple callbacks are created either via zombies or executor events, the dag file is added to the _file_path_queue and the manager will launch a new process to process it, which it should not since the dag file is currently under processing. This will bypass the _parallelism eventually especially when it takes a long time to process some dag files and since self._processors is just a dict with file path as the key. So multiple processors with the same key count as one and hence parallelism is bypassed. This address the same issue as #11875 but instead does not exclude file paths that are recently processed and that run at the limit (which is only used in tests) when Callbacks are sent by the Agent. This is by design as the execution of Callbacks is critical. This is done with a caveat to avoid duplicate processor -- i.e. if a processor exists, the file path is removed from the queue. This means that the processor with the file path to run callback will be still run when the file path is added again in the next loop Tests are added to check the same. closes #13047 closes #11875 (cherry picked from commit 32f5953)

When a dag file is executed via Dag File Processors and multiple callbacks are created either via zombies or executor events, the dag file is added to the _file_path_queue and the manager will launch a new process to process it, which it should not since the dag file is currently under processing. This will bypass the _parallelism eventually especially when it takes a long time to process some dag files and since self._processors is just a dict with file path as the key. So multiple processors with the same key count as one and hence parallelism is bypassed. This address the same issue as apache/airflow#11875 but instead does not exclude file paths that are recently processed and that run at the limit (which is only used in tests) when Callbacks are sent by the Agent. This is by design as the execution of Callbacks is critical. This is done with a caveat to avoid duplicate processor -- i.e. if a processor exists, the file path is removed from the queue. This means that the processor with the file path to run callback will be still run when the file path is added again in the next loop Tests are added to check the same. closes apache/airflow#13047 closes apache/airflow#11875 (cherry picked from commit 32f59534cbdb8188e4c8f49d7dfbb4b915eaeb4d) GitOrigin-RevId: b446d145e1e5042e453ba91e34ae97573f320f09

When a dag file is executed via Dag File Processors and multiple callbacks are created either via zombies or executor events, the dag file is added to the _file_path_queue and the manager will launch a new process to process it, which it should not since the dag file is currently under processing. This will bypass the _parallelism eventually especially when it takes a long time to process some dag files and since self._processors is just a dict with file path as the key. So multiple processors with the same key count as one and hence parallelism is bypassed. This address the same issue as apache/airflow#11875 but instead does not exclude file paths that are recently processed and that run at the limit (which is only used in tests) when Callbacks are sent by the Agent. This is by design as the execution of Callbacks is critical. This is done with a caveat to avoid duplicate processor -- i.e. if a processor exists, the file path is removed from the queue. This means that the processor with the file path to run callback will be still run when the file path is added again in the next loop Tests are added to check the same. closes apache/airflow#13047 closes apache/airflow#11875 GitOrigin-RevId: 32f59534cbdb8188e4c8f49d7dfbb4b915eaeb4d

When a dag file is executed via Dag File Processors and multiple callbacks are created either via zombies or executor events, the dag file is added to the _file_path_queue and the manager will launch a new process to process it, which it should not since the dag file is currently under processing. This will bypass the _parallelism eventually especially when it takes a long time to process some dag files and since self._processors is just a dict with file path as the key. So multiple processors with the same key count as one and hence parallelism is bypassed. This address the same issue as apache/airflow#11875 but instead does not exclude file paths that are recently processed and that run at the limit (which is only used in tests) when Callbacks are sent by the Agent. This is by design as the execution of Callbacks is critical. This is done with a caveat to avoid duplicate processor -- i.e. if a processor exists, the file path is removed from the queue. This means that the processor with the file path to run callback will be still run when the file path is added again in the next loop Tests are added to check the same. closes apache/airflow#13047 closes apache/airflow#11875 (cherry picked from commit 32f59534cbdb8188e4c8f49d7dfbb4b915eaeb4d) GitOrigin-RevId: b446d145e1e5042e453ba91e34ae97573f320f09

When a dag file is executed via Dag File Processors and multiple callbacks are created either via zombies or executor events, the dag file is added to the _file_path_queue and the manager will launch a new process to process it, which it should not since the dag file is currently under processing. This will bypass the _parallelism eventually especially when it takes a long time to process some dag files and since self._processors is just a dict with file path as the key. So multiple processors with the same key count as one and hence parallelism is bypassed. This address the same issue as apache/airflow#11875 but instead does not exclude file paths that are recently processed and that run at the limit (which is only used in tests) when Callbacks are sent by the Agent. This is by design as the execution of Callbacks is critical. This is done with a caveat to avoid duplicate processor -- i.e. if a processor exists, the file path is removed from the queue. This means that the processor with the file path to run callback will be still run when the file path is added again in the next loop Tests are added to check the same. closes apache/airflow#13047 closes apache/airflow#11875 GitOrigin-RevId: 32f59534cbdb8188e4c8f49d7dfbb4b915eaeb4d

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Oct 27, 2020

kaxil self-requested a review October 27, 2020 13:20

ashb reviewed Oct 27, 2020

View reviewed changes

kaxil added this to the Airflow 2.0.0 (rc1) milestone Nov 17, 2020

ashb modified the milestones: Airflow 2.0.0rc1, Airflow 2.1 Dec 7, 2020

kaxil mentioned this pull request Jan 14, 2021

Stop creating duplicate Dag File Processors #13662

Merged

kaxil closed this in 32f5953 Jan 15, 2021

kaxil removed this from the Airflow 2.1 milestone Jan 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dont create duplicate dag file processors#11875

Dont create duplicate dag file processors#11875
pingzh wants to merge 1 commit intoapache:masterfrom
pingzh:ping.zhang-fix-duplicate-dag-processor

pingzh commented Oct 27, 2020

Uh oh!

ashb Oct 27, 2020

Uh oh!

pingzh Oct 27, 2020

Uh oh!

pingzh Oct 27, 2020

Uh oh!

ashb commented Oct 27, 2020

Uh oh!

pingzh commented Oct 27, 2020 •

edited

Loading

Uh oh!

ashb commented Oct 28, 2020

Uh oh!

pingzh commented Nov 2, 2020

Uh oh!

pingzh commented Nov 13, 2020

Uh oh!

potiuk commented Dec 7, 2020

Uh oh!

ashb commented Dec 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pingzh commented Oct 27, 2020

Uh oh!

ashb Oct 27, 2020

Choose a reason for hiding this comment

Uh oh!

pingzh Oct 27, 2020

Choose a reason for hiding this comment

Uh oh!

pingzh Oct 27, 2020

Choose a reason for hiding this comment

Uh oh!

ashb commented Oct 27, 2020

Uh oh!

pingzh commented Oct 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashb commented Oct 28, 2020

Uh oh!

pingzh commented Nov 2, 2020

Uh oh!

pingzh commented Nov 13, 2020

Uh oh!

potiuk commented Dec 7, 2020

Uh oh!

ashb commented Dec 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pingzh commented Oct 27, 2020 •

edited

Loading