Improve task manager performance for task dependencies #5787

fosterseth · 2020-01-28T19:37:21Z

SUMMARY

#5154
continuation of #5672

Task dependencies include project updates or inventory source updates. The task manager looks through each Job in "pending" state and determines if it needs to launch an associated project update or inventory source update. This part of the task manager is quite slow.

The task manager is not persistent. A new Task Manager object is launched each time (once per 30 seconds, and on external signals). Therefore, it must rediscover the state of running/pending jobs each time.

These changes help alleviate the redundant work the task manager performs each time it is called.

1. Remove the method .process_dependencies()
This method is nearly a copy and paste of .process_pending_tasks(), so we can just call that directly instead

2. Add dependencies_processed boolean field to UnifiedJob model
This boolean should be interpreted as "Has the task manager checked and processed potential dependencies for this job". Not all jobs have dependencies by nature (ad hoc commands, for example). The task manager's job is to check if it has dependencies, and whether it should spawn those dependencies.

3. Only generate dependencies once for each task
The current task manager will generate/spawn dependencies for pending jobs each time it runs. We only need to do this once. We can check that task.dependencies_processed is False before running generate_dependencies().

Performance:
Calling TaskManager().schedule() on 5,000 "pending" state Jobs takes around 8 seconds (vs 200 seconds with the previous TaskManager)

Note, this is the elapsed time after the dependencies have been already generated. Generating dependencies for 5,000 jobs the first time takes a few minutes.

ISSUE TYPE

Feature Pull Request

COMPONENT NAME

API

AWX VERSION

awx: 9.1.1

ADDITIONAL INFORMATION

AlanCoding · 2020-01-28T19:51:41Z

awx/main/scheduler/task_manager.py

@@ -598,6 +536,8 @@ def process_tasks(self, all_sorted_tasks):
        self.process_running_tasks(running_tasks)

        pending_tasks = [t for t in all_sorted_tasks if t.status == 'pending']
+        dependencies = self.generate_dependencies(pending_tasks)


Why not only pass in tasks that have dependencies_processed=False?

Otherwise, you're attempting updates to lots of rows every 30 seconds. Even if they are no-ops, that sounds like a recipe for database conflicts.

good catch, fixed it.

softwarefactory-project-zuul · 2020-01-28T19:54:39Z

Build failed.

awx-api-lint : SUCCESS in 13m 39s
awx-api : FAILURE in 15m 05s
awx-ui : SUCCESS in 11m 37s
awx-ui-next : SUCCESS in 16m 56s
awx-swagger : SUCCESS in 15m 33s
awx-detect-schema-change : FAILURE in 16m 19s (non-voting)
awx-ansible-modules : SUCCESS in 10m 11s

fosterseth · 2020-01-28T22:07:17Z

The unit test is problematic

Although the TaskManager()._schedule() is considerably faster now, the total elapsed time can vary quite a bit from machine to machine. For example it takes .2 seconds on my machine, but 1.3 seconds on Zuul tests. I could make the threshold really high, like 5 seconds, but then this test becomes less meaningful.

Therefore it seems measuring the elapsed time and comparing to a flat expected value is not a great idea. I will drop this test unless someone can think of a more reliable way to approach this.

softwarefactory-project-zuul · 2020-01-28T22:07:49Z

Build failed.

awx-api-lint : SUCCESS in 3m 36s
awx-api : FAILURE in 9m 18s
awx-ui : SUCCESS in 6m 36s
awx-ui-next : SUCCESS in 7m 59s
awx-swagger : SUCCESS in 11m 44s
awx-detect-schema-change : FAILURE in 9m 52s (non-voting)
awx-ansible-modules : SUCCESS in 4m 20s

AlanCoding · 2020-01-29T03:32:53Z

I think that coverage in Zuul tests would be good. For example, create a job, run the task manager, and verify that the flag is swapped. Run it again, and verify that the generate_dependencies method is called without that job.

Since this is an internal field, this is the only place we can have this kind of coverage (except indirectly of course).

AlanCoding · 2020-01-29T03:34:14Z

awx/main/scheduler/task_manager.py

-    def generate_dependencies(self, task):
-        dependencies = []
-        if type(task) is Job:
+    def generate_dependencies(self, pending_tasks):


might be better to call the argument something else, like undeped_tasks

changed variable name to undeped_tasks

fosterseth · 2020-01-29T19:03:41Z

@AlanCoding Wrote a new functional unit test that makes sure dependencies_processed flag is flipped on a job after a single ._schedule() run. It also makes sure subsequent ._generate_dependencies() calls do not include said job in the argument list (as it has already been processed).

softwarefactory-project-zuul · 2020-01-29T19:12:40Z

Build failed.

awx-api-lint : SUCCESS in 4m 50s
awx-api : FAILURE in 11m 42s
awx-ui : SUCCESS in 7m 21s
awx-ui-next : SUCCESS in 9m 19s
awx-swagger : SUCCESS in 12m 58s
awx-detect-schema-change : FAILURE in 11m 28s (non-voting)
awx-ansible-modules : SUCCESS in 5m 08s

softwarefactory-project-zuul · 2020-01-29T21:21:07Z

Build failed.

awx-api-lint : SUCCESS in 2m 38s
awx-api : FAILURE in 7m 30s
awx-ui : SUCCESS in 4m 43s
awx-ui-next : SUCCESS in 10m 06s
awx-swagger : SUCCESS in 11m 30s
awx-detect-schema-change : FAILURE in 9m 29s (non-voting)
awx-ansible-modules : SUCCESS in 3m 10s

AlanCoding · 2020-01-30T15:43:59Z

awx/main/tests/functional/task_management/test_scheduler.py

+        # .call_args is tuple, (positional_args, kwargs), [0][0] then is
+        # the first positional arg, i.e. the first argument of
+        # .generate_dependencies()
+        assert job not in tm.generate_dependencies.call_args[0][0]


Since there are no other jobs in the database here, I would prefer a stricter assertion of tm.generate_dependencies.call_args[0][0] == []

@AlanCoding because project.scm_update_on_launch == True, then .call_args[0][0] contains [ProjectUpdate]. I like the idea of having a stricter assertion though, so I will turn off the project update and assert the empty list.

AlanCoding

This is one of my favorite pull requests ever

ryanpetrello

Thanks for digging into this @fosterseth; this part of AWX can be gnarly, and I'm excited about this optimization - great work!

Once zuul is passing, and @elyezer signs off, let's merge this one 🚢

elyezer · 2020-01-31T16:00:12Z

@fosterseth @ryanpetrello so the intent here is to improve the depency graph generation without changing any current behavior, right?

I think @squidboylan will be interested on taking a look on this one.

squidboylan · 2020-01-31T16:09:57Z

Yep, once zuul is passing I'll take a look and can sign off

matburt

I like it.

softwarefactory-project-zuul · 2020-01-31T21:06:32Z

Build failed.

awx-api-lint : SUCCESS in 3m 14s
awx-api : FAILURE in 9m 34s
awx-ui : SUCCESS in 5m 15s
awx-ui-next : SUCCESS in 7m 58s
awx-swagger : SUCCESS in 9m 29s
awx-detect-schema-change : FAILURE in 12m 21s (non-voting)
awx-ansible-modules : SUCCESS in 4m 02s

softwarefactory-project-zuul · 2020-01-31T21:40:52Z

Build succeeded.

awx-api-lint : SUCCESS in 3m 58s
awx-api : SUCCESS in 9m 06s
awx-ui : SUCCESS in 4m 59s
awx-ui-next : SUCCESS in 7m 27s
awx-swagger : SUCCESS in 8m 40s
awx-detect-schema-change : FAILURE in 11m 26s (non-voting)
awx-ansible-modules : SUCCESS in 3m 17s

softwarefactory-project-zuul · 2020-02-04T15:59:16Z

Build succeeded.

awx-api-lint : SUCCESS in 4m 21s
awx-api : SUCCESS in 11m 17s
awx-ui : SUCCESS in 9m 07s
awx-ui-next : SUCCESS in 8m 48s
awx-swagger : SUCCESS in 10m 34s
awx-detect-schema-change : FAILURE in 10m 50s (non-voting)
awx-ansible-modules : SUCCESS in 5m 16s

This adds a boolean "dependencies_processed" field to UnifiedJob model. The default value is False. Once the task manager generates dependencies for this task, it will not generate them again on subsequent runs. The changes also remove .process_dependencies(), as this method repeats the same code as .process_pending_tasks(), and is not needed. Once dependencies are generated, they are handled at .process_pending_tasks(). Adds a unit test that should catch regressions for this fix.

softwarefactory-project-zuul · 2020-02-06T17:10:49Z

Build succeeded.

awx-api-lint : SUCCESS in 13m 24s
awx-api : SUCCESS in 19m 32s
awx-ui : SUCCESS in 15m 37s
awx-ui-next : SUCCESS in 17m 47s
awx-swagger : SUCCESS in 22m 06s
awx-detect-schema-change : FAILURE in 20m 20s (non-voting)
awx-ansible-modules : SUCCESS in 13m 01s

ryanpetrello · 2020-02-06T18:02:52Z

awx/main/migrations/0108_v370_unifiedjob_dependencies_processed.py

+class Migration(migrations.Migration):
+
+    dependencies = [
+        ('main', '0107_v370_workflow_convergence_api_toggle'),


fosterseth · 2020-02-06T18:12:53Z

@elyezer yes the intent is to preserve the behavior overall, just improve performance by removing redundant work.

softwarefactory-project-zuul · 2020-02-06T21:30:46Z

Build succeeded (gate pipeline).

awx-api-lint : SUCCESS in 4m 02s
awx-api : SUCCESS in 10m 37s
awx-ui : SUCCESS in 6m 26s
awx-ui-next : SUCCESS in 8m 33s
awx-swagger : SUCCESS in 11m 50s
awx-detect-schema-change : FAILURE in 10m 21s (non-voting)
awx-ansible-modules : SUCCESS in 4m 59s
awx-push-new-schema : SUCCESS in 10m 23s (non-voting)

fosterseth · 2020-02-07T17:21:04Z

I am wondering what happens when a project update fails

one consequence of this PR is that start_task() is called differently for a project/inventory update

start_task(task, instance_group, tasks_to_fail)

old: tasks_to_fail is a list that includes just the first dependent task
new: tasks_to_fail is a list to all the dependent tasks (so if 300 jobs are dependent on that project update, then this list includes all 300 jobs)

If the project succeeds, then all is well and this list is never used
If the project update fails, then a callback will attempt to process all 300 jobs..

def handle_work_error()

awx/awx/main/tasks.py

Line 572 in 13faa0e

def handle_work_error(task_id, *args, **kwargs):

I'm not sure if this method is intended to handle a lot of items and might bog the system down if a project failed with a ton of pending tasks behind it -- I see a lot of queries, saves, and websocket messages

Note: The old task manager would eventually fail the tasks through the pre_run_hook()

ryanpetrello · 2020-02-07T17:24:32Z

@fosterseth I actually think the new behavior (update them all) makes more sense in this failure scenario.

I'm not sure if this method is intended to handle a lot of items and might bog the system down if a project failed with a ton of pending tasks behind it

I'm not drastically concerned about this scenario; this doesn't sound like a very regularly occurring event to me.

chrismeyersfsu · 2020-02-07T17:37:05Z

For completion sake I'll put what I said in the merge meeting here in the PR.

Setup:

<JT1, allow_simultaneous=False> that just sleeps for 5 seconds
<P1, launch_on_update=True, cache_timeout=5seconds>
Launch JT1 => J1, J2, J3, ...

Before:

J1 will spawn PU1,1
J2 will spawn PU1,2
J3 will spawn PU1,3
...

After:

J1, J2, J3, ... dependency generation will result in PU1,1

I am fine with this. This is better. As long as the user can set the cache_timeout=0 and, effectively, Before behavior. Why might this be important? Because I can imagine a customer that is running a Job that does a git code update, in a workflow, that relies on the next job syncing that code update.

fosterseth mentioned this pull request Jan 28, 2020

[WIP] Improve task manager performance for task dependencies #5672

Closed

fosterseth changed the title ~~TaskManager process dependencies only once~~ Improve task manager performance for task dependencies Jan 28, 2020

AlanCoding requested review from AlanCoding and matburt January 28, 2020 19:41

AlanCoding reviewed Jan 28, 2020

View reviewed changes

awxbot added type:enhancement component:api labels Jan 28, 2020

AlanCoding reviewed Jan 29, 2020

View reviewed changes

fosterseth force-pushed the tm_processed_field branch from cb7392e to 8eecbc6 Compare January 29, 2020 21:09

AlanCoding reviewed Jan 30, 2020

View reviewed changes

AlanCoding approved these changes Jan 30, 2020

View reviewed changes

ryanpetrello approved these changes Jan 31, 2020

View reviewed changes

matburt approved these changes Jan 31, 2020

View reviewed changes

squidboylan self-assigned this Jan 31, 2020

fosterseth force-pushed the tm_processed_field branch 2 times, most recently from e6c1899 to edd7634 Compare January 31, 2020 20:53

fosterseth force-pushed the tm_processed_field branch from edd7634 to 90c8705 Compare January 31, 2020 21:29

fosterseth force-pushed the tm_processed_field branch from 90c8705 to 51a7f20 Compare February 4, 2020 15:47

squidboylan approved these changes Feb 4, 2020

View reviewed changes

fosterseth force-pushed the tm_processed_field branch from 51a7f20 to 25862dc Compare February 6, 2020 16:45

fosterseth force-pushed the tm_processed_field branch from 6d3aa0d to 9b4b216 Compare February 6, 2020 16:47

ryanpetrello reviewed Feb 6, 2020

View reviewed changes

ryanpetrello added the mergeit label Feb 6, 2020

softwarefactory-project-zuul bot merged commit ce5c435 into ansible:devel Feb 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve task manager performance for task dependencies #5787

Improve task manager performance for task dependencies #5787

fosterseth commented Jan 28, 2020

AlanCoding Jan 28, 2020

fosterseth Jan 28, 2020

softwarefactory-project-zuul bot commented Jan 28, 2020

fosterseth commented Jan 28, 2020

softwarefactory-project-zuul bot commented Jan 28, 2020

AlanCoding commented Jan 29, 2020 •

edited

Loading

AlanCoding Jan 29, 2020

fosterseth Jan 29, 2020

fosterseth commented Jan 29, 2020

softwarefactory-project-zuul bot commented Jan 29, 2020

softwarefactory-project-zuul bot commented Jan 29, 2020

AlanCoding Jan 30, 2020

fosterseth Jan 30, 2020

AlanCoding left a comment

ryanpetrello left a comment •

edited

Loading

elyezer commented Jan 31, 2020

squidboylan commented Jan 31, 2020

matburt left a comment

softwarefactory-project-zuul bot commented Jan 31, 2020

softwarefactory-project-zuul bot commented Jan 31, 2020

softwarefactory-project-zuul bot commented Feb 4, 2020

softwarefactory-project-zuul bot commented Feb 6, 2020

ryanpetrello Feb 6, 2020

fosterseth commented Feb 6, 2020 •

edited

Loading

softwarefactory-project-zuul bot commented Feb 6, 2020

fosterseth commented Feb 7, 2020 •

edited

Loading

ryanpetrello commented Feb 7, 2020 •

edited

Loading

chrismeyersfsu commented Feb 7, 2020 •

edited

Loading

Improve task manager performance for task dependencies #5787

Improve task manager performance for task dependencies #5787

Conversation

fosterseth commented Jan 28, 2020

SUMMARY

ISSUE TYPE

COMPONENT NAME

AWX VERSION

ADDITIONAL INFORMATION

AlanCoding Jan 28, 2020

Choose a reason for hiding this comment

fosterseth Jan 28, 2020

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Jan 28, 2020

fosterseth commented Jan 28, 2020

softwarefactory-project-zuul bot commented Jan 28, 2020

AlanCoding commented Jan 29, 2020 • edited Loading

AlanCoding Jan 29, 2020

Choose a reason for hiding this comment

fosterseth Jan 29, 2020

Choose a reason for hiding this comment

fosterseth commented Jan 29, 2020

softwarefactory-project-zuul bot commented Jan 29, 2020

softwarefactory-project-zuul bot commented Jan 29, 2020

AlanCoding Jan 30, 2020

Choose a reason for hiding this comment

fosterseth Jan 30, 2020

Choose a reason for hiding this comment

AlanCoding left a comment

Choose a reason for hiding this comment

ryanpetrello left a comment • edited Loading

Choose a reason for hiding this comment

elyezer commented Jan 31, 2020

squidboylan commented Jan 31, 2020

matburt left a comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Jan 31, 2020

softwarefactory-project-zuul bot commented Jan 31, 2020

softwarefactory-project-zuul bot commented Feb 4, 2020

softwarefactory-project-zuul bot commented Feb 6, 2020

ryanpetrello Feb 6, 2020

Choose a reason for hiding this comment

fosterseth commented Feb 6, 2020 • edited Loading

softwarefactory-project-zuul bot commented Feb 6, 2020

fosterseth commented Feb 7, 2020 • edited Loading

ryanpetrello commented Feb 7, 2020 • edited Loading

chrismeyersfsu commented Feb 7, 2020 • edited Loading

AlanCoding commented Jan 29, 2020 •

edited

Loading

ryanpetrello left a comment •

edited

Loading

fosterseth commented Feb 6, 2020 •

edited

Loading

fosterseth commented Feb 7, 2020 •

edited

Loading

ryanpetrello commented Feb 7, 2020 •

edited

Loading

chrismeyersfsu commented Feb 7, 2020 •

edited

Loading