Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve task manager performance for task dependencies #5787

Merged

Conversation

fosterseth
Copy link
Member

SUMMARY

#5154
continuation of #5672

Task dependencies include project updates or inventory source updates. The task manager looks through each Job in "pending" state and determines if it needs to launch an associated project update or inventory source update. This part of the task manager is quite slow.

The task manager is not persistent. A new Task Manager object is launched each time (once per 30 seconds, and on external signals). Therefore, it must rediscover the state of running/pending jobs each time.

These changes help alleviate the redundant work the task manager performs each time it is called.

1. Remove the method .process_dependencies()
This method is nearly a copy and paste of .process_pending_tasks(), so we can just call that directly instead

2. Add dependencies_processed boolean field to UnifiedJob model
This boolean should be interpreted as "Has the task manager checked and processed potential dependencies for this job". Not all jobs have dependencies by nature (ad hoc commands, for example). The task manager's job is to check if it has dependencies, and whether it should spawn those dependencies.

3. Only generate dependencies once for each task
The current task manager will generate/spawn dependencies for pending jobs each time it runs. We only need to do this once. We can check that task.dependencies_processed is False before running generate_dependencies().

Performance:
Calling TaskManager().schedule() on 5,000 "pending" state Jobs takes around 8 seconds (vs 200 seconds with the previous TaskManager)

Note, this is the elapsed time after the dependencies have been already generated. Generating dependencies for 5,000 jobs the first time takes a few minutes.

ISSUE TYPE
  • Feature Pull Request
COMPONENT NAME
  • API
AWX VERSION
awx: 9.1.1
ADDITIONAL INFORMATION

@fosterseth fosterseth changed the title TaskManager process dependencies only once Improve task manager performance for task dependencies Jan 28, 2020
@@ -598,6 +536,8 @@ def process_tasks(self, all_sorted_tasks):
self.process_running_tasks(running_tasks)

pending_tasks = [t for t in all_sorted_tasks if t.status == 'pending']
dependencies = self.generate_dependencies(pending_tasks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not only pass in tasks that have dependencies_processed=False?

Otherwise, you're attempting updates to lots of rows every 30 seconds. Even if they are no-ops, that sounds like a recipe for database conflicts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixed it.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@fosterseth
Copy link
Member Author

The unit test is problematic

Although the TaskManager()._schedule() is considerably faster now, the total elapsed time can vary quite a bit from machine to machine. For example it takes .2 seconds on my machine, but 1.3 seconds on Zuul tests. I could make the threshold really high, like 5 seconds, but then this test becomes less meaningful.

Therefore it seems measuring the elapsed time and comparing to a flat expected value is not a great idea. I will drop this test unless someone can think of a more reliable way to approach this.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@AlanCoding
Copy link
Member

AlanCoding commented Jan 29, 2020

I think that coverage in Zuul tests would be good. For example, create a job, run the task manager, and verify that the flag is swapped. Run it again, and verify that the generate_dependencies method is called without that job.

Since this is an internal field, this is the only place we can have this kind of coverage (except indirectly of course).

def generate_dependencies(self, task):
dependencies = []
if type(task) is Job:
def generate_dependencies(self, pending_tasks):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better to call the argument something else, like undeped_tasks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed variable name to undeped_tasks

@fosterseth
Copy link
Member Author

@AlanCoding Wrote a new functional unit test that makes sure dependencies_processed flag is flipped on a job after a single ._schedule() run. It also makes sure subsequent ._generate_dependencies() calls do not include said job in the argument list (as it has already been processed).

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

# .call_args is tuple, (positional_args, kwargs), [0][0] then is
# the first positional arg, i.e. the first argument of
# .generate_dependencies()
assert job not in tm.generate_dependencies.call_args[0][0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are no other jobs in the database here, I would prefer a stricter assertion of tm.generate_dependencies.call_args[0][0] == []

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlanCoding because project.scm_update_on_launch == True, then .call_args[0][0] contains [ProjectUpdate]. I like the idea of having a stricter assertion though, so I will turn off the project update and assert the empty list.

Copy link
Member

@AlanCoding AlanCoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of my favorite pull requests ever

Copy link
Contributor

@ryanpetrello ryanpetrello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this @fosterseth; this part of AWX can be gnarly, and I'm excited about this optimization - great work!

Once zuul is passing, and @elyezer signs off, let's merge this one 🚢

@elyezer
Copy link
Member

elyezer commented Jan 31, 2020

@fosterseth @ryanpetrello so the intent here is to improve the depency graph generation without changing any current behavior, right?

I think @squidboylan will be interested on taking a look on this one.

@squidboylan
Copy link
Contributor

Yep, once zuul is passing I'll take a look and can sign off

Copy link
Member

@matburt matburt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it.

@squidboylan squidboylan self-assigned this Jan 31, 2020
@fosterseth fosterseth force-pushed the tm_processed_field branch 2 times, most recently from e6c1899 to edd7634 Compare January 31, 2020 20:53
@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

This adds a boolean "dependencies_processed" field to UnifiedJob
model. The default value is False. Once the task manager generates
dependencies for this task, it will not generate them again on
subsequent runs.

The changes also remove .process_dependencies(), as this method repeats
the same code as .process_pending_tasks(), and is not needed. Once
dependencies are generated, they are handled at .process_pending_tasks().

Adds a unit test that should catch regressions for this fix.
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

class Migration(migrations.Migration):

dependencies = [
('main', '0107_v370_workflow_convergence_api_toggle'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@fosterseth
Copy link
Member Author

fosterseth commented Feb 6, 2020

@elyezer yes the intent is to preserve the behavior overall, just improve performance by removing redundant work.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded (gate pipeline).

@softwarefactory-project-zuul softwarefactory-project-zuul bot merged commit ce5c435 into ansible:devel Feb 6, 2020
@fosterseth
Copy link
Member Author

fosterseth commented Feb 7, 2020

I am wondering what happens when a project update fails

one consequence of this PR is that start_task() is called differently for a project/inventory update

start_task(task, instance_group, tasks_to_fail)

old: tasks_to_fail is a list that includes just the first dependent task
new: tasks_to_fail is a list to all the dependent tasks (so if 300 jobs are dependent on that project update, then this list includes all 300 jobs)

If the project succeeds, then all is well and this list is never used
If the project update fails, then a callback will attempt to process all 300 jobs..

def handle_work_error()

def handle_work_error(task_id, *args, **kwargs):

I'm not sure if this method is intended to handle a lot of items and might bog the system down if a project failed with a ton of pending tasks behind it -- I see a lot of queries, saves, and websocket messages

Note: The old task manager would eventually fail the tasks through the pre_run_hook()

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Feb 7, 2020

@fosterseth I actually think the new behavior (update them all) makes more sense in this failure scenario.

I'm not sure if this method is intended to handle a lot of items and might bog the system down if a project failed with a ton of pending tasks behind it

I'm not drastically concerned about this scenario; this doesn't sound like a very regularly occurring event to me.

@chrismeyersfsu
Copy link
Member

chrismeyersfsu commented Feb 7, 2020

For completion sake I'll put what I said in the merge meeting here in the PR.

Setup:

  • <JT1, allow_simultaneous=False> that just sleeps for 5 seconds
  • <P1, launch_on_update=True, cache_timeout=5seconds>
  • Launch JT1 => J1, J2, J3, ...

Before:

  • J1 will spawn PU1,1
  • J2 will spawn PU1,2
  • J3 will spawn PU1,3
  • ...

After:

  • J1, J2, J3, ... dependency generation will result in PU1,1

I am fine with this. This is better. As long as the user can set the cache_timeout=0 and, effectively, Before behavior. Why might this be important? Because I can imagine a customer that is running a Job that does a git code update, in a workflow, that relies on the next job syncing that code update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants