Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor changes to increase the performance of nodeDb ScheduleMany #2318

Merged
merged 5 commits into from
Apr 13, 2023

Conversation

d80tb7
Copy link
Collaborator

@d80tb7 d80tb7 commented Apr 2, 2023

severinson JamesMurkin

ScheduleMany can be quite slow in the case where there are a large number of nodes and jobs cannot be assigned to any node. Here's a very simple benchmark:

func TestScheduleManyPerf(t *testing.T) {
	nodes := testNCpuNode(1000, testPriorities)
	nodeDb, err := createNodeDb(nodes)
	require.NoError(t, err)

	reqs := testNLargeCpuJob("foo", 0, 4000)
	for _, r := range reqs {
		_, _, err = nodeDb.ScheduleMany([]*schedulerobjects.PodRequirements{r})
		require.NoError(t, err)
	}
}

This benchmark create a 1000 node cluster and then tries to schedule 4000 jobs on each node. The first 1K jobs will fill up the cluster, the remaining 3K jobs will all be unable to schedule due to no free resource. On my laptop the current master code runs this in 13.5 seconds.

The changes in this PR, reduce this to 9 Seconds with no external changes to behaviour. Specifically:

  • node.AvailableQuantityByPriorityAndResource now uses r.AsApproximateFloat64() rather than armadaresource.QuantityAsFloat64 the former is significantly faster and allocates far less. This should be safe because armadaresource.QuantityAsFloat64 exists for the cases where we overflow the float value, which should not occur on a single node.
  • InsufficientResources.String() uses a string concat rather than fmt.Sprintf as the former is much faster and this code this ends up getting called in a tight loop

Note that there is much more that we can optimise about this use case but these are the changes that are completely non-invasive. I'll raise a discussion for further optimisations that will require larger and/or functional changes

Update: #2319 for discussion on bigger changes we can make to improve this.

┆Issue is synchronized with this Jira Task by Unito

@codecov
Copy link

codecov bot commented Apr 2, 2023

Codecov Report

Patch coverage has no change and project coverage change: +0.02 🎉

Comparison is base (ec82b5c) 58.97% compared to head (b812cf3) 59.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2318      +/-   ##
==========================================
+ Coverage   58.97%   59.00%   +0.02%     
==========================================
  Files         225      225              
  Lines       28468    28470       +2     
==========================================
+ Hits        16790    16799       +9     
+ Misses      10386    10379       -7     
  Partials     1292     1292              
Flag Coverage Δ
armada-server 59.00% <0.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
internal/scheduler/schedulerobjects/node.go 0.00% <0.00%> (ø)
...nternal/scheduler/schedulerobjects/nodematching.go 84.16% <0.00%> (-0.71%) ⬇️

... and 2 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@JamesMurkin JamesMurkin merged commit c9456e7 into master Apr 13, 2023
@JamesMurkin JamesMurkin deleted the f/chrisma/perf branch April 13, 2023 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants