Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Scheduling Topology Hints #180

Merged
merged 9 commits into from
Nov 25, 2021
Merged

Add Scheduling Topology Hints #180

merged 9 commits into from
Nov 25, 2021

Conversation

csegarragonz
Copy link
Collaborator

@csegarragonz csegarragonz commented Nov 24, 2021

In this PR I add support to pass scheduling topology hints as a parameter to callFunctions. Scheduling hints affect the way batches are scheduled to the available resources. The currently supported hints are:

  • NORMAL: will bin-pack requests to hosts until it runs out of slots, and then will overload the master (default).
  • NEVER_ALONE: will never schedule a request to a host without other requests of the batch. There is one exception: master requests (first in the batch) will be allocated to a new host (even on their own) if the master is out of resources.

These hints only affect the SchedulingDecision so the separation recently made between making the scheduling decision and actually calling the functions has proven extremely useful for the implementation and testing. I add tooling for testing these and future hints we may add easily, together with a number of tests.

include/faabric/scheduler/Scheduler.h Show resolved Hide resolved
faabric::util::SchedulingDecision publicMakeSchedulingDecision(
std::shared_ptr<faabric::BatchExecuteRequest> req,
bool forceLocal,
faabric::util::SchedulingTopologyHint topologyHint);
Copy link
Collaborator

@Shillaker Shillaker Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for this function isn't immediately obvious and it feels like a bit of a hack. In general if an API needs to be changed to support a test you have one of two things going on: (i) your test is too invasive and checking too much internal logic; (ii) the API doesn't expose enough information.

In this case I think it's (i). I think it can be changed relatively easily as this and callFunctions have the same signature. It might be possible to change the tests to call callFunctions instead (as they have mock mode turned on). You can then add a check to make sure the underlying function calls have been disaptched to the expected hosts.

If this isn't possible, then we need to work out what's happening in callFunctions that doesn't work properly in mock mode.

Copy link
Collaborator Author

@csegarragonz csegarragonz Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, switching to callFunctions is pretty easy with the only caveat that we need to use the TestExecutor and TestExecutorFactory classes that currently lived in ./tests/tests/scheduler/test_executor.cpp.

I have moved the declaration of these classes to fixtures.h and kept the definition where it is.

I also add a check for the recorded messages in the function call client.

include/faabric/util/scheduling.h Outdated Show resolved Hide resolved
#include <faabric/util/func.h>
#include <faabric/util/scheduling.h>

using namespace faabric::util;

namespace tests {

TEST_CASE("Test building scheduling decisions", "[util]")
TEST_CASE("Test building scheduling decisions", "[util][scheduling-decisions]")
Copy link
Collaborator

@Shillaker Shillaker Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tags are getting a bit too granular. Admittedly the util tag is bloated, but as mentioned above, these tests are really testing the scheduler, so once they've been moved, they can be added to the [scheduler] tag. The usefulness of the tags is to allow someone to quickly run the tests related to the thing that they've changed, however, when they're this granular, I'm not sure it's easy to work out which ones to run.

Catch2 provides various selectors that should make it easy to run subsets of tests in your dev workflow, if you're finding there are too many tests to run in a loop e.g. by file: https://catch2.docsforge.com/v2.13.2/running/command-line/.

Copy link
Collaborator Author

@csegarragonz csegarragonz Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree thanks. For future reference, you can run all tests in tests/test/scheduler/test_scheduling_decisions.cpp by running:

faabric_tests -# [#test_scheduling_decisions]

tests/test/util/test_scheduling.cpp Outdated Show resolved Hide resolved
tests/test/util/test_scheduling.cpp Outdated Show resolved Hide resolved
tests/test/util/test_scheduling.cpp Outdated Show resolved Hide resolved
tests/test/util/test_scheduling.cpp Outdated Show resolved Hide resolved
@@ -296,8 +311,20 @@ faabric::util::SchedulingDecision Scheduler::makeSchedulingDecision(
int available = r.slots() - r.usedslots();
int nOnThisHost = std::min(available, remainder);

for (int i = 0; i < nOnThisHost; i++) {
hosts.push_back(h);
// Under the pairs topology hint, we never allocate a single
Copy link
Collaborator

@Shillaker Shillaker Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut here would instead be:

if(topologyHint == faabric::util::SchedulingTopologyHint::PAIRS &&
      nOnThisHost < 2) {
    // Move on if we can't colocate function with at least one other
    continue;
}

I think this fixes the issue you mention in the PR description too.

Copy link
Collaborator Author

@csegarragonz csegarragonz Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this would behave as expected. For example:

  • We have 4 hosts with 4 slots each.
  • We want to schedule 9 requests (with the NEVER_ALONE hint).
  • We expect 4 requests scheduled to the first host, and 5 scheduled to the second.

However, I think your solution would schedule 5 requests on the first host and 4 on the second, as it would exhaust all possible hosts (nOnThisHost == 1 for all of them), and resort to overload the master.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After an offline discussion, I use this change together with a change in the overloading logic that makes the issue mentioned in the description disappear.

src/scheduler/Scheduler.cpp Outdated Show resolved Hide resolved
@@ -29,238 +29,224 @@ namespace tests {
std::atomic<int> restoreCount = 0;
std::atomic<int> resetCount = 0;

class TestExecutor final : public Executor
TestExecutor::TestExecutor(faabric::Message& msg)
Copy link
Collaborator Author

@csegarragonz csegarragonz Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes here involve moving from a class declaration + definition to just definition and some forceLocal refactor.

};
}

SECTION("Decreasing to one and increasing slot distribution")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add this test that before would have failed with NEVER_ALONE.

include/faabric/util/scheduling.h Outdated Show resolved Hide resolved
src/scheduler/Scheduler.cpp Show resolved Hide resolved
src/scheduler/Scheduler.cpp Outdated Show resolved Hide resolved
tests/test/scheduler/test_executor.cpp Outdated Show resolved Hide resolved
@Shillaker
Copy link
Collaborator

Shillaker commented Nov 24, 2021

@csegarragonz, I forgot to add a comment to that approval. Couple of nitpicky comments but other than that LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request scheduler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants