Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Scheduling Topology Hints #180

Merged
merged 9 commits into from
Nov 25, 2021
12 changes: 10 additions & 2 deletions include/faabric/scheduler/Scheduler.h
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,9 @@ class Scheduler

faabric::util::SchedulingDecision callFunctions(
std::shared_ptr<faabric::BatchExecuteRequest> req,
bool forceLocal = false);
bool forceLocal = false,
faabric::util::SchedulingTopologyHint =
faabric::util::SchedulingTopologyHint::NORMAL);
csegarragonz marked this conversation as resolved.
Show resolved Hide resolved

faabric::util::SchedulingDecision callFunctions(
std::shared_ptr<faabric::BatchExecuteRequest> req,
Expand Down Expand Up @@ -177,6 +179,11 @@ class Scheduler

void clearRecordedMessages();

faabric::util::SchedulingDecision publicMakeSchedulingDecision(
std::shared_ptr<faabric::BatchExecuteRequest> req,
bool forceLocal,
faabric::util::SchedulingTopologyHint topologyHint);
Copy link
Collaborator

@Shillaker Shillaker Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for this function isn't immediately obvious and it feels like a bit of a hack. In general if an API needs to be changed to support a test you have one of two things going on: (i) your test is too invasive and checking too much internal logic; (ii) the API doesn't expose enough information.

In this case I think it's (i). I think it can be changed relatively easily as this and callFunctions have the same signature. It might be possible to change the tests to call callFunctions instead (as they have mock mode turned on). You can then add a check to make sure the underlying function calls have been disaptched to the expected hosts.

If this isn't possible, then we need to work out what's happening in callFunctions that doesn't work properly in mock mode.

Copy link
Collaborator Author

@csegarragonz csegarragonz Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, switching to callFunctions is pretty easy with the only caveat that we need to use the TestExecutor and TestExecutorFactory classes that currently lived in ./tests/tests/scheduler/test_executor.cpp.

I have moved the declaration of these classes to fixtures.h and kept the definition where it is.

I also add a check for the recorded messages in the function call client.


// ----------------------------------
// Exec graph
// ----------------------------------
Expand Down Expand Up @@ -233,7 +240,8 @@ class Scheduler

faabric::util::SchedulingDecision makeSchedulingDecision(
std::shared_ptr<faabric::BatchExecuteRequest> req,
bool forceLocal);
bool forceLocal,
faabric::util::SchedulingTopologyHint topologyHint);

faabric::util::SchedulingDecision doCallFunctions(
std::shared_ptr<faabric::BatchExecuteRequest> req,
Expand Down
12 changes: 12 additions & 0 deletions include/faabric/util/scheduling.h
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,16 @@ class SchedulingDecision
int32_t appIdx,
int32_t groupIdx);
};

// Scheduling topology hints help the scheduler decide which host to assign new
// requests in a batch.
// - NORMAL: bin-packs requests to slots in hosts starting from the master
// host, and overloadds the master if it runs out of resources.
// - PAIRS: never allocates a single (non-master) request to a host without
// other requests of the batch.
enum SchedulingTopologyHint
{
NORMAL,
PAIRS
csegarragonz marked this conversation as resolved.
Show resolved Hide resolved
};
}
58 changes: 47 additions & 11 deletions src/scheduler/Scheduler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,8 @@ void Scheduler::notifyExecutorShutdown(Executor* exec,

faabric::util::SchedulingDecision Scheduler::callFunctions(
std::shared_ptr<faabric::BatchExecuteRequest> req,
bool forceLocal)
bool forceLocal,
faabric::util::SchedulingTopologyHint topologyHint)
{
// Note, we assume all the messages are for the same function and have the
// same master host
Expand All @@ -236,7 +237,8 @@ faabric::util::SchedulingDecision Scheduler::callFunctions(

faabric::util::FullLock lock(mx);

SchedulingDecision decision = makeSchedulingDecision(req, forceLocal);
SchedulingDecision decision =
makeSchedulingDecision(req, forceLocal, topologyHint);

// Send out point-to-point mappings if necessary (unless being forced to
// execute locally, in which case they will be transmitted from the
Expand All @@ -249,9 +251,22 @@ faabric::util::SchedulingDecision Scheduler::callFunctions(
return doCallFunctions(req, decision, lock);
}

faabric::util::SchedulingDecision Scheduler::publicMakeSchedulingDecision(
std::shared_ptr<faabric::BatchExecuteRequest> req,
bool forceLocal,
faabric::util::SchedulingTopologyHint topologyHint)
{
if (!faabric::util::isTestMode()) {
throw std::runtime_error("This function must only be called in tests");
}

return makeSchedulingDecision(req, forceLocal, topologyHint);
}

faabric::util::SchedulingDecision Scheduler::makeSchedulingDecision(
std::shared_ptr<faabric::BatchExecuteRequest> req,
bool forceLocal)
bool forceLocal,
faabric::util::SchedulingTopologyHint topologyHint)
{
int nMessages = req->messages_size();
faabric::Message& firstMsg = req->mutable_messages()->at(0);
Expand Down Expand Up @@ -296,8 +311,20 @@ faabric::util::SchedulingDecision Scheduler::makeSchedulingDecision(
int available = r.slots() - r.usedslots();
int nOnThisHost = std::min(available, remainder);

for (int i = 0; i < nOnThisHost; i++) {
hosts.push_back(h);
// Under the pairs topology hint, we never allocate a single
Copy link
Collaborator

@Shillaker Shillaker Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut here would instead be:

if(topologyHint == faabric::util::SchedulingTopologyHint::PAIRS &&
      nOnThisHost < 2) {
    // Move on if we can't colocate function with at least one other
    continue;
}

I think this fixes the issue you mention in the PR description too.

Copy link
Collaborator Author

@csegarragonz csegarragonz Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this would behave as expected. For example:

  • We have 4 hosts with 4 slots each.
  • We want to schedule 9 requests (with the NEVER_ALONE hint).
  • We expect 4 requests scheduled to the first host, and 5 scheduled to the second.

However, I think your solution would schedule 5 requests on the first host and 4 on the second, as it would exhaust all possible hosts (nOnThisHost == 1 for all of them), and resort to overload the master.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After an offline discussion, I use this change together with a change in the overloading logic that makes the issue mentioned in the description disappear.

// non-master request (id != 0) to a host without other
// requests of the batch
bool stickToPreviousHost =
(topologyHint ==
faabric::util::SchedulingTopologyHint::PAIRS &&
nOnThisHost == 1 && hosts.size() > 0);

if (stickToPreviousHost) {
hosts.push_back(hosts.back());
} else {
for (int i = 0; i < nOnThisHost; i++) {
hosts.push_back(h);
}
}

remainder -= nOnThisHost;
Expand All @@ -323,13 +350,22 @@ faabric::util::SchedulingDecision Scheduler::makeSchedulingDecision(
int available = r.slots() - r.usedslots();
int nOnThisHost = std::min(available, remainder);

// Register the host if it's exected a function
if (nOnThisHost > 0) {
registeredHosts[funcStr].insert(h);
}
bool stickToPreviousHost =
(topologyHint ==
faabric::util::SchedulingTopologyHint::PAIRS &&
nOnThisHost == 1 && hosts.size() > 0);
csegarragonz marked this conversation as resolved.
Show resolved Hide resolved

for (int i = 0; i < nOnThisHost; i++) {
hosts.push_back(h);
if (stickToPreviousHost) {
hosts.push_back(hosts.back());
} else {
// Register the host if it's exected a function
if (nOnThisHost > 0) {
registeredHosts[funcStr].insert(h);
}

for (int i = 0; i < nOnThisHost; i++) {
hosts.push_back(h);
}
}

remainder -= nOnThisHost;
Expand Down
Loading