Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler hints and fixes for distributed coordination bugs #169

Merged
merged 31 commits into from
Nov 10, 2021

Conversation

Shillaker
Copy link
Collaborator

@Shillaker Shillaker commented Nov 1, 2021

This is a mixed bag, all changes required to get a larger OpenMP application working and do some tidy-up along the way.

Scheduler hints

At the moment "hints" are taken on face value by the scheduler (so they're more like scheduler plans). This is useful in a multithreaded application that repeatedly performs an operation using the same number of threads and doesn't need to keep repeating the scheduling decision. How applications decide to cache the scheduling decision is up to them. This change is relatively simple, and just involves adding one more scheduler method:

faabric::util::SchedulingDecision callFunctions(
    std::shared_ptr<faabric::BatchExecRequest> req, 
    const faabric::util::SchedulingDecision &hint
    );

Where hint is the proposed scheduling decision. To implement this I just refactored the scheduling logic into two parts: (i) calculating the decision (only performed when a hint isn't passed); (ii) executing a decision (performed both when a hint is and isn't passed).

Snapshot diff whitelisting

Snapshot merge regions are now specified as a whitelist rather than a blacklist. This makes more sense, as we were adding a lot of logic around ignoring and doing default diffs, when in fact applications will know which regions they want to merge. The "worst case" is for a generic threaded application with no consistency guarantees where it would want to merge the whole heap, in which case it could specify a single merge region for it.

This rendered Ignore regions obsolete, so I removed them.

Bug fixes

Deadlock and inconsistencies with waiting for point-to-point groups to be enabled - added more checks in PointToPointBroker, and the FlagWaiter class, which allows an arbitrary number of threads to wait on a boolean flag set by another (can't use a barrier as we don't know in advance how many threads will need to wait).

Nested thread groups deadlocking when scheduled on same thread pool thread as their parent - if thread A spawns a nested group of threads B1 and B2, we must ensure that neither B1 nor B2 is queued on the same thread pool thread as A (as A is dependent on the completion of B1 and B2, but they wouldn't execute until A is finished). Previously the thread pool scheduling could allow this even when the Executor was not overloaded. This PR changes the thread pool allocation to allocate one task to every free thread pool thread before we start overloading them, and when we do, we avoid indexes 0 and 1 as these are likely to contain blocking threads. This doesn't completely remove the problem, but ensures it won't happen under normal (i.e. un-overloaded) execution.

Locking on non-existent groups on snapshot pushes - locking in the SnapshotServer only needs to happen on the master host when snapshot diffs are being returned. However, previously the SnapshotServer tried to lock on the group on any snapshot or snapshot diff push, causing an error when the group didn't exist (i.e. on non-master hosts). It now only happens on snapshot diff pushes when the group exists on that host.

Snapshot diffing inefficiency - previously this was going through all the dirty pages byte-by-byte, checking them against merge regions, and adding a default overwrite mode. Instead, it's much more efficient to only do merging in specified merge regions, and by default do no merging. This makes the code cleaner, and the diffing process more efficient.

Snapshot skipping missing diffs - previously executors would only restore a snapshot once, however, this actually needs to be done every time, as the snapshot may have changed between executions. Ideally we'd apply the diffs directly to all the mapped regions rather than have to re-restore from the original snapshot, but this is left as a potential optimisation for the future.

@Shillaker Shillaker self-assigned this Nov 1, 2021
@Shillaker Shillaker changed the title Fixes for distributed locks Fixes for distributed coordination bugs Nov 2, 2021
void addDiffs(std::vector<SnapshotDiff>& diffs,
const uint8_t* original,
const uint8_t* updated);
};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this class down the file as it now depends on the SnapshotDiff class which is declared above.

@Shillaker Shillaker changed the title Fixes for distributed coordination bugs Scheduler hints and fixes for distributed coordination bugs Nov 8, 2021
@@ -11,7 +11,7 @@ class PointToPointServer final : public MessageEndpointServer
PointToPointServer();

private:
PointToPointBroker& reg;
PointToPointBroker& broker;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be called a "registry" hence the reg name, but it was a bit confusing.

@@ -224,228 +223,339 @@ faabric::util::SchedulingDecision Scheduler::callFunctions(
throw std::runtime_error("Message with no master host");
}

// Set up scheduling decision
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff of the callFunctions methods isn't particularly easy to parse, full version here.

@Shillaker Shillaker marked this pull request as ready for review November 9, 2021 15:43
@@ -500,84 +477,9 @@ TEST_CASE_METHOD(SnapshotMergeTestFixture,
deallocatePages(snap.data, snapPages);
}

TEST_CASE_METHOD(SnapshotMergeTestFixture, "Test cross-page ignores", "[util]")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignores have been removed

@@ -430,94 +448,6 @@ TEST_CASE_METHOD(TestExecutorFixture,
}
}

TEST_CASE_METHOD(TestExecutorFixture,
"Test executing remote chained threads",
"[executor]")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is fiddly and fragile as it's trying to recreate a remote execution. Everything it checks is now covered in the distributed tests which are much cleaner, so I've removed it.

Copy link
Collaborator

@csegarragonz csegarragonz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple minor points

// If any tasks are blocking we risk a deadlock, and can no longer
// guarantee the application will finish.
// In general if we're on the master host and this is a thread, we
// should avoid the zeroth and first pool threads as they are likely
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have read about this in the PR description, and this seems to me a risky game to be playing.

How likely, i.e. when, would such an overload happen?

Copy link
Collaborator Author

@Shillaker Shillaker Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overloads like this would only happen when there aren't enough resources available, so they're unlikely. When the system is overloaded it can either keep accepting functions and do its best to execute them, or start rejecting requests with an error. Faabric does the former and starts queueing, as this would then trigger the underlying cluster to scale out in a "real" deployment.

This specific bit of code is only related to threading, i.e. when an application spawns more threads than there are cores in the system. Well written multi-threaded applications ought to request the level of parallelism available in the environment, at which point the system can specify an appropriate limit that will avoid this scenario (as is the case with OpenMP).

This behaviour is covered in a few tests so although the code shouldn't be triggered in a real deployment, it is still tested.

src/scheduler/Scheduler.cpp Outdated Show resolved Hide resolved
src/transport/PointToPointBroker.cpp Show resolved Hide resolved
@Shillaker Shillaker merged commit 990c640 into master Nov 10, 2021
@Shillaker Shillaker deleted the dist-lock-fixes branch November 10, 2021 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants