-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler hints and fixes for distributed coordination bugs #169
Conversation
void addDiffs(std::vector<SnapshotDiff>& diffs, | ||
const uint8_t* original, | ||
const uint8_t* updated); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this class down the file as it now depends on the SnapshotDiff
class which is declared above.
@@ -11,7 +11,7 @@ class PointToPointServer final : public MessageEndpointServer | |||
PointToPointServer(); | |||
|
|||
private: | |||
PointToPointBroker& reg; | |||
PointToPointBroker& broker; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This used to be called a "registry" hence the reg
name, but it was a bit confusing.
@@ -224,228 +223,339 @@ faabric::util::SchedulingDecision Scheduler::callFunctions( | |||
throw std::runtime_error("Message with no master host"); | |||
} | |||
|
|||
// Set up scheduling decision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff of the callFunctions
methods isn't particularly easy to parse, full version here.
@@ -500,84 +477,9 @@ TEST_CASE_METHOD(SnapshotMergeTestFixture, | |||
deallocatePages(snap.data, snapPages); | |||
} | |||
|
|||
TEST_CASE_METHOD(SnapshotMergeTestFixture, "Test cross-page ignores", "[util]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignores have been removed
@@ -430,94 +448,6 @@ TEST_CASE_METHOD(TestExecutorFixture, | |||
} | |||
} | |||
|
|||
TEST_CASE_METHOD(TestExecutorFixture, | |||
"Test executing remote chained threads", | |||
"[executor]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is fiddly and fragile as it's trying to recreate a remote execution. Everything it checks is now covered in the distributed tests which are much cleaner, so I've removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a couple minor points
// If any tasks are blocking we risk a deadlock, and can no longer | ||
// guarantee the application will finish. | ||
// In general if we're on the master host and this is a thread, we | ||
// should avoid the zeroth and first pool threads as they are likely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have read about this in the PR description, and this seems to me a risky game to be playing.
How likely, i.e. when, would such an overload happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overloads like this would only happen when there aren't enough resources available, so they're unlikely. When the system is overloaded it can either keep accepting functions and do its best to execute them, or start rejecting requests with an error. Faabric does the former and starts queueing, as this would then trigger the underlying cluster to scale out in a "real" deployment.
This specific bit of code is only related to threading, i.e. when an application spawns more threads than there are cores in the system. Well written multi-threaded applications ought to request the level of parallelism available in the environment, at which point the system can specify an appropriate limit that will avoid this scenario (as is the case with OpenMP).
This behaviour is covered in a few tests so although the code shouldn't be triggered in a real deployment, it is still tested.
This is a mixed bag, all changes required to get a larger OpenMP application working and do some tidy-up along the way.
Scheduler hints
At the moment "hints" are taken on face value by the scheduler (so they're more like scheduler plans). This is useful in a multithreaded application that repeatedly performs an operation using the same number of threads and doesn't need to keep repeating the scheduling decision. How applications decide to cache the scheduling decision is up to them. This change is relatively simple, and just involves adding one more scheduler method:
Where
hint
is the proposed scheduling decision. To implement this I just refactored the scheduling logic into two parts: (i) calculating the decision (only performed when a hint isn't passed); (ii) executing a decision (performed both when a hint is and isn't passed).Snapshot diff whitelisting
Snapshot merge regions are now specified as a whitelist rather than a blacklist. This makes more sense, as we were adding a lot of logic around ignoring and doing default diffs, when in fact applications will know which regions they want to merge. The "worst case" is for a generic threaded application with no consistency guarantees where it would want to merge the whole heap, in which case it could specify a single merge region for it.
This rendered
Ignore
regions obsolete, so I removed them.Bug fixes
Deadlock and inconsistencies with waiting for point-to-point groups to be enabled - added more checks in
PointToPointBroker
, and theFlagWaiter
class, which allows an arbitrary number of threads to wait on a boolean flag set by another (can't use a barrier as we don't know in advance how many threads will need to wait).Nested thread groups deadlocking when scheduled on same thread pool thread as their parent - if thread
A
spawns a nested group of threadsB1
andB2
, we must ensure that neitherB1
norB2
is queued on the same thread pool thread asA
(asA
is dependent on the completion ofB1
andB2
, but they wouldn't execute untilA
is finished). Previously the thread pool scheduling could allow this even when theExecutor
was not overloaded. This PR changes the thread pool allocation to allocate one task to every free thread pool thread before we start overloading them, and when we do, we avoid indexes0
and1
as these are likely to contain blocking threads. This doesn't completely remove the problem, but ensures it won't happen under normal (i.e. un-overloaded) execution.Locking on non-existent groups on snapshot pushes - locking in the
SnapshotServer
only needs to happen on the master host when snapshot diffs are being returned. However, previously theSnapshotServer
tried to lock on the group on any snapshot or snapshot diff push, causing an error when the group didn't exist (i.e. on non-master hosts). It now only happens on snapshot diff pushes when the group exists on that host.Snapshot diffing inefficiency - previously this was going through all the dirty pages byte-by-byte, checking them against merge regions, and adding a default overwrite mode. Instead, it's much more efficient to only do merging in specified merge regions, and by default do no merging. This makes the code cleaner, and the diffing process more efficient.
Snapshot skipping missing diffs - previously executors would only restore a snapshot once, however, this actually needs to be done every time, as the snapshot may have changed between executions. Ideally we'd apply the diffs directly to all the mapped regions rather than have to re-restore from the original snapshot, but this is left as a potential optimisation for the future.