Add distributed coordination operations #161

Shillaker · 2021-10-25T09:25:35Z

Adds generic utilities to support distributed coordination primitives such as locks and barriers.

It works as follows:

As distributed coordination is mostly point-to-point messaging, it happens through the PointToPointBroker class, which deals with PointToPointGroups, where each function in the group can coordinate with one another across hosts.
There may be zero to many groups per app, and the PointToPointBroker just deals with a groupId (integer) to refer to each one.
Each function belongs to an app (with an appId and an optional appIdx, e.g. an MPI rank, or OpenMP thread ID), and potentially a group as well (with a groupId and a groupIdx). There may be many groups in a single app, therefore the groupIdx and appIdx are treated separately (although may be set to the same thing if there's only one group in the app).
The operations available to each group are: a lock, a barrier and a no-wait barrier/ notify (where one function on the master can wait for all the others to finish, without the others being blocked).
Point-to-point groups are created by scheduling a batch of functions with groupId and groupIdx set on the underlying Messages. The scheduler uses this information to transparently set up the point-to-point mappings required to do the required messaging.
The barrier and notify implementations use standard point-to-point messaging.
Locking is done from a remote host by sending a request to the PointToPointServer. When the lock has been successfully acquired, a corresponding point-to-point message will be sent back to the group index that originally requested the lock. If the lock is requested locally, we do the same thing, just without the request to the remote PointToPointServer.
Unlocking is a single async operation (i.e. without a response), implemented as another request to the PointToPointServer if remote.

I've added a group lock/ unlock surrounding writing snapshot diffs. This could otherwise cause problems by overwriting regions of memory while another thread was in a critical section.

For posterity, this PR is a rewrite of #141

include/faabric/scheduler/Scheduler.h

include/faabric/transport/MessageEndpoint.h

src/proto/faabric.proto

tests/dist/server.cpp

csegarragonz

LGTM, very glad this is finally done 🎉

Just some minor comments.

Before merging though, could we do a bump faabric PR in faasm to make sure this does not break anything?

include/faabric/transport/PointToPointBroker.h

src/scheduler/CMakeLists.txt

tests/test/transport/test_point_to_point_groups.cpp

csegarragonz · 2021-10-29T10:04:44Z

tests/test/transport/test_point_to_point_groups.cpp

+        nSums = 1000;
+    }
+
+    // Spawn n-1 child threads to add to shared sums over several barriers so


several ?

Not sure what you mean by quoting a single word... The test is running the sum operations in a loop, so it's invoking several barriers. Does that make sense?

src/transport/PointToPointBroker.cpp

csegarragonz · 2021-10-29T10:14:37Z

src/transport/PointToPointBroker.cpp

+{
+    std::vector<uint8_t> data(1, 0);
+
+    ptpBroker.sendMessage(groupId, 0, groupIdx, data.data(), data.size());


We always assume the master's index to be 0 right? Maybe it would make the code more readable if this value was hardcoded somwhere. It is sometimes hard to understand that the 0 correponds to the master index.

Yes, good point, we frequently hard-code this zero in the MPI code too, i.e. if(rank == 0) { // Do stuff for master }, e.g. https://github.com/faasm/faabric/blob/master/src/scheduler/MpiWorld.cpp#L1216

I'll change for the ptp stuff but would be good to switch the MPI code at some point too.

Shillaker · 2021-10-29T15:58:56Z

Faasm PR checking here: faasm/faasm#531

Add distributed coordination operations

affba77

Shillaker self-assigned this Oct 25, 2021

Shillaker mentioned this pull request Oct 25, 2021

Add generic distributed coordination operations #141

Closed

Shillaker added 13 commits October 25, 2021 10:38

Started renaming

195fb4e

Merge distributed coordination into ptp broker

62133bd

Continuing fallout

b8a0cd9

Continuing refactor

733849f

More test fixes

a70fe84

Fix normal tests

6d57081

Fixing up dist tests

9da86ea

Tidy-up

a5735d9

Improved logging

7e04f3a

Fix force-local issue

d498596

Debugging barrier test

d8c2460

Fix distributed tests

28db42d

Merge branch 'master' into dist-coord

c9d43d9

Shillaker commented Oct 28, 2021

View reviewed changes

include/faabric/scheduler/Scheduler.h Show resolved Hide resolved

Shillaker commented Oct 28, 2021

View reviewed changes

include/faabric/transport/MessageEndpoint.h Show resolved Hide resolved

Shillaker commented Oct 28, 2021

View reviewed changes

src/proto/faabric.proto Show resolved Hide resolved

Shillaker commented Oct 28, 2021

View reviewed changes

tests/dist/server.cpp Outdated Show resolved Hide resolved

Shillaker added 3 commits October 28, 2021 11:09

Scheduler test

ab2f30c

Fix test for scheduler dispatching mappings

90162ad

Add notify distributed test

3961a31

Shillaker marked this pull request as ready for review October 28, 2021 18:03

Formatting

4011699

Shillaker requested a review from csegarragonz October 28, 2021 18:23

csegarragonz approved these changes Oct 29, 2021

View reviewed changes

Shillaker added 3 commits October 29, 2021 12:19

Review comments

2d7dc49

Lengthen message endpoint server timeout

10c8467

Switch back to []

e9aa02d

Add extra request latch barrier

62981b9

Shillaker merged commit dc3d150 into master Oct 29, 2021

Shillaker deleted the dist-coord branch October 29, 2021 15:59

Shillaker mentioned this pull request Nov 1, 2021

Use Faabric distributed coordination and merge operations faasm/faasm#534

Merged

csegarragonz mentioned this pull request Feb 23, 2022

Add task to generate release body #233

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed coordination operations #161

Add distributed coordination operations #161

Shillaker commented Oct 25, 2021 •

edited

Loading

csegarragonz left a comment

csegarragonz Oct 29, 2021

Shillaker Oct 29, 2021 •

edited

Loading

csegarragonz Oct 29, 2021

Shillaker Oct 29, 2021 •

edited

Loading

Shillaker commented Oct 29, 2021

Add distributed coordination operations #161

Add distributed coordination operations #161

Conversation

Shillaker commented Oct 25, 2021 • edited Loading

csegarragonz left a comment

Choose a reason for hiding this comment

csegarragonz Oct 29, 2021

Choose a reason for hiding this comment

Shillaker Oct 29, 2021 • edited Loading

Choose a reason for hiding this comment

csegarragonz Oct 29, 2021

Choose a reason for hiding this comment

Shillaker Oct 29, 2021 • edited Loading

Choose a reason for hiding this comment

Shillaker commented Oct 29, 2021

Shillaker commented Oct 25, 2021 •

edited

Loading

Shillaker Oct 29, 2021 •

edited

Loading

Shillaker Oct 29, 2021 •

edited

Loading