Different async mechanism: MPI message buffer #111

csegarragonz · 2021-06-10T14:22:07Z

Problem

Current thread pool is shared among all ranks. Thus, any thread in the pool may perform send/recv in lieu of any other rank in the host. This has several problems:

Increases the number of thread-local endpoint arrays.
Throws EADDRINUSE errors when the main thread and a pool-worker thread try to recv from the same rank (into the same rank as well).
In general, it breaks the point-to-point thread-local semantics desirable for MPI.
Lastly, send/recv are very lightweight methods. One could argue that running them in separate threads is an overkill.

In line 327 here I introduce a test that breaks the current semantics.

Solution

Remove the thread pool. Keep track of async messages that we expected to receive per (sendRank, recvRank) and perform batch reads in blocking calls.

Rely on ZeroMQ's asyncrhonous API. In particular, on the poll call, that blocks until it has received at least one message (but does not actually call recv). I.e. it polls the underlying receive buffers.
After having thought through it, I think we don't need polling (which is used to multiplex input sockets) but just recv-ing enough times according to our protocol.

We assume that the MPI application expects to recv messages in the order it issues recv/irecvs. This is, if the application issues irecv, irecv, and recv, then the 3rd message entering our network buffer will correspond to the recv, and not to any of the irecv even in this case:

/* SENDER */
send(msg1);
send(msg2);
send(msg3);

/* RECEIVER */
int id1 = irecv();
int id2 = irecv();

// third message in buffer
msg_t msg3 = recv();

// second message in buffer
msg_t msg2 = await(id2)
// first message in buffer
msg_t msg1 = await(id1)

This is the case for OpenMPI tag-less (tested), note that we don't support tags.

Detail on the implementation

We introduce a thread local unacknowledged message buffer (UMB). There is one per rank-to-rank channel local to the host. Each host has localRanks * worldSize such channels.

It stores messages that have been received, but not claimed, or claimed but not received.

The struct prototype is:

struct UMB {
    // List of claims
    std::list<int> ids;
    // List of messages
    std::list<msg_t> msgs;

    // How many claims have no matching message
    int pendingIds() {
        return std::max<int>(0, ids.size() - msgs.size());
    }
};

There are three methods that modify UMB: recv -> msg, irecv -> id, await(id) -> msg. All are executed by the same thread so we need no locks. I include the pseudo-code for each:

irecv: receive a message asynchronously.

int irecv()
{
    int id = generateId();
    UMB.ids.push_back(id);
}

recv: recv a message synchronously (block until we receive it)

msg_t recv()
{
    return recvBatchReturnLast(pendingIds() + 1);
}

await(id): wait until irecv with id has finished (i.e. msg has arrived)

msg_t await(id)
{
    // Await comes after irecv
    auto index = UMB.ids.find(id);
    assert(index != UMB.ids.end());

    // We assume that awaits can happen in a different order than `irecv`s,
    // that's why we can't use queues in the UMB in the first place, as here
    // we may want to pop an element in the middle of the list
    assert(UMB.msgs.size > index);
    auto it = std::advance(UMB.msgs.begin(), index);
    if (*it != nullptr) {
        msg_t toReturn = *it;
        // Remove ids and message from lists
        UMB.msgs.erase(it);
        UMB.ids.erase(index);
        return toReturn;
    } else {
        // Receive enough messages to get to our index
        return recvBatchReturnLast(index);
    }
}

Lastly, here's the auxiliary method to poll until a certain number of messages have arrived our inbound network sockets (ZeroMQ's):

msg_t recvBatchReturnLast(int numMsgToRecv)
{
    // First we receive all messages for which there is an id but no msg
    // i.e. `irecv`'s that happened before our `recv/await`.
    for (int i = 0; i < numMsgToRecv - 1; i++) {
        if (UMB.isLocal) {
            msg_t msg = localQueue.dequeue();
        } else {
            msg_t msg = socket.recv();
        }
        UMB.msg.push_back(msg);
    }

    // This is the message we are interested in returning
    msg_t toReturn = socket.recv;
    // Note that there may be other messages ready to be acknowledged in the
    // underlying buffer, but adding them to the UMB may add messages that 
    // correspond to standard `recv` and not `irecv`.

    return msg;
}

Possible pitfalls

Is this slowing our fast path down?
Can a slow recv be stuck behind a irecv that never appears?
- No. According to our assumptions, all irecv sends need to be done before the recv send, if recv happens after those irecv.

Shillaker · 2021-06-11T07:29:57Z

This looks great, very well thought out. I also like the iteration to avoid 0MQ's async API, the simpler we make it the better.

Re. slowing down the fast path, it's probably much of a muchness, given that this implementation is much nicer (and actually works), I wouldn't let that stop us. We can do some benchmarking once it's done as well.

Shillaker · 2021-06-14T06:43:41Z

src/scheduler/MpiWorld.cpp

-  functionCallClients;
+/* Each MPI rank runs in a separate thread, however they interact with faabric
+ * as a library. Thus, we use thread_local storage to guarantee that each rank
+ * sees its own version of these data structures.


I don't quite understand this comment, what do you mean "they interact with faabric as a library"?

Really the comment is too verbose and poorly phrased, will rephrase.

Shillaker · 2021-06-14T06:53:45Z

src/scheduler/MpiWorld.cpp

+    assert(index >= 0 && index < size * size);
+
+    // Lazily initialise send endpoints
+    if (mpiMessageEndpoints[index] == nullptr) {


Will these definitely be nullptr when uninitialised? If the condition for being uninitialised is that they are nullptr, I would usually explicitly set that in the constructor, however, it might be overkill and I'm not sure what the spec says.

(Currently going through the comments in #105, but this applies as well)

And yes, this will definately be uninitialised as we explicitely emplace_back a nullptr in line 45.

src/scheduler/MpiWorld.cpp

Shillaker · 2021-06-14T06:57:47Z

src/scheduler/MpiWorld.cpp

+        for (int i = 0; i < size * size; i++) {
+            unackedMessageBuffers.emplace_back(nullptr);
+        }
+    }


Can we do unackedMessageBuffers = std::vector(size * size, nullptr) rather than a loop here?

In this case yes as we are using shared pointers. However, I will use resize as I think it fits better.
This is, unackedMessageBuffers.resize(size * size, nullptr).

src/scheduler/MpiWorld.cpp

include/faabric/scheduler/MpiWorld.h

Shillaker · 2021-06-14T07:44:59Z

include/faabric/scheduler/MpiWorld.h

+        faabric_datatype_t* dataType;
+        int count;
+        faabric::MPIMessage::MPIMessageType messageType;
+    };


If we put the shared pointer to the message into this struct, could we get rid of one of the std::lists in this class?

Yes, in fact after some deep refactoring I realized we could just as well use one single list (which makes iterator management much easier).

src/scheduler/MpiWorld.cpp

Shillaker · 2021-06-14T07:54:16Z

Ah sorry I got confused and reviewed this when you had actually requested a review on the other one 🤦

csegarragonz · 2021-06-15T10:51:58Z

tests/test/scheduler/test_remote_mpi_worlds.cpp

@@ -366,4 +362,187 @@ TEST_CASE_METHOD(RemoteCollectiveTestFixture,
    senderThread.join();
    localWorld.destroy();
 }
+
+TEST_CASE_METHOD(RemoteMpiTestFixture,
+                 "Test sending sync and async message to same host",


This is the test that exposed the flaws in the thread pool (and would make the current master fail).

Shillaker

Nice, this is looking good, just a few style/ structure changes.

Shillaker · 2021-06-15T12:15:36Z

src/scheduler/MpiWorld.cpp

+            SPDLOG_TRACE("MPI - pending recv {} -> {}", sendRank, recvRank);
+            auto _m = getLocalQueue(sendRank, recvRank)->dequeue();
+
+            assert(_m != nullptr);


Sorry should have been clearer. I think it would be more useful to put the check in the enqueue method, as that will catch the issue at the source of the problem (i.e. when it's enqueued) rather than here.

include/faabric/scheduler/MpiMessageBuffer.h

src/scheduler/MpiWorld.cpp

tests/test/scheduler/test_mpi_message_buffer.cpp

include/faabric/scheduler/MpiWorld.h

src/scheduler/MpiMessageBuffer.cpp

Shillaker · 2021-06-15T13:08:03Z

src/scheduler/MpiWorld.cpp

+            }
+        }
+        unackedMessageBuffers.clear();
+    }


Would be good to add a couple of small tests to make sure these destroy checks work, e.g. call isend a few times then call destroy and REQUIRE_THROWS

Thanks for pointing out, as I had actually missed checking if iSendRequest was empty at destruction time.

csegarragonz · 2021-06-16T09:30:29Z

include/faabric/scheduler/MpiWorld.h

@@ -212,15 +205,36 @@ class MpiWorld
    void initLocalQueues();

    // Rank-to-rank sockets for remote messaging
-    void initRemoteMpiEndpoint(int sendRank, int recvRank);
-    int getMpiPort(int sendRank, int recvRank);
+    std::vector<int> basePorts;


I have changed these to accomodate for the new port offset per world.

csegarragonz · 2021-06-16T12:05:40Z

src/scheduler/MpiWorld.cpp

@@ -261,16 +262,51 @@ std::string MpiWorld::getHostForRank(int rank)
    return host;
 }

+// Returns a pair (sendPort, recvPort)


Also worth checking this which has changed since last review.

(basically this commit)

Shillaker

Nice, LGTM.

csegarragonz self-assigned this Jun 10, 2021

csegarragonz added bug Something isn't working enhancement New feature or request help wanted Extra attention is needed mpi Related to the MPI implementation labels Jun 10, 2021

csegarragonz changed the title ~~Poller~~ Different async mechanism Jun 10, 2021

csegarragonz mentioned this pull request Jun 11, 2021

Queue-less remote send/recv #105

Merged

csegarragonz force-pushed the poller branch from 19f4f30 to 7165f26 Compare June 14, 2021 06:46

csegarragonz marked this pull request as ready for review June 14, 2021 06:46

Shillaker requested changes Jun 14, 2021

View reviewed changes

csegarragonz force-pushed the poller branch from 7165f26 to 3101386 Compare June 14, 2021 08:28

csegarragonz marked this pull request as draft June 15, 2021 07:50

csegarragonz marked this pull request as ready for review June 15, 2021 07:51

csegarragonz force-pushed the poller branch from bc72677 to a4b668e Compare June 15, 2021 10:23

csegarragonz added 4 commits June 15, 2021 10:34

adding test that breaks current implementation

921bd41

removing thread pool and implementing the umb

17a48d1

adding more tests

0110ff4

introducing the mpi message buffer and encapsulating most logic there

a634405

csegarragonz force-pushed the poller branch from a4b668e to bfdde42 Compare June 15, 2021 10:38

csegarragonz commented Jun 15, 2021

View reviewed changes

adding tests for the mpi message buffer + formatting

3055782

csegarragonz force-pushed the poller branch from bfdde42 to 3055782 Compare June 15, 2021 11:04

csegarragonz mentioned this pull request Jun 15, 2021

Faabric update: MPI queue-less local messages, reworked async faasm/faasm#434

Merged

csegarragonz requested a review from Shillaker June 15, 2021 11:16

csegarragonz changed the title ~~Different async mechanism~~ Different async mechanism: MPI message buffer Jun 15, 2021

Shillaker requested changes Jun 15, 2021

View reviewed changes

csegarragonz removed the help wanted Extra attention is needed label Jun 15, 2021

pr comments

0a7ab7c

csegarragonz force-pushed the poller branch from f59c056 to 0a7ab7c Compare June 15, 2021 15:24

switching to per-world port range

88bcb7d

csegarragonz force-pushed the poller branch from 0432122 to 88bcb7d Compare June 16, 2021 08:25

csegarragonz commented Jun 16, 2021

View reviewed changes

csegarragonz force-pushed the poller branch 3 times, most recently from a3287fd to bee0ef8 Compare June 16, 2021 11:45

Shillaker self-requested a review June 16, 2021 11:49

csegarragonz commented Jun 16, 2021

View reviewed changes

adding more tests

1422843

csegarragonz force-pushed the poller branch from bee0ef8 to 1422843 Compare June 16, 2021 12:07

Shillaker approved these changes Jun 16, 2021

View reviewed changes

csegarragonz merged commit b9da3c0 into master Jun 16, 2021

csegarragonz deleted the poller branch June 16, 2021 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different async mechanism: MPI message buffer #111

Different async mechanism: MPI message buffer #111

csegarragonz commented Jun 10, 2021 •

edited

Loading

Shillaker commented Jun 11, 2021

Shillaker Jun 14, 2021

csegarragonz Jun 15, 2021

Shillaker Jun 14, 2021

csegarragonz Jun 14, 2021

Shillaker Jun 14, 2021

csegarragonz Jun 15, 2021

Shillaker Jun 14, 2021

csegarragonz Jun 15, 2021

Shillaker commented Jun 14, 2021

csegarragonz Jun 15, 2021

Shillaker left a comment

Shillaker Jun 15, 2021

Shillaker Jun 15, 2021

csegarragonz Jun 15, 2021

csegarragonz Jun 16, 2021

csegarragonz Jun 16, 2021

csegarragonz Jun 16, 2021

Shillaker left a comment

Different async mechanism: MPI message buffer #111

Different async mechanism: MPI message buffer #111

Conversation

csegarragonz commented Jun 10, 2021 • edited Loading

Problem

Solution

Detail on the implementation

Possible pitfalls

Shillaker commented Jun 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shillaker commented Jun 14, 2021

Choose a reason for hiding this comment

Shillaker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shillaker left a comment

Choose a reason for hiding this comment

csegarragonz commented Jun 10, 2021 •

edited

Loading