Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue-less remote send/recv #105

Merged
merged 7 commits into from
Jun 15, 2021
Merged

Queue-less remote send/recv #105

merged 7 commits into from
Jun 15, 2021

Conversation

csegarragonz
Copy link
Collaborator

@csegarragonz csegarragonz commented Jun 8, 2021

In this PR I introduce rank-to-rank messaging bypassing the central message endpoint server.

Sending and receiving messages is in the hot path for most workloads we run, thus we want to introduce the least contention. A low-hanging-fruit is allowing ranks to open messaging endpoints themselves.

As discussed offline, this exposes a bug not covered in the tests, for which I provide a solution in #111.

@csegarragonz csegarragonz self-assigned this Jun 8, 2021
@csegarragonz csegarragonz added enhancement New feature or request mpi Related to the MPI implementation labels Jun 8, 2021
@csegarragonz csegarragonz force-pushed the queueless-v2 branch 2 times, most recently from 6c5b169 to a283e83 Compare June 9, 2021 09:56
@csegarragonz csegarragonz marked this pull request as ready for review June 9, 2021 09:57
@csegarragonz csegarragonz marked this pull request as draft June 9, 2021 10:07
const std::shared_ptr<faabric::MPIMessage>& msg)
{
// TODO - is this lazy init very expensive?
if (sendMessageEndpoint.socket == nullptr) {
Copy link
Collaborator Author

@csegarragonz csegarragonz Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In normal circumstances, I would open both sendMessageEndpoint and recvMessageEndpoint in the constructor. However, this is very problematic for testing. That's because:

  1. mpiMessageEndpoints are static thread local, thus we need to create different worlds in separate threads.
  2. If both send and recv sockets are opened in the constructor, both recv sockets will try to bind to the same adress + port, as they are on the same network namespace, yielding an error (even though we are only using one, in the simplest scenario).

Not that I believe branching could be the bottleneck here, but this is a case where testing influences implementation. Is this what TDD looks like in practice? 🤣

Copy link
Collaborator

@Shillaker Shillaker Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the implementation relies on TLS then unfortunately the tests should exercise that TLS. As you say, this means we'd need to create worlds in different threads.

Re. the ports and testing, the solution is probably to make the base port configurable, then change the value in tests. In this case, each world could take a port offset as a constructor argument, set to the value we currently use by default, but overridden in tests.

Making this configurable arguably makes these classes more flexible and reusable so I'd say it's a +1 for testing!

Copy link
Collaborator Author

@csegarragonz csegarragonz Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we'll first try with binding to a specific (protocol, address, port) tuple, and will leave port management until the multi-tenancy PR is under-way

@csegarragonz csegarragonz force-pushed the queueless-v2 branch 2 times, most recently from 63706a4 to cdaf39e Compare June 11, 2021 10:53
src/scheduler/MpiWorld.cpp Outdated Show resolved Hide resolved
// Initialise the endpoint vector if not initialised
if (mpiMessageEndpoints.size() == 0) {
for (int i = 0; i < size * size; i++) {
mpiMessageEndpoints.emplace_back(nullptr);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see you've ported these changes into the other PR which I reviewed accidentally (#111), so the comments apply to both.

Relevant comments from the other PR:

a) Rather than spread it around, I think lazy initialisation for the rank should all happen in one place.
b) I think you can do mpiMessageEndpoints = std::vector(size * size, nullptr) and save a few lines.

Copy link
Collaborator Author

@csegarragonz csegarragonz Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) Indeed.
b) mpiMessageEndpoints is a vector of std::unique_ptr so it has a deleted = operator. Thus I need to emplace nullptrs back. Note that this also answers your concern in #111 wrt mpiMessageEndpoint[index being actually initialised to nullptr.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's annoying, makes sense though.

// Get host for recv rank
std::string host = getHostForRank(recvRank);
assert(!host.empty());
assert(host != thisHost);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment re. assertions and invalid ranks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first assert can go as we actually check this in getHostForRank but the latter has to do with logic particular to this method. I.e. if we send to a remote rank, the sending rank, must actually be in a remote host.

const std::shared_ptr<faabric::MPIMessage>& msg)
{
// TODO - is this lazy init very expensive?
if (sendMessageEndpoint.socket == nullptr) {
Copy link
Collaborator

@Shillaker Shillaker Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the implementation relies on TLS then unfortunately the tests should exercise that TLS. As you say, this means we'd need to create worlds in different threads.

Re. the ports and testing, the solution is probably to make the base port configurable, then change the value in tests. In this case, each world could take a port offset as a constructor argument, set to the value we currently use by default, but overridden in tests.

Making this configurable arguably makes these classes more flexible and reusable so I'd say it's a +1 for testing!

src/transport/MpiMessageEndpoint.cpp Show resolved Hide resolved
}

TEST_CASE_METHOD(RemoteMpiTestFixture,
"Test collective messaging across hosts",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests weren't using two different threads (not sure why they were passing). Given that I had to change them, I also separated them into smaller tests, easier to identify which fails.

@@ -117,6 +118,19 @@ class ConfTestFixture
faabric::util::SystemConfig& conf;
};

class MessageContextFixture : public SchedulerTestFixture
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to move this here to use it in the MpiMessageEndpoint tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request mpi Related to the MPI implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants