Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State-less MPI rank management #103

Merged
merged 7 commits into from
Jun 7, 2021
Merged

State-less MPI rank management #103

merged 7 commits into from
Jun 7, 2021

Conversation

csegarragonz
Copy link
Collaborator

@csegarragonz csegarragonz commented Jun 4, 2021

In this PR I remove the dependency with state for the rank-host map bookkeeping. This was something I had in my to-do list as it really wasn't needed. Even though this does not have a direct impact on performance, it opens the door to rank-to-rank communication, I elaborate.

Before, upon MPI_Init all ranks would halt at a barrier. Non-zero ranks would recv a broadcast message from the root rank (0), and bind to their local queue (not knowing which host the message came from). Only when trying to send a response afterwards, they'd query redis for the host where the root rank was placed. If we were to query redis before the first recv (to, for instance, recv on a per-rank port) we would be racing the root rank and its update of the rank-host map.

Given the socket-like tools introduced with zeromq, we can just send the map around to the hosts that need it. This is greatly facilitated by the fact that the scheduler returns the list of hosts where it has scheduled the calls.

The only part of the API that now needs the state are the RMA calls, which I haven't worked with at all, but quite keen to do so when they are needed.

RIP registerRank 💀

@csegarragonz csegarragonz self-assigned this Jun 4, 2021
@csegarragonz csegarragonz marked this pull request as draft June 4, 2021 17:12
@csegarragonz csegarragonz added enhancement New feature or request mpi Related to the MPI implementation labels Jun 4, 2021
@csegarragonz csegarragonz marked this pull request as ready for review June 6, 2021 09:59
@@ -0,0 +1,13 @@
#pragma once
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the implementation of these is so simple, that they are alright with just being abstract methods. However, I thought they belonged in transport rather than scheduler.

src/proto/faabric.proto Outdated Show resolved Hide resolved
const std::string actualHost = world.getHostForRank(1);
REQUIRE(actualHost == expectedHost);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We weren't checking anything here.

@csegarragonz csegarragonz force-pushed the barrierless branch 2 times, most recently from 26d57ac to 63d7bd0 Compare June 6, 2021 12:35
Copy link
Collaborator

@Shillaker Shillaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, the use of state for MPI has always been very fiddly.

A few style/ naming/ structure comments but nothing fundamental.

src/scheduler/MpiContext.cpp Outdated Show resolved Hide resolved
include/faabric/scheduler/MpiWorld.h Outdated Show resolved Hide resolved
include/faabric/scheduler/MpiContext.h Outdated Show resolved Hide resolved
@@ -8,3 +8,5 @@
#define MPI_MESSAGE_PORT 8005
#define SNAPSHOT_PORT 8006
#define REPLY_PORT_OFFSET 100

#define MPI_PORT 8800
Copy link
Collaborator

@Shillaker Shillaker Jun 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the MPI_PORT different to the existing MPI_MESSAGE_PORT and how come it's so much higher in the port range than the others? It would be cleanest to keep the range of ports we use as narrow as possible (e.g. could this just be 8007?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After our offline discussion, I'll leave the port like this as it will change in coming PRs.

src/proto/faabric.proto Outdated Show resolved Hide resolved
src/scheduler/MpiWorld.cpp Outdated Show resolved Hide resolved
logger->error("Error emplacing in rankHostMap: {} -> {}",
i + 1,
rankHostVec.at(i));
throw std::runtime_error("Error emplacing in rankHostMap");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the use of this pattern with try_emplace instead of just emplace. What situation is it guarding against? It's adding 5 lines of error handling code, and even when it fails the message isn't particularly clear. Much like avoiding variable names in comments, it's not good to include them in error messages, as this will be even more confusing if the variable name changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_emplace might be an overkill here indeed. It does the same as emplace (and has the same return value), but the the element is not constructed if the key is already there (again, not important with strings and ints). In short, emplace and try_emplace guard against the same behaviour: trying to insert a key that already exists.

The error message was not clear at all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I understand what try_emplace does, but what situation is it guarding against here? It seems like this will only error if we're setting a value for a rank that has already been set. AFAICT this won't happen as it's only called once in MPI initialisation right? Even if it did somehow happen (e.g. if someone changes the code in future), it's probably not worthy of an error, at most a warning to say that it's getting reset. Either way I think just using emplace is fine and much clearer.

tests/test/scheduler/test_function_client_server.cpp Outdated Show resolved Hide resolved
@@ -1051,16 +1058,13 @@ template void doReduceTest<double>(scheduler::MpiWorld& world,

TEST_CASE("Test reduce", "[mpi]")
{
cleanFaabric();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to add a fixture to this file as we also have a custom tearDown method. One for another PR probably.

Copy link
Collaborator

@Shillaker Shillaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one 👍

@csegarragonz csegarragonz merged commit c9716ba into master Jun 7, 2021
@csegarragonz csegarragonz deleted the barrierless branch June 7, 2021 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request mpi Related to the MPI implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants