Fix communication for large message sizes #131

masterleinad · 2019-10-07T16:25:27Z

For

mpiexec -np 10 ./ArborX_DistributedTree.exe --values=100000000 --queries=10000000 --neighbors=40

I was getting an MPI error telling me that the message size was too large. This pull requests splits the message into multiple smaller ones in case we need to send more than std::numeric_limits<int>::max() bytes.

aprokop

I disagree with the intent of this PR. I think we should just throw in this situation, and not allow things to continue. If user's code is trying to send a message that large, we should advocate for rethinking their approach.

sslattery · 2019-10-07T17:39:35Z

I tend to agree with @aprokop - that seems to be a ton of communication. I say for now we throw and if we arrive at a legitimate case for this in the future we can re-visit and find an optimal strategy for those cases. @dalg24 @Rombur do you agree?

sslattery · 2019-10-07T17:41:34Z

If you guys did want to do this, however, I would advocate for a more efficient asynchronous strategy for pack-unpack of multiple messages to avoid some latency costs.

dalg24

Please add a comment that you are splitting up the message in chunks when it exceeds the maximum size

dalg24 · 2019-10-07T18:13:14Z

No I do not agree. This is the correct fix for now.

dalg24 · 2019-10-07T18:14:44Z

Clang build is hanging and others are broken. @masterleinad please investigate.

masterleinad · 2019-10-07T21:53:12Z

@dalg24 I fixed the implementation.

@aprokop @sslattery So you are saying that the failing setup isn't reasonable at all? Surely, it is an edge case, but I don't see a reason to not support it.

The current implementation also makes it possible to change the message size quite easily in case we want to play with it.

If you guys did want to do this, however, I would advocate for a more efficient asynchronous strategy for pack-unpack of multiple messages to avoid some latency costs.

@sslattery Are you thinking about compressing via gzip or something the like? The impression I got from other projects is that network throughput is typically higher than the rate by which boost::archive is able to compress. If the communication here turns out to be a bottleneck, I am happy to consider that, though.

sslattery · 2019-10-07T21:54:32Z

No I was referring to an overlapping pack/unpack strategy when multiple messages are in play to hide the latencies of the pack/unpack kernels on GPU systems. This is fine for now - we can optimize later as needed.

Rombur · 2019-10-08T12:33:26Z

I don't see why we should throw? Why do you think that is not a valid case @aprokop ?

sslattery · 2019-10-08T12:35:41Z

I agree with @dalg24 and @masterleinad after more thought - it should work for huge messages even if you're in a regime of bad performance

dalg24 · 2019-10-08T12:45:32Z

@masterleinad Please confirm that the message that exceed the maximum size is sent by the rank to itself.

masterleinad · 2019-10-08T13:01:34Z

@masterleinad Please confirm that the message that exceed the maximum size is sent by the rank to itself.

I am not quite sure what you are asking for, but I checked (manually) that the tests pass if chunk_size==100.

aprokop · 2019-10-08T13:34:05Z

An additional concern is that the message sends/receives in this patch are not word-aligned. This could potentially be influencing some of MPI machinery, including the choice of a protocol to communicate data. I'm not approving this until this is researched.

sslattery · 2019-10-08T13:43:45Z

@aprokop is right - we might need an LDRD before we can merge

dalg24 · 2019-10-08T13:45:45Z

I am not quite sure what you are asking for

if (n_chunks > 1) assert(_sources[i] == comm_rank);

masterleinad · 2019-10-08T14:55:35Z

@dalg24 Yes, in case chunk_size = std::numeric_limits<int>::max().

dalg24 · 2019-10-08T15:16:45Z

@aprokop would you be happier if we filter out self-communication instead?

aprokop · 2019-10-08T15:34:56Z

would you be happier if we filter out self-communication instead?

Why do we even use MPI for self-communication?

dalg24 · 2019-10-08T15:39:21Z

would you be happier if we filter out self-communication instead?

Why do we even use MPI for self-communication?

Code simplicity if you don't treat self-commination explicitly.

aprokop · 2019-10-08T15:41:32Z

So, yes, in short, just do the change where you filter out self-communication, and put in place an assert for non-self communication.

Consider filtering out self-comm

dalg24 · 2019-10-08T15:50:09Z

@masterleinad or @aprokop
Would one of you please open another PR that propose an alternative solution that remove set-communication?

masterleinad · 2019-10-08T16:11:28Z

@dalg24 Yes, in case chunk_size = std::numeric_limits<int>::max().

But only in release mode. 😉 So I was lying: there is communication to another rank for the scenario described above.

masterleinad · 2019-10-08T20:04:24Z

The last commit avoids communication if an MPI rank tries to send to itself. For the scenario above that gives like 5% performance improvement for knn.

dalg24 · 2019-10-08T21:28:19Z

So I was lying: there is communication to another rank for the scenario described above.

Please elaborate.
Was that for 100M values/primitives and 10M queries/predicates?
What was the overlap between domains? Uniform random distribution? How many neighbors?

I am a bit surprised we hit the limit on communication with neighboring processes.

masterleinad · 2019-10-08T21:31:28Z

Please elaborate.
Was that for 100M values/primitives and 10M queries/predicates?
What was the overlap between domains? Uniform random distribution? How many neighbors?

I am a bit surprised we hit the limit on communication with neighboring processes.

No, another error on my side. I was checking the wrong array for the second check (when sending). It's actually just the communication with the same rank that is that large.

dalg24 · 2019-10-08T21:32:43Z

Then drop the splitting into chunks.

masterleinad · 2019-10-08T21:36:15Z

Then drop the splitting into chunks.

Didn't we agree to cover that edge case?

dalg24 · 2019-10-08T22:57:16Z

Then drop the splitting into chunks.

Didn't we agree to cover that edge case?

There is no consensus. Come back with problem setting that triggers it and does not hit other problems like overflowing integers used to index views and we'll revisit :)

sslattery · 2019-10-09T12:48:35Z

I agree with @dalg24 - if we have a test case where non-self-communication breaks the integer size limit then we will need to implement the chunking capability. Should we then consider throwing so we easily detect this scenario if it ever arises, as unexpected as it may be?

aprokop · 2019-10-09T13:23:21Z

Should we then consider throwing so we easily detect this scenario if it ever arises, as unexpected as it may be?

yes.

aprokop · 2019-10-09T13:38:57Z

Closing in favor of #134.

aprokop requested changes Oct 7, 2019

View reviewed changes

dalg24 previously approved these changes Oct 7, 2019

View reviewed changes

Fix communication for large message sizes

6b03a84

masterleinad force-pushed the fix_large_message_sizes branch from d946712 to 6b03a84 Compare October 7, 2019 21:39

Optimize self-communication

c8d2048

masterleinad force-pushed the fix_large_message_sizes branch from 2aac7d1 to c8d2048 Compare October 8, 2019 20:01

masterleinad mentioned this pull request Oct 9, 2019

Optimize communication within the same rank #134

Merged

aprokop closed this Oct 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix communication for large message sizes #131

Fix communication for large message sizes #131

masterleinad commented Oct 7, 2019

aprokop left a comment

sslattery commented Oct 7, 2019 •

edited

Loading

sslattery commented Oct 7, 2019

dalg24 left a comment

dalg24 commented Oct 7, 2019

dalg24 commented Oct 7, 2019

masterleinad commented Oct 7, 2019

sslattery commented Oct 7, 2019

Rombur commented Oct 8, 2019

sslattery commented Oct 8, 2019

dalg24 commented Oct 8, 2019

masterleinad commented Oct 8, 2019

aprokop commented Oct 8, 2019

sslattery commented Oct 8, 2019

dalg24 commented Oct 8, 2019 •

edited

Loading

masterleinad commented Oct 8, 2019

dalg24 commented Oct 8, 2019

aprokop commented Oct 8, 2019 •

edited

Loading

dalg24 commented Oct 8, 2019

aprokop commented Oct 8, 2019

dalg24 commented Oct 8, 2019

masterleinad commented Oct 8, 2019

masterleinad commented Oct 8, 2019

dalg24 commented Oct 8, 2019

masterleinad commented Oct 8, 2019

dalg24 commented Oct 8, 2019

masterleinad commented Oct 8, 2019

dalg24 commented Oct 8, 2019

sslattery commented Oct 9, 2019

aprokop commented Oct 9, 2019

aprokop commented Oct 9, 2019

Fix communication for large message sizes #131

Fix communication for large message sizes #131

Conversation

masterleinad commented Oct 7, 2019

aprokop left a comment

Choose a reason for hiding this comment

sslattery commented Oct 7, 2019 • edited Loading

sslattery commented Oct 7, 2019

dalg24 left a comment

Choose a reason for hiding this comment

dalg24 commented Oct 7, 2019

dalg24 commented Oct 7, 2019

masterleinad commented Oct 7, 2019

sslattery commented Oct 7, 2019

Rombur commented Oct 8, 2019

sslattery commented Oct 8, 2019

dalg24 commented Oct 8, 2019

masterleinad commented Oct 8, 2019

aprokop commented Oct 8, 2019

sslattery commented Oct 8, 2019

dalg24 commented Oct 8, 2019 • edited Loading

masterleinad commented Oct 8, 2019

dalg24 commented Oct 8, 2019

aprokop commented Oct 8, 2019 • edited Loading

dalg24 commented Oct 8, 2019

aprokop commented Oct 8, 2019

dalg24 commented Oct 8, 2019

masterleinad commented Oct 8, 2019

masterleinad commented Oct 8, 2019

dalg24 commented Oct 8, 2019

masterleinad commented Oct 8, 2019

dalg24 commented Oct 8, 2019

masterleinad commented Oct 8, 2019

dalg24 commented Oct 8, 2019

sslattery commented Oct 9, 2019

aprokop commented Oct 9, 2019

aprokop commented Oct 9, 2019

sslattery commented Oct 7, 2019 •

edited

Loading

dalg24 commented Oct 8, 2019 •

edited

Loading

aprokop commented Oct 8, 2019 •

edited

Loading