Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix communication for large message sizes #131

Closed

Conversation

masterleinad
Copy link
Collaborator

For

mpiexec -np 10 ./ArborX_DistributedTree.exe --values=100000000 --queries=10000000 --neighbors=40

I was getting an MPI error telling me that the message size was too large. This pull requests splits the message into multiple smaller ones in case we need to send more than std::numeric_limits<int>::max() bytes.

Copy link
Contributor

@aprokop aprokop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with the intent of this PR. I think we should just throw in this situation, and not allow things to continue. If user's code is trying to send a message that large, we should advocate for rethinking their approach.

@sslattery
Copy link
Contributor

sslattery commented Oct 7, 2019

I tend to agree with @aprokop - that seems to be a ton of communication. I say for now we throw and if we arrive at a legitimate case for this in the future we can re-visit and find an optimal strategy for those cases. @dalg24 @Rombur do you agree?

@sslattery
Copy link
Contributor

If you guys did want to do this, however, I would advocate for a more efficient asynchronous strategy for pack-unpack of multiple messages to avoid some latency costs.

dalg24
dalg24 previously approved these changes Oct 7, 2019
Copy link
Contributor

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment that you are splitting up the message in chunks when it exceeds the maximum size

@dalg24
Copy link
Contributor

dalg24 commented Oct 7, 2019

No I do not agree. This is the correct fix for now.

@dalg24
Copy link
Contributor

dalg24 commented Oct 7, 2019

Clang build is hanging and others are broken. @masterleinad please investigate.

@masterleinad
Copy link
Collaborator Author

@dalg24 I fixed the implementation.

@aprokop @sslattery So you are saying that the failing setup isn't reasonable at all? Surely, it is an edge case, but I don't see a reason to not support it.

The current implementation also makes it possible to change the message size quite easily in case we want to play with it.

If you guys did want to do this, however, I would advocate for a more efficient asynchronous strategy for pack-unpack of multiple messages to avoid some latency costs.

@sslattery Are you thinking about compressing via gzip or something the like? The impression I got from other projects is that network throughput is typically higher than the rate by which boost::archive is able to compress. If the communication here turns out to be a bottleneck, I am happy to consider that, though.

@sslattery
Copy link
Contributor

No I was referring to an overlapping pack/unpack strategy when multiple messages are in play to hide the latencies of the pack/unpack kernels on GPU systems. This is fine for now - we can optimize later as needed.

@Rombur
Copy link
Collaborator

Rombur commented Oct 8, 2019

I don't see why we should throw? Why do you think that is not a valid case @aprokop ?

@sslattery
Copy link
Contributor

I agree with @dalg24 and @masterleinad after more thought - it should work for huge messages even if you're in a regime of bad performance

@dalg24
Copy link
Contributor

dalg24 commented Oct 8, 2019

@masterleinad Please confirm that the message that exceed the maximum size is sent by the rank to itself.

@masterleinad
Copy link
Collaborator Author

@masterleinad Please confirm that the message that exceed the maximum size is sent by the rank to itself.

I am not quite sure what you are asking for, but I checked (manually) that the tests pass if chunk_size==100.

@aprokop
Copy link
Contributor

aprokop commented Oct 8, 2019

An additional concern is that the message sends/receives in this patch are not word-aligned. This could potentially be influencing some of MPI machinery, including the choice of a protocol to communicate data. I'm not approving this until this is researched.

@sslattery
Copy link
Contributor

@aprokop is right - we might need an LDRD before we can merge

@dalg24
Copy link
Contributor

dalg24 commented Oct 8, 2019

I am not quite sure what you are asking for

if (n_chunks > 1) assert(_sources[i] == comm_rank);

@masterleinad
Copy link
Collaborator Author

@dalg24 Yes, in case chunk_size = std::numeric_limits<int>::max().

@dalg24
Copy link
Contributor

dalg24 commented Oct 8, 2019

@aprokop would you be happier if we filter out self-communication instead?

@aprokop
Copy link
Contributor

aprokop commented Oct 8, 2019

would you be happier if we filter out self-communication instead?

Why do we even use MPI for self-communication?

@dalg24
Copy link
Contributor

dalg24 commented Oct 8, 2019

would you be happier if we filter out self-communication instead?

Why do we even use MPI for self-communication?

Code simplicity if you don't treat self-commination explicitly.

@aprokop
Copy link
Contributor

aprokop commented Oct 8, 2019

So, yes, in short, just do the change where you filter out self-communication, and put in place an assert for non-self communication.

@dalg24 dalg24 dismissed their stale review October 8, 2019 15:48

Consider filtering out self-comm

@dalg24
Copy link
Contributor

dalg24 commented Oct 8, 2019

@masterleinad or @aprokop
Would one of you please open another PR that propose an alternative solution that remove set-communication?

@masterleinad
Copy link
Collaborator Author

@dalg24 Yes, in case chunk_size = std::numeric_limits<int>::max().

But only in release mode. 😉 So I was lying: there is communication to another rank for the scenario described above.

@masterleinad
Copy link
Collaborator Author

The last commit avoids communication if an MPI rank tries to send to itself. For the scenario above that gives like 5% performance improvement for knn.

@dalg24
Copy link
Contributor

dalg24 commented Oct 8, 2019

So I was lying: there is communication to another rank for the scenario described above.

Please elaborate.
Was that for 100M values/primitives and 10M queries/predicates?
What was the overlap between domains? Uniform random distribution? How many neighbors?

I am a bit surprised we hit the limit on communication with neighboring processes.

@masterleinad
Copy link
Collaborator Author

Please elaborate.
Was that for 100M values/primitives and 10M queries/predicates?
What was the overlap between domains? Uniform random distribution? How many neighbors?

I am a bit surprised we hit the limit on communication with neighboring processes.

No, another error on my side. I was checking the wrong array for the second check (when sending). It's actually just the communication with the same rank that is that large.

@dalg24
Copy link
Contributor

dalg24 commented Oct 8, 2019

Then drop the splitting into chunks.

@masterleinad
Copy link
Collaborator Author

Then drop the splitting into chunks.

Didn't we agree to cover that edge case?

@dalg24
Copy link
Contributor

dalg24 commented Oct 8, 2019

Then drop the splitting into chunks.

Didn't we agree to cover that edge case?

There is no consensus. Come back with problem setting that triggers it and does not hit other problems like overflowing integers used to index views and we'll revisit :)

@sslattery
Copy link
Contributor

I agree with @dalg24 - if we have a test case where non-self-communication breaks the integer size limit then we will need to implement the chunking capability. Should we then consider throwing so we easily detect this scenario if it ever arises, as unexpected as it may be?

@aprokop
Copy link
Contributor

aprokop commented Oct 9, 2019

Should we then consider throwing so we easily detect this scenario if it ever arises, as unexpected as it may be?

yes.

@aprokop
Copy link
Contributor

aprokop commented Oct 9, 2019

Closing in favor of #134.

@aprokop aprokop closed this Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants