Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize communication within the same rank #134

Merged
merged 1 commit into from
Oct 9, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 29 additions & 8 deletions src/details/ArborX_DetailsDistributor.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -181,17 +181,38 @@ class Distributor
requests.reserve(outdegrees + indegrees);
for (int i = 0; i < indegrees; ++i)
{
requests.emplace_back();
MPI_Irecv(src_buffer.data() + _src_offsets[i] * num_packets,
_src_counts[i] * num_packets * sizeof(ValueType), MPI_BYTE,
_sources[i], MPI_ANY_TAG, _comm, &requests.back());
if (_sources[i] != comm_rank)
{
auto const message_size =
_src_counts[i] * num_packets * sizeof(ValueType);
auto const receive_buffer_ptr =
src_buffer.data() + _src_offsets[i] * num_packets;
requests.emplace_back();
MPI_Irecv(receive_buffer_ptr, message_size, MPI_BYTE, _sources[i], 123,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a good habit to fix the tag here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am always wary of fixing an MPI tag because you never know what a user will pick and if things will hang if they happen to be the same - typically I try to expose this to the user in an interface for code that does MPI communication with a default value provided

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typically I try to expose this to the user in an interface for code that does MPI communication with a default value provided

How do you do that? Do all your communications use the same tag?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I would expose in either the distributed tree constructor or send across network - basically any function that does MPI. If you have a single function that does multiple MPI communications which need different tags then you could ask for two or more tags or a range of tags that are acceptable for the library to use. See this for example: https://github.com/ECP-copa/Cabana/blob/89240fe1a605589d7c0d925b41b3b57bc5586459/core/src/Cabana_Distributor.hpp#L100

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't a much better and cleaner solution to do MPI_Comm_dup as that creates a new namespace so that any tag you use won't intersect with user communication? I think in fact this should be done in ArborX.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the object is long-lived that sounds reasonable. Either way we shouldn't use a fixed tag with a user-provided communicator. Either the approach I suggested or the approach @aprokop suggested will resolve this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comm duplication was merged in #135.

_comm, &requests.back());
}
}
for (int i = 0; i < outdegrees; ++i)
{
requests.emplace_back();
MPI_Isend(dest_buffer.data() + _dest_offsets[i] * num_packets,
_dest_counts[i] * num_packets * sizeof(ValueType), MPI_BYTE,
_destinations[i], 123, _comm, &requests.back());
auto const message_size =
_dest_counts[i] * num_packets * sizeof(ValueType);
auto const send_buffer_ptr =
dest_buffer.data() + _dest_offsets[i] * num_packets;
if (_destinations[i] == comm_rank)
{
auto const it = std::find(_sources.begin(), _sources.end(), comm_rank);
ARBORX_ASSERT(it != _sources.end());
auto const position = it - _sources.begin();
auto const receive_buffer_ptr =
src_buffer.data() + _src_offsets[position] * num_packets;
std::memcpy(receive_buffer_ptr, send_buffer_ptr, message_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you guys using GPU aware MPI? If you are, this won't work. If you aren't, we should have a discussion about that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But src_buffer is a std::vector in the current implementation so the change proposed are valid.

Copy link
Contributor

@sslattery sslattery Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough - lets make a note to discuss gpu-aware strategy in the future then.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TODO
// * apply permutation on the device in a parallel for
// * switch to MPI with CUDA support (do not copy to host)

}
else
{
requests.emplace_back();
MPI_Isend(send_buffer_ptr, message_size, MPI_BYTE, _destinations[i],
123, _comm, &requests.back());
}
}
if (!requests.empty())
MPI_Waitall(requests.size(), requests.data(), MPI_STATUSES_IGNORE);
Expand Down