Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nonblocking communication with boost::mpi::any_source not working for serialized communication #63

Closed
hirschsn opened this issue Jul 3, 2018 · 26 comments

Comments

@hirschsn
Copy link
Contributor

hirschsn commented Jul 3, 2018

Posting more than two irecv with boost::mpi::any_source results in message truncation errors (boost 1.67.0, g++ 7.3, OpenMPI 2.1.1 and newer).

Code:

#include <vector>
#include <iostream>
#include <iterator>
#include <boost/mpi.hpp>
#include <boost/serialization/vector.hpp>

namespace mpi = boost::mpi;

int main(int argc, char **argv)
{
    mpi::environment env(argc, argv);
    mpi::communicator comm_world;
    auto rank = comm_world.rank();
    if (rank == 0) {
        std::vector<boost::mpi::request> req;
        std::vector<std::vector<int>> data(comm_world.size() - 1);
        for (int i = 1; i < comm_world.size(); ++i) {
            req.push_back(comm_world.irecv(mpi::any_source, 0, data[i - 1]));
            //auto req = comm_world.irecv(mpi::any_source, 0, data[i - 1]);
            //req.wait();
        }
        boost::mpi::wait_all(std::begin(req), std::end(req));

        for (int i = 0; i < comm_world.size() - 1; ++i) {
            std::cout << "Process 0 received:" << std::endl;
            std::copy(std::begin(data[i]), std::end(data[i]), std::ostream_iterator<int>(std::cout, " "));
            std::cout << std::endl;
        }

    } else {
        std::vector<int> vec = {1, 2, 3, 4, 5};
        auto req = comm_world.isend(0, 0, vec);
        req.wait();
    }
}

Symptoms:

$ mpic++ serialized-anysource.cc -o serialized-anysource -lboost_mpi -lboost_serialization
$ mpiexec -n 3 ./serialized-anysource
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::mpi::exception> >'
  what():  MPI_Test: MPI_ERR_TRUNCATE: message truncated
[lapsgs17:04854] *** Process received signal ***
[lapsgs17:04854] Signal: Aborted (6)
[lapsgs17:04854] Signal code:  (-6)
[lapsgs17:04854] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7ff37f056f20]
[lapsgs17:04854] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7ff37f056e97]
[lapsgs17:04854] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7ff37f058801]
[lapsgs17:04854] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c8fb)[0x7ff37f6ad8fb]
[lapsgs17:04854] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d3a)[0x7ff37f6b3d3a]
[lapsgs17:04854] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d95)[0x7ff37f6b3d95]
[lapsgs17:04854] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92fe8)[0x7ff37f6b3fe8]
[lapsgs17:04854] [ 7] ./serialized-anysource(_ZN5boost15throw_exceptionINS_3mpi9exceptionEEEvRKT_+0x84)[0x5597ecc8415f]
[lapsgs17:04854] [ 8] ./serialized-anysource(+0x19c49)[0x5597ecc87c49]
[lapsgs17:04854] [ 9] /usr/lib/x86_64-linux-gnu/libboost_mpi.so.1.65.1(_ZN5boost3mpi7request4testEv+0x35)[0x7ff38011d595]
[lapsgs17:04854] [10] ./serialized-anysource(+0x16ae1)[0x5597ecc84ae1]
[lapsgs17:04854] [11] ./serialized-anysource(+0xfa2c)[0x5597ecc7da2c]
[lapsgs17:04854] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7ff37f039b97]
[lapsgs17:04854] [13] ./serialized-anysource(+0xf79a)[0x5597ecc7d79a]
[lapsgs17:04854] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node lapsgs17 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

If one uses the commented out code instead of pushing back the request, i.e. directly waiting for a request instead of deferring the wait, the code works. Also, again, for PODs the code is working properly.
The symptoms could be explained easily if count and data messages use the same (user provided) tags. Then, count and data messages can get mixed up because of any_source (one irecv receives both counts and one receives both data messages). But this is only speculation. Again, I can do some investigation as soon as I find the time to do it.

@hirschsn
Copy link
Contributor Author

hirschsn commented Jul 4, 2018

The problem is indeed that the count and data messages get mixed up. There is no trivial fix for this problem in conjunction with boost::mpi::any_source. In my opinion there are two options: Use a different tag value for serialized isend and serialized irecv for the second message (a 1:1 mapping from the user-supplied tag value, which can be used for the first message) or forbid serialized irecv from boost::mpi::any_source completely.

Using a different tag value: Applications are allowed to use tag values in the range 0 .. MPI_TAG_UB. Using anything else would be highly dependant on the implementation of the used mpi library.
Using a different tag value from within 0 .. MPI_TAG_UB is possible and works, however, might lead to serious problems if an application using boost mpi chooses to actually use this very tag value for its messages.

@aminiussi
Copy link
Member

Maybe there is another possibility (need to investigate though): the status of the size message seems to provide the actual source. If true, maybe we can force the data request to wait for the same ?

@aminiussi
Copy link
Member

Which is what we are already doing...

     // Wait for the count message to complete                                                                                            
      BOOST_MPI_CHECK_RESULT(MPI_Wait,
                             (self->m_requests, &stat.m_status)); // status of the sise msg....
      // Resize our buffer and get ready to receive its data                                                                               
      data->ia.resize(data->count);
      BOOST_MPI_CHECK_RESULT(MPI_Irecv,
                             (data->ia.address(), data->ia.size(), MPI_PACKED,
                              stat.source(), stat.tag(), // .. provides the source for the data
                              MPI_Comm(data->comm), self->m_requests + 1));

@aminiussi
Copy link
Member

Ok so now I get it, sorry.

@aminiussi
Copy link
Member

aminiussi commented Aug 1, 2018

There might be another solution. Maybe we can ask the user a second, usually optional, tag to the user and make it mandatory when using any_source.

As fr now, the check would be at run time unless we turn it into a tag (struct any_source_t{} any_source;) so that the compiler can force a second tag. The use would still be able to use then value MPI_ANY_SOURCE directly but then we could warn.

@aminiussi
Copy link
Member

We need to merge #66 before we proceed with this issue as it will impact that specific code.

@hirschsn
Copy link
Contributor Author

hirschsn commented Aug 6, 2018

Asking for a second tag on serialized non-blocking communication indeed seems to be a good possibility given the options.

@hirschsn
Copy link
Contributor Author

hirschsn commented Aug 7, 2018

I just discussed this issue with a colleague and we came to the conclusion that it will still not work, even with two user specified tags.

Imagine the following: Rank 0 does two successive (patched to take two user tags) isends to rank 1 with tag values C and D (which the new implementation would use for count and data messages, respectively). Rank 1 issues two irecvs with matching tags C and D and stores the requests in r1 and r2. Now r1 matches the count message of the first isend and r2 matches the count message of the second isend because MPI guarantees message ordering. But if the user then for some reason called test or wait on r2 first, the data message receive of r2 would match the data of the first isend. This is due to the fact that the user did specified the same tag pair for both isends.

@aminiussi
Copy link
Member

Correct, this need more thinking. Unfortunately I won't be able to work on it for the coming 10 days.

We could (using a any_source type tag as proposed earlier) prohibit at compile time its usage on non atomic messages. But that would prevent people from using a feature that could still work with some reasonable precautions.

@matthiastroyer
Copy link
Contributor

matthiastroyer commented Aug 10, 2018 via email

@hirschsn
Copy link
Contributor Author

I don't know what has been discussed in the past. Do you have a reference to prior discussions about this or related problems? Anyway, let me quickly note some thoughts on using different communicators.

Using just one pre-dupped communicator (e.g. a second MPI_Comm as part of every boost::mpi::communicator) that is responsible for all data messages does not solve the problem described above (calling test or wait in a different order).

Using a new communicator for every single nonblocking point-to-point communication should work, however is impossible to do on-the-fly because dup, split, etc. on communicators are collective operations. It would require predefining n^2 point-to-point communicators, wouldn't it?

@aminiussi
Copy link
Member

Maybe it has been part of the past discussion, but what about implementing point-to-point serialized data without size mesage ? using MPI_Probe MPI_IProbe on the receive side ?

@hirschsn
Copy link
Contributor Author

I think this would solve both issues: not being able to tell a) which messages are data counts and which messages carry the actual data and b) which count message belongs to which data message (on same source, tag pair). Simply because there will be no count messages anymore.

However, doing it only for serialized data will probably not work, since there are other kinds of data sent as count + message (for example in the std::vector overloads). These would also need to use a communication scheme with only one message being sent.

@aminiussi
Copy link
Member

Well, we can match serialized send with serialized receive, But dealing with vector is probably not a big issue anyway.

I'm opening a new issue, as this seems more general.

@aminiussi
Copy link
Member

You can follow #70 if interested in implementation.

@aminiussi
Copy link
Member

So, a probe version is mostly working, but is limited by what is probably a Intel MPI bug.

The following MPI only code works up to 15 process on my installation, but fails to find incoming messages starting with 16 processes.

It works on my Open MPI installation.

Could you try on your available platform ?

Thanks.

#include <stdio.h>
#include <mpi.h>

int main(int argc, char* argv[])
{
  MPI_Init(&argc, &argv);
  int rank, nproc;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nproc);
  int value = 42;
  int input;
  int next = (rank + 1) % nproc;
  int prev = (rank + nproc - 1) % nproc;
  int tag = 2;
  MPI_Request sreq;
  MPI_Isend(&value, 1, MPI_INT, next, tag, MPI_COMM_WORLD, &sreq);
  int probe = 0;
  int test  = 0;
  MPI_Message msg;
  do {
    if (!test) {
      MPI_Test(&sreq, &test, MPI_STATUS_IGNORE);
      if (test) {
        printf("Proc %i sent msg %i to Proc %i\n", rank, tag, next);
      } else {
        printf("Proc %i have not sent msg %i to Proc %i yet\n", rank, tag, next);
      }
    }
    if (!probe) {
      int err = MPI_Improbe(prev, tag,
                            MPI_COMM_WORLD, &probe,
                            &msg,
                            MPI_STATUS_IGNORE);
      if (probe)
        printf("Proc %i got msg %i from proc %i\n", rank, tag, prev);
      else
        printf("Proc %i haven't got msg %i from proc %i yet\n", rank, tag, prev);
    }
  } while(probe == 0 || test == 0);
  MPI_Finalize();
  return 0;
}

My installations are:

Intel(R) MPI Library for Linux* OS, Version 2018 Update 3 Build 20180411 (id: 18329) 
mpirun (Open MPI) 2.1.1

@Belcourt
Copy link
Member

Belcourt commented Aug 25, 2018 via email

@hirschsn
Copy link
Contributor Author

Works for me on OpenMPI-3.1.1 and some older versions. Also, the code seems okay.

@aminiussi
Copy link
Member

aminiussi commented Aug 27, 2018

fails with (on a different Linux cluster occigen.cines.fr):

Intel(R) MPI Library for Linux* OS, Version 5.1.3 Build 20160601 (build id: 15562)
Intel(R) MPI Library for Linux* OS, Version 2017 Update 2 Build 20170125 (id: 16752)
Intel(R) MPI Library for Linux* OS, Version 2017 Build 20160721 (id: 15987)
Intel(R) MPI Library for Linux* OS, Version 2018 Build 20170713 (id: 17594)
Intel(R) MPI Library for Linux* OS, Version 2018 Update 1 Build 20171011 (id: 17941)

Fails, on the same cluster, (licallo.oca.eu) with

Intel(R) MPI Library for Linux* OS, Version 2017 Update 3 Build 20170405 (id: 17193)
Intel(R) MPI Library for Linux* OS, Version 2018 Update 3 Build 20180411 (id: 18329) 

@hirschsn
Copy link
Contributor Author

Could you test the code with an MPI_Mrecv before finalizing. I know it sounds paranoid but I think the program could be ill formed. The standard says in §8.5 (MPI 3.1, p. 357, l. 34ff, MPI_Finalize): "[Before calling MPI_Finalize a process] must locally complete all MPI operations that it initiated and must execute matching calls needed to complete MPI communications initiated by other processes. For example, [...] if the process is the target of a send, then it must post the matching receive; [...]".

It could be that Intel MPI relies on this in the implementations you tested?

@aminiussi
Copy link
Member

Thanks for the remark, the code would indeed be non conformant as such.
This is the reduced version, the original (with same behaviour) has an MPI_Mrecv call after MPI_imprbe success (if any).
Still the Intel MPI runtime has no way of knowing if a matching MPI_Mrecv will be called when it execute the MPI_Improbe (that's no decidable in the general case). It could trigger an undefined behaviour when executing the finalize if the matching recv wasn't called (while complying with 8.5) but that is not what I observe: the MPI_Improbe is never (well, this can't be decided neither, but let's say not for a long time) successful and as a consequence, finalize is not called.
I'm still going to update the test case for my Intel issue so to make it conformant. Thanks!

@aminiussi
Copy link
Member

aminiussi commented Aug 28, 2018

Please note that I fixed the result comment, it always fails with intel (I got my SLURMA parameters wrong).
Tried some config with more than one node (to force communication layer to IB) without improvement...

@aminiussi
Copy link
Member

So, it is an issue with Intel'MPI implementation that should be fixed in the soon available 2019 version.
I guess we will need to conditionally compile the new feature/fix and put a word in the documentation.

@aminiussi
Copy link
Member

@hirschsn is it ok for you to provide code under the boost software license ?

I'd like to integrate your test case.

@aminiussi
Copy link
Member

This issue seems fixed in #70.

@hirschsn
Copy link
Contributor Author

hirschsn commented Sep 10, 2018 via email

aminiussi added a commit that referenced this issue Oct 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants