Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dangling pointer to custom operation in collectives #52

Closed
wants to merge 3 commits into from
Closed

Fix dangling pointer to custom operation in collectives #52

wants to merge 3 commits into from

Conversation

mkuron
Copy link
Contributor

@mkuron mkuron commented Dec 20, 2017

Found with Clang-Tidy in Boost 1.58.

/usr/include/boost/mpi/collectives/all_reduce.hpp:59:3: error: Address of stack memory associated with local variable 'op' is still referred to by the global variable 'op_ptr' upon returning to the caller.  This will be a dangling reference [clang-analyzer-core.StackAddressEscape,-warnings-as-errors]
  }
  ^
main.cpp:54:3: note: Calling 'all_reduce'
  boost::mpi::all_reduce(comm, value == root_value, is_same,
  ^
/usr/include/boost/mpi/collectives/all_reduce.hpp:115:3: note: Calling 'all_reduce_impl'
  detail::all_reduce_impl(comm, &in_value, 1, &out_value, op,
  ^
/usr/include/boost/mpi/collectives/all_reduce.hpp:59:3: note: Address of stack memory associated with local variable 'op' is still referred to by the global variable 'op_ptr' upon returning to the caller.  This will be a dangling reference
  }
  ^

user_op takes a reference to op and all_reduce_impl/reduce_impl/scan_impl take it by value, so this seems to be the only place where this can be fixed.

@mkuron mkuron changed the title Fix dangling pointer in all_reduce_impl Fix dangling pointer to custom operation in collectives Jan 8, 2018
@aminiussi
Copy link
Member

This is tricky, I realize I never looked into that corner...
Basically, we're suggesting that we can provide statefull operation (hence the copy of the function object).

But if we look the underlying C API of MPI_Op_create, we see that this not supported:

typedef void MPI_User_function(void *invec, void *inoutvec,
                                 int *len,
                                 MPI_Datatype *datatype);

The root of the problem raised by that mismatch is actually in the user_op object (in operations.hpp). In order to compensate for that lack of state, we declare a static Op* op_ptr; that will bind to our statefull C++ function object.

That is sooooo bad. If we do have a state in the function object, multi-threading will kill us: If we have two different states for the same couple <Op,T>, since they will share that static, the last call to 'user_op<Op,T>will reset the state that will be used by all instances. (In the absence of multi-threading, the last one will be the good one, we should still resetop_ptr` to 0 to make valgrind happy).

I don't think you modification actually fixes the problem, it just make the compiler happy for no good reason.

Right now, the only real solution I can think of would be to state clearly in the documentation that we require stateless functions objects (in line with the MPI requirements) and use a fresh instance of the op object in perform:

static void BOOST_MPI_CALLING_CONVENTION perform(void* vinvec, void* voutvec, int* plen, MPI_Datatype*)
    {
      T* invec = static_cast<T*>(vinvec);
      T* outvec = static_cast<T*>(voutvec);
      Op op; // independent instance
      std::transform(invec, invec + *plen, outvec, outvec, op);
    }

the static Op* op_ptr; would be removed.

But that's a change that would break code that just happens to work so far.

@mkuron
Copy link
Contributor Author

mkuron commented Jan 8, 2018

@aminiussi, thanks for the explanation. I have to admit that I only briefly looked at the Boost code and noticed I needed to added these statics to fix the variable lifetime issue that the compiler complained about.

Stateless operations work correctly, but previously produced the warnings that my pull request silences. Stateful operations never really worked if I understand you correctly, so that would need to be documented. If your suggested change needed to enforce statelessness can't be made without breaking code, then I suppose documenting the lack of support for stateful operations is all we can do. We should still suppress these warnings for stateless operations though.

@aminiussi
Copy link
Member

Documentation change in d7b35ef

@mkuron
Copy link
Contributor Author

mkuron commented Jul 31, 2018

We still have the variable lifetime issue though that this pull request fixes; right now we are still calling member functions on a dangling pointer. This happens to work because the function doesn't access member variables, but is probably undefined behavior.

I suppose we could do user_op<Op, T> mpi_op(Op()) to get rid of the static variables though.

@aminiussi
Copy link
Member

Moving discussion to #68

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants