Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array pointers in messages can be overwritten by generated code in the message class constructor (causes crash on GCC 6+) #1045

Closed
PhilMiller opened this issue Apr 27, 2016 · 13 comments
Assignees
Labels
Bug Something isn't working
Milestone

Comments

@PhilMiller
Copy link
Contributor

Original issue: https://charm.cs.illinois.edu/redmine/issues/1045


https://charm.cs.illinois.edu/private/tms/listlog.php?param=1111

Upcoming changes in GCC 6 optimizer capabilities may resurface this old bug even with the workaround implemented
https://gcc.gnu.org/gcc-6/changes.html

-flifetime-dse is more aggressive in dead-store elimination in situations where a memory store to a location precedes a constructor to the memory location.

If the various bits of memory allocation and construction get inlined, potentially even across object files with link-time optimization, then the stores to the pointers from the message struct to the associated spots in the memory buffer allocated for it and its payload might get eliminated.

@PhilMiller PhilMiller added the Bug Something isn't working label Apr 25, 2019
@PhilMiller PhilMiller added this to the 6.8.0-beta1 milestone Apr 25, 2019
@PhilMiller PhilMiller self-assigned this Apr 25, 2019
@PhilMiller
Copy link
Contributor Author

Original date: 2016-08-28 20:59:30


I need to investigate more, but I think this has come to the fore.

@PhilMiller
Copy link
Contributor Author

Original date: 2016-08-28 21:16:33


This crashes in megatest

./build test netlrts-linux-x86_64 --suffix=gcc6-prod-lifetime-dse --with-production --enable-errorchecking -j4 -g

Whereas this:

./build test netlrts-linux-x86_64 --suffix=gcc6-prod-no-lifetime-dse --with-production --enable-errorchecking -j4 -g -fno-lifetime-dse

Does not crash, successfully runs the full suite of tests and examples.

@PhilMiller
Copy link
Contributor Author

Original date: 2016-08-28 22:02:03


More details on the optimization here:
https://gcc.gnu.org/gcc-6/porting_to.html

@trquinn
Copy link
Collaborator

trquinn commented Apr 25, 2019

Original date: 2016-10-12 04:39:26


ChaNGa also dies in a CkReduction when compiled with GCC 6.2.1. Recompiling with fno-lifetime-dse fixes the problem.

@PhilMiller
Copy link
Contributor Author

Original date: 2016-10-19 21:35:55


https://charm.cs.illinois.edu/gerrit/1928 97b1159

@stwhite91
Copy link
Contributor

Original date: 2016-10-29 14:58:01


On net-win64 we get compiler warnings because MSVC doesn't support the -fno-lifetime-dse option:

Ignored Unrecognized argument -fno-lifetime-dse

@stwhite91
Copy link
Contributor

Original date: 2016-11-04 15:23:53


XLC actually gives an error for this, so the build aborts:

../bin/charmc -host  -I../../src/xlat-i/ -I../../src/xlat-i/sdag/ -I.   -c -o xi-main.o ../../src/xlat-i/xi-main.C
g++: unrecognized option '-qlanglvl=extended0x'
cc1plus: error: unrecognized command line option "-fno-lifetime-dse"

@PhilMiller
Copy link
Contributor Author

Original date: 2016-11-04 15:26:45


Ugh, the configure test is supposed to catch that. I'll fix.

@nikhil-jain
Copy link
Contributor

Original date: 2016-11-06 01:06:16


Phil Miller wrote:

Ugh, the configure test is supposed to catch that. I'll fix.

Config does catch it, but on BG/Q cross compilation means we use different compilers for SEQ and rest.

@PhilMiller
Copy link
Contributor Author

Original date: 2016-11-08 00:27:54


On examination, there is no case where the flags we're detecting in configure should actually be passed to the native/host compiler. So, disable passing those flags to the native compiler, and we've fixed the Blue Gene Q issue, and partially mitigated the (less problematic) MSVC issue:
https://charm.cs.illinois.edu/gerrit/#/c/1972/ 5d10d7e
Would test on BG/Q, but it's not back online from maintenance yet.

@PhilMiller
Copy link
Contributor Author

Original date: 2016-11-08 18:27:17


Subsequent fixes pushed, awaiting review.

@PhilMiller
Copy link
Contributor Author

Original date: 2016-11-09 18:20:08


It'll be nice to see this get through autobuild, but the fix is in.

@PhilMiller
Copy link
Contributor Author

Original date: 2017-07-11 16:14:17


My original notes from 2011-01-07:

Shortly before Christmas, Akhil and I isolated a bug in Charm++ code
that only exhibits in newer compilers. Specifically, when creating
messages containing variable length arrays, we rely on the memory
allocation returned by our operator new() being passed to the
message's constructor unmodified. In some circumstances [1], this is
not the case, and the pointers that the allocator initializes as
offsets into the allocated buffer get overwritten. Subsequent code
dereferencing those pointers goes astray.

Two major things to consider:

  1. We should not make a new (non-bugfix) release with a bug like this
    looming over our heads, and
  2. I'm pretty sure fixing this will require an API change, though the
    magnitude of that change spans many possibilities.

The plan for this meeting is to explore what those possibilities are,
and what they entail in terms of implementation effort, application
adaptation effort, maintainability, and performance. I would
appreciate if group members with experience developing against APIs
with RDMA-like constructs (ibverbs, LAPI) would attend.

[1] Right now, the one I know of is when the compiler generates a
default constructor. There may be others, and that set may vary over
time, compiler version, options, etc.

Email exchange with Jim, same day:

Jim sent the following while we met:

This might be a product of new initialization rules for plain old data. Try removing the final () from both of these lines in varsizetest2.C:

varsizetest2_Msg *m = new (sizes,0) varsizetest2_Msg();

m = new (sizes,0) varsizetest2_Msg();

The () after the class name is an explicit initializer, which causes all elements to be default inititialized to zero. If you declare a constructor then it's no longer plain old data, and the default constructor is expected to initialize everything.

See http://www.informit.com/guides/content.aspx?g=cplusplus&seqNum=423

Try making the CMessage_varsizetest2_Msg constructor protected in the .decl.h file, which would actually solve the problem if it works.

If that doesn't work, the appropriate API change would be to move the pointer declarations to the CMessage_varsizetest2_Msg class. The user would have to comment them out (or use a macro) in the varsizetest2_Msg declaration and they would only exist in the .ci file.

Next message:

Maybe also try adding a private copy constructor to CMessage_varsizetest2_Msg in the .decl.h file.

Anything to keep C++0X from making the message plain old data.

http://en.wikipedia.org/wiki/C%2B%2B0x

The insight about POD versus class-with-constructor is one we missed, but it doesn't really matter. The compiler is allowed to do whatever it wants with the block of memory returned by operator new() before calling the constructor. XLC even supports a debugging flag to have such memory initialized to a particular pattern. The compiler could also ask for extra bytes that it will take off the beginning of the allocation for whatever purpose, a possibility that we ignore.

The API change suggested is the same as the minimal-effort change we came up with.

Meeting minutes, same day:

We explored the issue, and concluded that some API change would be necessary.

There are two prime possibilities:

When a message declared in a .ci file mentions variable length arrays, the pointers to those arrays should be declared in the base CMessage_foo class and initialized to the appropriate offset in its constructor, using length data hidden somewhere by the runtime. The requires user code to remove the same declarations, possibly conditional on the version of Charm++ in which this change occurs.

We could transition to an API where the client code provides a list of pieces of memory that should end up the in the outgoing message, with lengths and possibly strides. This matches the underlying APIs available in TCP (sendv()), LAPI, and probably others. This smoothly enables later implementation of optional zero-copy optimization if the caller promises not to modify the buffers in question until the send is complete.

January 18:

After discussions with people, I've started along the following path:

  1. Implement the partial workaround that Gengbin described, by modifying charmxi as follows:
    a. for each message class, declare an array _ck_sizes of size_t with enough slots for all of the variable length arrays
    b. in CMessage_foo::operator new(), save the sizes passed in as arguments to _ck_sizes
    c. Move the setting of pointers from CMessage_foo::operator new() to CMessage_foo::CMessage_foo(), using the saved sizes
  2. Start exploring forward-looking APIs that obviate the still-possible issue of overwriting between CMessage_foo::CMessage_foo() and Message::Message()

Currently experiencing difficulty with 1a, in that CpvStaticDeclare won't work for static class members. Can't just declare outside the class with the name appended, because of templates. Will continue tomorrow.

Next update:

Coded bugfix. Tested net-linux-x86_64{,-smp} through added test in megatest.

Commits 5d1a75e and c0776f0.

Final note, January 2012:

To actually make this bug impossible would still require API changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants