-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround nvlink bug #118
Merged
sethrj
merged 3 commits into
celeritas-project:master
from
sethrj:workaround-nvlink-bug
Jan 26, 2021
Merged
Workaround nvlink bug #118
sethrj
merged 3 commits into
celeritas-project:master
from
sethrj:workaround-nvlink-bug
Jan 26, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
On emmet and @pcanal's machine, we saw the following error (which didn't appear on the CI, which runs CUDA 11.1): ``` nvlink error : Size doesn't match for '_ZN9celeritas9SecondaryC1Ev$53' in 'CMakeFiles/celeritas.dir/physics/em/detail/EPlusGG.cu.o', first specified in 'CMakeFiles/celeritas.dir/physics/em/detail/BetheHeitler.cu.o' nvlink fatal : merge_elf failed ``` A bug report for LLVM hinted at a bug in `nvlink` regarding weak symbols: https://bugs.llvm.org/show_bug.cgi?id=40893 . Using the nvcc `-keep` flag led to finding the Secondary's constructor was defined as a weak symbol, and internally it stores two values (for the initialization values of def_id and energy), which are defined internally in BetheHeitler as ``` .weak .global .align 4 .b8 _ZN9celeritas9SecondaryC1Ev$53[4] = {255, 255, 255, 255}; .weak .global .align 8 .b8 _ZN9celeritas9SecondaryC1Ev$54[8]; ``` However, the constructor's private variables show up with a different suffix in the EPlusGG kernel: ``` .weak .global .align 4 .b8 _ZN9celeritas9SecondaryC1Ev$52[4] = {255, 255, 255, 255}; .weak .global .align 8 .b8 _ZN9celeritas9SecondaryC1Ev$53[8]; ``` so the suffixes are off by one and end up causing a collision. I'm not sure how the suffixes are created, but changing the include order changes their value, so rearranging the includes causes the collision to disappear.
pcanal
approved these changes
Jan 26, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The work-around work on wc.fnal.gov with cuda release 10.2, V10.2.89
Thank! |
Work around a bug ini the FindMPI cmake script
Thanks @sethrj, good detective work! |
sethrj
added a commit
to sethrj/celeritas
that referenced
this pull request
Jan 26, 2021
* Use member initialization for Secondary constructor * Rearrange include order to work around nvlink bug * Fix emmet with-mpi build --- On emmet and @pcanal's machine, we saw the following error (which didn't appear on the CI, which runs CUDA 11.1): ``` nvlink error : Size doesn't match for '_ZN9celeritas9SecondaryC1Ev$53' in 'CMakeFiles/celeritas.dir/physics/em/detail/EPlusGG.cu.o', first specified in 'CMakeFiles/celeritas.dir/physics/em/detail/BetheHeitler.cu.o' nvlink fatal : merge_elf failed ``` A bug report for LLVM hinted at a bug in `nvlink` regarding weak symbols: https://bugs.llvm.org/show_bug.cgi?id=40893 . Using the nvcc `-keep` flag led to finding the Secondary's constructor was defined as a weak symbol, and internally it stores two values (for the initialization values of def_id and energy), which are defined internally in BetheHeitler as ``` .weak .global .align 4 .b8 _ZN9celeritas9SecondaryC1Ev$53[4] = {255, 255, 255, 255}; .weak .global .align 8 .b8 _ZN9celeritas9SecondaryC1Ev$54[8]; ``` However, the constructor's private variables show up with a different suffix in the EPlusGG kernel: ``` .weak .global .align 4 .b8 _ZN9celeritas9SecondaryC1Ev$52[4] = {255, 255, 255, 255}; .weak .global .align 8 .b8 _ZN9celeritas9SecondaryC1Ev$53[8]; ``` so the suffixes are off by one and end up causing a collision. I'm not sure how the suffixes are created, but changing the include order changes their value, so rearranging the includes causes the collision to disappear.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@amandalund @pcanal
On emmet we saw the following error (which didn't appear on the CI, which runs CUDA 11.1):
A bug report for LLVM hinted at a bug in
nvlink
regarding weak symbols: https://bugs.llvm.org/show_bug.cgi?id=40893 . Using the nvcc-keep
flag led to finding the Secondary's constructor was defined as a weak symbol, and internally it stores two values (for the initialization values of def_id and energy), which are defined internally in BetheHeitler asHowever, the constructor's private variables show up with a different suffix in the EPlusGG kernel:
so the suffixes are off by one and end up causing a collision.
I'm not sure how the suffixes are created, but changing the include order changes their value, so rearranging the includes causes the collision to disappear.