Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround nvlink bug #118

Merged
merged 3 commits into from
Jan 26, 2021
Merged

Conversation

sethrj
Copy link
Member

@sethrj sethrj commented Jan 26, 2021

@amandalund @pcanal


On emmet we saw the following error (which didn't appear on the CI, which runs CUDA 11.1):

nvlink error   : Size doesn't match for '_ZN9celeritas9SecondaryC1Ev$53' in 'CMakeFiles/celeritas.dir/physics/em/detail/EPlusGG.cu.o', first specified in 'CMakeFiles/celeritas.dir/physics/em/detail/BetheHeitler.cu.o'
nvlink fatal   : merge_elf failed

A bug report for LLVM hinted at a bug in nvlink regarding weak symbols: https://bugs.llvm.org/show_bug.cgi?id=40893 . Using the nvcc -keep flag led to finding the Secondary's constructor was defined as a weak symbol, and internally it stores two values (for the initialization values of def_id and energy), which are defined internally in BetheHeitler as

.weak .global .align 4 .b8 _ZN9celeritas9SecondaryC1Ev$53[4] = {255, 255, 255, 255};
.weak .global .align 8 .b8 _ZN9celeritas9SecondaryC1Ev$54[8];

However, the constructor's private variables show up with a different suffix in the EPlusGG kernel:

.weak .global .align 4 .b8 _ZN9celeritas9SecondaryC1Ev$52[4] = {255, 255, 255, 255};
.weak .global .align 8 .b8 _ZN9celeritas9SecondaryC1Ev$53[8];

so the suffixes are off by one and end up causing a collision.

I'm not sure how the suffixes are created, but changing the include order changes their value, so rearranging the includes causes the collision to disappear.

On emmet and @pcanal's machine, we saw the following error (which didn't
appear on the CI, which runs CUDA 11.1):
```
nvlink error   : Size doesn't match for '_ZN9celeritas9SecondaryC1Ev$53' in 'CMakeFiles/celeritas.dir/physics/em/detail/EPlusGG.cu.o', first specified in 'CMakeFiles/celeritas.dir/physics/em/detail/BetheHeitler.cu.o'
nvlink fatal   : merge_elf failed
```

A bug report for LLVM hinted at a bug in `nvlink` regarding weak
symbols: https://bugs.llvm.org/show_bug.cgi?id=40893 . Using the nvcc
`-keep` flag led to finding the Secondary's constructor was defined as a
weak symbol, and internally it stores two values (for the initialization
values of def_id and energy), which are defined internally in
BetheHeitler as
```
.weak .global .align 4 .b8 _ZN9celeritas9SecondaryC1Ev$53[4] = {255, 255, 255, 255};
.weak .global .align 8 .b8 _ZN9celeritas9SecondaryC1Ev$54[8];
```

However, the constructor's private variables show up with a different
suffix in the EPlusGG kernel:
```
.weak .global .align 4 .b8 _ZN9celeritas9SecondaryC1Ev$52[4] = {255, 255, 255, 255};
.weak .global .align 8 .b8 _ZN9celeritas9SecondaryC1Ev$53[8];
```
so the suffixes are off by one and end up causing a collision.

I'm not sure how the suffixes are created, but changing the include
order changes their value, so rearranging the includes causes the
collision to disappear.
@sethrj sethrj added the bug Something isn't working label Jan 26, 2021
@pcanal pcanal self-requested a review January 26, 2021 15:39
Copy link
Contributor

@pcanal pcanal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work-around work on wc.fnal.gov with cuda release 10.2, V10.2.89

@pcanal
Copy link
Contributor

pcanal commented Jan 26, 2021

Thank!

Work around a bug ini the FindMPI cmake script
@sethrj sethrj merged commit 802dbef into celeritas-project:master Jan 26, 2021
@amandalund
Copy link
Contributor

Thanks @sethrj, good detective work!

sethrj added a commit to sethrj/celeritas that referenced this pull request Jan 26, 2021
* Use member initialization for Secondary constructor
* Rearrange include order to work around nvlink bug
* Fix emmet with-mpi build

---

On emmet and @pcanal's machine, we saw the following error (which didn't
appear on the CI, which runs CUDA 11.1):
```
nvlink error   : Size doesn't match for '_ZN9celeritas9SecondaryC1Ev$53' in 'CMakeFiles/celeritas.dir/physics/em/detail/EPlusGG.cu.o', first specified in 'CMakeFiles/celeritas.dir/physics/em/detail/BetheHeitler.cu.o'
nvlink fatal   : merge_elf failed
```

A bug report for LLVM hinted at a bug in `nvlink` regarding weak
symbols: https://bugs.llvm.org/show_bug.cgi?id=40893 . Using the nvcc
`-keep` flag led to finding the Secondary's constructor was defined as a
weak symbol, and internally it stores two values (for the initialization
values of def_id and energy), which are defined internally in
BetheHeitler as
```
.weak .global .align 4 .b8 _ZN9celeritas9SecondaryC1Ev$53[4] = {255, 255, 255, 255};
.weak .global .align 8 .b8 _ZN9celeritas9SecondaryC1Ev$54[8];
```

However, the constructor's private variables show up with a different
suffix in the EPlusGG kernel:
```
.weak .global .align 4 .b8 _ZN9celeritas9SecondaryC1Ev$52[4] = {255, 255, 255, 255};
.weak .global .align 8 .b8 _ZN9celeritas9SecondaryC1Ev$53[8];
```
so the suffixes are off by one and end up causing a collision.

I'm not sure how the suffixes are created, but changing the include
order changes their value, so rearranging the includes causes the
collision to disappear.
@sethrj sethrj deleted the workaround-nvlink-bug branch March 1, 2021 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants