New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{devel}[fosscuda/2020b] PyTorch v1.7.1 w/ Python 3.8.6 #12003
{devel}[fosscuda/2020b] PyTorch v1.7.1 w/ Python 3.8.6 #12003
Conversation
…s-3.7.4.3-fosscuda-2020b.eb
easybuild/easyconfigs/p/PyTorch/PyTorch-1.7.1-fosscuda-2020b.eb
Outdated
Show resolved
Hide resolved
Co-authored-by: Mikael Öhman <micketeer@gmail.com>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
As expected, from looking at bug reports, this error persists with NCCL 2.7.8:
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Test report by @branfosj |
Test report by @Flamefire |
To me this looks like static deinit fiasco. Tests seem to be fine though |
easybuild/easyconfigs/t/typing-extensions/typing-extensions-3.7.4.3-fosscuda-2020b.eb
Outdated
Show resolved
Hide resolved
I've synced with |
@boegelbot please test @ generoso |
@boegel: Request for testing this PR well received on generoso PR test command '
Test results coming soon (I hope)... - notification for comment with ID 778679764 processed Message to humans: this is just bookkeeping information for me, |
Test report by @branfosj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Test report by @boegel |
Tested this on our A100 cluster and seeing multiple minor failures:
I assume this is because we have 8 GPUs/node and PyTorch is not ready to handle that. Will open a bug report there and can add a patch here |
Test report by @boegel |
@Flamefire Should we let that block this PR, or do we handle that in a follow-up PR? |
@boegel Up to you. I'd fix this in here but as I'm likely the only one seeing the problem I could open the follow-up PR to avoid having to send patches via mail or so ;) |
The patch made the tests pass on a few tests but I still have precision related failures. As PyTorch seems to work on other systems I'd assume this is specific to A100s/CC8.0 and not worth holding back this PR. Edit: FWIW: The official PyTorch docker image (docker://pytorch/pytorch:1.7.1-cuda11.0-cudnn8-devel) shows the same behavior. |
Going in, thanks @branfosj! |
(created using
eb --new-pr
)Notes:
cuDNN-8.0.4.30-CUDA-11.1.1.eb
as there does not seem to be acuDNN
/8.0.5.39
/CUDA
11.1
for ppc64lePyTorch-1.7.1_el8_ppc64le.patch
, compared to thefosscuda/2019b
install, as this should only be needed on RHEL 8 with CUDA <= 10Patches added through #12147
PyTorch-1.7.1_validate-num-gpus-in-distributed-test.patch
fixes the NCCL error seen during testsPyTorch-1.7.1_complex32.patch
provides complex32 dtypePyTorch-1.7.1_bypass-nan-compare.patch
disables two tests that are known to fail. PyTorch are working on a fix for these failures.