Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{devel}[fosscuda/2020b] PyTorch v1.7.1 w/ Python 3.8.6 #12003

Merged
merged 14 commits into from Feb 17, 2021

Conversation

branfosj
Copy link
Member

@branfosj branfosj commented Jan 15, 2021

(created using eb --new-pr)

Notes:

  • I've used cuDNN-8.0.4.30-CUDA-11.1.1.eb as there does not seem to be a cuDNN / 8.0.5.39 / CUDA 11.1 for ppc64le
  • I've dropped the PyTorch-1.7.1_el8_ppc64le.patch, compared to the fosscuda/2019b install, as this should only be needed on RHEL 8 with CUDA <= 10

Patches added through #12147

  • PyTorch-1.7.1_validate-num-gpus-in-distributed-test.patch fixes the NCCL error seen during tests
  • PyTorch-1.7.1_complex32.patch provides complex32 dtype
  • PyTorch-1.7.1_bypass-nan-compare.patch disables two tests that are known to fail. PyTorch are working on a fix for these failures.

@branfosj branfosj marked this pull request as draft January 15, 2021 09:09
@branfosj branfosj added this to the 4.x milestone Jan 15, 2021
Co-authored-by: Mikael Öhman <micketeer@gmail.com>
@branfosj

This comment has been minimized.

@branfosj

This comment has been minimized.

@branfosj
Copy link
Member Author

As expected, from looking at bug reports, this error persists with NCCL 2.7.8:

RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

@branfosj

This comment has been minimized.

@branfosj

This comment has been minimized.

@branfosj
Copy link
Member Author

branfosj commented Feb 6, 2021

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
bear-pg0212u17a.bear.cluster - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz (broadwell), Python 3.6.8
See https://gist.github.com/e81c3cca332c834f97a9ce702fed2399 for a full test report.

@branfosj branfosj marked this pull request as ready for review February 6, 2021 17:36
@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
taurusi8032 - Linux centos linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor, Python 2.7.5
See https://gist.github.com/8158a48ad183c69131185f2b16bfedc7 for a full test report.

@Flamefire
Copy link
Contributor

[W CudaIPCTypes.cpp:22] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Traceback (most recent call last):
  File "run_test.py", line 745, in <module>
    main()
  File "run_test.py", line 728, in main
    raise RuntimeError(err_message)
RuntimeError: distributed/test_c10d_spawn failed! Received signal: SIGSEGV

To me this looks like static deinit fiasco. Tests seem to be fine though

@Flamefire
Copy link
Contributor

@branfosj I backported your patches to 2019b and included mine to fix the aformentioned bug and another one: #12147

I'm currently running your EC with those patches on our A100 cluster and see if that succeeds. Feel free to add those patches here.

@branfosj branfosj changed the title {bio,devel}[fosscuda/2020b] PyTorch v1.7.1, typing-extensions v3.7.4.3 w/ Python 3.8.6 {devel}[fosscuda/2020b] PyTorch v1.7.1 w/ Python 3.8.6 Feb 12, 2021
@boegel
Copy link
Member

boegel commented Feb 13, 2021

@branfosj #12147 is merged, so this PR should be synced accordingly by pulling in the missing patches?

@branfosj
Copy link
Member Author

I've synced with develop and matched the patches from PyTorch-1.7.1-fosscuda-2019b-Python-3.7.4.eb - with the exception of PyTorch-1.7.1_el8_ppc64le.patch which is removed as it is not needed with CUDA 11.

@boegel
Copy link
Member

boegel commented Feb 13, 2021

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=12003 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_12003 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 13815

Test results coming soon (I hope)...

- notification for comment with ID 778679764 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@branfosj
Copy link
Member Author

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0212u17a.bear.cluster - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz (broadwell), Python 3.6.8
See https://gist.github.com/11639ce44e435e6f21ff00766f1e54d7 for a full test report.

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel boegel modified the milestones: 4.x, next release (4.3.3?) Feb 14, 2021
@boegel
Copy link
Member

boegel commented Feb 14, 2021

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3300.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/5fd4948c6a69b9be7560879e89e5beeb for a full test report.

@Flamefire
Copy link
Contributor

Tested this on our A100 cluster and seeing multiple minor failures:

AssertionError: False is not true : Tensors failed to compare as equal! With rtol=1.3e-06 and atol=1e-05, found 1 element(s) (out of 200) whose difference(s) exceeded the margin of error (including 0 nan comp
arisons). The greatest difference was 1.1809170246124268e-05 (-0.11461969465017319 vs. -0.11460788547992706), which occurred at index (3, 47).
exiting process with exit code: 10
E

I assume this is because we have 8 GPUs/node and PyTorch is not ready to handle that. Will open a bug report there and can add a patch here

@boegel
Copy link
Member

boegel commented Feb 15, 2021

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3501.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/32f2066a992a94424838bbe4464667c0 for a full test report.

@boegel
Copy link
Member

boegel commented Feb 15, 2021

@Flamefire Should we let that block this PR, or do we handle that in a follow-up PR?

@Flamefire
Copy link
Contributor

@Flamefire Should we let that block this PR, or do we handle that in a follow-up PR?

@boegel Up to you. I'd fix this in here but as I'm likely the only one seeing the problem I could open the follow-up PR to avoid having to send patches via mail or so ;)
I have a patch ready and that solves that problem. I'll run a test build overnight to see if there is anything else

@Flamefire
Copy link
Contributor

Flamefire commented Feb 17, 2021

The patch made the tests pass on a few tests but I still have precision related failures. As PyTorch seems to work on other systems I'd assume this is specific to A100s/CC8.0 and not worth holding back this PR.
If I find a solution I'll open a follow-up PR

Edit: FWIW: The official PyTorch docker image (docker://pytorch/pytorch:1.7.1-cuda11.0-cudnn8-devel) shows the same behavior.

@boegel
Copy link
Member

boegel commented Feb 17, 2021

Going in, thanks @branfosj!

@boegel boegel merged commit 43ddaa0 into easybuilders:develop Feb 17, 2021
@branfosj branfosj deleted the 20210115090711_new_pr_PyTorch171 branch February 17, 2021 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants