{devel}[fosscuda/2020b] PyTorch v1.7.1 w/ Python 3.8.6 #12003

branfosj · 2021-01-15T09:07:19Z

(created using eb --new-pr)

Notes:

I've used cuDNN-8.0.4.30-CUDA-11.1.1.eb as there does not seem to be a cuDNN / 8.0.5.39 / CUDA 11.1 for ppc64le
I've dropped the PyTorch-1.7.1_el8_ppc64le.patch, compared to the fosscuda/2019b install, as this should only be needed on RHEL 8 with CUDA <= 10

Patches added through #12147

PyTorch-1.7.1_validate-num-gpus-in-distributed-test.patch fixes the NCCL error seen during tests
PyTorch-1.7.1_complex32.patch provides complex32 dtype
PyTorch-1.7.1_bypass-nan-compare.patch disables two tests that are known to fail. PyTorch are working on a fix for these failures.

…s-3.7.4.3-fosscuda-2020b.eb

easybuild/easyconfigs/p/PyTorch/PyTorch-1.7.1-fosscuda-2020b.eb

Co-authored-by: Mikael Öhman <micketeer@gmail.com>

branfosj · 2021-01-15T15:09:04Z

As expected, from looking at bug reports, this error persists with NCCL 2.7.8:

RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

branfosj · 2021-02-06T17:36:21Z

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
bear-pg0212u17a.bear.cluster - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz (broadwell), Python 3.6.8
See https://gist.github.com/e81c3cca332c834f97a9ce702fed2399 for a full test report.

Flamefire · 2021-02-09T01:55:06Z

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
taurusi8032 - Linux centos linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor, Python 2.7.5
See https://gist.github.com/8158a48ad183c69131185f2b16bfedc7 for a full test report.

Flamefire · 2021-02-09T09:27:18Z

[W CudaIPCTypes.cpp:22] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Traceback (most recent call last):
  File "run_test.py", line 745, in <module>
    main()
  File "run_test.py", line 728, in main
    raise RuntimeError(err_message)
RuntimeError: distributed/test_c10d_spawn failed! Received signal: SIGSEGV

To me this looks like static deinit fiasco. Tests seem to be fine though

easybuild/easyconfigs/t/typing-extensions/typing-extensions-3.7.4.3-fosscuda-2020b.eb

Flamefire · 2021-02-12T11:38:25Z

@branfosj I backported your patches to 2019b and included mine to fix the aformentioned bug and another one: #12147

I'm currently running your EC with those patches on our A100 cluster and see if that succeeds. Feel free to add those patches here.

boegel · 2021-02-13T19:34:06Z

@branfosj #12147 is merged, so this PR should be synced accordingly by pulling in the missing patches?

…asyconfigs into 20210115090711_new_pr_PyTorch171

branfosj · 2021-02-13T21:09:02Z

I've synced with develop and matched the patches from PyTorch-1.7.1-fosscuda-2019b-Python-3.7.4.eb - with the exception of PyTorch-1.7.1_el8_ppc64le.patch which is removed as it is not needed with CUDA 11.

boegel · 2021-02-13T21:20:23Z

@boegelbot please test @ generoso

boegelbot · 2021-02-13T21:25:07Z

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=12003 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_12003 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

exit code: 0
output:

Submitted batch job 13815

Test results coming soon (I hope)...

- notification for comment with ID 778679764 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

branfosj · 2021-02-13T23:49:26Z

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0212u17a.bear.cluster - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz (broadwell), Python 3.6.8
See https://gist.github.com/11639ce44e435e6f21ff00766f1e54d7 for a full test report.

boegel

lgtm

boegel · 2021-02-14T12:21:24Z

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3300.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/5fd4948c6a69b9be7560879e89e5beeb for a full test report.

Flamefire · 2021-02-15T11:53:34Z

Tested this on our A100 cluster and seeing multiple minor failures:

AssertionError: False is not true : Tensors failed to compare as equal! With rtol=1.3e-06 and atol=1e-05, found 1 element(s) (out of 200) whose difference(s) exceeded the margin of error (including 0 nan comp
arisons). The greatest difference was 1.1809170246124268e-05 (-0.11461969465017319 vs. -0.11460788547992706), which occurred at index (3, 47).
exiting process with exit code: 10
E

I assume this is because we have 8 GPUs/node and PyTorch is not ready to handle that. Will open a bug report there and can add a patch here

boegel · 2021-02-15T12:32:20Z

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3501.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/32f2066a992a94424838bbe4464667c0 for a full test report.

boegel · 2021-02-15T13:38:27Z

@Flamefire Should we let that block this PR, or do we handle that in a follow-up PR?

Flamefire · 2021-02-15T17:12:32Z

@Flamefire Should we let that block this PR, or do we handle that in a follow-up PR?

@boegel Up to you. I'd fix this in here but as I'm likely the only one seeing the problem I could open the follow-up PR to avoid having to send patches via mail or so ;)
I have a patch ready and that solves that problem. I'll run a test build overnight to see if there is anything else

Flamefire · 2021-02-17T10:16:36Z

The patch made the tests pass on a few tests but I still have precision related failures. As PyTorch seems to work on other systems I'd assume this is specific to A100s/CC8.0 and not worth holding back this PR.
If I find a solution I'll open a follow-up PR

Edit: FWIW: The official PyTorch docker image (docker://pytorch/pytorch:1.7.1-cuda11.0-cudnn8-devel) shows the same behavior.

boegel · 2021-02-17T16:07:21Z

Going in, thanks @branfosj!

adding easyconfigs: PyTorch-1.7.1-fosscuda-2020b.eb, typing-extension…

975e803

…s-3.7.4.3-fosscuda-2020b.eb

branfosj added the update label Jan 15, 2021

branfosj marked this pull request as draft January 15, 2021 09:09

branfosj added this to the 4.x milestone Jan 15, 2021

Micket reviewed Jan 15, 2021

View reviewed changes

easybuild/easyconfigs/p/PyTorch/PyTorch-1.7.1-fosscuda-2020b.eb Outdated Show resolved Hide resolved

add 8.0 to cuda compute capabilities

dce7eae

Co-authored-by: Mikael Öhman <micketeer@gmail.com>

This comment has been minimized.

Sign in to view

branfosj and others added 2 commits January 15, 2021 19:56

patch to validate number of gpus in distributed tests

c13f7a4

update moduleclass

1b3f11e

This comment has been minimized.

Sign in to view

patch to fix nan compare for complex in tests

960e6f0

This comment has been minimized.

Sign in to view

branfosj added 3 commits January 21, 2021 20:01

correct fix nan compare patch and add complex half patch

816c8e9

bypass two tests, as known failing

60931fc

remove debug change

a462286

branfosj marked this pull request as ready for review February 6, 2021 17:36

Flamefire reviewed Feb 9, 2021

View reviewed changes

easybuild/easyconfigs/t/typing-extensions/typing-extensions-3.7.4.3-fosscuda-2020b.eb Outdated Show resolved Hide resolved

removing unneeded file

05a16c2

Flamefire mentioned this pull request Feb 12, 2021

add additional patches for PyTorch 1.7.1 to fix failing tests #12147

Merged

branfosj changed the title ~~{bio,devel}[fosscuda/2020b] PyTorch v1.7.1, typing-extensions v3.7.4.3 w/ Python 3.8.6~~ {devel}[fosscuda/2020b] PyTorch v1.7.1 w/ Python 3.8.6 Feb 12, 2021

branfosj and others added 3 commits February 13, 2021 20:42

Merge branch 'develop' of https://github.com/easybuilders/easybuild-e…

6f7cd4e

…asyconfigs into 20210115090711_new_pr_PyTorch171

sync patch list with 2019b version

33b6124

remove accidentally added patch and stray spaces

42ec31c

branfosj added 2 commits February 13, 2021 20:53

more trailing whitespace ...

48153ec

switch to using cudaver for cuDNN and NCCL

b8aaf27

boegel approved these changes Feb 14, 2021

View reviewed changes

boegel modified the milestones: 4.x, next release (4.3.3?) Feb 14, 2021

Flamefire mentioned this pull request Feb 16, 2021

Tests fail on A100 GPUs due to inaccurate/differing float values pytorch/pytorch#52278

Closed

boegel merged commit 43ddaa0 into easybuilders:develop Feb 17, 2021

branfosj deleted the 20210115090711_new_pr_PyTorch171 branch February 17, 2021 16:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{devel}[fosscuda/2020b] PyTorch v1.7.1 w/ Python 3.8.6 #12003

{devel}[fosscuda/2020b] PyTorch v1.7.1 w/ Python 3.8.6 #12003

branfosj commented Jan 15, 2021 •

edited

This comment has been minimized.

This comment has been minimized.

branfosj commented Jan 15, 2021

This comment has been minimized.

This comment has been minimized.

branfosj commented Feb 6, 2021

Flamefire commented Feb 9, 2021

Flamefire commented Feb 9, 2021

Flamefire commented Feb 12, 2021

boegel commented Feb 13, 2021

branfosj commented Feb 13, 2021

boegel commented Feb 13, 2021

boegelbot commented Feb 13, 2021

branfosj commented Feb 13, 2021

boegel left a comment

boegel commented Feb 14, 2021

Flamefire commented Feb 15, 2021

boegel commented Feb 15, 2021

boegel commented Feb 15, 2021

Flamefire commented Feb 15, 2021

Flamefire commented Feb 17, 2021 •

edited

boegel commented Feb 17, 2021

{devel}[fosscuda/2020b] PyTorch v1.7.1 w/ Python 3.8.6 #12003

{devel}[fosscuda/2020b] PyTorch v1.7.1 w/ Python 3.8.6 #12003

Conversation

branfosj commented Jan 15, 2021 • edited

This comment has been minimized.

This comment has been minimized.

branfosj commented Jan 15, 2021

This comment has been minimized.

This comment has been minimized.

branfosj commented Feb 6, 2021

Flamefire commented Feb 9, 2021

Flamefire commented Feb 9, 2021

Flamefire commented Feb 12, 2021

boegel commented Feb 13, 2021

branfosj commented Feb 13, 2021

boegel commented Feb 13, 2021

boegelbot commented Feb 13, 2021

branfosj commented Feb 13, 2021

boegel left a comment

Choose a reason for hiding this comment

boegel commented Feb 14, 2021

Flamefire commented Feb 15, 2021

boegel commented Feb 15, 2021

boegel commented Feb 15, 2021

Flamefire commented Feb 15, 2021

Flamefire commented Feb 17, 2021 • edited

boegel commented Feb 17, 2021

branfosj commented Jan 15, 2021 •

edited

Flamefire commented Feb 17, 2021 •

edited