Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{devel}[foss/2021a] PyTorch v1.12.1 w/ Python 3.9.5 + CUDA 11.3.1 #16453

Merged

Conversation

Flamefire
Copy link
Contributor

(created using eb --new-pr)

@boegelbot

This comment was marked as outdated.

@boegelbot

This comment was marked as outdated.

branfosj
branfosj previously approved these changes Nov 24, 2022
Copy link
Member

@branfosj branfosj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@branfosj branfosj added this to the next release (4.7.0) milestone Nov 24, 2022
@boegel boegel changed the title {devel}[foss/2021a] PyTorch v1.12.1 w/ Python 3.9.5 {devel}[foss/2021a] PyTorch v1.12.1 w/ Python 3.9.5 + CUDA 11.3.1 Nov 24, 2022
@boegel
Copy link
Member

boegel commented Nov 24, 2022

@Flamefire Do you think it makes sense to add this explicitly in these easyconfigs?

# there should be no failing tests thanks to the included patches
# if you do see failing tests, please open an issue at https://github.com/easybuilders/easybuild-easyconfigs/issues
max_failed_tests = 0

@branfosj
Copy link
Member

@Flamefire Do you think it makes sense to add this explicitly in these easyconfigs?

# there should be no failing tests thanks to the included patches
# if you do see failing tests, please open an issue at https://github.com/easybuilders/easybuild-easyconfigs/issues
max_failed_tests = 0

Two points here:

@branfosj
Copy link
Member

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0105u36b.bear.cluster - Linux RHEL 8.5, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/ec24560002d510ba2fafa9f209f2fe72 for a full test report.

@branfosj
Copy link
Member

branfosj commented Nov 24, 2022

FAILED (skipped=3602, expected failures=80, unexpected successes=2)
test_ops_gradients failed!
test_forward_mode_AD_nn_functional_max_unpool2d_cpu_float64 (__main__.TestGradientsCPU) ... unexpected success
test_forward_mode_AD_nn_functional_max_unpool3d_cpu_float64 (__main__.TestGradientsCPU) ... unexpected success

@branfosj
Copy link
Member

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0103u11a.bear.cluster - Linux RHEL 8.5, x86_64, Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (icelake), 1 x NVIDIA NVIDIA A100-PCIE-40GB, 470.57.02, Python 3.6.8
See https://gist.github.com/ffb50411c59c26ecc2819a3b38eb81cf for a full test report.

@boegel
Copy link
Member

boegel commented Nov 25, 2022

Test report by @boegel
FAILED
Build succeeded for 0 out of 2 (2 easyconfigs in total)
node3902.accelgor.os - Linux RHEL 8.4, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 520.61.05, Python 3.6.8
See https://gist.github.com/e0ba4d67a744cd1c9e89a478e4834354 for a full test report.

  • PyTorch-1.12.1-foss-2021a-CUDA-11.3.1.eb => 248 test failures, 0 test error (out of 89459)
  • PyTorch-1.12.1-foss-2021a.eb => 1 test failure, 0 test error (out of 88942) (distributed/test_c10d_gloo failed!)

@Flamefire For PyTorch-1.12.1-foss-2021a-CUDA-11.3.1.eb, could it be related to only having a single GPU available?

@boegel
Copy link
Member

boegel commented Nov 25, 2022

Details on failing distributed/test_c10d_gloo for CPU-only installation:

Test exited with non-zero exitcode 1. Command to reproduce: /user/gent/400/vsc40023/eb_scratch/RHEL8/zen3-ampere-ib/software/Python/3.9.5-GCCcore-10.3.0/bin/python distributed/test_c10d_gloo.py -v ProcessGroupGlooTest.test_round_robin
test_round_robin_create_destroy (__main__.ProcessGroupGlooTest) ... INFO:torch.testing._internal.common_distributed:Started process 0 with pid 2339063
INFO:torch.testing._internal.common_distributed:Started process 1 with pid 2339064
INFO:torch.testing._internal.common_distributed:Started process 2 with pid 2339065
INFO:torch.testing._internal.common_distributed:Started process 3 with pid 2339066
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 0
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 1
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 2
INFO:torch.testing._internal.common_distributed:Starting event listener thread for rank 3
ERROR:torch.testing._internal.common_distributed:Caught exception:
Traceback (most recent call last):
  File "/tmp/eb-8zn00kw2/tmpreav4c99/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 622, in run_test
    getattr(self, test_name)()
  File "/tmp/eb-8zn00kw2/tmpreav4c99/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 503, in wrapper
    fn()
  File "/tmp/vsc40023/easybuild_build/PyTorch/1.12.1/foss-2021a/pytorch-v1.12.1/test/distributed/test_c10d_gloo.py", line 1462, in test_round_robin_create_destroy
    pg = create(num=num_process_groups, prefix=i)
  File "/tmp/vsc40023/easybuild_build/PyTorch/1.12.1/foss-2021a/pytorch-v1.12.1/test/distributed/test_c10d_gloo.py", line 1448, in create
    [
  File "/tmp/vsc40023/easybuild_build/PyTorch/1.12.1/foss-2021a/pytorch-v1.12.1/test/distributed/test_c10d_gloo.py", line 1449, in <listcomp>
    c10d.ProcessGroupGloo(
RuntimeError: Wait timeout
Exception raised from wait at ../torch/csrc/distributed/c10d/FileStore.cpp:452 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x57 (0x14c6aedc2197 in /tmp/eb-8zn00kw2/tmpreav4c99/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xd9 (0x14c6aed9898c in /tmp/eb-8zn00kw2/tmpreav4c99/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x3dc3b4a (0x14c6b2ec6b4a in /tmp/eb-8zn00kw2/tmpreav4c99/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::PrefixStore::wait(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::chrono::duration<long, std::ratio<1l, 1000l> > const&) + 0x2f (0x14c6b2eccf6f in /tmp/eb-8zn00kw2/tmpreav4c99/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #4: gloo::rendezvous::PrefixStore::wait(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::chrono::duration<long, std::ratio<1l, 1000l> > const&) + 0x121 (0x14c6b4a12021 in /tmp/eb-8zn00kw2/tmpreav4c99/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #5: gloo::rendezvous::Context::connectFullMesh(gloo::rendezvous::Store&, std::shared_ptr<gloo::transport::Device>&) + 0x14ea (0x14c6b4a0fc5a in /tmp/eb-8zn00kw2/tmpreav4c99/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::ProcessGroupGloo::ProcessGroupGloo(c10::intrusive_ptr<c10d::Store, c10::detail::intrusive_target_default_null_type<c10d::Store> > const&, int, int, c10::intrusive_ptr<c10d::ProcessGroupGloo::Options, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupGloo::Options> >) + 0x447 (0x14c6b2edeac7 in /tmp/eb-8zn00kw2/tmpreav4c99/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x9f3cbd (0x14c6b9302cbd in /tmp/eb-8zn00kw2/tmpreav4c99/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x36be8a (0x14c6b8c7ae8a in /tmp/eb-8zn00kw2/tmpreav4c99/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #14: <unknown function> + 0x369559 (0x14c6b8c78559 in /tmp/eb-8zn00kw2/tmpreav4c99/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #56: __libc_start_main + 0xf3 (0x14c6ba2e8493 in /lib64/libc.so.6)
frame #57: _start + 0x2e (0x4006ce in /user/gent/400/vsc40023/eb_scratch/RHEL8/zen3-ampere-ib/software/Python/3.9.5-GCCcore-10.3.0/bin/python)

 exiting process 0 with exit code: 10
Process 0 terminated with exit code 10, terminating remaining processes.
ERROR

@branfosj
Copy link
Member

@Flamefire For PyTorch-1.12.1-foss-2021a-CUDA-11.3.1.eb, could it be related to only having a single GPU available?

I am suspicious about this as well. The errors I am seeing on the CUDA build all look similar to:

Running distributed/fsdp/test_distributed_checkpoint ... [2022-11-24 11:32:28.143789]
Executing ['/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.9.5-GCCcore-10.3.0/bin/python', 'distributed/fsdp/test_distributed_checkpoint.py', '-v'] ... [2022-11-24 11:32:28.143928]
test_distributed_checkpoint_state_dict_type_StateDictType_LOCAL_STATE_DICT (__main__.TestDistributedCheckpoint) ... INFO:torch.testing._internal.common_distributed:Started process 0 with pid 733462
INFO:torch.testing._internal.common_distributed:Started process 1 with pid 733463
dist init r=0, world=2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
dist init r=1, world=2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Process process 0:
Traceback (most recent call last):
  File "/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-usbph1a0/tmp_jxl19_q/lib/python3.9/site-packages/torch/testing/_internal/common_fsdp.py", line 427, in _run
    dist.barrier()
  File "/dev/shm/branfosj/tmp-up-EL8/eb-usbph1a0/tmp_jxl19_q/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Process process 1:
Traceback (most recent call last):
  File "/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.9.5-GCCcore-10.3.0/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-usbph1a0/tmp_jxl19_q/lib/python3.9/site-packages/torch/testing/_internal/common_fsdp.py", line 427, in _run
    dist.barrier()
  File "/dev/shm/branfosj/tmp-up-EL8/eb-usbph1a0/tmp_jxl19_q/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2784, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
FAIL

@boegel
Copy link
Member

boegel commented Nov 25, 2022

Test report by @boegel
FAILED
Build succeeded for 0 out of 2 (2 easyconfigs in total)
node3304.joltik.os - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 520.61.05, Python 3.6.8
See https://gist.github.com/3bf0c77ca66537f6ddb66c757d831684 for a full test report.

@Flamefire
Copy link
Contributor Author

@branfosj your log looks interesting:

Running distributed/fsdp/test_distributed_checkpoint

Did you run this on a node with only 1 GPU? Because this test is guarded by @skip_if_lt_x_gpu(2). Can you ignore the failure and run torch.cuda.is_available() and torch.cuda.device_count() on that node in Python after import torch?

@branfosj
Copy link
Member

@branfosj your log looks interesting:

Running distributed/fsdp/test_distributed_checkpoint

Did you run this on a node with only 1 GPU? Because this test is guarded by @skip_if_lt_x_gpu(2). Can you ignore the failure and run torch.cuda.is_available() and torch.cuda.device_count() on that node in Python after import torch?

It is a node with 2 GPUs, however I was in a cgroup that only had access 1 of them. I'm currently testing a build with access to both GPUs and after that I'll see how those torch functions return in various cases.

@Flamefire
Copy link
Contributor Author

Flamefire commented Nov 25, 2022

The 1 GPU thing is the issue. I debugged it a bit and found: The test forks, waits in the barrier and then calls the wrapped test function which checks for the amount of GPUs.
So by the time the check is run it already tried to use the 2 GPUs -.-

Opened a bug with PyTorch: pytorch/pytorch#89686

I'll add a patch for the ECs next week. Edit: Quick test, scheduled builds to run over the weekend

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0103u14a.bear.cluster - Linux RHEL 8.5, x86_64, Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (icelake), 2 x NVIDIA NVIDIA A30, 470.57.02, Python 3.6.8
See https://gist.github.com/3a5df9ad56295a169e807a0afb7a9c0f for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0103u14a.bear.cluster - Linux RHEL 8.5, x86_64, Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (icelake), 1 x NVIDIA NVIDIA A30, 470.57.02, Python 3.6.8
See https://gist.github.com/b453024903678edfac52cc6a9c3dab27 for a full test report.

Copy link
Member

@branfosj branfosj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure was

Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.12.1/foss-2021a-CUDA-11.3.1/pytorch-v1.12.1/test/distributed/fsdp/test_fsdp_multiple_forward.py", line 48, in <module>
    class TestMultiForward(FSDPTest):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.12.1/foss-2021a-CUDA-11.3.1/pytorch-v1.12.1/test/distributed/fsdp/test_fsdp_multiple_forward.py", line 73, in TestMultiForward
    @skip_if_lt_x_gpu(2)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-b6ygbfpo/tmp4zw2uabe/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 132, in skip_if_lt_x_gpu
    TEST_SKIPS[f"multi-gpu-{n}"].message)
NameError: name 'n' is not defined

@Flamefire Flamefire force-pushed the 20221020165400_new_pr_PyTorch1121 branch from 168ff80 to ec03d0c Compare November 28, 2022 11:58
@Flamefire
Copy link
Contributor Author

@branfosj Yes, C&P mistake while quickly hacking out the fix before the weekend. Did a proper fix, sent a PR upstream and updated the patch here

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0105u36b.bear.cluster - Linux RHEL 8.5, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/7c9058e9bfbac9496484f928e0c389ff for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0103u04a.bear.cluster - Linux RHEL 8.5, x86_64, Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (icelake), 2 x NVIDIA NVIDIA A30, 470.57.02, Python 3.6.8
See https://gist.github.com/0097203763ffd7652f47d81376f2c891 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusml5 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/5c2b1431843d31e5d1da8a6604cbcd0b for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusa12 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/c279166a7c9894585837888ab0e6c1f3 for a full test report.

@branfosj
Copy link
Member

Going in, thanks @Flamefire!

@branfosj branfosj merged commit 2913f3d into easybuilders:develop Nov 29, 2022
@Flamefire Flamefire deleted the 20221020165400_new_pr_PyTorch1121 branch November 30, 2022 08:21
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusi8016 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/47a8be219082efeece3d5bf3c4469843 for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants