New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{devel}[foss/2021a] PyTorch v1.12.1 w/ Python 3.9.5 + CUDA 11.3.1 #16453
{devel}[foss/2021a] PyTorch v1.12.1 w/ Python 3.9.5 + CUDA 11.3.1 #16453
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
fb7a1a3
to
5f0c732
Compare
e995abf
to
0df1438
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@Flamefire Do you think it makes sense to add this explicitly in these easyconfigs? # there should be no failing tests thanks to the included patches
# if you do see failing tests, please open an issue at https://github.com/easybuilders/easybuild-easyconfigs/issues
max_failed_tests = 0 |
Two points here:
|
Test report by @branfosj |
|
Test report by @branfosj |
Test report by @boegel
@Flamefire For |
Details on failing
|
I am suspicious about this as well. The errors I am seeing on the CUDA build all look similar to:
|
Test report by @boegel |
@branfosj your log looks interesting:
Did you run this on a node with only 1 GPU? Because this test is guarded by |
It is a node with 2 GPUs, however I was in a cgroup that only had access 1 of them. I'm currently testing a build with access to both GPUs and after that I'll see how those |
The 1 GPU thing is the issue. I debugged it a bit and found: The test forks, waits in the barrier and then calls the wrapped test function which checks for the amount of GPUs. Opened a bug with PyTorch: pytorch/pytorch#89686
|
Test report by @branfosj |
Test report by @branfosj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure was
Traceback (most recent call last):
File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.12.1/foss-2021a-CUDA-11.3.1/pytorch-v1.12.1/test/distributed/fsdp/test_fsdp_multiple_forward.py", line 48, in <module>
class TestMultiForward(FSDPTest):
File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.12.1/foss-2021a-CUDA-11.3.1/pytorch-v1.12.1/test/distributed/fsdp/test_fsdp_multiple_forward.py", line 73, in TestMultiForward
@skip_if_lt_x_gpu(2)
File "/dev/shm/branfosj/tmp-up-EL8/eb-b6ygbfpo/tmp4zw2uabe/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 132, in skip_if_lt_x_gpu
TEST_SKIPS[f"multi-gpu-{n}"].message)
NameError: name 'n' is not defined
easybuild/easyconfigs/p/PyTorch/PyTorch-1.12.1_fix-skip-decorators.patch
Outdated
Show resolved
Hide resolved
easybuild/easyconfigs/p/PyTorch/PyTorch-1.12.1_fix-skip-decorators.patch
Outdated
Show resolved
Hide resolved
168ff80
to
ec03d0c
Compare
@branfosj Yes, C&P mistake while quickly hacking out the fix before the weekend. Did a proper fix, sent a PR upstream and updated the patch here |
easybuild/easyconfigs/p/PyTorch/PyTorch-1.12.1_fix-skip-decorators.patch
Outdated
Show resolved
Hide resolved
Test report by @branfosj |
Test report by @branfosj |
Test report by @Flamefire |
Test report by @Flamefire |
Going in, thanks @Flamefire! |
Test report by @Flamefire |
(created using
eb --new-pr
)