{ai}[foss/2022a] PyTorch v1.13.1 w/ Python 3.10.4 w/ CUDA 11.7.0 #17156

branfosj · 2023-01-19T09:33:20Z

(created using eb --new-pr)

…hes: PyTorch-1.13.1_fix-test-ops-conf.patch, PyTorch-1.13.1_no-cuda-stubs-rpath.patch, PyTorch-1.13.1_remove-flaky-test-in-testnn.patch, PyTorch-1.13.1_skip-ao-sparsity-test-without-fbgemm.patch

satishskamath · 2023-01-31T23:02:04Z

Hi @branfosj. Are there issues still pending within this PR?

Update patches based on PyTorch 1.13.1

Those tests require 2 pytest plugins and a bugfix.

Fix test_ops* startup failures

…asyconfigs into 20230119093315_new_pr_PyTorch1131

Flamefire · 2023-02-10T23:22:30Z

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusi8006 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/d562b0460290df6c5c3cde89694c1311 for a full test report.

Flamefire · 2023-02-11T02:06:21Z

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusml26 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/24a30bba9da16fb44941b172986475a0 for a full test report.

Flamefire · 2023-02-11T03:49:47Z

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusa11 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), Python 2.7.5
See https://gist.github.com/bb99683f64fd9982ae70752654544d3d for a full test report.

smoors · 2023-04-13T01:34:55Z

Test report by @smoors
FAILED
Build succeeded for 2 out of 3 (1 easyconfigs in total)
node406.hydra.os - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7282 16-Core Processor, 1 x NVIDIA NVIDIA A100-PCIE-40GB, 515.48.07, Python 3.6.8
See https://gist.github.com/smoors/cefd98a7e9a2da2b2683ffecc386d663 for a full test report.

smoors · 2023-04-13T15:56:57Z

as discussed in the last conf call, to avoid this PR becoming stale and as the number of failed tests is limited, we decided to merge this as is and create an issue for the failing tests to follow-up.

@branfosj if you agree, can you add max_failed_tests = 10 and remove the draft label?
i'll then merge this and create the issue for the failing tests.

boegel · 2023-04-13T16:09:29Z

@smoors Issue already created, see #17712 (I'm focusing on getting #17155 merged first)

boegel · 2023-04-13T16:11:35Z

For test_ops_gradients, we probably just need to add the PyTorch-1.13.1_skip-failing-grad-test.patch as was done in 6124d4c in #17155

For the test_jit* failing tests, we should include those in excluded_tests, which is a bit more strict than just allowing 10 random tests to fail.

branfosj · 2023-04-13T16:12:33Z

Yes, we should get #17155 merged first and then make sure all the patches are synced to here.

boegel · 2023-04-13T16:13:25Z

test_tensorpipe_agent is failing on broadwell and zen2 in @Flamefire's tests, which is excluded in PyTorch-1.9.0*.eb, but only for POWER.

I would also skip that test for now, and mention it in #17712

boegel · 2023-04-13T16:14:08Z

Failing tests on POWER9:

test_ao_sparsity failed!
test_optim failed!
test_quantization failed!
distributed/rpc/test_tensorpipe_agent failed!
test_cpp_extensions_aot_ninja failed!
test_cpp_extensions_aot_no_ninja failed!
test_cpp_extensions_open_device_registration failed!
test_cuda failed!
test_ops failed!

Let's not block this PR over that, those can be dealt with in a follow-up PR.

smoors · 2023-04-13T16:33:20Z

For the test_jit* failing tests, we should include those in excluded_tests, which is a bit more strict than just allowing 10 random tests to fail.

true, but on the other hand i prefer to run a test and ignore the failure than to skip the test altogether, especially if the failure is specific to another architecture than the one i am building on.

we could add another parameter ignored_tests, but that may be overkill..
or even ignore_tests_for_architecture=<arch>

VRehnberg · 2023-04-24T14:52:20Z

Test report by @VRehnberg
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
alvis2-12 - Linux Rocky Linux 8.6, x86_64, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, 8 x NVIDIA Tesla T4, 520.61.05, Python 3.6.8
See https://gist.github.com/VRehnberg/eb6da3f62c2c703ca377f93735d71cc1 for a full test report.

VRehnberg · 2023-04-24T15:22:07Z

Test report by @VRehnberg
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
alvis3-22 - Linux Rocky Linux 8.6, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 520.61.05, Python 3.6.8
See https://gist.github.com/VRehnberg/ce43ced71ca294084e9e03b1f16a2b1c for a full test report.

VRehnberg · 2023-05-10T08:09:23Z

I'm seeing three failed tests:

Failed tests (suites/files):
* distributed/_shard/sharded_tensor/ops/test_linear
* distributed/rpc/test_tensorpipe_agent
* test_jit_legacy
distributed/_shard/sharded_tensor/ops/test_linear (3 total tests, errors=1)

For distributed/_shard/sharded_tensor/ops/test_linear there is one bfloat16 tensor comparison where 385.5 is compared to 385 in a few different places for 3 out of four GPUs (this is on the 4xA40 node).

This is not much for bfloat16 so I'd say either skip test or remove absolute tolerance and increase relative tolerance to at least $(2^0 + 2^{-8})/2^0$ which should be about as much accuracy to expect with a 7-bit mantissa. However, I'm also confused because unless I'm miscounting the closest bfloat16 numbers should be 384 and 386 (i.e. 385 and 385.5 shouldn't be expressible) so there might be something I'm missing.

I'll check the other tests next.

VRehnberg · 2023-05-10T12:50:03Z

For the rpc one I'm not making much sense of. I've tracked it down to https://github.com/pytorch/pytorch/blob/v1.13.1/torch/testing/_internal/distributed/rpc/rpc_test.py#L5075 and that it is reminicent of pytorch/pytorch#41474 but in this case it is not sporadic.

surak · 2023-06-15T13:26:06Z

Test report by @surak
SUCCESS
Build succeeded for 0 out of 0 (1 easyconfigs in total)
haicluster1.fz-juelich.de - Linux Ubuntu 20.04, x86_64, AMD EPYC 7F72 24-Core Processor, 4 x NVIDIA NVIDIA GeForce RTX 3090, 515.65.01, Python 3.8.10
See https://gist.github.com/surak/0cf9d9fad51dc92ea82e84d50054be54 for a full test report.

boegelbot · 2023-07-05T20:25:35Z

@branfosj: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/5468268082
Output from first failing test suite run:

FAIL: test__parse_easyconfig_PyTorch-1.13.1-foss-2022a-CUDA-11.7.0.eb (test.easyconfigs.easyconfigs.EasyConfigTest)
Test for easyconfig PyTorch-1.13.1-foss-2022a-CUDA-11.7.0.eb
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/test/easyconfigs/easyconfigs.py", line 1555, in innertest
    template_easyconfig_test(self, spec_path)
  File "/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/test/easyconfigs/easyconfigs.py", line 1406, in template_easyconfig_test
    self.assertTrue(os.path.isfile(patch_full), msg)
AssertionError: False is not true : Patch file /home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/easybuild/easyconfigs/p/PyTorch/PyTorch-1.13.1_increase-tolerance-test_ops.patch is available for PyTorch-1.13.1-foss-2022a-CUDA-11.7.0.eb

----------------------------------------------------------------------
Ran 17521 tests in 802.177s

FAILED (failures=1)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

branfosj · 2023-07-05T22:49:50Z

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0203u29a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 520.61.05, Python 3.6.8
See https://gist.github.com/branfosj/e3f2805e2bf1d4e69c34a49bb6d3a671 for a full test report.

boegel · 2023-07-06T07:37:07Z

@branfosj Should be synced with develop now that #17155 is merged

…h1131

branfosj · 2023-07-06T07:40:49Z

`test_jit_legacy`, `test_jit_profiling`, and `test_jit`

Same failed test in the three testsuites

======================================================================
FAIL: test_freeze_conv_relu_fusion (jit.test_freezing.TestFrozenOptimizations)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022a-CUDA-11.7.0/pytorch-v1.13.1/test/jit/test_freezing.py", line 2258, in test_freeze_conv_relu_fusion
    self.assertEqual(mod_eager(inp), frozen_mod(inp))
  File "/dev/shm/branfosj/tmp-up-EL8/eb-8fe5hmfe/tmpuilxmqz9/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2470, in assertEqual
    assert_equal(
  File "/dev/shm/branfosj/tmp-up-EL8/eb-8fe5hmfe/tmpuilxmqz9/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 10 / 30 (33.3%)
Greatest absolute difference: 3.057718276977539e-05 at index (2, 3, 0, 0, 0) (up to 1e-05 allowed)
Greatest relative difference: 8.758584417742737e-05 at index (0, 3, 0, 0, 0) (up to 1.3e-06 allowed)

----------------------------------------------------------------------

I'm patching this one out.

`test_optim`

======================================================================
FAIL: test_rprop (__main__.TestOptim)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/tmp-up-EL8/eb-8fe5hmfe/tmpuilxmqz9/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 1054, in wrapper
    fn(*args, **kwargs)
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022a-CUDA-11.7.0/pytorch-v1.13.1/test/test_optim.py", line 1016, in test_rprop
    self._test_basic_cases(
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022a-CUDA-11.7.0/pytorch-v1.13.1/test/test_optim.py", line 283, in _test_basic_cases
    self._test_state_dict(
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022a-CUDA-11.7.0/pytorch-v1.13.1/test/test_optim.py", line 258, in _test_state_dict
    self.assertEqual(bias, bias_cuda)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-8fe5hmfe/tmpuilxmqz9/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2470, in assertEqual
    assert_equal(
  File "/dev/shm/branfosj/tmp-up-EL8/eb-8fe5hmfe/tmpuilxmqz9/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 10 (10.0%)
Greatest absolute difference: 0.00010061264038085938 at index (0,) (up to 1e-05 allowed)
Greatest relative difference: 6.106863159088995e-05 at index (0,) (up to 1.3e-06 allowed)

----------------------------------------------------------------------

We skip test_optim in 1.12.x, due to intermittent test failures, so readd this test skip here.

easybuild/easyconfigs/p/PyTorch/PyTorch-1.13.1-foss-2022a-CUDA-11.7.0.eb

branfosj · 2023-07-06T11:53:52Z

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0203u29a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 520.61.05, Python 3.6.8
See https://gist.github.com/branfosj/a91c14fec2ac001a1831a8229e17a14a for a full test report.

edit

test_jit_cuda_fuser failed! Received signal: SIGIOT

branfosj · 2023-07-08T18:12:11Z

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0203u31a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 4 x NVIDIA NVIDIA A100-SXM4-40GB, 520.61.05, Python 3.6.8
See https://gist.github.com/branfosj/11e97ea919a6ce9e6b191b8a1c5870b6 for a full test report.

branfosj · 2023-07-14T08:43:43Z

closing for #18305

adding easyconfigs: PyTorch-1.13.1-foss-2022a-CUDA-11.7.0.eb and patc…

92b000b

…hes: PyTorch-1.13.1_fix-test-ops-conf.patch, PyTorch-1.13.1_no-cuda-stubs-rpath.patch, PyTorch-1.13.1_remove-flaky-test-in-testnn.patch, PyTorch-1.13.1_skip-ao-sparsity-test-without-fbgemm.patch

branfosj added the update label Jan 19, 2023

branfosj changed the title ~~{ai}[foss/2022a] PyTorch v1.13.1 w/ Python 3.10.4~~ {ai}[foss/2022a] PyTorch v1.13.1 w/ Python 3.10.4 w/ CUDA 11.7.0 Jan 19, 2023

branfosj marked this pull request as draft January 19, 2023 10:09

Flamefire and others added 4 commits February 10, 2023 11:54

Update patches based on PyTorch 1.13.1

6032b02

Merge pull request #5 from Flamefire/PyTorch-1.13.1-CUDA

9e5bf47

Update patches based on PyTorch 1.13.1

Fix test_ops* startup failures

e0b9504

Those tests require 2 pytest plugins and a bugfix.

Merge pull request #6 from Flamefire/PyTorch-1.13.1-CUDA

29a5def

Fix test_ops* startup failures

This comment was marked as outdated.

Sign in to view

Merge branch 'develop' of https://github.com/easybuilders/easybuild-e…

22cdceb

…asyconfigs into 20230119093315_new_pr_PyTorch1131

boegel mentioned this pull request Apr 13, 2023

PyTorch 1.13.1 test failures: test_native_mha #17712

Open

boegel added this to the next release (4.7.2) milestone Apr 13, 2023

boegel mentioned this pull request Apr 14, 2023

{ai}[foss/2022a] PyTorch v1.13.1 w/ Python 3.10.4 #17155

Merged

1 task

boegel removed this from the next release (4.7.2) milestone May 23, 2023

boegel added this to the release after 4.7.2 milestone May 23, 2023

boegel modified the milestones: 4.7.3, release after 4.7.3 Jul 5, 2023

sync in changes from other PyTorch ecs

e1b9969

Merge branch 'easybuilders:develop' into 20230119093315_new_pr_PyTorc…

1228187

…h1131

branfosj marked this pull request as ready for review July 6, 2023 08:34

branfosj added 2 commits July 6, 2023 09:35

skip test_optim and test_freeze_conv_relu_fusion

157b74d

actually add the test_optim skip

c5a747a

branfosj commented Jul 6, 2023

View reviewed changes

easybuild/easyconfigs/p/PyTorch/PyTorch-1.13.1-foss-2022a-CUDA-11.7.0.eb Outdated Show resolved Hide resolved

branfosj and others added 2 commits July 6, 2023 09:50

fix typo

f90a9b3

fix patch location

dec3fd6

branfosj closed this Jul 14, 2023

branfosj deleted the 20230119093315_new_pr_PyTorch1131 branch October 7, 2023 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{ai}[foss/2022a] PyTorch v1.13.1 w/ Python 3.10.4 w/ CUDA 11.7.0 #17156

{ai}[foss/2022a] PyTorch v1.13.1 w/ Python 3.10.4 w/ CUDA 11.7.0 #17156

branfosj commented Jan 19, 2023 •

edited

satishskamath commented Jan 31, 2023 •

edited

This comment was marked as outdated.

Flamefire commented Feb 10, 2023

Flamefire commented Feb 11, 2023

Flamefire commented Feb 11, 2023

smoors commented Apr 13, 2023

smoors commented Apr 13, 2023

boegel commented Apr 13, 2023

boegel commented Apr 13, 2023

branfosj commented Apr 13, 2023

boegel commented Apr 13, 2023 •

edited

boegel commented Apr 13, 2023

smoors commented Apr 13, 2023 •

edited

VRehnberg commented Apr 24, 2023

VRehnberg commented Apr 24, 2023

VRehnberg commented May 10, 2023 •

edited

VRehnberg commented May 10, 2023 •

edited

surak commented Jun 15, 2023

boegelbot commented Jul 5, 2023

branfosj commented Jul 5, 2023

boegel commented Jul 6, 2023

branfosj commented Jul 6, 2023 •

edited

branfosj commented Jul 6, 2023 •

edited

branfosj commented Jul 8, 2023

branfosj commented Jul 14, 2023

{ai}[foss/2022a] PyTorch v1.13.1 w/ Python 3.10.4 w/ CUDA 11.7.0 #17156

{ai}[foss/2022a] PyTorch v1.13.1 w/ Python 3.10.4 w/ CUDA 11.7.0 #17156

Conversation

branfosj commented Jan 19, 2023 • edited

satishskamath commented Jan 31, 2023 • edited

This comment was marked as outdated.

Flamefire commented Feb 10, 2023

Flamefire commented Feb 11, 2023

Flamefire commented Feb 11, 2023

smoors commented Apr 13, 2023

smoors commented Apr 13, 2023

boegel commented Apr 13, 2023

boegel commented Apr 13, 2023

branfosj commented Apr 13, 2023

boegel commented Apr 13, 2023 • edited

boegel commented Apr 13, 2023

smoors commented Apr 13, 2023 • edited

VRehnberg commented Apr 24, 2023

VRehnberg commented Apr 24, 2023

VRehnberg commented May 10, 2023 • edited

VRehnberg commented May 10, 2023 • edited

surak commented Jun 15, 2023

boegelbot commented Jul 5, 2023

branfosj commented Jul 5, 2023

boegel commented Jul 6, 2023

branfosj commented Jul 6, 2023 • edited

test_jit_legacy, test_jit_profiling, and test_jit

test_optim

branfosj commented Jul 6, 2023 • edited

branfosj commented Jul 8, 2023

branfosj commented Jul 14, 2023

branfosj commented Jan 19, 2023 •

edited

satishskamath commented Jan 31, 2023 •

edited

boegel commented Apr 13, 2023 •

edited

smoors commented Apr 13, 2023 •

edited

VRehnberg commented May 10, 2023 •

edited

VRehnberg commented May 10, 2023 •

edited

branfosj commented Jul 6, 2023 •

edited

`test_jit_legacy`, `test_jit_profiling`, and `test_jit`

`test_optim`

branfosj commented Jul 6, 2023 •

edited