Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add patches to fix test issues for PyTorch 2.1.2 with foss/2023a + CUDA 12.1.1 #20156

Merged

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Mar 19, 2024

(created using eb --new-pr)

Fixes #19946

@casparvl
Copy link
Contributor

Test report by @casparvl
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
gcn6.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/casparvl/127d18ffbe2dfa016b8e2b31db21583b for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
i8001 - Linux Rocky Linux 8.7 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.8.13
See https://gist.github.com/Flamefire/03f217eae6950077bb56d51d164aecf4 for a full test report.

@jfgrimm
Copy link
Member

jfgrimm commented Mar 23, 2024

Test report by @jfgrimm
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
gpu22.viking2.yor.alces.network - Linux Rocky Linux 8.8, x86_64, AMD EPYC 7413 24-Core Processor, 1 x NVIDIA NVIDIA H100 PCIe, 535.86.10, Python 3.6.8
See https://gist.github.com/jfgrimm/d7245f91e629eca8cd8c2cc129b142de for a full test report.

@casparvl
Copy link
Contributor

Test report by @casparvl
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
gcn6.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/casparvl/d1a5eb6abbfedbecdc909e6d2db8076a for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @casparvl FAILED Build succeeded for 0 out of 1 (1 easyconfigs in total) gcn6.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8 See https://gist.github.com/casparvl/d1a5eb6abbfedbecdc909e6d2db8076a for a full test report.

Looks like we need to increase the allowed failures to ~10. Yours report 8. 6 of them are caught in detail: test_Conv1d_pad_same_cuda_tf32, test_constant_specialization, test_delayed_optim_step_offload_true_no_shard, test_file_reader_no_memory_leak, test_file_reader_no_memory_leak, test_file_reader_no_memory_leak

The first is from test_nn which is kinda know, the last 3 are known on your machine from other runs. the other 2 (4) I don't know.

The full test log might be useful to enhance the RegEx to capture the other 2 tests too. I think it really helps having the individual tests listed conveniently at a single place to judge the failure (see e.g. the last 3 where you can see that it is the same cause, not only: "test_jit_foo", "test_jit_bar", "test_jit" files failed)

@casparvl
Copy link
Contributor

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=20156 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20156 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 13194

Test results coming soon (I hope)...

- notification for comment with ID 2016786483 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@casparvl
Copy link
Contributor

test_jit 1/1 (1 failed, 2388 passed, 106 skipped, 12 xfailed, 2 rerun)
test_proxy_tensor 1/1 (1 failed, 2078 passed, 613 skipped, 80 xfailed, 2 rerun)
distributed/fsdp/test_fsdp_core 1/1 (2 failed, 58 passed, 10 rerun)
test_jit_legacy 1/1 (1 failed, 2390 passed, 104 skipped, 12 xfailed, 2 rerun)
test_jit_profiling 1/1 (1 failed, 2388 passed, 106 skipped, 12 xfailed, 2 rerun)
test_nn 1/1 (1 failed, 2804 passed, 122 skipped, 3 xfailed, 2 rerun)
distributed/test_c10d_nccl 1/1 (1 unit test(s) failed)
Detailed log

test_jit
Ok, we know this from #20171 , I'm not too worried here as it might be that kernel memory leak issue I mentioned.

================================================================ FAILURES ================================================================
_______________________________________________ TestScript.test_file_reader_no_memory_leak _______________________________________________
Traceback (most recent call last):
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2407, in wrapper
    method(*args, **kwargs)
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/test_jit.py", line 12862, in test_file_reader_no_memory_leak
    assert peak_from_file < peak_from_string * 500
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
======================================================== short test summary info =========================================================
FAILED [2.4948s] test_jit.py::TestScript::test_file_reader_no_memory_leak - AssertionError
============================== 1 failed, 2388 passed, 106 skipped, 12 xfailed, 2 rerun in 166.02s (0:02:46) ==============================
FINISHED PRINTING LOG FILE of test_jit 1/1 (/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/test-reports/test_jit_g6he3ln_.log)

test_proxy_tensor
I have the feeling this should not be there, since you included Z3 as dependency (right?)...:

================================================================= RERUNS =================================================================
____________________________________________ TestSymbolicTracing.test_constant_specialization ____________________________________________
Traceback (most recent call last):
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2407, in wrapper
    method(*args, **kwargs)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/contextlib.py", line 81, in inner
##[endgroup]
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/test_proxy_tensor.py", line 1491, in test_constant_specialization
    tensor = make_fx(f, tracing_mode="symbolic")(torch.randn(10))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py", line 739, in wrapped
    shape_env = ShapeEnv()
                ^^^^^^^^^^
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/fx/experimental/symbolic_shapes.py", line 2116, in __init__
    if _translation_validation_enabled():
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/fx/experimental/symbolic_shapes.py", line 1483, in _translation_validation_enabled
    return translation_validation_enabled()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/fx/experimental/validator.py", line 537, in translation_validation_enabled
    assert_z3_installed_if_tv_set()
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/fx/experimental/validator.py", line 546, in assert_z3_installed_if_tv_set
    assert _HAS_Z3 or not config.translation_validation, (
AssertionError: translation validation requires Z3 package. Please, either install z3-solver or disable translation validation.

To execute this test, run the following from the base repo dir:
     python test/test_proxy_tensor.py -k test_constant_specialization

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

distributed/fsdp/test_fsdp_core
These are ... potentially worrying:

================================================================= RERUNS =================================================================
____________________________________ TestParityWithDDP.test_delayed_optim_step_offload_true_no_shard _____________________________________
Traceback (most recent call last):
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_distributed.py", line 506, in wrapper
    self._join_processes(fn)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_distributed.py", line 725, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_distributed.py", line 775, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_distributed.py", line 622, in run_test
    getattr(self, test_name)()
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_distributed.py", line 508, in wrapper
    fn()
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2407, in wrapper
    method(*args, **kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 356, in instantiated_test
    test(self, **param_kwargs)
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/distributed/fsdp/test_fsdp_core.py", line 188, in test_delayed_optim_step
    self.run_subtests(
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_fsdp.py", line 896, in run_subtests
    return run_subtests(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_fsdp.py", line 848, in run_subtests
    test_fn(*test_args, **test_kwargs, **subtest_kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_fsdp.py", line 1170, in _test_fsdp_parity
    torch.testing.assert_close(ref_loss, fsdp_loss, check_dtype=False)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Scalars are not close!

Expected -1.2194809913635254 but got -2.304675579071045.
Absolute difference: 1.0851945877075195 (up to 1e-05 allowed)
Relative difference: 0.8898823314122694 (up to 1.3e-06 allowed)

To execute this test, run the following from the base repo dir:
     python test/distributed/fsdp/test_fsdp_core.py -k test_delayed_optim_step_offload_true_no_shard

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
______________________________________ TestParityWithDDP.test_mixture_of_experts_offload_true_none _______________________________________
Traceback (most recent call last):
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_distributed.py", line 506, in wrapper
    self._join_processes(fn)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_distributed.py", line 725, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_distributed.py", line 775, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_distributed.py", line 622, in run_test
    getattr(self, test_name)()
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_distributed.py", line 508, in wrapper
    fn()
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2407, in wrapper
    method(*args, **kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 356, in instantiated_test
    test(self, **param_kwargs)
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/distributed/fsdp/test_fsdp_core.py", line 231, in test_mixture_of_experts
    self.run_subtests(
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_fsdp.py", line 896, in run_subtests
    return run_subtests(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_fsdp.py", line 848, in run_subtests
    test_fn(*test_args, **test_kwargs, **subtest_kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_fsdp.py", line 1170, in _test_fsdp_parity
    torch.testing.assert_close(ref_loss, fsdp_loss, check_dtype=False)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Scalars are not close!

Expected -3.869016647338867 but got -2.7315802574157715.
Absolute difference: 1.1374363899230957 (up to 1e-05 allowed)
Relative difference: 0.29398591259756696 (up to 1.3e-06 allowed)
_________________________ TestParityWithDDP.test_mixture_of_experts_with_delay_before_free_offload_true_no_shard _________________________
Traceback (most recent call last):
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_distributed.py", line 506, in wrapper
    self._join_processes(fn)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_distributed.py", line 725, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_distributed.py", line 775, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_distributed.py", line 622, in run_test
    getattr(self, test_name)()
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_distributed.py", line 508, in wrapper
    fn()
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2407, in wrapper
    method(*args, **kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 356, in instantiated_test
    test(self, **param_kwargs)
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/distributed/fsdp/test_fsdp_core.py", line 248, in test_mixture_of_experts_with_delay_before_free
    self.run_subtests(
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_fsdp.py", line 896, in run_subtests
    return run_subtests(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_fsdp.py", line 848, in run_subtests
    test_fn(*test_args, **test_kwargs, **subtest_kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_fsdp.py", line 1170, in _test_fsdp_parity
    torch.testing.assert_close(ref_loss, fsdp_loss, check_dtype=False)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Scalars are not close!

Expected -1.8912922143936157 but got -2.122039318084717.
Absolute difference: 0.23074710369110107 (up to 1e-05 allowed)
Relative difference: 0.12200499845291383 (up to 1.3e-06 allowed)

To execute this test, run the following from the base repo dir:
     python test/distributed/fsdp/test_fsdp_core.py -k test_mixture_of_experts_with_delay_before_free_offload_true_no_shard

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

test_jit_legacy
Ok, we know this from #20171 , I'm not too worried here as it might be that kernel memory leak issue I mentioned.

================================================================ FAILURES ================================================================
_______________________________________________ TestScript.test_file_reader_no_memory_leak _______________________________________________
Traceback (most recent call last):
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2407, in wrapper
    method(*args, **kwargs)
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/test_jit.py", line 12862, in test_file_reader_no_memory_leak
    assert peak_from_file < peak_from_string * 500
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
======================================================== short test summary info =========================================================
FAILED [2.4620s] test_jit_legacy.py::TestScript::test_file_reader_no_memory_leak - AssertionError
============================== 1 failed, 2390 passed, 104 skipped, 12 xfailed, 2 rerun in 166.18s (0:02:46) ==============================
FINISHED PRINTING LOG FILE of test_jit_legacy 1/1 (/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/test-reports/test_jit_legacy_mhxi2r5m.log)

test_jit_profiling
Ok, we know this from #20171 , I'm not too worried here as it might be that kernel memory leak issue I mentioned.

================================================================ FAILURES ================================================================
_______________________________________________ TestScript.test_file_reader_no_memory_leak _______________________________________________
Traceback (most recent call last):
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2407, in wrapper
    method(*args, **kwargs)
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/test_jit.py", line 12862, in test_file_reader_no_memory_leak
    assert peak_from_file < peak_from_string * 500
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
======================================================== short test summary info =========================================================
FAILED [2.4602s] test_jit_profiling.py::TestScript::test_file_reader_no_memory_leak - AssertionError

test_nn

================================================================= RERUNS =================================================================
_________________________________________________ TestNN.test_Conv1d_pad_same_cuda_tf32 __________________________________________________
Traceback (most recent call last):
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2407, in wrapper
    method(*args, **kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 2407, in wrapper
    method(*args, **kwargs)
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2023a-CUDA-12.1.1/pytorch-v2.1.2/test/test_nn.py", line 7474, in with_tf32_on
    test.test_cuda(self, **kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_nn.py", line 4486, in test_cuda
    test_case.assertEqual(cpu_d_i, gpu_d_i, atol=self.precision, rtol=0, exact_dtype=False)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ofx1wu0a/tmp00_2xzah/lib/python3.11/site-packages/torch/testing/_internal/common_utils.py", line 3304, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 2 / 60 (3.3%)
Greatest absolute difference: 0.007216689253091602 at index (1, 0, 0) (up to 0.005 allowed)
Greatest relative difference: 0.0016201659975987861 at index (3, 0, 0) (up to 0 allowed)

To execute this test, run the following from the base repo dir:
     python test/test_nn.py -k test_Conv1d_pad_same_cuda_tf32

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

@casparvl
Copy link
Contributor

@boegelbot please test @ jsc-zen3
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20156 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20156 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3853

Test results coming soon (I hope)...

- notification for comment with ID 2016792987 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/9a5df52bc276b0e7cbeed75b48b94327 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/dd225639683307147b137315b189be0d for a full test report.

@Flamefire
Copy link
Contributor Author

@casparvl Can you upload the compressed log of the failures (#20156 (comment)) such that the easyblock can be enhanced to also detect the 2 other failing tests by name?

@casparvl
Copy link
Contributor

I'll send it to you in a DM. I don't assume there to be much privacy-sensitive info in there, but just to be safe I'll not share it with the world ;-)

@Flamefire
Copy link
Contributor Author

Thanks.

As for the failures:

  • test_file_reader_no_memory_leak -> OK
  • test_proxy_tensor - An import error. Certainly strange but of course they messed up the check such that it isn't clear if it really is z3 failing to import or anything else. Try with sourcing the --dump-env and python -c 'import z3' to see if it is z3 or something else
  • distributed/fsdp/test_fsdp_core - True those are potential issues and seem to be related. But as most others work I'd dismiss them stll
  • test_nn (especially test_Conv1d_pad_same_cuda_tf32) has known issues, so can be ignored

@casparvl
Copy link
Contributor

I think I know: I probably didn't rebuild Z3 after #20050 , and the old one is a non-python one...
Ok, I think we have an understanding of most of these. I agree that test_fsdp_core is supsicious, but I'd also dismiss it since the others work.

I'll merge this PR: there are sufficient succesful test-reports, and on my system, we have a reasonable understanding of the failing tests.

Copy link
Contributor

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@casparvl casparvl added this to the release after 4.9.0 milestone Mar 26, 2024
@casparvl
Copy link
Contributor

Going in, thanks @Flamefire!

@casparvl casparvl merged commit ffde1f5 into easybuilders:develop Mar 26, 2024
9 checks passed
@Flamefire Flamefire deleted the 20240319165505_new_pr_PyTorch212 branch March 26, 2024 11:23
@jfgrimm
Copy link
Member

jfgrimm commented Mar 28, 2024

Test report by @jfgrimm
SUCCESS
Build succeeded (with --ignore-test-failure) for 1 out of 1 (1 easyconfigs in total)
gpu25.viking2.yor.alces.network - Linux Rocky Linux 8.8, x86_64, AMD EPYC 7413 24-Core Processor, 1 x NVIDIA NVIDIA H100 PCIe, 535.86.10, Python 3.6.8
See https://gist.github.com/jfgrimm/b0a1f84ef02126a1229ff263966a163d for a full test report.

@boegel boegel changed the title Fix test issues in PyTorch-2.1.2-foss-2023a-CUDA-12.1.1 add patches to fix test issues for PyTorch 2.1.2 with foss/2023a + CUDA 12.1.1 Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Look into failed tests in PyTorch 2.1.2 w/ foss 2023a + CUDA
4 participants