Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many test failures for PyTorch/1.12.0-foss-2022a-CUDA-11.7.0 #16733

Open
boegel opened this issue Nov 24, 2022 · 5 comments
Open

too many test failures for PyTorch/1.12.0-foss-2022a-CUDA-11.7.0 #16733

boegel opened this issue Nov 24, 2022 · 5 comments
Milestone

Comments

@boegel
Copy link
Member

boegel commented Nov 24, 2022

On both our V100 (Intel Cascade Lake) and A100 (AMD Milan) systems (both RHEL 8.4 currently), I'm seeing too many test failures for PyTorch/1.12.0-foss-2022a-CUDA-11.7.0.

On both systems, I get Too many failed tests (437), maximum allowed is 400 with:

WARNING: 285 test failures, 152 test errors (out of 86678):
distributions/test_constraints (2 failed, 128 passed, 2 skipped, 2 warnings)
distributed/fsdp/test_distributed_checkpoint (2 total tests, failures=2)
distributed/fsdp/test_fsdp_apply (3 total tests, failures=3)
distributed/fsdp/test_fsdp_input (2 total tests, failures=2)
distributed/fsdp/test_fsdp_meta (14 total tests, failures=14)
distributed/fsdp/test_fsdp_misc (9 total tests, failures=9)
distributed/fsdp/test_fsdp_mixed_precision (90 total tests, failures=88)
distributed/fsdp/test_fsdp_state_dict (61 total tests, failures=61)
distributed/fsdp/test_fsdp_summon_full_params (73 total tests, failures=65)
distributions/test_distributions (219 total tests, failures=1)
test_autograd (484 total tests, failures=1, skipped=16, expected failures=2)
test_fx (924 total tests, errors=10, skipped=190, expected failures=6)
test_jit (2661 total tests, failures=12, errors=7, skipped=135, expected failures=7)
test_jit_cuda_fuser (147 total tests, errors=1, skipped=19)
test_jit_legacy (2661 total tests, failures=12, errors=8, skipped=133, expected failures=7)
test_jit_profiling (2661 total tests, failures=12, errors=7, skipped=135, expected failures=7)
test_ops_gradients (6968 total tests, errors=1, skipped=3597, expected failures=85)
test_package (131 total tests, errors=46, skipped=23)
test_quantization (877 total tests, failures=3, errors=40, skipped=51)
test_reductions (2895 total tests, errors=5, skipped=104, expected failures=49)
test_sort_and_select (91 total tests, errors=1, skipped=13)
test_sparse (1268 total tests, errors=1, skipped=131)
test_tensor_creation_ops (546 total tests, errors=25, skipped=60)

That seems to be significantly more than what @casparvl and @smoors observed in #15924 (although not all test reports were using the enhanced PyTorch easyblock from easybuilders/easybuild-easyblocks#2803 which counts failing tests correctly, I guess), so I'm a bit puzzled here...

@Flamefire Do some of these failing tests happen to run a bell for you?
In #15924 you mentioned that you have some patches lined up for PyTorch 1.12.x (but perhaps we need to get #16453 and #16484 merged first?).

@boegel boegel added this to the 4.x milestone Nov 24, 2022
@boegel boegel changed the title too many test failures for PyTorch/1.12.0/foss-2022a-CUDA-11.7.0 too many test failures for PyTorch/1.12.0-foss-2022a-CUDA-11.7.0 Nov 24, 2022
@casparvl
Copy link
Contributor

@boegel at your request, on our system that contains 4x A100 per node and intel CPU:

== 2022-10-30 05:06:46,214 pytorch.py:344 WARNING 41 test failures, 152 test errors (out of 86678):
distributions/test_constraints (2 failed, 128 passed, 2 skipped, 2 warnings)
distributed/_shard/sharded_tensor/test_sharded_tensor (58 total tests, errors=1)
distributions/test_distributions (219 total tests, failures=1)
test_fx (924 total tests, errors=10, skipped=190, expected failures=6)
test_jit (2661 total tests, failures=12, errors=7, skipped=127, expected failures=7)
test_jit_cuda_fuser (147 total tests, errors=1, skipped=18)
test_jit_legacy (2661 total tests, failures=12, errors=8, skipped=125, expected failures=7)
test_jit_profiling (2661 total tests, failures=12, errors=7, skipped=127, expected failures=7)
test_package (131 total tests, errors=46, skipped=23)
test_quantization (877 total tests, failures=3, errors=40, skipped=47)
test_reductions (2895 total tests, errors=5, skipped=104, expected failures=49)
test_sort_and_select (91 total tests, errors=1, skipped=13)
test_sparse (1268 total tests, errors=1, skipped=129)
test_tensor_creation_ops (546 total tests, errors=25, skipped=60)
test_torch (853 total tests, failures=1, skipped=65)

On our other cluster, we have 4x Titan V's in our build node, and the test suite produced:

== 2022-10-30 17:34:45,721 pytorch.py:344 WARNING 39 test failures, 146 test errors (out of 86183):
distributions/test_distributions (219 total tests, failures=1, skipped=5)
test_fx (924 total tests, errors=10, skipped=190, expected failures=6)
test_jit (2656 total tests, failures=12, errors=7, skipped=174, expected failures=7)
test_jit_legacy (2656 total tests, failures=12, errors=8, skipped=169, expected failures=7)
test_jit_profiling (2656 total tests, failures=12, errors=7, skipped=174, expected failures=7)
test_package (131 total tests, errors=46, skipped=23)
test_quantization (877 total tests, failures=2, errors=36, skipped=75)
test_reductions (2895 total tests, errors=5, skipped=104, expected failures=49)
test_sort_and_select (91 total tests, errors=1, skipped=13)
test_sparse (1268 total tests, errors=1, skipped=133)
test_tensor_creation_ops (546 total tests, errors=25, skipped=60)

@Flamefire
Copy link
Contributor

@Flamefire Do some of these failing tests happen to run a bell for you?
In #15924 you mentioned that you have some patches lined up for PyTorch 1.12.x (but perhaps we need to get #16453 and #16484 merged first?).

Yes: PyTorch 1.12 is not compatible with Python 3.10 yet, so most of the test failures are real and caused by that incompatibility.

So #16453 fixes a bunch of actual failures especially related to PPC but also a few others, while #16484 (still working on the last 2 tests) has patches fixing the Python 3.10 (and also CUDA 11.7) compatibility and the ones from the former PR.

On that topic: I really liked the old way of reporting failing test suites/files (e.g. "test_jit_profiling") better because during the work on the above 2 I noticed that many sub-test failures (i.e. in the same file) can be fixed by a single patch. So that output was IMO more useful for investigation and reproduction (manually) and we can exclude whole test suites/files with the EC param I added long ago.

So I would:

@boegel
Copy link
Member Author

boegel commented Nov 24, 2022

@Flamefire How come @casparvl isn't seeing a whole bunch of those errors though, if they largely boil down to incompatibilities with Python 3.10?
We're also not seeing those errors for the CPU-only installations of PyTorch/1.12.0-foss-2022a:

== 2022-11-22 19:33:17,511 pytorch.py:344 WARNING 39 test failures, 147 test errors (out of 86167):
distributions/test_distributions (219 total tests, failures=1, skipped=5)
test_fx (924 total tests, errors=10, skipped=190, expected failures=6)
test_jit (2656 total tests, failures=12, errors=7, skipped=174, expected failures=7)
test_jit_legacy (2656 total tests, failures=12, errors=8, skipped=169, expected failures=7)
test_jit_profiling (2656 total tests, failures=12, errors=7, skipped=174, expected failures=7)
test_package (131 total tests, errors=46, skipped=23)
test_quantization (877 total tests, failures=2, errors=37, skipped=75)
test_reductions (2895 total tests, errors=5, skipped=104, expected failures=49)
test_sort_and_select (91 total tests, errors=1, skipped=13)
test_sparse (1268 total tests, errors=1, skipped=133)
test_tensor_creation_ops (546 total tests, errors=25, skipped=60)

That shows a very similar result to what @casparvl shared, yet for our GPU installations I'm getting way more failing tests...
@casparvl Did you do the installation with a single GPU available (in a Slurm job), or with all 4 GPUs available for running the tests?

@Flamefire
Copy link
Contributor

The CPU version may avoid running into the code intended for Python 3.10. Also as some tests depend on the number of GPUs there may be more or less such failures. For the rest I'd need more info but I'd say try out the version I'm currently fixing once it is ready and see if the failures are gone before we spend time guessing.

@Flamefire
Copy link
Contributor

FWIW: This is from my working document listing the failing tests (i.e. files) and which patch fixes it:

    * distributed/fsdp/test_fsdp_pure_fp16      - PyTorch-1.11.0_fix-fsdp-fp16-test.patch
    * distributed/rpc/cuda/test_tensorpipe_agent - ?
    * distributed/rpc/test_tensorpipe_agent     - PyTorch-1.12.1_fix-use-after-free-in-tensorpipe-agent.patch
    * distributions/test_constraints
    * distributions/test_distributions          - PyTorch-1.12.1_fix-test_wishart_log_prob.patch
    * test_ao_sparsity                          - PyTorch-1.12.1_skip-ao-sparsity-test-without-fbgemm.patch
    * test_cpp_extensions_aot_no_ninja          - PyTorch-1.12.1_fix-cuda-gcc-version-check.patch
    * test_cpp_extensions_jit                   - PyTorch-1.12.1_fix-test_cpp_extensions_jit.patch
    * test_fx                                   - PyTorch-1.12.1_python-3.10-compat.patch
    * test_jit_cuda_fuser                       - PyTorch-1.12.1_fix-TestCudaFuser.test_unary_ops.patch
    * test_jit_legacy                           - PyTorch-1.12.1_python-3.10-annotation-fix.patch
    * test_jit_profiling                        - PyTorch-1.12.1_python-3.10-annotation-fix.patch
    * test_jit                                  - PyTorch-1.12.1_python-3.10-annotation-fix.patch
    * test_model_dump                           - PyTorch-1.10.0_fix-test-model_dump.patch
    * test_nn                                   - PyTorch-1.12.1_fix-vsx-loadu.patch
    * test_ops_gradients                        - PyTorch-1.12.1_skip-failing-grad-test.patch
    * test_ops                                  - PyTorch-1.12.1_increase-tolerance-test_ops.patch
    * test_optim                                - PyTorch-1.12.1_increase-test-adadelta-tolerance.patch
    * test_quantization                         - PyTorch-1.12.1_python-3.10-annotation.patch
    * test_reductions                           - PyTorch-1.12.1_python-3.10-compat.patch
    * test_sort_and_select                      - PyTorch-1.12.1_python-3.10-compat.patch
    * test_sparse                               - PyTorch-1.12.1_python-3.10-compat.patch
    * test_tensor_creation_ops                  - PyTorch-1.12.1_python-3.10-compat.patch
    * test_torch                                - PyTorch-1.12.1_fix-TestTorch.test_to.patch
    * test_unary_ufuncs                         - PyTorch-1.12.1_fix-vsx-vector-funcs.patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants