-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
too many test failures for PyTorch/1.12.0-foss-2022a-CUDA-11.7.0 #16733
Comments
@boegel at your request, on our system that contains 4x A100 per node and intel CPU:
On our other cluster, we have 4x Titan V's in our build node, and the test suite produced:
|
Yes: PyTorch 1.12 is not compatible with Python 3.10 yet, so most of the test failures are real and caused by that incompatibility. So #16453 fixes a bunch of actual failures especially related to PPC but also a few others, while #16484 (still working on the last 2 tests) has patches fixing the Python 3.10 (and also CUDA 11.7) compatibility and the ones from the former PR. On that topic: I really liked the old way of reporting failing test suites/files (e.g. "test_jit_profiling") better because during the work on the above 2 I noticed that many sub-test failures (i.e. in the same file) can be fixed by a single patch. So that output was IMO more useful for investigation and reproduction (manually) and we can exclude whole test suites/files with the EC param I added long ago. So I would:
|
@Flamefire How come @casparvl isn't seeing a whole bunch of those errors though, if they largely boil down to incompatibilities with Python 3.10?
That shows a very similar result to what @casparvl shared, yet for our GPU installations I'm getting way more failing tests... |
The CPU version may avoid running into the code intended for Python 3.10. Also as some tests depend on the number of GPUs there may be more or less such failures. For the rest I'd need more info but I'd say try out the version I'm currently fixing once it is ready and see if the failures are gone before we spend time guessing. |
FWIW: This is from my working document listing the failing tests (i.e. files) and which patch fixes it:
|
On both our V100 (Intel Cascade Lake) and A100 (AMD Milan) systems (both RHEL 8.4 currently), I'm seeing too many test failures for
PyTorch/1.12.0-foss-2022a-CUDA-11.7.0
.On both systems, I get
Too many failed tests (437), maximum allowed is 400
with:That seems to be significantly more than what @casparvl and @smoors observed in #15924 (although not all test reports were using the enhanced PyTorch easyblock from easybuilders/easybuild-easyblocks#2803 which counts failing tests correctly, I guess), so I'm a bit puzzled here...
@Flamefire Do some of these failing tests happen to run a bell for you?
In #15924 you mentioned that you have some patches lined up for PyTorch 1.12.x (but perhaps we need to get #16453 and #16484 merged first?).
The text was updated successfully, but these errors were encountered: