Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ai}[foss/2022b] PyTorch v2.1.2 #19445

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Dec 19, 2023

(created using eb --new-pr)

Update over #19087

….1.0-GCCcore-12.2.0.eb, Z3-4.12.2-GCCcore-12.2.0-Python-3.10.8.eb and patches: PyTorch-2.0.1_avoid-test_quantization-failures.patch, PyTorch-2.0.1_fix-skip-decorators.patch, PyTorch-2.0.1_fix-ub-in-inductor-codegen.patch, PyTorch-2.0.1_fix-vsx-loadu.patch, PyTorch-2.0.1_no-cuda-stubs-rpath.patch, PyTorch-2.0.1_skip-failing-gradtest.patch, PyTorch-2.0.1_skip-test_shuffle_reproducibility.patch, PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch, PyTorch-2.1.0_fix-bufferoverflow-in-oneDNN.patch, PyTorch-2.1.0_fix-validationError-output-test.patch, PyTorch-2.1.0_fix-vsx-vector-shift-functions.patch, PyTorch-2.1.0_increase-tolerance-functorch-test_vmapvjpvjp.patch, PyTorch-2.1.0_remove-sparse-csr-nnz-overflow-test.patch, PyTorch-2.1.0_remove-test-requiring-online-access.patch, PyTorch-2.1.0_skip-diff-test-on-ppc.patch, PyTorch-2.1.0_skip-dynamo-test_predispatch.patch, PyTorch-2.1.0_skip-test_jvp_linalg_det_singular.patch, PyTorch-2.1.0_skip-test_linear_fp32-without-MKL.patch, PyTorch-2.1.0_skip-test_wrap_bad.patch
@Flamefire Flamefire marked this pull request as draft December 19, 2023 12:20
@Flamefire Flamefire changed the title {tools}[GCCcore/12.2.0] PyTorch v2.1.2, pytest-flakefinder v1.1.0, Z3 v4.12.2 w/ Python 3.10.8 {ai}[foss/2022b] PyTorch v2.1.2 Dec 19, 2023
@casparvl
Copy link
Contributor

Test report by @casparvl
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
gcn6.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 535.104.12, Python 3.6.8
See https://gist.github.com/casparvl/0c3cafe86d827f2919f25590d11fd4d0 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
n1134 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (icelake), Python 3.8.13
See https://gist.github.com/Flamefire/e5d44950d651eee8be354239eda26943 for a full test report.

@casparvl
Copy link
Contributor

casparvl commented Dec 20, 2023

From the failed built on my system:

WARNING: 3 test failures, 0 test errors (out of 0):
test_jit 1/1 (1 failed, 2331 passed, 156 skipped, 12 xfailed, 2 rerun)
test_jit_legacy 1/1 (1 failed, 2336 passed, 151 skipped, 12 xfailed, 2 rerun)
test_jit_profiling 1/1 (1 failed, 2331 passed, 156 skipped, 12 xfailed, 2 rerun)

Details:

========================================== FAILURES ===========================================
_________________________ TestScript.test_file_reader_no_memory_leak __________________________
Traceback (most recent call last):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2022b/pytorch-v2.1.2/test/test_jit.py", line 12862, in test_file_reader_no_memory_leak
    assert peak_from_file < peak_from_string * 500
AssertionError
=================================== short test summary info ===================================
FAILED [1.5805s] test_jit_profiling.py::TestScript::test_file_reader_no_memory_leak - Assert...
======== 1 failed, 2331 passed, 156 skipped, 12 xfailed, 2 rerun in 146.47s (0:02:26) =========
FINISHED PRINTING LOG FILE of test_jit_profiling 1/1 (/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2022b/pytorch-v2.1.2/test/test-reports/test_jit_pr
ofiling_lt0riar3.log)

test_jit_profiling 1/1 failed!
========================================== FAILURES ===========================================
_________________________ TestScript.test_file_reader_no_memory_leak __________________________
Traceback (most recent call last):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2022b/pytorch-v2.1.2/test/test_jit.py", line 12862, in test_file_reader_no_memory_leak
    assert peak_from_file < peak_from_string * 500
AssertionError
=================================== short test summary info ===================================
FAILED [1.6039s] test_jit_legacy.py::TestScript::test_file_reader_no_memory_leak - Assertion...
======== 1 failed, 2336 passed, 151 skipped, 12 xfailed, 2 rerun in 145.37s (0:02:25) =========
FINISHED PRINTING LOG FILE of test_jit_legacy 1/1 (/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2022b/pytorch-v2.1.2/test/test-reports/test_jit_legac
y_spnm3s0r.log)
========================================== FAILURES ===========================================
_________________________ TestScript.test_file_reader_no_memory_leak __________________________
Traceback (most recent call last):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2022b/pytorch-v2.1.2/test/test_jit.py", line 12862, in test_file_reader_no_memory_leak
    assert peak_from_file < peak_from_string * 500
AssertionError
=================================== short test summary info ===================================
FAILED [1.5056s] test_jit.py::TestScript::test_file_reader_no_memory_leak - AssertionError
======== 1 failed, 2331 passed, 156 skipped, 12 xfailed, 2 rerun in 144.05s (0:02:24) =========
FINISHED PRINTING LOG FILE of test_jit 1/1 (/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2022b/pytorch-v2.1.2/test/test-reports/test_jit_nm0s2w4n.log)

test_jit 1/1 failed!

Essentially seems to be three times the same thing...?

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
i8023 - Linux Rocky Linux 8.7, x86_64, AMD EPYC 7352 24-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/Flamefire/d1255461853e69d1e977edd2b3a2e9b4 for a full test report.

@Flamefire
Copy link
Contributor Author

Marking this as ready as it doesn't seem to fail more than 2.1.0 so this might be better than #19087

@Flamefire Flamefire marked this pull request as ready for review December 21, 2023 15:55
@SebastianAchilles SebastianAchilles added this to the 4.x milestone Dec 22, 2023
@SebastianAchilles
Copy link
Member

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@SebastianAchilles: Request for testing this PR well received on login1

PR test command 'EB_PR=19445 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_19445 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 12459

Test results coming soon (I hope)...

- notification for comment with ID 1867587945 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@akesandgren
Copy link
Contributor

Test report by @akesandgren
SUCCESS
Build succeeded for 1 out of 1 (3 easyconfigs in total)
b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 525.147.05, Python 3.10.12
See https://gist.github.com/akesandgren/7cadb59b725790cb0cb0e3031126861e for a full test report.

@akesandgren
Copy link
Contributor

akesandgren commented Dec 22, 2023

Note that the PyTorch-2.1.0_skip-float16-part-test_batchnorm_nhwc_cpu.patch isn't used here.
Not sure it's needed (I didn't see any problems) but just in case...

@SebastianAchilles
Copy link
Member

@boegelbot please test @ jsc-zen2
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@SebastianAchilles: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=19445 EB_ARGS= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_19445 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3961

Test results coming soon (I hope)...

- notification for comment with ID 1867911921 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/10b5d5fc6265ca49e0d0214ce15201a7 for a full test report.

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
zen2-rockylinux-89 - Linux Rocky Linux 8.9, x86_64, AMD EPYC 7452 32-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/SebastianAchilles/65766fdcb1837744ab864dad4364b0e4 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/boegelbot/8cf365ac1dc6ec87050028202cc1726c for a full test report.

@Flamefire
Copy link
Contributor Author

Note that the PyTorch-2.1.0_skip-float16-part-test_batchnorm_nhwc_cpu.patch isn't used here. Not sure it's needed (I didn't see any problems) but just in case...

Yes I intentionally removed that hoping this particular failure from 2.1.0 is fixed in 2.1.2 which is one of the main reasons I made those 2.1.2 PRs.

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
jscclxc2.int.jsc-clx.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, Intel Xeon Processor (Cascadelake) (cascadelake), Python 3.9.18
See https://gist.github.com/SebastianAchilles/10c958e84ac7d00d80a3852adb1e6fbc for a full test report.

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
jsczen3c2.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/SebastianAchilles/bbed22423e57b30675b2061e387c95c5 for a full test report.

@boegel boegel modified the milestones: 4.x, next release (4.9.0?) Dec 27, 2023
Copy link
Member

@SebastianAchilles SebastianAchilles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@SebastianAchilles
Copy link
Member

Going in, thanks @Flamefire!

@SebastianAchilles SebastianAchilles merged commit 3f74a51 into easybuilders:develop Dec 27, 2023
9 checks passed
@Flamefire Flamefire deleted the 20231219131854_new_pr_PyTorch212 branch December 27, 2023 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants