Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ai}[foss/2022a] PyTorch v2.0.1 #19066

Merged

Conversation

Flamefire
Copy link
Contributor

(created using eb --new-pr)

…2.0.1_add-missing-vsx-vector-shift-functions.patch, PyTorch-2.0.1_avoid-test_quantization-failures.patch, PyTorch-2.0.1_disable-test-sharding.patch, PyTorch-2.0.1_fix-numpy-compat.patch, PyTorch-2.0.1_fix-shift-ops.patch, PyTorch-2.0.1_fix-skip-decorators.patch, PyTorch-2.0.1_fix-test_memory_profiler.patch, PyTorch-2.0.1_fix-test-ops-conf.patch, PyTorch-2.0.1_fix-torch.compile-on-ppc.patch, PyTorch-2.0.1_fix-ub-in-inductor-codegen.patch, PyTorch-2.0.1_fix-vsx-loadu.patch, PyTorch-2.0.1_no-cuda-stubs-rpath.patch, PyTorch-2.0.1_remove-test-requiring-online-access.patch, PyTorch-2.0.1_skip-diff-test-on-ppc.patch, PyTorch-2.0.1_skip-failing-gradtest.patch, PyTorch-2.0.1_skip-test_shuffle_reproducibility.patch, PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch
@jfgrimm

This comment was marked as outdated.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
taurusml3 - Linux RHEL 7.6, POWER, 8335-GTX, 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/9d9ef6f2149c6f32d63ae0c0bb95364f for a full test report.

@boegel
Copy link
Member

boegel commented Oct 25, 2023

@boegelbot please test @ jsc-zen2
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=19066 EB_ARGS= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_19066 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3640

Test results coming soon (I hope)...

- notification for comment with ID 1778707471 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@Micket Micket added the update label Oct 25, 2023
@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
jsczen2g1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/boegelbot/6c8f0ed02d28e91f34b96ab6824d86f9 for a full test report.

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
zen2-rockylinux-88 - Linux Rocky Linux 8.8, x86_64, AMD EPYC 7452 32-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/SebastianAchilles/8a7ebe899a6aae5ae612763d13dd7dc7 for a full test report.

@SebastianAchilles SebastianAchilles added this to the 4.x milestone Oct 25, 2023
@branfosj
Copy link
Member

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0105u03a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/639f0f4c14e5b35f5d1a30ffe7764d24 for a full test report.

@boegel
Copy link
Member

boegel commented Oct 25, 2023

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3100.skitty.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/boegel/c427b409621702145c5db62c883e0213 for a full test report.

@boegel
Copy link
Member

boegel commented Oct 26, 2023

@branfosj Can you dig into the log file and extract more details on the failing inductor/test_torchinductor_opinfo?

My vote goes to ignoring this test for now, so we can merge this PR and follow-up in another PR to get that (quirky?) test fixed, since we've seen success on a range of different systems here (incl. POWER!).

@branfosj
Copy link
Member

@branfosj Can you dig into the log file and extract more details on the failing inductor/test_torchinductor_opinfo?

=
FAILED inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_index_add_cpu_float16 - RuntimeError: unexpected success index_add, torch.float16, cpu
FAILED inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_scatter_add_cpu_float16 - RuntimeError: unexpected success scatter_add, torch.float16, cpu
FAILED inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_scatter_reduce_sum_cpu_float16 - RuntimeError: unexpected success scatter_reduce.sum, torch.float16, cpu

With the traceback being short:

_
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/2.0.1/foss-2022a/pytorch-v2.0.1/test/inductor/test_torchinductor_opinfo.py", line 606, in test_comprehensive
    raise RuntimeError(
RuntimeError: unexpected success scatter_reduce.sum, torch.float16, cpu

@Flamefire
Copy link
Contributor Author

I added a patch to disable that check (and similar ones) by setting a flag (could also be set by an env var) which catches unexpected success in this test. This should make this test succeed without any potential influence on other tests. See https://github.com/pytorch/pytorch/blob/v2.0.1/test/inductor/test_torchinductor_opinfo.py#L605-L608

I think with that patch added we can consider the report as a success and merge this without another test (I verified with --stop=patch that the patch applies)

@branfosj branfosj modified the milestones: 4.x, next release (4.8.2?) Oct 26, 2023
@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0105u03a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/f038e8f0042ddb24153dca4fe2ae31f0 for a full test report.

@branfosj
Copy link
Member

Going in, thanks @Flamefire!

@branfosj branfosj merged commit c07c4b1 into easybuilders:develop Oct 26, 2023
5 checks passed
@Flamefire Flamefire deleted the 20231024093827_new_pr_PyTorch201 branch October 26, 2023 18:13
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
taurusi8018 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor, 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/d48255954eeb2cad36dec3a26d543612 for a full test report.

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis-s1 - Linux Rocky Linux 8.8, x86_64, Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz, Python 3.6.8
See https://gist.github.com/VRehnberg/088e9766b72bdfeec1018b5e4ccd9b19 for a full test report.

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
SUCCESS
Build succeeded for 3 out of 3 (1 easyconfigs in total)
alvis-c1 - Linux Rocky Linux 8.8, x86_64, Intel Xeon Processor (Skylake), Python 3.6.8
See https://gist.github.com/VRehnberg/42502f8695413eceea214c62a822f421 for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants