Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{lib}[GCCcore/12.2.0,foss/2022b] PyTorch v1.13.1, cuDNN v8.5.0.96, magma v2.7.1, ... w/ CUDA 11.7.0 #18853

Merged

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Sep 22, 2023

(created using eb --new-pr)

Requires a compiler patch or PyTorch will fail to be compiled by NVCC:

Requires a pybind11 fix for JIT compilation:

….5.0.96-CUDA-11.7.0.eb, magma-2.7.1-foss-2022b-CUDA-11.7.0.eb, NCCL-2.16.2-GCCcore-12.2.0-CUDA-11.7.0.eb, UCX-CUDA-1.13.1-GCCcore-12.2.0-CUDA-11.7.0.eb and patches: PyTorch-1.13.1_disable-test-sharding.patch, PyTorch-1.13.1_fix-duplicate-kDefaultTimeout-definition.patch, PyTorch-1.13.1_fix-flaky-jit-test.patch, PyTorch-1.13.1_fix-fsdp-fp16-test.patch, PyTorch-1.13.1_fix-fsdp-tp-integration-test.patch, PyTorch-1.13.1_fix-gcc-12-missing-includes.patch, PyTorch-1.13.1_fix-gcc-12-warning-in-fbgemm.patch, PyTorch-1.13.1_fix-kineto-crash-on-exit.patch, PyTorch-1.13.1_fix-numpy-deprecations.patch, PyTorch-1.13.1_fix-protobuf-dependency.patch, PyTorch-1.13.1_fix-pytest-args.patch, PyTorch-1.13.1_fix-python-3.11-compat.patch, PyTorch-1.13.1_fix-test-ops-conf.patch, PyTorch-1.13.1_fix-warning-in-test-cpp-api.patch, PyTorch-1.13.1_fix-wrong-check-in-fsdp-tests.patch, PyTorch-1.13.1_increase-tolerance-test_jit.patch, PyTorch-1.13.1_increase-tolerance-test_ops.patch, PyTorch-1.13.1_increase-tolerance-test_optim.patch, PyTorch-1.13.1_install-vsx-vec-headers.patch, PyTorch-1.13.1_no-cuda-stubs-rpath.patch, PyTorch-1.13.1_remove-flaky-test-in-testnn.patch, PyTorch-1.13.1_skip-failing-grad-test.patch, PyTorch-1.13.1_skip-failing-singular-grad-test.patch, PyTorch-1.13.1_skip-test-requiring-online-access.patch, PyTorch-1.13.1_skip-tests-without-fbgemm.patch
@boegelbot

This comment was marked as outdated.

@SebastianAchilles SebastianAchilles added this to the 4.x milestone Sep 22, 2023
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 4 out of 5 (5 easyconfigs in total)
taurusml3 - Linux RHEL 7.6, POWER, 8335-GTX, 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/c0142341fd74698ffe6d5e3cb7d0de89 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 5 out of 5 (5 easyconfigs in total)
taurusml7 - Linux RHEL 7.6, POWER, 8335-GTX, 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/770aace436687fab5bcd446d097867dd for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 4 out of 5 (5 easyconfigs in total)
taurusi8006 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor, 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/64a355baa681a11275ad5e83c4d0d19c for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 5 (5 easyconfigs in total)
taurusml7 - Linux RHEL 7.6, POWER, 8335-GTX, 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/6ce02de13202cfb6d81bcac299a0ac7d for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 4 out of 5 (5 easyconfigs in total)
taurusml3 - Linux RHEL 7.6, POWER, 8335-GTX, 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/6abd6b8191247dd7a04f9917fc0f93ad for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 4 out of 5 (5 easyconfigs in total)
taurusi8010 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/c47cb55d5885adc203b26485c476da93 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 5 out of 5 (5 easyconfigs in total)
taurusml7 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/a624b47b6685e89c2733ec43c05f0c02 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 4 out of 5 (5 easyconfigs in total)
taurusi8014 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/83b2b98135deae0264e304d63e41ffbb for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 5 out of 5 (5 easyconfigs in total)
taurusml7 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/827eedf5dbc89bdc0c74604886759bba for a full test report.

@boegel
Copy link
Member

boegel commented Dec 6, 2023

@Flamefire Should this be good to go now, or are you still figuring out the test faillures on your end?

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 5 (5 easyconfigs in total)
i8026 - Linux Rocky Linux 8.7, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/Flamefire/fd1098ab30ea9fd4c9cc80aad2cf436b for a full test report.

@Flamefire
Copy link
Contributor Author

@boegel Good to go now. Build failed due to 3 failing tests with only 2 allowed but the issue seems to be driver related or so with a rare enough use case that I just skipped those 2 tests with a patch. There seemingly was an intermittent issue with our cluster so the test report above can be ignored. Trying again but that will take some time.

However there is a UnicodeDecodeError in the gist, so there seems to be a framework issue/bug as that shouldn't appear in the (top-level) gist, should it?

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 5 out of 5 (5 easyconfigs in total)
i8019 - Linux Rocky Linux 8.7, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/Flamefire/024ba8f9a2c1691080ee6f6d11e1eae6 for a full test report.

@akesandgren
Copy link
Contributor

Test report by @akesandgren
SUCCESS
Build succeeded for 5 out of 5 (5 easyconfigs in total)
b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 545.29.06, Python 3.10.12
See https://gist.github.com/akesandgren/04ccf75775c7144cdfcf4c5b5b7d87e9 for a full test report.

Copy link
Contributor

@akesandgren akesandgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@akesandgren akesandgren modified the milestones: 4.x, release after 4.9.1 Apr 4, 2024
@akesandgren
Copy link
Contributor

Going in, thanks @Flamefire!

@akesandgren akesandgren merged commit da1d0f7 into easybuilders:develop Apr 4, 2024
9 checks passed
@Flamefire Flamefire deleted the 20230922150734_new_pr_PyTorch1131 branch April 4, 2024 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants