Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ai}[foss/2022b] PyTorch v1.13.1 w/ CUDA 12.0.0 #18806

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Sep 18, 2023

(created using eb --new-pr)

Our cluster doesn't support CUDA 12 yet (drivers too old), so can't test this.

…hes: PyTorch-1.13.1_add-cuda12-compat.patch, PyTorch-1.13.1_disable-test-sharding.patch, PyTorch-1.13.1_fix-flaky-jit-test.patch, PyTorch-1.13.1_fix-fsdp-fp16-test.patch, PyTorch-1.13.1_fix-fsdp-tp-integration-test.patch, PyTorch-1.13.1_fix-gcc-12-missing-includes.patch, PyTorch-1.13.1_fix-gcc-12-warning-in-fbgemm.patch, PyTorch-1.13.1_fix-kineto-crash-on-exit.patch, PyTorch-1.13.1_fix-numpy-deprecations.patch, PyTorch-1.13.1_fix-protobuf-dependency.patch, PyTorch-1.13.1_fix-pytest-args.patch, PyTorch-1.13.1_fix-python-3.11-compat.patch, PyTorch-1.13.1_fix-test-ops-conf.patch, PyTorch-1.13.1_fix-warning-in-test-cpp-api.patch, PyTorch-1.13.1_fix-wrong-check-in-fsdp-tests.patch, PyTorch-1.13.1_increase-tolerance-test_jit.patch, PyTorch-1.13.1_increase-tolerance-test_ops.patch, PyTorch-1.13.1_increase-tolerance-test_optim.patch, PyTorch-1.13.1_install-vsx-vec-headers.patch, PyTorch-1.13.1_no-cuda-stubs-rpath.patch, PyTorch-1.13.1_remove-flaky-test-in-testnn.patch, PyTorch-1.13.1_skip-failing-grad-test.patch, PyTorch-1.13.1_skip-failing-singular-grad-test.patch, PyTorch-1.13.1_skip-test-requiring-online-access.patch, PyTorch-1.13.1_skip-tests-without-fbgemm.patch
@SebastianAchilles SebastianAchilles added this to the 4.x milestone Sep 18, 2023
@branfosj
Copy link
Member

Our cluster doesn't support CUDA 12 yet (drivers too old), so can't test this.

Same for me.

@Flamefire
Copy link
Contributor Author

I made an attempt for a CUDA 11.7 version for 2022b: #18853

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3003
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
skl-rockylinux-88 - Linux Rocky Linux 8.8, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 535.104.05, Python 3.6.8
See https://gist.github.com/SebastianAchilles/1ff9ee9f54742f3cbcab372dde4e444a for a full test report.

@Flamefire
Copy link
Contributor Author

@SebastianAchilles After the merge of develop an added patch is no longer included in the PR so you need to update your local repo(s)

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3003
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
skl-rockylinux-88 - Linux Rocky Linux 8.8, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 535.104.05, Python 3.6.8
See https://gist.github.com/SebastianAchilles/669d7e2eba7da982819a65fd59347a9a for a full test report.

@Flamefire
Copy link
Contributor Author

@SebastianAchilles

skipped 'Only runs on cuda'

Is it possible that your cluster also doesn't support CUDA 12? Otherwise I don't understand why it would skip those tests.

Check the log for something like: it didn't find any CUDA devices

@SebastianAchilles
Copy link
Member

@SebastianAchilles

skipped 'Only runs on cuda'

Is it possible that your cluster also doesn't support CUDA 12? Otherwise I don't understand why it would skip those tests.

The NVIDIA driver on this machine is new and nvidia-smi reports

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+

That is why I assume that it should support CUDA 12.

Check the log for something like: it didn't find any CUDA devices

In the EasyBuild log file I only found a few skipped 'Need at least 2 CUDA devices' and skipped 'Need at least 4 CUDA devices' (which makes sense on a system with only 1 GPU). Did I look into the correct log file or is there another file I should look into?

@Flamefire
Copy link
Contributor Author

Then I'm out of ideas, sorry :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants