Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{devel}[foss/2021b] PyTorch v1.11.0 w/ Python 3.9.6 + CUDA 11.4.1 #16385

Conversation

sassy-crick
Copy link
Collaborator

(created using eb --new-pr)


# several tests are known to be flaky, and fail in some contexts (like having multiple GPUs available),
# so we allow up to 10 (out of ~90k) tests to fail before treating the installation to be faulty
max_failed_tests = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the merge of easybuilders/easybuild-easyblocks#2794 I'm guessing this will need to be higher. But let's see how many tests actually fail first, it might not be all that many since we still patched failing tests when the original EasyConfig 1.11.0 was developed :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To (hopefully) add to this: I tried to install PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb which originally failed.
So I added the max_failed_tests = 10 to the EasyConfig file and tried to install it like that:

eb --include-easyblocks-from-pr=2794  --cuda-compute-capabilities=7.5 PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb

I got:

WARNING: 0 test failure, 463 test errors (out of 57757):
distributed/pipeline/sync/skip/test_gpipe (12 skipped, 1 warning, 1 error)
distributed/pipeline/sync/skip/test_leak (1 warning, 8 errors)
distributed/pipeline/sync/test_bugs (1 skipped, 1 warning, 3 errors)
distributed/pipeline/sync/test_inplace (2 xfailed, 1 warning, 1 error)
distributed/pipeline/sync/test_pipe (1 passed, 8 skipped, 1 warning, 47 errors)
distributed/pipeline/sync/test_transparency (1 warning, 1 error)
distributed/rpc/cuda/test_tensorpipe_agent (107 total tests, errors=1)
distributed/rpc/test_faulty_agent (28 total tests, errors=28)
distributed/rpc/test_tensorpipe_agent (424 total tests, errors=412)
distributed/test_store (19 total tests, errors=1)

I guess we need to do a bit more tuning here. :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, see the changes in #16339

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test report from installation.

The installation failed with:

Running test_xnnpack_integration ... [2022-10-12 02:18:04.597373]
Executing ['/sw-eb/software/Python/3.9.6-GCCcore-11.2.0/bin/python', 'test_xnnpack_integration.py', '-v'] ... [2022-10-12 02:18:04.597482]
/dev/shm/hpcsw/eb-kabv0vz7/tmpcfnhusx2/lib/python3.9/site-packages/torch/cuda/__init__.py:82: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up
 environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  /dev/shm/hpcsw/PyTorch/1.11.0/foss-202
1b-CUDA-11.4.1/pytorch/c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0
test_conv1d_basic (__main__.TestXNNPACKConv1dTransformPass) ... /dev/shm/hpcsw/eb-kabv0vz7/tmpcfnhusx2/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py:424: UserWarning: 
Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered in
ternally at  /dev/shm/hpcsw/PyTorch/1.11.0/foss-2021b-CUDA-11.4.1/pytorch/c10/core/TensorImpl.h:1460.)
  return callable(*args, **kwargs)
ok
test_conv1d_with_relu_fc (__main__.TestXNNPACKConv1dTransformPass) ... skipped 'test is slow; run with PYTORCH_TEST_WITH_SLOW to enable test'
test_conv2d (__main__.TestXNNPACKOps) ... ok
test_conv2d_transpose (__main__.TestXNNPACKOps) ... ok
test_linear (__main__.TestXNNPACKOps) ... ok
test_linear_1d_input (__main__.TestXNNPACKOps) ... ok
test_decomposed_linear (__main__.TestXNNPACKRewritePass) ... ok
test_linear (__main__.TestXNNPACKRewritePass) ... ok
test_combined_model (__main__.TestXNNPACKSerDes) ... ok
test_conv2d (__main__.TestXNNPACKSerDes) ... ok
test_conv2d_transpose (__main__.TestXNNPACKSerDes) ... ok
test_linear (__main__.TestXNNPACKSerDes) ... ok

----------------------------------------------------------------------
Ran 12 tests in 141.679s

OK (skipped=1)
distributed/pipeline/sync/skip/test_gpipe failed!
distributed/pipeline/sync/skip/test_leak failed!
distributed/pipeline/sync/test_bugs failed!
distributed/pipeline/sync/test_inplace failed!
distributed/pipeline/sync/test_pipe failed!
distributed/pipeline/sync/test_transparency failed!
distributed/rpc/cuda/test_tensorpipe_agent failed!
distributed/rpc/test_faulty_agent failed!
distributed/rpc/test_tensorpipe_agent failed!
distributed/test_store failed!
distributions/test_distributions failed!

== 2022-10-12 02:20:34,765 filetools.py:382 INFO Path /dev/shm/hpcsw/eb-kabv0vz7/tmpcfnhusx2 successfully removed.
== 2022-10-12 02:20:36,991 pytorch.py:344 WARNING 0 test failure, 24 test errors (out of 88784):
distributed/pipeline/sync/skip/test_gpipe (12 skipped, 1 warning, 1 error)
distributed/pipeline/sync/skip/test_leak (1 warning, 8 errors)
distributed/pipeline/sync/test_bugs (1 skipped, 1 warning, 3 errors)
distributed/pipeline/sync/test_inplace (2 xfailed, 1 warning, 1 error)
distributed/pipeline/sync/test_pipe (1 passed, 8 skipped, 1 warning, 47 errors)
distributed/pipeline/sync/test_transparency (1 warning, 1 error)
distributions/test_distributions (216 total tests, errors=3, skipped=5)

The PyTorch test suite is known to include some flaky tests, which may fail depending on the specifics of the system or the context in which they are run. For this PyTorch installation, EasyBuild allows up to 10 tests to fail. We recommend to double check that the failing tests listed above  are known to be flaky, or do not affect your intended usage of PyTorch. In case of doubt, reach out to the EasyBuild community (via GitHub, Slack, or mailing list).
== 2022-10-12 02:20:37,273 build_log.py:169 ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:124 in __init__): Too many failed tests (24), maximum allowed is 10 (at easybuild/easyblocks/pytorch.py:348 in test_step)
== 2022-10-12 02:20:37,275 build_log.py:265 INFO ... (took 1 hour 58 mins 12 secs)
== 2022-10-12 02:20:37,278 filetools.py:2014 INFO Removing lock /sw-eb/software/.locks/_sw-eb_software_PyTorch_1.11.0-foss-2021b-CUDA-11.4.1.lock...
== 2022-10-12 02:20:37,280 filetools.py:382 INFO Path /sw-eb/software/.locks/_sw-eb_software_PyTorch_1.11.0-foss-2021b-CUDA-11.4.1.lock successfully removed.
== 2022-10-12 02:20:37,281 filetools.py:2018 INFO Lock removed: /sw-eb/software/.locks/_sw-eb_software_PyTorch_1.11.0-foss-2021b-CUDA-11.4.1.lock
== 2022-10-12 02:20:37,281 easyblock.py:4089 WARNING build failed (first 300 chars): Too many failed tests (24), maximum allowed is 10
== 2022-10-12 02:20:37,283 easyblock.py:319 INFO Closing log for application name PyTorch version 1.11.0

That was done before any changes were made.
Unfortunately, the log-file is too large to actually upload it.

Copy link
Contributor

@casparvl casparvl Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, 2 days ago you had 400-something errors (with pretty much all tests in distributed/rpc/test_tensorpipe_agent failing), yesterday, you had 24. What was the difference between these two runs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for causing confusion. Two days ago I was trying out the new EasyBlock with the PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb which is not this one here. My wrong to put it here. Apologies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, clear then!

On a side note, I see the recent EasyBlock fails to properly count everything... Reported this in an issue and will fix it later.

@casparvl
Copy link
Contributor

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2794
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
gcn2 - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 515.43.04, Python 3.6.8
See https://gist.github.com/9de07b40eab80e645037356c039a18d8 for a full test report.

@boegel boegel changed the title {devel}[foss/2021b] PyTorch v1.11.0 w/ Python 3.9.6 {devel}[foss/2021b] PyTorch v1.11.0 w/ Python 3.9.6 + CUDA 11.4.1 Oct 11, 2022
@casparvl
Copy link
Contributor

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2794
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
software2.lisa.surfsara.nl - Linux debian 10.13, x86_64, Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 4 x NVIDIA NVIDIA TITAN V, 470.103.01, Python 3.7.3
See https://gist.github.com/6a81e68e773aff30448339a435105e97 for a full test report.

@boegel boegel added this to the 4.x milestone Oct 12, 2022
@casparvl
Copy link
Contributor

On both nodes, I got:

== 2022-10-12 00:11:31,871 pytorch.py:344 WARNING 0 test failure, 6 test errors (out of 88959):
distributions/test_distributions (216 total tests, errors=4)
test_nn (3992 total tests, errors=2, skipped=1094, expected failures=28)

These systems are intel CPU + 4x Titan V (software2), and Intel CPU + 4x A100 (gcn2). So from my point of view, this PR looks ok. But I am concerned about the high failure rates for @sassy-crick . What kind of hardware are you working on?

@casparvl
Copy link
Contributor

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=16385 EB_ARGS= /opt/software/slurm/bin/sbatch --job-name test_PR_16385 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9270

Test results coming soon (I hope)...

- notification for comment with ID 1277291633 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/c04310707a04ed58613880e585fab514 for a full test report.

]
tests = ['PyTorch-check-cpp-extension.py']

moduleclass = 'devel'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a moduleclass 'ai' now, that's where this should go probably.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting me know. That has slipped my mind. I fixed that.

@sassy-crick
Copy link
Collaborator Author

On both nodes, I got:

== 2022-10-12 00:11:31,871 pytorch.py:344 WARNING 0 test failure, 6 test errors (out of 88959):
distributions/test_distributions (216 total tests, errors=4)
test_nn (3992 total tests, errors=2, skipped=1094, expected failures=28)

These systems are intel CPU + 4x Titan V (software2), and Intel CPU + 4x A100 (gcn2). So from my point of view, this PR looks ok. But I am concerned about the high failure rates for @sassy-crick . What kind of hardware are you working on?

@casparvl
I share your concerns!
These are Skylake CPUs and RTX 6000 cards, running inside a Rocky-8.5 Singularity container with 1 GPU enabled, so some tests are being skipped, and with the --nv flag being used when running the container. More details are in the test-report if that helps.
I don't seem to be able to upload that on here, hence the link.

@casparvl
Copy link
Contributor

Hm, I'm mainly concerned about

distributed/pipeline/sync/test_pipe (1 passed, 8 skipped, 1 warning, 47 errors)

Can you find the section in the output file that reports the full error for those tests and paste it here? Or if all are very similar, just paste one as an example. Maybe we can verify if it's a known issue or something...

Again, I still don't think that there's anything wrong with this PR as such, so if we can't figure it out in reasonable time, I'd still propose to merge this - it works on Kenneth's system, two of mine, and Generoso, so for many people this PR would probably work fine and thus be helpful to have it merged.

@sassy-crick
Copy link
Collaborator Author

sassy-crick commented Oct 17, 2022

Update on this:
Fixing an issue I had with nVIDIA inside the container (nvidia-smi works fine but neither deviceQueryDrv nor deviceQuery from CUDA-samples worked), I tried to install this version of PyTorch. That failed as one of the test-processes turned into a zombie.
So I ditched the container approach and tried to install it on the GPU node directly using this:

$ eb --dump-test-report=$PWD/testreport.md --include-easyblocks-from-pr=2794  --cuda-compute-capabilities=7.5 PyTorch-1.11.0-foss-2021b-CUDA-11.4.1.eb

but the process is hanging here again, which I observed inside the container as well:

3139276 pts/1    S+     0:06  |                   \_ /usr/bin/python3 -m easybuild.main --robot --download-timeout=1000 --modules-tool=EnvironmentModules --module-syntax=Tcl --allow-modules-tool-mismatch --hooks=/apps/eb-softwarestack/hooks//site-hooks.py --cuda-compute-capabilities=7.5 --dump-test-report=/sw-eb/scripts/modified/deeplabcut/2.2.3/testreport.md --include-easyblocks-from-pr=2794 --cuda-compute-capabilities=7.5 PyTorch-1.11.0-foss-2021b-CUDA-11.4.1.eb
3262001 pts/1    Sl+    0:02  |                       \_ /sw-eb/software/Python/3.9.6-GCCcore-11.2.0/bin/python run_test.py --continue-through-error --verbose -x distributed/elastic/utils/distributed_test distributed/elastic/multiprocessing/api_test distributed/test_distributed_spawn test_optim test_model_dump distributed/fsdp/test_fsdp_memory distributed/fsdp/test_fsdp_overlap
4107924 pts/1    Sl+    0:02  |                           \_ /sw-eb/software/Python/3.9.6-GCCcore-11.2.0/bin/python test_multiprocessing.py -v
4107935 pts/1    S+     0:00  |                               \_ /sw-eb/software/Python/3.9.6-GCCcore-11.2.0/bin/python -s -c from multiprocessing.resource_tracker import main;main(24)
4107955 pts/1    Z+     0:01  |                               \_ [python] <defunct>

I had a look at the patched from PR 16339 but none of them seem to be really applicable. I might be wrong here. Any suggestions? In light it is working on other systems, I am a bit inclined to say the issue is on mine. Where would I find the log file for that process?

System info:

$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: RedHatEnterprise
Description:    Red Hat Enterprise Linux release 8.5 (Ootpa)
Release:        8.5
Codename:       Ootpa

EasyBuild:
eb-4.6.1

$ lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
Stepping:            4

@casparvl
Copy link
Contributor

I'm afraid I don't immediately know what's going on here. But one of the things you could try is run this test individually. What I did when developing PyTorch-1.11.0-foss-2021a-CUDA-11.3.1.eb and debugging tests was:

  1. Build locally in my homedir, while keeping the tmpdir and builddir, and skip the test step so that the build doesn't crash (e.g. eb PyTorch-1.11.0-foss-2021a-CUDA-11.3.1.eb --skip-test-step --disable-cleanup-tmpdir --disable-cleanup-builddir)
  2. Load the module (from my homedir, it needs to be on your MODULEPATH of course) and change directory to the build directory. Add the site-packages in the corresponding tmpdir to the PYTHONPATH.
module load PyTorch/1.11.0-foss-2021a-CUDA-11.3.1
module load hypothesis/6.13.1-GCCcore-10.3.0
cd /scratch-shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test
export PYTHONPATH=/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages:$PYTHONPATH && 
  1. Run the tests, e.g. to launch an individual test
PYTHONUNBUFFERED=1 /sw/arch/Centos8/EB_production/2021/software/Python/3.9.5-GCCcore-10.3.0/bin/python -m unittest test_linalg.TestLinalgCPU.test_norm_extreme_values_cpu -v

That allows much quicker testing, and also, and I've seen differences in the past between e.g. interactively running a test where I just ssh-ed to a node, instead of running it in a SLURM job. For me, a test that fails in my build (SLURM) job, but succeeds when run interactively, points to a build that is fine, but a test that fails because of some environmental differences.

@sassy-crick
Copy link
Collaborator Author

Thanks for all your help, much appreciated!
I have included the patches from #16339 but that made no difference. I am running these jobs by sshed into the node, so no environment from the queue.
From what I can see right now it looks like that could potentially be a problem with our setup. I would suggest the following:

  • test the updated EasyConfig file. If that one is ok, please be free to merge that
  • IF, however, these patches are causing more problem, revert to the previously tested EasyConfig file and merge.

The only thing I could not update was the way it is downloading the source code for PyTorch as my framework is not up for that right now. Please feel free to updated that.

I don't want to be the stopper of something which actually works but we got an issue with that node I am doing the testing on. I will keep you posted on that.
Again, thanks for your help and patience!

@casparvl
Copy link
Contributor

casparvl commented Oct 18, 2022

@boegelbot please test @ generoso
CORE_CNT=16
EB_ARGS="--include-easyblocks-from-pr 2803"

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=16385 EB_ARGS="--include-easyblocks-from-pr 2803" /opt/software/slurm/bin/sbatch --job-name test_PR_16385 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9308

Test results coming soon (I hope)...

- notification for comment with ID 1282388299 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2803
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/abd8088172ff6e39711c5c6ac9ef705c for a full test report.

@casparvl
Copy link
Contributor

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2803
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
software2.lisa.surfsara.nl - Linux debian 10.13, x86_64, Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 4 x NVIDIA NVIDIA TITAN V, 470.103.01, Python 3.7.3
See https://gist.github.com/1a9bd41325742b1733a75b7741151cf3 for a full test report.

@fizwit
Copy link
Contributor

fizwit commented Nov 16, 2022

I was able to build this with no issues using eb --from-pr 16385, Thanks for creating this for the 2021b toolchain.
Pytorch is the base for many other tools, Like Torchvsion.
torchvision 0.13.0 has a requirement pillow!=8.3.*,>=5.3.0, but you have pillow 8.3.2.
Can you consider using using Pillow-9.1.0. ?

@boegel
Copy link
Member

boegel commented Nov 24, 2022

I was able to build this with no issues using eb --from-pr 16385, Thanks for creating this for the 2021b toolchain. Pytorch is the base for many other tools, Like Torchvsion. torchvision 0.13.0 has a requirement pillow!=8.3.*,>=5.3.0, but you have pillow 8.3.2. Can you consider using using Pillow-9.1.0. ?

@fizwit Pillow 8.3.2 is the version we're using a dependency in various places in easyconfigs of the 2021b generation.
If we really need to, we can diverge from that and add an exception (in the easyconfigs test suite), but that would be painful w.r.t. maximizing compatibility of modules installed with a 2021b toolchain (or subtoolchain thereof).

Is there more information available why the !=8.3.* restriction is there in torchvision?

@boegel
Copy link
Member

boegel commented Nov 25, 2022

Test report by @boegel
SUCCESS
Build succeeded for 27 out of 27 (1 easyconfigs in total)
node3306.joltik.os - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 520.61.05, Python 3.6.8
See https://gist.github.com/b07b19a3448157d77122c6dde1005244 for a full test report.

edit:

3 test failures, 4 test errors (out of 88968):
distributed/fsdp/test_fsdp_input (2 total tests, failures=2)
distributions/test_distributions (216 total tests, errors=4)
test_autograd (464 total tests, failures=1, skipped=52, expected failures=1)

@sassy-crick
Copy link
Collaborator Author

Jumping back into that: IF the tests are working at other sites, I am in favour to merge this PR as I am not convinced that our setup we are currently having is working as expected. So please don't wait for me to fix my failing tests when it works everywhere else! :-)

@boegel
Copy link
Member

boegel commented Jan 11, 2023

@boegel So, issue a patch to 8.3.2 Pillow easyconfig to fix an issue with torchvision 0.13.1. tochvison will also require a patch to setup.py to allow Pillow 8.3.2. Does this look ok? Pill Path

@fizwit Except for the missing checksum for the patch file, your modified Pillow-8.3.2-GCCcore-11.2.0.eb (here) looks good to me. Can you open the PR for that?

@boegel boegel modified the milestones: 4.x, next release (4.7.1?) Jan 11, 2023
boegel
boegel previously approved these changes Jan 11, 2023
Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Member

boegel commented Jan 12, 2023

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3305.joltik.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 525.60.13, Python 3.6.8
See https://gist.github.com/aff84c9d838ad4da5ec32b4dca244223 for a full test report.

@boegel
Copy link
Member

boegel commented Jan 12, 2023

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3901.accelgor.os - Linux RHEL 8.6, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 525.60.13, Python 3.6.8
See https://gist.github.com/c80632f56b389e338a0220a5e8864f4f for a full test report.

@Flamefire
Copy link
Contributor

@boegel

distributed/fsdp/test_fsdp_input failed!
distributed/test_c10d_gloo failed!
distributions/test_distributions failed!
test_autograd failed!

I don't know about the first test but the other 3 are likely fixed in my PyTorch 1.12 EC with:

  • PyTorch-1.12.1_skip-test_round_robin.patch
  • PyTorch-1.12.1_fix-test_wishart_log_prob.patch
  • PyTorch-1.12.1_fix-autograd-thread_shutdown-test.patch
    respectively.

Why adding another 1.11 and not 1.12? Maybe some failures are fixed in the 1.12 EC already.

@boegel
Copy link
Member

boegel commented Jan 12, 2023

@Flamefire I guess we could upgrade this to PyTorch 1.12, but this PR has been sitting here for a while, so I figured I would get it merged...

There's a bug in the PyTorch easyblock w.r.t. counting the tests here though, no?
The installation fails at the test step because the failed test counting isn't able to find a match for test_c10d_gloo, I think? We should look into fixing that imho (regardless of whether or not we stick to PyTorch 1.11, which I'm actually in favor of, there could be situations where PyTorch 1.10 is not recent enough, yet PyTorch 1.12 is too recent, and we don't have any PyTorch yet in the 2021b generation...)

@boegel
Copy link
Member

boegel commented Jan 13, 2023

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2859
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3305.joltik.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 525.60.13, Python 3.6.8
See https://gist.github.com/f94ada55e0d0a9174740c2f13072b04b for a full test report.

@Flamefire
Copy link
Contributor

FWIW: I'm testing the PyTorch 1.12 EC on foss-2021b to see what might need patching and what might be a real issue. Going to another toolchain always has the possibility of hitting compatibility issues which may be resolved later. E.g. IIRC 1.10 doesn't work on 2022a as it is incompatible with Python 3.10. Maybe we see something similar here as taking a closer look at the failures they seem to be not exactly the same as on 1.12

…g-extensions dependency (already included with Python)
@boegel
Copy link
Member

boegel commented Jan 15, 2023

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2859
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3309.joltik.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 525.60.13, Python 3.6.8
See https://gist.github.com/200f41addaca022e55bec2befdf02a49 for a full test report.

@Flamefire
Copy link
Contributor

Flamefire commented Jan 17, 2023

I'm currently testing with PyTorch 1.12 on 2021b and still have trouble. A test on an AVX2 x86 system is failing with a real issue: pytorch/pytorch#92246

The same test with same inputs works on 2022a it is only on 2021b where I get wrong results. I currently traced it into XNNPACK and am still investigating what the issue is. It is possible that the 2021b toolchain has a bug somewhere.

Found it: GCC has a bug until 11.3 (i.e. "GCC11.0 through 11.2"): https://stackoverflow.com/a/72837992/1930508

@Flamefire
Copy link
Contributor

Flamefire commented Jan 17, 2023

@Flamefire
Copy link
Contributor

I added PRs for PyTorch 1.12.1 on 2021b:

Maybe you can try those as a comparison. At least the CUDA version fails for me on some(?) nodes with

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasDgemv(handle, op, m, n, &alpha, a, lda, x, incx, &beta, y, incy)``

So check the logs for this if it already happens with PyTorch 1.11.0 which might hint that it isn't compatible with CUDA and/or cuBLAS 11.4

@casparvl
Copy link
Contributor

Test report by @casparvl
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
gcn31.local.snellius.surf.nl - Linux Rocky Linux 8.7, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 1 x NVIDIA NVIDIA A100-SXM4-40GB, 515.86.01, Python 3.6.8
See https://gist.github.com/ac6ea5a6fcaf78fa9cf4b2129eb32ae2 for a full test report.

@boegel
Copy link
Member

boegel commented Apr 11, 2023

@sassy-crick Since we now have easyconfig merged for PyTorch 1.12.1 with foss/2021b, we should use those instead, and close this PR?

@boegel boegel modified the milestones: next release (4.7.2), 4.x Apr 12, 2023
@boegel
Copy link
Member

boegel commented Jan 13, 2024

closing this, superseded by more recent PyTorch easyconfigs

@boegel boegel closed this Jan 13, 2024
@sassy-crick sassy-crick deleted the 20221011105155_new_pr_PyTorch1110 branch January 26, 2024 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants