Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ai}[foss/2022b] PyTorch v2.1.2 w/ CUDA 12.0.0 #20155

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

Flamefire
Copy link
Contributor

(created using eb --new-pr)

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
i8002 - Linux Rocky Linux 8.7 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.8.13
See https://gist.github.com/Flamefire/2133e3af9d4d0ce0ec5ad6ef1d5faf35 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
i8002 - Linux Rocky Linux 8.7 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.8.13
See https://gist.github.com/Flamefire/df8d92fc4aecca60a3448b286b7d3d66 for a full test report.

@casparvl
Copy link
Contributor

Test report by @casparvl
FAILED
Build succeeded for 3 out of 4 (1 easyconfigs in total)
gcn6.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/casparvl/d118e91d83550334542ae6a2ee0a536e for a full test report.

@Flamefire
Copy link
Contributor Author

Here and in #20156 test_cpp_extensions_aot_ninja fails and the related one too. But not due a test failure but some actual error. Can you check the log?

@casparvl
Copy link
Contributor

Hm, the log contains a lot, it's a bit hard to read, but I think this is the relevant part:

Error log
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] /scratch-nvme/1/casparl/generic/software/CUDA/12.0.0/bin/nvcc  -I/scratch-nvme/1/casparl/ebtmpdir/eb-ukx4l8ka/tmpglj5n990/lib/python3.10/site-packages/
torch/include -I/scratch-nvme/1/casparl/ebtmpdir/eb-ukx4l8ka/tmpglj5n990/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/scratch-nvme/1/
casparl/ebtmpdir/eb-ukx4l8ka/tmpglj5n990/lib/python3.10/site-packages/torch/include/TH -I/scratch-nvme/1/casparl/ebtmpdir/eb-ukx4l8ka/tmpglj5n990/lib/python3
.10/site-packages/torch/include/THC -I/scratch-nvme/1/casparl/generic/software/CUDA/12.0.0/include -I/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-202
2b-CUDA-12.0.0/pytorch-v2.1.2/test/cpp_extensions/self_compiler_include_dirs_test -I/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/include/python3.10 -c -c /gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2022b-CUDA-12.0.0/pytorch-v2.1.2/test/cpp_extensions/torch_library.cu -o /gpfs/nvm
e1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2022b-CUDA-12.0.0/pytorch-v2.1.2/test/cpp_extensions/build/temp.linux-x86_64-cpython-310/torch_library.o -D__CUDA
_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi10
17"' -DTORCH_EXTENSION_NAME=torch_library -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_80,code=sm_80 -ccbin gcc -std=c++17
FAILED: /gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2022b-CUDA-12.0.0/pytorch-v2.1.2/test/cpp_extensions/build/temp.linux-x86_64-cpython-310/torch_library.o
/scratch-nvme/1/casparl/generic/software/CUDA/12.0.0/bin/nvcc  -I/scratch-nvme/1/casparl/ebtmpdir/eb-ukx4l8ka/tmpglj5n990/lib/python3.10/site-packages/torch/
include -I/scratch-nvme/1/casparl/ebtmpdir/eb-ukx4l8ka/tmpglj5n990/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/scratch-nvme/1/casparl/ebtmpdir/eb-ukx4l8ka/tmpglj5n990/lib/python3.10/site-packages/torch/include/TH -I/scratch-nvme/1/casparl/ebtmpdir/eb-ukx4l8ka/tmpglj5n990/lib/python3.10/si
te-packages/torch/include/THC -I/scratch-nvme/1/casparl/generic/software/CUDA/12.0.0/include -I/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2022b-CUD
A-12.0.0/pytorch-v2.1.2/test/cpp_extensions/self_compiler_include_dirs_test -I/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/include/python3.10 -c -c /gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2022b-CUDA-12.0.0/pytorch-v2.1.2/test/cpp_extensions/torch_library.cu -o /gpfs/nvme1/1/c
asparl/ebbuildpath/PyTorch/2.1.2/foss-2022b-CUDA-12.0.0/pytorch-v2.1.2/test/cpp_extensions/build/temp.linux-x86_64-cpython-310/torch_library.o -D__CUDA_NO_HA
LF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1017"' -
DTORCH_EXTENSION_NAME=torch_library -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_80,code=sm_80 -ccbin gcc -std=c++17
/scratch-nvme/1/casparl/generic/software/pybind11/2.10.3-GCCcore-12.2.0/include/pybind11/detail/../cast.h: In function typename pybind11::detail::type_caster<typename pybind11::detail::intrinsic_type<T>::type>::cast_op_type<T> pybind11::detail::cast_op(make_caster<T>&):
/scratch-nvme/1/casparl/generic/software/pybind11/2.10.3-GCCcore-12.2.0/include/pybind11/detail/../cast.h:45:120: error: expected template-name before < toke
n
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                        ^
/scratch-nvme/1/casparl/generic/software/pybind11/2.10.3-GCCcore-12.2.0/include/pybind11/detail/../cast.h:45:120: error: expected identifier before < token
/scratch-nvme/1/casparl/generic/software/pybind11/2.10.3-GCCcore-12.2.0/include/pybind11/detail/../cast.h:45:123: error: expected primary-expression before >
 token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                           ^
/scratch-nvme/1/casparl/generic/software/pybind11/2.10.3-GCCcore-12.2.0/include/pybind11/detail/../cast.h:45:126: error: expected primary-expression before )
 token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ukx4l8ka/tmpglj5n990/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
    subprocess.run(
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2022b-CUDA-12.0.0/pytorch-v2.1.2/test/cpp_extensions/setup.py", line 90, in <module>
    setup(
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/__init__.py", line 87, in setup
    return distutils.core.setup(**attrs)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
    return run_commands(dist)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_c
ommands
    dist.run_commands()
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 973, in run_c
ommands
    self.run_command(cmd)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
    super().run_command(command)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 992, in run_c
ommand
    cmd_obj.run()
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/command/install.py", line 68, in run
    return orig.install.run(self)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/_distutils/command/install.py", line 698, in run
    self.run_command('build')
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
    self.distribution.run_command(command)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
    super().run_command(command)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 992, in run_c
ommand
    cmd_obj.run()
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/command/build.py", line 24, in run
    super().run()
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/_distutils/command/build.py", line 132,
 in run
    self.run_command(cmd_name)
File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/_distutils/cmd.py", line 319, in run_co
mmand
    self.distribution.run_command(command)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/dist.py", line 1217, in run_command
    super().run_command(command)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 992, in run_command
    cmd_obj.run()
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 79, in run
    _build_ext.run(self)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/Cython/Distutils/old_build_ext.py", line 186, in r
un
    _build_ext.build_ext.run(self)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line
346, in run
    self.build_extensions()
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ukx4l8ka/tmpglj5n990/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 873, in build_extensions
    build_ext.build_extensions(self)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/Cython/Distutils/old_build_ext.py", line 195, in b
uild_extensions
    _build_ext.build_ext.build_extensions(self)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line
466, in build_extensions
    self._build_extensions_serial()
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line
492, in _build_extensions_serial
    self.build_extension(ext)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 202, in bui
ld_extension
    _build_ext.build_extension(self, ext)
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line
547, in build_extension
    objects = self.compiler.compile(
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ukx4l8ka/tmpglj5n990/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 686, in unix_wrap_ninja_com
pile
    _write_ninja_file_and_compile_objects(
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-ukx4l8ka/tmpglj5n990/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1774, in _write_ninja_file_
and_compile_objects
    _run_ninja_build(
   File "/scratch-nvme/1/casparl/ebtmpdir/eb-ukx4l8ka/tmpglj5n990/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
test_cpp_extensions_aot_ninja 1/1 failed!

@casparvl
Copy link
Contributor

Error in test_cpp_extensions looks similar btw:

Error log:
/scratch-nvme/1/casparl/generic/software/pybind11/2.10.3-GCCcore-12.2.0/include/pybind11/detail/../cast.h: In function typename pybind11::detail::type_caster
<typename pybind11::detail::intrinsic_type<T>::type>::cast_op_type<T> pybind11::detail::cast_op(make_caster<T>&):
/scratch-nvme/1/casparl/generic/software/pybind11/2.10.3-GCCcore-12.2.0/include/pybind11/detail/../cast.h:45:120: error: expected template-name before < toke
n
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                        ^
/scratch-nvme/1/casparl/generic/software/pybind11/2.10.3-GCCcore-12.2.0/include/pybind11/detail/../cast.h:45:120: error: expected identifier before < token
/scratch-nvme/1/casparl/generic/software/pybind11/2.10.3-GCCcore-12.2.0/include/pybind11/detail/../cast.h:45:123: error: expected primary-expression before >
 token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                           ^
/scratch-nvme/1/casparl/generic/software/pybind11/2.10.3-GCCcore-12.2.0/include/pybind11/detail/../cast.h:45:126: error: expected primary-expression before )
 token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                              ^
error: command '/scratch-nvme/1/casparl/generic/software/CUDA/12.0.0/bin/nvcc' failed with exit code 1
test_cpp_extensions_aot_no_ninja 1/1 failed!
Running test_cpp_extensions_jit 1/1 ... [2024-03-21 08:38:32.382254]
Executing ['/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/bin/python', '-bb', 'test_cpp_extensions_jit.py', '--shard-id=0', '--num-sh
ards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--reruns=2'] ... [2024-03-21 08:38:32.382937]

Expand the folded group to see the log file of test_cpp_extensions_jit 1/1

@Flamefire
Copy link
Contributor Author

Yep that is a known issue: Reinstall your pybind11 with the latest EC

@casparvl
Copy link
Contributor

Great, will do! Sorry, there are so many fixes that I often can't keep up and don't always rebuild stuff XD I'll send a new test report after the pybind11 rebuild.

@Flamefire
Copy link
Contributor Author

Great, will do! Sorry, there are so many fixes that I often can't keep up and don't always rebuild stuff XD I'll send a new test report after the pybind11 rebuild.

Yeah I know that is annoying, but we can't do much better than updating the existing EC(s) for such major bugs. It came up recently with someone else too so I remembered it.
Note that you can always try to search parts of the error in this repo or grep the local checkout. IIRC the patch contains the relevant part of the error.

Side note: This is actually a good reason to run the PyTorch test suite and investigate errors: Our pybind11 version isn't (wasn't) compatible with this PyTorch version which would make it less usable as this error is likely to pop up in user code using this module.

@casparvl
Copy link
Contributor

Ok, I rebuild pybind11, it's now rebuilding this PR. Now we have to practice patience again ;-)

@sassy-crick
Copy link
Collaborator

Test report sassy-crick:
SUCCESS
Xeon(R) Platinum 8358, A100 GPU, Red Hat Enterprise Linux release 8.8 (Ootpa)
See here for a full test report

@casparvl
Copy link
Contributor

Test report by @casparvl
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
gcn6.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/casparvl/0a676dbccf3d4c9f2580582ad65d5e25 for a full test report.

@Flamefire
Copy link
Contributor Author

@casparvl Looks similar to #20156 so increasing the allowed failures to 10 might be enough

@casparvl
Copy link
Contributor

casparvl commented Mar 24, 2024

Failures are the same for

The only new one was a failure in distributed/test_c10d_nccl, which seems to have resulted in a hang:

____________________________________________ ProcessGroupNCCLTest.test_nccl_watchdog_cudagraph _____________________________________________
Traceback (most recent call last):
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-0svq1wxg/tmpqiveorku/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 506, in wrapper
    self._join_processes(fn)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-0svq1wxg/tmpqiveorku/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 725, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-0svq1wxg/tmpqiveorku/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 780, in _check_return_codes
    raise RuntimeError(
RuntimeError: Process 0 terminated or timed out after 300.0367577075958 seconds
----------------------------------------------------------- Captured stdout call -----------------------------------------------------------
Timing out after 300 seconds and killing subprocesses.
----------------------------------------------------------- Captured stdout call -----------------------------------------------------------
Timing out after 300 seconds and killing subprocesses.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/scratch-nvme/1/casparl/ebtmpdir/eb-0svq1wxg/tmpqiveorku/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py:718: KeyboardInterrupt
(to show a full traceback on KeyboardInterrupt use --full-trace)
======================================================= 2 rerun in 897.82s (0:14:57) =======================================================

@casparvl
Copy link
Contributor

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=20155 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20155 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 13198

Test results coming soon (I hope)...

- notification for comment with ID 2016793002 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (1 easyconfigs in total)
cnx5 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/499e262317097dc60188ace77f7f957e for a full test report.

@casparvl
Copy link
Contributor

@boegelbot please test @ jsc-zen3
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20155 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20155 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3860

Test results coming soon (I hope)...

- notification for comment with ID 2020037044 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 3 out of 3 (1 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/f2bf0146d4d6cf62ee135ddc832e422f for a full test report.

@casparvl
Copy link
Contributor

casparvl commented Apr 2, 2024

Test report by @casparvl
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
gcn6.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/casparvl/3376bb4371939aa03f6eacf1856727b7 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @casparvl FAILED Build succeeded for 0 out of 1 (1 easyconfigs in total) gcn6.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8 See https://gist.github.com/casparvl/3376bb4371939aa03f6eacf1856727b7 for a full test report.

You are missing the patches from #19666 which are in develop

@casparvl
Copy link
Contributor

casparvl commented Apr 2, 2024

Ah, let me sync your branch with develop - I'm assuming you won't mind... :)

@casparvl
Copy link
Contributor

casparvl commented Apr 2, 2024

Ok, rebuild started succesfully now. Test reporting should be there somewhere tonight. I'll trigger one more rebuild on one of the test clusters for good measure. Should be good to go afterwards...

@casparvl
Copy link
Contributor

casparvl commented Apr 2, 2024

@boegelbot please test @ jsc-zen3
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20155 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20155 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3909

Test results coming soon (I hope)...

- notification for comment with ID 2032212011 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
jsczen3c2.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/ef331e090522687eafe998b223c49185 for a full test report.

@casparvl
Copy link
Contributor

casparvl commented Apr 3, 2024

Test report by @casparvl
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
gcn6.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/casparvl/5830985b8a671f1c06e9d11f62593b9d for a full test report.

…es: PyTorch-2.1.2_add-cuda-skip-markers.patch, PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch, PyTorch-2.1.2_fix-device-mesh-check.patch, PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch, PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch, PyTorch-2.1.2_fix-test_memory_profiler.patch, PyTorch-2.1.2_fix-test_torchinductor-rounding.patch, PyTorch-2.1.2_fix-vsx-vector-abs.patch, PyTorch-2.1.2_fix-vsx-vector-div.patch, PyTorch-2.1.2_fix-with_temp_dir-decorator.patch, PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch, PyTorch-2.1.2_relax-cuda-tolerances.patch, PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch, PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch, PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch, PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch, PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch
@Flamefire Flamefire force-pushed the 20240319165333_new_pr_PyTorch212 branch from 5564421 to 8bb0d57 Compare April 16, 2024 10:33
@akesandgren
Copy link
Contributor

Test report by @akesandgren
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 545.29.06, Python 3.10.12
See https://gist.github.com/akesandgren/581ba5cfbd45762c316a059be80e91ac for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @akesandgren FAILED Build succeeded for 0 out of 1 (1 easyconfigs in total) b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 545.29.06, Python 3.10.12 See https://gist.github.com/akesandgren/581ba5cfbd45762c316a059be80e91ac for a full test report.

Failing especially due to the seg faults:

distributed/fsdp/test_wrap 1/1 failed! Received signal: SIGSEGV
distributed/test_dynamo_distributed 1/1 failed! Received signal: SIGSEGV
distributed/test_inductor_collectives 1/1 failed! Received signal: SIGSEGV

@akesandgren
Copy link
Contributor

Yeah, they are a bit weird. Can't see any reason for them to have happened...
System has plenty of memory and it's an A100/80G card.
Will try again.

@Flamefire
Copy link
Contributor Author

Check the logs if there is any more specific message for those crashing tests.

@akesandgren
Copy link
Contributor

akesandgren commented Apr 29, 2024

Didn't find anything useful, a bunch of tracebacks but they weren't very informative...
I'll see tomorrow what the second run gives me

@akesandgren
Copy link
Contributor

@Flamefire This is what I got:

distributed/fsdp/test_wrap.py::TestAutoWrap::test_auto_wrap_smoke_test_cuda_init_mode_CUDAInitMode_CUDA_AFTER_cpu_offload_CPUOffload(offload_params=False)_use_device_id_True [W PyInterpreter.cpp:221] Warning: Deallocating 
Tensor that still has live PyObject references.  This probably happened because you took out a weak reference to Tensor and didn't call _fix_weakref() after dereferencing it.  Subsequent accesses to this tensor via the PyO
bject will now fail. (function decref)
[W PyInterpreter.cpp:221] Warning: Deallocating Tensor that still has live PyObject references.  This probably happened because you took out a weak reference to Tensor and didn't call _fix_weakref() after dereferencing it.
  Subsequent accesses to this tensor via the PyObject will now fail. (function decref)
...
distributed/fsdp/test_wrap.py::TestAutoWrap::test_auto_wrap_smoke_test_cuda_init_mode_CUDAInitMode_CUDA_BEFORE_cpu_offload_CPUOffload(offload_params=False)_use_device_id_True [W PyInterpreter.cpp:221] Warning: Deallocating
 Tensor that still has live PyObject references.  This probably happened because you took out a weak reference to Tensor and didn't call _fix_weakref() after dereferencing it.  Subsequent accesses to this tensor via the Py
Object will now fail. (function decref)
[W PyInterpreter.cpp:221] Warning: Deallocating Tensor that still has live PyObject references.  This probably happened because you took out a weak reference to Tensor and didn't call _fix_weakref() after dereferencing it.
  Subsequent accesses to this tensor via the PyObject will now fail. (function decref)
PASSED [0.5554s]                        [ 74%]
distributed/fsdp/test_wrap.py::TestAutoWrap::test_auto_wrap_smoke_test_cuda_init_mode_CUDAInitMode_CUDA_BEFORE_cpu_offload_CPUOffload(offload_params=True)_use_device_id_False Fatal Python error: Segmentation fault

Thread 0x00007f4b70ffd640 (most recent call first):
  <no Python frame>

Thread 0x00007f4c09b6d740 (most recent call first):
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2899 in all_gather_into_tensor
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47 in wrapper
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/distributed/fsdp/_exec_order_utils.py", line 208 in _check_order
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/distributed/fsdp/_exec_order_utils.py", line 166 in record_pre_forward
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 425 in _pre_forward
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 825 in forward
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/dev/shm/eb-ake/PyTorch/2.1.2/foss-2022b-CUDA-12.0.0/pytorch-v2.1.2/test/distributed/fsdp/test_wrap.py", line 740 in test_auto_wrap_smoke_test
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 356 in instantiated_test
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2407 in wrapper
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/unittest/case.py", line 549 in _callTestMethod
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/unittest/case.py", line 591 in run
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2480 in _run_with_retry
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2551 in run
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/unittest/case.py", line 650 in __call__
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/unittest.py", line 330 in runtest
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/runner.py", line 260 in <lambda>
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/runner.py", line 339 in from_call
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/runner.py", line 220 in call_and_report
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/runner.py", line 131 in runtestprotocol
  File "/hpc2n/eb/software/pytest-rerunfailures/12.0-GCCcore-12.2.0/lib/python3.10/site-packages/pytest_rerunfailures.py", line 608 in pytest_runtest_protocol
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/main.py", line 324 in _main
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/config/__init__.py", line 167 in main
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 930 in run_tests
  File "/dev/shm/eb-ake/PyTorch/2.1.2/foss-2022b-CUDA-12.0.0/pytorch-v2.1.2/test/distributed/fsdp/test_wrap.py", line 946 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, gmpy2.gmpy2, psutil._psutil_linux, psutil._psutil_posix (total: 23)

and

Running 1 items in this shard: test/distributed/test_c10d_nccl.py::NcclProcessGroupWithDispatchedCollectivesTests::test_all_to_all_single

distributed/test_c10d_nccl.py::NcclProcessGroupWithDispatchedCollectivesTests::test_all_to_all_single SIGSEGV(11), PID: 847043, Thread 847043: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x6b (0x7ff7123b7e0b in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x42520 (0x7ff7498d0520 in /lib/x86_64-linux-gnu/libc.so.6)
frame #2: clock_nanosleep + 0xc8 (0x7ff749973868 in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: __nanosleep + 0x17 (0x7ff7499786e7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: usleep + 0x4f (0x7ff7499aa0df in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0x6729a (0x7ff6d6c6529a in /hpc2n/eb/software/NCCL/2.16.2-GCCcore-12.2.0-CUDA-12.0.0/lib/libnccl.so.2)
frame #6: <unknown function> + 0x67f28 (0x7ff6d6c65f28 in /hpc2n/eb/software/NCCL/2.16.2-GCCcore-12.2.0-CUDA-12.0.0/lib/libnccl.so.2)
frame #7: ncclGroupEnd + 0x59 (0x7ff6d6c665a9 in /hpc2n/eb/software/NCCL/2.16.2-GCCcore-12.2.0-CUDA-12.0.0/lib/libnccl.so.2)
frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x78b (0x7ff6ed50280b in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #9: <unknown function> + 0xe56df8 (0x7ff6ed50adf8 in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::alltoall_base(at::Tensor&, at::Tensor&, std::vector<long, std::allocator<long> >&, std::vector<long, std::allocator<long> >&, c10d::AllToAllOptions const&) + 0xfc8 (0x7ff6ed5223f8 in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0x4d13bb6 (0x7ff704f08bb6 in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x4d237b5 (0x7ff704f187b5 in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x4d30ad0 (0x7ff704f25ad0 in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x4d35b67 (0x7ff704f2ab67 in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0xbae8ee (0x7ff70bf778ee in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x3dcdb0 (0x7ff70b7a5db0 in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #60: <unknown function> + 0x29d90 (0x7ff7498b7d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #61: __libc_start_main + 0x80 (0x7ff7498b7e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #62: _start + 0x25 (0x401065 in /hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/bin/python)

SIGSEGV(11), PID: 847043, Thread 847045: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x6b (0x7ff7123b7e0b in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x42520 (0x7ff7498d0520 in /lib/x86_64-linux-gnu/libc.so.6)
frame #2: __poll + 0x4f (0x7ff7499a6d7f in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: <unknown function> + 0x2b7c3f (0x7ff71000dc3f in /.singularity.d/libs/libcuda.so.1)
frame #4: <unknown function> + 0x37a7bf (0x7ff7100d07bf in /.singularity.d/libs/libcuda.so.1)
frame #5: <unknown function> + 0x2b3e2f (0x7ff710009e2f in /.singularity.d/libs/libcuda.so.1)
frame #6: <unknown function> + 0x94b43 (0x7ff749922b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x126a00 (0x7ff7499b4a00 in /lib/x86_64-linux-gnu/libc.so.6)

SIGSEGV(11), PID: 847043, Thread 847046: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x6b (0x7ff7123b7e0b in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x42520 (0x7ff7498d0520 in /lib/x86_64-linux-gnu/libc.so.6)
frame #2: __poll + 0x4f (0x7ff7499a6d7f in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: <unknown function> + 0x4ac1 (0x7ff74910bac1 in /hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/lib-dynload/select.cpython-310-x86_64-linux-gnu.so)
<omitting python frames>
frame #20: <unknown function> + 0x94b43 (0x7ff749922b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: <unknown function> + 0x126a00 (0x7ff7499b4a00 in /lib/x86_64-linux-gnu/libc.so.6)

SIGSEGV(11), PID: 847043, Thread 847047: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x6b (0x7ff7123b7e0b in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x42520 (0x7ff7498d0520 in /lib/x86_64-linux-gnu/libc.so.6)
frame #2: <unknown function> + 0x91197 (0x7ff74991f197 in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: pthread_cond_clockwait + 0x1fd (0x7ff74992235d in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x100 (0x7ff6ed5007b0 in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x7e (0x7ff6ed500c1e in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xe04a3 (0x7ff711aba4a3 in /hpc2n/eb/software/GCCcore/12.2.0/lib64/libstdc++.so.6)
frame #7: <unknown function> + 0x94b43 (0x7ff749922b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126a00 (0x7ff7499b4a00 in /lib/x86_64-linux-gnu/libc.so.6)

SIGSEGV(11), PID: 847043, Thread 847048: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x6b (0x7ff7123b7e0b in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x42520 (0x7ff7498d0520 in /lib/x86_64-linux-gnu/libc.so.6)
frame #2: __poll + 0x4f (0x7ff7499a6d7f in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: <unknown function> + 0x2b7c3f (0x7ff71000dc3f in /.singularity.d/libs/libcuda.so.1)
frame #4: <unknown function> + 0x37a7bf (0x7ff7100d07bf in /.singularity.d/libs/libcuda.so.1)
frame #5: <unknown function> + 0x2b3e2f (0x7ff710009e2f in /.singularity.d/libs/libcuda.so.1)
frame #6: <unknown function> + 0x94b43 (0x7ff749922b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x126a00 (0x7ff7499b4a00 in /lib/x86_64-linux-gnu/libc.so.6)

SIGSEGV(11), PID: 847043, Thread 847050: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x6b (0x7ff7123b7e0b in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::FatalSignalHandler::fatalSignalHandler(int) + 0x14a (0x7ff7123b833a in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x42520 (0x7ff7498d0520 in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: <unknown function> + 0x198b22 (0x7ff749a26b22 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0xb46fe (0x7ff6d6cb26fe in /hpc2n/eb/software/NCCL/2.16.2-GCCcore-12.2.0-CUDA-12.0.0/lib/libnccl.so.2)
frame #5: <unknown function> + 0xb548e (0x7ff6d6cb348e in /hpc2n/eb/software/NCCL/2.16.2-GCCcore-12.2.0-CUDA-12.0.0/lib/libnccl.so.2)
frame #6: <unknown function> + 0xb76ed (0x7ff6d6cb56ed in /hpc2n/eb/software/NCCL/2.16.2-GCCcore-12.2.0-CUDA-12.0.0/lib/libnccl.so.2)
frame #7: <unknown function> + 0x9d7a6 (0x7ff6d6c9b7a6 in /hpc2n/eb/software/NCCL/2.16.2-GCCcore-12.2.0-CUDA-12.0.0/lib/libnccl.so.2)
frame #8: <unknown function> + 0x4b82d (0x7ff6d6c4982d in /hpc2n/eb/software/NCCL/2.16.2-GCCcore-12.2.0-CUDA-12.0.0/lib/libnccl.so.2)
frame #9: <unknown function> + 0x4fe0b (0x7ff6d6c4de0b in /hpc2n/eb/software/NCCL/2.16.2-GCCcore-12.2.0-CUDA-12.0.0/lib/libnccl.so.2)
frame #10: <unknown function> + 0x66267 (0x7ff6d6c64267 in /hpc2n/eb/software/NCCL/2.16.2-GCCcore-12.2.0-CUDA-12.0.0/lib/libnccl.so.2)
frame #11: <unknown function> + 0x94b43 (0x7ff749922b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #12: <unknown function> + 0x126a00 (0x7ff7499b4a00 in /lib/x86_64-linux-gnu/libc.so.6)

SIGSEGV(11), PID: 847043, Thread 847051: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x6b (0x7ff7123b7e0b in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x42520 (0x7ff7498d0520 in /lib/x86_64-linux-gnu/libc.so.6)
frame #2: read + 0x4c (0x7ff7499a29cc in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: ibv_get_async_event + 0x34 (0x7ff7386b54c4 in /lib/x86_64-linux-gnu/libibverbs.so.1)
frame #4: <unknown function> + 0x73292 (0x7ff6d6c71292 in /hpc2n/eb/software/NCCL/2.16.2-GCCcore-12.2.0-CUDA-12.0.0/lib/libnccl.so.2)
frame #5: <unknown function> + 0x89a4b (0x7ff6d6c87a4b in /hpc2n/eb/software/NCCL/2.16.2-GCCcore-12.2.0-CUDA-12.0.0/lib/libnccl.so.2)
frame #6: <unknown function> + 0x94b43 (0x7ff749922b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x126a00 (0x7ff7499b4a00 in /lib/x86_64-linux-gnu/libc.so.6)

SIGSEGV(11), PID: 847043, Thread 847052: 
 frame #0: c10::FatalSignalHandler::stacktraceSignalHandler(bool) + 0x6b (0x7ff7123b7e0b in /scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x42520 (0x7ff7498d0520 in /lib/x86_64-linux-gnu/libc.so.6)
frame #2: read + 0x4c (0x7ff7499a29cc in /lib/x86_64-linux-gnu/libc.so.6)
frame #3: ibv_get_async_event + 0x34 (0x7ff7386b54c4 in /lib/x86_64-linux-gnu/libibverbs.so.1)
frame #4: <unknown function> + 0x73292 (0x7ff6d6c71292 in /hpc2n/eb/software/NCCL/2.16.2-GCCcore-12.2.0-CUDA-12.0.0/lib/libnccl.so.2)
frame #5: <unknown function> + 0x89a4b (0x7ff6d6c87a4b in /hpc2n/eb/software/NCCL/2.16.2-GCCcore-12.2.0-CUDA-12.0.0/lib/libnccl.so.2)
frame #6: <unknown function> + 0x94b43 (0x7ff749922b43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: <unknown function> + 0x126a00 (0x7ff7499b4a00 in /lib/x86_64-linux-gnu/libc.so.6)

('RERUN', {'yellow': True}) [2.2816s]                                                                            [100%]

and

___________________________________________________________________________ NcclProcessGroupWithDispatchedCollectivesTests.test_all_to_all_single ____________________________________________________________________________
Traceback (most recent call last):
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 506, in wrapper
    self._join_processes(fn)
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 725, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 803, in _check_return_codes
    self.assertEqual(
  File "/scratch/eb-ake-tmp/eb-lkz0y096/tmp47l6v88h/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3304, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Scalars are not equal!

Expected 0 but got -11.
Absolute difference: 11
Relative difference: inf
Expected zero exit code but got -11 for pid: 847043
================================================================================================= 1 passed, 1 rerun in 4.84s =================================================================================================

@akesandgren
Copy link
Contributor

akesandgren commented May 2, 2024

ok, the segfault in test_wrap doesn't happen everytime.

A cleaner stacktrace is:

distributed/fsdp/test_wrap.py::TestAutoWrap::test_auto_wrap_smoke_test_cuda_init_mode_CUDAInitMode_CUDA_AFTER_cpu_offload_CPUOffload(offload_params=False)_use_device_id_False Fatal Python error: Segmentation fault

Thread 0x00007fba0a99f740 (most recent call first):
  File "/home/a/ake/easybuild-amd64_ubuntu2204_zen3/software/PyTorch/2.1.2-foss-2022b-CUDA-12.0.0/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2899 in all_gather_into_tensor
  File "/home/a/ake/easybuild-amd64_ubuntu2204_zen3/software/PyTorch/2.1.2-foss-2022b-CUDA-12.0.0/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 47 in wrapper
  File "/home/a/ake/easybuild-amd64_ubuntu2204_zen3/software/PyTorch/2.1.2-foss-2022b-CUDA-12.0.0/lib/python3.10/site-packages/torch/distributed/fsdp/_exec_order_utils.py", line 208 in _check_order
  File "/home/a/ake/easybuild-amd64_ubuntu2204_zen3/software/PyTorch/2.1.2-foss-2022b-CUDA-12.0.0/lib/python3.10/site-packages/torch/distributed/fsdp/_exec_order_utils.py", line 166 in record_pre_forward
  File "/home/a/ake/easybuild-amd64_ubuntu2204_zen3/software/PyTorch/2.1.2-foss-2022b-CUDA-12.0.0/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 425 in _pre_forward
  File "/home/a/ake/easybuild-amd64_ubuntu2204_zen3/software/PyTorch/2.1.2-foss-2022b-CUDA-12.0.0/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 825 in forward
  File "/home/a/ake/easybuild-amd64_ubuntu2204_zen3/software/PyTorch/2.1.2-foss-2022b-CUDA-12.0.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/home/a/ake/easybuild-amd64_ubuntu2204_zen3/software/PyTorch/2.1.2-foss-2022b-CUDA-12.0.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/dev/shm/eb-ake/PyTorch/2.1.2/foss-2022b-CUDA-12.0.0/pytorch-v2.1.2/test/distributed/fsdp/test_wrap.py", line 740 in test_auto_wrap_smoke_test
  File "/home/a/ake/easybuild-amd64_ubuntu2204_zen3/software/PyTorch/2.1.2-foss-2022b-CUDA-12.0.0/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 356 in instantiated_test
  File "/home/a/ake/easybuild-amd64_ubuntu2204_zen3/software/PyTorch/2.1.2-foss-2022b-CUDA-12.0.0/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2407 in wrapper
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/unittest/case.py", line 549 in _callTestMethod
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/unittest/case.py", line 591 in run
  File "/home/a/ake/easybuild-amd64_ubuntu2204_zen3/software/PyTorch/2.1.2-foss-2022b-CUDA-12.0.0/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2480 in _run_with_retry
  File "/home/a/ake/easybuild-amd64_ubuntu2204_zen3/software/PyTorch/2.1.2-foss-2022b-CUDA-12.0.0/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2551 in run
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/unittest/case.py", line 650 in __call__
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/unittest.py", line 330 in runtest
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/runner.py", line 260 in <lambda>
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/runner.py", line 339 in from_call
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/runner.py", line 220 in call_and_report
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/runner.py", line 131 in runtestprotocol
  File "/hpc2n/eb/software/pytest-rerunfailures/12.0-GCCcore-12.2.0/lib/python3.10/site-packages/pytest_rerunfailures.py", line 608 in pytest_runtest_protocol
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/main.py", line 324 in _main
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_callers.py", line 39 in _multicall
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_manager.py", line 80 in _hookexec
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "/hpc2n/eb/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/site-packages/_pytest/config/__init__.py", line 167 in main
  File "/home/a/ake/easybuild-amd64_ubuntu2204_zen3/software/PyTorch/2.1.2-foss-2022b-CUDA-12.0.0/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 930 in run_tests
  File "/dev/shm/eb-ake/PyTorch/2.1.2/foss-2022b-CUDA-12.0.0/pytorch-v2.1.2/test/distributed/fsdp/test_wrap.py", line 946 in <module>

@akesandgren
Copy link
Contributor

akesandgren commented May 2, 2024

I also have a problem with test/distributed/test_c10d_nccl.py::NcclProcessGroupWithDispatchedCollectivesTests::test_allgather_base which also segfaults.

But if I change to NCCL 2.18.3 (i.e. the one used in 2023a) that problem goes away.

@akesandgren
Copy link
Contributor

Using a newer NCCL also seem to eliminate the SEGV in distributed/fsdp/test_wrap

I'd say we should drop this one due to the NCCL version problem.

Or add this to some more tests:
test/distributed/test_c10d_nccl.py: @requires_nccl_version((2, 17), "Need NCCL 2.17+ for configuring NCCL communicators")

@akesandgren
Copy link
Contributor

Also note that .github/scripts/generate_binary_build_matrix.py hints at NCCL 2.18.1 being "required", or at least what they test with.

@Flamefire
Copy link
Contributor Author

Using a newer NCCL also seem to eliminate the SEGV in distributed/fsdp/test_wrap

I'd say we should drop this one due to the NCCL version problem.

Or add this to some more tests: test/distributed/test_c10d_nccl.py: @requires_nccl_version((2, 17), "Need NCCL 2.17+ for configuring NCCL communicators")

Those 3 suites are crashing in your report:

  • distributed/fsdp/test_wrap
  • distributed/test_dynamo_distributed
  • distributed/test_inductor_collectives

The test_c10d_nccl seems to be "just" a failure or at least counted as such, easy enough to skip if it fails consistently

This is the only PyTorch 2.x version in 2022b, so might be worth keeping. So 3 options:

  • Add the 2023a NCCL version (2.18.3) to 2022b
  • Skip the affected tests (might fail for users relying on those parts)
  • Remove it and refer to 2023a for PyTorch 2.x

I prefer the first option.
Currently rerunning the builds on my side to check if I see those failures too. Can't remember having them...

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
i8009 - Linux Rocky Linux 8.7 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.8.13
See https://gist.github.com/Flamefire/1f858bec1c9fb0f90e07ab1a9dd0156e for a full test report.

@Flamefire
Copy link
Contributor Author

I ran some tests and I see random failures of whole suites with SIGSEGV (test_jit) and
SIGIOT (distributed/_tensor/test_dtensor_ops). Especially the latter one is a) confusing as I don't now what that means in this context and b) hard to find any failure due to many XFAILS where the stacktrace and failure is printed just to later say "yeah, this is expected", which makes it pretty much impossible to see actual errors

I opened another PR using the newer NCCL: #20520

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants