Add dtensor support for scale-out testing by frost-intel · Pull Request #15 · daisyden/pytorch

frost-intel · 2025-09-22T21:09:19Z

Adds external (scale-out) testing for DTensorTestBase. Used specifically in test/distributed/checkpoint/teset_utils.py but also used in many other tests.

frost-intel · 2025-09-22T21:09:34Z

@pkourdis

pkourdis · 2025-09-30T04:22:57Z

@frost-intel can you please rebase your changes with panos/test-with-external-multiprocessing?

pkourdis · 2025-10-01T17:54:58Z

LGTM

…nces between x86 vs aarch64 (pytorch#176085) In the test: ``` python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda ``` it raises an exception when calling `STD_CUDA_CHECK(cudaSetDevice(99999));` which got the expected `CUDA error: invalid device` message. However, the expected string for the C++ stack trace is different between `x86` vs `aarch64` due perhaps in these issues: - pytorch#119905 - pytorch#134387 In the current setup when getting a stack trace string: - x86 contains `C++ CapturedTraceback:` - aarch64 contains `Exception raised from` + `frame #` An example of the full string from an aarch64 system when : ``` AssertionError: 'C++ CapturedTraceback:' not found in 'CUDA error: invalid device ordinal\nGPU device may be out of range, do you have enough GPUs?\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n\nException raised from test_std_cuda_check_error at /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/csrc/test_std_cuda_check.cu:23 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xe471ebcd39f4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)\nframe #1: <unknown function> + 0x43f998 (0xe471ebdcf998 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1bc (0xe471ebdcfc0c in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #3: torch_c10_cuda_check_msg + 0x1c (0xe471ef335c4c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)\nframe #4: test_std_cuda_check_error() + 0x58 (0xe470cd396678 in /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/install/usr/local/lib/python3.12/dist-packages/libtorch_agn_2_10/_C.so)\nframe #5: c10::BoxedKernel::makeFromFunctor<StableIValueBoxedKernel>(std::unique_ptr<StableIValueBoxedKernel, std::default_delete<StableIValueBoxedKernel> >)::{lambda(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)#1}::_FUN(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 0x16c (0xe47211cd419c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe #6: <unknown function> + 0x61d34bc (0xe47211cf34bc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe #7: <unknown function> + 0xe6c324 (0xe4721532c324 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #8: <unknown function> + 0xe6c7e0 (0xe4721532c7e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #9: <unknown function> + 0xd3907c (0xe472151f907c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #10: <unknown function> + 0x5ccbf8 (0xe47214a8cbf8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #11: /usr/bin/python() [0x504a34]\nframe #12: PyObject_Call + 0x6c (0x4c633c in /usr/bin/python)\nframe #13: _PyEval_EvalFrameDefault + 0x3ea0 (0x568564 in /usr/bin/python)\nframe #14: _PyObject_Call_Prepend + 0xc4 (0x4c5934 in /usr/bin/python)\nframe #15: /usr/bin/python() [0x52a070]\nframe #16: _PyObject_MakeTpCall + 0x78 (0x4c3e58 in /usr/bin/python)\nframe #17: _PyEval_EvalFrameDefault + 0x8a0 (0x564f64 in /usr/bin/python)\nframe #18: PyEval_EvalCode + 0x130 (0x5632b4 in /usr/bin/python)\nframe #19: PyRun_StringFlags + 0xe0 (0x59c330 in /usr/bin/python)\nframe #20: PyRun_SimpleStringFlags + 0x44 (0x67ebc4 in /usr/bin/python)\nframe #21: Py_RunMain + 0x390 (0x68b380 in /usr/bin/python)\nframe #22: Py_BytesMain + 0x28 (0x68ae88 in /usr/bin/python)\nframe #23: <unknown function> + 0x284c4 (0xe47216b084c4 in /lib/aarch64-linux-gnu/libc.so.6)\nframe #24: __libc_start_main + 0x98 (0xe47216b08598 in /lib/aarch64-linux-gnu/libc.so.6)\nframe #25: _start + 0x30 (0x5f6770 in /usr/bin/python)\n\n' To execute this test, run the following from the base repo dir: python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda ``` Pull Request resolved: pytorch#176085 Approved by: https://github.com/eqy

frost-intel changed the title ~~Add dtensor support~~ Add dtensor support for scale-out testing Sep 29, 2025

pkourdis force-pushed the panos/test-with-external-multiprocessing branch 2 times, most recently from de5f659 to 44c5a33 Compare September 30, 2025 04:18

Add dtensor support

3fb129a

frost-intel force-pushed the frost/test_external_dtensor branch from 5ad16a8 to 3fb129a Compare October 1, 2025 14:29

pkourdis merged commit 073b731 into daisyden:panos/test-with-external-multiprocessing Oct 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dtensor support for scale-out testing#15

Add dtensor support for scale-out testing#15
pkourdis merged 1 commit intodaisyden:panos/test-with-external-multiprocessingfrom
frost-intel:frost/test_external_dtensor

frost-intel commented Sep 22, 2025

Uh oh!

frost-intel commented Sep 22, 2025

Uh oh!

pkourdis commented Sep 30, 2025

Uh oh!

pkourdis commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

frost-intel commented Sep 22, 2025

Uh oh!

frost-intel commented Sep 22, 2025

Uh oh!

pkourdis commented Sep 30, 2025

Uh oh!

pkourdis commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants