Add XPU support for distributed by PenghuiCheng · Pull Request #4 · daisyden/pytorch

PenghuiCheng · 2025-03-19T13:47:16Z

Add XPU support for distributed

Signed-off-by: Cheng Penghui <penghui.cheng@intel.com>

daisyden · 2025-03-19T13:59:10Z

torch/testing/_internal/distributed/distributed_test.py

        return torch.empty(size, size, size, dtype=dtype).fill_(value)
    else:
-        return torch.empty(size, size, size, dtype=dtype).fill_(value).cuda(device_id)
+        return torch.empty(size, size, size, dtype=dtype).fill_(value).to(torch.device(device_id))


Here device_id is an integer to indicate rank, right? We could need change into .to(torch.accelerator.current_accelerator().type + ":" + str(device_id)).

daisyden · 2025-03-19T14:02:25Z

torch/testing/_internal/distributed/distributed_test.py

                nn.ReLU(),
                nn.Linear(1, 5, bias=False),
-            ).cuda(device_id)
+            ).to(torch.device(device_id))


.to(torch.accelerator.current_accelerator().type + ":" + str(device_id))

daisyden · 2025-03-19T14:03:03Z

torch/testing/_internal/distributed/distributed_test.py

            device_id = rank_to_GPU[rank][0]

-            model = nn.Linear(1, 5, bias=False).cuda(device_id)
+            model = nn.Linear(1, 5, bias=False).to(torch.device(device_id))


.to(torch.accelerator.current_accelerator().type + ":" + str(device_id))

daisyden · 2025-03-19T14:03:10Z

torch/testing/_internal/distributed/distributed_test.py

            device_id = rank_to_GPU[rank][0]

-            model = nn.Linear(1, 5, bias=False).cuda(device_id)
+            model = nn.Linear(1, 5, bias=False).to(torch.device(device_id))


.to(torch.accelerator.current_accelerator().type + ":" + str(device_id))

daisyden · 2025-03-19T14:03:25Z

torch/testing/_internal/distributed/distributed_test.py

            device_id = rank_to_GPU[rank][0]

-            model = nn.Linear(1, 5, bias=False).cuda(device_id)
+            model = nn.Linear(1, 5, bias=False).to(torch.device(device_id))


.to(torch.accelerator.current_accelerator().type + ":" + str(device_id))

daisyden

Will merge the code first and fix the .to() issues.

Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError` `torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's [`PyErr_SetString`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L282), namely - Convert cstr into Python string with `PyUnicode_FromString` - Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L32) - Set `error_code` property using `PyObject_SetAttrString` - decref all temporary references Test that it works and captures CPP backtrace (in addition to CI) by running ```python import os os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1' import torch x = torch.rand(10, device="cuda") y = torch.arange(20, device="cuda") try: x[y] = 2 print(x) except torch.AcceleratorError as e: print("Exception was raised", e.args[0]) print("Captured error code is ", e.error_code) ``` which produces following output ``` Exception was raised CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) [clone .cold] from CUDAException.cpp:0 #7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0 #8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0 #9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0 #10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0 #11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0 #12 at::native::abs(at::Tensor const&) from ??:0 #13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0 #14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0 #15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #17 at::_ops::abs::call(at::Tensor const&) from ??:0 #18 at::native::isfinite(at::Tensor const&) from ??:0 #19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0 #20 at::_ops::isfinite::call(at::Tensor const&) from ??:0 #21 torch::autograd::THPVariable_isfinite(_object*, _object*, _object*) from python_torch_functions_2.cpp:0 #22 PyObject_CallFunctionObjArgs from ??:0 #23 _PyObject_MakeTpCall from ??:0 #24 _PyEval_EvalFrameDefault from ??:0 #25 _PyObject_FastCallDictTstate from ??:0 #26 _PyStack_AsDict from ??:0 #27 _PyObject_MakeTpCall from ??:0 #28 _PyEval_EvalFrameDefault from ??:0 #29 _PyFunction_Vectorcall from ??:0 #30 _PyEval_EvalFrameDefault from ??:0 #31 _PyFunction_Vectorcall from ??:0 pytorch#32 _PyEval_EvalFrameDefault from ??:0 pytorch#33 _PyFunction_Vectorcall from ??:0 pytorch#34 _PyEval_EvalFrameDefault from ??:0 pytorch#35 PyFrame_GetCode from ??:0 pytorch#36 PyNumber_Xor from ??:0 pytorch#37 PyObject_Str from ??:0 pytorch#38 PyFile_WriteObject from ??:0 pytorch#39 _PyWideStringList_AsList from ??:0 pytorch#40 _PyDict_NewPresized from ??:0 pytorch#41 _PyEval_EvalFrameDefault from ??:0 pytorch#42 PyEval_EvalCode from ??:0 pytorch#43 PyEval_EvalCode from ??:0 pytorch#44 PyUnicode_Tailmatch from ??:0 pytorch#45 PyInit__collections from ??:0 pytorch#46 PyUnicode_Tailmatch from ??:0 pytorch#47 _PyRun_SimpleFileObject from ??:0 pytorch#48 _PyRun_AnyFileObject from ??:0 pytorch#49 Py_RunMain from ??:0 pytorch#50 Py_BytesMain from ??:0 pytorch#51 __libc_init_first from ??:0 pytorch#52 __libc_start_main from ??:0 pytorch#53 _start from ??:0 Captured error code is 710 ``` Pull Request resolved: pytorch#152023 Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel ghstack dependencies: pytorch#154436

…torch#156600) Don't call `sum()` on a tensor that is default constructed. Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this: ``` Traceback (most recent call last): File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper method(*args, **kwargs) File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps ln_out_cuda.backward(grad_output_cuda) File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward torch.autograd.backward( File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward _engine_run_backward( File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: tensor does not have a device Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0 #7 at::TensorBase::options() const from :0 #8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0 #9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const*, float const*, float const*, float const*, long, long, at::Tensor*, at::Tensor*, CUstream_st*) from ??:0 #14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0 #15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0 #16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0 #17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0 ``` Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment. Pull Request resolved: pytorch#156600 Approved by: https://github.com/eqy, https://github.com/ngimel

For tensor with non-zero offset, it must be multiplied by element size Add regression test by creating Tensor in array of 6 elements with offset 3, which before the fix crashed with ``` C++ exception with description "setStorage: sizes [3, 3], strides [0, 1], storage offset 3, and itemsize 4 requiring a storage size of 24 are out of bounds for storage of size 15 Exception raised from checkInBoundsForStorage at /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/Resize.h:123 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 56 (0x104a9cd44 in libc10.dylib) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 120 (0x104a9a05c in libc10.dylib) frame #2: void at::native::checkInBoundsForStorage<long long>(c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, caffe2::TypeMeta const&, c10::Storage const&) + 656 (0x111dbd314 in libtorch_cpu.dylib) frame #3: void at::native::setStrided<long long>(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long) + 152 (0x111dcd22c in libtorch_cpu.dylib) frame #4: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) + 312 (0x111dccf98 in libtorch_cpu.dylib) frame #5: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__as_strided(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>>>, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 104 (0x1129a1e94 in libtorch_cpu.dylib) frame #6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 476 (0x112200ad0 in libtorch_cpu.dylib) frame #7: at::Tensor::as_strided(c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) const + 236 (0x1115db098 in libtorch_cpu.dylib) frame #8: at::native::expand(at::Tensor const&, c10::ArrayRef<long long>, bool) + 348 (0x111dcc0d4 in libtorch_cpu.dylib) frame #9: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 116 (0x1157ac410 in libtorch_cpu.dylib) frame #10: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 992 (0x114e8b010 in libtorch_cpu.dylib) frame #11: at::_ops::expand::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 316 (0x112743c90 in libtorch_cpu.dylib) frame #12: at::expand_size(at::Tensor const&, c10::ArrayRef<long long>) + 164 (0x1047d82b4 in basic) frame #13: BasicTest_TestForBlobResizeCPU_Test::TestBody() + 284 (0x1047d8048 in basic) ``` Pull Request resolved: pytorch#158690 Approved by: https://github.com/angelayi

…rch#165479) These happen when building with CMAKE_BUILD_TYPE=RelWithAssert This should fix two types of failures that started with pytorch#163665 Disclaimer that I used a lot of AI since I don't how pybind works or what refcounts and pointers are, so idk if this is a good solution, or even a solution at all (fwiw the tests pass now) The first one type is Truncated: ``` default_pg, _ = _new_process_group_helper( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2096, in _new_process_group_helper backend_class = creator_fn(dist_backend_opts, backend_options) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/fake_pg.py", line 25, in _create_fake_pg return FakeProcessGroup._create_internal( RuntimeError: new_refcount != 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/c10/util/intrusive_ptr.h":319, please report a bug to PyTorch. intrusive_ptr: Cannot increase refcount after it reached zero. Exception raised from retain_ at /var/lib/jenkins/workspace/c10/util/intrusive_ptr.h:319 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0 #7 c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) from ??:0 #8 void pybind11::class_<c10d::FakeProcessGroup, (anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup> >::init_instance<(anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup>, 0>(pybind11::detail::instance*, void const*) from init.cpp:0 #9 pybind11::detail::type_caster_generic::cast(void const*, pybind11::return_value_policy, pybind11::handle, pybind11::detail::type_info const*, void* (*)(void const*), void* (*)(void const*), void const*) from :0 #10 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)pytorch#127}, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> >, int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v>(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)pytorch#127}&&, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> > (*)(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0 ``` and I fix it here by getting rid of `DontIncreaseRefcount` and using make_intrusive to do the ref count handling instead. However, I also had to move the constructor to be public, which I think is not good, based on the reasoning of the original PR The other one type is ``` Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/test_testing.py", line 2415, in test_no_warning_on_import self.assertEqual(out, "") File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4233, in assertEqual raise error_metas.pop()[0].to_error( # type: ignore[index] AssertionError: String comparison failed: "/opt/conda/envs/py_3.10/lib/python3.10/s[352 chars]):\n" != '' - /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/__init__.py:29: FutureWarning: pybind11-bound class 'torch._C._distributed_c10d.FakeProcessGroup' is using an old-style placement-new '__init__' which has been deprecated. See the upgrade guide in pybind11's docs. This message is only visible when compiled in debug mode. - if is_available() and not torch._C._c10d_init(): To execute this test, run the following from the base repo dir: python test/test_testing.py TestImports.test_no_warning_on_import ``` which I fix by getting rid of the `__init__` which I think is ok since it'll just error if you try to make one? Pull Request resolved: pytorch#165479 Approved by: https://github.com/ezyang

Previously g3 = NVIDIA Tesla M60 Now g6 = NVIDIA L4 Also change cuda arch list accordingly Pros: More memory, newer GPU Cons: That was one of the few remaining tests on g3 runners, so we probably lost coverage? We can probably run more tests in parallel now but I'm not going to do that here Disabled a bunch of sparse tests and nestedtensor tests that were previously skipped due to not having sufficient hardware? They are now failing with ``` Traceback (most recent call last): File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3293, in wrapper method(*args, **kwargs) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3292, in wrapper with policy(): File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2532, in __enter__ self.beforeStreams[-1].synchronize() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/streams.py", line 105, in synchronize super().synchronize() torch.AcceleratorError: CUDA error: device-side assert triggered Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from stream_synchronize at /var/lib/jenkins/workspace/c10/cuda/CUDAFunctions.h:120 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) [clone .cold] from CUDAException.cpp:0 #7 THCPStream_synchronize(_object*, _object*) from Stream.cpp:0 #8 cfunction_vectorcall_NOARGS from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:489 #9 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 #10 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 #11 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114 #12 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46 ``` when run with cuda launch blocking I got a ton of stuff like ``` /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [2,7,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [3,7,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,0,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,1,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,2,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,3,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,3,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,4,0] Assertion `value < upper_bound` failed. /var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,4,0] Assertion `value < upper_bound` failed. ``` Pull Request resolved: pytorch#165158 Approved by: https://github.com/seemethere

If another static object (like `g_device_config_parse_hook_registry_instance` created by the `REGISTER_ALLOCATOR_CONFIG_PARSE_HOOK` macro) tries to call `registerDeviceConfigParserHook` before `device_config_parser_hook_` is initialized, assigning to it (operator=) can fail, which leads to a runtime error. When I use a compilation optimization of ` -O1` I see this issue: ``` [src/libcxx/include/__functional/function.h:496]:14: runtime error: member access within null pointer of type 'const __policy' #0 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:496]:14 #1 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:483]:19 #2 0x563224e28b78 in operator= [crosstool/v18/stable/src/libcxx/include/__functional/function.h:727]:8 #3 0x563224e28b78 in c10::CachingAllocator::AcceleratorAllocatorConfig::registerDeviceConfigParserHook(std::__u::function<void (std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>> const&)>&&, std::__u::unordered_set<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>, std::__u::hash<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>, std::__u::equal_to<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>, std::__u::allocator<std::__u::basic_string<char, std::__u::char_traits<char>, std::__u::allocator<char>>>> const&) [torch/c10/core/AllocatorConfig.h:263]:32 #4 0x563224e28e9d in DeviceConfigParserHookRegistry [torch/c10/core/AllocatorConfig.h:369]:5 #5 0x563224e28e9d in __cxx_global_var_init.34 [torch/c10/cuda/CUDAAllocatorConfig.cpp:195]:1 #6 0x563224e28e9d in _GLOBAL__sub_I_CUDAAllocatorConfig.cpp torch/c10/cuda/CUDAAllocatorConfig.cpp #7 0x5632459709ac in __libc_csu_init /[usr/grte/v5/debug-src/src/csu/elf-init.c:88]:7 #8 0x7f748b9562e7 in __libc_start_main (/usr/grte/v5/lib64/libc.so.6+0x612e7) (BuildId: ca23ec6d935352118622ce674a8bb52d) #9 0x5632018f3729 in _start /usr/grte/v5/debug-src/src/csu/../sysdeps/x86_64/start.S:120 ``` Pull Request resolved: pytorch#172581 Approved by: https://github.com/guangyey, https://github.com/albanD

…nces between x86 vs aarch64 (pytorch#176085) In the test: ``` python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda ``` it raises an exception when calling `STD_CUDA_CHECK(cudaSetDevice(99999));` which got the expected `CUDA error: invalid device` message. However, the expected string for the C++ stack trace is different between `x86` vs `aarch64` due perhaps in these issues: - pytorch#119905 - pytorch#134387 In the current setup when getting a stack trace string: - x86 contains `C++ CapturedTraceback:` - aarch64 contains `Exception raised from` + `frame #` An example of the full string from an aarch64 system when : ``` AssertionError: 'C++ CapturedTraceback:' not found in 'CUDA error: invalid device ordinal\nGPU device may be out of range, do you have enough GPUs?\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n\nException raised from test_std_cuda_check_error at /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/csrc/test_std_cuda_check.cu:23 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xe471ebcd39f4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)\nframe #1: <unknown function> + 0x43f998 (0xe471ebdcf998 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1bc (0xe471ebdcfc0c in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)\nframe #3: torch_c10_cuda_check_msg + 0x1c (0xe471ef335c4c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)\nframe #4: test_std_cuda_check_error() + 0x58 (0xe470cd396678 in /opt/pytorch/pytorch/test/cpp_extensions/libtorch_agn_2_10_extension/install/usr/local/lib/python3.12/dist-packages/libtorch_agn_2_10/_C.so)\nframe #5: c10::BoxedKernel::makeFromFunctor<StableIValueBoxedKernel>(std::unique_ptr<StableIValueBoxedKernel, std::default_delete<StableIValueBoxedKernel> >)::{lambda(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)#1}::_FUN(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) + 0x16c (0xe47211cd419c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe #6: <unknown function> + 0x61d34bc (0xe47211cf34bc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)\nframe #7: <unknown function> + 0xe6c324 (0xe4721532c324 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #8: <unknown function> + 0xe6c7e0 (0xe4721532c7e0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #9: <unknown function> + 0xd3907c (0xe472151f907c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #10: <unknown function> + 0x5ccbf8 (0xe47214a8cbf8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)\nframe #11: /usr/bin/python() [0x504a34]\nframe #12: PyObject_Call + 0x6c (0x4c633c in /usr/bin/python)\nframe #13: _PyEval_EvalFrameDefault + 0x3ea0 (0x568564 in /usr/bin/python)\nframe #14: _PyObject_Call_Prepend + 0xc4 (0x4c5934 in /usr/bin/python)\nframe #15: /usr/bin/python() [0x52a070]\nframe #16: _PyObject_MakeTpCall + 0x78 (0x4c3e58 in /usr/bin/python)\nframe #17: _PyEval_EvalFrameDefault + 0x8a0 (0x564f64 in /usr/bin/python)\nframe #18: PyEval_EvalCode + 0x130 (0x5632b4 in /usr/bin/python)\nframe #19: PyRun_StringFlags + 0xe0 (0x59c330 in /usr/bin/python)\nframe #20: PyRun_SimpleStringFlags + 0x44 (0x67ebc4 in /usr/bin/python)\nframe #21: Py_RunMain + 0x390 (0x68b380 in /usr/bin/python)\nframe #22: Py_BytesMain + 0x28 (0x68ae88 in /usr/bin/python)\nframe #23: <unknown function> + 0x284c4 (0xe47216b084c4 in /lib/aarch64-linux-gnu/libc.so.6)\nframe #24: __libc_start_main + 0x98 (0xe47216b08598 in /lib/aarch64-linux-gnu/libc.so.6)\nframe #25: _start + 0x30 (0x5f6770 in /usr/bin/python)\n\n' To execute this test, run the following from the base repo dir: python test/cpp_extensions/test_libtorch_agnostic.py TestLibtorchAgnosticCUDA.test_std_cuda_check_error_show_cpp_stacktraces_True_cuda ``` Pull Request resolved: pytorch#176085 Approved by: https://github.com/eqy

Add XPU support for distributed

96ebbb1

Signed-off-by: Cheng Penghui <penghui.cheng@intel.com>

daisyden reviewed Mar 19, 2025

View reviewed changes

daisyden approved these changes Mar 19, 2025

View reviewed changes

daisyden merged this pull request into daisyden:distributed_2.8 Mar 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add XPU support for distributed#4

Add XPU support for distributed#4
daisyden merged 1 commit intodaisyden:distributed_2.8from
PenghuiCheng:distributed_2.8

PenghuiCheng commented Mar 19, 2025

Uh oh!

daisyden Mar 19, 2025

Uh oh!

PenghuiCheng Mar 21, 2025

Uh oh!

daisyden Mar 19, 2025

Uh oh!

PenghuiCheng Mar 21, 2025

Uh oh!

daisyden Mar 19, 2025

Uh oh!

daisyden Mar 19, 2025

Uh oh!

daisyden Mar 19, 2025

Uh oh!

daisyden left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PenghuiCheng commented Mar 19, 2025

Uh oh!

daisyden Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

PenghuiCheng Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

daisyden Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

PenghuiCheng Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

daisyden Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

daisyden Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

daisyden Mar 19, 2025

Choose a reason for hiding this comment

Uh oh!

daisyden left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants