gpu CI failing pretty consistently with segfault #8194

jrbourbeau · 2023-09-18T16:24:09Z

I've noticed gpuCI has been failing pretty consistently (for example, this build and this build) with a segfault. Note that tests actually pass -- there must be some extra step where things are going awry

09:26:31 SKIPPED [2] distributed/diagnostics/tests/test_nvml.py:107: Less than two GPUs available
09:26:31 SKIPPED [1] distributed/shuffle/tests/test_shuffle.py:134: could not import 'dask_cudf': No module named 'dask_cudf'
09:26:31 XFAIL distributed/comm/tests/test_ucx.py::test_cuda_context - If running on Docker, requires --pid=host
09:26:31 XFAIL distributed/diagnostics/tests/test_nvml.py::test_has_cuda_context - If running on Docker, requires --pid=host
09:26:31 XFAIL distributed/tests/test_nanny.py::test_no_unnecessary_imports_on_worker[pandas] - distributed#5723
09:26:31 ==== 372 passed, 8 skipped, 3702 deselected, 3 xfailed in 257.84s (0:04:17) ====
09:26:32 continuous_integration/gpuci/build.sh: line 58:   173 Segmentation fault      (core dumped) py.test distributed -v -m gpu --runslow --junitxml="$WORKSPACE/junit-distributed.xml"
09:26:33 Build step 'Execute shell' marked build as failure

cc @charlesbluca @quasiben @dask/gpu

The text was updated successfully, but these errors were encountered:

charlesbluca · 2023-09-19T16:00:14Z

Thanks for raising this @jrbourbeau - I'll see if I'm able to reproduce the segfaults locally to follow up here with more info

charlesbluca · 2023-09-19T16:57:03Z

Was able to reproduce the segfaults locally - not sure if these are coming from multiple tests, but one in particular that seems to trigger them is distributed/comm/tests/test_ucx.py::test_ping_pong_cupy, which at least with the latest GPU CI environment isn't able to complete locally for me:

======================================================================== short test summary info ========================================================================
FAILED distributed/comm/tests/test_ucx.py::test_ping_pong_cupy[shape1] - cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
FAILED distributed/comm/tests/test_ucx.py::test_ping_pong_cupy[shape2] - cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
ERROR distributed/comm/tests/test_ucx.py::test_ping_pong_cupy[shape1] - ucp._libs.exceptions.UCXError: Trying to reset UCX but not all Endpoints and/or Listeners are closed(). The following objects are still referencing ApplicationConte...
ERROR distributed/comm/tests/test_ucx.py::test_ping_pong_cupy[shape2] - ucp._libs.exceptions.UCXError: Trying to reset UCX but not all Endpoints and/or Listeners are closed(). The following objects are still referencing ApplicationConte...
================================================================= 2 failed, 1 passed, 2 errors in 3.76s =================================================================
Fatal Python error: Segmentation fault

Current thread 0x00007fb8774fe740 (most recent call first):
  Garbage-collecting
  <no Python frame>
Segmentation fault (core dumped)

Here are the associated failures/errors:

___________________________________________________________ ERROR at teardown of test_ping_pong_cupy[shape1] ____________________________________________________________

    @pytest.fixture(scope="function")
    def ucx_loop():
        """Allows UCX to cancel progress tasks before closing event loop.
    
        When UCX tasks are not completed in time (e.g., by unexpected Endpoint
        closure), clean up tasks before closing the event loop to prevent unwanted
        errors from being raised.
        """
        ucp = pytest.importorskip("ucp")
    
        loop = asyncio.new_event_loop()
        loop.set_exception_handler(ucx_exception_handler)
        ucp.reset()
        yield loop
>       ucp.reset()

distributed/utils_test.py:2151: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    def reset():
        """Resets the UCX library by shutting down all of UCX.
    
        The library is initiated at next API call.
        """
        global _ctx
        if _ctx is not None:
            weakref_ctx = weakref.ref(_ctx)
            _ctx = None
            gc.collect()
            if weakref_ctx() is not None:
                msg = (
                    "Trying to reset UCX but not all Endpoints and/or Listeners "
                    "are closed(). The following objects are still referencing "
                    "ApplicationContext: "
                )
                for o in gc.get_referrers(weakref_ctx()):
                    msg += "\n  %s" % str(o)
>               raise UCXError(msg)
E               ucp._libs.exceptions.UCXError: Trying to reset UCX but not all Endpoints and/or Listeners are closed(). The following objects are still referencing ApplicationContext: 
E                 {'_ep': <ucp._libs.ucx_api.UCXEndpoint object at 0x7fb80f628eb0>, '_ctx': <ucp.core.ApplicationContext object at 0x7fb80e1e1ff0>, '_send_count': 3, '_recv_count': 3, '_finished_recv_count': 3, '_shutting_down_peer': False, '_close_after_n_recv': None, '_tags': {'msg_send': 10817624050753198211, 'msg_recv': 6614375631065549451, 'ctrl_send': 8811442553812381923, 'ctrl_recv': 16219165296038974008}}
E                 {'_ep': <ucp._libs.ucx_api.UCXEndpoint object at 0x7fb80f629070>, '_ctx': <ucp.core.ApplicationContext object at 0x7fb80e1e1ff0>, '_send_count': 3, '_recv_count': 3, '_finished_recv_count': 3, '_shutting_down_peer': False, '_close_after_n_recv': None, '_tags': {'msg_send': 6614375631065549451, 'msg_recv': 10817624050753198211, 'ctrl_send': 16219165296038974008, 'ctrl_recv': 8811442553812381923}}

/datasets/charlesb/micromamba/envs/distributed-gpuci-py310/lib/python3.10/site-packages/ucp/core.py:947: UCXError
______________________________________________________________________ test_ping_pong_cupy[shape1] ______________________________________________________________________

ucx_loop = <_UnixSelectorEventLoop running=False closed=False debug=False>, shape = (10, 10)

    @pytest.mark.parametrize("shape", [(100,), (10, 10), (4947,)])
    @gen_test()
    async def test_ping_pong_cupy(ucx_loop, shape):
        cupy = pytest.importorskip("cupy")
        com, serv_com = await get_comm_pair()
    
>       arr = cupy.random.random(shape)

distributed/comm/tests/test_ucx.py:224: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/datasets/charlesb/micromamba/envs/distributed-gpuci-py310/lib/python3.10/site-packages/cupy/random/_sample.py:156: in random_sample
    return rs.random_sample(size=size, dtype=dtype)
/datasets/charlesb/micromamba/envs/distributed-gpuci-py310/lib/python3.10/site-packages/cupy/random/_generator.py:619: in random_sample
    RandomState._mod1_kernel(out)
cupy/_core/_kernel.pyx:921: in cupy._core._kernel.ElementwiseKernel.__call__
    ???
cupy/cuda/function.pyx:237: in cupy.cuda.function.Function.linear_launch
    ???
cupy/cuda/function.pyx:205: in cupy.cuda.function._launch
    ???
cupy_backends/cuda/api/driver.pyx:253: in cupy_backends.cuda.api.driver.launchKernel
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle

cupy_backends/cuda/api/driver.pyx:60: CUDADriverError

My first assumption is that we're running into some issues around UCX clean up? Can try bisecting Distributed to see if we can isolate this to a specific commit

charlesbluca · 2023-09-19T17:38:10Z

Able to reproduce things, but having some difficulty bisecting it to any specific cause 😕 eyeballing this successful GPU CI run from a few weeks ago, I tried rolling back to older versions of UCX/UCX-Py with Distributed 2023.7.1 and still saw the segfaults.

jrbourbeau added the tests Unit tests and/or continuous integration label Sep 18, 2023

pentschev mentioned this issue Sep 21, 2023

Do not reset CUDA context after UCX tests #8201

Merged

2 tasks

jrbourbeau closed this as completed in #8201 Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu CI failing pretty consistently with segfault #8194

gpu CI failing pretty consistently with segfault #8194

jrbourbeau commented Sep 18, 2023

charlesbluca commented Sep 19, 2023

charlesbluca commented Sep 19, 2023

charlesbluca commented Sep 19, 2023

gpu CI failing pretty consistently with segfault #8194

gpu CI failing pretty consistently with segfault #8194

Comments

jrbourbeau commented Sep 18, 2023

charlesbluca commented Sep 19, 2023

charlesbluca commented Sep 19, 2023

charlesbluca commented Sep 19, 2023