Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu CI failing pretty consistently with segfault #8194

Closed
jrbourbeau opened this issue Sep 18, 2023 · 3 comments · Fixed by #8201
Closed

gpu CI failing pretty consistently with segfault #8194

jrbourbeau opened this issue Sep 18, 2023 · 3 comments · Fixed by #8201
Labels
tests Unit tests and/or continuous integration

Comments

@jrbourbeau
Copy link
Member

I've noticed gpuCI has been failing pretty consistently (for example, this build and this build) with a segfault. Note that tests actually pass -- there must be some extra step where things are going awry

09:26:31 SKIPPED [2] distributed/diagnostics/tests/test_nvml.py:107: Less than two GPUs available
09:26:31 SKIPPED [1] distributed/shuffle/tests/test_shuffle.py:134: could not import 'dask_cudf': No module named 'dask_cudf'
09:26:31 XFAIL distributed/comm/tests/test_ucx.py::test_cuda_context - If running on Docker, requires --pid=host
09:26:31 XFAIL distributed/diagnostics/tests/test_nvml.py::test_has_cuda_context - If running on Docker, requires --pid=host
09:26:31 XFAIL distributed/tests/test_nanny.py::test_no_unnecessary_imports_on_worker[pandas] - distributed#5723
09:26:31 ==== 372 passed, 8 skipped, 3702 deselected, 3 xfailed in 257.84s (0:04:17) ====
09:26:32 continuous_integration/gpuci/build.sh: line 58:   173 Segmentation fault      (core dumped) py.test distributed -v -m gpu --runslow --junitxml="$WORKSPACE/junit-distributed.xml"
09:26:33 Build step 'Execute shell' marked build as failure

cc @charlesbluca @quasiben @dask/gpu

@jrbourbeau jrbourbeau added the tests Unit tests and/or continuous integration label Sep 18, 2023
@charlesbluca
Copy link
Member

Thanks for raising this @jrbourbeau - I'll see if I'm able to reproduce the segfaults locally to follow up here with more info

@charlesbluca
Copy link
Member

Was able to reproduce the segfaults locally - not sure if these are coming from multiple tests, but one in particular that seems to trigger them is distributed/comm/tests/test_ucx.py::test_ping_pong_cupy, which at least with the latest GPU CI environment isn't able to complete locally for me:

======================================================================== short test summary info ========================================================================
FAILED distributed/comm/tests/test_ucx.py::test_ping_pong_cupy[shape1] - cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
FAILED distributed/comm/tests/test_ucx.py::test_ping_pong_cupy[shape2] - cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
ERROR distributed/comm/tests/test_ucx.py::test_ping_pong_cupy[shape1] - ucp._libs.exceptions.UCXError: Trying to reset UCX but not all Endpoints and/or Listeners are closed(). The following objects are still referencing ApplicationConte...
ERROR distributed/comm/tests/test_ucx.py::test_ping_pong_cupy[shape2] - ucp._libs.exceptions.UCXError: Trying to reset UCX but not all Endpoints and/or Listeners are closed(). The following objects are still referencing ApplicationConte...
================================================================= 2 failed, 1 passed, 2 errors in 3.76s =================================================================
Fatal Python error: Segmentation fault

Current thread 0x00007fb8774fe740 (most recent call first):
  Garbage-collecting
  <no Python frame>
Segmentation fault (core dumped)

Here are the associated failures/errors:

___________________________________________________________ ERROR at teardown of test_ping_pong_cupy[shape1] ____________________________________________________________

    @pytest.fixture(scope="function")
    def ucx_loop():
        """Allows UCX to cancel progress tasks before closing event loop.
    
        When UCX tasks are not completed in time (e.g., by unexpected Endpoint
        closure), clean up tasks before closing the event loop to prevent unwanted
        errors from being raised.
        """
        ucp = pytest.importorskip("ucp")
    
        loop = asyncio.new_event_loop()
        loop.set_exception_handler(ucx_exception_handler)
        ucp.reset()
        yield loop
>       ucp.reset()

distributed/utils_test.py:2151: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    def reset():
        """Resets the UCX library by shutting down all of UCX.
    
        The library is initiated at next API call.
        """
        global _ctx
        if _ctx is not None:
            weakref_ctx = weakref.ref(_ctx)
            _ctx = None
            gc.collect()
            if weakref_ctx() is not None:
                msg = (
                    "Trying to reset UCX but not all Endpoints and/or Listeners "
                    "are closed(). The following objects are still referencing "
                    "ApplicationContext: "
                )
                for o in gc.get_referrers(weakref_ctx()):
                    msg += "\n  %s" % str(o)
>               raise UCXError(msg)
E               ucp._libs.exceptions.UCXError: Trying to reset UCX but not all Endpoints and/or Listeners are closed(). The following objects are still referencing ApplicationContext: 
E                 {'_ep': <ucp._libs.ucx_api.UCXEndpoint object at 0x7fb80f628eb0>, '_ctx': <ucp.core.ApplicationContext object at 0x7fb80e1e1ff0>, '_send_count': 3, '_recv_count': 3, '_finished_recv_count': 3, '_shutting_down_peer': False, '_close_after_n_recv': None, '_tags': {'msg_send': 10817624050753198211, 'msg_recv': 6614375631065549451, 'ctrl_send': 8811442553812381923, 'ctrl_recv': 16219165296038974008}}
E                 {'_ep': <ucp._libs.ucx_api.UCXEndpoint object at 0x7fb80f629070>, '_ctx': <ucp.core.ApplicationContext object at 0x7fb80e1e1ff0>, '_send_count': 3, '_recv_count': 3, '_finished_recv_count': 3, '_shutting_down_peer': False, '_close_after_n_recv': None, '_tags': {'msg_send': 6614375631065549451, 'msg_recv': 10817624050753198211, 'ctrl_send': 16219165296038974008, 'ctrl_recv': 8811442553812381923}}

/datasets/charlesb/micromamba/envs/distributed-gpuci-py310/lib/python3.10/site-packages/ucp/core.py:947: UCXError
______________________________________________________________________ test_ping_pong_cupy[shape1] ______________________________________________________________________

ucx_loop = <_UnixSelectorEventLoop running=False closed=False debug=False>, shape = (10, 10)

    @pytest.mark.parametrize("shape", [(100,), (10, 10), (4947,)])
    @gen_test()
    async def test_ping_pong_cupy(ucx_loop, shape):
        cupy = pytest.importorskip("cupy")
        com, serv_com = await get_comm_pair()
    
>       arr = cupy.random.random(shape)

distributed/comm/tests/test_ucx.py:224: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/datasets/charlesb/micromamba/envs/distributed-gpuci-py310/lib/python3.10/site-packages/cupy/random/_sample.py:156: in random_sample
    return rs.random_sample(size=size, dtype=dtype)
/datasets/charlesb/micromamba/envs/distributed-gpuci-py310/lib/python3.10/site-packages/cupy/random/_generator.py:619: in random_sample
    RandomState._mod1_kernel(out)
cupy/_core/_kernel.pyx:921: in cupy._core._kernel.ElementwiseKernel.__call__
    ???
cupy/cuda/function.pyx:237: in cupy.cuda.function.Function.linear_launch
    ???
cupy/cuda/function.pyx:205: in cupy.cuda.function._launch
    ???
cupy_backends/cuda/api/driver.pyx:253: in cupy_backends.cuda.api.driver.launchKernel
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   cupy_backends.cuda.api.driver.CUDADriverError: CUDA_ERROR_INVALID_HANDLE: invalid resource handle

cupy_backends/cuda/api/driver.pyx:60: CUDADriverError

My first assumption is that we're running into some issues around UCX clean up? Can try bisecting Distributed to see if we can isolate this to a specific commit

@charlesbluca
Copy link
Member

Able to reproduce things, but having some difficulty bisecting it to any specific cause 😕 eyeballing this successful GPU CI run from a few weeks ago, I tried rolling back to older versions of UCX/UCX-Py with Distributed 2023.7.1 and still saw the segfaults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Unit tests and/or continuous integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants