test_gpu_metrics fails #4153

jakirkham · 2020-10-06T22:48:33Z

What happened:

When running the test_gpu_metrics tests, found that they failed due to an ImportError.

What you expected to happen:

Would have expected these tests to pass.

Minimal Complete Verifiable Example:

$ python -m pytest distributed/tests/test_gpu_metrics.py

Anything else we need to know?:

Test failure:

============================= test session starts ==============================
platform linux -- Python 3.8.5, pytest-6.1.1, py-1.9.0, pluggy-0.13.1
rootdir: /datasets/jkirkham/devel/distributed, configfile: setup.cfg
plugins: asyncio-0.12.0
collected 1 item                                                               

distributed/tests/test_gpu_metrics.py F                                  [100%]

=================================== FAILURES ===================================
_______________________________ test_gpu_metrics _______________________________

    def test_func():
        result = None
        workers = []
        with clean(timeout=active_rpc_timeout, **clean_kwargs) as loop:
    
            async def coro():
                with dask.config.set(config):
                    s = False
                    for i in range(5):
                        try:
                            s, ws = await start_cluster(
                                nthreads,
                                scheduler,
                                loop,
                                security=security,
                                Worker=Worker,
                                scheduler_kwargs=scheduler_kwargs,
                                worker_kwargs=worker_kwargs,
                            )
                        except Exception as e:
                            logger.error(
                                "Failed to start gen_cluster, retrying",
                                exc_info=True,
                            )
                            await asyncio.sleep(1)
                        else:
                            workers[:] = ws
                            args = [s] + workers
                            break
                    if s is False:
                        raise Exception("Could not start cluster")
                    if client:
                        c = await Client(
                            s.address,
                            loop=loop,
                            security=security,
                            asynchronous=True,
                            **client_kwargs,
                        )
                        args = [c] + args
                    try:
                        future = func(*args)
                        if timeout:
                            future = asyncio.wait_for(future, timeout)
                        result = await future
                        if s.validate:
                            s.validate_state()
                    finally:
                        if client and c.status not in ("closing", "closed"):
                            await c._close(fast=s.status == Status.closed)
                        await end_cluster(s, workers)
                        await asyncio.wait_for(cleanup_global_workers(), 1)
    
                    try:
                        c = await default_client()
                    except ValueError:
                        pass
                    else:
                        await c._close(fast=True)
    
                    def get_unclosed():
                        return [c for c in Comm._instances if not c.closed()] + [
                            c
                            for c in _global_clients.values()
                            if c.status != "closed"
                        ]
    
                    try:
                        start = time()
                        while time() < start + 5:
                            gc.collect()
                            if not get_unclosed():
                                break
                            await asyncio.sleep(0.05)
                        else:
                            if allow_unclosed:
                                print(f"Unclosed Comms: {get_unclosed()}")
                            else:
                                raise RuntimeError("Unclosed Comms", get_unclosed())
                    finally:
                        Comm._instances.clear()
                        _global_clients.clear()
    
                    return result
    
>           result = loop.run_sync(
                coro, timeout=timeout * 2 if timeout else timeout
            )

distributed/utils_test.py:953: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../miniconda/envs/rapids16dev/lib/python3.8/site-packages/tornado/ioloop.py:532: in run_sync
    return future_cell[0].result()
distributed/utils_test.py:912: in coro
    result = await future
../../miniconda/envs/rapids16dev/lib/python3.8/asyncio/tasks.py:483: in wait_for
    return fut.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

s = <Scheduler: "tcp://127.0.0.1:39423" processes: 0 cores: 0>
a = <Worker: 'tcp://127.0.0.1:41823', 0, Status.closed, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>
b = <Worker: 'tcp://127.0.0.1:38913', 1, Status.closed, stored: 0, running: 0/2, ready: 0, comm: 0, waiting: 0>

    @gen_cluster()
    async def test_gpu_metrics(s, a, b):
>       from distributed.diagnostics.nvml import handles
E       ImportError: cannot import name 'handles' from 'distributed.diagnostics.nvml' (/datasets/jkirkham/devel/distributed/distributed/diagnostics/nvml.py)

distributed/tests/test_gpu_metrics.py:9: ImportError
----------------------------- Captured stderr call -----------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:39423
distributed.scheduler - INFO -   dashboard at:            127.0.0.1:8787
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:41823
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:41823
distributed.worker - INFO -          dashboard at:            127.0.0.1:45689
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:39423
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    1.08 TB
distributed.worker - INFO -       Local Directory: /raid/jkirkham/tmp/dask/dask-worker-space/worker-zot1ktl_
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:38913
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:38913
distributed.worker - INFO -          dashboard at:            127.0.0.1:46725
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:39423
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                    1.08 TB
distributed.worker - INFO -       Local Directory: /raid/jkirkham/tmp/dask/dask-worker-space/worker-zhp1iv2t
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:41823', name: 0, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:41823
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:39423
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:38913', name: 1, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:38913
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:39423
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:41823
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:38913
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:41823', name: 0, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:41823
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:38913', name: 1, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:38913
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Scheduler closing...
distributed.scheduler - INFO - Scheduler closing all comms
=============================== warnings summary ===============================
../../miniconda/envs/rapids16dev/lib/python3.8/site-packages/pluggy/callers.py:187
  /datasets/jkirkham/miniconda/envs/rapids16dev/lib/python3.8/site-packages/pluggy/callers.py:187: DeprecationWarning: `type` argument to addoption() is the string 'float',  but when supplied should be a type (for example `str` or `int`). (options: ('--leaks-timeout',))
    res = hook_impl.function(*args)

distributed/tests/test_gpu_metrics.py::test_gpu_metrics
  /datasets/jkirkham/miniconda/envs/rapids16dev/lib/python3.8/site-packages/aiohttp/helpers.py:107: DeprecationWarning: "@coroutine" decorator is deprecated since Python 3.8, use "async def" instead
    def noop(*args, **kwargs):  # type: ignore

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================= slowest 10 durations =============================
0.58s call     distributed/tests/test_gpu_metrics.py::test_gpu_metrics

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
======================== 1 failed, 2 warnings in 1.09s =========================

Environment:

Dask version: 2.29.0
Distributed version: commit a1dc5f4
Python version: 3.8.5
Operating System: Linux
Install method (conda, pip, source): Conda

The text was updated successfully, but these errors were encountered:

jakirkham · 2020-10-06T22:49:10Z

cc @quasiben (for vis)

quasiben · 2020-10-06T22:58:28Z

I think this is just an error in the test. I'll fix now

jakirkham · 2020-10-06T23:02:57Z

Start of a fix in PR ( #4154 ).

jakirkham mentioned this issue Oct 6, 2020

Upcoming release schedule around high level graphs dask/community#99

Closed

jakirkham mentioned this issue Oct 6, 2020

Use new NVTX module rapidsai/dask-cuda#406

Merged

jakirkham mentioned this issue Oct 6, 2020

Fix test_gpu_metrics failure #4154

Merged

quasiben closed this as completed in #4154 Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_gpu_metrics fails #4153

test_gpu_metrics fails #4153

jakirkham commented Oct 6, 2020

jakirkham commented Oct 6, 2020

quasiben commented Oct 6, 2020 •

edited

jakirkham commented Oct 6, 2020

test_gpu_metrics fails #4153

test_gpu_metrics fails #4153

Comments

jakirkham commented Oct 6, 2020

jakirkham commented Oct 6, 2020

quasiben commented Oct 6, 2020 • edited

jakirkham commented Oct 6, 2020

quasiben commented Oct 6, 2020 •

edited