Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_gpu_metrics fails #4153

Closed
jakirkham opened this issue Oct 6, 2020 · 3 comments · Fixed by #4154
Closed

test_gpu_metrics fails #4153

jakirkham opened this issue Oct 6, 2020 · 3 comments · Fixed by #4154

Comments

@jakirkham
Copy link
Member

What happened:

When running the test_gpu_metrics tests, found that they failed due to an ImportError.

What you expected to happen:

Would have expected these tests to pass.

Minimal Complete Verifiable Example:

$ python -m pytest distributed/tests/test_gpu_metrics.py

Anything else we need to know?:

Test failure:
============================= test session starts ==============================
platform linux -- Python 3.8.5, pytest-6.1.1, py-1.9.0, pluggy-0.13.1
rootdir: /datasets/jkirkham/devel/distributed, configfile: setup.cfg
plugins: asyncio-0.12.0
collected 1 item                                                               

distributed/tests/test_gpu_metrics.py F                                  [100%]

=================================== FAILURES ===================================
_______________________________ test_gpu_metrics _______________________________

    def test_func():
        result = None
        workers = []
        with clean(timeout=active_rpc_timeout, **clean_kwargs) as loop:
    
            async def coro():
                with dask.config.set(config):
                    s = False
                    for i in range(5):
                        try:
                            s, ws = await start_cluster(
                                nthreads,
                                scheduler,
                                loop,
                                security=security,
                                Worker=Worker,
                                scheduler_kwargs=scheduler_kwargs,
                                worker_kwargs=worker_kwargs,
                            )
                        except Exception as e:
                            logger.error(
                                "Failed to start gen_cluster, retrying",
                                exc_info=True,
                            )
                            await asyncio.sleep(1)
                        else:
                            workers[:] = ws
                            args = [s] + workers
                            break
                    if s is False:
                        raise Exception("Could not start cluster")
                    if client:
                        c = await Client(
                            s.address,
                            loop=loop,
                            security=security,
                            asynchronous=True,
                            **client_kwargs,
                        )
                        args = [c] + args
                    try:
                        future = func(*args)
                        if timeout:
                            future = asyncio.wait_for(future, timeout)
                        result = await future
                        if s.validate:
                            s.validate_state()
                    finally:
                        if client and c.status not in ("closing", "closed"):
                            await c._close(fast=s.status == Status.closed)
                        await end_cluster(s, workers)
                        await asyncio.wait_for(cleanup_global_workers(), 1)
    
                    try:
                        c = await default_client()
                    except ValueError:
                        pass
                    else:
                        await c._close(fast=True)
    
                    def get_unclosed():
                        return [c for c in Comm._instances if not c.closed()] + [
                            c
                            for c in _global_clients.values()
                            if c.status != "closed"
                        ]
    
                    try:
                        start = time()
                        while time() < start + 5:
                            gc.collect()
                            if not get_unclosed():
                                break
                            await asyncio.sleep(0.05)
                        else:
                            if allow_unclosed:
                                print(f"Unclosed Comms: {get_unclosed()}")
                            else:
                                raise RuntimeError("Unclosed Comms", get_unclosed())
                    finally:
                        Comm._instances.clear()
                        _global_clients.clear()
    
                    return result
    
>           result = loop.run_sync(
                coro, timeout=timeout * 2 if timeout else timeout
            )

distributed/utils_test.py:953: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../miniconda/envs/rapids16dev/lib/python3.8/site-packages/tornado/ioloop.py:532: in run_sync
    return future_cell[0].result()
distributed/utils_test.py:912: in coro
    result = await future
../../miniconda/envs/rapids16dev/lib/python3.8/asyncio/tasks.py:483: in wait_for
    return fut.result()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

s = <Scheduler: "tcp://127.0.0.1:39423" processes: 0 cores: 0>
a = <Worker: 'tcp://127.0.0.1:41823', 0, Status.closed, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>
b = <Worker: 'tcp://127.0.0.1:38913', 1, Status.closed, stored: 0, running: 0/2, ready: 0, comm: 0, waiting: 0>

    @gen_cluster()
    async def test_gpu_metrics(s, a, b):
>       from distributed.diagnostics.nvml import handles
E       ImportError: cannot import name 'handles' from 'distributed.diagnostics.nvml' (/datasets/jkirkham/devel/distributed/distributed/diagnostics/nvml.py)

distributed/tests/test_gpu_metrics.py:9: ImportError
----------------------------- Captured stderr call -----------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:39423
distributed.scheduler - INFO -   dashboard at:            127.0.0.1:8787
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:41823
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:41823
distributed.worker - INFO -          dashboard at:            127.0.0.1:45689
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:39423
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    1.08 TB
distributed.worker - INFO -       Local Directory: /raid/jkirkham/tmp/dask/dask-worker-space/worker-zot1ktl_
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:38913
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:38913
distributed.worker - INFO -          dashboard at:            127.0.0.1:46725
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:39423
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                    1.08 TB
distributed.worker - INFO -       Local Directory: /raid/jkirkham/tmp/dask/dask-worker-space/worker-zhp1iv2t
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:41823', name: 0, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:41823
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:39423
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:38913', name: 1, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:38913
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:39423
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:41823
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:38913
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:41823', name: 0, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:41823
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:38913', name: 1, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:38913
distributed.scheduler - INFO - Lost all workers
distributed.scheduler - INFO - Scheduler closing...
distributed.scheduler - INFO - Scheduler closing all comms
=============================== warnings summary ===============================
../../miniconda/envs/rapids16dev/lib/python3.8/site-packages/pluggy/callers.py:187
  /datasets/jkirkham/miniconda/envs/rapids16dev/lib/python3.8/site-packages/pluggy/callers.py:187: DeprecationWarning: `type` argument to addoption() is the string 'float',  but when supplied should be a type (for example `str` or `int`). (options: ('--leaks-timeout',))
    res = hook_impl.function(*args)

distributed/tests/test_gpu_metrics.py::test_gpu_metrics
  /datasets/jkirkham/miniconda/envs/rapids16dev/lib/python3.8/site-packages/aiohttp/helpers.py:107: DeprecationWarning: "@coroutine" decorator is deprecated since Python 3.8, use "async def" instead
    def noop(*args, **kwargs):  # type: ignore

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================= slowest 10 durations =============================
0.58s call     distributed/tests/test_gpu_metrics.py::test_gpu_metrics

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
======================== 1 failed, 2 warnings in 1.09s =========================

Environment:

  • Dask version: 2.29.0
  • Distributed version: commit a1dc5f4
  • Python version: 3.8.5
  • Operating System: Linux
  • Install method (conda, pip, source): Conda
@jakirkham
Copy link
Member Author

cc @quasiben (for vis)

@quasiben
Copy link
Member

quasiben commented Oct 6, 2020

I think this is just an error in the test. I'll fix now

@jakirkham
Copy link
Member Author

Start of a fix in PR ( #4154 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants