Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_dont_steal_unknown_functions failure #3574

Closed
jrbourbeau opened this issue Mar 13, 2020 · 4 comments
Closed

test_dont_steal_unknown_functions failure #3574

jrbourbeau opened this issue Mar 13, 2020 · 4 comments
Labels
flaky test Intermittent failures on CI.

Comments

@jrbourbeau
Copy link
Member

xref https://travis-ci.org/github/dask/distributed/jobs/662107052

Full traceback:
______________________ test_dont_steal_unknown_functions _______________________

    def test_func():

        result = None

        workers = []

        with clean(timeout=active_rpc_timeout, **clean_kwargs) as loop:

    

            async def coro():

                with dask.config.set(config):

                    s = False

                    for i in range(5):

                        try:

                            s, ws = await start_cluster(

                                nthreads,

                                scheduler,

                                loop,

                                security=security,

                                Worker=Worker,

                                scheduler_kwargs=scheduler_kwargs,

                                worker_kwargs=worker_kwargs,

                            )

                        except Exception as e:

                            logger.error(

                                "Failed to start gen_cluster, retrying",

                                exc_info=True,

                            )

                        else:

                            workers[:] = ws

                            args = [s] + workers

                            break

                    if s is False:

                        raise Exception("Could not start cluster")

                    if client:

                        c = await Client(

                            s.address,

                            loop=loop,

                            security=security,

                            asynchronous=True,

                            **client_kwargs

                        )

                        args = [c] + args

                    try:

                        future = func(*args)

                        if timeout:

                            future = asyncio.wait_for(future, timeout)

                        result = await future

                        if s.validate:

                            s.validate_state()

                    finally:

                        if client and c.status not in ("closing", "closed"):

                            await c._close(fast=s.status == "closed")

                        await end_cluster(s, workers)

                        await asyncio.wait_for(cleanup_global_workers(), 1)

    

                    try:

                        c = await default_client()

                    except ValueError:

                        pass

                    else:

                        await c._close(fast=True)

    

                    for i in range(5):

                        if all(c.closed() for c in Comm._instances):

                            break

                        else:

                            await asyncio.sleep(0.05)

                    else:

                        L = [c for c in Comm._instances if not c.closed()]

                        Comm._instances.clear()

                        # raise ValueError("Unclosed Comms", L)

                        print("Unclosed Comms", L)

    

                    return result

    

            result = loop.run_sync(

>               coro, timeout=timeout * 2 if timeout else timeout

            )

distributed/utils_test.py:957: 

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

../../../miniconda/envs/test-environment/lib/python3.6/site-packages/tornado/ioloop.py:576: in run_sync

    return future_cell[0].result()

distributed/utils_test.py:927: in coro

    result = await future

../../../miniconda/envs/test-environment/lib/python3.6/asyncio/tasks.py:358: in wait_for

    return fut.result()

../../../miniconda/envs/test-environment/lib/python3.6/site-packages/tornado/gen.py:1147: in run

    yielded = self.gen.send(value)

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

c = <Client: not connected>

s = <Scheduler: "tcp://127.0.0.1:43315" processes: 0 cores: 0>

a = <Worker: 'tcp://127.0.0.1:35456', 0, closed, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>

b = <Worker: 'tcp://127.0.0.1:42222', 1, closed, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>

    @gen_cluster(client=True, nthreads=[("127.0.0.1", 1)] * 2)

    def test_dont_steal_unknown_functions(c, s, a, b):

        futures = c.map(inc, [1, 2], workers=a.address, allow_other_workers=True)

        yield wait(futures)

>       assert len(a.data) == 2, [len(a.data), len(b.data)]

E       AssertionError: [1, 1]

E       assert 1 == 2

E        +  where 1 = len(Buffer<<LRU: 28/5023005081 on dict>, <Func: serialize_bytelist<->deserialize_bytes <File: /home/travis/build/dask/distributed/dask-worker-space/worker-hbrk40rs/storage, mode="a", 0 elements>>>)

E        +    where Buffer<<LRU: 28/5023005081 on dict>, <Func: serialize_bytelist<->deserialize_bytes <File: /home/travis/build/dask/distributed/dask-worker-space/worker-hbrk40rs/storage, mode="a", 0 elements>>> = <Worker: 'tcp://127.0.0.1:35456', 0, running, stored: 1, running: 0/1, ready: 0, comm: 0, waiting: 0>.data

distributed/tests/test_steal.py:116: AssertionError
@jrbourbeau jrbourbeau added the flaky test Intermittent failures on CI. label Mar 13, 2020
@crusaderky
Copy link
Collaborator

crusaderky commented Apr 18, 2020

Revisited in #3706 to test that 95 out of 100 tasks hit the preferred worker, instead of 2 out of 2. In most cases it does 100 out of 100. I've seen it once randomly hit 67/100 and fail. And on Python 3.6 Linux (only) it hits exactly 50/50, which to me is a strong indication of the underlying functionality not working at all.

@jakirkham
Copy link
Member

Have been seeing this fail more frequently of late as well.

@crusaderky
Copy link
Collaborator

I've xfail'ed it in #3729

@fjetter
Copy link
Member

fjetter commented Feb 17, 2022

Test no longer exists

@fjetter fjetter closed this as completed Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI.
Projects
None yet
Development

No branches or pull requests

4 participants