Skip to content
This repository has been archived by the owner on Feb 10, 2021. It is now read-only.

test_dont_request_on_many_short_tasks CI failure #41

Open
jakirkham opened this issue Nov 3, 2017 · 11 comments · Fixed by #50
Open

test_dont_request_on_many_short_tasks CI failure #41

jakirkham opened this issue Nov 3, 2017 · 11 comments · Fixed by #50

Comments

@jakirkham
Copy link
Member

Seeing this test failure on the CI.

Test failure snippet
____________________ test_dont_request_on_many_short_tasks _____________________
loop = <tornado.platform.epoll.EPollIOLoop object at 0x7f8ba7153b00>
    def test_dont_request_on_many_short_tasks(loop):
        with SGECluster(scheduler_port=0) as cluster:
            adapt = Adaptive(cluster, interval=50, startup_cost=10)
            with Client(cluster, loop=loop) as client:
                cluster.scheduler.task_duration['slowinc'] = 0.001
                futures = client.map(slowinc, range(1000), delay=0.001)
    
                while not cluster.scheduler.workers:
                    sleep(0.01)
    
                for i in range(20):
                    sleep(0.1)
>                   assert len(cluster.workers) < 2
E                   AssertionError: assert 2 < 2
E                    +  where 2 = len({'15.1': WorkerSpec(job_id='15.1', kwargs={'nativeSpecification': '', 'cpus': 1, 'memory': None, 'memory_fraction': 0...., 'memory': None, 'memory_fraction': 0.5}, stdout='/dask-drmaa/worker.16.1.out', stderr='/dask-drmaa/worker.16.1.err')})
E                    +    where {'15.1': WorkerSpec(job_id='15.1', kwargs={'nativeSpecification': '', 'cpus': 1, 'memory': None, 'memory_fraction': 0...., 'memory': None, 'memory_fraction': 0.5}, stdout='/dask-drmaa/worker.16.1.out', stderr='/dask-drmaa/worker.16.1.err')} = <SGECluster: 2 workers>.workers
dask_drmaa/tests/test_adaptive.py:105: AssertionError
----------------------------- Captured stderr call -----------------------------
tornado.application - ERROR - Future <tornado.concurrent.Future object at 0x7f8ba715ac88> exception was never retrieved: Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/anaconda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 305, in connect
    **kwargs)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 183, in connect
    af, addr, stream = yield connector.start()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 82, in start
    self.try_connect(iter(self.primary_addrs))
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 97, in try_connect
    future = self.connect(af, addr)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 224, in _create_stream
    return stream.connect(addr)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/iostream.py", line 1128, in connect
    self._add_io_state(self.io_loop.WRITE)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/iostream.py", line 971, in _add_io_state
    self.fileno(), self._handle_events, self._state)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/ioloop.py", line 728, in add_handler
    self._impl.register(fd, events | self.ERROR)
ValueError: I/O operation on closed epoll object
tornado.application - ERROR - Future <tornado.concurrent.Future object at 0x7f8ba6a460f0> exception was never retrieved: Traceback (most recent call last):
  File "/dask-drmaa/dask_drmaa/tests/test_adaptive.py", line 105, in test_dont_request_on_many_short_tasks
    assert len(cluster.workers) < 2
AssertionError: assert 2 < 2
 +  where 2 = len({'15.1': WorkerSpec(job_id='15.1', kwargs={'nativeSpecification': '', 'cpus': 1, 'memory': None, 'memory_fraction': 0...., 'memory': None, 'memory_fraction': 0.5}, stdout='/dask-drmaa/worker.16.1.out', stderr='/dask-drmaa/worker.16.1.err')})
 +    where {'15.1': WorkerSpec(job_id='15.1', kwargs={'nativeSpecification': '', 'cpus': 1, 'memory': None, 'memory_fraction': 0...., 'memory': None, 'memory_fraction': 0.5}, stdout='/dask-drmaa/worker.16.1.out', stderr='/dask-drmaa/worker.16.1.err')} = <SGECluster: 2 workers>.workers
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/anaconda/lib/python3.6/site-packages/dask_drmaa-0.1.0-py3.6.egg/dask_drmaa/core.py", line 193, in stop_workers
    close_workers=True)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/anaconda/lib/python3.6/site-packages/distributed/scheduler.py", line 1951, in retire_workers
    yield [self.close_worker(worker=w) for w in workers]
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 828, in callback
    result_list.append(f.result())
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/anaconda/lib/python3.6/site-packages/distributed/scheduler.py", line 564, in close_worker
    yield r.terminate(report=False)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/anaconda/lib/python3.6/site-packages/distributed/core.py", line 453, in send_recv_from_rpc
    comm = yield self.live_comm()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/anaconda/lib/python3.6/site-packages/distributed/core.py", line 429, in live_comm
    connection_args=self.connection_args)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 307, in wrapper
    yielded = next(result)
  File "/opt/anaconda/lib/python3.6/site-packages/distributed/comm/core.py", line 185, in connect
    quiet_exceptions=EnvironmentError)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 921, in with_timeout
    timeout, timeout_callback)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/ioloop.py", line 505, in add_timeout
    callback, *args, **kwargs)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/ioloop.py", line 921, in call_at
    heapq.heappush(self._timeouts, timeout)
TypeError: heap argument must be a list

ref: https://travis-ci.org/dask/dask-drmaa/jobs/296660993

@jakirkham jakirkham mentioned this issue Nov 3, 2017
@jakirkham
Copy link
Member Author

To summarize the error above, it expects there to be fewer than 2 workers, but there are exactly 2.

@jakirkham
Copy link
Member Author

jakirkham commented Nov 3, 2017

Seems this is not really reproducible. 😕 Restarting made this error go away.

Edit: That said, it has repeated a few times. So it is not perfectly reproducible, but is still some kind of issue.

@jakirkham
Copy link
Member Author

Saw a very similar error on Python 2.7 without the heap issue coming up. Snippet below and link to the relevant log afterwards.

Test failure snippet
____________________ test_dont_request_on_many_short_tasks _____________________
loop = <tornado.platform.epoll.EPollIOLoop object at 0x7fbf88ed9470>
    def test_dont_request_on_many_short_tasks(loop):
        with SGECluster(scheduler_port=0) as cluster:
            adapt = Adaptive(cluster, interval=50, startup_cost=10)
            with Client(cluster, loop=loop) as client:
                cluster.scheduler.task_duration['slowinc'] = 0.001
                futures = client.map(slowinc, range(1000), delay=0.001)
    
                while not cluster.scheduler.workers:
                    sleep(0.01)
    
                for i in range(20):
                    sleep(0.1)
>                   assert len(cluster.workers) < 2
E                   AssertionError: assert 2 < 2
E                    +  where 2 = len({'15.1': WorkerSpec(job_id='15.1', kwargs={'nativeSpecification': '', 'cpus': 1, 'memory': None, 'memory_fraction': 0...., 'memory': None, 'memory_fraction': 0.5}, stdout='/dask-drmaa/worker.16.1.out', stderr='/dask-drmaa/worker.16.1.err')})
E                    +    where {'15.1': WorkerSpec(job_id='15.1', kwargs={'nativeSpecification': '', 'cpus': 1, 'memory': None, 'memory_fraction': 0...., 'memory': None, 'memory_fraction': 0.5}, stdout='/dask-drmaa/worker.16.1.out', stderr='/dask-drmaa/worker.16.1.err')} = <SGECluster: 2 workers>.workers
dask_drmaa/tests/test_adaptive.py:105: AssertionError
----------------------------- Captured stderr call -----------------------------
tornado.application - ERROR - Future <tornado.concurrent.Future object at 0x7fbf89076f60> exception was never retrieved: Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/anaconda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 309, in connect
    **kwargs)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 183, in connect
    af, addr, stream = yield connector.start()
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 82, in start
    self.try_connect(iter(self.primary_addrs))
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 97, in try_connect
    future = self.connect(af, addr)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/tcpclient.py", line 224, in _create_stream
    return stream.connect(addr)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/iostream.py", line 1128, in connect
    self._add_io_state(self.io_loop.WRITE)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/iostream.py", line 971, in _add_io_state
    self.fileno(), self._handle_events, self._state)
  File "/opt/anaconda/lib/python3.6/site-packages/tornado/ioloop.py", line 728, in add_handler
    self._impl.register(fd, events | self.ERROR)
ValueError: I/O operation on closed epoll object

ref: https://travis-ci.org/dask/dask-drmaa/jobs/305489948#L1929

@jakirkham
Copy link
Member Author

Any ideas on this test, @mrocklin? I'm kind of at a loss. 😕 Seems to expect there always to be only 1 worker, but sometimes there are 2. Is this is a reasonable expectation? Given how much trouble it is causing, I'm leaning towards making it a known fail.

@mrocklin
Copy link
Member

My guess is that this is caused when workers are started but do not arrive quickly, so we start another one. Finally both arrive and cause the test to fail. At first glance I don't see a nice solution.

@jakirkham
Copy link
Member Author

That sounds plausible. Should we just up the sleep time before checking the number of workers then?

@mrocklin
Copy link
Member

I'm not sure where that sleep would be. We're probably calling cluster.add_worker when we already have a worker in flight in SGE.

@jakirkham
Copy link
Member Author

Guess before entering this for-loop to give it some more time to equilibrate. Not sure if that is defeating the test though.

@mrocklin
Copy link
Member

Yeah, this test is trying to verify that we don't make more workers than we need. It turns out that sometimes we do. I think that this test genuinely fails. I think that adding a sleep during the period where we might have more than one worker destroys the spirit of the test, as you suggest.

@jakirkham
Copy link
Member Author

Alright, I'm proposing we mark it as an xfail in PR ( #50 ). If we can figure out how to fix it, we can always drop the xfail.

@jakirkham
Copy link
Member Author

While the test has been marked as a known failure, the test failing is still problematic. So am reopening this issue so that it can be tracked.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants