Reduce fragility of GCDiagnosis tests by pitrou · Pull Request #1668 · dask/distributed

pitrou · 2017-12-28T16:02:32Z

This tries to fix the sporadic failure reported in #966 (comment) , where it seems a leftover from a previous test gets collected and triggers a large reduction in heap size.

mrocklin · 2017-12-28T16:08:09Z

I'm starting to think that many of our recent intermittent test failures may also be due to issues like this. Generally our test suite is not robust to irregular long pauses such as may be caused by garbage collection.

Do you have any thoughts on if this is likely a problem and, if so, how we might address it?

pitrou · 2017-12-28T16:09:09Z

By "long pauses", do you mean such that GC collections might break timing-dependent tests?

mrocklin · 2017-12-28T16:11:51Z

By "long pauses", do you mean such that GC collections might break timing-dependent tests?

It's a guess, but yes

pitrou · 2017-12-28T16:16:56Z

Apparently a full gc.collect() at the end of the test suite takes around 200ms (on my machine here). That's indeed significant.

mrocklin · 2017-12-28T16:17:57Z

I would not be surpised if the travis-ci machines were 10x slower at times.

pitrou · 2017-12-28T16:18:56Z

Yes, that's certainly possible.

pitrou · 2017-12-28T16:24:49Z

I may be mistaken, but I think this boils down to the fact that our test suite progressively leaks memory (i.e. Python objects)... I'm not sure how that is. My guess is some objects (such as Scheduler, etc.) don't get properly terminated and are left alive by a dangling thread. I'm unaware of other potential sources of leaks in our codebase.

pitrou · 2017-12-28T16:34:39Z

Some stats at the end of the test suite:

>>> proc = psutil.Process()
>>> pprint.pprint(proc.connections())
[pconn(fd=19, family=<AddressFamily.AF_INET: 2>, type=<SocketKind.SOCK_STREAM: 1>, laddr=addr(ip='127.0.0.1', port=36121), raddr=(), status='LISTEN'),
 pconn(fd=29, family=<AddressFamily.AF_INET: 2>, type=<SocketKind.SOCK_STREAM: 1>, laddr=addr(ip='127.0.0.1', port=37939), raddr=(), status='LISTEN'),
 pconn(fd=31, family=<AddressFamily.AF_INET: 2>, type=<SocketKind.SOCK_STREAM: 1>, laddr=addr(ip='127.0.0.1', port=40929), raddr=(), status='LISTEN'),
 pconn(fd=37, family=<AddressFamily.AF_INET: 2>, type=<SocketKind.SOCK_STREAM: 1>, laddr=addr(ip='127.0.0.1', port=37913), raddr=(), status='LISTEN')]
>>> pprint.pprint(proc.num_fds())
46
>>> pprint.pprint(proc.num_threads())
25
>>> pprint.pprint(threading.enumerate())
[<_MainThread(MainThread, started 139736273405696)>,
 <Thread(Threaded scatter(), started daemon 139735662053120)>,
 <Thread(Threaded scatter(), started daemon 139735670445824)>,
 <Thread(Threaded gather(), started daemon 139735908517632)>,
 <Thread(Threaded scatter(), started daemon 139735899600640)>,
 <Thread(Threaded gather(), started daemon 139735687231232)>,
 <Thread(Threaded scatter(), started daemon 139735653660416)>,
 <Thread(Threaded scatter(), started daemon 139735645267712)>,
 <Thread(Threaded gather(), started daemon 139735636875008)>,
 <Thread(AsyncProcess ForkServerProcess-249 watch message queue, started daemon 139735926613760)>,
 <Thread(AsyncProcess ForkServerProcess-250 watch message queue, started daemon 139734588311296)>,
 <Thread(AsyncProcess ForkServerProcess-249 watch process join, started daemon 139735192291072)>,
 <Thread(AsyncProcess ForkServerProcess-251 watch message queue, started daemon 139735217469184)>,
 <Thread(AsyncProcess ForkServerProcess-251 watch process join, started daemon 139735183898368)>,
 <Thread(AsyncProcess ForkServerProcess-250 watch process join, started daemon 139735209076480)>,
 <Thread(AsyncProcess ForkServerProcess-252 watch message queue, started daemon 139735200683776)>,
 <Thread(AsyncProcess ForkServerProcess-252 watch process join, started daemon 139735175505664)>,
 <Thread(Threaded map(), started daemon 139734613489408)>,
 <Thread(Threaded map(), started daemon 139734571525888)>,
 <Thread(ThreadPoolExecutor-2_0, started daemon 139734605096704)>,
 <Thread(ThreadPool worker 0, started daemon 139733399623424)>,
 <Thread(ThreadPoolExecutor-0_0, started daemon 139733349271296)>,
 <Thread(ThreadPoolExecutor-0_1, started daemon 139733366056704)>]

mrocklin · 2017-12-28T16:49:28Z

FWIW I'd be fine removing the threaded map/scatter/gather code.

pitrou · 2017-12-28T16:50:25Z

Is it unimportant functionally?

mrocklin · 2017-12-28T16:51:00Z

Not really. It used to be interesting, I think that other solutions exist for this that are now more attractive. I don't think I've seen anyone use it recently.

pitrou · 2017-12-28T16:56:43Z

I think the iterable form is fine (though test_iterator_gather is skipped?), but the queue form is delicate. I think I can fix the queue form, but with a hack. Do you think it's worthwhile, or do we remove it?

pitrou · 2017-12-28T16:59:14Z

I'd favour removing it FWIW.

pitrou · 2017-12-28T16:59:28Z

In the meantime I'm also merging this PR.

mrocklin · 2017-12-28T17:00:09Z

Removing is fine with me.

mrocklin · 2017-12-28T17:00:20Z

Do you have thoughts on the other lingering issues?

pitrou · 2017-12-28T17:06:36Z

There are still tests leaking processes:
#1597 (comment)

Reduce fragility of GCDiagnosis tests

0b997f0

pitrou merged commit fd9d08a into dask:master Dec 28, 2017

pitrou deleted the gc_diagnosis_test_failure branch December 28, 2017 16:59

Uh oh!

Conversation

pitrou commented Dec 28, 2017

Uh oh!

mrocklin commented Dec 28, 2017

Uh oh!

pitrou commented Dec 28, 2017

Uh oh!

mrocklin commented Dec 28, 2017

Uh oh!

pitrou commented Dec 28, 2017

Uh oh!

mrocklin commented Dec 28, 2017

Uh oh!

pitrou commented Dec 28, 2017

Uh oh!

pitrou commented Dec 28, 2017

Uh oh!

pitrou commented Dec 28, 2017

Uh oh!

mrocklin commented Dec 28, 2017

Uh oh!

pitrou commented Dec 28, 2017

Uh oh!

mrocklin commented Dec 28, 2017

Uh oh!

pitrou commented Dec 28, 2017

Uh oh!

pitrou commented Dec 28, 2017

Uh oh!

pitrou commented Dec 28, 2017

Uh oh!

mrocklin commented Dec 28, 2017

Uh oh!

mrocklin commented Dec 28, 2017

Uh oh!

pitrou commented Dec 28, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants