Skip to content

Reduce fragility of GCDiagnosis tests#1668

Merged
pitrou merged 1 commit intodask:masterfrom
pitrou:gc_diagnosis_test_failure
Dec 28, 2017
Merged

Reduce fragility of GCDiagnosis tests#1668
pitrou merged 1 commit intodask:masterfrom
pitrou:gc_diagnosis_test_failure

Conversation

@pitrou
Copy link
Copy Markdown
Member

@pitrou pitrou commented Dec 28, 2017

This tries to fix the sporadic failure reported in #966 (comment) , where it seems a leftover from a previous test gets collected and triggers a large reduction in heap size.

@mrocklin
Copy link
Copy Markdown
Member

I'm starting to think that many of our recent intermittent test failures may also be due to issues like this. Generally our test suite is not robust to irregular long pauses such as may be caused by garbage collection.

Do you have any thoughts on if this is likely a problem and, if so, how we might address it?

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Dec 28, 2017

By "long pauses", do you mean such that GC collections might break timing-dependent tests?

@mrocklin
Copy link
Copy Markdown
Member

By "long pauses", do you mean such that GC collections might break timing-dependent tests?

It's a guess, but yes

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Dec 28, 2017

Apparently a full gc.collect() at the end of the test suite takes around 200ms (on my machine here). That's indeed significant.

@mrocklin
Copy link
Copy Markdown
Member

I would not be surpised if the travis-ci machines were 10x slower at times.

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Dec 28, 2017

Yes, that's certainly possible.

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Dec 28, 2017

I may be mistaken, but I think this boils down to the fact that our test suite progressively leaks memory (i.e. Python objects)... I'm not sure how that is. My guess is some objects (such as Scheduler, etc.) don't get properly terminated and are left alive by a dangling thread. I'm unaware of other potential sources of leaks in our codebase.

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Dec 28, 2017

Some stats at the end of the test suite:

>>> proc = psutil.Process()
>>> pprint.pprint(proc.connections())
[pconn(fd=19, family=<AddressFamily.AF_INET: 2>, type=<SocketKind.SOCK_STREAM: 1>, laddr=addr(ip='127.0.0.1', port=36121), raddr=(), status='LISTEN'),
 pconn(fd=29, family=<AddressFamily.AF_INET: 2>, type=<SocketKind.SOCK_STREAM: 1>, laddr=addr(ip='127.0.0.1', port=37939), raddr=(), status='LISTEN'),
 pconn(fd=31, family=<AddressFamily.AF_INET: 2>, type=<SocketKind.SOCK_STREAM: 1>, laddr=addr(ip='127.0.0.1', port=40929), raddr=(), status='LISTEN'),
 pconn(fd=37, family=<AddressFamily.AF_INET: 2>, type=<SocketKind.SOCK_STREAM: 1>, laddr=addr(ip='127.0.0.1', port=37913), raddr=(), status='LISTEN')]
>>> pprint.pprint(proc.num_fds())
46
>>> pprint.pprint(proc.num_threads())
25
>>> pprint.pprint(threading.enumerate())
[<_MainThread(MainThread, started 139736273405696)>,
 <Thread(Threaded scatter(), started daemon 139735662053120)>,
 <Thread(Threaded scatter(), started daemon 139735670445824)>,
 <Thread(Threaded gather(), started daemon 139735908517632)>,
 <Thread(Threaded scatter(), started daemon 139735899600640)>,
 <Thread(Threaded gather(), started daemon 139735687231232)>,
 <Thread(Threaded scatter(), started daemon 139735653660416)>,
 <Thread(Threaded scatter(), started daemon 139735645267712)>,
 <Thread(Threaded gather(), started daemon 139735636875008)>,
 <Thread(AsyncProcess ForkServerProcess-249 watch message queue, started daemon 139735926613760)>,
 <Thread(AsyncProcess ForkServerProcess-250 watch message queue, started daemon 139734588311296)>,
 <Thread(AsyncProcess ForkServerProcess-249 watch process join, started daemon 139735192291072)>,
 <Thread(AsyncProcess ForkServerProcess-251 watch message queue, started daemon 139735217469184)>,
 <Thread(AsyncProcess ForkServerProcess-251 watch process join, started daemon 139735183898368)>,
 <Thread(AsyncProcess ForkServerProcess-250 watch process join, started daemon 139735209076480)>,
 <Thread(AsyncProcess ForkServerProcess-252 watch message queue, started daemon 139735200683776)>,
 <Thread(AsyncProcess ForkServerProcess-252 watch process join, started daemon 139735175505664)>,
 <Thread(Threaded map(), started daemon 139734613489408)>,
 <Thread(Threaded map(), started daemon 139734571525888)>,
 <Thread(ThreadPoolExecutor-2_0, started daemon 139734605096704)>,
 <Thread(ThreadPool worker 0, started daemon 139733399623424)>,
 <Thread(ThreadPoolExecutor-0_0, started daemon 139733349271296)>,
 <Thread(ThreadPoolExecutor-0_1, started daemon 139733366056704)>]

@mrocklin
Copy link
Copy Markdown
Member

FWIW I'd be fine removing the threaded map/scatter/gather code.

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Dec 28, 2017

Is it unimportant functionally?

@mrocklin
Copy link
Copy Markdown
Member

Not really. It used to be interesting, I think that other solutions exist for this that are now more attractive. I don't think I've seen anyone use it recently.

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Dec 28, 2017

I think the iterable form is fine (though test_iterator_gather is skipped?), but the queue form is delicate. I think I can fix the queue form, but with a hack. Do you think it's worthwhile, or do we remove it?

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Dec 28, 2017

I'd favour removing it FWIW.

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Dec 28, 2017

In the meantime I'm also merging this PR.

@pitrou pitrou merged commit fd9d08a into dask:master Dec 28, 2017
@pitrou pitrou deleted the gc_diagnosis_test_failure branch December 28, 2017 16:59
@mrocklin
Copy link
Copy Markdown
Member

Removing is fine with me.

@mrocklin
Copy link
Copy Markdown
Member

Do you have thoughts on the other lingering issues?

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Dec 28, 2017

There are still tests leaking processes:
#1597 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants