Added explicit worker GC (throttled)#1255
Conversation
| if key not in self.types: | ||
| self.types[key] = type(value) | ||
|
|
||
| self.nbytes[key] = sizeof(value) |
There was a problem hiding this comment.
Note: I changed this because it is probably safer to just update the nbytes value instead of relying that it can't change. For self.types this was already done that way and the if was probably a left-over.
There was a problem hiding this comment.
We assume immutability. Additionally sizeof can take some time while type is free.
|
Sorry for the delay here, I was away at a conference. A couple of concerns:
|
|
Some thoughts:
|
| if key not in self.types: | ||
| self.types[key] = type(value) | ||
|
|
||
| self.nbytes[key] = sizeof(value) |
There was a problem hiding this comment.
We assume immutability. Additionally sizeof can take some time while type is free.
| import re | ||
| import shutil | ||
| import socket | ||
| import time |
There was a problem hiding this comment.
from .metrics import time
The windows time.time function has a 1s precision limit.
| 'key': key, | ||
| 'cause': cause}) | ||
| self._throttledGC.collect( | ||
| force_gc=True if nbytes_to_free > 10 * 2**20 else False |
There was a problem hiding this comment.
Could probably drop the ternary statement and just replace with the condition.
force_gc = nbytes_to_free > 10 * 2**20
|
I'm generally fine with the code here. Unfortunately it looks like the tests are not. This change causes a variety of tests to fail in interesting ways. I suspect some subtle multi-threading interaction. |
|
I was trying to reproduce the problems locally, but the tests just work for me. There was one remaining issue regarding the usage of |
|
Yeah, welcome to the wonderful world of concurrent debugging. The virtual machines on travis-ci are very slow and so are quite good at catching subtle bugs that sneak by on faster machines. I recommend trying to run a few of the failing tests in a for loop like the following: This might help you catch some of the errors on your local machine. |
|
It's also entirely possible though that calling |
|
How can garbage collection be dangerous? As far as I can see the explicit garbage collection happens in places where an automatic garbage collection could trigger anyway, so it's not possible to rely on GC not happening, right? Or am I missing that there are periods with I was rerunning all the failed tests (from the Travis Python 2.7 build) in a brute-force attempt (see commands below). I couldn't get the tests to fail except for DetailsCommand lines for running the failed tests in brute force: |
Yeah I don't know. I suspect some subtle interaction. |
|
Any further thoughts on this @bluenote10 ? |
|
I'm still puzzled why the tests fail on travis, but not locally. The logs on travis also don't look like actual test failures, rather errors related to testing infrastructure. I'm trying to get more output from travis now to see if this can help. |
|
There are a number of new failures in the tests, I suspect due to some change on travis-ci. I pushed a trivial change here to help us get a better understanding of the kinds of tests that are failing on master: #1317 |
Conflicts: distributed/worker.py
|
I've merged this into master. We'll see if problems persist. |
|
Sorry, I've merged master into this. Not the other way around. |
|
It looks like you've gotten tests to pass, but only by removing all of the actual functionality of the PR :/ Do you have any thoughts about how to trigger GC safely? Or other thoughts about how to achieve the same results while only relying on the standard GC operation? |
|
I'm still wondering if the tests failing by explicit GC may hint to a potential cause of the memory leaks. Also, some failing tests were related to work stealing, and I think you already spotted an issue there. I will try to systematically re-enable GC here (after merging in the work stealing fix) and look at the pattern of failing tests again. However I'll have to work on something else for a while, so it might take a bit until I can pick this up again. |
|
OK, thank you for your continued engagement on this. Sorry it took me a while to engage myself. |
| new_time = time() | ||
| if force_gc or new_time - self.last_collect > self.min_interval_in_sec: | ||
| gc.collect() | ||
| self.last_collect = new_time |
There was a problem hiding this comment.
As @pitrou said in another PR, GC can legitimately take more than several hundred milliseconds (maybe even seconds) on interpreters loaded with lots of small python objects (e.g. in dask.bags with nested constructs). Therefore you might want to consider a strategy that records the time taken by the last call to gc.collect() and the time elapsed since after the end of the last call to gc.collect() so as to only call gc.collect() if less than 5% of the time is spent inside the gc.collect() calls in aggregate instead of using a fixed absolute value for min_interval_in_sec.
Also using a time.monotonic() instead of time.time() under Python 3.5 and later will probably spare some rare but potentially hard to debug issues in case of leap second events.
There was a problem hiding this comment.
Yes, good point, I think I'll just use your implementation, no need for two throttled GC implementations ;). I'll still have to check if this PR is needed at all with you modification.
…ed into feature/explicit_gc
|
This PR has gone stale. Closing for now. |
This PR adds explicit garbage collections to mitigate the memory issues observed in #1015 and dask/zict#19.
The implementation is throttled so that it does not lead to excessive GC calls. As discussed in dask/zict#19 calling GC from the inner scopes like
put_key_in_memoryorupdate_datawould not collect the currentvalue. The ideal solution would trigger it on the outermost scope whenever a large value has been persisted after releasing the reference to the value. However these places are tricky to find. Currently I'm triggering fromrelease_key,gather, andexecute, which is hopefully frequent enough.