Added explicit worker GC (throttled) by bluenote10 · Pull Request #1255 · dask/distributed

bluenote10 · 2017-07-13T14:53:26Z

This PR adds explicit garbage collections to mitigate the memory issues observed in #1015 and dask/zict#19.

The implementation is throttled so that it does not lead to excessive GC calls. As discussed in dask/zict#19 calling GC from the inner scopes like put_key_in_memory or update_data would not collect the current value. The ideal solution would trigger it on the outermost scope whenever a large value has been persisted after releasing the reference to the value. However these places are tricky to find. Currently I'm triggering from release_key, gather, and execute, which is hopefully frequent enough.

bluenote10 · 2017-07-13T14:57:10Z

-        if key not in self.types:
-            self.types[key] = type(value)
-
+        self.nbytes[key] = sizeof(value)


Note: I changed this because it is probably safer to just update the nbytes value instead of relying that it can't change. For self.types this was already done that way and the if was probably a left-over.

We assume immutability. Additionally sizeof can take some time while type is free.

mrocklin · 2017-07-17T21:23:57Z

Sorry for the delay here, I was away at a conference.

A couple of concerns:

This puts GC on the main event loop thread. Does this matter? I wouldn't be surprised if gc.collect() held the GIL explicitly while collecting, but if it doesn't then it would be far more preferable to keep this off of the main thread.
Rather than make a new class for this perhaps we can just add gc.collect (or some suitable sending of gc.collect to a thread) in a PeriodicCallback

bluenote10 · 2017-07-18T13:35:37Z

Some thoughts:

I would be very surprised if the GC does not hold the GIL during collection, i.e., it will have to briefly stop the event loop anyway.
I was thinking about a periodic solution as well, but in the end, it's not an ideal solution. On the one hand the period needs to be fairly small, because already ~1 sec of heavy spill-to-disk-swaps can produce large amounts of uncollected garbage, causing workers to go out-of-memory. On the other hand, triggering with 1 second intervals would probably have a measurable on CPU-bound tasks. A smart trigger would avoid these issues.

mrocklin · 2017-07-18T15:01:51Z

-        if key not in self.types:
-            self.types[key] = type(value)
-
+        self.nbytes[key] = sizeof(value)


We assume immutability. Additionally sizeof can take some time while type is free.

mrocklin · 2017-07-18T15:02:56Z

 import re
 import shutil
 import socket
+import time


from .metrics import time

The windows time.time function has a 1s precision limit.

mrocklin · 2017-07-18T15:03:48Z

                                          'key': key,
                                          'cause': cause})
+            self._throttledGC.collect(
+                force_gc=True if nbytes_to_free > 10 * 2**20 else False


Could probably drop the ternary statement and just replace with the condition.

force_gc = nbytes_to_free > 10 * 2**20

mrocklin · 2017-07-19T12:39:44Z

I'm generally fine with the code here. Unfortunately it looks like the tests are not. This change causes a variety of tests to fail in interesting ways. I suspect some subtle multi-threading interaction.

bluenote10 · 2017-07-20T18:26:51Z

I was trying to reproduce the problems locally, but the tests just work for me. There was one remaining issue regarding the usage of nbytes_to_free, but now the set of failed tests seems to have changed again (one of the failures is rather a Travis timeout, right?). Do the tests fail randomly or is there a clear pattern? How can we solve this?

mrocklin · 2017-07-20T18:46:03Z

Yeah, welcome to the wonderful world of concurrent debugging. The virtual machines on travis-ci are very slow and so are quite good at catching subtle bugs that sneak by on faster machines. I recommend trying to run a few of the failing tests in a for loop like the following:

for i in {1..100}; do distributed/tests/test_stress.py::test_stress_1 --pdb ; done

This might help you catch some of the errors on your local machine.

mrocklin · 2017-07-20T19:04:28Z

It's also entirely possible though that calling gc.collect in this way is subtly dangerous though and that we'll need to try another approach.

bluenote10 · 2017-07-21T09:07:41Z

How can garbage collection be dangerous? As far as I can see the explicit garbage collection happens in places where an automatic garbage collection could trigger anyway, so it's not possible to rely on GC not happening, right? Or am I missing that there are periods with gc.disable()?

I was rerunning all the failed tests (from the Travis Python 2.7 build) in a brute-force attempt (see commands below). I couldn't get the tests to fail except for distributed/tests/test_steal.py::test_steal_expensive_data_slow_computation. However, locally this test fails due to the buffer concatenation issue #1179, which is not the reason why it fails on Travis.

Details

Command lines for running the failed tests in brute force:

for i in {1..100}; do py.test "distributed/tests/test_client.py::test_open_close_many_workers[Worker-100-5]" --pdb --runslow ; done
for i in {1..100}; do py.test "distributed/tests/test_scheduler.py::test_balance_many_workers" --pdb --runslow ; done
for i in {1..100}; do py.test "distributed/tests/test_scheduler.py::test_balance_many_workers_2" --pdb --runslow ; done
for i in {1..100}; do py.test "distributed/tests/test_scheduler.py::test_correct_bad_time_estimate" --pdb --runslow ; done
for i in {1..100}; do py.test "distributed/tests/test_steal.py::test_worksteal_many_thieves" --pdb --runslow ; done
for i in {1..100}; do py.test "distributed/tests/test_steal.py::test_dont_steal_unknown_functions" --pdb --runslow ; done
for i in {1..100}; do py.test "distributed/tests/test_steal.py::test_new_worker_steals" --pdb --runslow ; done
for i in {1..100}; do py.test "distributed/tests/test_steal.py::test_work_steal_no_kwargs" --pdb --runslow ; done
for i in {1..100}; do py.test "distributed/tests/test_steal.py::test_balance_without_dependencies" --pdb --runslow ; done
for i in {1..100}; do py.test "distributed/tests/test_steal.py::test_steal_twice" --pdb --runslow ; done
for i in {1..100}; do py.test "distributed/tests/test_steal.py::test_accept_old_result_if_stolen" --pdb --runslow ; done
for i in {1..100}; do py.test "distributed/tests/test_stress.py::test_stress_1" --pdb --runslow ; done

mrocklin · 2017-07-21T11:15:18Z

How can garbage collection be dangerous?

Yeah I don't know. I suspect some subtle interaction.

mrocklin · 2017-08-07T02:56:57Z

Any further thoughts on this @bluenote10 ?

bluenote10 · 2017-08-07T13:48:58Z

I'm still puzzled why the tests fail on travis, but not locally. The logs on travis also don't look like actual test failures, rather errors related to testing infrastructure. I'm trying to get more output from travis now to see if this can help.

mrocklin · 2017-08-07T15:15:33Z

There are a number of new failures in the tests, I suspect due to some change on travis-ci. I pushed a trivial change here to help us get a better understanding of the kinds of tests that are failing on master: #1317

Conflicts: distributed/worker.py

mrocklin · 2017-08-11T18:04:29Z

I've merged this into master. We'll see if problems persist.

mrocklin · 2017-08-11T18:04:52Z

Sorry, I've merged master into this. Not the other way around.

…ems)

mrocklin · 2017-08-15T13:02:07Z

It looks like you've gotten tests to pass, but only by removing all of the actual functionality of the PR :/

Do you have any thoughts about how to trigger GC safely? Or other thoughts about how to achieve the same results while only relying on the standard GC operation?

bluenote10 · 2017-08-15T13:35:30Z

I'm still wondering if the tests failing by explicit GC may hint to a potential cause of the memory leaks. Also, some failing tests were related to work stealing, and I think you already spotted an issue there. I will try to systematically re-enable GC here (after merging in the work stealing fix) and look at the pattern of failing tests again. However I'll have to work on something else for a while, so it might take a bit until I can pick this up again.

mrocklin · 2017-08-15T13:36:36Z

OK, thank you for your continued engagement on this. Sorry it took me a while to engage myself.

ogrisel · 2017-10-21T07:38:35Z

+        new_time = time()
+        if force_gc or new_time - self.last_collect > self.min_interval_in_sec:
+            gc.collect()
+            self.last_collect = new_time


As @pitrou said in another PR, GC can legitimately take more than several hundred milliseconds (maybe even seconds) on interpreters loaded with lots of small python objects (e.g. in dask.bags with nested constructs). Therefore you might want to consider a strategy that records the time taken by the last call to gc.collect() and the time elapsed since after the end of the last call to gc.collect() so as to only call gc.collect() if less than 5% of the time is spent inside the gc.collect() calls in aggregate instead of using a fixed absolute value for min_interval_in_sec.

Also using a time.monotonic() instead of time.time() under Python 3.5 and later will probably spare some rare but potentially hard to debug issues in case of leap second events.

Yes, good point, I think I'll just use your implementation, no need for two throttled GC implementations ;). I'll still have to check if this PR is needed at all with you modification.

…ed into feature/explicit_gc

mrocklin · 2019-04-15T21:42:51Z

This PR has gone stale. Closing for now.

added throttled GC

bd2c453

bluenote10 commented Jul 13, 2017

View reviewed changes

added default arg

b67f659

mrocklin reviewed Jul 18, 2017

View reviewed changes

review improvements

3e83912

fix force gc logic

d10d3fd

Fabian Keller added 2 commits July 21, 2017 11:10

merged master

c4f6293

turn on verbose tests

24ff328

mrocklin mentioned this pull request Aug 7, 2017

OOMs on seemingly simple shuffle job: mem usage greatly exceeds --memory-limit dask/dask#2456

Open

Merge branch 'master' into feature/explicit_gc

09dc0af

disable py.test capturing to clarify test behavior

f959e47

Merge branch 'master' into feature/explicit_gc

6642c0b

Conflicts: distributed/worker.py

bluenote10 added 5 commits August 12, 2017 09:52

disabled gc in gather (attempt to identify which gc call causes probl…

748b9c8

…ems)

disabled gc in release_key

ea56968

disabled gc in execute

0135304

disabled gc in constructor

2020e0a

flake fix

7af3bae

flake fix

3a38dc2

mrocklin mentioned this pull request Oct 20, 2017

Aggressively GC on high memory usage in worker #1488

Merged

ogrisel reviewed Oct 21, 2017

View reviewed changes

Fabian Keller added 3 commits October 24, 2017 14:29

merged master

2bad5b5

Merge branch 'feature/explicit_gc' of github.com:bluenote10/distribut…

0c0141a

…ed into feature/explicit_gc

re-enabled aggressive gc

6449a16

ogrisel mentioned this pull request Oct 27, 2017

Added gc on mem usage warning #1504

Merged

mrocklin closed this Apr 15, 2019

Uh oh!

Conversation

bluenote10 commented Jul 13, 2017

Uh oh!

bluenote10 Jul 13, 2017

Choose a reason for hiding this comment

Uh oh!

mrocklin Jul 18, 2017

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jul 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bluenote10 commented Jul 18, 2017

Uh oh!

mrocklin Jul 18, 2017

Choose a reason for hiding this comment

Uh oh!

mrocklin Jul 18, 2017

Choose a reason for hiding this comment

Uh oh!

mrocklin Jul 18, 2017

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jul 19, 2017

Uh oh!

bluenote10 commented Jul 20, 2017

Uh oh!

mrocklin commented Jul 20, 2017

Uh oh!

mrocklin commented Jul 20, 2017

Uh oh!

bluenote10 commented Jul 21, 2017

Uh oh!

mrocklin commented Jul 21, 2017

Uh oh!

mrocklin commented Aug 7, 2017

Uh oh!

bluenote10 commented Aug 7, 2017

Uh oh!

mrocklin commented Aug 7, 2017

Uh oh!

mrocklin commented Aug 11, 2017

Uh oh!

mrocklin commented Aug 11, 2017

Uh oh!

mrocklin commented Aug 15, 2017

Uh oh!

bluenote10 commented Aug 15, 2017

Uh oh!

mrocklin commented Aug 15, 2017

Uh oh!

ogrisel Oct 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bluenote10 Oct 24, 2017

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Apr 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mrocklin commented Jul 17, 2017 •

edited

Loading

ogrisel Oct 21, 2017 •

edited

Loading