Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are reference cycles a performance problem? #4987

Open
mrocklin opened this issue Jun 29, 2021 · 25 comments
Open

Are reference cycles a performance problem? #4987

mrocklin opened this issue Jun 29, 2021 · 25 comments

Comments

@mrocklin
Copy link
Member

mrocklin commented Jun 29, 2021

@gjoseph92 noticed that, under some profiling conditions, turning off garbage collection had a significant impact on scheduler performance. I'm going to include some notes from him in the summary below

Notes from Gabe

See #4825 for initial discussion of the problem. It also comes up on #4881 (comment).

I've also run these with GC debug mode on (gjoseph92/dask-profiling-coiled@c0ea2aa1) and looked at GC logs. Interestingly GC debug mode generally reports GC as taking zero time:

gc: done, 0 unreachable, 0 uncollectable, 0.0000s elapsed

Some of those logs are here: https://rawcdn.githack.com/gjoseph92/dask-profiling-coiled/61fc875173a5b2f9195346f2a523cb1d876c48ad/results/cython-shuffle-gc-debug-noprofiling-ecs-prod-nopyspy.txt?raw=true

The types of objects being listed as collectable are interesting (cells, frames, tracebacks, asyncio Futures/Tasks, SelectorKey) since those are the sorts of things you might expect to create cycles. It's also interesting that there are already ~150k objects in generation 3 before the computation has even started, and ~300k (and growing) once it's been running for a little bit.

I've also tried turning off:

  • statistical profiling
  • bokeh dashboard
  • uvloop instead of native asyncio

But none of those affected the issue.

What I wanted to do next was use refcycle or objgraph or a similar tool to try to see what's causing the cycles. Or possibly use tracemalloc + GC hooks to try to log where the objects that were being collected were initially created.

I notice that we have reference cycles in our scheduler state

In [1]: from dask.distributed import Client
In [2]: client = Client()
In [3]: import dask.array as da
In [4]: x = da.random.random((1000, 1000)).sum().persist()
In [5]: s = client.cluster.scheduler
In [6]: a, b = s.tasks.values()

In [7]: a
Out[7]: <TaskState "('sum-aggregate-832c859ad539eafe39d0e7207de9f1e7',)" memory>

In [8]: b
Out[8]: <TaskState "('random_sample-sum-sum-aggregate-832c859ad539eafe39d0e7207de9f1e7',)" released>

In [9]: a in b.dependents
Out[9]: True
In [10]: b in a.dependencies
Out[10]: True

Should we be concerned about our use of reference cycles?

cc @jakirkham @pitrou

@fjetter
Copy link
Member

fjetter commented Jun 29, 2021

FWIW, I've seen in certain "flaky" timed out CI runs log messages claiming that we spend >98% of time in GC. I'll watch out for those tests and post links if I find any.

@pitrou
Copy link
Member

pitrou commented Jun 29, 2021

I notice that we have reference cycles in our scheduler state

As long as those are live cycles, it shouldn't be a problem. It's probably good to try to break those cycles when a task is removed from the graph, but IIRC this is/was already the case (e.g. explicitly clear dependents and dependencies in the removed task).

Should we be concerned about our use of reference cycles?

I don't know. Cycles will probably difficult to avoid in async code, I think.

One tidbit: if you know you're keeping a traceback or exception somewhere, you can help breaking cycles by calling traceback.clear_frames: https://docs.python.org/3/library/traceback.html#traceback.clear_frames

@mrocklin
Copy link
Member Author

Any suggestions on how to track down what is causing excessive GC use?

@pitrou
Copy link
Member

pitrou commented Jun 29, 2021

First, the GC will run whether you have reference cycles or not (it needs to run precisely to detect dead reference cycles).
Second, the GC is generational. The most costly part is collecting the oldest generation, which contains long-lived objects (for example scheduler tasks ;-)).
Third, what triggers cyclic garbage collections is a set of heuristics based on the number of allocations since the last collection. They are described in the docs. As you can see, you can tweak those heuristics somehow.

@gjoseph92
Copy link
Collaborator

gjoseph92 commented Jul 1, 2021

@crusaderky, @jrbourbeau and I discussed this today.

We agreed that the main mystery is why the logs from GC debug mode mostly show an elapsed time of 0.0000s, the longest one recorded was 0.1061s, yet the py-spy profiles make it look like pauses are in the 1-2sec range. (Note that the logs were recorded without py-spy running, since the combo made things non-functionally slow.) We need to figure out which is true—or if they're both true, why.

Some next steps:

  1. Get actual statistics about the performance gain from disabling GC on the scheduler. The tests I've been running are meant for finer-grained profiling only, and total runtimes should not really be compared between tests. I'll change things around to run the same workload many times (without py-spy) to get some actual metrics. If these show little difference, then we close this and move on.
  2. I'll also look at total CPU times spent on the scheduler during the compute. @crusaderky's idea was that GC should be CPU-intensive, so if these pauses are actually spent traversing cycles, we'd see higher user CPU time. But if the pauses are something else (resource contention? acquiring a lock?), we wouldn't.
  3. Add gc callbacks to log GC time, as perceived by user code, and to correlate to timestamps in the py-spy profiles. Compare to the times logged by GC debug mode. Is there some other time we're not accounting for?
  4. Try disabling only generation 2 collection. Same performance as disabling it entirely?
  5. Either try to line-profile CPython while it's GC'ing (possible with a GC callback and some clever perf-ing), or take the big hammer of commenting out parts of gc_collect_main, rebuilding Python, and re-running to see what part is slowing us down? (This is assuming elapsed time as logged by a gc callback != elapsed time as logged by GC debug, and we're trying to figure out where the time is going. If we've instead found that GC just has a ton of work to do, then we need to figure out why we have so many cycles, and what we can do to reduce them.)

@pitrou
Copy link
Member

pitrou commented Jul 1, 2021

We agreed that the main mystery is why the logs from GC debug mode mostly show an elapsed time of 0.0000s

You're looking for collections of the oldest generation ("gc: collecting generation 2"). Your logs only show 3 instances of that, AFAIK.

By the way: it may well be that the reason for long GC runs is not the cost of garbage collection, but some costly finalizers (for example if a finalizer does some IO).

@mrocklin
Copy link
Member Author

mrocklin commented Jul 1, 2021 via email

@gjoseph92
Copy link
Collaborator

The finalizers are a good idea. I would have thought they'd show up in the py-spy profiles, but maybe because of how they're called within the GC process, they don't show up in the interpreter's normal call stack?

@jakirkham
Copy link
Member

I think at one point we discussed tracking what objects are being GC'd. Has that been explored? If we are concerned about how specific objects are being cleaned up, this would be useful info when identifying what cleanup methods are problematic

@dhirschfeld
Copy link
Contributor

There was an interesting issue in trio where they tracked down GC issues causing latency spikes. Might be some useful debugging techniques to crib from there?

@gjoseph92
Copy link
Collaborator

@dhirschfeld thanks a ton these links—they're very helpful. Something like python-trio/trio#1805 (comment) is what we'll want to do to debug this.

Also, I wouldn't be surprised if our issue is similar to this Trio one (raising exceptions within frames that hold references to other exceptions/frames/traceback objects), with a similar solution. I've been suspicious of this since seeing how so many of the objects being collected in the debug logs are Cells, Tracebacks, Frames, etc. Intuitively you'd read that sort of code and think "how does adding a del as the last line of a function matter; the function is about to go out of scope anyway." But it's a different story entirely when something (the raised exception) holds a reference to that Frame.

@gjoseph92
Copy link
Collaborator

I've gotten some statistics about how turning off GC affects performance. First, disabling GC on the scheduler gives a 14-37% speedup on our shuffle workload on a 100-worker Coiled cluster.

The range is because I've accidentally discovered that this workload runs slower and slower each subsequent time it re-runs on the same cluster, in a concerningly linear way:

elapsed

The doubled lines are with and without MALLOC_TRIM_THRESHOLD_=0. It gave a slight performance improvement, but not that significant.

So at trial 1, disabling GC is 14% faster. By trial 10, it's 37% faster.

This slowdown is really significant when GC is on—after 10 repetitions, the same operation is 75% slower.

FWIW I tried adding a client.restart() before each repetition, but it made no difference/was a tiny bit worse, so I left it off this plot. I also couldn't reproduce the slowdown locally with a subsample of the data. It may have something to do with the number of workers?


I also recorded both memory info and the change in CPU times before each trial (@crusaderky). We can see that both user and idle CPU time is higher when GC is active. Also, scheduler memory grows linearly in both cases, but grows much faster when GC is off (so just turning off GC entirely by default certainly does not seem like a viable option). Also reminds me of #3898.

stats

Unless anyone has ideas about what's causing the slowdowns with each repetition, my next steps will be:

  • Try a totally different workload, and see if it also gets slower with repetitions on the same cluster.
  • Record py-spy profiles of each repetition and try to see what's slowing down.

@pitrou
Copy link
Member

pitrou commented Jul 9, 2021

I'd say two things are particularly of concern:

  • that it grows a lot with the number of trials
  • the fact that idle time grows a lot (!!!)

@pitrou
Copy link
Member

pitrou commented Jul 9, 2021

That said, I don't understand the idle time on your graphs. Basically, your numbers show that user time == elapsed time, but idle time also grows a lot (and a lot larger than elapsed time!).

@gjoseph92
Copy link
Collaborator

gjoseph92 commented Jul 9, 2021

@pitrou this could just be because the scheduler node has 4 CPUs, but is single-threaded. So 1s of extra wall time will usually give a minimum of 3s of idle time, I'd think.

Maybe I'll rerun with 1 CPU for the scheduler so these numbers are more meaningful.

@mrocklin
Copy link
Member Author

mrocklin commented Jul 9, 2021 via email

@gjoseph92
Copy link
Collaborator

I reran with:

  1. a new workload ((x + x.T).mean() sort of thing; 93,977 optimized tasks vs 274,712 optimized tasks for the shuffle)
  2. pinning the scheduler process to CPU 0, then only looking at times for CPU 0, so @pitrou idle times should be accurate now.

For both workloads, disabling GC gives both better absolute performance, and significantly better performance for repeated runs:

elapsed

Looking at only the CPU where the scheduler is running, idle time is very low, and consistent between runs:

stats

These are all run with MALLOC_TRIM_THRESHOLD_=0. I dropped the no-trim case because it was effectively the same, and just added clutter.

@mrocklin I'm working now on trying to figure out what new objects are being created/leaked between runs. Though I'm currently just starting with a local cluster to figure out what information to collect.

randerzander added a commit to randerzander/gpu-bdb that referenced this issue Jul 12, 2021
Based on, dask/distributed#4987, ran SF1k w/ GC disabled on the scheduler process, and observed total benchmark runtime improvement of about 5s. This was consistent across many runs.
@crusaderky
Copy link
Collaborator

crusaderky commented Jul 16, 2021

the progressive increase of runtime over runs could be explained by a progressive increase of python objects that gc.collect() has to go through at every iteration - read: leak. Do you have a measure of that? Does it correlate with runtime?
If positive, I suggest that you

  1. do one run (to make sure everything's warm)
  2. run pympler to log the id of all objects in the interpreter
  3. run many more runs and observe the progressive slowdown
  4. run pympler again to find all the objects in the interpreter again, then remove those that were already present in (2)
  5. use pympler.refbrowser (warning: mortally slow) to find out what's referencing them and why they aren't being garbage collected

@mrocklin
Copy link
Member Author

A few of us (@gjoseph92 @crusaderky @fjetter @jrbourbeau ) just met on this topic and came up with a few questions. First, some context. The dask scheduler constucts a massive self-referential data structure. We're curious how this structure interacts with the garbage collector. We're looking to get information from people like @pitrou (or anyone that he may know) who can help us with some general information.

  • What is the behavior of the GC when the scheduler is at rest but with a massive self-referential data structure? Is it expensive to keep such a structure in memory?

  • Why do we need the GC anyway, what are we producing that isn't cleaned up naturally? This is more a question for future experimentation. I would be curious to turn off the GC and see what builds up. If we can get the scheduler to a state where everything is cleaned up on its own that would be nice.

  • We're seeing pauses that exist when we keep GC on. However we don't see a lot of things getting cleaned up during these pauses. Is it possible that there are expensive things in memory that take up a lot of GC time, but that don't ultimately get GC'ed? If so, is it possible to find out what those things are?

  • How does GC interact with our Cythonization system? Is there a way to remove our massive collection of self-referential TaskState/WorkerState/TaskGroup system from the view of the garbage collector using the recent Cythonization work? Maybe there is some magic directive here?

@jakirkham
Copy link
Member

In general this is the kind of structure that GC is designed to handle. It's hard to know when a large self-referential object can be cleaned up. So this is what GC checks for. Maybe this is already obvious, but thought it deserved stating to set expectations.

@mrocklin
Copy link
Member Author

Is it expensive to walk through such a structure? Are there ways where we can ask the GC to ignore this structure? How about with Cython?

@mrocklin
Copy link
Member Author

@gjoseph92 I would be curious if the pauses that you see are only in the oldest generation. I suspect that we could learn some things by tweaking some of the parameters that @pitrou mentions above

@pitrou
Copy link
Member

pitrou commented Jul 30, 2021

What is the behavior of the GC when the scheduler is at rest but with a massive self-referential data structure?

The same as with a massive non-self-referential data structure. The fact that the data structure is self-referential is irrelevant for the cost of walking the structure.

Is it expensive to walk through such a structure?

Basically, it's O(number of objects referenced by the structure).

Are there ways where we can ask the GC to ignore this structure?

Not really, there's gc.freeze but it's very coarse-grained as you'll see from the description.

How about with Cython?

Neither. Cython doesn't change how the GC works.

@jakirkham
Copy link
Member

It's possible to tell Cython not to GC certain Cython extension objects. Am less clear on if that will work in our use case or whether it will be a good idea. Should add Cython extension types already disable features like the trashcan by default.

A separate, but also interesting idea in Cython (that I believe I've raised before) is to use Cython freelist for commonly used extension types. This cuts down on the allocation/deallocation cost of a particular type (especially if it is created/deleted frequently). That said, I don't think we've narrowed down what objects this would be useful for or how many objects we tend to use. Both would really be needed to make this useful.

More generally while we have seen some pains around GC. I don't think we have narrowed it down to particular objects (though please correct me if things have changed here I may be a bit out-of-date). Think this is very important to determine where effort is best spent. For example we might think Task* and *State objects and spend a bunch of time improving allocation/deallocation costs around them only to learn it doesn't help.

To put a finer point on this, we have seen that turning off GC seems to be help during reads and MsgPack unpackb specifically ( #4881 ). More recently we have identified large memory usage during serialization causing some pain ( #5107 ). Large memory usage while creating a lot of small temporary objects during serialization may be exactly the sort of thing that could trigger heavy GC usage (and some of these other problems discussed). If this is true, optimizing Task* and *State objects wouldn't help much. Instead being more careful to use memoryviews ( #5112 ), investigating other serialization libraries like msgspec, and possibly an overhaul of serialization altogether ( #4923 ) may all be needed to address this kind of issue.

@gjoseph92
Copy link
Collaborator

FYI @wence-, this thread, and especially #4987 (comment), might be interesting to you with the profiling you're doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants