Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collection issues in parallel #1617

Closed
wence- opened this issue Feb 27, 2020 · 7 comments
Closed

Garbage collection issues in parallel #1617

wence- opened this issue Feb 27, 2020 · 7 comments

Comments

@wence-
Copy link
Contributor

wence- commented Feb 27, 2020

Introduction

PETSc C objects are managed with a refcounting collector. Destruction is collective over the communicator they were created on.

Some (many) Firedrake objects hold on to PETSc objects. Firedrake objects are managed by the Python garbage collector which has both a refcounting collector (for things with no refcycles) and a generational mark-and-sweep collector (for things with cycles).

Problem

The Python garbage collector has no concept of collective semantics over an MPI communicator. If the "same" object on different processes for some reason gets collected in a different order, we can end up with deadlocks in the PETSc destructors.

If objects are only ever collected by the refcounting part of the collector, things are kind of OK. You could get into a situation where you wrote code that produced a bad problem here (but it's unlikely).

On the other hand, if objects are collected by the generational collector, basically it's "only a matter of time" before things break.

Unfortunately, Firedrake holds cyclic references everywhere. A first step to addressing this problem would be to audit all the code and remove all the refcycles. This is quite hard work, but kind of mostly just tedious. Lots of the cycles actually live in PyOP2.

Possible solutions

Push explicit lifetime management of Firedrake objects XXX.destroy() into the user API.

This is "simple" but pollutes user code significantly.

Your option here.

@pefarrell
Copy link
Contributor

I've seen this happen with both firedrake and FEniCS.

The way I deal with this is to do import gc; gc.disable() at the start of my scripts, and manually call gc.collect() at various points (typically after solves). I don't understand the Python garbage collection semantics well enough to guarantee that this fixes the problem, but it goes some way towards addressing it.

@danshapero
Copy link
Contributor

If tediously removing reference cycles will fix the problem then I'm happy to help, as long as it can be done piecemeal. Can you list some of the affected classes? I imagine the reference cycles were there in the first place for a reason -- will removing them break userspace?

@salazardetroya
Copy link
Contributor

I'd be willing to help as well if it's just tedious work. I am interested in learning more about the pyop2-firedrake toolchain. What would be a good place to start? Which tools do you guys use for detecting memory leaks in python?

@jrmaddison
Copy link
Contributor

The following construct might be useful for regression testing

import gc
import weakref


class Test:
    pass


refcount_only = (Test,)
refcount_only_refs = []
refcount_error = [False]


def gc_callback(phase, info):
    if phase == "start":
        for obj in gc.get_objects():
            if isinstance(obj, refcount_only):
                refcount_only_refs.append(weakref.ref(obj))
    else:
        # assert phase == "stop"
        for ref in refcount_only_refs:
            # If ref() is None then the object was collected by the garbage
            # collector, which is forbidden
            if ref() is None:
                refcount_error[0] = True
        refcount_only_refs.clear()


gc.callbacks.append(gc_callback)

@wence-
Copy link
Contributor Author

wence- commented Apr 7, 2020

Some more thoughts on this. The first question is what type of refcycles we need to care about.

Some non-exhaustive experimenting and reading suggests the following statement is true.

If I have an object x which contains a reference cycle, and refer to it via an object y, then x will be collected by the generational collector, but y will be collected by the refcounting collector (which is what we want).

This means that what we need is that any object which holds on to (or is referred to) by a PETSc object cannot be part of a reference cycle, but it is fine for it to hold on to objects that are themselves part of reference cycles as long as those objects are safe to collect "out of order".

The first place to start is in PyOP2, because all of Firedrake builds on top of that. There are a number of places here where we create reference cycles because we had originally envisaged that PyOP2 would be programmed "by a human". So, for example, a DataSet is "object-cached" on a Set and so forth.

PyOP2 itself is a bit of a mess, I think a good first step would be to split the base.py and petsc_base.py files into a data_structures/ subdirectory, with one file for each object exposed in the API. We can then start fixing things slowly there with better overview of the code.

We can then think about refactoring the places in Firedrake that build these PyOP2 objects to maintain the caches "one-way" in Firedrake itself, rather than via ref-cycles. Plausibly we don't need this, but I am not sure.

The biggest thing we need to handle is that we use DM objects to pass information back and forth between Firedrake and solvers (inside petsc). We're reasonably good at cleaning those up after the fact, but there are places where I think we probably DTWT.

@jrmaddison
Copy link
Contributor

jrmaddison commented Apr 1, 2021

Further circular references are generated within pyadjoint, in pyadjoint.overloaded_type.OverloadedType.create_block_variable. (EDIT: Identify correct method)

@JDBetteridge
Copy link
Member

This is now fixed by #2599

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

6 participants