Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regression in test_weakref_cache #6033

Merged
merged 8 commits into from
Apr 6, 2022
Merged

Fix regression in test_weakref_cache #6033

merged 8 commits into from
Apr 6, 2022

Conversation

crusaderky
Copy link
Collaborator

Fix regression introduced in #6029:
https://github.com/dask/distributed/runs/5768355142?check_suite_focus=true

>       assert x2.id != id_x
E       AssertionError: assert UUID('57673261-dc39-4e98-afc3-c4ad625f9211') != UUID('57673261-dc39-4e98-afc3-c4ad625f9211')

@crusaderky crusaderky self-assigned this Mar 31, 2022
@crusaderky crusaderky added flaky test Intermittent failures on CI. regression labels Mar 31, 2022
Comment on lines 341 to 343
# Surprisingly, even on CPython this is needed to ensure that the object is garbage
# collected, even if there are no obvious circular references going on
gc.collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does concern me a bit. I don't think the GC should be required here.

I can reproduce this on py3.10. Maybe there is a CPython bug? Below is what objgraph tells me

        import weakref
        ref = weakref.ref(x)
        del x
        import objgraph
        obj = ref()
        if obj:
            objgraph.show_backrefs(obj)

cc @graingert in case something rings a bell

objgraph-kjmpytgy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake, at this point, x is supposed to be in memory since the spill was not, yet, triggered. This only happens a few lines below.

I cannot reproduce this any longer

Copy link
Member

@sjperkins sjperkins Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into similar issues in dask-ms a while back and found I could never rely on the del x. If it's an issue of "encouraging" the decref/garbage collector, you might try allocating x inside an inner function like so:

   # Run this test twice:

    # - x is smaller than target and is evicted by y;
    # - x is individually larger than target and it never touches fast
    def _inner():
      x = cls(size)
      buf["x"] = x
      if size < 100:
          buf["y"] = cls(60)  # spill x
      assert "x" in buf.slow
      # Test that we update the weakref cache on setitem
      assert (buf["x"] is x) == expect_cached
      # Do not use id_x = id(x), as in CPython id's are C memory addresses and are reused
      # by PyMalloc when you descope objects, so a brand new object might end up having
      # the same id as a deleted one
      id_x = x.id
      del x
      return id_x

  id_x = _inner()

@fjetter
Copy link
Member

fjetter commented Mar 31, 2022

This test has been failing for the first time on 2022-03-17, i.e. before #6029

https://dask.org/distributed/test_report.html

image

I suggest to no merge this since the gc.collect is not helping and may even confuse us more than it should

@github-actions
Copy link
Contributor

github-actions bot commented Mar 31, 2022

Unit Test Results

       18 files  ±0         18 suites  ±0   9h 12m 11s ⏱️ + 19m 22s
  2 724 tests ±0    2 642 ✔️ +2       82 💤 +1  0  - 3 
24 364 runs  ±0  23 145 ✔️ +1  1 219 💤 +2  0  - 3 

Results for commit 177def7. ± Comparison against base commit 971a96d.

♻️ This comment has been updated with latest results.

@crusaderky
Copy link
Collaborator Author

crusaderky commented Mar 31, 2022

This test has been failing for the first time on 2022-03-17, i.e. before #6029>
I suggest to no merge this since the gc.collect is not helping and may even confuse us more than it should

That was before #5978. It has been perfectly stable between #5978 and #6029.

@crusaderky
Copy link
Collaborator Author

crusaderky commented Mar 31, 2022

xref #6000 where @mrocklin saw the exact same problem with weakrefs elsewhere

@sjperkins sjperkins self-requested a review March 31, 2022 16:19
@graingert
Copy link
Member

the test passes for me locally with the gc disabled:

def test_weakref_cache(tmpdir, cls, expect_cached, size):
gc.disable()
try:
buf = SpillBuffer(str(tmpdir), target=100)
# Run this test twice:
# - x is smaller than target and is evicted by y;
# - x is individually larger than target and it never touches fast
x = cls(size)
canary = weakref.ref(x.canary)
buf["x"] = x
if size < 100:
buf["y"] = cls(60) # spill x
assert "x" in buf.slow
# Test that we update the weakref cache on setitem
assert (buf["x"] is x) == expect_cached
# Do not use id_x = id(x), as in CPython id's are C memory addresses and are reused
# by PyMalloc when you descope objects, so a brand new object might end up having
# the same id as a deleted one
id_x = x.id
del x
if size < 100:
assert (canary() is not None) == expect_cached
buf["y"]
assert canary() is None
assert "x" in buf.slow
x2 = buf["x"]
assert x2.id != id_x
if size < 100:
buf["y"]
assert "x" in buf.slow
# Test that we update the weakref cache on getitem
assert (buf["x"] is x2) == expect_cached
finally:
gc.enable()

@graingert
Copy link
Member

graingert commented Apr 2, 2022

shutting down the profile threads before running test_weakref_cache seems to fix it for me (locally):

def _watch(
thread_id,
log,
stop_event,
interval="20ms",
cycle="2s",
omit=None,
stop=lambda: False,
):
interval = parse_timedelta(interval)
cycle = parse_timedelta(cycle)
recent = create()
last = time()
while not stop() and not stop_event.is_set():
if time() > last + cycle:
log.append((time(), recent))
recent = create()
last = time()
try:
frame = sys._current_frames()[thread_id]
except KeyError:
return
process(frame, None, recent, omit=omit)
stop_event.wait(interval)
class _Profile(threading.Thread):
def __init__(self, thread_id, interval, cycle, log, omit, stop):
self._stop_event = stop_event = threading.Event()
super().__init__(
name="Profile",
target=_watch,
kwargs={
"stop_event": stop_event,
"thread_id": thread_id,
"interval": interval,
"cycle": cycle,
"log": log,
"omit": omit,
"stop": stop,
},
daemon=True,
)
def close(self) -> None:
self._stop_event.set()
self.join()
_profiles: weakref.WeakSet[_Profile] = weakref.WeakSet()
def shutdown_profile_threads():
pop = _profiles.pop
while True:
try:
profile = pop()
except KeyError:
return
profile.close()

shutdown_profile_threads()

edit: Seems to have fixed it on CI too https://github.com/graingert/distributed/actions/runs/2081825815

@graingert
Copy link
Member

graingert commented Apr 3, 2022

commenting out the sys._current_frames()[thread_id] call makes test_spill.py pass:

# try:
# frame = sys._current_frames()[thread_id]
# except KeyError:
# return
# process(frame, None, recent, omit=omit)

https://github.com/graingert/distributed/runs/5806180032?check_suite_focus=true#step:11:5355

@crusaderky
Copy link
Collaborator Author

I've commented the various tests that call gc.collect() to reflect your latest findings

@graingert
Copy link
Member

I've commented the various tests that call gc.collect() to reflect your latest findings

I don't think this is correct - I think a sleep will be just as good as a gc.collect()

@mrocklin
Copy link
Member

mrocklin commented Apr 4, 2022 via email

@crusaderky
Copy link
Collaborator Author

Turning off profile threads during testing seems reasonable to me.

Having a whole extra thread, running at all times during production and which tampers with references, which is just not there during testing - this seems quite dangerous to me.

@graingert
Copy link
Member

Turning off profile threads during testing seems reasonable to me.

These profile threads aren't actually supposed to be running, they're sticking around because some other test is leaking threads

@crusaderky
Copy link
Collaborator Author

These profile threads aren't actually supposed to be running, they're sticking around because some other test is leaking threads

I was not talking about test_weakref_cache in particular, but instead in general about all those tests that look like

del future
gc.collect()

I've reworked the PR - let's see if it's stable or not

@graingert
Copy link
Member

graingert commented Apr 5, 2022

the LocalCluster isn't calling self._loop_runner.stop() on close

I added:

--- a/distributed/utils.py
+++ b/distributed/utils.py
@@ -459,7 +459,10 @@ class LoopRunner:
             finally:
                 done_evt.set()
 
-        thread = threading.Thread(target=run_loop, name="IO loop")
+        thread = threading.Thread(
+            target=run_loop,
+            name=f"IO loop for {os.environ.get('PYTEST_CURRENT_TEST')}",
+        )
         thread.daemon = True
         thread.start()
 

which lets me see where the loops are staying alive from:

[<_MainThread(MainThread, started 140507058100032)>,
 <Thread(Dask-Offload_0, started 140505936979712)>,
 <Thread(TCP-Executor-44063-0, started daemon 140505708054272)>,
 <Thread(TCP-Executor-44063-1, started daemon 140505493010176)>,
 <Thread(Profile, started daemon 140505716446976)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::LocalTest::test_context_manager (call), started daemon 140502607329024)>,
 <Thread(Profile, started daemon 140504410875648)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::LocalTest::test_cores (call), started daemon 140505006462720)>,
 <Thread(Profile, started daemon 140505014855424)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::LocalTest::test_no_workers (call), started daemon 140505023248128)>,
 <Thread(Profile, started daemon 140504998070016)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::LocalTest::test_submit (call), started daemon 140504989677312)>,
 <Thread(Profile, started daemon 140504972891904)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::test_cleanup (call), started daemon 140504419268352)>,
 <Thread(Profile, started daemon 140503144199936)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::test_cleanup (call), started daemon 140504981284608)>,
 <Thread(Profile, started daemon 140504385697536)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::test_dont_select_closed_worker (call), started daemon 140504368912128)>,
 <Thread(Profile, started daemon 140504402482944)>,
 <Thread(IO loop for distributed/deploy/tests/test_local.py::test_dont_select_closed_worker (call), started daemon 140503848826624)>,
 <Thread(Profile, started daemon 140504394090240)>,
 <paramiko.Transport at 0xf8785ab0 (cipher aes128-ctr, 128 bits) (active; 0 open channel(s))>,
 <paramiko.Transport at 0xdc71c070 (cipher aes128-ctr, 128 bits) (active; 0 open channel(s))>,
 <paramiko.Transport at 0xf879f430 (cipher aes128-ctr, 128 bits) (active; 0 open channel(s))>,
 <Timer(pytest_timeout distributed/tests/test_spill.py::test_weakref_cache[60-SupportsWeakRef-True], started 140505734280960)>]

@crusaderky
Copy link
Collaborator Author

@graingert that list of threads is an extremely useful thing to have permanently - mind opening a separate PR to get it in, as well as proper cleanup for LocalCluster?

I think this PR should continue on the direction of resiliency to the profiler regardless of your cleanup. Also, even after the cleanup, all tests that have a cluster running will still have the issue.

Comment on lines +286 to +287
while _watch_running:
sleep(0.0001)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure how I feel about this. It is not thread safe so while this function is returning another profile thread might again sample the frames. At the same time, if a profile thread is actually running, this thing will burn CPU hard.
For this specific use case, that's likely not a problem but I'm wondering if a sleep(0.1) would not have an equal power without us putting this kind of instrumentation in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this is OK only as long as its usage is limited to unit tests

@fjetter
Copy link
Member

fjetter commented Apr 6, 2022

I don't like this watcher but I don't want to block this PR any longer. I opened #6075 to discuss the future of the profiler in our test suite

@fjetter fjetter merged commit d074c9e into dask:main Apr 6, 2022
@crusaderky crusaderky deleted the weakref branch April 6, 2022 14:21
@gjoseph92
Copy link
Collaborator

test_weakref_cache failed agin on July 5 on macOS with a 300s timeout: https://github.com/dask/distributed/runs/7199711175?check_suite_focus=true

@graingert is that a different type of failure, or what you would have expected this to fix? I haven't looked at this PR at all, so I'm not sure.

@crusaderky
Copy link
Collaborator Author

The new failure has nothing to do with test_weakref_cache; the log is showing that the problem is that profile_lock has remained acquired by some other test before it.

@crusaderky
Copy link
Collaborator Author

cc @hendrikmakait

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI. regression
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants