CA library not safe to use when shutting down #866

MikeHart85 · 2020-05-12T15:39:30Z

Following up on the recurring unexpected exceptions/segfaults during teardown:

@danielballan and @dmgav and I chatted about this a bit this morning and in a nutshell concluded:

Solving this in pyepics, along the lines of what @klauer you proposed here, is ultimately the correct way to tackle this. It will need some fleshing out though, to address the additional exceptions I mentioned here Call to clear_auto_monitor() in release_pvs() results in exception on program exit #834 (comment).
Alternatively, or additionally, to make ophyd play nice with current versions of pyepics, it may be a good idea to have ophyd avoid making any calls that involve libCA once it detects that Python is shutting down, since it is no longer safe to do so at that point.

The change in this PR is an idea on how to achieve point 2.

It appears to work, in that the caproto tests run without problems with this in place.

It feels a bit dirty though, and makes me question my sanity, so I welcome any and all critical comments and thoughts on why this is a horrible idea.

dmgav · 2020-05-12T16:04:31Z

ophyd/_pyepics_shim.py

+
+def invalidate_ca():
+    global ca
+    ca = None


May be it's a good idea to make a comment in the code on why exactly we want to invalidate ca (accidental attempt to use ca during interpreter shutdown will raise an exception, which is better than segfault, and we also check if ca is None when needed).

Agreed, this definitely needs some comments to explain what's going on if we decide to go ahead with this.

As you correctly picked up, the basic idea is to make attempts to use ca "fail in a better way" so to speak. And to force the failure, instead of depending on seemingly random circumstances, like calling pytest with exactly the right arguments, etc.

One problem with this that we also import epics, caget and caput above. Even if we None all of them here, there are still references to pvs floating around and, for example, calls to pv.clear_auto_monitor() will trigger the problem.

So vigilance is still required and arguably picking on ca in particular is a bit arbitrary here.

danielballan

I have a minor preference for setting up a separate piece of state rather than setting ca to None.

_ca_state = []  # truth-y when ca is shutting down

def invalidate_ca():
    _ca_state.append(object()

def ca_is_valid():
    return bool(_ca_state)

This approach seems susceptible to race conditions but a reasonable short-term fix to maintain best-effort compatibility with existing (and older) versions of pyepics.

tacaswell · 2020-05-12T16:33:15Z

I was about to make the exact same comment as @danielballan (but with a global bool flag rather than a list).

MikeHart85 · 2020-05-12T16:34:00Z

I was going to ask: Is there an advantage to using a list over a bool?

dmgav · 2020-05-12T16:46:32Z

Surprisingly, I thought about the separate bool flag too, but then I noticed that ca is used a few more times throughout the module, so ideally we should check if the flag is set in every function that is using ca. But in principle, separate flag will look cleaner, and it can be checked only in the functions used to release resources.

MikeHart85 · 2020-05-12T16:55:41Z

I was hoping to get away with only checking in release_pvs (for now?), since I think that's the only one that could realistically be called during shutdown... is it safe to tentatively assume that, until proven otherwise?

dmgav · 2020-05-12T16:59:27Z

Also @leofang mentioned (in Teams) that he observes global variables being set to None during interpreter shutdown. The Python docs tell that this shouldn't happen starting with 3.4, but it is still happening. So it would be a more reliable solution to set up a bool flag that is initially True and is set to False during shutdown. If the interpreter decides to set it to None, then it will not change the logic.

dmgav

Looks good to me. There are some flake8 issues that should be addressed.

leofang · 2020-05-12T17:28:34Z

I was going to ask: Is there an advantage to using a list over a bool?

I have no idea, but I know a few projects use list. I'm interested too.

danielballan · 2020-05-12T19:53:35Z

The advantage is that if you do:

# modue.py
global_state = True

def change_state():
    global_state = False

and some other package in the module does

from .module import global_state

it will not get up to date information, where if you mutate global_state itself the reference remains useful. But that is not applicable here and I think the current implementation is good.

danielballan · 2020-05-12T20:54:03Z

Power-cycled to re-run CI now that #864 is merged, which fixed some flake8 issues that appeared on master due to flake8 becoming stricter.

leofang · 2020-05-12T20:57:51Z

from .module import global_state
it will not get up to date information

Yeah I think some style guides forbid importing module variables for exactly this reason.

MikeHart85 · 2020-05-13T15:18:27Z

Thanks for the explanation @danielballan!

I modified it to work the way you suggested. Seems safer / a good habit to encourage, even if not strictly needed the way it is used here.

danielballan · 2020-05-13T21:34:21Z

Will merge/release by end of week if there are no objections.

klauer · 2020-05-13T23:27:20Z

This still isn't quite right. Downstream library typhos test suite is hanging due to ophyd cleanup even with this PR:

Thread 0x00000001133fedc0 (most recent call first):
  File "/Users/klauer/docs/Repos/pyepics/epics/ca.py", line 882 in pend_io
  File "/Users/klauer/docs/Repos/pyepics/epics/ca.py", line 544 in wrapper
  File "/Users/klauer/docs/Repos/pyepics/epics/ca.py", line 905 in poll
  File "/Users/klauer/docs/Repos/pyepics/epics/ca.py", line 544 in wrapper
  File "/Users/klauer/docs/Repos/pyepics/epics/ca.py", line 1053 in connect_channel
  File "/Users/klauer/docs/Repos/pyepics/epics/ca.py", line 570 in wrapper
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 437 in connect
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 48 in wrapped
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 422 in wait_for_connection
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 48 in wrapped
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 729 in get_ctrlvars
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 48 in wrapped
  File "/Users/klauer/docs/Repos/ophyd/ophyd/_pyepics_shim.py", line 103 in _getarg
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 963 in count
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 357 in _check_auto_monitor
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 48 in wrapped
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 409 in auto_monitor
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 48 in wrapped
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 444 in clear_auto_monitor
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 48 in wrapped
  File "/Users/klauer/docs/Repos/ophyd/ophyd/_pyepics_shim.py", line 136 in release_pvs
  File "/Users/klauer/docs/Repos/ophyd/ophyd/signal.py", line 1708 in destroy
  File "/Users/klauer/mc/envs/lucid/lib/python3.6/weakref.py", line 548 in __call__
  File "/Users/klauer/mc/envs/lucid/lib/python3.6/weakref.py", line 624 in _exitfunc

Note that release_pvs clears auto monitor, which tries to connect because count isn't available. I had tried to fix this on pyepics's side, but apparently it was a huge can of worms (pyepics/pyepics#203)

danielballan · 2020-05-14T01:16:39Z

Thanks, @klauer. Will hold off in that case.

klauer · 2020-05-15T23:44:53Z

@hhslepicka with the new pyepics master, is your test suite still failing?

hhslepicka · 2020-05-16T01:43:57Z

They are not failing anymore

MikeHart85 · 2020-05-22T19:05:10Z

Just checking that I understood correctly: Typhos tests pass okay with both the latest pyepics master and this branch of ophyd, correct?

To be sure we're not ending up with redundant fixes I tried various on/off combinations of:

Caproto's tests pass with both 1+4 and 1+2+4. They fail with 2+4, 2+3, and 2+3+4.

If I understood correctly, Typhos tests pass with 1+2+4, but fail with just 1+4.

In summary: This PR should still be merged, but some downstream projects may experience failures unless they also have pyepics at the current master or later. 2 and 4 are needed and already merged, so no action required. 3 is not merged and doesn't need to be, so no action required.

Is this consistent with everyone else's understanding?

klauer · 2020-05-22T19:16:42Z

I think that's probably about right. Though any project with many disconnected PVs will greatly increase the possibility of the teardown hanging without (2) and (3). It certainly may happen with the ophyd test suite. Try even the simple:

import ophyd

sigs = {
    i: ophyd.EpicsSignal(f'sim:mtr{i}', name=f'sig{i}')
    for i in range(10000)
}

Hangs with pyepics/pyepics@9371da8 (on clear_auto_monitor)
Succeeds with pyepics/pyepics@master

This, in my opinion, indicates that there should be a new push for a pyepics tag and a bump in ophyd's requirement of it.

hhslepicka · 2020-05-22T19:18:28Z

Typhos pass with:

pyepics - fixed from 204
ophyd - master

With the fixes at pyepics I think that this PR can be closed.

MikeHart85 · 2020-05-22T19:23:49Z

Okay, so we probably want to merge this and maybe beta tag it so we can fix caproto tests, but hold off on a full ophyd release until (3) is merged and pyepics has a new tag.

It would be great to confirm this branch doesn't break Typhos tests as well.

MikeHart85 · 2020-05-22T19:39:51Z

And a heads up on the pyepics side of things, with the current master the following snippet still triggers an uncatchable exception on teardown:

import epics

class Broken:
    def __init__(self, pv):
        self.pv = epics.get_pv(pv)
    def __del__(self):
        self.pv.clear_auto_monitor()

mypv = Broken('sim:mtr1')

Adding (3) replaces the exception with the RuntimeError that is introduced there. But that is a big improvement, because this one can be caught in del.

Using weakref.finalize in a simple example like this sidesteps the issue entirely, but as we've seen that's not enough by itself in more complex situations.

MikeHart85 · 2020-05-22T20:04:32Z

I also tried replacing, in epics/ca.py:

    if AUTO_CLEANUP:
        atexit.register(finalize_libca)

... with ...

    if AUTO_CLEANUP:
        weakref.finalize(sys.modules[__name__], finalize_libca)

... hoping that this would delay finalization enough to avoid the problem altogether, because the module shouldn't go out of scope until nothing needs it, ergo after __del__ and any other finalizers referencing things in ca ran.

Unfortunately it made no noticeable difference. :(

hhslepicka · 2020-05-22T22:42:04Z

@MikeHart85 why not pursue the fix for that at pyepics instead of patching here?
Seems like the best way despite not being the easier way.

MikeHart85 · 2020-05-26T03:43:19Z

That would be ideal, agreed. Initially it just wasn't really clear to me whether the error was on our end (calling functions we shouldn't be during teardown) or in pyepics (failing on function calls that should succeed or fail more gracefully).

Now I think it's fair to say it's the latter, but we already have these workarounds, and not yet any complete solution in pyepics. If we can find one, I'd certainly prefer that to this.

MikeHart85 · 2020-05-28T16:52:10Z

The pyepics PR referenced above would make this PR obsolete, but let's see if it is accepted before closing this.

MikeHart85 · 2020-06-01T14:38:58Z

With the pyepics PR merged, this PR is likely obsolete and can be closed without merging.

I'm not seeing Caproto test failure anymore locally, when using Ophyd 1.5.0 and the latest pyepics master. Would be great to confirm that this is also the case in CI and other projects that had similar issues.

klauer · 2020-06-10T17:37:17Z

If all is good, perhaps updating the pyepics requirement pinning would be appropriate?

prjemian · 2020-06-10T17:40:01Z

We are using this right now on the USAXS instrument and operations are successful. We can confirm that exit works smoothly with no delays or incredible exception traces.

Declare victory on this.

prjemian · 2020-06-10T17:40:41Z

That is, pyEpics 3.4.2 conda package from conda-forge. Thanks, all!

MikeHart85 · 2020-06-10T17:45:13Z

Thanks for the confirmation!

I'll go ahead and close this PR as obsolete, since the issue is fixed upstream.

Going to update requirements separately prior to releasing new ophyd version.

CA library not safe to use when shutting down

14f46ca

MikeHart85 requested review from tacaswell, danielballan, klauer and dmgav May 12, 2020 15:39

Works better this way

b169f0d

dmgav reviewed May 12, 2020

View reviewed changes

danielballan reviewed May 12, 2020

View reviewed changes

Address review comments, general cleanup

608fa61

dmgav approved these changes May 12, 2020

View reviewed changes

danielballan closed this May 12, 2020

danielballan reopened this May 12, 2020

danielballan mentioned this pull request May 12, 2020

Release notes for 1.5.1 #867

Closed

Use atexit decorator, list with object instead of bool

658d323

MikeHart85 marked this pull request as ready for review May 13, 2020 15:28

prjemian mentioned this pull request May 22, 2020

Bluesky error when running default 'bluesky8IDI' aps-8id-dys/ipython-8idiuser#156

Closed

MikeHart85 mentioned this pull request May 28, 2020

Prevent exceptions after libca is finalized during teardown pyepics/pyepics#205

Merged

MikeHart85 closed this Jun 10, 2020

MikeHart85 mentioned this pull request Jun 11, 2020

Release notes for version 1.5.1 #872

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA library not safe to use when shutting down #866

CA library not safe to use when shutting down #866

MikeHart85 commented May 12, 2020

dmgav May 12, 2020

MikeHart85 May 12, 2020

danielballan left a comment

tacaswell commented May 12, 2020

MikeHart85 commented May 12, 2020

dmgav commented May 12, 2020 •

edited

Loading

MikeHart85 commented May 12, 2020

dmgav commented May 12, 2020 •

edited

Loading

dmgav left a comment

leofang commented May 12, 2020

danielballan commented May 12, 2020

danielballan commented May 12, 2020

leofang commented May 12, 2020

MikeHart85 commented May 13, 2020

danielballan commented May 13, 2020

klauer commented May 13, 2020

danielballan commented May 14, 2020

klauer commented May 15, 2020

hhslepicka commented May 16, 2020 •

edited

Loading

MikeHart85 commented May 22, 2020 •

edited

Loading

klauer commented May 22, 2020

hhslepicka commented May 22, 2020 •

edited

Loading

MikeHart85 commented May 22, 2020

MikeHart85 commented May 22, 2020

MikeHart85 commented May 22, 2020 •

edited

Loading

hhslepicka commented May 22, 2020

MikeHart85 commented May 26, 2020

MikeHart85 commented May 28, 2020

MikeHart85 commented Jun 1, 2020

klauer commented Jun 10, 2020

prjemian commented Jun 10, 2020

prjemian commented Jun 10, 2020

MikeHart85 commented Jun 10, 2020 •

edited

Loading

CA library not safe to use when shutting down #866

CA library not safe to use when shutting down #866

Conversation

MikeHart85 commented May 12, 2020

dmgav May 12, 2020

Choose a reason for hiding this comment

MikeHart85 May 12, 2020

Choose a reason for hiding this comment

danielballan left a comment

Choose a reason for hiding this comment

tacaswell commented May 12, 2020

MikeHart85 commented May 12, 2020

dmgav commented May 12, 2020 • edited Loading

MikeHart85 commented May 12, 2020

dmgav commented May 12, 2020 • edited Loading

dmgav left a comment

Choose a reason for hiding this comment

leofang commented May 12, 2020

danielballan commented May 12, 2020

danielballan commented May 12, 2020

leofang commented May 12, 2020

MikeHart85 commented May 13, 2020

danielballan commented May 13, 2020

klauer commented May 13, 2020

danielballan commented May 14, 2020

klauer commented May 15, 2020

hhslepicka commented May 16, 2020 • edited Loading

MikeHart85 commented May 22, 2020 • edited Loading

klauer commented May 22, 2020

hhslepicka commented May 22, 2020 • edited Loading

MikeHart85 commented May 22, 2020

MikeHart85 commented May 22, 2020

MikeHart85 commented May 22, 2020 • edited Loading

hhslepicka commented May 22, 2020

MikeHart85 commented May 26, 2020

MikeHart85 commented May 28, 2020

MikeHart85 commented Jun 1, 2020

klauer commented Jun 10, 2020

prjemian commented Jun 10, 2020

prjemian commented Jun 10, 2020

MikeHart85 commented Jun 10, 2020 • edited Loading

dmgav commented May 12, 2020 •

edited

Loading

dmgav commented May 12, 2020 •

edited

Loading

hhslepicka commented May 16, 2020 •

edited

Loading

MikeHart85 commented May 22, 2020 •

edited

Loading

hhslepicka commented May 22, 2020 •

edited

Loading

MikeHart85 commented May 22, 2020 •

edited

Loading

MikeHart85 commented Jun 10, 2020 •

edited

Loading