Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA library not safe to use when shutting down #866

Closed
wants to merge 4 commits into from

Conversation

MikeHart85
Copy link
Contributor

Following up on the recurring unexpected exceptions/segfaults during teardown:

@danielballan and @dmgav and I chatted about this a bit this morning and in a nutshell concluded:

  1. Solving this in pyepics, along the lines of what @klauer you proposed here, is ultimately the correct way to tackle this. It will need some fleshing out though, to address the additional exceptions I mentioned here Call to clear_auto_monitor() in release_pvs() results in exception on program exit #834 (comment).
  2. Alternatively, or additionally, to make ophyd play nice with current versions of pyepics, it may be a good idea to have ophyd avoid making any calls that involve libCA once it detects that Python is shutting down, since it is no longer safe to do so at that point.

The change in this PR is an idea on how to achieve point 2.

It appears to work, in that the caproto tests run without problems with this in place.

It feels a bit dirty though, and makes me question my sanity, so I welcome any and all critical comments and thoughts on why this is a horrible idea.


def invalidate_ca():
global ca
ca = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be it's a good idea to make a comment in the code on why exactly we want to invalidate ca (accidental attempt to use ca during interpreter shutdown will raise an exception, which is better than segfault, and we also check if ca is None when needed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this definitely needs some comments to explain what's going on if we decide to go ahead with this.

As you correctly picked up, the basic idea is to make attempts to use ca "fail in a better way" so to speak. And to force the failure, instead of depending on seemingly random circumstances, like calling pytest with exactly the right arguments, etc.

One problem with this that we also import epics, caget and caput above. Even if we None all of them here, there are still references to pvs floating around and, for example, calls to pv.clear_auto_monitor() will trigger the problem.

So vigilance is still required and arguably picking on ca in particular is a bit arbitrary here.

Copy link
Member

@danielballan danielballan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a minor preference for setting up a separate piece of state rather than setting ca to None.

_ca_state = []  # truth-y when ca is shutting down

def invalidate_ca():
    _ca_state.append(object()

def ca_is_valid():
    return bool(_ca_state)

This approach seems susceptible to race conditions but a reasonable short-term fix to maintain best-effort compatibility with existing (and older) versions of pyepics.

@tacaswell
Copy link
Contributor

I was about to make the exact same comment as @danielballan (but with a global bool flag rather than a list).

@MikeHart85
Copy link
Contributor Author

I was going to ask: Is there an advantage to using a list over a bool?

@dmgav
Copy link
Contributor

dmgav commented May 12, 2020

Surprisingly, I thought about the separate bool flag too, but then I noticed that ca is used a few more times throughout the module, so ideally we should check if the flag is set in every function that is using ca. But in principle, separate flag will look cleaner, and it can be checked only in the functions used to release resources.

@MikeHart85
Copy link
Contributor Author

I was hoping to get away with only checking in release_pvs (for now?), since I think that's the only one that could realistically be called during shutdown... is it safe to tentatively assume that, until proven otherwise?

@dmgav
Copy link
Contributor

dmgav commented May 12, 2020

Also @leofang mentioned (in Teams) that he observes global variables being set to None during interpreter shutdown. The Python docs tell that this shouldn't happen starting with 3.4, but it is still happening. So it would be a more reliable solution to set up a bool flag that is initially True and is set to False during shutdown. If the interpreter decides to set it to None, then it will not change the logic.

Copy link
Contributor

@dmgav dmgav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. There are some flake8 issues that should be addressed.

@leofang
Copy link

leofang commented May 12, 2020

I was going to ask: Is there an advantage to using a list over a bool?

I have no idea, but I know a few projects use list. I'm interested too.

@danielballan
Copy link
Member

The advantage is that if you do:

# modue.py
global_state = True

def change_state():
    global_state = False

and some other package in the module does

from .module import global_state

it will not get up to date information, where if you mutate global_state itself the reference remains useful. But that is not applicable here and I think the current implementation is good.

@danielballan
Copy link
Member

Power-cycled to re-run CI now that #864 is merged, which fixed some flake8 issues that appeared on master due to flake8 becoming stricter.

@leofang
Copy link

leofang commented May 12, 2020

from .module import global_state

it will not get up to date information

Yeah I think some style guides forbid importing module variables for exactly this reason.

@MikeHart85
Copy link
Contributor Author

Thanks for the explanation @danielballan!

I modified it to work the way you suggested. Seems safer / a good habit to encourage, even if not strictly needed the way it is used here.

@MikeHart85 MikeHart85 marked this pull request as ready for review May 13, 2020 15:28
@danielballan
Copy link
Member

Will merge/release by end of week if there are no objections.

@klauer
Copy link
Member

klauer commented May 13, 2020

This still isn't quite right. Downstream library typhos test suite is hanging due to ophyd cleanup even with this PR:

Thread 0x00000001133fedc0 (most recent call first):
  File "/Users/klauer/docs/Repos/pyepics/epics/ca.py", line 882 in pend_io
  File "/Users/klauer/docs/Repos/pyepics/epics/ca.py", line 544 in wrapper
  File "/Users/klauer/docs/Repos/pyepics/epics/ca.py", line 905 in poll
  File "/Users/klauer/docs/Repos/pyepics/epics/ca.py", line 544 in wrapper
  File "/Users/klauer/docs/Repos/pyepics/epics/ca.py", line 1053 in connect_channel
  File "/Users/klauer/docs/Repos/pyepics/epics/ca.py", line 570 in wrapper
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 437 in connect
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 48 in wrapped
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 422 in wait_for_connection
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 48 in wrapped
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 729 in get_ctrlvars
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 48 in wrapped
  File "/Users/klauer/docs/Repos/ophyd/ophyd/_pyepics_shim.py", line 103 in _getarg
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 963 in count
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 357 in _check_auto_monitor
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 48 in wrapped
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 409 in auto_monitor
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 48 in wrapped
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 444 in clear_auto_monitor
  File "/Users/klauer/docs/Repos/pyepics/epics/pv.py", line 48 in wrapped
  File "/Users/klauer/docs/Repos/ophyd/ophyd/_pyepics_shim.py", line 136 in release_pvs
  File "/Users/klauer/docs/Repos/ophyd/ophyd/signal.py", line 1708 in destroy
  File "/Users/klauer/mc/envs/lucid/lib/python3.6/weakref.py", line 548 in __call__
  File "/Users/klauer/mc/envs/lucid/lib/python3.6/weakref.py", line 624 in _exitfunc

Note that release_pvs clears auto monitor, which tries to connect because count isn't available. I had tried to fix this on pyepics's side, but apparently it was a huge can of worms (pyepics/pyepics#203)

@danielballan
Copy link
Member

Thanks, @klauer. Will hold off in that case.

@klauer
Copy link
Member

klauer commented May 15, 2020

@hhslepicka with the new pyepics master, is your test suite still failing?

@hhslepicka
Copy link
Contributor

hhslepicka commented May 16, 2020

They are not failing anymore

@MikeHart85
Copy link
Contributor Author

MikeHart85 commented May 22, 2020

Just checking that I understood correctly: Typhos tests pass okay with both the latest pyepics master and this branch of ophyd, correct?

To be sure we're not ending up with redundant fixes I tried various on/off combinations of:

  1. this branch
  2. FIX: Do not block if disconnected when changing auto_monitor settings - continued pyepics/pyepics#204
  3. pyepics/pyepics@1bb4a68
  4. Use finalize instead of __del__ to tear down Signals #845

Caproto's tests pass with both 1+4 and 1+2+4. They fail with 2+4, 2+3, and 2+3+4.

If I understood correctly, Typhos tests pass with 1+2+4, but fail with just 1+4.

In summary: This PR should still be merged, but some downstream projects may experience failures unless they also have pyepics at the current master or later. 2 and 4 are needed and already merged, so no action required. 3 is not merged and doesn't need to be, so no action required.

Is this consistent with everyone else's understanding?

@klauer
Copy link
Member

klauer commented May 22, 2020

I think that's probably about right. Though any project with many disconnected PVs will greatly increase the possibility of the teardown hanging without (2) and (3). It certainly may happen with the ophyd test suite. Try even the simple:

import ophyd

sigs = {
    i: ophyd.EpicsSignal(f'sim:mtr{i}', name=f'sig{i}')
    for i in range(10000)
}

Hangs with pyepics/pyepics@9371da8 (on clear_auto_monitor)
Succeeds with pyepics/pyepics@master

This, in my opinion, indicates that there should be a new push for a pyepics tag and a bump in ophyd's requirement of it.

@hhslepicka
Copy link
Contributor

hhslepicka commented May 22, 2020

Typhos pass with:

  • pyepics - fixed from 204
  • ophyd - master

With the fixes at pyepics I think that this PR can be closed.

@MikeHart85
Copy link
Contributor Author

Okay, so we probably want to merge this and maybe beta tag it so we can fix caproto tests, but hold off on a full ophyd release until (3) is merged and pyepics has a new tag.

It would be great to confirm this branch doesn't break Typhos tests as well.

@MikeHart85
Copy link
Contributor Author

And a heads up on the pyepics side of things, with the current master the following snippet still triggers an uncatchable exception on teardown:

import epics

class Broken:
    def __init__(self, pv):
        self.pv = epics.get_pv(pv)
    def __del__(self):
        self.pv.clear_auto_monitor()

mypv = Broken('sim:mtr1')

Adding (3) replaces the exception with the RuntimeError that is introduced there. But that is a big improvement, because this one can be caught in del.

Using weakref.finalize in a simple example like this sidesteps the issue entirely, but as we've seen that's not enough by itself in more complex situations.

@MikeHart85
Copy link
Contributor Author

MikeHart85 commented May 22, 2020

I also tried replacing, in epics/ca.py:

    if AUTO_CLEANUP:
        atexit.register(finalize_libca)

... with ...

    if AUTO_CLEANUP:
        weakref.finalize(sys.modules[__name__], finalize_libca)

... hoping that this would delay finalization enough to avoid the problem altogether, because the module shouldn't go out of scope until nothing needs it, ergo after __del__ and any other finalizers referencing things in ca ran.

Unfortunately it made no noticeable difference. :(

@hhslepicka
Copy link
Contributor

@MikeHart85 why not pursue the fix for that at pyepics instead of patching here?
Seems like the best way despite not being the easier way.

@MikeHart85
Copy link
Contributor Author

That would be ideal, agreed. Initially it just wasn't really clear to me whether the error was on our end (calling functions we shouldn't be during teardown) or in pyepics (failing on function calls that should succeed or fail more gracefully).

Now I think it's fair to say it's the latter, but we already have these workarounds, and not yet any complete solution in pyepics. If we can find one, I'd certainly prefer that to this.

@MikeHart85
Copy link
Contributor Author

The pyepics PR referenced above would make this PR obsolete, but let's see if it is accepted before closing this.

@MikeHart85
Copy link
Contributor Author

With the pyepics PR merged, this PR is likely obsolete and can be closed without merging.

I'm not seeing Caproto test failure anymore locally, when using Ophyd 1.5.0 and the latest pyepics master. Would be great to confirm that this is also the case in CI and other projects that had similar issues.

@klauer
Copy link
Member

klauer commented Jun 10, 2020

If all is good, perhaps updating the pyepics requirement pinning would be appropriate?

@prjemian
Copy link
Contributor

We are using this right now on the USAXS instrument and operations are successful. We can confirm that exit works smoothly with no delays or incredible exception traces.

Declare victory on this.

@prjemian
Copy link
Contributor

That is, pyEpics 3.4.2 conda package from conda-forge. Thanks, all!

@MikeHart85
Copy link
Contributor Author

MikeHart85 commented Jun 10, 2020

Thanks for the confirmation!

I'll go ahead and close this PR as obsolete, since the issue is fixed upstream.

Going to update requirements separately prior to releasing new ophyd version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants