Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: Correctly abort not-started transactions after client exit #2995

Merged
merged 2 commits into from Jul 15, 2021

Conversation

cgwalters
Copy link
Member

Add generalized "debugpoint" via RPMOSTREE_DEBUG

This is the inevitable application-specific, ad-hoc, informally-specified
reimplementation¹ of a tiny subset of BPF+tracepoints.

One way to view this is similar to strace + optional fault injection.

The existing RPMOSTREE_GDB_HOOK becomes RPMOSTREE_DEBUG=main=sigstop.

But one can pair arbitrary debugpoints with a set of arbitrary actions,
e.g. RPMOSTREE_DEBUG=main=exit will exit immediately after main instead
of stopping (not that that's very useful, but it illustrates the mechanism).

A bit more useful, I added a new debugpoint just after client transaction
connect to help reliably trigger a bug:
env RPMOSTREE_DEBUG=client::connect=exit.

(OK but wait, why not use tracepoints/BPF? It'd be cool but I
don't want to actually try dragging all of that in just for this bug.
This little bit of code will carry us for a while until we get
up the energy to go that way)

¹ Reference to https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule


daemon: Correctly abort not-started transactions after client exit

https://bugzilla.redhat.com/show_bug.cgi?id=1982389

At least one OpenShift user hit this race:

  • MCD starts rpm-ostree upgrade (client process)
  • Daemon receives DBus request, and waits for client to connect to the transaction progress DBus socket
  • MCO is also upgrading the MCD daemonset, and this ends up killing the MCD pod, which also kills the rpm-ostree client process
    (I think, or at least the client process died in some way)
  • We're now stuck because the transaction code was buggy and
    didn't actually clear the transaction if the client exited
    before connecting

In this situation one can't even rpm-ostree cancel actually,
only systemctl restart rpm-ostreed.

The fix is simple; we were emitting "closed" but we actually
need to explicitly clear the transaction.


@lucab
Copy link
Contributor

lucab commented Jul 15, 2021

For reference, it looks like you are manually implementing so called "failpoints". To that extent, elsewhere (e.g. in Zincati) we do use the fail crate which is nicely configurable. However I've only seen it used in tests and behind a feature gate so far, so I don't know if there are some gotcha on leaving that always on.

@cgwalters
Copy link
Member Author

For reference, it looks like you are manually implementing so called "failpoints". To that extent, elsewhere (e.g. in Zincati) we do use the fail crate which is nicely configurable.

Indeed! Though it looks like it's missing sigstop but that's probably easily added. Once we do that we can replace RPMOSTREE_GDB_HOOK too.

However I've only seen it used in tests and behind a feature gate so far, so I don't know if there are some gotcha on leaving that always on.

Right. It seems to be oriented for use in unit tests. But because we need to run under systemd in a VM, that's not something really doable in cargo test. I skimmed the code; it looks like each failpoint allocates 3 mutexes, etc. Surprisingly heavyweight. I guess they're doing this because they're trying to support an API where another thread can change things dynamically, which we don't need. So, something to consider if we were to make a lot of use of this, but seems fine for now.

@cgwalters
Copy link
Member Author

Looks like there's https://crates.io/crates/tracers too.

jlebon
jlebon previously approved these changes Jul 15, 2021
Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work investigating this!

rust/src/failpoint_bridge.rs Outdated Show resolved Hide resolved
This also add a new debugpoint just after client transaction
connect to help reliably trigger a bug.  Will be used in
a next commit.
https://bugzilla.redhat.com/show_bug.cgi?id=1982389

At least one OpenShift user hit this race:

- MCD starts `rpm-ostree upgrade` (client process)
- Daemon receives DBus request, and waits for client to connect to the transaction progress DBus socket
- MCO is also upgrading the MCD daemonset, and this ends up killing the MCD pod, which also kills the rpm-ostree client process
  (I think, or at least the client process died in some way)
- We're now stuck because the transaction code was buggy and
  didn't actually clear the transaction if the client exited
  before connecting

In this situation one can't even `rpm-ostree cancel` actually,
only `systemctl restart rpm-ostreed`.

The fix is simple; we were emitting "closed" but we actually
need to explicitly clear the transaction.
@jlebon jlebon enabled auto-merge July 15, 2021 15:09
@jlebon jlebon merged commit ce48b49 into coreos:main Jul 15, 2021
cgwalters added a commit to cgwalters/machine-config-operator that referenced this pull request Jul 16, 2021
This is a workaround for https://bugzilla.redhat.com/show_bug.cgi?id=1982389
which is already fixed in rpm-ostree in
coreos/rpm-ostree#2995
because it will take a fair while until we can ship the fixed
rpm-ostreed in RHEL and then OpenShift stable versions.
(Yes, this is a sad recurring pattern)

The updater client gains an explicit `Initialize` method, where
we also explicitly `systemctl start rpm-ostreed` which
then effectively rolls in the change from
coreos/rpm-ostree#2945
too.
cgwalters added a commit to cgwalters/machine-config-operator that referenced this pull request Feb 25, 2022
This is a workaround for https://bugzilla.redhat.com/show_bug.cgi?id=1982389
which is already fixed in rpm-ostree in
coreos/rpm-ostree#2995
because it will take a fair while until we can ship the fixed
rpm-ostreed in RHEL and then OpenShift stable versions.
(Yes, this is a sad recurring pattern)

The updater client gains an explicit `Initialize` method, where
we also explicitly `systemctl start rpm-ostreed` which
then effectively rolls in the change from
coreos/rpm-ostree#2945
too.
cgwalters added a commit to cgwalters/machine-config-operator that referenced this pull request Jul 8, 2022
This is a workaround for https://bugzilla.redhat.com/show_bug.cgi?id=1982389
which is already fixed in rpm-ostree in
coreos/rpm-ostree#2995
because it will take a fair while until we can ship the fixed
rpm-ostreed in RHEL and then OpenShift stable versions.
(Yes, this is a sad recurring pattern)

The updater client gains an explicit `Initialize` method, where
we also explicitly `systemctl start rpm-ostreed` which
then effectively rolls in the change from
coreos/rpm-ostree#2945
too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants