New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon: Correctly abort not-started transactions after client exit #2995
Conversation
|
For reference, it looks like you are manually implementing so called "failpoints". To that extent, elsewhere (e.g. in Zincati) we do use the fail crate which is nicely configurable. However I've only seen it used in tests and behind a feature gate so far, so I don't know if there are some gotcha on leaving that always on. |
Indeed! Though it looks like it's missing
Right. It seems to be oriented for use in unit tests. But because we need to run under systemd in a VM, that's not something really doable in |
|
Looks like there's https://crates.io/crates/tracers too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work investigating this!
This also add a new debugpoint just after client transaction connect to help reliably trigger a bug. Will be used in a next commit.
https://bugzilla.redhat.com/show_bug.cgi?id=1982389 At least one OpenShift user hit this race: - MCD starts `rpm-ostree upgrade` (client process) - Daemon receives DBus request, and waits for client to connect to the transaction progress DBus socket - MCO is also upgrading the MCD daemonset, and this ends up killing the MCD pod, which also kills the rpm-ostree client process (I think, or at least the client process died in some way) - We're now stuck because the transaction code was buggy and didn't actually clear the transaction if the client exited before connecting In this situation one can't even `rpm-ostree cancel` actually, only `systemctl restart rpm-ostreed`. The fix is simple; we were emitting "closed" but we actually need to explicitly clear the transaction.
This is a workaround for https://bugzilla.redhat.com/show_bug.cgi?id=1982389 which is already fixed in rpm-ostree in coreos/rpm-ostree#2995 because it will take a fair while until we can ship the fixed rpm-ostreed in RHEL and then OpenShift stable versions. (Yes, this is a sad recurring pattern) The updater client gains an explicit `Initialize` method, where we also explicitly `systemctl start rpm-ostreed` which then effectively rolls in the change from coreos/rpm-ostree#2945 too.
This is a workaround for https://bugzilla.redhat.com/show_bug.cgi?id=1982389 which is already fixed in rpm-ostree in coreos/rpm-ostree#2995 because it will take a fair while until we can ship the fixed rpm-ostreed in RHEL and then OpenShift stable versions. (Yes, this is a sad recurring pattern) The updater client gains an explicit `Initialize` method, where we also explicitly `systemctl start rpm-ostreed` which then effectively rolls in the change from coreos/rpm-ostree#2945 too.
This is a workaround for https://bugzilla.redhat.com/show_bug.cgi?id=1982389 which is already fixed in rpm-ostree in coreos/rpm-ostree#2995 because it will take a fair while until we can ship the fixed rpm-ostreed in RHEL and then OpenShift stable versions. (Yes, this is a sad recurring pattern) The updater client gains an explicit `Initialize` method, where we also explicitly `systemctl start rpm-ostreed` which then effectively rolls in the change from coreos/rpm-ostree#2945 too.
Add generalized "debugpoint" via RPMOSTREE_DEBUG
This is the inevitable application-specific, ad-hoc, informally-specified
reimplementation¹ of a tiny subset of BPF+tracepoints.
One way to view this is similar to
strace+ optional fault injection.The existing
RPMOSTREE_GDB_HOOKbecomesRPMOSTREE_DEBUG=main=sigstop.But one can pair arbitrary debugpoints with a set of arbitrary actions,
e.g.
RPMOSTREE_DEBUG=main=exitwill exit immediately after main insteadof stopping (not that that's very useful, but it illustrates the mechanism).
A bit more useful, I added a new debugpoint just after client transaction
connect to help reliably trigger a bug:
env RPMOSTREE_DEBUG=client::connect=exit.(OK but wait, why not use tracepoints/BPF? It'd be cool but I
don't want to actually try dragging all of that in just for this bug.
This little bit of code will carry us for a while until we get
up the energy to go that way)
¹ Reference to https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule
daemon: Correctly abort not-started transactions after client exit
https://bugzilla.redhat.com/show_bug.cgi?id=1982389
At least one OpenShift user hit this race:
rpm-ostree upgrade(client process)(I think, or at least the client process died in some way)
didn't actually clear the transaction if the client exited
before connecting
In this situation one can't even
rpm-ostree cancelactually,only
systemctl restart rpm-ostreed.The fix is simple; we were emitting "closed" but we actually
need to explicitly clear the transaction.