Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

service activation regression, could not activate remote peer #258

Closed
lucab opened this issue Feb 15, 2021 · 6 comments
Closed

service activation regression, could not activate remote peer #258

lucab opened this issue Feb 15, 2021 · 6 comments

Comments

@lucab
Copy link

lucab commented Feb 15, 2021

There is a regression in service activation that we started observing in rpm-ostree CI after a dbus-broker update. The last known-working version is v24, while both v26 and v27 show the same regression symptoms (though maybe different root causes):

# rpm-ostree status > /dev/null; systemctl kill -s KILL rpm-ostreed; rpm-ostree status > /dev/null && echo ok
error: Could not activate remote peer.

# systemctl status rpm-ostreed
* rpm-ostreed.service - rpm-ostree System Management Daemon
     Loaded: loaded (/usr/lib/systemd/system/rpm-ostreed.service; static)
     Active: activating (start) since Mon 2021-02-15 11:10:58 UTC; 20ms ago
       Docs: man:rpm-ostree(1)
   Main PID: 3271 (rpm-ostree)
      Tasks: 1 (limit: 9444)
     Memory: 1.3M
     CGroup: /system.slice/rpm-ostreed.service
             `-3271 /usr/bin/rpm-ostree start-daemon

Feb 15 11:10:58 local-test systemd[1]: Starting rpm-ostree System Management Daemon...

At least on v27 it doesn't seem to be a race (can be manually reproduce, not improved by adding artificial sleep), but it looks like systemctl reset-failed has some effect on it:

# rpm-ostree status > /dev/null; systemctl kill -s KILL rpm-ostreed; sleep 15; rpm-ostree status > /dev/null && echo ok
error: Could not activate remote peer.

# rpm-ostree status > /dev/null; systemctl kill -s KILL rpm-ostreed; systemctl reset-failed; rpm-ostree status > /dev/null && echo ok
ok

Cross-references:

@dvdhrm
Copy link
Member

dvdhrm commented Feb 16, 2021

I did some more investigation, because the bus-log seems really out of order. But reading through the systemd sources, it seems this is rather obscure:

When you send StartUnit to systemd, it will queue a local job to start a unit, then reply with success to StartUnit. However, there is no guarantee this local job was at all scheduled, yet. So even after you got the reply to StartUnit a query for the Unit-state might still return failed (if it already was in failed state). It even sends out PropertiesChanged with ActiveState=failed. Moreover, even if the unit was forked and the properties return the PID of the unit, the state is still failed. Only after some time, the state will change to activating (don't ask me when...). 😞

Systemd does return a job-ID as reply to StartUnit. The problem is, the ID is only known when we receive the reply, so we would install a possible dbus-match too late and miss any signals sent in-between. The alternative would be to listen to ALL job-notifications of systemd, which is just ridiculous. Hence, the Job API is not useful to track a unit-operation. 😞

I don't know what to do. sigh

@dvdhrm
Copy link
Member

dvdhrm commented Feb 16, 2021

The reason we want this activation-tracking is to make sure our TransientUnit handling matches how the old-style direct activation in dbus-daemon works (which tracks exit-code failures of the processes it executes).

I now verified and the classic systemd-activation in dbus-daemon actually never tracks service failures. It simply verifies the job was queued successfully (see ActivationRequest signal handling in systemd). So we could technically just enable this tracking for transient units. However, it would be a missed opportunity.

I currently see no other way than watching JobRemoved on systemd for all jobs and then checking whether a queued job failed. This should be safe to do, since it is also what systemd does internally (though it is quite sad that we need such broad matches for it).

@dvdhrm
Copy link
Member

dvdhrm commented Feb 17, 2021

@lucab I just pushed dbus-broker-27-2 into rawhide, which includes 2 recent commits I just pushed to main. After discussion with Lennart, I decided to add JobRemoved tracking. A bit unfortunate, but we can reduce the match to a short timeframe just around the StartUnit call.

Anyway, can you give this a try? My local tests now all succeed.

@lucab
Copy link
Author

lucab commented Feb 18, 2021

@dvdhrm I tested 27-2.fc35 manually and I'm not seeing the issue anymore!

@lucab
Copy link
Author

lucab commented Feb 18, 2021

Not sure if the plan is to have a v28 tagged for this, but in either cases it would be good to propagate back this fix downstream to F33 as the update there introduced the regression.

@dvdhrm
Copy link
Member

dvdhrm commented Mar 2, 2021

This is now in F32 up to rawhide as dbus-broker-27-2. Given the other fixes and improvements, I will likely also release v28 into rawhide this week.

Closing this as it is resolved. Thanks a lot for the help!

@dvdhrm dvdhrm closed this as completed Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants