Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session lost when elogind restarted #104

Closed
markhindley opened this Issue Dec 12, 2018 · 20 comments

Comments

Projects
None yet
2 participants
@markhindley
Copy link
Contributor

commented Dec 12, 2018

Sven,

Please see https://bugs.debian.org/916247. I can reproduce this behaviour.

Is it expected?

Thanks

Mark

@Yamakuzure

This comment has been minimized.

Copy link
Collaborator

commented Dec 13, 2018

Yes and no.

First of all, why did you restart a boot service that other services rely upon? That's like restarting dbus and then wondering why hell breaks loose.
When elogind is stopped, it cleans up after itself, tearing down all sessions and seats. If the user isn't in "lingering" mode, its directory in /run is removed as well. (Although I have to check that, please see below.)

Did you (or the bug reporter) wish to reload a changed configuration? This is done by sending SIGHUP to the elogind process. Sorry that it is not in the man page, yet. (See Bug #98)

What puzzles me is, that the whole desktop session didn't come crashing down. So whatever desktop manager and/or DE they use, it doesn't register the its user with elogind.
When I restart elogind while I am in Plasma, it shuts itself off, as the user session is destroyed (aka: user logged out) and I am ending up with a stale root window, and an unusable startkde and SDDM process.
SDDM doesn't fire up its greeter again, so I have to kill startkde and SDDM and restart it by hand.

About "lingering" state:
Your bug reporter might brought me upon something else, however. I set my user into "lingering" state and then restarted elogind. No difference. But when a user is in that state, their session and processes should be preserved. I'll have to look into that, whether they got preserved or not (their directories in /run/user and /run/systemd). I am pretty sure that Plasma coming down and the SDDM greeter not coming up have nothing to do with it, as, lingering or not, the user is still logged out.

@Yamakuzure

This comment has been minimized.

Copy link
Collaborator

commented Dec 13, 2018

loginctl should get a "reload" command to help with reloading configurations. See bug #105.

@markhindley

This comment has been minimized.

Copy link
Contributor Author

commented Dec 13, 2018

@Yamakuzure

This comment has been minimized.

Copy link
Collaborator

commented Dec 13, 2018

Yes it doesn't work for me either. Once it does, maybe this is the way through:
temporarily set all sessions to lingering, restart daemon and then reset
lingering state?

Or maybe some sort of restart command, that completely bypasses all cleanup. All relevant information is in /run/user, /run/systemd/* and /sys/fs/cgroup anyway.

What we could not control then, however, are other processes reacting on elogind unregistering from dbus. But what the impact might be remains to be seen...

@markhindley

This comment has been minimized.

Copy link
Contributor Author

commented Dec 13, 2018

@Yamakuzure

This comment has been minimized.

Copy link
Collaborator

commented Dec 13, 2018

I think polkitd has a mechanism for replacing itself on the dbus without
upsetting anything else.

I will have a look into polkitd then to compare the mechanisms used. Thank you very much for the hint!

@Yamakuzure

This comment has been minimized.

Copy link
Collaborator

commented Dec 13, 2018

@markhindley : For the time being until a restart command is added, you could use SIGKILL on the old daemon before starting the updated one. That would at least hotfix the upgrade/update procedures.

SIGTERM, SIGQUIT and SIGINT are all caught and turned into a graceful shutdown, ungracefully taking all sessions and seats down.
...but as SIGKILL can not be caught and immediately terminates the program... (not even atexit() is called)

@markhindley

This comment has been minimized.

Copy link
Contributor Author

commented Dec 14, 2018

@markhindley

This comment has been minimized.

Copy link
Contributor Author

commented Dec 16, 2018

@Yamakuzure

This comment has been minimized.

Copy link
Collaborator

commented Dec 17, 2018

@markhindley : You are absolutely right.

Luckily SIGINT, SIGTERM and SIGQUIT are already caught, but all three lead to a "graceful shutdown" of elogind.

I am planning to re-purpose SIGINT, as that would be the most fitting:
QUIT is obvious, and TERM is traditionally sent when the application should shut down in an ordered way. An interrupt, however, means something like: "Cease all actions and exit immediately!"

So my plan is, to add another signal handler for SIGINT, that skips the cleaning up of sessions, seats and directories, and just exits. This would still call atexit(), which removes the PID-file.

Of course I have to finish updating the documentation, especially the signals part.

What do you think, would that be a feasible?

@markhindley

This comment has been minimized.

Copy link
Contributor Author

commented Dec 17, 2018

@Yamakuzure

This comment has been minimized.

Copy link
Collaborator

commented Dec 19, 2018

This is a lot fishier than I thought.

When I kill elogind from a console, it does not matter whether I use SIGTERM, INT, QUIT or KILL. elogind will end and leaves all files in /run/systemd where they are. Only the inhibit information is removed.

When I start a new instance, seats, users and sessions are loaded, and active sessions and users get combinated. But nevertheless all users are then deemed stale and get removed. And with them the sessions end.

It looks like we stumbled over a totally different issue here.

Yamakuzure added a commit that referenced this issue Jan 4, 2019

Prepare xxx: Fix [user|session]_may_gc()
When sessions/users are loaded, the gc queues are gone through to
remove stale sessions and users without non-stale sessions before
starting everything.

Unfortunately we have no systemd instances running that hint us what
is stale and what is not.

On load the state is loaded, so we now assume that that everything
that was not stopping on last write is a non-stale object.

As elogind will most probably only need to load disk-stored data when
it gets restarted on updates, this should be safe.

Bug: #104
Signed-Off-By: Sven Eden <sven.eden@prydeworx.com>

Yamakuzure added a commit that referenced this issue Jan 4, 2019

Prepare v239.4: Fix [user|session]_may_gc()
When sessions/users are loaded, the gc queues are gone through to
remove stale sessions and users without non-stale sessions before
starting everything.

Unfortunately we have no systemd instances running that hint us what
is stale and what is not.

On load the state is loaded, so we now assume that that everything
that was not stopping on last write is a non-stale object.

As elogind will most probably only need to load disk-stored data when
it gets restarted on updates, this should be safe.

Bug: #104
Signed-Off-By: Sven Eden <sven.eden@prydeworx.com>
@Yamakuzure

This comment has been minimized.

Copy link
Collaborator

commented Jan 4, 2019

@markhindley : I was able to find the culprit eventually.
As elogind has no background systemd instances telling it what users have running services and what sessions have valid scopes, all loaded users and sesions ended up being considered as being stale.

The fix was, in the end, to simply consider only sessions to be stale, that were not in a stopping state on last write anyway.

However, I am not closing this, yet. Currently I can end any X session by restarting elogind, although the users and sessions are no longer lost. I am using SDDM and it seems to go haywire on that act.
I'd like to to some more testing without SDDM ruining my day. 😉

Could you, by any chance, test the current v239-stable branch?

@markhindley

This comment has been minimized.

Copy link
Contributor Author

commented Jan 4, 2019

@Yamakuzure

This comment has been minimized.

Copy link
Collaborator

commented Jan 5, 2019

@markhindley : Thanks for testing!

Well, when we start things manually, we have not been told via dbus to do so. And there is no systemd in the background starting user services and telling elogind when it's done.

So what seems to be missing is the finalization of the starting process. That shouldn't be too complicated to fix.

Yamakuzure added a commit that referenced this issue Jan 17, 2019

Prep v239.4: Enhance the debug logging of starting and stopping.
When odd things happen, like currently with the session loading on
restarting elogind, it is very difficult to fathom what happens when
and why.

This commit adds debug log messages to many places where seats,
users, sessions and their FIFOs are created, attached, started,
stopped, released and finalized.

This should help determining the order of events in debugging
sessions.

Bug: #104
Signed-Off-By: Sven Eden <sven.eden@prydeworx.com>

Yamakuzure added a commit that referenced this issue Jan 17, 2019

Prep v239.4: Add special handling for SIGINT
When elogind is stopped, it tears down all resources. Unfortunately
this includes the deletion of all session and inhibitor FIFOs.

When elogind is interrupted using the SIGINT signal, elogind will no
keep the FIFOs, inhibitor state files, and will no longer go on a
killing spree in session cgroups.

This should make it possible to restart elogind without wreaking
havoc in running X11/Wayland sessions.

Note: This does not work with SDDM. SDDM still goes haywire!

Bug: #104
Signed-Off-By: Sven Eden <sven.eden@prydeworx.com>
@Yamakuzure

This comment has been minimized.

Copy link
Collaborator

commented Jan 17, 2019

@markhindley : Unless you use SDDM, restarting elogind should be fine now, whenever you use SIGINT to stop it.

I have no brilliant idea what makes SDDM still go nuts. So when I have taken care about some other issues, I'll take a look into its sources, to find out what makes it tell PAM that the user logged out.

@markhindley

This comment has been minimized.

Copy link
Contributor Author

commented Jan 17, 2019

@Yamakuzure

This comment has been minimized.

Copy link
Collaborator

commented Jan 17, 2019

The misbehaving with SDDM is taken care off in issue #110

@Yamakuzure Yamakuzure closed this Jan 17, 2019

@markhindley

This comment has been minimized.

Copy link
Contributor Author

commented Jan 19, 2019

@Yamakuzure

This comment has been minimized.

Copy link
Collaborator

commented Jan 21, 2019

@markhindley You are right. I had in on my list and then forgot. 🤦‍♂️

I took your wording, that's exactly what I wanted to write anyway. 🤷‍♂️

Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.