Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

version 255.4: suspend/resume breaks network connection #280

Closed
hamkg opened this issue Apr 15, 2024 · 26 comments
Closed

version 255.4: suspend/resume breaks network connection #280

hamkg opened this issue Apr 15, 2024 · 26 comments
Assignees
Labels
bug In Progress A fix/enhancement is worked on

Comments

@hamkg
Copy link

hamkg commented Apr 15, 2024

Slackware-current 64 bit
Linux kernel 6.6.27
HP Elitebook 820 G1, Intel Haswell-ULT Integrated Graphics, Intel Wireless 7260 netwok device

After upgrading to 255.4, suspend no longer works properly. Most serious issue is that network connection breaks, NetworkManager becomes unresponsive and cannot be restarted except by rebooting.

With elogind-252, the machine's LED for wifi goes out during suspend and the on/off lead flashes.
After upgrade both LEDs are permanently on.

With elogind-252.23, closing/opening the laptop lid is sufficient for suspend/resume.
After upgrade, closing suspends but opening does not resume - I have to press power button to resume.

I never have had to touch the logind.conf or sleep.conf files earlier. I've now tried to edit them, enabling HandeLidSwitch=suspend (in logind.conf) and AllowSuspend=yes (in sleep.conf). No difference.

Recompiled elogind using the latest 255.4-r1 version: no difference
Recompiled elogind using the latest 255.4-r1 version and the change suggested here: #279 but still no difference.

Reverting to version 252.23 for now. Grateful for explanations, help, possible solutions...

@Yamakuzure
Copy link
Collaborator

Hmmm... To fix the "Zombie-Process"-Issue (#275), I suspended/resumed many times, and NetworkManager never had any problem.

After upgrade both LEDs are permanently on.

Unfortunately, upstream removed SuspendMode from sleep.conf, which now defaults to s2idle.
Both LEDs staying on sounds like s2idle being entered instead of deep sleep.

The default values of the config items upstream removed or have been added for elogind exclusively are now in /etc/elogind/sleep.conf.d/10-elogind.conf

Therefore, I have an additional config file on my laptop (See last line)

 ~ # cat /etc/elogind/sleep.conf.d/90-hibernate_to_suspend.conf
[Sleep]
AllowPowerOffInterrupts=yes
AllowHibernation=yes
AllowSuspendInterrupts=yes
AllowSuspendThenHibernate=yes
AllowHybridSleep=yes
HandleNvidiaSleep=no
SuspendMode=deep s2idle

With this, I am always using deep mode.

However, I do not recall having issues with NetworkManager when using s2idle but I did not test it with v255.

@Yamakuzure Yamakuzure self-assigned this Apr 15, 2024
@Yamakuzure Yamakuzure added the investigating A reported issue is to be investigated label Apr 15, 2024
@hamkg
Copy link
Author

hamkg commented Apr 15, 2024

Thanks. Editing the /etc/elogind/sleep.conf.d/10-elogind.conf (allowing the SuspendMode line and set it as SuspendMode=deep s2idle) puts the machine to sleep and also means that it resumes when I open the lid. So far so good.
However, the network issue (which many slackware users experience) remains. And it is not fixed by the revised 255.4-r1 version.

@Yamakuzure
Copy link
Collaborator

Dbus is enabled and running, right?
Because elogind is telling all services that are registered via dbus when the system is going to sleep or waking up.

In my /var/log/messages I have:

Apr 11 08:44:03 sed-m7550 elogind-daemon[7856]: The system will suspend now!
Apr 11 08:44:03 sed-m7550 NetworkManager[9812]: <info>  [1712817843.5457] manager: sleep: sleep requested (sleeping: no  enabled: yes)
Apr 11 08:44:03 sed-m7550 NetworkManager[9812]: <info>  [1712817843.5462] device (wlan0): state change: unavailable -> unmanaged (reason 'sleeping', sys-iface-state: 'managed')
Apr 11 08:44:03 sed-m7550 NetworkManager[9812]: <info>  [1712817843.5466] device (/net/connman/iwd/0): state change: disconnected -> unmanaged (reason 'sleeping', sys-iface-state: 'managed')
Apr 11 08:44:03 sed-m7550 NetworkManager[9812]: <info>  [1712817843.5468] manager: NetworkManager state is now ASLEEP
Apr 11 08:44:04 sed-m7550 e-suspend[17879]: Performing sleep operation 'suspend'...
Apr 11 08:44:04 sed-m7550 kernel: PM: suspend entry (deep)
Apr 11 08:44:32 sed-m7550 e-suspend[17879]: System returned from sleep operation 'suspend'.
Apr 11 08:44:35 sed-m7550 NetworkManager[9812]: <info>  [1712817875.0857] device (eno2): carrier: link connected

Can you see such lines in your system log, too?

@hamkg
Copy link
Author

hamkg commented Apr 16, 2024

Yes, dbus is enabled and running, eg
dbus-daemon[941]: [system] Successfully activated service 'org.blueman.Mechanism'

What I can see in the /var/log/messages file is that NetworkManager restarts after resume with elogind-252.23 but not with elogind-255.4.

With 255.4, NetworkManager goes to sleep:
NetworkManager[998]: <info> [1713271827.8245] device (wlan0): state change: disconnected -> unmanaged (reason 'sleeping', sys-iface-state: 'managed')

but after resume, there's no mentioning of NetworkManager in /var/log/messages. Instead, there's the line
acvpnagent[1146]: Current network state: No routable network interface

@maravtdm
Copy link

maravtdm commented Apr 16, 2024

I only have that, about NM, in /var/log/syslog when resuming :
Apr 16 23:07:40 blackstar NetworkManager[1234]: <warn> [1713301660.2220] dhcp-listener: dhcp-event: (pid 1524) unhandled DHCP event for interface wlan0
Absolutely nothing in /var/log/messages until the service is restarted

Also, dbus is running here as well

# ps -p $(cat /var/run/dbus/dbus.pid)
  PID TTY          TIME CMD
 1194 ?        00:00:00 dbus-daemon

elogind: 255.4_r2

@fxj-github
Copy link

I think for some reason, elogind not send 'resume' signal to NetworkManager...

@ctrlaltca
Copy link

Looks like there's something not going right in the dbus interface when resuming:

elogind-252.23:

➜  ~ dbus-monitor --system "type='signal',sender='org.freedesktop.login1',interface='org.freedesktop.login1.Manager'"
... removed some noise messages ...
signal time=1713350817.791134 sender=:1.1 -> destination=(null destination) serial=309 path=/org/freedesktop/login1; interface=org.freedesktop.login1.Manager; member=PrepareForSleep
   boolean true
signal time=1713350818.151513 sender=:1.1 -> destination=(null destination) serial=340 path=/org/freedesktop/login1; interface=org.freedesktop.login1.Manager; member=PrepareForSleep
   boolean false

elogind-255.4_r2:

➜  ~ dbus-monitor --system "type='signal',sender='org.freedesktop.login1',interface='org.freedesktop.login1.Manager'"
... removed some noise messages ...
signal time=1713351000.383657 sender=:1.1 -> destination=(null destination) serial=245 path=/org/freedesktop/login1; interface=org.freedesktop.login1.Manager; member=PrepareForSleep
   boolean true

The "PrepareForSleep" signal with parameter "false" is not sent after resume, so applications subscribing to it to detect system resume are not notified correctly.

@ctrlaltca
Copy link

ctrlaltca commented Apr 17, 2024

I noticed a small difference in login-dbus.c in the function elogind_execute_shutdown_or_sleep() between v252.23 and v255.4-r2.
In the old, working version on the function, after the fork both the child and the parent processes call send_prepare_for( m, a->inhibit_what, false ); (lines 1773 and 1788):
https://github.com/elogind/elogind/blob/v252.23/src/login/logind-dbus.c#L1710

In the new function only the child process calls send_prepare_for( m, a, false ); at line 1935:

https://github.com/elogind/elogind/blob/v255.4-r2/src/login/logind-dbus.c#L1863

I see the function flow after the fork() has been inverted, maybe this caused one of the calls to be removed accidentally?
I'm not sure i completely understand how the code works, and why there was two calls in the old version of the code.

@Yamakuzure
Copy link
Collaborator

elogind not sending the wakeup seems to be the culprit here.

When the sleeper is forked out, the main process must not send a signal; otherwise, it would send a "prepare-for-wakeup" message while the forked-out process is working on suspending the machine.

It seems that sending the signal from the fork does not work. So, I either find out why or move sending the wakeup signal to the SIGCHLD handler. The latter seems to be more feasible to me.

@Yamakuzure Yamakuzure added bug In Progress A fix/enhancement is worked on and removed investigating A reported issue is to be investigated labels Apr 17, 2024
@ctrlaltca
Copy link

ctrlaltca commented Apr 17, 2024

Hopefully some debug log can be useful:

SLEEP

Apr 17 15:48:58 sprawl elogind-daemon[1939]: Suspend key pressed. Further action depends on the key press duration.
...
Apr 17 15:48:58 sprawl elogind-daemon[1939]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/login1 interface=org.freedesktop.login1.Manager member=PrepareForSleep cookie=245 reply_cookie=0 signature=b error-name=n/a error-message=n/a
...
Apr 17 15:48:59 sprawl elogind-daemon[1939]: Successfully forked off 'e-suspend' as PID 6562.
Apr 17 15:48:59 sprawl elogind-daemon[1939]: (DEBUG) src/login/logind-dbus.c:1913:elogind_execute_shutdown_or_sleep: Forking off 'e-suspend' returned 1 and set PID 6562
...
Apr 17 15:48:59 sprawl e-suspend[6562]: (DEBUG) src/login/logind-dbus.c:1920:elogind_execute_shutdown_or_sleep: Started 'e-suspend' with PID 6562
Apr 17 15:48:59 sprawl e-suspend[6562]: (DEBUG) src/login/logind-dbus.c:1837:elogind_shutdown_or_sleep: Called for 'suspend'
Apr 17 15:48:59 sprawl e-suspend[6562]: (DEBUG) src/sleep/sleep.c:904:do_sleep: Called for 'suspend'
...
Apr 17 15:49:00 sprawl e-suspend[6562]: (sd-exec-strv) succeeded.
Apr 17 15:49:00 sprawl e-suspend[6562]: (DEBUG) src/sleep/sleep.c:505:execute: Result is 0 (callback_failed: false)

RESUME

Apr 17 15:51:04 sprawl e-suspend[6562]: Using sleep state 'mem'.
Apr 17 15:51:04 sprawl e-suspend[6562]: Successfully forked off '(sd-exec-strv)' as PID 6720.
Apr 17 15:51:04 sprawl e-suspend[6720]: (DEBUG) src/basic/argv-util.c:197:rename_process: Setting program_invocation_name to '(sd-exec-strv)'
...
Apr 17 15:51:07 sprawl e-suspend[6562]: (sd-exec-strv) succeeded.
Apr 17 15:51:07 sprawl e-suspend[6562]: Assertion '!bus_origin_changed(bus)' failed at src/libelogind/sd-bus/bus-convenience.c:33, function sd_bus_emit_signal_tov(). Ignoring.
Apr 17 15:51:07 sprawl e-suspend[6562]: Failed to emit PrepareForShutdown(): No child processes
Apr 17 15:51:07 sprawl e-suspend[6562]: (DEBUG) src/login/logind-dbus.c:1937:elogind_execute_shutdown_or_sleep: Exiting from e-suspend

I guess the Failed to emit PrepareForShutdown(): No child processes is the attempt at sending the PrepareForSleep(false) signal
If that's the case, the origin of the problem is sd_bus_emit_signal_tov() failing the assert

@sirenuf
Copy link

sirenuf commented Apr 17, 2024

I recently updated to 252.23.1-1 on Artix. Not sure if it is related but thought I'd share that after updating I experience similar faults on two machines, I have to manually restart NM for it to work. nm-applet just says "Networking disabled."

@sirenuf
Copy link

sirenuf commented Apr 17, 2024

I recently updated to 252.23.1-1 on Artix. Not sure if it is related but thought I'd share that after updating I experience similar faults on two machines, I have to manually restart NM for it to work. nm-applet just says "Networking disabled."

To follow up on this, my friend gave me a cached archive of 252.23-1 and I have confirmed that after downgrading the issue is eliminated. So it is definitely something wrong with the latest version.

@JPGuillemin
Copy link

I recently updated to 252.23.1-1 on Artix. Not sure if it is related but thought I'd share that after updating I experience similar faults on two machines, I have to manually restart NM for it to work. nm-applet just says "Networking disabled."

To follow up on this, my friend gave me a cached archive of 252.23-1 and I have confirmed that after downgrading the issue is eliminated. So it is definitely something wrong with the latest version.

Indeed, confirmed here as well

@rizitis
Copy link

rizitis commented Apr 20, 2024

bug appears only if resume start new session (user re-login), if user suspend-resume from cli and no re-login then everything works fine.
Maybe need Investigating how elogind interacts with udev?
Because resume and new session has removed eth0
signal time=1713549936.701372 sender=:1.2 -> destination=(null destination) serial=1162 path=/org/freedesktop/NetworkManager; interface=org.freedesktop.NetworkManager; member=DeviceRemoved

BEFORE: eth0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether 5c:60:ba:be:90:5e txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

AFTER: NOT EXIST!!!

Network device status before suspend: Executing command: nmcli device status Output of command: nmcli device status DEVICE TYPE STATE CONNECTION wlan0 wifi connected COSMOTE-769552 lo loopback connected (externally) lo virbr0 bridge connected (externally) virbr0 C4:DF:39:BA:BD:3C bt disconnected -- p2p-dev-wlan0 wifi-p2p disconnected -- eth0 ethernet unavailable --

Network device status after resume: Executing command: nmcli device status Output of command: nmcli device status DEVICE TYPE STATE CONNECTION lo loopback connected (externally) lo virbr0 bridge connected (externally) virbr0 C4:DF:39:BA:BD:3C bt unmanaged -- eth0 ethernet unmanaged -- wlan0 wifi unmanaged -- p2p-dev-wlan0 wifi-p2p unmanaged -- ----------------------------------------
eth0 stay unmanaged after resume, because is removed and cant change status.

EDIT : There are major diffs from one release to other, something that might be the reason (?) is:
Functions that previously used simple PID are now using PidRef:
session_set_leader: Changed from pid_t pid to PidRef _leader.
session_reset_leader: Changed to use the PidRef structure.
The methods for mapping process leaders in a hashmap also shifted:
The old version used PID_TO_PTR(pid), while the new version utilizes the PidRef structure.
Calls to the hashmap for inserting and removing process leaders changed to accommodate PidRef.

The switch from simple PIDs to PidRef structures for tracking session leaders could cause inconsistencies in session management, potentially leading to a disconnect between system resources (like eth0) and their associated sessions. If the new session doesn't maintain the same references or associations, it might lead to eth0 being removed.

Also if sessions are reset or terminated upon resuming, this might trigger a re-initialization of system resources. If eth0 is tied to a session that is no longer valid or has been reset, it could be removed.

@Yamakuzure
Copy link
Collaborator

EDIT : There are major diffs from one release to other, something that might be the reason (?) is: Functions that previously used simple PID are now using PidRef: session_set_leader: Changed from pid_t pid to PidRef _leader. session_reset_leader: Changed to use the PidRef structure. The methods for mapping process leaders in a hashmap also shifted: The old version used PID_TO_PTR(pid), while the new version utilizes the PidRef structure. Calls to the hashmap for inserting and removing process leaders changed to accommodate PidRef.

The switch to PidRef is going on since September 2023. See: Upstream c79ab77

The switch from simple PIDs to PidRef structures for tracking session leaders could cause inconsistencies in session management, potentially leading to a disconnect between system resources (like eth0) and their associated sessions. If the new session doesn't maintain the same references or associations, it might lead to eth0 being removed.

I'll better check again whether everything is consistent, then.

@turtureanu
Copy link

turtureanu commented Apr 21, 2024

I recently updated to 252.23.1-1 on Artix. Not sure if it is related but thought I'd share that after updating I experience similar faults on two machines, I have to manually restart NM for it to work. nm-applet just says "Networking disabled."

Also on Artix (using OpenRC with the networkmanager and networkmanager-openrc packages), same issue: I have to manually restart NetworkManager for it to work.

When I try to enable any connection using nmtui I get the following message:

|| Could not activate connection:       │  │
││ Connection 'Wired connection 1' is   │  │
││ not available on device eth0 because │  |
││ device is strictly unmanaged         │  |

Running nmcli outputs:

lo: connected (externally) to lo
        "lo"
        loopback (unknown), 00:00:00:00:00:00, sw, mtu 65536
        inet4 127.0.0.1/8
        route4 127.0.0.0/8 via 127.0.0.1 metric 0
        inet6 ::1/128

docker0: connected (externally) to docker0
        "docker0"
        bridge, 02:42:25:CF:0A:EF, sw, mtu 1500
        inet4 172.17.0.1/16
        route4 172.17.0.0/16 metric 0

eth0: unmanaged
        "Realtek RTL8111/8168/8411"
        ethernet (r8169), B0:25:AA:46:63:1C, hw, mtu 1500

wlan0: unmanaged
        "Intel 6 AX200"
        wifi (iwlwifi), E0:D4:64:87:E6:91, hw, mtu 1500

p2p-dev-wlan0: unmanaged
        "p2p-dev-wlan0"
        wifi-p2p, hw

Use "nmcli device show" to get complete information about known devices and
"nmcli connection show" to get an overview on active connection profiles.

Consult nmcli(1) and nmcli-examples(7) manual pages for complete usage details.

Yamakuzure pushed a commit that referenced this issue Apr 21, 2024
The forked out sleeper process fails to send the wakeup signal, as it
does not share the dbus connection with elogind.

Therefore elogind sends the signal itself once the sleeper has
messaged elogind that it is done via the SIGCHLD signal.

Bug: #280
Signed-off-by: Sven Eden <sven@eden-worx.com>
@Yamakuzure
Copy link
Collaborator

With commit ce3616c I have NetworkManager reporting that a wakeup is requested again.

Please check whether this commit fixes the issue for you, too.

@rizitis
Copy link

rizitis commented Apr 21, 2024

With commit ce3616c I have NetworkManager reporting that a wakeup is requested again.

Please check whether this commit fixes the issue for you, too.

Slackware64-current, I rebuilded elogind as you suggested and polkit. Things for me looking good.
Thank you sir.

@hamkg
Copy link
Author

hamkg commented Apr 21, 2024

Ditto. Rebuilt elogind with the changes from the commit and slackware's buildscript (and rebuilt polkit after upgrading elogind.
Everything seems to work fine: NetworkManager reconnects, and Thunderbird (running when suspending) regains connection to mailservers.
Thanks a lot to the elogind team and to everyone that have contributed here.

@maravtdm
Copy link

maravtdm commented Apr 21, 2024

Same here

It works like a charm
OS: Slackware-current
elogind: 255.4-r2 + patch

Apr 21 20:02:53 blackstar kernel: PM: hibernation: hibernation exit Apr 21 20:02:53 blackstar ModemManager[1049]: <info> [sleep-monitor-systemd] system is resuming Apr 21 20:02:53 blackstar NetworkManager[1006]: <info> [1713722573.4649] manager: sleep: wake requested (sleeping: yes enabled: yes)

Thanks

@af7567
Copy link

af7567 commented Apr 21, 2024

Fixed for me too, also on slackware current.
Tested with deep suspend, hibernate, and suspend-then-hibernate modes. No longer need to restart networkmanager/firefox/thunderbird/spotify after resume.
Thanks all.

Yamakuzure pushed a commit that referenced this issue Apr 22, 2024
The forked out sleeper process fails to send the wakeup signal, as it
does not share the dbus connection with elogind.

Therefore elogind sends the signal itself once the sleeper has
messaged elogind that it is done via the SIGCHLD signal.

Bug: #280
Signed-off-by: Sven Eden <sven@eden-worx.com>
@Yamakuzure
Copy link
Collaborator

Yamakuzure commented Apr 22, 2024

Thank you all very very much for your reports and your feedback!

@titetanium
Copy link

I recently updated to 252.23.1-1 on Artix. Not sure if it is related but thought I'd share that after updating I experience similar faults on two machines, I have to manually restart NM for it to work. nm-applet just says "Networking disabled."

To follow up on this, my friend gave me a cached archive of 252.23-1 and I have confirmed that after downgrading the issue is eliminated. So it is definitely something wrong with the latest version.

Indeed, confirmed here as well

I'm also seeing this on Devuan ceres (Debian sid) as well. Suspend/resume works but NetworkManager is killed/disabled upon resume.

@hroemer
Copy link

hroemer commented Apr 24, 2024

I did have the NetworkManager problem as well, but now with elogind-255.4.2-4 I cannot suspend anymore.
The screen turns black, backlight is still on, system locked, but no suspend is being performed.

dbus-monitor --system "type='signal',sender='org.freedesktop.login1',interface='org.freedesktop.login1.Manager'"
signal time=1713979606.600314 sender=:1.2 -> destination=(null destination) serial=165 path=/org/freedesktop/login1; interface=org.freedesktop.login1.Manager; member=PrepareForSleep
   boolean true
signal time=1713979610.526106 sender=:1.2 -> destination=(null destination) serial=171 path=/org/freedesktop/login1; interface=org.freedesktop.login1.Manager; member=PrepareForSleep
   boolean false

@af7567
Copy link

af7567 commented Apr 24, 2024

@hroemer This sounds like the issue mentioned above where you need to set the SuspendMode to deep since the default has changed:
#280 (comment)

@hroemer
Copy link

hroemer commented Apr 25, 2024

@af7567 thanks, this did the trick!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug In Progress A fix/enhancement is worked on
Projects
None yet
Development

No branches or pull requests