Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iTCO_wdt driver broken in >= 34.20210626.3.2 #911

Closed
brokenjacobs opened this issue Jul 28, 2021 · 18 comments
Closed

iTCO_wdt driver broken in >= 34.20210626.3.2 #911

brokenjacobs opened this issue Jul 28, 2021 · 18 comments

Comments

@brokenjacobs
Copy link

brokenjacobs commented Jul 28, 2021

Describe the bug
When using builds >=34.20210626.3.2 and setting a file similar to:
/etc/systemd/system.conf.d/10-use-watchdog.conf:
[Manager]
RuntimeWatchdogSec=60s

Reproduction steps
Steps to reproduce the behavior:

  1. Create the above file on a system with hardware watchdog support.
  2. Check the watchdog with wdctl:
wdctl
Device:        /dev/watchdog0
Identity:      iTCO_wdt [version 0]
Timeout:       60 seconds
Pre-timeout:    0 seconds
Timeleft:      30 seconds
FLAG           DESCRIPTION               STATUS BOOT-STATUS
KEEPALIVEPING  Keep alive ping reply          1           0
MAGICCLOSE     Supports magic close char      0           0
SETTIMEOUT     Set timeout (in seconds)       0           0
  1. reload systemd with: systemctl daemon-reexec
  2. Wait 30 seconds and system will reboot.

Expected behavior
I expect the watchdog to work like it did before and to continuously reset the runtime watchdog at half the interval specified but it does not appear to do so.

Actual behavior
The system reboots

System details

  • Bare Metal
  • any version >= 34.20210626.3.2

Additional information
I've tried looking upstream for systemd bugs but I haven't found any as of yet.

@dustymabe
Copy link
Member

so 34.20210626.3.2 works and 34.20210711.3.0 fails?

@dustymabe
Copy link
Member

dustymabe commented Jul 28, 2021

ahh >=. So it works in 34.20210626.3.1 and not in 34.20210626.3.2?

The difference there is systemd 248.3-1.fc34.x86_64 → 248.5-1.fc34.x86_64

@brokenjacobs
Copy link
Author

Correct I noticed the systemd update as well. I have to assume that is the cause of the issue but I haven’t managed to find anything yet.

@dustymabe
Copy link
Member

dustymabe commented Jul 29, 2021

The kernel also got updated in that transition:

kernel 5.12.12-300.fc34.x86_64 → 5.12.19-300.fc34.x86_64
systemd 248.3-1.fc34.x86_64 → 248.5-1.fc34.x86_64 

Would you mind trying the latest testing release 34.20210725.2.0 (newer kernel) just to give us another datapoint?

@brokenjacobs
Copy link
Author

I'll do a branch swap on my test system and report back.

@brokenjacobs
Copy link
Author

Testing on 34.20210725.2.0 exhibits the same behavior.

@brokenjacobs
Copy link
Author

brokenjacobs commented Jul 29, 2021

Another datapoint. These systems when I reboot them report:
[ 72.628727] watchdog: watchdog0: watchdog did not stop!

And hang on reboot. It does not matter if I have started the watchdog timer or not on the current boot. I have to warm reset them via ipmi now.

Even after a cold boot, they still hang on reboot.

If I blacklist the itco_wdt module, then that message no longer appears and the reboot completes

Update: The message appears to be a red herring. Reboot completes either way it just takes a while.

@brokenjacobs
Copy link
Author

I don't think this is a systemd issue, I think it is kernel related. If I blacklist the iTCO_wdt module my reboot hanging issues go away. If instead I manually load the ipmi_watchdog module I can see that the watchdog is being reset by systemd externally by verifying the watchdog with ipmitool:

Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      60 sec
Present Countdown:      34 sec

And after some time:

Watchdog Timer Use:     SMS/OS (0x44)
Watchdog Timer Is:      Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval:   0 seconds
Timer Expiration Flags: 0x00
Initial Countdown:      60 sec
Present Countdown:      58 sec

I believe this is a bug in the iTCO_wdtintel watchdog timer.

@brokenjacobs brokenjacobs changed the title RuntimeWatchdogSec watchdog for systemd manager broken in >= 34.20210626.3.2 iTCO_wdt driver broken in in >= 34.20210626.3.2 (was RuntimeWatchdogSec watchdog for systemd manager broken in >= 34.20210626.3.2) Jul 29, 2021
@dustymabe dustymabe changed the title iTCO_wdt driver broken in in >= 34.20210626.3.2 (was RuntimeWatchdogSec watchdog for systemd manager broken in >= 34.20210626.3.2) iTCO_wdt driver broken in >= 34.20210626.3.2 Jul 30, 2021
@dustymabe
Copy link
Member

Seems like it might be an issue with:

Could you engage upstream to report the issue?

@jan-kiszka
Copy link

Yeah, mea culpa. Fix is on the way: https://lkml.org/lkml/2021/7/26/349.

@brokenjacobs
Copy link
Author

Well done tracking that down. Thank you so much. Are you ok with leaving this open to track this with the upstream kernel when available?

@dustymabe
Copy link
Member

Well done tracking that down. Thank you so much.

No Problem. That code base isn't super active so it wasn't too hard to find (thanks to git).

Are you ok with leaving this open to track this with the upstream kernel when available?

Sure.

@dustymabe dustymabe added the status/pending-upstream-release Fixed upstream. Waiting on an upstream component source code release. label Jul 30, 2021
@dustymabe
Copy link
Member

Yeah, mea culpa. Fix is on the way: https://lkml.org/lkml/2021/7/26/349.

Thank you so much @jan-kiszka!

@dustymabe
Copy link
Member

dustymabe commented Sep 22, 2021

Looks like this made it into the 5.14.6 (and later) point release (c7dcbbd) and will be in the 5.15 kernel when it's released (aec4264).

@dustymabe
Copy link
Member

dustymabe commented Oct 1, 2021

5.14.6 (or newer) is in next-devel and will be in the next next stream release.

@dustymabe dustymabe added status/pending-next-release Fixed upstream. Waiting on a next release. and removed status/pending-upstream-release Fixed upstream. Waiting on an upstream component source code release. labels Oct 1, 2021
@dustymabe
Copy link
Member

The fix for this went into next stream release 35.20211010.1.0. Please try out the new release and report issues.

@dustymabe
Copy link
Member

The fix for this went into testing stream release 34.20211004.2.0. Please try out the new release and report issues.

@dustymabe dustymabe added status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. and removed status/pending-next-release Fixed upstream. Waiting on a next release. labels Oct 18, 2021
@dustymabe
Copy link
Member

The fix for this went into stable stream release 34.20211004.3.1

@dustymabe dustymabe removed the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label Oct 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants