New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix setting of watchdog timer #59
Conversation
@inouekazu: First patch is nice catch, but it may be shorter to just replace != CS_OK to == CS_OK making code a little more readable. Second patch makes no sense to me. watchdog_timeout variable is always set. On the beginning it's 0, then watchdog_timeout_apply is called and it's set to ether resources.watchdog_timeout or WD_DEFAULT_TIMEOUT_SEC so it should work. Also I'm unable to understand removal of timer_add_duration, what is purpose of that? |
First patch: Lines 689 to 700 in 252b38a
Lines 476 to 482 in 252b38a
Second patch:
The value of variable 'new' and 'original_timeout' is same in watchdog_timeout_apply() called from wd_exec_init_fn(), so it doesn't reach the IOCTL with WDIOC_SETTIMEOUT.
When wd.c#L588-L592 of patch is applied, timer_add_duration() in watchdog_timeout_apply() will be executed at the start of corosync. In this case, the second timer will be added by the timer_add_duration() in wd_exec_init_fn(). I add the following debug and 'start corosync'.
|
Hi All, How did the story of this problem turn out? Best Regards, |
I've checked this out and the problems address by the patches are correct. If you can change the first one so it just swaps != CS_OK to == CS_OK then I'll pull them. |
I revised it. |
Now merged. Thanks! |
Hi Christine, Many Thanks! |
Sadly this patch causes some kind of regression and cts tests now doesn't pass. So reopening. |
The problem here is that the timeout is now being set and the script seems to take too long to get started so the node gets killed in the 6 seconds before the test script removes the resource. This opened up a whole new can of worms in that the watchdog code does a log of icmap_get_uint*() calls to read values but they are held as strings. So those reads always fail. My first try to fix this test was to increase the watchdog_timer initial value but that doesn't work because of this. If someone, who actually uses watchdog, wants to submit a patch to fix this problem of string/int then please feel free. I don't have the time right now. |
@chrissie-c Thanks for looking to problem. It may make sense to store values read by watchdog as int. |
Hi I made a correction to read watcdog_timeout from corosync.conf . |
@@ -585,7 +585,11 @@ static void wd_scan_resources (void) | |||
static void watchdog_timeout_apply (uint32_t new) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document this function, such as it's parameters, and a description of what actions it takes and why it might be called.
This request has two commits.
Since resources.watchdog_timeout of cmap is not used as watchdog timeout, fixed it.
Since the watchdog default timeout (6sec) is not set (ioctl(dog, WDIOC_SETTIMEOUT, &watchdog_timeout); is not called..), fixed it.