Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix setting of watchdog timer #59

Closed
wants to merge 2 commits into from

Conversation

inouekazu
Copy link
Contributor

This request has two commits.

  1. inouekazu@ad91908
    Since resources.watchdog_timeout of cmap is not used as watchdog timeout, fixed it.
  2. inouekazu@a765ca6
    Since the watchdog default timeout (6sec) is not set (ioctl(dog, WDIOC_SETTIMEOUT, &watchdog_timeout); is not called..), fixed it.

@jfriesse
Copy link
Member

@inouekazu: First patch is nice catch, but it may be shorter to just replace != CS_OK to == CS_OK making code a little more readable.

Second patch makes no sense to me. watchdog_timeout variable is always set. On the beginning it's 0, then watchdog_timeout_apply is called and it's set to ether resources.watchdog_timeout or WD_DEFAULT_TIMEOUT_SEC so it should work. Also I'm unable to understand removal of timer_add_duration, what is purpose of that?

@inouekazu
Copy link
Contributor Author

First patch:
Other parts are also != CS_OK, so I thought the same style is more readable.

corosync/exec/wd.c

Lines 689 to 700 in 252b38a

if (icmap_get_uint32("resources.watchdog_timeout", &tmp_value_32) != CS_OK) {
watchdog_timeout_apply (WD_DEFAULT_TIMEOUT_SEC);
icmap_set_uint32("resources.watchdog_timeout", watchdog_timeout);
}
else {
if (tmp_value_32 >= 2 && tmp_value_32 <= 120) {
watchdog_timeout_apply (tmp_value_32);
} else {
watchdog_timeout_apply (WD_DEFAULT_TIMEOUT_SEC);
}
}

corosync/exec/wd.c

Lines 476 to 482 in 252b38a

if (icmap_get_uint64(key_name, &tmp_value) != CS_OK) {
/* key does not exist.
*/
ref->last_updated = 0;
} else {
ref->last_updated = tmp_value;
}

Second patch:
I add the following debug to the latest version (https://github.com/corosync/corosync/tree/252b38ab8a62ff083e83b1d6f514109f7b7cbb42) and run.

# git diff
diff --git a/exec/wd.c b/exec/wd.c
index 093747f..f070ced 100644
--- a/exec/wd.c
+++ b/exec/wd.c
@@ -720,6 +720,14 @@ static char *wd_exec_init_fn (struct corosync_api_v1 *corosync_api)
    api->timer_add_duration(tickle_timeout*MILLI_2_NANO_SECONDS, NULL,
                wd_tickle_fn, &wd_timer);

+#if 1
+{
+   uint32_t w_timeout;
+   ioctl(dog, WDIOC_GETTIMEOUT, &w_timeout);
+   log_printf(LOGSYS_LEVEL_INFO, "WDIOC_GETTIMEOUT [%d]", w_timeout);
+}
+#endif
+
    return NULL;
 }

The value of variable 'new' and 'original_timeout' is same in watchdog_timeout_apply() called from wd_exec_init_fn(), so it doesn't reach the IOCTL with WDIOC_SETTIMEOUT.
Therefore, timeout (that is obtained using the IOCTL with WDIOC_GETTIMEOUT) wasn't changed from 70. refer to [*1]

# rmmod softdog; modprobe softdog soft_margin=70

# gdb /usr/sbin/corosync
 :
(gdb) b watchdog_timeout_apply
Breakpoint 1 at 0x3028c: file wd.c, line 588.
(gdb) run
Starting program: /usr/sbin/corosync
 :
Breakpoint 1, watchdog_timeout_apply (new=6) at wd.c:588
588             uint32_t original_timeout = watchdog_timeout;
(gdb) bt
#0  watchdog_timeout_apply (new=6) at wd.c:588
#1  0x00007ffff7fef5af in watchdog_timeout_get_initial () at wd.c:690
#2  0x00007ffff7fef646 in wd_exec_init_fn (corosync_api=0x7ffff8200aa0) at wd.c:714
 :
(gdb) n
590             if (new == original_timeout) {
(gdb) p original_timeout
$1 = 6
(gdb) n
591                     return;
(gdb) c
Continuing.

Breakpoint 1, watchdog_timeout_apply (new=6) at wd.c:588
588             uint32_t original_timeout = watchdog_timeout;
(gdb) bt
#0  watchdog_timeout_apply (new=6) at wd.c:588
#1  0x00007ffff7fef4d7 in setup_watchdog () at wd.c:654
#2  0x00007ffff7fef64b in wd_exec_init_fn (corosync_api=0x7ffff8200aa0) at wd.c:716
 :
(gdb) n
590             if (new == original_timeout) {
(gdb) p original_timeout
$2 = 6
(gdb) n
591                     return;
(gdb) fin
Run till exit from #0  watchdog_timeout_apply (new=6) at wd.c:591
setup_watchdog () at wd.c:656
656             ioctl(dog, WDIOC_SETOPTIONS, WDIOS_ENABLECARD);
(gdb) fin
Run till exit from #0  setup_watchdog () at wd.c:656
wd_exec_init_fn (corosync_api=0x7ffff8200aa0) at wd.c:718
718             wd_scan_resources();
Value returned is $3 = 0
(gdb) n
720             api->timer_add_duration(tickle_timeout*MILLI_2_NANO_SECONDS, NULL,
(gdb) n
726             ioctl(dog, WDIOC_GETTIMEOUT, &w_timeout);
(gdb) n
727             log_printf(LOGSYS_LEVEL_INFO, "WDIOC_GETTIMEOUT [%d]", w_timeout);
(gdb) p w_timeout
$4 = 70  [*1]
(gdb)

Also I'm unable to understand removal of timer_add_duration, what is purpose of that?

When wd.c#L588-L592 of patch is applied, timer_add_duration() in watchdog_timeout_apply() will be executed at the start of corosync. In this case, the second timer will be added by the timer_add_duration() in wd_exec_init_fn().

I add the following debug and 'start corosync'.

# git diff
diff --git a/exec/wd.c b/exec/wd.c
index 093747f..ccb0ded 100644
--- a/exec/wd.c
+++ b/exec/wd.c
@@ -500,6 +500,7 @@ static void wd_tickle_fn (void* arg)

    if (watchdog_ok) {
        if (dog > 0) {
+           log_printf (LOGSYS_LEVEL_INFO, "call WDIOC_KEEPALIVE");
            ioctl(dog, WDIOC_KEEPALIVE, &watchdog_ok);
        }
        api->timer_add_duration(tickle_timeout*MILLI_2_NANO_SECONDS, NULL,
  • syslog
Jan 21 14:00:22 vm1 corosync[5285]:  [MAIN  ] Corosync Cluster Engine ('2.3.4.15-252b3'): started and ready to provide service.
 :
Jan 21 14:00:26 vm1 corosync[5286]:  [WD    ] call WDIOC_KEEPALIVE
Jan 21 14:00:26 vm1 corosync[5286]:  [WD    ] call WDIOC_KEEPALIVE

Jan 21 14:00:29 vm1 corosync[5286]:  [WD    ] call WDIOC_KEEPALIVE
Jan 21 14:00:29 vm1 corosync[5286]:  [WD    ] call WDIOC_KEEPALIVE

Jan 21 14:00:32 vm1 corosync[5286]:  [WD    ] call WDIOC_KEEPALIVE
Jan 21 14:00:32 vm1 corosync[5286]:  [WD    ] call WDIOC_KEEPALIVE
Jan 21 14:00:35 vm1 corosync[5286]:  [WD    ] call WDIOC_KEEPALIVE
Jan 21 14:00:35 vm1 corosync[5286]:  [WD    ] call WDIOC_KEEPALIVE
Jan 21 14:00:38 vm1 corosync[5286]:  [WD    ] call WDIOC_KEEPALIVE
Jan 21 14:00:38 vm1 corosync[5286]:  [WD    ] call WDIOC_KEEPALIVE
Jan 21 14:00:41 vm1 corosync[5286]:  [WD    ] call WDIOC_KEEPALIVE
Jan 21 14:00:41 vm1 corosync[5286]:  [WD    ] call WDIOC_KEEPALIVE
Jan 21 14:00:44 vm1 corosync[5286]:  [WD    ] call WDIOC_KEEPALIVE
Jan 21 14:00:44 vm1 corosync[5286]:  [WD    ] call WDIOC_KEEPALIVE
 :

@HideoYamauchi
Copy link
Contributor

Hi All,

How did the story of this problem turn out?
The community of Japanese linux-ha looks forward to the solution to this problem.

Best Regards,
Hideo Yamauchi.

@chrissie-c
Copy link
Contributor

I've checked this out and the problems address by the patches are correct. If you can change the first one so it just swaps != CS_OK to == CS_OK then I'll pull them.

@inouekazu
Copy link
Contributor Author

I revised it.

@chrissie-c
Copy link
Contributor

Now merged. Thanks!

@chrissie-c chrissie-c closed this Jul 14, 2015
@HideoYamauchi
Copy link
Contributor

Hi Christine,

Many Thanks!

@jfriesse
Copy link
Member

Sadly this patch causes some kind of regression and cts tests now doesn't pass. So reopening.

@jfriesse jfriesse reopened this Aug 26, 2015
@chrissie-c
Copy link
Contributor

The problem here is that the timeout is now being set and the script seems to take too long to get started so the node gets killed in the 6 seconds before the test script removes the resource.

This opened up a whole new can of worms in that the watchdog code does a log of icmap_get_uint*() calls to read values but they are held as strings. So those reads always fail. My first try to fix this test was to increase the watchdog_timer initial value but that doesn't work because of this.

If someone, who actually uses watchdog, wants to submit a patch to fix this problem of string/int then please feel free. I don't have the time right now.

@jfriesse
Copy link
Member

@chrissie-c Thanks for looking to problem. It may make sense to store values read by watchdog as int.

@yuusuke
Copy link
Contributor

yuusuke commented Sep 11, 2015

Hi

I made a correction to read watcdog_timeout from corosync.conf .
and, made modifications to warn when I set a value out of the range .
How does that look?

#87

@@ -585,7 +585,11 @@ static void wd_scan_resources (void)
static void watchdog_timeout_apply (uint32_t new)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document this function, such as it's parameters, and a description of what actions it takes and why it might be called.

@jfriesse jfriesse closed this May 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants