New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel log message time drift #121

Closed
benkibbey opened this Issue Apr 26, 2014 · 37 comments

Comments

Projects
None yet
8 participants
@benkibbey

benkibbey commented Apr 26, 2014

There is a strange problem in syslog-ng that the kernel timestamp drifts over time. I now have an uptime of 5d 13h and the kernel log timestamp has drifted by about 11 minutes and gets greater as time passes. If I remove syslog-ng and install rsyslog or busybox-klogd/syslogd then the timestamp is correct.

This only happens with kernel log messages and not other normal messages.

@benkibbey

This comment has been minimized.

benkibbey commented Apr 26, 2014

I should also mention that restarting syslog-ng does not help. The drift is the same.

@benkibbey

This comment has been minimized.

benkibbey commented Apr 26, 2014

And running version 3.5.4.1 on Debian testing with kernel 3.14.1.

@bazsi

This comment has been minimized.

Member

bazsi commented Apr 27, 2014

Are you using /dev/kmsg as the kernel source?

@benkibbey

This comment has been minimized.

benkibbey commented Apr 27, 2014

How would I determine that? There is no mention of kmsg in syslog-ng.conf. The log source is from system() and internal() then filters KERN to kern.log. The Debian change log does mention kmsg:

  • Read /proc/kmsg directly again. It's eliminate all the problem around klogd.

But that is dated from 2002.

@bazsi

This comment has been minimized.

Member

bazsi commented Apr 27, 2014

You are most probably using /dev/kmsg then. The system driver uses that if the requisite support is present in the kernel.

Hmmm.. The timestamp supplied by kmsg is probably CLOCK_MONOTONIC (boot time plus offset). Is your system clock continuously changing (for instance by NTP)?

I'm not sure rsyslog uses this timestamp by default or not.

1 similar comment
@bazsi

This comment has been minimized.

Member

bazsi commented Apr 27, 2014

You are most probably using /dev/kmsg then. The system driver uses that if the requisite support is present in the kernel.

Hmmm.. The timestamp supplied by kmsg is probably CLOCK_MONOTONIC (boot time plus offset). Is your system clock continuously changing (for instance by NTP)?

I'm not sure rsyslog uses this timestamp by default or not.

@benkibbey

This comment has been minimized.

benkibbey commented Apr 27, 2014

Check/update the clock via a cron job once per hour via ntpdate. I do have a VM running but its idle most all of the time and without any load.

@benkibbey

This comment has been minimized.

benkibbey commented Apr 28, 2014

Maybe it's the load on the host machine. I've rebooted recently:

uptime; adjtimex -p; adjtimexconfig

21:35:45 up 10:52, 5 users, load average: 0.40, 0.55, 0.50
mode: 0
offset: 0
frequency: 1895862
maxerror: 16000000
esterror: 16000000
status: 64
time_constant: 2
precision: 1
tolerance: 32768000
tick: 9762
raw time: 1398648945s 994651us = 1398648945.994651
return value = 5
Comparing clocks (this will take 70 sec)...done.
Adjusting system time by -2055.21 sec/day to agree with CMOS clock...done.

This was after doing a rebuild of a project that uses 100% CPU for about 2.5-3 hours, although I didn't run the above commands before the build.

@bazsi

This comment has been minimized.

Member

bazsi commented Apr 28, 2014

If the host clock is drifting from clock monotonic, that explains the reason for drifting.

I'm not yet sure how to solve it though. One way would be to ignore the timestamp of the kernel.

Or if you could use the R_ prefixed time macro in this case.

@benkibbey

This comment has been minimized.

benkibbey commented Apr 28, 2014

Maybe adjust the timestamp for a kernel message on the local machine to the time of reception, but leave it alone for remote connections? Or an option for it.

@benkibbey

This comment has been minimized.

benkibbey commented Apr 28, 2014

Or an option to use /proc/kmsg rather than /dev/kmsg if both are available. /proc/kmsg doesn't include CLOCK_MONOTONIC timestamps so that would have the same effect as a local time of receptiion timestamp.
I suppose I could add a filter to suppress output of /dev/kmsg then add a filter for /proc/kmsg.

@algernon

This comment has been minimized.

Contributor

algernon commented Apr 29, 2014

The easiest way to use /proc/kmsg would be to not use system(), in my opinion, and just hard-code what it expands to (see the --preprocess-into= command-line option), replacing the /dev/kmsg line with an appropriate /proc/kmsg one.

@lbudai lbudai added the blocked label Dec 10, 2014

@ihrwein

This comment has been minimized.

Contributor

ihrwein commented Mar 10, 2015

This issue seems to me very similar (I bet it's the same):

https://bugs.gentoo.org/show_bug.cgi?id=533328

Reverting to syslog-ng-3.4.8 solved it for the gentoo user.

@AstroFloyd

This comment has been minimized.

AstroFloyd commented May 6, 2015

I have the same issue, also on a Gentoo system. My observations are:

  • the problems started when I upgraded from syslog-ng v3.4.8 to v3.6.2
  • the 'wrong' timestamp only occurs in lines with "kernel:", not in any other (bluetoothd, su, cron, etc.)
  • the problem shows in /var/log/messages and other logs showing the kernel messages
  • the time offset is reset to a few seconds after a reboot
  • the time offset increases in jumps whenever my laptop sleeps; the jump size is equal to the sleep time
    • in other words, after waking up, most system messages have a label that is <sleep time> later, while the "kernel:" lines continue a few seconds "later"
    • it seems that the kernel lines print a timestamp based on the 'waking time' since boot
    • since my laptop usually sleeps at night and isn't rebooted very often, the current timestamp difference between "kernel:" and other lines amounts to about 14 days
  • upgrading the kernel (from 3.16.5 to 3.18.9) didn't change anything
  • after downgrading to syslog-ng v3.4.8 and restarting it, the problem went away
    • also, /var/log/messages was flushed with ~3000 kernel-message lines at that moment
  • after upgrading again to syslog-ng v3.6.2 and restarting it, the problem immediately came back, with the same offset of ~14 days
  • "date" and "adjtimex -p" print the same time (no offset)

All this suggests that the issue is caused by a change between syslog-ng v3.4.8 and v3.6.2, perhaps in the way the timestamp is constructed for the kernel messages. If there are any more tests that might be useful, I'd be glad to help if I can.

@benkibbey

This comment has been minimized.

benkibbey commented May 6, 2015

The problem went away for me. I think it had something to do with adjtimex needing to reset or recalibrate /etc/adjtime, I dont remember exactly. But it has fixed itself for a while. If I can remember what changed I'll let you all know.

@toralf

This comment has been minimized.

toralf commented May 6, 2015

I' running 2 Gentoo boxes (1 server, 1 desktop) and do suffer from this issue on my desktop.

The issue started with 3.6.x, 3.4.x was fine. I do not hibernate my desktop, but I do s2ram often.

Here's an example :
May 1 11:20:01 t44 CROND[12761]: (root) CMD ([ ! -x /etc/cron.hourly/0anacron ] && { test -x /usr/sbin/run-crons && /usr/sbin/run-crons ; })
Apr 30 10:15:19 t44 kernel: wlp3s0: AP 08:96:d7:05:f9:2a changed bandwidth, new config is 2462 MHz, width 2 (2452/0 MHz)
Apr 30 10:15:20 t44 kernel: wlp3s0: AP 08:96:d7:05:f9:2a changed bandwidth, new config is 2462 MHz, width 1 (2462/0 MHz)
Apr 30 10:15:22 t44 kernel: wlp3s0: AP 08:96:d7:05:f9:2a changed bandwidth, new config is 2462 MHz, width 2 (2452/0 MHz)
Apr 30 10:20:20 t44 kernel: wlp3s0: AP 08:96:d7:05:f9:2a changed bandwidth, new config is 2462 MHz, width 1 (2462/0 MHz)
May 1 11:30:01 t44 CROND[14380]: (root) CMD ([ ! -x /etc/cron.hourly/0anacron ] && { test -x /usr/sbin/run-crons && /usr/sbin/run-crons ; })

@benkibbey

This comment has been minimized.

benkibbey commented May 6, 2015

Debian has an /etc/default/adjtimex file to set values to correct drift. My values for this file are:

TICK=10000
FREQ=2712500

which is set during startup with /etc/init.d/adjtimex that does:

/sbin/adjtimex -tick "$TICK" -frequency "$FREQ"

Debian ships with adjtimexconfig to help set those two variables which is used with /etc/adjtime. Then I have a cronjob to call ntpdate hourly. The drift is not more than 1/10 of a second. Hope this helps.

@AstroFloyd

This comment has been minimized.

AstroFloyd commented May 7, 2015

Here's an example of 5 consecutive lines from my /var/log/messages, written on May first:

Apr 18 02:30:38 think kernel: e1000e 0000:00:19.0 eth0: Error reading PHY register
May 1 09:32:45 think dbus[3960]: [system] Activating service name=(...)
Apr 18 02:30:39 think kernel: EXT4-fs (sda3): re-mounted. Opts: data=ordered,commit=600
May 1 09:32:45 think init[1]: Switching to runlevel: 5
Apr 18 02:30:39 think kernel: e1000e 0000:00:19.0 eth0: Error reading PHY register

All "kernel:" lines give the wrong timestamp. You can see that the time jumps by about 13 days and 7 hours, then -13d7h, +13d7h and -13d7h in the course of ~1-2s, when using syslog-ng v3.6.2. This is not a drift, as seems to be the case for @benkibbey. 13d7h is consistent with the amount of time my laptop spent 'sleeping' (s2ram) since the last boot. For v3.4.8 (keeping the rest of my system the same), such behaviour doesn't happen, which suggests timestamps for kernel messages are determined differently between the two versions. As mentioned, adjtimex and date agree on the time on my system.

@algernon

This comment has been minimized.

Contributor

algernon commented May 7, 2015

FWIW, I see the following on my laptop:

$ dmesg -T | tail
[Wed May  6 05:48:21 2015] thermal LNXTHERM:00: registered as thermal_zone0
[Wed May  6 05:48:21 2015] ACPI: Thermal Zone [TZ00] (28 C)
[Wed May  6 05:48:21 2015] thermal LNXTHERM:01: registered as thermal_zone1
[Wed May  6 05:48:21 2015] ACPI: Thermal Zone [TZ01] (30 C)

But the date is Thu May 7 12:32:32 CEST 2015. Non-kernel logs are fine, and in /var/log/messages, the above messages appear with correct date:

May  7 12:32:28 eowyn kernel: thermal LNXTHERM:00: registered as thermal_zone0
May  7 12:32:28 eowyn kernel: ACPI: Thermal Zone [TZ00] (28 C)
May  7 12:32:28 eowyn kernel: thermal LNXTHERM:01: registered as thermal_zone1
May  7 12:32:28 eowyn kernel: ACPI: Thermal Zone [TZ01] (30 C)

This is using 3.6.1, but I don't see anything obvious in the 3.6.2 changes that would change the behaviour.

@AstroFloyd, @benkibbey does your dmesg -T agree with what's written to /var/log/messages? Also, are you using system()? Do you get your kernel logs from /proc/kmsg? If using system() and a recent kernel, kernel logs will be read from /dev/log, and the time parser there may be wrong.

If using system(), does changing it to to use /proc/kmsg instead of /dev/kmsg solve the issue? (To figure out what system() expands to, you can use the system-expand tool in modules/system-source/)

What I suspect is that we're reading /dev/kmsg, which has a timestamp, and that is not updated after SUSPEND/RESUME. With /proc/kmsg, we don't have a timestamp at all, as far as I remember, so the receive time will be used, which is closer to being correct than the /dev/kmsg timestamp after resuming. Perhaps we need a way to ignore /dev/kmsg timestamps, and do so by default?

@ihrwein

This comment has been minimized.

Contributor

ihrwein commented May 26, 2015

I've checked the difference between 3.4.8 and 3.6.2. In 3.4.8 we use only /proc/kmsg. In 3.6.2 we first try to open /dev/kmsg and if that fails we move to use /proc/kmsg. This happens on Linux machines.

Could somebody test a patch if I make one?

@toralf

This comment has been minimized.

toralf commented May 26, 2015

Why not - my Gentoo Linux here makes it easy to apply patches.
(just FWIW I switched back in the mean while my Gentoo Linux desktop to 3.4.8 b/c 3.6.2 was no longer useful, but would give the patch a try)

@ihrwein

This comment has been minimized.

Contributor

ihrwein commented May 27, 2015

@toralf: Before we go for that (/proc/kmsg can be read safely only by one consumer) could you try something?

Now, we read messages from /dev/kmsg. It's also used by dmesg which says in its manual (Ubuntu Trusty):

The  time  source  used for the logs is not updated after system
              SUSPEND/RESUME.

The time drift is also affected by BIOS time. You may check your BIOS time with the following command (as root):

hwclock --show

To print out the system time, execute date.

The BIOS time can be synchronized with the system time with the following command (as root):

hwclock --systohc --utc

Could you synchronize your BIOS time to your system time with the above commands? It should mitigate the problem. Please, also paste the output dates into a comment.

@algernon

This comment has been minimized.

Contributor

algernon commented May 27, 2015

FWIW, see my analysis above. The straightforward solution is to not use the date parts of /dev/kmsg, but add our own.

Keep in mind that restarting syslog-ng will re-read /dev/kmsg, as far as I remember, because we don't empty the buffer, nor do we track where we left of. This may have been fixed without me noticing, but something to keep in mind.

@ihrwein

This comment has been minimized.

Contributor

ihrwein commented May 27, 2015

@algernon I overlooked that, thanks.

ihrwein added a commit to ihrwein/syslog-ng that referenced this issue May 27, 2015

system: use keep-timestamp(yes) in case of Linux kernel log messages
Kernel messages read from /dev/kmsg may not have accurate timestamps.
The  time  source  used for the logs is not updated after system
 SUSPEND/RESUME. With this patch we ignore the original (and possibly
 inaccurate) timestamp and use the time of reception.

Fixes balabit#121

Signed-off-by: Tibor Benke <tibor.benke@balabit.com>
@ihrwein

This comment has been minimized.

Contributor

ihrwein commented May 27, 2015

@toralf Could you test the fix? You can find the commit below this comment.

ihrwein added a commit to ihrwein/syslog-ng that referenced this issue May 27, 2015

system: use keep-timestamp(no) in case of Linux kernel log messages
Kernel messages read from /dev/kmsg may not have accurate timestamps.
The  time  source  used for the logs is not updated after system
 SUSPEND/RESUME. With this patch we ignore the original (and possibly
 inaccurate) timestamp and use the time of reception.

Fixes balabit#121

Signed-off-by: Tibor Benke <tibor.benke@balabit.com>
@toralf

This comment has been minimized.

toralf commented May 27, 2015

I''' try to test it.
FWIW :

$ sudo hwclock --show; date
Wed 27 May 2015 02:49:58 PM CEST -0.734753 seconds
Wed May 27 14:49:58 CEST 2015

BTW - I didn't found an easy way to download the patch directly from the given link - do I overlooked the download button ?

@ihrwein

This comment has been minimized.

Contributor

ihrwein commented May 27, 2015

Just a moment, I'll paste it here.

@ihrwein

This comment has been minimized.

Contributor

ihrwein commented May 27, 2015

From 75dc0c6d4fffeac7a7b0461a6cdc6edd7a62c22c Mon Sep 17 00:00:00 2001
From: Benke Tibor <ihrwein@gmail.com>
Date: Wed, 27 May 2015 13:18:53 +0200
Subject: [PATCH] system: use keep-timestamp(no) in case of Linux kernel log
 messages

Kernel messages read from /dev/kmsg may not have accurate timestamps.
The  time  source  used for the logs is not updated after system
 SUSPEND/RESUME. With this patch we ignore the original (and possibly
 inaccurate) timestamp and use the time of reception.

Fixes #121

Signed-off-by: Tibor Benke <tibor.benke@balabit.com>
---
 modules/system-source/system-source.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/modules/system-source/system-source.c b/modules/system-source/system-source.c
index 7228dd4..9e3de12 100644
--- a/modules/system-source/system-source.c
+++ b/modules/system-source/system-source.c
@@ -76,7 +76,8 @@ system_sysblock_add_unix_dgram(GString *sysblock, const gchar *path,
 static void
 system_sysblock_add_file(GString *sysblock, const gchar *path,
                          gint follow_freq, const gchar *prg_override,
-                         const gchar *flags, const gchar *format)
+                         const gchar *flags, const gchar *format,
+                         gboolean ignore_timestamp)
 {
   g_string_append_printf(sysblock, "file(\"%s\"", path);
   if (follow_freq >= 0)
@@ -87,6 +88,8 @@ system_sysblock_add_file(GString *sysblock, const gchar *path,
     g_string_append_printf(sysblock, " flags(%s)", flags);
   if (format)
     g_string_append_printf(sysblock, " format(%s)", format);
+  if (ignore_timestamp)
+    g_string_append_printf(sysblock, " keep-timestamp(no)");
   g_string_append(sysblock, ");\n");
 }

@@ -145,9 +148,9 @@ system_sysblock_add_freebsd_klog(GString *sysblock, const gchar *release)
   if (strncmp(release, "7.", 2) == 0 ||
       strncmp(release, "8.", 2) == 0 ||
       strncmp(release, "9.0", 3) == 0)
-    system_sysblock_add_file(sysblock, "/dev/klog", 1, "kernel", "no-parse", NULL);
+    system_sysblock_add_file(sysblock, "/dev/klog", 1, "kernel", "no-parse", NULL, FALSE);
   else
-    system_sysblock_add_file(sysblock, "/dev/klog", 0, "kernel", "no-parse", NULL);
+    system_sysblock_add_file(sysblock, "/dev/klog", 0, "kernel", "no-parse", NULL, FALSE);
 }

 static gboolean
@@ -200,7 +203,7 @@ system_sysblock_add_linux_kmsg(GString *sysblock)
     }
   else
     system_sysblock_add_file(sysblock, kmsg, -1,
-                             "kernel", "kernel", format);
+                             "kernel", "kernel", format, TRUE);
 }

 static gboolean
-- 
1.9.1
@algernon

This comment has been minimized.

Contributor

algernon commented May 27, 2015

@toralf: Append .patch to the URL, and it becomes easily downloadable.

@toralf

This comment has been minimized.

toralf commented May 27, 2015

@ihrwein : at a first glance that seems to work, will observe it over the next few days

@algernon: cool hint - somebody should tell github devs to code that into a nifty bottun.

@ihrwein

This comment has been minimized.

Contributor

ihrwein commented Jun 1, 2015

@toralf Did it work finally? If it works, please close this issue.

@toralf

This comment has been minimized.

toralf commented Jun 1, 2015

It works here fine since I applied the patch - but I cannot close this issue, I'm not the issuer, I'm just the Gentoo Linux guy, who cried loud about this.

@ihrwein

This comment has been minimized.

Contributor

ihrwein commented Jun 8, 2015

@toralf You were the tester, thanks for that :)
@benkibbey If the patch works for you, please close the issue.

@benkibbey benkibbey closed this Jun 8, 2015

@talien talien removed the in progress label Jun 8, 2015

ihrwein added a commit to ihrwein/syslog-ng that referenced this issue Nov 9, 2015

system: use keep-timestamp(no) in case of Linux kernel log messages
Kernel messages read from /dev/kmsg may not have accurate timestamps.
The  time  source  used for the logs is not updated after system
 SUSPEND/RESUME. With this patch we ignore the original (and possibly
 inaccurate) timestamp and use the time of reception.

Fixes balabit#121

Signed-off-by: Tibor Benke <tibor.benke@balabit.com>

ihrwein added a commit to ihrwein/syslog-ng that referenced this issue Nov 9, 2015

system: use keep-timestamp(no) in case of Linux kernel log messages
Kernel messages read from /dev/kmsg may not have accurate timestamps.
The  time  source  used for the logs is not updated after system
 SUSPEND/RESUME. With this patch we ignore the original (and possibly
 inaccurate) timestamp and use the time of reception.

Fixes balabit#121

Signed-off-by: Tibor Benke <tibor.benke@balabit.com>
@toralf

This comment has been minimized.

toralf commented Jan 22, 2016

still - or again - a problem with 3.7.2 : http://www.zwiebeltoralf.de/pub/syslog-ng-3.7.2-mess.txt
at this stable hardened 64 Bit Gentoo, 3.6.4 was fine AFAICT

@ihrwein

This comment has been minimized.

Contributor

ihrwein commented Feb 8, 2016

@toralf The upcoming 3.7.3 release will contain the patch. Thanks for reporting! (3.8 will also have it).

update: a workaround is to expand manually your system() source and use keep-timestamp(no) in case of /dev/kmsg.

@toralf

This comment has been minimized.

toralf commented Feb 8, 2016

yep - with confiure option "keep-timestamp(no)" the version 3.7.2 seems to run flawlessly.
Would that for 3.7.3 be needed too ?

@ihrwein

This comment has been minimized.

Contributor

ihrwein commented Feb 8, 2016

@toralf 3.7.3's system() source will use the keep-timestamp(no) so you won't need any manual tweaking.

hTrap added a commit to hTrap/syslog-ng that referenced this issue Aug 20, 2016

system: use keep-timestamp(no) in case of Linux kernel log messages
Kernel messages read from /dev/kmsg may not have accurate timestamps.
The  time  source  used for the logs is not updated after system
 SUSPEND/RESUME. With this patch we ignore the original (and possibly
 inaccurate) timestamp and use the time of reception.

Fixes balabit#121

Signed-off-by: Tibor Benke <tibor.benke@balabit.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment