task/internal/syslog: Add capability to ignore kernel failures #1666

kotreshhr · 2021-08-05T17:31:39Z

Adds capability to ignore kernel failures to jobs. This is done by
adding 'syslog' dict to config dictionary which holds the 'ignorelist'
of kernel failures.

Also removes old kernel failurs from exclude list.

Fixes: https://tracker.ceph.com/issues/50150
Signed-off-by: Kotresh HR khiremat@redhat.com

teuthology/run.py

teuthology/task/internal/syslog.py

batrick · 2021-08-05T17:56:27Z

teuthology/task/internal/syslog.py

-                    run.Raw('|'),
-                    'egrep', '-v', '\\btcmu-runner\\b.*\\bINFO\\b',
-                    run.Raw('|'),
-                    'head', '-n', '1',


Which of these need moved to the qa suite? Please post a ceph PR. It needs backported to octopus/pacific before this can be merged.

I think most of this is stale. We will have to run all relevant suites and find out.

We could do that but I'm not sure it is worth the effort. If the objective is to make the exclude list configurable, is there a problem with leaving these in (meaning that these would always be on the exclude list)?

@batrick What's the best filter to exercise only kernel code. Which of the following covers all ?

-s fs --filter mount/kclient This listed more than 600 jobs

-s fs --filter kernel This listed around 54 jobs

We could do that but I'm not sure it is worth the effort. If the objective is to make the exclude list configurable, is there a problem with leaving these in (meaning that these would always be on the exclude list)?

In the past, this has been confusing to newcomers. There's all sorts of magic defaults in teuthology (e.g. the log ignorelist). Best to move these to the ceph.git qa/ when possible.

@batrick What's the best filter to exercise only kernel code. Which of the following covers all ?

1. `-s fs --filter mount/kclient` This listed more than 600 jobs

This will have the best coverage. Add --subset x/16.

idryomov · 2021-08-06T09:09:15Z

Since this started out as "begin grepping kernel logs for kclient warnings/failures", is there a suspicion that the existing greps don't work (or perhaps work but not on all distros)? Have you attempted to trigger a WARNING or BUG and observe the test failing?

batrick · 2021-08-06T19:08:20Z

Since this started out as "begin grepping kernel logs for kclient warnings/failures", is there a suspicion that the existing greps don't work (or perhaps work but not on all distros)? Have you attempted to trigger a WARNING or BUG and observe the test failing?

Honestly, we didn't know about this syslog grep at the start. I was concerned there was no detection of kernel faults at all since I would occasionally see warnings that should fail tests in a syslog.

So I guess we're both concerned that this grep wasn't catching genuine faults/warnings, that it's not easily configurable, and that it's not easily discovered.

idryomov · 2021-08-06T19:29:35Z

I remember it breaking a few years ago -- something to do with traditional syslog vs systemd journal interaction but it was fixed then. It might be the case that it broke again, which is why I asked. The issue is not with the grep invocation but rather with what is being grepped (i.e. does that kern.log file actually contain any kernel log messages at the time grep is invoked, etc).

Removing outdated exclude items is obviously lower priority than resurrecting the core functionality in case it is broken.

batrick · 2022-01-10T15:35:11Z

@kotreshhr what's the status on this PR?

kotreshhr · 2022-01-12T06:09:30Z

@kotreshhr what's the status on this PR?

@batrick I think it's good to be merged. Here is the run triggered and Jeff's comments on the same. Sorry, I should have discussed here. It was in the cephfs-team mailing list. Copying it here for reference.

Hi Jeff,

I was working on removing old kernel failures which are masked with the
present teuthology code.
I ran the teuthology on my PR [1] with filter "-s fs --filter mount/kclient"
and found following unique
failures in syslog. Just wanted to check with you are these real failures
which need attention?

The complete run can be found at [2].

Many thanks for doing this. Better testing with teuthology is definitely
an area where I could use a lot of help. FWIW, the pulpito links below
only work if you're on the sepia lab VPN.

The kernel is built from commit 7c1f4b5e3842, so this is missing some of
the more recent patches in testing branch.

'2021-08-14T12:28:01.665935+00:00 smithi071 kernel: WARNING: CPU: 3 PID:
39974 at fs/ceph/mds_client.c:4495 check_session_state+0x55/0x60 [ceph] '
in syslog
http://pulpito.front.sepia.ceph.com/khiremat-2021-08-09_13:23:53-fs-wip-khiremat-test-kernel-exclude-failures-distro-basic-smithi/6327912

I suspect this is fixed with a recent patch from Xiubo (1a45b99820e7).
Do we know whether this test does a 'umount -f' ?

'2021-08-14T13:05:50.330033+00:00 smithi041 kernel: INFO: task
fsstress:47378 blocked for more than 122 seconds. ' in syslog
http://pulpito.front.sepia.ceph.com/khiremat-2021-08-09_13:23:53-fs-wip-khiremat-test-kernel-exclude-failures-distro-basic-smithi/6327931

Doesn't look familiar. It looks like the client is just waiting on an
OSD reply (probably for a write since this is in sync()).

'2021-08-14T14:06:15.476389+00:00 smithi062 kernel: INFO: task
kcompactd0:59 blocked for more than 122 seconds. ' in syslog
http://pulpito.front.sepia.ceph.com/khiremat-2021-08-09_13:23:53-fs-wip-khiremat-test-kernel-exclude-failures-distro-basic-smithi/6327954

Also doesn't look familiar. The task is hung waiting for dirty pages to
be written out. We don't know to which filesystem (but it seems likely
that it was cephfs). Possibly related to #2 above.

'2021-08-14T13:37:56.676192+00:00 smithi052 kernel: WARNING: CPU: 3 PID:
101 at crypto/testmgr.c:5653 alg_test+0x245/0x450 ' in syslog

http://pulpito.front.sepia.ceph.com/khiremat-2021-08-09_13:23:53-fs-wip-khiremat-test-kernel-exclude-failures-distro-basic-smithi/6327978

Looks unrelated to ceph at all. This one happened at boot time. Probably
some bug in the crypto code. This is a -rc4 based kernel after all, so
there are bugs in there...

'2021-08-14T17:11:12.918237+00:00 smithi027 kernel: [ 969.167486] INFO:
task ffsb:47395 blocked for more than 120 seconds. ' in syslog

http://pulpito.front.sepia.ceph.com/khiremat-2021-08-09_13:23:53-fs-wip-khiremat-test-kernel-exclude-failures-distro-basic-smithi/6328144

Not much to go on here. There are no stack traces, AFAICT and it's
unclear why. Maybe we need to do something in Ubuntu distros to turn
them on?

'2021-08-15T06:15:50.378319+00:00 smithi019 kernel: INFO: task iozone:67133
blocked for more than 120 seconds. ' in syslog

http://pulpito.front.sepia.ceph.com/khiremat-2021-08-09_13:23:53-fs-wip-khiremat-test-kernel-exclude-failures-distro-basic-smithi/6328732

Hung (briefly) while waiting on ceph_fsync. It looks like this one might
have eventually recovered as we don't see any more messages about
blocked tasks even after another 120s. This may just be due to slow OSD
responses, and could also be related to #2 and/or #3.

[1] #1666
[2]
http://pulpito.front.sepia.ceph.com/khiremat-2021-08-09_13:23:53-fs-wip-khiremat-test-kernel-exclude-failures-distro-basic-smithi/

batrick

otherwise LGTM; please post a teuthology run with the ceph PR requested for qa/cephfs/begin.yaml.

teuthology/suite/placeholder.py

batrick · 2022-01-12T17:09:05Z

teuthology/task/internal/syslog.py

                    'egrep', '--binary-files=text',
-                    '\\bBUG\\b|\\bINFO\\b|\\bDEADLOCK\\b',
+                    '\\bBUG\\b|\\bINFO\\b|\\bDEADLOCK\\b|\\bOops\\b|\\bWARNING\\b|\\bKASAN\\b',


@jtlayton are we missing anything here or LGTY?

kotreshhr · 2022-01-13T16:41:24Z

Teuthology run triggered:
http://pulpito.front.sepia.ceph.com/khiremat-2022-01-13_16:30:18-fs-wip-khiremat-44574-syslog-warn-ignorelist-distro-basic-smithi/

batrick · 2022-01-13T17:57:03Z

Teuthology run triggered: http://pulpito.front.sepia.ceph.com/khiremat-2022-01-13_16:30:18-fs-wip-khiremat-44574-syslog-warn-ignorelist-distro-basic-smithi/

I killed this @kotreshhr . It has 1000+ jobs. Please use a subset.

kotreshhr · 2022-01-14T16:41:34Z

Teuthology run triggered: http://pulpito.front.sepia.ceph.com/khiremat-2022-01-13_16:30:18-fs-wip-khiremat-44574-syslog-warn-ignorelist-distro-basic-smithi/

I killed this @kotreshhr . It has 1000+ jobs. Please use a subset.

@batrick I am not sure I understand --subset option correctly. I did give 3/16. Any suggestion?

batrick · 2022-01-14T18:18:39Z

Here is what I usually run and it gets about 300 jobs:

teuthology-suite --machine-type smithi --email pdonnell@redhat.com -p 95 --suite fs --force-priority --subset $((RANDOM % 32))/32 --ceph wip-branch

kotreshhr · 2022-01-17T05:50:28Z

Here is what I usually run and it gets about 300 jobs:

teuthology-suite --machine-type smithi --email pdonnell@redhat.com -p 95 --suite fs --force-priority --subset $((RANDOM % 32))/32 --ceph wip-branch

Thanks @batrick, here is the teuthology run:
http://pulpito.front.sepia.ceph.com/khiremat-2022-01-16_11:16:23-fs-wip-khiremat-44574-syslog-warn-ignorelist-distro-basic-smithi/

batrick · 2022-01-22T01:51:08Z

Here is what I usually run and it gets about 300 jobs:
teuthology-suite --machine-type smithi --email pdonnell@redhat.com -p 95 --suite fs --force-priority --subset $((RANDOM % 32))/32 --ceph wip-branch

Thanks @batrick, here is the teuthology run: http://pulpito.front.sepia.ceph.com/khiremat-2022-01-16_11:16:23-fs-wip-khiremat-44574-syslog-warn-ignorelist-distro-basic-smithi/

There's some ... weird failures in that run. Please do --rerun of it to see if they're transient.

kotreshhr · 2022-01-22T07:44:17Z

Here is what I usually run and it gets about 300 jobs:
teuthology-suite --machine-type smithi --email pdonnell@redhat.com -p 95 --suite fs --force-priority --subset $((RANDOM % 32))/32 --ceph wip-branch

Thanks @batrick, here is the teuthology run: http://pulpito.front.sepia.ceph.com/khiremat-2022-01-16_11:16:23-fs-wip-khiremat-44574-syslog-warn-ignorelist-distro-basic-smithi/

There's some ... weird failures in that run. Please do --rerun of it to see if they're transient.

Rerun QA link:
http://pulpito.front.sepia.ceph.com/khiremat-2022-01-22_07:42:15-fs-wip-khiremat-44574-syslog-warn-ignorelist-distro-basic-smithi/

batrick · 2022-01-24T14:30:48Z

Good to merge IMO from CephFS side. @idryomov do you want to have another look? rbd run needed?

idryomov

I'd like to see a confirmation that the whole grep invocation is actually working. Could you please build a kernel that would emit a custom WARNING or ERROR on e.g. CephFS mount and run this against that kernel (and make sure that all supported distros are covered by that run because there used to be annoying differences in how dmesg was persisted between distros). And then another run with that custom log message added to ignorelist, again on all distros. Or maybe you already did something like that?

teuthology/suite/placeholder.py

teuthology/task/internal/syslog.py

batrick · 2022-01-26T18:16:55Z

I'd like to see a confirmation that the whole grep invocation is actually working. Could you please build a kernel that would emit a custom WARNING or ERROR

Would it be crazy to add a patch to upstream kernel where a mount option triggers a warning/error so we can continuously verify this works?

idryomov · 2022-01-26T19:03:02Z

Definitely crazy IMO. But we could find a way to trigger something with one of the "bad" keywords in it on vanilla kernels. For example, take all OSDs out against some in-flight I/O -- that should trigger "INFO: task ... blocked for more than %ld seconds".

idryomov · 2022-01-26T21:06:20Z

... or we could carry the kind of patch you envision in our testing branch.

jtlayton · 2022-01-26T21:21:19Z

Would it be crazy to add a patch to upstream kernel where a mount option triggers a warning/error so we can continuously verify this works?

I wouldn't use a mount option. This sounds more like a job for debugfs or maybe consider the fault injection framework documented here:

https://www.kernel.org/doc/html/latest/fault-injection/fault-injection.html

batrick · 2022-01-27T01:55:36Z

Hmm, I think we can just use the network namespaces to network partition the kernel mount (suspend_netns in mount.py), wait a reasonable amount of time for complaints like the osd timeout, then exit. Just need to grep for the expected failure message (or fail) and make sure ignorelist works by ignoring the message. We don't grep for the expected failure message, perhaps that should go in this PR.

vshankar · 2024-03-14T04:05:05Z

I bought this up in cephfs standup last while discussing https://tracker.ceph.com/issues/64471 with @batrick and recalled this change. So, this was close to getting merged, but @idryomov suggested a validation before merging. Can we agree on a way forward to validate this?

batrick · 2024-03-15T01:21:08Z

I bought this up in cephfs standup last while discussing https://tracker.ceph.com/issues/64471 with @batrick and recalled this change. So, this was close to getting merged, but @idryomov suggested a validation before merging. Can we agree on a way forward to validate this?

To start, @kotreshhr needs to rebase. Then I suggest trying Ilya's suggestion of stopping OSDs while some application is writing a large file. This shouldn't be hard to test...

Adds capability to ignore kernel failures to jobs. This is done by adding 'syslog' dict to config dictionary which holds the 'ignorelist' of kernel failures. Also removes old kernel failurs from exclude list. Fixes: https://tracker.ceph.com/issues/50150 Signed-off-by: Kotresh HR <khiremat@redhat.com>

kotreshhr · 2024-03-21T03:38:55Z

I bought this up in cephfs standup last while discussing https://tracker.ceph.com/issues/64471 with @batrick and recalled this change. So, this was close to getting merged, but @idryomov suggested a validation before merging. Can we agree on a way forward to validate this?

To start, @kotreshhr needs to rebase. Then I suggest trying Ilya's suggestion of stopping OSDs while some application is writing a large file. This shouldn't be hard to test...

@batrick Does that mean, I write a teuthology test that kills OSDs while writing large file ?

vshankar · 2024-03-21T13:14:25Z

I bought this up in cephfs standup last while discussing https://tracker.ceph.com/issues/64471 with @batrick and recalled this change. So, this was close to getting merged, but @idryomov suggested a validation before merging. Can we agree on a way forward to validate this?

To start, @kotreshhr needs to rebase. Then I suggest trying Ilya's suggestion of stopping OSDs while some application is writing a large file. This shouldn't be hard to test...

@batrick Does that mean, I write a teuthology test that kills OSDs while writing large file ?

As discussed during standup - that's a good start to validate the filter.

kotreshhr requested review from batrick and idryomov August 5, 2021 17:32

kotreshhr mentioned this pull request Aug 5, 2021

qa: grep kernel logs for kclient warnings/failures to fail a test ceph/ceph#42193

Closed

3 tasks

batrick requested changes Aug 5, 2021

View reviewed changes

kotreshhr force-pushed the remove-old-kernel-failures branch from adab615 to 958c1ed Compare August 6, 2021 09:11

kotreshhr force-pushed the remove-old-kernel-failures branch from 958c1ed to baa60a8 Compare August 9, 2021 11:06

kotreshhr changed the title ~~task/internal/syslog: Remove old kernel failures from exclude list~~ task/internal/syslog: Add capability to ignore kernel failures @kotreshhr Aug 9, 2021

kotreshhr changed the title ~~task/internal/syslog: Add capability to ignore kernel failures @kotreshhr~~ task/internal/syslog: Add capability to ignore kernel failures Aug 9, 2021

batrick requested changes Jan 12, 2022

View reviewed changes

kotreshhr force-pushed the remove-old-kernel-failures branch from baa60a8 to 6f99b63 Compare January 13, 2022 11:31

batrick mentioned this pull request Jan 13, 2022

qa: Add syslog ignore list ceph/ceph#44574

Merged

14 tasks

batrick previously approved these changes Jan 24, 2022

View reviewed changes

idryomov reviewed Jan 26, 2022

View reviewed changes

teuthology/suite/placeholder.py Outdated Show resolved Hide resolved

teuthology/task/internal/syslog.py Outdated Show resolved Hide resolved

kotreshhr dismissed batrick’s stale review via 4464035 February 15, 2022 10:22

kotreshhr force-pushed the remove-old-kernel-failures branch from 6f99b63 to 4464035 Compare February 15, 2022 10:22

djgalloway changed the base branch from master to main June 1, 2022 17:03

kotreshhr force-pushed the remove-old-kernel-failures branch from 4464035 to 4c0bd70 Compare March 21, 2024 03:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task/internal/syslog: Add capability to ignore kernel failures #1666

task/internal/syslog: Add capability to ignore kernel failures #1666

kotreshhr commented Aug 5, 2021 •

edited

batrick Aug 5, 2021

kotreshhr Aug 6, 2021

idryomov Aug 6, 2021

kotreshhr Aug 6, 2021 •

edited

batrick Aug 6, 2021

batrick Aug 6, 2021

idryomov commented Aug 6, 2021

batrick commented Aug 6, 2021

idryomov commented Aug 6, 2021

batrick commented Jan 10, 2022

kotreshhr commented Jan 12, 2022

batrick left a comment

batrick Jan 12, 2022

kotreshhr commented Jan 13, 2022

batrick commented Jan 13, 2022

kotreshhr commented Jan 14, 2022

batrick commented Jan 14, 2022

kotreshhr commented Jan 17, 2022

batrick commented Jan 22, 2022

kotreshhr commented Jan 22, 2022

batrick commented Jan 24, 2022

idryomov left a comment

batrick commented Jan 26, 2022

idryomov commented Jan 26, 2022

idryomov commented Jan 26, 2022

jtlayton commented Jan 26, 2022

batrick commented Jan 27, 2022

vshankar commented Mar 14, 2024

batrick commented Mar 15, 2024

kotreshhr commented Mar 21, 2024

vshankar commented Mar 21, 2024

task/internal/syslog: Add capability to ignore kernel failures #1666

Are you sure you want to change the base?

task/internal/syslog: Add capability to ignore kernel failures #1666

Conversation

kotreshhr commented Aug 5, 2021 • edited

batrick Aug 5, 2021

Choose a reason for hiding this comment

kotreshhr Aug 6, 2021

Choose a reason for hiding this comment

idryomov Aug 6, 2021

Choose a reason for hiding this comment

kotreshhr Aug 6, 2021 • edited

Choose a reason for hiding this comment

batrick Aug 6, 2021

Choose a reason for hiding this comment

batrick Aug 6, 2021

Choose a reason for hiding this comment

idryomov commented Aug 6, 2021

batrick commented Aug 6, 2021

idryomov commented Aug 6, 2021

batrick commented Jan 10, 2022

kotreshhr commented Jan 12, 2022

batrick left a comment

Choose a reason for hiding this comment

batrick Jan 12, 2022

Choose a reason for hiding this comment

kotreshhr commented Jan 13, 2022

batrick commented Jan 13, 2022

kotreshhr commented Jan 14, 2022

batrick commented Jan 14, 2022

kotreshhr commented Jan 17, 2022

batrick commented Jan 22, 2022

kotreshhr commented Jan 22, 2022

batrick commented Jan 24, 2022

idryomov left a comment

Choose a reason for hiding this comment

batrick commented Jan 26, 2022

idryomov commented Jan 26, 2022

idryomov commented Jan 26, 2022

jtlayton commented Jan 26, 2022

batrick commented Jan 27, 2022

vshankar commented Mar 14, 2024

batrick commented Mar 15, 2024

kotreshhr commented Mar 21, 2024

vshankar commented Mar 21, 2024

kotreshhr commented Aug 5, 2021 •

edited

kotreshhr Aug 6, 2021 •

edited