Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task/internal/syslog: Add capability to ignore kernel failures #1666

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion teuthology/run.py
Expand Up @@ -224,7 +224,7 @@ def get_initial_tasks(lock, config, machine_type):
{'internal.archive': None},
{'internal.coredump': None},
{'internal.sudo': None},
{'internal.syslog': None},
{'internal.syslog': config.get('syslog', {})},
])
init_tasks.append({'internal.timer': None})

Expand Down
57 changes: 11 additions & 46 deletions teuthology/task/internal/syslog.py
Expand Up @@ -102,56 +102,21 @@ def syslog(ctx, config):
# flush the file fully. oh well.

log.info('Checking logs for errors...')
exclude_errors = config.get('ignorelist', [])
log.info('Exclude error list: {0}'.format(exclude_errors))
for rem in cluster.remotes.keys():
log.debug('Checking %s', rem.name)
stdout = rem.sh(
[
args = [
'egrep', '--binary-files=text',
'\\bBUG\\b|\\bINFO\\b|\\bDEADLOCK\\b',
'\\bBUG\\b|\\bINFO\\b|\\bDEADLOCK\\b|\\bOops\\b|\\bWARNING\\b|\\bKASAN\\b',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtlayton are we missing anything here or LGTY?

run.Raw(f'{archive_dir}/syslog/kern.log'),
run.Raw('|'),
'grep', '-v', 'task .* blocked for more than .* seconds',
run.Raw('|'),
'grep', '-v', 'lockdep is turned off',
run.Raw('|'),
'grep', '-v', 'trying to register non-static key',
run.Raw('|'),
'grep', '-v', 'DEBUG: fsize', # xfs_fsr
run.Raw('|'),
'grep', '-v', 'CRON', # ignore cron noise
run.Raw('|'),
'grep', '-v', 'BUG: bad unlock balance detected', # #6097
run.Raw('|'),
'grep', '-v', 'inconsistent lock state', # FIXME see #2523
run.Raw('|'),
'grep', '-v', '*** DEADLOCK ***', # part of lockdep output
run.Raw('|'),
'grep', '-v',
# FIXME see #2590 and #147
'INFO: possible irq lock inversion dependency detected',
run.Raw('|'),
'grep', '-v',
'INFO: NMI handler (perf_event_nmi_handler) took too long to run', # noqa
run.Raw('|'),
'grep', '-v', 'INFO: recovery required on readonly',
run.Raw('|'),
'grep', '-v', 'ceph-create-keys: INFO',
run.Raw('|'),
'grep', '-v', 'INFO:ceph-create-keys',
run.Raw('|'),
'grep', '-v', 'Loaded datasource DataSourceOpenStack',
run.Raw('|'),
'grep', '-v', 'container-storage-setup: INFO: Volume group backing root filesystem could not be determined', # noqa
run.Raw('|'),
'egrep', '-v', '\\bsalt-master\\b|\\bsalt-minion\\b|\\bsalt-api\\b',
run.Raw('|'),
'grep', '-v', 'ceph-crash',
run.Raw('|'),
'egrep', '-v', '\\btcmu-runner\\b.*\\bINFO\\b',
run.Raw('|'),
'head', '-n', '1',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which of these need moved to the qa suite? Please post a ceph PR. It needs backported to octopus/pacific before this can be merged.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of this is stale. We will have to run all relevant suites and find out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that but I'm not sure it is worth the effort. If the objective is to make the exclude list configurable, is there a problem with leaving these in (meaning that these would always be on the exclude list)?

Copy link
Author

@kotreshhr kotreshhr Aug 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick What's the best filter to exercise only kernel code. Which of the following covers all ?

  1. -s fs --filter mount/kclient This listed more than 600 jobs
  2. -s fs --filter kernel This listed around 54 jobs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that but I'm not sure it is worth the effort. If the objective is to make the exclude list configurable, is there a problem with leaving these in (meaning that these would always be on the exclude list)?

In the past, this has been confusing to newcomers. There's all sorts of magic defaults in teuthology (e.g. the log ignorelist). Best to move these to the ceph.git qa/ when possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick What's the best filter to exercise only kernel code. Which of the following covers all ?

1. `-s fs --filter mount/kclient`  This listed more than 600 jobs

This will have the best coverage. Add --subset x/16.

],
)
]
for exclude in exclude_errors:
args.extend([run.Raw('|'), 'egrep', '-v', exclude])
args.extend([
run.Raw('|'), 'head', '-n', '1',
])
stdout = rem.sh(args)
if stdout != '':
log.error('Error in syslog on %s: %s', rem.name, stdout)
set_status(ctx.summary, 'fail')
Expand Down