Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/crash: show timestamps in more user-friendly format #36671

Closed
wants to merge 1 commit into from

Conversation

smithfarm
Copy link
Contributor

Fixes: https://tracker.ceph.com/issues/43897
Signed-off-by: Nathan Cutler ncutler@suse.com

@smithfarm
Copy link
Contributor Author

In order to test this, I would need to somehow precipitate a crash. Any ideas how to do that?

@tchaikov
Copy link
Contributor

@smithfarm

[global]
debug asok assert abort = true
$ ceph tell mon.a abort

@trociny
Copy link
Contributor

trociny commented Aug 16, 2020

or just kill -11 $pid

@@ -80,7 +80,7 @@ def _refresh_health_checks(self):
'%s crashed on host %s at %s' % (
crash.get('entity_name', 'unidentified daemon'),
crash.get('utsname_hostname', '(unknown)'),
crash.get('timestamp', 'unknown time'))
self.user_friendly_timestamp(crash.get('timestamp', 'unknown time')))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: A little cleaner to me looks to just pass crash.get('timestamp') to user_friendly_timestamp and make user_friendly_timestamp return "unknown time" if not timestr.

return timestr
datetime_obj = self.time_from_string(timestr)
return "{}+00:00".format(
datetime_obj.isoformat(' ', timespec='seconds')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of manually formatting (adding timezone) you could just properly set the timezone in datetime_obj:

datetime_obj.replace(tzinfo=datetime.timezone.utc).isoformat(' ', timespec='seconds')

Or probably time_from_string could be fixed to set timezone to utc when creating the datetime object.

@@ -114,6 +114,16 @@ def time_from_string(self, timestr):
except ValueError:
return datetime.datetime.strptime(timestr, OLD_DATEFMT)

def user_friendly_timestamp(self, timestr):
# Converts timestamp as stored in crash metadata into a more
# user-friendly format. See https://tracker.ceph.com/issues/43897
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't see much value in this comment here, especially in adding the tracker url. The purpose of the function is self evident.

@smithfarm
Copy link
Contributor Author

or just kill -11 $pid

I tried killing a daemon, but when the daemon is running inside a container, it has PID 1 and kill (apparently) becomes a noop.

@smithfarm
Copy link
Contributor Author

smithfarm commented Aug 16, 2020

$ ceph tell mon.a abort

Thanks, @tchaikov! Have you tried it when the daemon is running in a container? I'm getting:

master:~ # ceph config set global "debug_asok_assert_abort" true
master:~ # ceph tell mon.node1 abort
Error ENXIO: mon unavailable

I guess this doesn't work for the same reason that kill doesn't work (PID 1 is immune to kill)?

@smithfarm
Copy link
Contributor Author

Test failures: 190 - unittest_rbd_mirror (Child aborted) 

@smithfarm
Copy link
Contributor Author

jenkins test make check

@tchaikov
Copy link
Contributor

$ ceph tell mon.a abort

Thanks, @tchaikov! Have you tried it when the daemon is running in a container? I'm getting:

no. i haven't tried this for killing a daemon in container.

master:~ # ceph config set global "debug_asok_assert_abort" true
master:~ # ceph tell mon.node1 abort
Error ENXIO: mon unavailable

this error is returned by MonClient when it failed to talk to monitor after tried "mon_client_directed_command_retry" times. can you try something like

ceph -s

?


I guess this doesn't work for the same reason that `kill` doesn't work (PID 1 is immune to `kill`)?

yeah, i am afraid so. this command is handled right in a daemon by calling ceph_abort(). but the daemon also sends SIGABRT to itself using abort(). so i think you need to launch the daemon manually using, for instance,sh -c.

@smithfarm
Copy link
Contributor Author

master:~ # ceph tell mon.node1 abort
Error ENXIO: mon unavailable

@tchaikov Nevermind, this just indicates that the command succeeded:

master:~ # ceph health detail
HEALTH_WARN 1 daemons have recently crashed
[WRN] RECENT_CRASH: 1 daemons have recently crashed
    mon.node1 crashed on host node1 at 2020-08-16T10:17:24.741958Z

@tchaikov
Copy link
Contributor

tchaikov commented Aug 16, 2020

i see, monitor failed to reply ceph cli before killing itself.

return datetime.datetime.strptime(timestr, OLD_DATEFMT)
datetime_obj = datetime.datetime.strptime(timestr, OLD_DATEFMT)
# add time zone to datetime object before returning it
return datetime_obj.replace(tzinfo=datetime.timezone.utc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this breaks the self_test().

2020-08-18T15:07:02.661 INFO:teuthology.orchestra.run.smithi115.stderr:2020-08-18T15:07:02.654+0000 7fe288575700  1 -- 172.21.15.115:0/1247744208 --> [v2:172.21.15.115:6808/33840,v1:172.21.15.115:6809/338
40] -- mgr_command(tid 0: {"prefix": "mgr self-test module", "module": "crash", "target": ["mon-mgr", ""]}) v1 -- 0x7fe280063560 con 0x7fe258077b50
2020-08-18T15:07:02.662 INFO:teuthology.orchestra.run.smithi115.stderr:2020-08-18T15:07:02.656+0000 7fe26e7fc700  1 -- 172.21.15.115:0/1247744208 <== mgr.4499 v2:172.21.15.115:6808/33840 1 ==== mgr_comman
d_reply(tid 0: -1 Test failed: Remote method threw exception: Traceback (most recent call last):
2020-08-18T15:07:02.662 INFO:teuthology.orchestra.run.smithi115.stderr:  File "/usr/share/ceph/mgr/crash/module.py", line 353, in self_test
2020-08-18T15:07:02.662 INFO:teuthology.orchestra.run.smithi115.stderr:    raise RuntimeError('time_from_string() failed')
2020-08-18T15:07:02.662 INFO:teuthology.orchestra.run.smithi115.stderr:RuntimeError: time_from_string() failed
2020-08-18T15:07:02.662 INFO:teuthology.orchestra.run.smithi115.stderr:) v1 ==== 248+0+0 (secure 0 0 0) 0x7fe27c00d410 con 0x7fe258077b50
2020-08-18T15:07:02.663 INFO:teuthology.orchestra.run.smithi115.stderr:2020-08-18T15:07:02.656+0000 7fe26e7fc700 20 mgrc handle_command_reply tid 0 r -1
2020-08-18T15:07:02.663 INFO:teuthology.orchestra.run.smithi115.stderr:Error EPERM: Test failed: Remote method threw exception: Traceback (most recent call last):
2020-08-18T15:07:02.663 INFO:teuthology.orchestra.run.smithi115.stderr:  File "/usr/share/ceph/mgr/crash/module.py", line 353, in self_test
2020-08-18T15:07:02.663 INFO:teuthology.orchestra.run.smithi115.stderr:    raise RuntimeError('time_from_string() failed')
2020-08-18T15:07:02.663 INFO:teuthology.orchestra.run.smithi115.stderr:RuntimeError: time_from_string() failed
2020-08-18T15:07:02.663 INFO:teuthology.orchestra.run.smithi115.stderr:
2020-08-18T15:07:02.664 INFO:tasks.ceph.mgr.x.smithi115.stderr:2020-08-18T15:07:02.656+0000 7f4ae3639700 -1 Remote method threw exception: Traceback (most recent call last):
2020-08-18T15:07:02.664 INFO:tasks.ceph.mgr.x.smithi115.stderr:  File "/usr/share/ceph/mgr/crash/module.py", line 353, in self_test
2020-08-18T15:07:02.664 INFO:tasks.ceph.mgr.x.smithi115.stderr:    raise RuntimeError('time_from_string() failed')
2020-08-18T15:07:02.664 INFO:tasks.ceph.mgr.x.smithi115.stderr:RuntimeError: time_from_string() failed
2020-08-18T15:07:02.665 INFO:tasks.ceph.mgr.x.smithi115.stderr:
2020-08-18T15:07:02.665 INFO:tasks.ceph.mgr.x.smithi115.stderr:2020-08-18T15:07:02.656+0000 7f4ae3639700 -1 mgr.server reply reply (1) Operation not permitted Test failed: Remote method threw exception: T
raceback (most recent call last):
2020-08-18T15:07:02.666 INFO:tasks.ceph.mgr.x.smithi115.stderr:  File "/usr/share/ceph/mgr/crash/module.py", line 353, in self_test
2020-08-18T15:07:02.666 INFO:tasks.ceph.mgr.x.smithi115.stderr:    raise RuntimeError('time_from_string() failed')
2020-08-18T15:07:02.668 INFO:tasks.ceph.mgr.x.smithi115.stderr:RuntimeError: time_from_string() failed

see https://pulpito.ceph.com/kchai-2020-08-18_13:00:33-rados-wip-kefu-testing-2020-08-18-1658-distro-basic-smithi/5356537/

seems it also breaks test in qa/tasks/mgr/test_insights.py,

2020-08-18T14:50:49.983 INFO:tasks.cephfs_test_runner:test_crash_history (tasks.mgr.test_insights.TestInsights) ... FAIL
2020-08-18T14:50:49.984 INFO:tasks.cephfs_test_runner:
2020-08-18T14:50:49.985 INFO:tasks.cephfs_test_runner:======================================================================
2020-08-18T14:50:49.985 INFO:tasks.cephfs_test_runner:FAIL: test_crash_history (tasks.mgr.test_insights.TestInsights)
2020-08-18T14:50:49.985 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2020-08-18T14:50:49.986 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2020-08-18T14:50:49.986 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-kefu-testing-2020-08-18-1658/qa/tasks/mgr/test_insights.py", line 197, in test_crash_history
2020-08-18T14:50:49.986 INFO:tasks.cephfs_test_runner:    self._add_crash(1)
2020-08-18T14:50:49.987 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-kefu-testing-2020-08-18-1658/qa/tasks/mgr/test_insights.py", line 45, in _add_crash
2020-08-18T14:50:49.987 INFO:tasks.cephfs_test_runner:    self.assertEqual(0, ret)
2020-08-18T14:50:49.987 INFO:tasks.cephfs_test_runner:AssertionError: 0 != 22

see https://pulpito.ceph.com/kchai-2020-08-18_13:00:33-rados-wip-kefu-testing-2020-08-18-1658-distro-basic-smithi/5356503/

@stale
Copy link

stale bot commented Oct 18, 2020

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Oct 18, 2020
@smithfarm
Copy link
Contributor Author

unstale

@stale stale bot removed the stale label Oct 22, 2020
@stale
Copy link

stale bot commented Dec 25, 2020

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@stale stale bot added the stale label Dec 25, 2020
@smithfarm smithfarm closed this Jan 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants