Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #24639 from dmick/wip-crashdump-mimic-backport
mimic: mgr: crashdump feature backport
- Loading branch information
Showing
52 changed files
with
958 additions
and
57 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
Crash plugin | ||
============ | ||
The crash plugin collects information about daemon crashdumps and stores | ||
it in the Ceph cluster for later analysis. | ||
|
||
Daemon crashdumps are dumped in /var/lib/ceph/crash by default; this can | ||
be configured with the option 'crash dir'. Crash directories are named by | ||
time and date and a randomly-generated UUID, and contain a metadata file | ||
'meta' and a recent log file, with a "crash_id" that is the same. | ||
This plugin allows the metadata about those dumps to be persisted in | ||
the monitors' storage. | ||
|
||
Enabling | ||
-------- | ||
|
||
The *crash* module is enabled with:: | ||
|
||
ceph mgr module enable crash | ||
|
||
Commands | ||
-------- | ||
:: | ||
|
||
ceph crash post -i <metafile> | ||
|
||
Save a crash dump. The metadata file is a JSON blob stored in the crash | ||
dir as ``meta``. As usual, the ceph command can be invoked with ``-i -``, | ||
and will read from stdin. | ||
|
||
:: | ||
|
||
ceph rm <crashid> | ||
|
||
Remove a specific crash dump. | ||
|
||
:: | ||
|
||
ceph crash ls | ||
|
||
List the timestamp/uuid crashids for all saved crash info. | ||
|
||
:: | ||
|
||
ceph crash stat | ||
|
||
Show a summary of saved crash info grouped by age. | ||
|
||
:: | ||
|
||
ceph crash info <crashid> | ||
|
||
Show all details of a saved crash. | ||
|
||
:: | ||
|
||
ceph crash prune <keep> | ||
|
||
Remove saved crashes older than 'keep' days. <keep> must be an integer. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -38,3 +38,4 @@ sensible. | |
Telemetry plugin <telemetry> | ||
Telegraf plugin <telegraf> | ||
Iostat plugin <iostat> | ||
Crash plugin <crash> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
|
||
tasks: | ||
- install: | ||
- ceph: | ||
# tests may leave mgrs broken, so don't try and call into them | ||
# to invoke e.g. pg dump during teardown. | ||
wait-for-scrub: false | ||
log-whitelist: | ||
- overall HEALTH_ | ||
- \(MGR_DOWN\) | ||
- \(PG_ | ||
- replacing it with standby | ||
- No standby daemons available | ||
- cephfs_test_runner: | ||
modules: | ||
- tasks.mgr.test_crash |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
roles: | ||
- [client.0, mon.a, mgr.x, osd.0, osd.1, osd.2] | ||
|
||
tasks: | ||
- install: | ||
- ceph: | ||
log-whitelist: | ||
- Reduced data availability | ||
- OSD_.*DOWN | ||
- workunit: | ||
clients: | ||
client.0: | ||
- rados/test_crash.sh | ||
- ceph.restart: [osd.*] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
|
||
|
||
from mgr_test_case import MgrTestCase | ||
|
||
import json | ||
import logging | ||
import datetime | ||
|
||
log = logging.getLogger(__name__) | ||
UUID = 'd5775432-0742-44a3-a435-45095e32e6b1' | ||
DATEFMT = '%Y-%m-%d %H:%M:%S.%f' | ||
|
||
|
||
class TestCrash(MgrTestCase): | ||
|
||
def setUp(self): | ||
self.setup_mgrs() | ||
self._load_module('crash') | ||
|
||
# Whip up some crash data | ||
self.crashes = dict() | ||
now = datetime.datetime.utcnow() | ||
|
||
for i in (0, 1, 3, 4, 8): | ||
timestamp = now - datetime.timedelta(days=i) | ||
timestamp = timestamp.strftime(DATEFMT) + 'Z' | ||
crash_id = '_'.join((timestamp, UUID)).replace(' ', '_') | ||
self.crashes[crash_id] = { | ||
'crash_id': crash_id, 'timestamp': timestamp, | ||
} | ||
|
||
self.assertEqual( | ||
0, | ||
self.mgr_cluster.mon_manager.raw_cluster_cmd_result( | ||
'crash', 'post', '-i', '-', | ||
stdin=json.dumps(self.crashes[crash_id]), | ||
) | ||
) | ||
|
||
retstr = self.mgr_cluster.mon_manager.raw_cluster_cmd( | ||
'crash', 'ls', | ||
) | ||
log.warning("setUp: crash ls returns %s" % retstr) | ||
|
||
self.oldest_crashid = crash_id | ||
|
||
def tearDown(self): | ||
for crash in self.crashes.itervalues(): | ||
self.mgr_cluster.mon_manager.raw_cluster_cmd_result( | ||
'crash', 'rm', crash['crash_id'] | ||
) | ||
|
||
def test_info(self): | ||
for crash in self.crashes.itervalues(): | ||
log.warning('test_info: crash %s' % crash) | ||
retstr = self.mgr_cluster.mon_manager.raw_cluster_cmd( | ||
'crash', 'ls' | ||
) | ||
log.warning('ls output: %s' % retstr) | ||
retstr = self.mgr_cluster.mon_manager.raw_cluster_cmd( | ||
'crash', 'info', crash['crash_id'], | ||
) | ||
log.warning('crash info output: %s' % retstr) | ||
crashinfo = json.loads(retstr) | ||
self.assertIn('crash_id', crashinfo) | ||
self.assertIn('timestamp', crashinfo) | ||
|
||
def test_ls(self): | ||
retstr = self.mgr_cluster.mon_manager.raw_cluster_cmd( | ||
'crash', 'ls', | ||
) | ||
for crash in self.crashes.itervalues(): | ||
self.assertIn(crash['crash_id'], retstr) | ||
|
||
def test_rm(self): | ||
crashid = self.crashes.keys()[0] | ||
self.assertEqual( | ||
0, | ||
self.mgr_cluster.mon_manager.raw_cluster_cmd_result( | ||
'crash', 'rm', crashid, | ||
) | ||
) | ||
|
||
retstr = self.mgr_cluster.mon_manager.raw_cluster_cmd( | ||
'crash', 'ls', | ||
) | ||
self.assertNotIn(crashid, retstr) | ||
|
||
def test_stat(self): | ||
retstr = self.mgr_cluster.mon_manager.raw_cluster_cmd( | ||
'crash', 'stat', | ||
) | ||
self.assertIn('5 crashes recorded', retstr) | ||
self.assertIn('4 older than 1 days old:', retstr) | ||
self.assertIn('3 older than 3 days old:', retstr) | ||
self.assertIn('1 older than 7 days old:', retstr) | ||
|
||
def test_prune(self): | ||
self.assertEqual( | ||
0, | ||
self.mgr_cluster.mon_manager.raw_cluster_cmd_result( | ||
'crash', 'prune', '5' | ||
) | ||
) | ||
retstr = self.mgr_cluster.mon_manager.raw_cluster_cmd( | ||
'crash', 'ls', | ||
) | ||
self.assertNotIn(self.oldest_crashid, retstr) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
#!/bin/sh | ||
|
||
set -x | ||
|
||
# run on a single-node three-OSD cluster | ||
|
||
sudo killall -ABRT ceph-osd | ||
sleep 5 | ||
|
||
# kill caused coredumps; find them and delete them, carefully, so as | ||
# not to disturb other coredumps, or else teuthology will see them | ||
# and assume test failure. sudos are because the core files are | ||
# root/600 | ||
for f in $(find $TESTDIR/archive/coredump -type f); do | ||
gdb_output=$(echo "quit" | sudo gdb /usr/bin/ceph-osd $f) | ||
if expr match "$gdb_output" ".*generated.*ceph-osd.*" && \ | ||
( \ | ||
|
||
expr match "$gdb_output" ".*terminated.*signal 6.*" || \ | ||
expr match "$gdb_output" ".*terminated.*signal SIGABRT.*" \ | ||
) | ||
then | ||
sudo rm $f | ||
fi | ||
done | ||
|
||
# let daemon find crashdumps on startup | ||
sudo systemctl restart ceph-crash | ||
sleep 30 | ||
|
||
# must be 3 crashdumps registered and moved to crash/posted | ||
[ $(ceph crash ls | wc -l) = 3 ] || exit 1 | ||
[ $(sudo find /var/lib/ceph/crash/posted/ -name meta | wc -l) = 3 ] || exit 1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.