OSD::Modify OSD Fast-Shutdown to work safely i.e. quiesce all activit… #44913

benhanokh · 2022-02-06T15:20:44Z

OSD::Modify OSD Fast-Shutdown to work safely i.e. quiesce all activities and destage allocations to disk before killing the OSD

keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
skip service.prepare_to_stop() which can take as much as 10 seconds
skip debug options in fast-shutdown
set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
clear op_shardedwq queues, this is safe since we didn't started processing them
stop timer
drain osd_op_tp (no new items will be added)
now we can safely call umount which will close_db/bluefs and will destage allocation to disk
skip _shutdown_cache() when we are in the middle of a fast-shutdown
increase debug level on fast-shutdown
add option for bluestore_qfsck_on_mount to force scan on mount for all tests
disable fsck-on-umount when running fast-shutdown
add an option to increase debug level at fast-shutdown umount()
set a time limit to fast-shutdown
Bug-Fix BlueStore::pool_statfs don't access db after it was removed
Fix error message for qfsck (error was caused by PR BlueStore:NCB:Bug-Fix for recovery code with shared blobs #44563)

Fixes: https://tracker.ceph.com/issues/53266

Signed-off-by: Gabriel Benhanokh gbenhano@redhat.com

qa/tasks/ceph.py

src/osd/OSD.cc

src/os/bluestore/BlueStore.cc

src/osd/OSD.cc

jdurgin · 2022-02-15T03:10:40Z

looking pretty good, the teuthology piece could be a separate PR

benhanokh · 2022-02-16T06:29:37Z

There are 2 tests showing corruption in the allocation file:
/a/yuriw-2022-02-15_22:35:42-rados-wip-yuri8-testing-2022-02-15-1214-distro-default-smithi/6685993/teuthology.log
/a/yuriw-2022-02-15_22:35:42-rados-wip-yuri8-testing-2022-02-15-1214-distro-default-smithi/6686090/teuthology.log

I need to understand what was done in those tests and why we ended with corrupted allocation file, but for now we should stop the merge :-(

benhanokh · 2022-02-16T18:54:58Z

It seems that the problem is an old race-condition in NCB code unrelated to safe-fast-shutdown.
The issue is detected by the extra sanity-checks/scrubs added in this PR.

The problem might be a race in the way we free up space on BlueFS on compaction

jdurgin

extra changes look good

neha-ojha · 2022-02-23T18:59:40Z

Being tested here: https://pulpito.ceph.com/nojha-2022-02-23_17:58:44-rados-GBH_safe_shutdown_v2_basecode_sanity_check_disabled_2-distro-basic-smithi/

neha-ojha · 2022-02-23T21:05:26Z

@benhanokh please see this failure https://pulpito.ceph.com/nojha-2022-02-23_17:58:44-rados-GBH_safe_shutdown_v2_basecode_sanity_check_disabled_2-distro-basic-smithi/6702922/

2022-02-23T20:48:36.298 INFO:teuthology.orchestra.run.smithi181.stderr:2022-02-23T20:48:36.317+0000 7f3525ca8700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: stored = 1277464416, computed = 3826455388  in db/000139.sst offset 1576903 size 4054 code = 2 Rocksdb transaction:
2022-02-23T20:48:36.299 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.OBJ_0000000000000003.1DB26E06.27b.smithi14140276-34..' value size = 74)
2022-02-23T20:48:36.299 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_0000000000000277_0000000000000003.1DB26E06.27b.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.299 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.OBJ_0000000000000003.1DB26E06.2ef.smithi14140276-34..' value size = 98)
2022-02-23T20:48:36.299 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_00000000000002DB_0000000000000003.1DB26E06.2ef.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.300 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_00000000000002DF_0000000000000003.1DB26E06.2ef.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.300 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_00000000000002E4_0000000000000003.1DB26E06.2ef.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.300 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_00000000000002ED_0000000000000003.1DB26E06.2ef.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.300 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.OBJ_0000000000000003.1DB26E06.31f.smithi14140276-34..' value size = 82)
2022-02-23T20:48:36.301 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_0000000000000316_0000000000000003.1DB26E06.31f.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.301 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_000000000000031C_0000000000000003.1DB26E06.31f.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.301 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.OBJ_0000000000000003.1DB26E06.33c.smithi14140276-34..' value size = 90)
2022-02-23T20:48:36.301 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_000000000000032E_0000000000000003.1DB26E06.33c.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.302 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_000000000000033B_0000000000000003.1DB26E06.33c.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.302 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_000000000000033C_0000000000000003.1DB26E06.33c.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.302 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x00000000000000038BD46706000000000000'2(~' value size = 0)
2022-02-23T20:48:36.302 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x00000000000000038BD46706000000000000'2(-' value size = 0)
2022-02-23T20:48:36.302 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = O key = 0x7F80000000000000038BD46706'!smithi14140276-34!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F value size = 1776)
2022-02-23T20:48:36.303 INFO:teuthology.orchestra.run.smithi181.stderr:MergeCF( prefix = T key = 0x0000000000000003 value size = 40)
2022-02-23T20:48:36.303 INFO:teuthology.orchestra.run.smithi181.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-10831-g40213c87/rpm/el8/BUILD/ceph-17.0.0-10831-g40213c87/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 7f3525ca8700 time 2022-02-23T20:48:36.318044+0000
2022-02-23T20:48:36.303 INFO:teuthology.orchestra.run.smithi181.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-10831-g40213c87/rpm/el8/BUILD/ceph-17.0.0-10831-g40213c87/src/os/bluestore/BlueStore.cc: 12773: FAILED ceph_assert(r == 0)
2022-02-23T20:48:36.304 INFO:teuthology.orchestra.run.smithi181.stderr: ceph version 17.0.0-10831-g40213c87 (40213c87bdc043acdadc03f47f4d21c98bee71d8) quincy (dev)
2022-02-23T20:48:36.304 INFO:teuthology.orchestra.run.smithi181.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f3533a8d294]
2022-02-23T20:48:36.305 INFO:teuthology.orchestra.run.smithi181.stderr: 2: /usr/lib64/ceph/libceph-common.so.2(+0x2834b5) [0x7f3533a8d4b5]
2022-02-23T20:48:36.305 INFO:teuthology.orchestra.run.smithi181.stderr: 3: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x5fb) [0x5611a0b42f9b]
2022-02-23T20:48:36.305 INFO:teuthology.orchestra.run.smithi181.stderr: 4: (BlueStore::_kv_sync_thread()+0x2139) [0x5611a0b8a429]
2022-02-23T20:48:36.305 INFO:teuthology.orchestra.run.smithi181.stderr: 5: (BlueStore::KVSyncThread::entry()+0x11) [0x5611a0bba4f1]
2022-02-23T20:48:36.305 INFO:teuthology.orchestra.run.smithi181.stderr: 6: /lib64/libpthread.so.0(+0x814a) [0x7f3532eb914a]
2022-02-23T20:48:36.306 INFO:teuthology.orchestra.run.smithi181.stderr: 7: clone()

quiesce all activities and destage allocations to disk before killing the OSD 1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager()) 2) skip service.prepare_to_stop() which can take as much as 10 seconds 3) skip debug options in fast-shutdown 4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD 5) clear op_shardedwq queues, this is safe since we didn't started processing them 6) stop timer 7) drain osd_op_tp (no new items will be added) 8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk 9) skip _shutdown_cache() when we are in the middle of a fast-shutdown 10) increase debug level on fast-shutdown 11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests 12) disable fsck-on-umount when running fast-shutdown 13) add an option to increase debug level at fast-shutdown umount() 14) set a time limit to fast-shutdown 15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed 16) Fix error message for qfsck (error was caused by PR ceph#44563) 17) make shutdown-timeout configurable Fixes: https://tracker.ceph.com/issues/53266 Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>

Fixes problem with sync compaction (_rewrite_log_and_layout_sync). There was a problem with not updating log_seq after compacting log. It cause to stop _replay log right after first transaction. ... 20 bluefs _replay 0x0: op_dir_create sharding ... 20 bluefs _replay 0x0: op_dir_link sharding/def to 21 ... 20 bluefs _replay 0x0: op_jump_seq 1025 ... 10 bluefs _read h 0x555557c46400 0x1000~1000 from file(ino 1 size 0x1000 mtime 0.000000 allocated 410000 alloc_commit 410000 extents [1:0x1540000~410000]) ... 20 bluefs _read left 0xff000 len 0x1000 ... 20 bluefs _read got 4096 ... 10 bluefs _replay 0x1000: stop: seq 1025 != expected 1026 This is a product of bluefs fine grain locks refactor. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com> (cherry picked from commit 2f8e370) Conflicts: src/test/objectstore/test_bluefs.cc

Close window for possibility to capture allocator state and bluefs state that are not in sync. Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>

aclamk

Looks good.

aclamk · 2022-03-08T06:38:48Z

src/os/bluestore/BlueStore.cc

  _close_db_and_around();
-  if (cct->_conf->bluestore_fsck_on_umount) {
+  // disable fsck on fast-shutdown


"skip" is better word then "disable" here

aclamk · 2022-03-08T06:42:50Z

src/osd/OSD.cc

@@ -4448,6 +4504,9 @@ int OSD::shutdown()
  hb_front_server_messenger->shutdown();
  hb_back_server_messenger->shutdown();

+  utime_t duration = ceph_clock_now() - start_time_func;
+  dout(0) <<"Slow Shutdown duration:" << duration << " seconds" << dendl;


Proposal: How about calling it "Full" or "Orderly" instead of "Slow"?

aclamk · 2022-03-08T06:46:06Z

src/osd/OSD.cc

+  // vstart overwrites osd_fast_shutdown value in the conf file -> force the value here!
+  //cct->_conf->osd_fast_shutdown = true;
+
+  dout(0) << "Fast Shutdown: - cct->_conf->osd_fast_shutdown = "


Need to improve output, show something like:
"Shutdown: Fast, null-fm=true"

aclamk · 2022-03-08T06:47:34Z

src/osd/OSD.cc

@@ -4258,27 +4258,44 @@ PerfCounters* OSD::create_recoverystate_perf()

 int OSD::shutdown()
 {
+  // vstart overwrites osd_fast_shutdown value in the conf file -> force the value here!


Suspicious comments. Is there something missing in this PR?
If this is for testing purposes, then it should be a separate commit.

aclamk · 2022-03-08T06:50:04Z

src/osd/OSD.cc

  set_state(STATE_STOPPING);

-  // Debugging
-  if (cct->_conf.get_val<bool>("osd_debug_shutdown")) {
+  // Disabled debugging during fast-shutdown


We do not disable any debugging here....

ljflores · 2022-03-09T19:45:19Z

http://pulpito.front.sepia.ceph.com/yuriw-2022-03-07_22:06:10-rados-wip-yuri-testing-2022-03-07-0958-distro-default-smithi/

Failures, unrelated:
https://tracker.ceph.com/issues/50280
https://tracker.ceph.com/issues/53680
https://tracker.ceph.com/issues/51076

Details:
1. cephadm: RuntimeError: uid/gid not found - Ceph
2. ERROR:tasks.rook:'waiting for service removal' reached maximum tries (90) after waiting for 900 seconds - Ceph - Orchestrator
3. "wait_for_recovery: failed before timeout expired" during thrashosd test with EC backend. - Ceph - RADOS

ljflores · 2022-03-09T19:45:58Z

jenkins test windows

benhanokh · 2022-03-10T19:02:11Z

PR was backported to quincy - #45342

benhanokh self-assigned this Feb 6, 2022

github-actions bot added bluestore common core labels Feb 6, 2022

benhanokh force-pushed the safe_shutdown_v2 branch from f451f69 to 38ffd4d Compare February 9, 2022 15:37

benhanokh requested a review from jdurgin February 9, 2022 15:38

benhanokh force-pushed the safe_shutdown_v2 branch from 38ffd4d to 9051e0a Compare February 10, 2022 14:58

benhanokh mentioned this pull request Feb 14, 2022

WIP #44787

Closed

14 tasks

benhanokh force-pushed the safe_shutdown_v2 branch from 9051e0a to 3dc81c4 Compare February 14, 2022 11:40

github-actions bot added the tests label Feb 14, 2022

benhanokh force-pushed the safe_shutdown_v2 branch from 3dc81c4 to c10bb3a Compare February 14, 2022 14:27