Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSD::Modify OSD Fast-Shutdown to work safely i.e. quiesce all activit… #44913

Merged
merged 3 commits into from
Mar 9, 2022

Conversation

benhanokh
Copy link
Contributor

@benhanokh benhanokh commented Feb 6, 2022

OSD::Modify OSD Fast-Shutdown to work safely i.e. quiesce all activities and destage allocations to disk before killing the OSD

  1. keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
  2. skip service.prepare_to_stop() which can take as much as 10 seconds
  3. skip debug options in fast-shutdown
  4. set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
  5. clear op_shardedwq queues, this is safe since we didn't started processing them
  6. stop timer
  7. drain osd_op_tp (no new items will be added)
  8. now we can safely call umount which will close_db/bluefs and will destage allocation to disk
  9. skip _shutdown_cache() when we are in the middle of a fast-shutdown
  10. increase debug level on fast-shutdown
  11. add option for bluestore_qfsck_on_mount to force scan on mount for all tests
  12. disable fsck-on-umount when running fast-shutdown
  13. add an option to increase debug level at fast-shutdown umount()
  14. set a time limit to fast-shutdown
  15. Bug-Fix BlueStore::pool_statfs don't access db after it was removed
  16. Fix error message for qfsck (error was caused by PR BlueStore:NCB:Bug-Fix for recovery code with shared blobs #44563)

Fixes: https://tracker.ceph.com/issues/53266

Signed-off-by: Gabriel Benhanokh gbenhano@redhat.com

qa/tasks/ceph.py Outdated Show resolved Hide resolved
src/osd/OSD.cc Outdated Show resolved Hide resolved
src/osd/OSD.cc Outdated Show resolved Hide resolved
src/osd/OSD.cc Outdated Show resolved Hide resolved
src/osd/OSD.cc Outdated Show resolved Hide resolved
@jdurgin
Copy link
Member

jdurgin commented Feb 15, 2022

looking pretty good, the teuthology piece could be a separate PR

@benhanokh benhanokh force-pushed the safe_shutdown_v2 branch 3 times, most recently from 94d92f6 to 203e3e3 Compare February 15, 2022 17:20
@neha-ojha neha-ojha added the needs-quincy-backport backport required for quincy label Feb 15, 2022
@benhanokh
Copy link
Contributor Author

There are 2 tests showing corruption in the allocation file:
/a/yuriw-2022-02-15_22:35:42-rados-wip-yuri8-testing-2022-02-15-1214-distro-default-smithi/6685993/teuthology.log
/a/yuriw-2022-02-15_22:35:42-rados-wip-yuri8-testing-2022-02-15-1214-distro-default-smithi/6686090/teuthology.log

I need to understand what was done in those tests and why we ended with corrupted allocation file, but for now we should stop the merge :-(

@benhanokh
Copy link
Contributor Author

It seems that the problem is an old race-condition in NCB code unrelated to safe-fast-shutdown.
The issue is detected by the extra sanity-checks/scrubs added in this PR.

The problem might be a race in the way we free up space on BlueFS on compaction

@benhanokh benhanokh force-pushed the safe_shutdown_v2 branch 3 times, most recently from f35526c to b94db00 Compare February 17, 2022 12:40
@benhanokh benhanokh force-pushed the safe_shutdown_v2 branch 4 times, most recently from f70b0d8 to 40213c8 Compare February 23, 2022 15:04
Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra changes look good

@neha-ojha
Copy link
Member

@benhanokh please see this failure https://pulpito.ceph.com/nojha-2022-02-23_17:58:44-rados-GBH_safe_shutdown_v2_basecode_sanity_check_disabled_2-distro-basic-smithi/6702922/

2022-02-23T20:48:36.298 INFO:teuthology.orchestra.run.smithi181.stderr:2022-02-23T20:48:36.317+0000 7f3525ca8700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: stored = 1277464416, computed = 3826455388  in db/000139.sst offset 1576903 size 4054 code = 2 Rocksdb transaction:
2022-02-23T20:48:36.299 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.OBJ_0000000000000003.1DB26E06.27b.smithi14140276-34..' value size = 74)
2022-02-23T20:48:36.299 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_0000000000000277_0000000000000003.1DB26E06.27b.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.299 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.OBJ_0000000000000003.1DB26E06.2ef.smithi14140276-34..' value size = 98)
2022-02-23T20:48:36.299 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_00000000000002DB_0000000000000003.1DB26E06.2ef.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.300 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_00000000000002DF_0000000000000003.1DB26E06.2ef.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.300 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_00000000000002E4_0000000000000003.1DB26E06.2ef.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.300 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_00000000000002ED_0000000000000003.1DB26E06.2ef.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.300 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.OBJ_0000000000000003.1DB26E06.31f.smithi14140276-34..' value size = 82)
2022-02-23T20:48:36.301 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_0000000000000316_0000000000000003.1DB26E06.31f.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.301 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_000000000000031C_0000000000000003.1DB26E06.31f.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.301 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.OBJ_0000000000000003.1DB26E06.33c.smithi14140276-34..' value size = 90)
2022-02-23T20:48:36.301 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_000000000000032E_0000000000000003.1DB26E06.33c.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.302 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_000000000000033B_0000000000000003.1DB26E06.33c.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.302 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x0000000000000000C03716250000000000000402'.SNA_3_000000000000033C_0000000000000003.1DB26E06.33c.smithi14140276-34..' value size = 70)
2022-02-23T20:48:36.302 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x00000000000000038BD46706000000000000'2(~' value size = 0)
2022-02-23T20:48:36.302 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = p key = 0x00000000000000038BD46706000000000000'2(-' value size = 0)
2022-02-23T20:48:36.302 INFO:teuthology.orchestra.run.smithi181.stderr:PutCF( prefix = O key = 0x7F80000000000000038BD46706'!smithi14140276-34!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F value size = 1776)
2022-02-23T20:48:36.303 INFO:teuthology.orchestra.run.smithi181.stderr:MergeCF( prefix = T key = 0x0000000000000003 value size = 40)
2022-02-23T20:48:36.303 INFO:teuthology.orchestra.run.smithi181.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-10831-g40213c87/rpm/el8/BUILD/ceph-17.0.0-10831-g40213c87/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 7f3525ca8700 time 2022-02-23T20:48:36.318044+0000
2022-02-23T20:48:36.303 INFO:teuthology.orchestra.run.smithi181.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-10831-g40213c87/rpm/el8/BUILD/ceph-17.0.0-10831-g40213c87/src/os/bluestore/BlueStore.cc: 12773: FAILED ceph_assert(r == 0)
2022-02-23T20:48:36.304 INFO:teuthology.orchestra.run.smithi181.stderr: ceph version 17.0.0-10831-g40213c87 (40213c87bdc043acdadc03f47f4d21c98bee71d8) quincy (dev)
2022-02-23T20:48:36.304 INFO:teuthology.orchestra.run.smithi181.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f3533a8d294]
2022-02-23T20:48:36.305 INFO:teuthology.orchestra.run.smithi181.stderr: 2: /usr/lib64/ceph/libceph-common.so.2(+0x2834b5) [0x7f3533a8d4b5]
2022-02-23T20:48:36.305 INFO:teuthology.orchestra.run.smithi181.stderr: 3: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x5fb) [0x5611a0b42f9b]
2022-02-23T20:48:36.305 INFO:teuthology.orchestra.run.smithi181.stderr: 4: (BlueStore::_kv_sync_thread()+0x2139) [0x5611a0b8a429]
2022-02-23T20:48:36.305 INFO:teuthology.orchestra.run.smithi181.stderr: 5: (BlueStore::KVSyncThread::entry()+0x11) [0x5611a0bba4f1]
2022-02-23T20:48:36.305 INFO:teuthology.orchestra.run.smithi181.stderr: 6: /lib64/libpthread.so.0(+0x814a) [0x7f3532eb914a]
2022-02-23T20:48:36.306 INFO:teuthology.orchestra.run.smithi181.stderr: 7: clone()

benhanokh and others added 3 commits March 7, 2022 17:40
quiesce all activities and destage allocations to disk before killing the OSD

    1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
    2) skip service.prepare_to_stop() which can take as much as 10 seconds
    3) skip debug options in fast-shutdown
    4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
    5) clear op_shardedwq queues, this is safe since we didn't started processing them
    6) stop timer
    7) drain osd_op_tp (no new items will be added)
    8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
    9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
    10) increase debug level on fast-shutdown
    11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
    12) disable fsck-on-umount when running fast-shutdown
    13) add an option to increase debug level at fast-shutdown umount()
    14) set a time limit to fast-shutdown

    15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
    16) Fix error message for qfsck (error was caused by PR ceph#44563)

    17) make shutdown-timeout configurable

Fixes: https://tracker.ceph.com/issues/53266
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
Fixes problem with sync compaction (_rewrite_log_and_layout_sync).
There was a problem with not updating log_seq after compacting log.

It cause to stop _replay log right after first transaction.

... 20 bluefs _replay 0x0:  op_dir_create sharding
... 20 bluefs _replay 0x0:  op_dir_link  sharding/def to 21
... 20 bluefs _replay 0x0:  op_jump_seq 1025
... 10 bluefs _read h 0x555557c46400 0x1000~1000 from file(ino 1 size 0x1000 mtime 0.000000 allocated 410000 alloc_commit 410000 extents [1:0x1540000~410000])
... 20 bluefs _read left 0xff000 len 0x1000
... 20 bluefs _read got 4096
... 10 bluefs _replay 0x1000: stop: seq 1025 != expected 1026

This is a product of bluefs fine grain locks refactor.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 2f8e370)

Conflicts:
	src/test/objectstore/test_bluefs.cc
Close window for possibility to capture allocator state and bluefs state
that are not in sync.

Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

_close_db_and_around();
if (cct->_conf->bluestore_fsck_on_umount) {
// disable fsck on fast-shutdown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"skip" is better word then "disable" here

@@ -4448,6 +4504,9 @@ int OSD::shutdown()
hb_front_server_messenger->shutdown();
hb_back_server_messenger->shutdown();

utime_t duration = ceph_clock_now() - start_time_func;
dout(0) <<"Slow Shutdown duration:" << duration << " seconds" << dendl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposal: How about calling it "Full" or "Orderly" instead of "Slow"?

// vstart overwrites osd_fast_shutdown value in the conf file -> force the value here!
//cct->_conf->osd_fast_shutdown = true;

dout(0) << "Fast Shutdown: - cct->_conf->osd_fast_shutdown = "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to improve output, show something like:
"Shutdown: Fast, null-fm=true"

@@ -4258,27 +4258,44 @@ PerfCounters* OSD::create_recoverystate_perf()

int OSD::shutdown()
{
// vstart overwrites osd_fast_shutdown value in the conf file -> force the value here!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suspicious comments. Is there something missing in this PR?
If this is for testing purposes, then it should be a separate commit.

set_state(STATE_STOPPING);

// Debugging
if (cct->_conf.get_val<bool>("osd_debug_shutdown")) {
// Disabled debugging during fast-shutdown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not disable any debugging here....

@ljflores
Copy link
Contributor

ljflores commented Mar 9, 2022

http://pulpito.front.sepia.ceph.com/yuriw-2022-03-07_22:06:10-rados-wip-yuri-testing-2022-03-07-0958-distro-default-smithi/

Failures, unrelated:
https://tracker.ceph.com/issues/50280
https://tracker.ceph.com/issues/53680
https://tracker.ceph.com/issues/51076

Details:
1. cephadm: RuntimeError: uid/gid not found - Ceph
2. ERROR:tasks.rook:'waiting for service removal' reached maximum tries (90) after waiting for 900 seconds - Ceph - Orchestrator
3. "wait_for_recovery: failed before timeout expired" during thrashosd test with EC backend. - Ceph - RADOS

@ljflores
Copy link
Contributor

ljflores commented Mar 9, 2022

jenkins test windows

@benhanokh
Copy link
Contributor Author

PR was backported to quincy - #45342

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants