octopus: rocksdb: do not use non-zero recycle_log_file_num setting #45040

ifed01 · 2022-02-15T14:37:35Z

This forces RocksDB to use less reliable kTolerateCorruptedTailRecords
mode for wal recovery.

Fixes: https://tracker.ceph.com/issues/54288
Signed-off-by: Igor Fedotov igor.fedotov@croit.io

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

This forces RocksDB to use less reliable kTolerateCorruptedTailRecords mode for wal recovery. Fixes: https://tracker.ceph.com/issues/54288 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

neha-ojha · 2022-02-15T22:29:50Z

@ifed01 should try to get this into 15.2.16? though it might be a bit late

ifed01 · 2022-02-15T22:31:48Z

@ifed01 should try to get this into 15.2.16? though it might be a bit late

I think that's not required. There is a workaround - one can adjust the setting manually if needed.

neha-ojha

makes sense to me

ljflores · 2022-05-12T19:25:25Z

@ifed01 does this failure look familiar to you? It came up in twice in teuthology runs, and I don't see it tracked anywhere. http://pulpito.front.sepia.ceph.com/yuriw-2022-05-09_21:49:19-rados-wip-yuri6-testing-2022-05-09-0734-octopus-distro-default-smithi/6829109/

2022-05-10T00:31:01.602 INFO:tasks.workunit.client.0.smithi032.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-bluefs-volume-ops.sh:382: TEST_bluestore2:  ceph osd down 0
2022-05-10T00:31:02.106 INFO:tasks.workunit.client.0.smithi032.stderr:osd.0 is already down.
2022-05-10T00:31:02.113 INFO:tasks.workunit.client.0.smithi032.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-bluefs-volume-ops.sh:384: TEST_bluestore2:  ceph-bluestore-tool --path td/osd-bluefs-volume-ops/0 --devs-source td/osd-bluefs-volume-ops/0/block.db --dev-target td/osd-bluefs-volume-ops/0/block --command bluefs-bdev-migrate
2022-05-10T00:31:02.123 INFO:tasks.workunit.client.0.smithi032.stdout:inferring bluefs devices from bluestore path
2022-05-10T00:31:02.184 INFO:tasks.workunit.client.0.smithi032.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.16-101-g1fb36ae9/rpm/el8/BUILD/ceph-15.2.16-101-g1fb36ae9/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::_mount_for_bluefs()' thread 7f4fab54c240 time 2022-05-10T00:31:02.184091+0000
2022-05-10T00:31:02.184 INFO:tasks.workunit.client.0.smithi032.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.16-101-g1fb36ae9/rpm/el8/BUILD/ceph-15.2.16-101-g1fb36ae9/src/os/bluestore/BlueStore.cc: 6894: FAILED ceph_assert(r == 0)
2022-05-10T00:31:02.185 INFO:tasks.workunit.client.0.smithi032.stderr:2022-05-10T00:31:02.183+0000 7f4fab54c240 -1 bluefs _check_new_allocations invalid extent 1: 0xe00000~400000: duplicate reference, ino 22
2022-05-10T00:31:02.185 INFO:tasks.workunit.client.0.smithi032.stderr:2022-05-10T00:31:02.183+0000 7f4fab54c240 -1 bluefs mount failed to replay log: (14) Bad address
2022-05-10T00:31:02.185 INFO:tasks.workunit.client.0.smithi032.stderr:2022-05-10T00:31:02.183+0000 7f4fab54c240 -1 bluestore(td/osd-bluefs-volume-ops/0) _open_bluefs failed bluefs mount: (14) Bad address
2022-05-10T00:31:02.188 INFO:tasks.workunit.client.0.smithi032.stderr: ceph version 15.2.16-101-g1fb36ae9 (1fb36ae9234fd4c69ca633c9660cdd632313b2a8) octopus (stable)
2022-05-10T00:31:02.188 INFO:tasks.workunit.client.0.smithi032.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f4fa16565ea]
2022-05-10T00:31:02.189 INFO:tasks.workunit.client.0.smithi032.stderr: 2: (()+0x279804) [0x7f4fa1656804]
2022-05-10T00:31:02.190 INFO:tasks.workunit.client.0.smithi032.stderr: 3: (BlueStore::_mount_for_bluefs()+0x68) [0x5555c7af1ec8]
2022-05-10T00:31:02.190 INFO:tasks.workunit.client.0.smithi032.stderr: 4: (BlueStore::migrate_to_existing_bluefs_device(std::set<int, std::less<int>, std::allocator<int> > const&, int)+0x1a0) [0x5555c7af57d0]
2022-05-10T00:31:02.191 INFO:tasks.workunit.client.0.smithi032.stderr: 5: (main()+0x4144) [0x5555c79e8204]
2022-05-10T00:31:02.191 INFO:tasks.workunit.client.0.smithi032.stderr: 6: (__libc_start_main()+0xf3) [0x7f4f9ee827b3]
2022-05-10T00:31:02.191 INFO:tasks.workunit.client.0.smithi032.stderr: 7: (_start()+0x2e) [0x5555c7a083ae]

ljflores · 2022-05-12T19:54:01Z

Current analysis of the test run. @ifed01 I opened https://tracker.ceph.com/issues/49287 to track the BlueFS failure. Let me know what you think of it.

http://pulpito.front.sepia.ceph.com/?branch=wip-yuri6-testing-2022-05-09-0734-octopus

A few jobs failed due to problems in infrastructure, but passed in a rerun.

Failures:
1. https://tracker.ceph.com/issues/49287
2. https://tracker.ceph.com/issues/55636 --> opened during this run

Details:
1. podman: setting cgroup config for procHooks process caused: Unit libpod-$hash.scope not found - Ceph - Orchestrator
2. octopus: osd-bluefs-volume-ops.sh: TEST_bluestore2 fails with "FAILED ceph_assert(r == 0)" - Ceph - BlueStore

ifed01 · 2022-06-22T14:37:43Z

Current analysis of the test run. @ifed01 I opened https://tracker.ceph.com/issues/49287 to track the BlueFS failure. Let me know what you think of it.

http://pulpito.front.sepia.ceph.com/?branch=wip-yuri6-testing-2022-05-09-0734-octopus

A few jobs failed due to problems in infrastructure, but passed in a rerun.

Failures: 1. https://tracker.ceph.com/issues/49287 2. https://tracker.ceph.com/issues/55636 --> opened during this run

Details: 1. podman: setting cgroup config for procHooks process caused: Unit libpod-$hash.scope not found - Ceph - Orchestrator 2. octopus: osd-bluefs-volume-ops.sh: TEST_bluestore2 fails with "FAILED ceph_assert(r == 0)" - Ceph - BlueStore

@ljflores - sorry for the late response.
I'm absolutely sure https://tracker.ceph.com/issues/49287 is unrelated to this PR.
And IMO https://tracker.ceph.com/issues/55636 is unrelated as well but it worth additional investigation on its own...

ljflores · 2022-06-23T20:59:18Z

Rados suite results: https://pulpito.ceph.com/?branch=wip-yuri5-testing-2022-06-22-0914-octopus

One unrelated dead cephadm job, which passed in the rerun.
All other failures were caused by a bug in teuthology, which has been fixed.

ljflores · 2022-07-01T19:30:47Z

Hi @ifed01, please see https://tracker.ceph.com/issues/55636#note-2. I suspect that this commit actually did cause the bug from https://tracker.ceph.com/issues/55636. The reason it was a tricky catch is that it appears to fail only on certain operating systems.

Let me know what you think.

rocksdb: do not use non-zero recycle_log_file_num setting

6e225bf

This forces RocksDB to use less reliable kTolerateCorruptedTailRecords mode for wal recovery. Fixes: https://tracker.ceph.com/issues/54288 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>

github-actions bot added this to the octopus milestone Feb 15, 2022

ifed01 requested review from markhpc, neha-ojha and aclamk February 15, 2022 14:38

ifed01 added the bluestore label Feb 15, 2022

neha-ojha changed the title ~~rocksdb: do not use non-zero recycle_log_file_num setting~~ octopus: rocksdb: do not use non-zero recycle_log_file_num setting Feb 15, 2022

neha-ojha approved these changes Mar 3, 2022

View reviewed changes

ifed01 added the needs-qa label Mar 23, 2022

yuriw added the wip-yuri6-testing label May 5, 2022

yuriw removed wip-yuri6-testing needs-qa labels May 12, 2022

ifed01 added the needs-qa label Jun 22, 2022

yuriw added the wip-yuri5-testing label Jun 22, 2022

yuriw merged commit 8477df9 into ceph:octopus Jun 23, 2022

ifed01 deleted the wip-ifed-fix-54288-oct branch June 24, 2022 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

octopus: rocksdb: do not use non-zero recycle_log_file_num setting #45040

octopus: rocksdb: do not use non-zero recycle_log_file_num setting #45040

ifed01 commented Feb 15, 2022 •

edited

neha-ojha commented Feb 15, 2022

ifed01 commented Feb 15, 2022

neha-ojha left a comment

ljflores commented May 12, 2022 •

edited

ljflores commented May 12, 2022 •

edited

ifed01 commented Jun 22, 2022

ljflores commented Jun 23, 2022

ljflores commented Jul 1, 2022

octopus: rocksdb: do not use non-zero recycle_log_file_num setting #45040

octopus: rocksdb: do not use non-zero recycle_log_file_num setting #45040

Conversation

ifed01 commented Feb 15, 2022 • edited

Checklist

neha-ojha commented Feb 15, 2022

ifed01 commented Feb 15, 2022

neha-ojha left a comment

Choose a reason for hiding this comment

ljflores commented May 12, 2022 • edited

ljflores commented May 12, 2022 • edited

ifed01 commented Jun 22, 2022

ljflores commented Jun 23, 2022

ljflores commented Jul 1, 2022

ifed01 commented Feb 15, 2022 •

edited

ljflores commented May 12, 2022 •

edited

ljflores commented May 12, 2022 •

edited