Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

octopus: rocksdb: do not use non-zero recycle_log_file_num setting #45040

Merged
merged 1 commit into from Jun 23, 2022

Conversation

ifed01
Copy link
Contributor

@ifed01 ifed01 commented Feb 15, 2022

This forces RocksDB to use less reliable kTolerateCorruptedTailRecords
mode for wal recovery.

Fixes: https://tracker.ceph.com/issues/54288
Signed-off-by: Igor Fedotov igor.fedotov@croit.io

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

This forces RocksDB to use less reliable kTolerateCorruptedTailRecords
mode for wal recovery.

Fixes: https://tracker.ceph.com/issues/54288
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
@github-actions github-actions bot added this to the octopus milestone Feb 15, 2022
@neha-ojha neha-ojha changed the title rocksdb: do not use non-zero recycle_log_file_num setting octopus: rocksdb: do not use non-zero recycle_log_file_num setting Feb 15, 2022
@neha-ojha
Copy link
Member

@ifed01 should try to get this into 15.2.16? though it might be a bit late

@ifed01
Copy link
Contributor Author

ifed01 commented Feb 15, 2022

@ifed01 should try to get this into 15.2.16? though it might be a bit late

I think that's not required. There is a workaround - one can adjust the setting manually if needed.

Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me

@ljflores
Copy link
Contributor

ljflores commented May 12, 2022

@ifed01 does this failure look familiar to you? It came up in twice in teuthology runs, and I don't see it tracked anywhere. http://pulpito.front.sepia.ceph.com/yuriw-2022-05-09_21:49:19-rados-wip-yuri6-testing-2022-05-09-0734-octopus-distro-default-smithi/6829109/

2022-05-10T00:31:01.602 INFO:tasks.workunit.client.0.smithi032.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-bluefs-volume-ops.sh:382: TEST_bluestore2:  ceph osd down 0
2022-05-10T00:31:02.106 INFO:tasks.workunit.client.0.smithi032.stderr:osd.0 is already down.
2022-05-10T00:31:02.113 INFO:tasks.workunit.client.0.smithi032.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/osd-bluefs-volume-ops.sh:384: TEST_bluestore2:  ceph-bluestore-tool --path td/osd-bluefs-volume-ops/0 --devs-source td/osd-bluefs-volume-ops/0/block.db --dev-target td/osd-bluefs-volume-ops/0/block --command bluefs-bdev-migrate
2022-05-10T00:31:02.123 INFO:tasks.workunit.client.0.smithi032.stdout:inferring bluefs devices from bluestore path
2022-05-10T00:31:02.184 INFO:tasks.workunit.client.0.smithi032.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.16-101-g1fb36ae9/rpm/el8/BUILD/ceph-15.2.16-101-g1fb36ae9/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::_mount_for_bluefs()' thread 7f4fab54c240 time 2022-05-10T00:31:02.184091+0000
2022-05-10T00:31:02.184 INFO:tasks.workunit.client.0.smithi032.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.16-101-g1fb36ae9/rpm/el8/BUILD/ceph-15.2.16-101-g1fb36ae9/src/os/bluestore/BlueStore.cc: 6894: FAILED ceph_assert(r == 0)
2022-05-10T00:31:02.185 INFO:tasks.workunit.client.0.smithi032.stderr:2022-05-10T00:31:02.183+0000 7f4fab54c240 -1 bluefs _check_new_allocations invalid extent 1: 0xe00000~400000: duplicate reference, ino 22
2022-05-10T00:31:02.185 INFO:tasks.workunit.client.0.smithi032.stderr:2022-05-10T00:31:02.183+0000 7f4fab54c240 -1 bluefs mount failed to replay log: (14) Bad address
2022-05-10T00:31:02.185 INFO:tasks.workunit.client.0.smithi032.stderr:2022-05-10T00:31:02.183+0000 7f4fab54c240 -1 bluestore(td/osd-bluefs-volume-ops/0) _open_bluefs failed bluefs mount: (14) Bad address
2022-05-10T00:31:02.188 INFO:tasks.workunit.client.0.smithi032.stderr: ceph version 15.2.16-101-g1fb36ae9 (1fb36ae9234fd4c69ca633c9660cdd632313b2a8) octopus (stable)
2022-05-10T00:31:02.188 INFO:tasks.workunit.client.0.smithi032.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f4fa16565ea]
2022-05-10T00:31:02.189 INFO:tasks.workunit.client.0.smithi032.stderr: 2: (()+0x279804) [0x7f4fa1656804]
2022-05-10T00:31:02.190 INFO:tasks.workunit.client.0.smithi032.stderr: 3: (BlueStore::_mount_for_bluefs()+0x68) [0x5555c7af1ec8]
2022-05-10T00:31:02.190 INFO:tasks.workunit.client.0.smithi032.stderr: 4: (BlueStore::migrate_to_existing_bluefs_device(std::set<int, std::less<int>, std::allocator<int> > const&, int)+0x1a0) [0x5555c7af57d0]
2022-05-10T00:31:02.191 INFO:tasks.workunit.client.0.smithi032.stderr: 5: (main()+0x4144) [0x5555c79e8204]
2022-05-10T00:31:02.191 INFO:tasks.workunit.client.0.smithi032.stderr: 6: (__libc_start_main()+0xf3) [0x7f4f9ee827b3]
2022-05-10T00:31:02.191 INFO:tasks.workunit.client.0.smithi032.stderr: 7: (_start()+0x2e) [0x5555c7a083ae]

@ljflores
Copy link
Contributor

ljflores commented May 12, 2022

Current analysis of the test run. @ifed01 I opened https://tracker.ceph.com/issues/49287 to track the BlueFS failure. Let me know what you think of it.

http://pulpito.front.sepia.ceph.com/?branch=wip-yuri6-testing-2022-05-09-0734-octopus

A few jobs failed due to problems in infrastructure, but passed in a rerun.

Failures:
1. https://tracker.ceph.com/issues/49287
2. https://tracker.ceph.com/issues/55636 --> opened during this run

Details:
1. podman: setting cgroup config for procHooks process caused: Unit libpod-$hash.scope not found - Ceph - Orchestrator
2. octopus: osd-bluefs-volume-ops.sh: TEST_bluestore2 fails with "FAILED ceph_assert(r == 0)" - Ceph - BlueStore

@ifed01
Copy link
Contributor Author

ifed01 commented Jun 22, 2022

Current analysis of the test run. @ifed01 I opened https://tracker.ceph.com/issues/49287 to track the BlueFS failure. Let me know what you think of it.

http://pulpito.front.sepia.ceph.com/?branch=wip-yuri6-testing-2022-05-09-0734-octopus

A few jobs failed due to problems in infrastructure, but passed in a rerun.

Failures: 1. https://tracker.ceph.com/issues/49287 2. https://tracker.ceph.com/issues/55636 --> opened during this run

Details: 1. podman: setting cgroup config for procHooks process caused: Unit libpod-$hash.scope not found - Ceph - Orchestrator 2. octopus: osd-bluefs-volume-ops.sh: TEST_bluestore2 fails with "FAILED ceph_assert(r == 0)" - Ceph - BlueStore

@ljflores - sorry for the late response.
I'm absolutely sure https://tracker.ceph.com/issues/49287 is unrelated to this PR.
And IMO https://tracker.ceph.com/issues/55636 is unrelated as well but it worth additional investigation on its own...

@ljflores
Copy link
Contributor

Rados suite results: https://pulpito.ceph.com/?branch=wip-yuri5-testing-2022-06-22-0914-octopus

One unrelated dead cephadm job, which passed in the rerun.
All other failures were caused by a bug in teuthology, which has been fixed.

@yuriw yuriw merged commit 8477df9 into ceph:octopus Jun 23, 2022
@ifed01 ifed01 deleted the wip-ifed-fix-54288-oct branch June 24, 2022 21:17
@ljflores
Copy link
Contributor

ljflores commented Jul 1, 2022

Hi @ifed01, please see https://tracker.ceph.com/issues/55636#note-2. I suspect that this commit actually did cause the bug from https://tracker.ceph.com/issues/55636. The reason it was a tricky catch is that it appears to fail only on certain operating systems.

Let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants