Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: refcounting chunks for snapshotted manifest object #29283

Merged
merged 13 commits into from Jul 14, 2020

Conversation

myoungwon
Copy link
Member

@myoungwon myoungwon commented Jul 24, 2019

https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/7G222W5F7MD7X5MJS3BF6YDJ76RKJGN5/

To prevent removing a manifest object in use,
chunk maps in clone object are introduced.

Signed-off-by: Myoungwon Oh myoungwon.oh@samsung.com

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

@myoungwon myoungwon added the core label Jul 24, 2019
@myoungwon myoungwon force-pushed the wip-refcount-snap branch 5 times, most recently from e57060f to b496bf3 Compare July 29, 2019 08:37
@myoungwon
Copy link
Member Author

@liewegas Before starting scrub work for snapshotted manifest object, I want to get a feedback to remove misunderstanding about the concept.

These commit present the reference tracking based on clone info as @gregsfortytwo's comment.
The refcount will be increased if clone or head object is created and it will be decreased if clone or head object is deleted through each object_info_t's refcount.

Note that he key idea behind of reference counting is false-positive, which means (manifest object (no ref), chunk object(has ref)) can be possible instead of (manifest object (has ref), chunk 1(no ref)).
If inconsistency occurs, this will be fixed by a scrub processing (ceph-dedup-tool).

Am I missing something?

@myoungwon
Copy link
Member Author

@liewegas Ping.

}
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the design worries me. It means that the first time we touch a snapshotted object and need to make a clone, the op is blocked while we go bump a bunch of ref counts.

@myoungwon myoungwon force-pushed the wip-refcount-snap branch 3 times, most recently from f23ca9c to 54bf4b2 Compare August 9, 2019 10:43
@myoungwon
Copy link
Member Author

@liewegas Make sense?

@myoungwon
Copy link
Member Author

@liewegas Ping.

interval_set<uint64_t> clone_overlap = newest_overlap;
interval_set<uint64_t> chunk_range;
chunk_range.insert(p.first, p.second.length);
clone_overlap.intersection_of(chunk_range);
Copy link
Member

@liewegas liewegas Sep 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think here you can replace the last 4 lines with just if (newest_overlap.intersects(p.first, p.second.length)) { ...

src/osd/PrimaryLogPG.cc Outdated Show resolved Hide resolved
src/osd/PrimaryLogPG.cc Outdated Show resolved Hide resolved
@myoungwon myoungwon force-pushed the wip-refcount-snap branch 2 times, most recently from 290d83e to 95c0df2 Compare September 8, 2019 11:32
@myoungwon
Copy link
Member Author

@liewegas Fixed. Could you take a look?

@myoungwon myoungwon changed the title WIP: osd: refcounting chunks for snapshotted manifest object osd: refcounting chunks for snapshotted manifest object Sep 9, 2019
if (!has_reference) {
return false;
}
return true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just return has_reference;?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// check if the references is still used
object_info_t& oi = cobc->obs.oi;
if (oi.has_manifest()) {
coi.manifest.build_non_intersection_set(oi.manifest.chunk_map, no_refs, NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way build_non_intersection_set is written, it will only add things that don't overlap. So calling it inside a loop won't find things that don't intersect with all other clones.. instead, it'll find things that don't intersect with at least one other clone, which isn't very useful.

  1. I think this only needs to check the before and after clones.
  2. I think it needs to do the first pass, where it finds the intersection, on both of those clones, and then do the second pass.

@myoungwon
Copy link
Member Author

@liewegas I addressed your comments. Please review.

object_info_t& coi = cobc->obs.oi;
oi.manifest.build_intersection_set(coi.manifest.chunk_map, refs, &newest_overlap);

if (refs.size() > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this if should go

@liewegas
Copy link
Member

liewegas commented Sep 9, 2019

other than that if, i think this looks right!

@tchaikov
Copy link
Contributor

tchaikov commented Jul 5, 2020

@athanatos @myoungwon i will include this PR in my next batch.

@myoungwon
Copy link
Member Author

myoungwon commented Jul 7, 2020

@athanatos The test result looks like there is no fails which have to do with this PR. What do you think this?

@athanatos
Copy link
Contributor

athanatos commented Jul 7, 2020

@myoungwon What caused the two rados/test.sh failures? They persisted into the second run and could be related.

@tchaikov
Copy link
Contributor

tchaikov commented Jul 8, 2020

2020-07-06T12:06:10.597 INFO:teuthology.orchestra.run.smithi192:workunit test rados/test.sh> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_
CLI_TEST_DUP_COMMAND=1 CEPH_REF=7b60e408aedc30fb1b71a2c6e541618527d6e6d3 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.c
lient.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 ALLOW_TIMEOUTS=1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rad
os/test.sh
...
2020-07-06T18:06:10.641 DEBUG:teuthology.orchestra.run:got remote process result: 124

they timedout.

$ grep -A1 '\[==========\]' /a/kchai-2020-07-06_11:39:50-rados-wip-kefu-testing-2020-07-06-1016-distro-basic-smithi/5202896/teuthology.log
2020-07-06T12:06:10.708 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_asio: [==========] Running 12 tests from 1 test suite.
2020-07-06T12:06:10.708 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_asio: [----------] Global test environment set-up.
--
2020-07-06T12:06:36.503 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_asio: 2020-07-06T12:06:35.935+0000 7fe5167fc700  3 client.4328.objecter handle_osd_map decoding i              api_service: [==========] Running 3 tests from 1 test suite.
2020-07-06T12:06:36.504 INFO:tasks.workunit.client.0.smithi192.stdout:              api_service: [----------] Global test environment set-up.
--
2020-07-06T12:06:36.506 INFO:tasks.workunit.client.0.smithi192.stdout:              api_service: [==========] 3 tests from 1 test suite ran. (25690 ms total)
2020-07-06T12:06:36.506 INFO:tasks.workunit.client.0.smithi192.stdout:              api_service: [  PASSED  ] 3 tests.
--
2020-07-06T12:06:47.337 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_asio: 2020-07-06T12:06:46           api_service_pp: [==========] Running 4 tests from 1 test suite.
2020-07-06T12:06:47.338 INFO:tasks.workunit.client.0.smithi192.stdout:           api_service_pp: [----------] Global test environment set-up.
--
2020-07-06T12:06:47.353 INFO:tasks.workunit.client.0.smithi192.stdout:           api_service_pp: [==========] 4 tests from 1 test suite ran. (36510 ms total)
2020-07-06T12:06:47.363 INFO:tasks.workunit.client.0.smithi192.stdout:           api_service_pp: [  PASSED  ] 4 tests.
--
2020-07-06T12:06:53.049 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_pool: [==========] Running 7 tests from 1 test suite.
2020-07-06T12:06:53.049 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_pool: [----------] Global test environment set-up.
--
2020-07-06T12:06:53.053 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_pool: [==========] 7 tests from 1 test suite ran. (42335 ms total)
2020-07-06T12:06:53.053 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_pool: [  PASSED  ] 7 tests.
--
2020-07-06T12:07:15.841 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_asio: 2020-07-06T12:07:14.743+0000                 api_misc: [==========] Running 11 tests from 4 test suites.
2020-07-06T12:07:15.842 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_misc: [----------] Global test environment set-up.
--
2020-07-06T12:07:43.729 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_asio: [==========] 12 tests from 1 test suite ran. (93020 ms total)
2020-07-06T12:07:43.729 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_asio: [  PASSED  ] 12 tests.
--
2020-07-06T12:08:35.928 INFO:tasks.workunit.client.0.smithi192.stdout:    api_c_read_operations: [==========] Running 16 tests from 1 test suite.
2020-07-06T12:08:35.929 INFO:tasks.workunit.client.0.smithi192.stdout:    api_c_read_operations: [----------] Global test environment set-up.
--
2020-07-06T12:08:35.936 INFO:tasks.workunit.client.0.smithi192.stdout:    api_c_read_operations: [==========] 16 tests from 1 test suite ran. (145085 ms total)
2020-07-06T12:08:35.936 INFO:tasks.workunit.client.0.smithi192.stdout:    api_c_read_operations: [  PASSED  ] 16 tests.
--
2020-07-06T12:08:52.886 INFO:tasks.workunit.client.0.smithi192.stdout:               api_cmd_pp: [==========] Running 3 tests from 1 test suite.
2020-07-06T12:08:52.886 INFO:tasks.workunit.client.0.smithi192.stdout:               api_cmd_pp: [----------] Global test environment set-up.
--
2020-07-06T12:08:52.888 INFO:tasks.workunit.client.0.smithi192.stdout:               api_cmd_pp: [==========] 3 tests from 1 test suite ran. (162085 ms total)
2020-07-06T12:08:52.888 INFO:tasks.workunit.client.0.smithi192.stdout:               api_cmd_pp: [  PASSED  ] 3 tests.
--
2020-07-06T12:08:53.969 INFO:tasks.workunit.client.0.smithi192.stdout:                  api_cmd: [==========] Running 4 tests from 1 test suite.
2020-07-06T12:08:53.969 INFO:tasks.workunit.client.0.smithi192.stdout:                  api_cmd: [----------] Global test environment set-up.
--
2020-07-06T12:08:56.082 INFO:tasks.workunit.client.0.smithi192.stdout:      api_watch_notify_pp: [==========] Running 16 tests from 2 test suites.
2020-07-06T12:08:56.083 INFO:tasks.workunit.client.0.smithi192.stdout:      api_watch_notify_pp: [----------] Global test environment set-up.
--
2020-07-06T12:09:23.024 INFO:tasks.workunit.client.0.smithi192.stdout:                  api_cmd: got: 2020-07-06T12:09:14.073735+0000 mon.c [INF] from='client.? 172.21.15.192:0/1203117768' entity='client.admin' cmd=[{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi192-26074-12"}]: d                 api_list: [==========] Running 10 tests from 3 test suites.
2020-07-06T12:09:23.024 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_list: [----------] Global test environment set-up.
--
2020-07-06T12:09:28.884 INFO:tasks.workunit.client.0.smithi192.stdout:      api_watch_notify_pp: [==========] 16 tests from 2 test suites ran. (198104 ms total)
2020-07-06T12:09:28.884 INFO:tasks.workunit.client.0.smithi192.stdout:      api_watch_notify_pp: [  PASSED  ] 16 tests.
--
2020-07-06T12:09:29.907 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_misc: [==========] 11 tests from 4 test suites ran. (199185 ms total)
2020-07-06T12:09:29.907 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_misc: [  PASSED  ] 11 tests.
--
2020-07-06T12:09:29.908 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_stat: [==========] Running 8 tests from 2 test suites.
2020-07-06T12:09:29.908 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_stat: [----------] Global test environment set-up.
--
2020-07-06T12:09:29.922 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_stat: [==========] 8 tests from 2 test suites ran. (199160 ms total)
2020-07-06T12:09:29.922 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_stat: [  PASSED  ] 8 tests.
--
2020-07-06T12:09:29.958 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_lock: [==========] Running 16 tests from 2 test suites.
2020-07-06T12:09:29.958 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_lock: [----------] Global test environment set-up.
--
2020-07-06T12:09:29.967 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_lock: [==========] 16 tests from 2 test suites ran. (199292 ms total)
2020-07-06T12:09:29.967 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_lock: [  PASSED  ] 16 tests.
--
2020-07-06T12:09:30.410 INFO:tasks.workunit.client.0.smithi192.stdout:                  api_cmd: [==========] 4 tests from 1 test suite ran. (199629 ms total)
2020-07-06T12:09:30.411 INFO:tasks.workunit.client.0.smithi192.stdout:                  api_cmd: [  PASSED  ] 4 tests.
--
2020-07-06T12:09:34.151 INFO:tasks.workunit.client.0.smithi192.stdout:   api_c_write_operations: [==========] Running 8 tests from 2 test suites.
2020-07-06T12:09:34.152 INFO:tasks.workunit.client.0.smithi192.stdout:   api_c_write_operations: [----------] Global test environment set-up.
--
2020-07-06T12:09:34.156 INFO:tasks.workunit.client.0.smithi192.stdout:   api_c_write_operations: [==========] 8 tests from 2 test suites ran. (203323 ms total)
2020-07-06T12:09:34.156 INFO:tasks.workunit.client.0.smithi192.stdout:   api_c_write_operations: [  PASSED  ] 8 tests.
--
2020-07-06T12:09:41.134 INFO:tasks.workunit.client.0.smithi192.stdout:              api_stat_pp: [==========] Running 9 tests from 2 test suites.
2020-07-06T12:09:41.134 INFO:tasks.workunit.client.0.smithi192.stdout:              api_stat_pp: [----------] Global test environment set-up.
--
2020-07-06T12:09:41.141 INFO:tasks.workunit.client.0.smithi192.stdout:              api_stat_pp: [==========] 9 tests from 2 test suites ran. (210368 ms total)
2020-07-06T12:09:41.142 INFO:tasks.workunit.client.0.smithi192.stdout:              api_stat_pp: [  PASSED  ] 9 tests.
--
2020-07-06T12:09:47.473 INFO:tasks.workunit.client.0.smithi192.stdout:         api_watch_notify: [==========] Running 11 tests from 2 test suites.
2020-07-06T12:09:47.473 INFO:tasks.workunit.client.0.smithi192.stdout:         api_watch_notify: [----------] Global test environment set-up.
--
2020-07-06T12:09:47.483 INFO:tasks.workunit.client.0.smithi192.stdout:         api_watch_notify: [==========] 11 tests from 2 test suites ran. (216704 ms total)
2020-07-06T12:09:47.483 INFO:tasks.workunit.client.0.smithi192.stdout:         api_watch_notify: [  PASSED  ] 11 tests.
--
2020-07-06T12:09:58.884 INFO:tasks.workunit.client.0.smithi192.stdout:                api_io_pp: [==========] Running 37 tests from 2 test suites.
2020-07-06T12:09:58.884 INFO:tasks.workunit.client.0.smithi192.stdout:                api_io_pp: [----------] Global test environment set-up.
--
2020-07-06T12:10:11.568 INFO:tasks.workunit.client.0.smithi192.stdout:              api_lock_pp: [==========] Running 16 tests from 2 test suites.
2020-07-06T12:10:11.568 INFO:tasks.workunit.client.0.smithi192.stdout:              api_lock_pp: [----------] Global test environment set-up.
--
2020-07-06T12:10:11.579 INFO:tasks.workunit.client.0.smithi192.stdout:              api_lock_pp: [==========] 16 tests from 2 test suites ran. (240895 ms total)
2020-07-06T12:10:11.579 INFO:tasks.workunit.client.0.smithi192.stdout:              api_lock_pp: [  PASSED  ] 16 tests.
--
2020-07-06T12:10:24.253 INFO:tasks.workunit.client.0.smithi192.stdout:                   api_io: [==========] Running 23 tests from 2 test suites.
2020-07-06T12:10:24.253 INFO:tasks.workunit.client.0.smithi192.stdout:                   api_io: [----------] Global test environment set-up.
--
2020-07-06T12:10:24.265 INFO:tasks.workunit.client.0.smithi192.stdout:                   api_io: [==========] 23 tests from 2 test suites ran. (253612 ms total)
2020-07-06T12:10:24.265 INFO:tasks.workunit.client.0.smithi192.stdout:                   api_io: [  PASSED  ] 23 tests.
--
2020-07-06T12:10:38.618 INFO:tasks.workunit.client.0.smithi192.stdout:                api_io_pp: [==========] 37 tests from 2 test suites ran. (267971 ms total)
2020-07-06T12:10:38.618 INFO:tasks.workunit.client.0.smithi192.stdout:                api_io_pp: [  PASSED  ] 37 tests.
--
2020-07-06T12:10:51.373 INFO:tasks.workunit.client.0.smithi192.stdout:            api_snapshots: [==========] Running 12 tests from 4 test suites.
2020-07-06T12:10:51.374 INFO:tasks.workunit.client.0.smithi192.stdout:            api_snapshots: [----------] Global test environment set-up.
--
2020-07-06T12:10:51.389 INFO:tasks.workunit.client.0.smithi192.stdout:            api_snapshots: [==========] 12 tests from 4 test suites ran. (280643 ms total)
2020-07-06T12:10:51.389 INFO:tasks.workunit.client.0.smithi192.stdout:            api_snapshots: [  PASSED  ] 12 tests.
--
2020-07-06T12:11:41.979 INFO:tasks.workunit.client.0.smithi192.stdout:         api_snapshots_pp: [==========] Running 18 tests from 4 test suites.
2020-07-06T12:11:41.979 INFO:tasks.workunit.client.0.smithi192.stdout:         api_snapshots_pp: [----------] Global test environment set-up.
--
2020-07-06T12:11:41.990 INFO:tasks.workunit.client.0.smithi192.stdout:         api_snapshots_pp: [==========] 18 tests from 4 test suites ran. (331247 ms total)
2020-07-06T12:11:41.990 INFO:tasks.workunit.client.0.smithi192.stdout:         api_snapshots_pp: [  PASSED  ] 18 tests.
--
2020-07-06T12:11:49.271 INFO:tasks.workunit.client.0.smithi192.stdout:               api_aio_pp: [==========] Running 53 tests from 4 test suites.
2020-07-06T12:11:49.271 INFO:tasks.workunit.client.0.smithi192.stdout:               api_aio_pp: [----------] Global test environment set-up.
--
2020-07-06T12:11:53.908 INFO:tasks.workunit.client.0.smithi192.stdout:                  api_aio: [==========] Running 40 tests from 2 test suites.
2020-07-06T12:11:53.908 INFO:tasks.workunit.client.0.smithi192.stdout:                  api_aio: [----------] Global test environment set-up.
--
2020-07-06T12:13:24.642 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_list: [==========] 10 tests from 3 test suites ran. (433968 ms total)
2020-07-06T12:13:24.642 INFO:tasks.workunit.client.0.smithi192.stdout:                 api_list: [  PASSED  ] 10 tests.
--
2020-07-06T12:13:52.873 INFO:tasks.workunit.client.0.smithi192.stdout:                  api_aio: [==========] 40 tests from 2 test suites ran. (462232 ms total)
2020-07-06T12:13:52.874 INFO:tasks.workunit.client.0.smithi192.stdout:                  api_aio: [  PASSED  ] 40 tests.
--
2020-07-06T12:15:11.053 INFO:tasks.workunit.client.0.smithi192.stdout:               api_aio_pp: [==========] 53 tests from 4 test suites ran. (540413 ms total)
2020-07-06T12:15:11.053 INFO:tasks.workunit.client.0.smithi192.stdout:               api_aio_pp: [  PASSED  ] 53 tests.
--
2020-07-06T12:15:27.648 INFO:tasks.workunit.client.0.smithi192.stdout:              api_misc_pp: [==========] Running 31 tests from 7 test suites.
2020-07-06T12:15:27.649 INFO:tasks.workunit.client.0.smithi192.stdout:              api_misc_pp: [----------] Global test environment set-up.
--
2020-07-06T12:16:01.177 INFO:tasks.workunit.client.0.smithi192.stdout:              api_misc_pp: [==========] 31 tests from 7 test suites ran. (590461 ms total)
2020-07-06T12:16:01.177 INFO:tasks.workunit.client.0.smithi192.stdout:              api_misc_pp: [  PASSED  ] 31 tests.
--
2020-07-06T12:17:59.992 INFO:tasks.workunit.client.0.smithi192.stdout:              api_tier_pp: [==========] Running 62 tests from 4 test suites.
2020-07-06T12:17:59.992 INFO:tasks.workunit.client.0.smithi192.stdout:              api_tier_pp: [----------] Global test environment set-up.

in which api_tier_pp never finishes.

@tchaikov
Copy link
Contributor

tchaikov commented Jul 8, 2020

@myoungwon
Copy link
Member Author

myoungwon commented Jul 8, 2020

This seems like a bug in dec_all_refcount_manifest().

api_tier_pp never finish because vargrind reports an error

2020-07-06T12:23:32.593 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp: [       OK ] LibRadosTwoPoolsPP.ProxyRead (23522 ms)
2020-07-06T12:23:32.593 INFO:tasks.workunit.client.0.smithi023.stdout:              api_tier_pp: [ RUN      ] LibRadosTwoPoolsPP.CachePin
2020-07-06T12:23:36.447 INFO:tasks.ceph.mon.a.smithi023.stderr:==00:00:17:31.848 21858== Warning: unimplemented fcntl command: 1036
2020-07-06T12:23:36.489 INFO:tasks.ceph.mon.a.smithi023.stderr:==00:00:17:31.890 21858== Warning: unimplemented fcntl command: 1036
.......................
2020-07-06T12:24:28.771  INFO:tasks.ceph.osd.3.smithi023.stderr:==00:00:18:02.099 22484== Exit program on first error (--exit-on-first-error=yes) 

And this caused by

<kind>InvalidRead</kind>
  <what>Invalid read of size 8</what>
  <stack>
    <frame>
      <ip>0x8A25F8</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>__shared_ptr</fn>
      <dir>/usr/include/c++/8/bits</dir>
      <file>shared_ptr_base.h</file>
      <line>1165</line>
    </frame>
    <frame>
      <ip>0x8A25F8</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>shared_ptr</fn>
      <dir>/usr/include/c++/8/bits</dir>
      <file>shared_ptr.h</file>
      <line>129</line>
    </frame>
    <frame>
      <ip>0x8A25F8</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>operator()</fn>
      <dir>/usr/src/debug/ceph-16.0.0-3235.g7b60e408aed.el8.x86_64/src/osd</dir>
      <file>PrimaryLogPG.cc</file>
      <line>3520</line>
    </frame>
    <frame>
      <ip>0x8A25F8</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>std::_Function_handler&lt;void (), PrimaryLogPG::dec_all_refcount_manifest(object_info_t const&amp;, PrimaryLogPG::OpContext*)::{lambda()#2}&gt;::_M_invoke(std::_Any_data const&amp;)</fn>
      <dir>/usr/include/c++/8/bits</dir>
      <file>std_function.h</file>
      <line>297</line>
    </frame>
    <frame>
      <ip>0x864703</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>operator()</fn>
      <dir>/usr/include/c++/8/bits</dir>
      <file>std_function.h</file>
      <line>687</line>
    </frame>
    <frame>
      <ip>0x864703</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>PrimaryLogPG::eval_repop(PrimaryLogPG::RepGather*)</fn>
      <dir>/usr/src/debug/ceph-16.0.0-3235.g7b60e408aed.el8.x86_64/src/osd</dir>
      <file>PrimaryLogPG.cc</file>
      <line>10589</line>
    </frame>
    <frame>

Based on valgrind's report, in dec_all_refcount_manifest(),

   if (!refs.is_empty()) {
      ctx->register_on_commit(                                                            
        [ctx, this, refs](){
          dec_refcount(ctx->obc, refs);                                                   
        });                                                                               
    }

dec_refcount is called with unreferenced ctx due to move(ctx) in AwaitAsyncWork::react().
( AwaitAsyncWork::react() -> trim_object -> move(ctx) ).

So, to fix this, replace above code with

    if (!refs.is_empty()) { 
      hobject_t soid = ctx->obc->obs.oi.soid;
      ctx->register_on_commit(
        [soid, this, refs](){
          ObjectContextRef obc = get_object_context(soid, false, NULL);
          ceph_assert(obc);
          dec_refcount(obc, refs);
        });
    }

Make sense?

ctx will be moved, so replace it with the value

Signed-off-by: Myoungwon Oh <ohmyoungwon@gmail.com>
@tchaikov
Copy link
Contributor

tchaikov commented Jul 8, 2020

@myoungwon probably you could rerun the failed tests with your latest fix to verify it?

@myoungwon
Copy link
Member Author

@tchaikov I cannot access the sepia yet because my update request for login credentials is in-progress. When that request is competed, I can rerun this.

@myoungwon
Copy link
Member Author

retest this please

if (i == manifest.chunk_map.end() || current != i->first) {
return nullptr;
} else {
// We advance the iterator iff we consider the chunk_map on this iteration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was not actually helpful in understanding what's going on here. The way you're advancing through the loop is elegant but "hiding" the iter advancement inside of a lambda function is messy, so you should explain the lambda in light of the overall algorithm, especially since its very clear name only describes a portion of its purpose.

@gregsfortytwo
Copy link
Member

Those functions look fine @athanatos other than the one bit needing a better comment.

@athanatos
Copy link
Contributor

@myoungwon Top commit on https://github.com/athanatos/ceph/commits/sjust/wip-fix-comment (athanatos@1c65b0a) has an improvement to address @gregsfortytwo's comment.

Signed-off-by: Samuel Just <sjust@redhat.com>
@athanatos
Copy link
Contributor

retest this please

@athanatos
Copy link
Contributor

retest this please

@myoungwon
Copy link
Member Author

jenkins retest this please

@myoungwon
Copy link
Member Author

myoungwon commented Jul 14, 2020

@athanatos Can you look over this PR? QA result looks good. Do I need to rerun the tests?

@athanatos athanatos self-requested a review July 14, 2020 22:47
@athanatos athanatos merged commit f88211b into ceph:master Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants