osd: fix to allow inc manifest leaked #42302

myoungwon · 2021-07-13T10:03:39Z

Current dedup allows to contain multiple same sources using
multiset, which results in inconsistent situation as follow
(during set_chunk, but not confined in set_chunk).

Complete primary write
Failure occurs before replication is done
Cancel repop
Requeue the op

Due to 4, inc. op is applied one more time. At this time,
current inc op---this is similar to ++ operation--- increases
the ref. count once again.
To prevent this, inc/dec op needs to be idempotent using
absolute value.

fixes: https://tracker.ceph.com/issues/51000

Signed-off-by: Myoungwon Oh myoungwon.oh@samsung.com

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

myoungwon · 2021-07-20T10:15:55Z

@athanatos @tchaikov

athanatos · 2021-07-20T16:29:17Z

I'm not sure I understand the scenario. client->osd op re submission will detect and drop duplicates.

myoungwon · 2021-07-21T09:35:21Z

I misunderstood why increment operation is applied twice. Please look at the following scenario.

User issues set_chunk
OSD receives the set_chunk, and sends increment message to an object in the low tier (INPROGRESS).
OSD map is changed (841 → 843)
User re-issues set_chunk with 843
OSD receives the duplicated set_chunk, but it is not able to know the set_chunk is duplicated because it does not log on the disk yet.
OSD issues increment message again to the object in the low tier

The message sent at 2 and the message sent at 6 have different tid, so OSD can not recognize they are the same. As a result,
increment operation is executed twice.

To resolve this issue, I suppose that there are two options. First is to use update_log_only to detect dup ops, but set_chunk is not write_operations, so I am not sure the following code is a right way.

  int result = do_osd_ops(ctx, *ctx->ops);                                                                                               
  if (result < 0) {
    if ((ctx->op->may_write() && 
        get_osdmap()->require_osd_release >= ceph_release_t::kraken) ||
        **(ctx->op->cache() && result == -EINPROGESS)**) {                                                                   
      // need to save the error code in the pg log, to detect dup ops,                                                                   
      // but do nothing else 
      ctx->update_log_only = true;                                                                                                       
    }
    return result;                                                                                                                       
  }

Second is to make reference count operation idempotent as described in the commits. What do you think?

athanatos · 2021-07-21T23:25:24Z

In step 5, why doesn't it see the in progress op?

myoungwon · 2021-07-22T04:17:05Z

    bool got = check_in_progress_op(
      m->get_reqid(), &version, &user_version, &return_code, &op_returns);

To check whether it is in progress op, the op should be recorded in either projected_log or recovery state. But, in this case, the first set_chunk op returns EINPROGRESS without updating log. So, check_in_progess_op returns false. Am I wrong?

athanatos · 2021-07-23T22:07:21Z

Ah, right, so the previous op is in progress (returned -EINPROGRESS and will be completed later), but the repop hasn't been submitted yet and therefore it's not in the projected log. I think the fix is to add a registry of requests in that state and check them in check_in_progress_op as well.

myoungwon · 2021-07-28T14:00:44Z

@athanatos Sorry, I missed one thing---step 4 is triggered by requeue_op(), not issued by the user. So, the scenario is as follows.
(from /a/kchai-2021-05-27_09:22:11-rados-wip-kefu-testing-2021-05-27-1528-distro-basic-smithi/6137821/remote/smithi134/log/ceph-osd.1.log.gz)

User issues set_chunk
OSD receives the set_chunk, and sends increment message to an object in the low tier (INPROGRESS).
OSD map is changed (841 → 843)
3.5. on_change() is called
the set_chunk op is reenqueued by requeue_op()
OSD handles the duplicated set_chunk, but it is not able to know the set_chunk is duplicated because it does not log on the disk yet.
OSD issues increment message again to the object in the low tier. (increment operation is executed twice)

As a result, I think adding registry (something like unorded_map<osd_reqid_t, version>) in check_in_progress_op() is not appropriate because the registry should be dropped when on_change is called. What do you think?

athanatos · 2021-07-29T22:22:13Z

I'll take a closer look at this next week.

myoungwon · 2021-08-11T13:00:12Z

@athanatos ping

github-actions · 2021-08-13T17:32:22Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

athanatos · 2021-08-13T22:38:08Z

src/cls/cas/cls_cas_client.h

@@ -17,17 +17,20 @@ void cls_cas_chunk_create_or_get_ref(
  librados::ObjectWriteOperation& op,
  const hobject_t& soid,
  const bufferlist& data,
-  bool verify=false);
+  bool verify=false,
+  int count = -1);


The server side asserts that count isn't -1. When is it valid to leave this as the default?

athanatos · 2021-08-13T22:49:56Z

I'm not sure about this approach. The refcount design in general has always explicitly permitted the recorded count in the base pool to be >= the real refcount. I see two ways a replayed operation can go:

Increment: This necessarily happens prior to the triggering cache pool operation completing and can therefore only result in leaked refcounts (safe by design).
Decrement: This necessary happens subsequent to the triggering cache pool operations and can therefore only result in the decrement happening 0 or 1 times also resulting in at worst a leaked refcount.

Neither case to me obviously breaks the contract. Specifying the source object refcount on submission doesn't fix 2) at all, and so also won't work correctly on 1) either as the recorded base refcount can still be an overestimate. Specifying the absolute refcount for a given source object also requires that the base pool actually track them, which probably won't scale in cases where a base pool object has a genuinely large number of incoming references.

As far as I can tell, this behavior is one case of the general refcount leak behavior we've specifically chosen to allow. The remedy is meant to be a background refcount scrubbing process to correct leaked references. Otherwise, we'd need a real intent log, and that would be expensive.

Am I missing something?

Current dedup allow to contain multiple same sources using multiset, which results in inconsistent situation as follow (during set_chunk, but not confined in set_chunk). 1. User issues set_chunk 2. OSD receives the set_chunk, and sends increment message to an object in the low tier (INPROGRESS). 3. OSD map is changed (841 → 843) 3.5. on_change() is called 4. the set_chunk op is reenqueued by requeue_op() 5. OSD handles the duplicated set_chunk, but it is not able to know the set_chunk is duplicated because it does not log on the disk yet. 6. OSD issues increment message again to the object in the low tier. (increment operation is executed twice) To fix this, this commit allows >= the real refcount in test cases fixes: https://tracker.ceph.com/issues/51000 Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>

myoungwon · 2021-08-17T13:28:56Z

@athanatos Yeah, right, it also requires the overhead to track the reference and the mismatch can be fixed via the background refcount scrubbing process. The previous commits just intend to avoid reference mismatch as possible to save space.

Anyway, if so, updated commit can fix to avoid a false alarm. Can you take a look?

tchaikov · 2021-08-19T07:41:47Z

i don't think the failures are related.

myoungwon added the bug-fix label Jul 13, 2021

github-actions bot added core tests labels Jul 13, 2021

myoungwon force-pushed the wip-51000 branch from 24bd5df to 67efb7a Compare July 20, 2021 10:14

myoungwon requested review from athanatos and tchaikov July 20, 2021 10:16

github-actions bot added the needs-rebase label Aug 13, 2021

athanatos reviewed Aug 13, 2021

View reviewed changes

myoungwon force-pushed the wip-51000 branch from 67efb7a to 126df96 Compare August 17, 2021 12:56

github-actions bot removed the needs-rebase label Aug 17, 2021

myoungwon changed the title ~~WIP: osd: fix to make inc/dec manifest operation idempotent~~ osd: fix to make inc/dec manifest operation idempotent Aug 17, 2021

myoungwon changed the title ~~osd: fix to make inc/dec manifest operation idempotent~~ osd: fix to make inc manifest leaked Aug 17, 2021

myoungwon changed the title ~~osd: fix to make inc manifest leaked~~ osd: fix to allow inc manifest leaked Aug 17, 2021

athanatos approved these changes Aug 17, 2021

View reviewed changes

athanatos added the needs-qa label Aug 17, 2021

tchaikov added the wip-kefu-testing label Aug 18, 2021

tchaikov merged commit eab4b58 into ceph:master Aug 19, 2021

This was referenced Sep 27, 2021

pacific: osd: fix to recover adjacent clone when set_chunk is called #43099

Merged

pacific: osd: fix to allow inc manifest leaked #43306

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd: fix to allow inc manifest leaked #42302

osd: fix to allow inc manifest leaked #42302

myoungwon commented Jul 13, 2021 •

edited

myoungwon commented Jul 20, 2021

athanatos commented Jul 20, 2021

myoungwon commented Jul 21, 2021

athanatos commented Jul 21, 2021

myoungwon commented Jul 22, 2021

athanatos commented Jul 23, 2021

myoungwon commented Jul 28, 2021 •

edited

athanatos commented Jul 29, 2021

myoungwon commented Aug 11, 2021

github-actions bot commented Aug 13, 2021

athanatos Aug 13, 2021

athanatos commented Aug 13, 2021

myoungwon commented Aug 17, 2021

tchaikov commented Aug 19, 2021

osd: fix to allow inc manifest leaked #42302

osd: fix to allow inc manifest leaked #42302

Conversation

myoungwon commented Jul 13, 2021 • edited

Checklist

myoungwon commented Jul 20, 2021

athanatos commented Jul 20, 2021

myoungwon commented Jul 21, 2021

athanatos commented Jul 21, 2021

myoungwon commented Jul 22, 2021

athanatos commented Jul 23, 2021

myoungwon commented Jul 28, 2021 • edited

athanatos commented Jul 29, 2021

myoungwon commented Aug 11, 2021

github-actions bot commented Aug 13, 2021

athanatos Aug 13, 2021

Choose a reason for hiding this comment

athanatos commented Aug 13, 2021

myoungwon commented Aug 17, 2021

tchaikov commented Aug 19, 2021

myoungwon commented Jul 13, 2021 •

edited

myoungwon commented Jul 28, 2021 •

edited