Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: fix to allow inc manifest leaked #42302

Merged
merged 1 commit into from Aug 19, 2021
Merged

Conversation

myoungwon
Copy link
Member

@myoungwon myoungwon commented Jul 13, 2021

Current dedup allows to contain multiple same sources using
multiset, which results in inconsistent situation as follow
(during set_chunk, but not confined in set_chunk).

  1. Complete primary write
  2. Failure occurs before replication is done
  3. Cancel repop
  4. Requeue the op

Due to 4, inc. op is applied one more time. At this time,
current inc op---this is similar to ++ operation--- increases
the ref. count once again.
To prevent this, inc/dec op needs to be idempotent using
absolute value.

fixes: https://tracker.ceph.com/issues/51000

Signed-off-by: Myoungwon Oh myoungwon.oh@samsung.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@myoungwon
Copy link
Member Author

@athanatos @tchaikov

@athanatos
Copy link
Contributor

I'm not sure I understand the scenario. client->osd op re submission will detect and drop duplicates.

@myoungwon
Copy link
Member Author

I misunderstood why increment operation is applied twice. Please look at the following scenario.

  1. User issues set_chunk
  2. OSD receives the set_chunk, and sends increment message to an object in the low tier (INPROGRESS).
  3. OSD map is changed (841 → 843)
  4. User re-issues set_chunk with 843
  5. OSD receives the duplicated set_chunk, but it is not able to know the set_chunk is duplicated because it does not log on the disk yet.
  6. OSD issues increment message again to the object in the low tier

The message sent at 2 and the message sent at 6 have different tid, so OSD can not recognize they are the same. As a result,
increment operation is executed twice.

To resolve this issue, I suppose that there are two options. First is to use update_log_only to detect dup ops, but set_chunk is not write_operations, so I am not sure the following code is a right way.

  int result = do_osd_ops(ctx, *ctx->ops);                                                                                               
  if (result < 0) {
    if ((ctx->op->may_write() && 
        get_osdmap()->require_osd_release >= ceph_release_t::kraken) ||
        **(ctx->op->cache() && result == -EINPROGESS)**) {                                                                   
      // need to save the error code in the pg log, to detect dup ops,                                                                   
      // but do nothing else 
      ctx->update_log_only = true;                                                                                                       
    }
    return result;                                                                                                                       
  }  

Second is to make reference count operation idempotent as described in the commits. What do you think?

@athanatos
Copy link
Contributor

In step 5, why doesn't it see the in progress op?

@myoungwon
Copy link
Member Author

    bool got = check_in_progress_op(
      m->get_reqid(), &version, &user_version, &return_code, &op_returns);

To check whether it is in progress op, the op should be recorded in either projected_log or recovery state. But, in this case, the first set_chunk op returns EINPROGRESS without updating log. So, check_in_progess_op returns false. Am I wrong?

@athanatos
Copy link
Contributor

Ah, right, so the previous op is in progress (returned -EINPROGRESS and will be completed later), but the repop hasn't been submitted yet and therefore it's not in the projected log. I think the fix is to add a registry of requests in that state and check them in check_in_progress_op as well.

@myoungwon
Copy link
Member Author

myoungwon commented Jul 28, 2021

@athanatos Sorry, I missed one thing---step 4 is triggered by requeue_op(), not issued by the user. So, the scenario is as follows.
(from /a/kchai-2021-05-27_09:22:11-rados-wip-kefu-testing-2021-05-27-1528-distro-basic-smithi/6137821/remote/smithi134/log/ceph-osd.1.log.gz)

  1. User issues set_chunk
  2. OSD receives the set_chunk, and sends increment message to an object in the low tier (INPROGRESS).
  3. OSD map is changed (841 → 843)
    3.5. on_change() is called
  4. the set_chunk op is reenqueued by requeue_op()
  5. OSD handles the duplicated set_chunk, but it is not able to know the set_chunk is duplicated because it does not log on the disk yet.
  6. OSD issues increment message again to the object in the low tier. (increment operation is executed twice)

As a result, I think adding registry (something like unorded_map<osd_reqid_t, version>) in check_in_progress_op() is not appropriate because the registry should be dropped when on_change is called. What do you think?

@athanatos
Copy link
Contributor

I'll take a closer look at this next week.

@myoungwon
Copy link
Member Author

@athanatos ping

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@@ -17,17 +17,20 @@ void cls_cas_chunk_create_or_get_ref(
librados::ObjectWriteOperation& op,
const hobject_t& soid,
const bufferlist& data,
bool verify=false);
bool verify=false,
int count = -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server side asserts that count isn't -1. When is it valid to leave this as the default?

@athanatos
Copy link
Contributor

I'm not sure about this approach. The refcount design in general has always explicitly permitted the recorded count in the base pool to be >= the real refcount. I see two ways a replayed operation can go:

  1. Increment: This necessarily happens prior to the triggering cache pool operation completing and can therefore only result in leaked refcounts (safe by design).
  2. Decrement: This necessary happens subsequent to the triggering cache pool operations and can therefore only result in the decrement happening 0 or 1 times also resulting in at worst a leaked refcount.

Neither case to me obviously breaks the contract. Specifying the source object refcount on submission doesn't fix 2) at all, and so also won't work correctly on 1) either as the recorded base refcount can still be an overestimate. Specifying the absolute refcount for a given source object also requires that the base pool actually track them, which probably won't scale in cases where a base pool object has a genuinely large number of incoming references.

As far as I can tell, this behavior is one case of the general refcount leak behavior we've specifically chosen to allow. The remedy is meant to be a background refcount scrubbing process to correct leaked references. Otherwise, we'd need a real intent log, and that would be expensive.

Am I missing something?

Current dedup allow to contain multiple same sources using
multiset, which results in inconsistent situation as follow
(during set_chunk, but not confined in set_chunk).

1. User issues set_chunk
2. OSD receives the set_chunk, and sends increment message
to an object in the low tier (INPROGRESS).
3. OSD map is changed (841 → 843)
3.5. on_change() is called
4. the set_chunk op is reenqueued by requeue_op()
5. OSD handles the duplicated set_chunk, but it is not able to
 know the set_chunk is duplicated because it does not log on the disk yet.
6. OSD issues increment message again to the object
in the low tier. (increment operation is executed twice)

To fix this, this commit allows >= the real refcount in test cases

fixes: https://tracker.ceph.com/issues/51000

Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
@myoungwon
Copy link
Member Author

@athanatos Yeah, right, it also requires the overhead to track the reference and the mismatch can be fixed via the background refcount scrubbing process. The previous commits just intend to avoid reference mismatch as possible to save space.

Anyway, if so, updated commit can fix to avoid a false alarm. Can you take a look?

@myoungwon myoungwon changed the title WIP: osd: fix to make inc/dec manifest operation idempotent osd: fix to make inc/dec manifest operation idempotent Aug 17, 2021
@myoungwon myoungwon changed the title osd: fix to make inc/dec manifest operation idempotent osd: fix to make inc manifest leaked Aug 17, 2021
@myoungwon myoungwon changed the title osd: fix to make inc manifest leaked osd: fix to allow inc manifest leaked Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants