Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crimson/osd: don't clone non-existing objects #48373

Merged
merged 2 commits into from Nov 29, 2022

Conversation

rzarzynski
Copy link
Contributor

@rzarzynski rzarzynski commented Oct 5, 2022

The differences

crimson

DEBUG 2022-10-04 20:10:24,710 [shard 0] osd - client_request(id=37, detail=m=[osd_op(client.4117.0:40 1.6 1:65e2c3d8:::rbd_data.1015fc34d4d2.0000000000000000:head {set-alloc-hint object_size 4194304 write_size 4194304, write 0~80 in=80b} snapc 4={4} ondisk+write+known_if_redirected+supports_pool_eio e9) v8]).0: after await_map stage
...
DEBUG 2022-10-04 20:10:24,710 [shard 0] osd - cloning v 0'0 to 1:65e2c3d8:::rbd_data.1015fc34d4d2.0000000000000000:4 v 9'2 snaps= {4} snapset=0={}:{4={4}}
DEBUG 2022-10-04 20:10:24,710 [shard 0] osd - make_writeable 1:65e2c3d8:::rbd_data.1015fc34d4d2.0000000000000000:head done, snapset=4={}:{4={4}}
...
DEBUG 2022-10-04 20:10:24,711 [shard 0] osd - final snapset 4={}:{4={4}} in 1:65e2c3d8:::rbd_data.1015fc34d4d2.0000000000000000:head
TRACE 2022-10-04 20:10:24,712 [shard 0] bluestore - _dump_transaction transaction dump:
{
    "ops": [
// ...
        {
            "op_num": 1,
            "op_name": "clone",
            "collection": "1.6_head",
            "src_oid": "#1:65e2c3d8:::rbd_data.1015fc34d4d2.0000000000000000:head#",
            "dst_oid": "#1:65e2c3d8:::rbd_data.1015fc34d4d2.0000000000000000:4#"
        },
// ...
        {
            "op_num": 6,
            "op_name": "touch",
            "collection": "1.6_head",
            "oid": "#1:65e2c3d8:::rbd_data.1015fc34d4d2.0000000000000000:head#"
        },
// ...
    ]
}

The classical OSD

2022-09-26T09:46:53.502+0000 7f74c1e7b700 20 osd.0 pg_epoch: 11 pg[1.0( v 11'7 (0'0,11'7] local-lis/les=7/8 n=4 ec=7/7 lis/c=7/7 les/c/f=8/8/0 sis=7) [0] r=0 lpr=7 crt=11'7 lcod 10'6 mlcod 10'6 active+clean ps=[2~1]] do_op: op osd_op(client.4117.0:40 1.0 1:2f0692fe:::rbd_data.10152970f466.0000000000000000:head [set-alloc-hint object_size 4194304 write_size 4194304,write 0~80 in=80b] snapc 4=[4] ondisk+write+known_if_redirected+supports_pool_eio e11) v8
...
2022-09-26T09:46:53.502+0000 7f74c1e7b700 20 osd.0 pg_epoch: 11 pg[1.0( v 11'7 (0'0,11'7] local-lis/les=7/8 n=4 ec=7/7 lis/c=7/7 les/c/f=8/8/0 sis=7) [0] r=0 lpr=7 crt=11'7 lcod 10'6 mlcod 10'6 active+clean ps=[2~1]] make_writeable 1:2f0692fe:::rbd_data.10152970f466.0000000000000000:head snapset=0=[]:{}  snapc=4=[4]
2022-09-26T09:46:53.502+0000 7f74c1e7b700 20 osd.0 pg_epoch: 11 pg[1.0( v 11'7 (0'0,11'7] local-lis/les=7/8 n=4 ec=7/7 lis/c=7/7 les/c/f=8/8/0 sis=7) [0] r=0 lpr=7 crt=11'7 lcod 10'6 mlcod 10'6 active+clean ps=[2~1]]  setting DIRTY flag
2022-09-26T09:46:53.502+0000 7f74c1e7b700 20 osd.0 pg_epoch: 11 pg[1.0( v 11'7 (0'0,11'7] local-lis/les=7/8 n=4 ec=7/7 lis/c=7/7 les/c/f=8/8/0 sis=7) [0] r=0 lpr=7 crt=11'7 lcod 10'6 mlcod 10'6 active+clean ps=[2~1]] make_writeable 1:2f0692fe:::rbd_data.10152970f466.0000000000000000:head done, snapset=4=[]:{}
...
2022-09-26T09:46:53.502+0000 7f74c1e7b700 10 osd.0 pg_epoch: 11 pg[1.0( v 11'7 (0'0,11'7] local-lis/les=7/8 n=4 ec=7/7 lis/c=7/7 les/c/f=8/8/0 sis=7) [0] r=0 lpr=7 crt=11'7 lcod 10'6 mlcod 10'6 active+clean ps=[2~1]]  final snapset 4=[]:{} in 1:2f0692fe:::rbd_data.10152970f466.0000000000000000:head

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
@rzarzynski rzarzynski marked this pull request as ready for review October 5, 2022 14:06
@rzarzynski rzarzynski requested a review from a team as a code owner October 5, 2022 14:06
if (head_os.exists && // old obs.exists
snapc.snaps.size() && // there are snaps
snapc.snaps[0] > obc->ssc->snapset.seq) { // existing obj is old
if (initial_os.existed && !initial_os.is_whiteout && // old obs.exists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to create a new initial_os, which is just used here? head_os is also used in other places, how about just use head_os here?

Copy link
Contributor

@Matan-B Matan-B Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add ObjectState member to initial_os_t struct instead of bool existed;.
That way we can drop head_os entirely and update all of its users to use initial_os.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, head_os has completely different update strategy. It's updated before executing an op. What if there are multiple of them? head_os will hold the state corresponding to n - 1, not necessarily 0.

OpsExecuter::interruptible_errorated_future<OpsExecuter::osd_op_errorator>
OpsExecuter::execute_op(OSDOp& osd_op)
{
  head_os = obc->obs;
  return do_execute_op(osd_op).handle_error_interruptible(

Anyway, my intuition is that head_os was forged as a counterpart of the classical old obs. If so, likely it will need a rework. I would also prefer to be more selective when capturing the old state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in classic osd, there is a struct OpContext and two members const ObjectState *obs; // Old objectstate and ObjectState new_obs; // resulting ObjectState, I think here intial_os match obs and head_os match new_obs. Can we rename head_os to be easily understood? and agree Matan-B to capture ObjectState into intial_os insead of two bool members.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why to copy entire, complex struct (see object_manifest_t which is inside) while all we need at the moment is basically 2 bits? I truly doubt we'll ever reach to a situation where OpsExecuter would care about mere 5% of the information the old obs carries. The classical OSD has the combo (obs + new_obs) to not pollute the cached object state when a failure occurs during an Op processing; crimson-osd solves this problem completely differently: by reloading the obj state from ObjectStore (please see the comment in OpsExecuter::RollbackHelper<Func>::rollback_obc_if_modified).

To be clear: I'm not so terribly worried about a machine's performance. I very dislike the concept because inflating the program-state-to-track imposes extra mental burden on programmer.

Copy link
Contributor

@athanatos athanatos Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rzarzynski head_os appears to have two users

  • OpsExecutor::make_writeable (removed by this patch)
  • OpsExecutor::prepare_clone (called from make_writeable)

@Matan-B introduced it with snap machinery in 7171ca6 to replace head_existed because the snapshot pathway requires the original object_info_t fields used in object_info_t::copy_user_bits called from OpsExecutor::prepare_clone.

I think the bug here is simply that the update in execute_op is wrong -- it should be initialized in the constructor at the start of the transaction. I think head_existed had the same problem. OpsExecutor::make_writeable needs the exists state from the start of the transaction and OpsExecutor::prepare_clone needs to invoke object_info_t::copy_user_bits on the object_info as it existed at the start of the transaction (the logical state that the clone will be in).

I think there are two paths forward.

  1. Retain initial_os_t: Copying the entire ObjectState does capture an unnecessary amount of information. However, in order to handle this correctly, initial_os_t would also need to capture the object_info_t state required for copy_user_bits.
  2. Retain a full ObjectState: Basically, rename head_os to initial_os and capture in the OpsExecutor constructor.

I think 2. is simpler. It avoids needing to duplicate logic between object_info_t and initial_os_t (anything added to these "user bits" would need to be reflected back into initial_os_t). It also allow us to remove the logic required to re-read the object context upon error. Later on, once we have an effective test suite working reliably and some performance measurements, we can always revisit this question.

Either way, we shouldn't retain both head_os and initial_os.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy_user_bits() indeed uses a bunch of the old oi:

void object_info_t::copy_user_bits(const object_info_t& other)
{
  // these bits are copied from head->clone.
  size = other.size;
  mtime = other.mtime;
  local_mtime = other.local_mtime;
  last_reqid = other.last_reqid;
  truncate_seq = other.truncate_seq;
  truncate_size = other.truncate_size;
  flags = other.flags;
  user_version = other.user_version;
  data_digest = other.data_digest;
  omap_digest = other.omap_digest;
}

However, I still have the impression something terrible is going on here:

  1. the obs is duplicated to freeze the initial object state, so
  2. PGBackend::clone() can be executed after all the operations which in turn
  3. results in the need to reorder the low-level txn to place clone op at its beginning (in 80103ca; reworked & extended in os, crimson/osd: reoder ops in os::Transaction to mimic classical PGTransaction #48361):
void PGBackend::clone(
  object_info_t& snap_oi,
  ObjectState& os,
  ObjectState& d_os,
  ceph::os::Transaction& txn)
{
  // Prepend the cloning operation to txn
  ceph::os::Transaction c_txn;
  c_txn.clone(coll->get_cid(), ghobject_t{os.oi.soid}, ghobject_t{d_os.oi.soid});
  // Operations will be removed from txn while appending
  c_txn.append(txn);
  txn = std::move(c_txn)

Maybe a third way exists. Why we don't clone just at the beginning, before executing any op? Changes made to obc::ssc could be dealt with, I think (currently only CEPH_OSD_OP_DELETE takes care for the sake of to-whiteout-or-not-whiteout; CEPH_OSD_OP_LIST_SNAPS is coming IIRC).

Copy link
Contributor

@athanatos athanatos Oct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing the clone at the beginning might work as long as it ends up being simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-)))) E(xperiment)IP!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sent as separate PR: #48527.

@Matan-B Matan-B merged commit 57391ab into ceph:main Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants