New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crimson/osd: don't clone non-existing objects #48373
Conversation
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
if (head_os.exists && // old obs.exists | ||
snapc.snaps.size() && // there are snaps | ||
snapc.snaps[0] > obc->ssc->snapset.seq) { // existing obj is old | ||
if (initial_os.existed && !initial_os.is_whiteout && // old obs.exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to create a new initial_os, which is just used here? head_os is also used in other places, how about just use head_os here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add ObjectState
member to initial_os_t
struct instead of bool existed;
.
That way we can drop head_os
entirely and update all of its users to use initial_os
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, head_os
has completely different update strategy. It's updated before executing an op. What if there are multiple of them? head_os
will hold the state corresponding to n - 1
, not necessarily 0
.
OpsExecuter::interruptible_errorated_future<OpsExecuter::osd_op_errorator>
OpsExecuter::execute_op(OSDOp& osd_op)
{
head_os = obc->obs;
return do_execute_op(osd_op).handle_error_interruptible(
Anyway, my intuition is that head_os
was forged as a counterpart of the classical old obs. If so, likely it will need a rework. I would also prefer to be more selective when capturing the old state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in classic osd, there is a struct OpContext and two members const ObjectState *obs; // Old objectstate and ObjectState new_obs; // resulting ObjectState, I think here intial_os match obs and head_os match new_obs. Can we rename head_os to be easily understood? and agree Matan-B to capture ObjectState into intial_os insead of two bool members.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why to copy entire, complex struct (see object_manifest_t
which is inside) while all we need at the moment is basically 2 bits? I truly doubt we'll ever reach to a situation where OpsExecuter
would care about mere 5% of the information the old obs carries. The classical OSD has the combo (obs
+ new_obs
) to not pollute the cached object state when a failure occurs during an Op
processing; crimson-osd solves this problem completely differently: by reloading the obj state from ObjectStore
(please see the comment in OpsExecuter::RollbackHelper<Func>::rollback_obc_if_modified
).
To be clear: I'm not so terribly worried about a machine's performance. I very dislike the concept because inflating the program-state-to-track imposes extra mental burden on programmer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rzarzynski head_os appears to have two users
- OpsExecutor::make_writeable (removed by this patch)
- OpsExecutor::prepare_clone (called from make_writeable)
@Matan-B introduced it with snap machinery in 7171ca6 to replace head_existed because the snapshot pathway requires the original object_info_t fields used in object_info_t::copy_user_bits called from OpsExecutor::prepare_clone.
I think the bug here is simply that the update in execute_op is wrong -- it should be initialized in the constructor at the start of the transaction. I think head_existed had the same problem. OpsExecutor::make_writeable needs the exists state from the start of the transaction and OpsExecutor::prepare_clone needs to invoke object_info_t::copy_user_bits on the object_info as it existed at the start of the transaction (the logical state that the clone will be in).
I think there are two paths forward.
- Retain initial_os_t: Copying the entire ObjectState does capture an unnecessary amount of information. However, in order to handle this correctly, initial_os_t would also need to capture the object_info_t state required for copy_user_bits.
- Retain a full ObjectState: Basically, rename head_os to initial_os and capture in the OpsExecutor constructor.
I think 2. is simpler. It avoids needing to duplicate logic between object_info_t and initial_os_t (anything added to these "user bits" would need to be reflected back into initial_os_t). It also allow us to remove the logic required to re-read the object context upon error. Later on, once we have an effective test suite working reliably and some performance measurements, we can always revisit this question.
Either way, we shouldn't retain both head_os and initial_os.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy_user_bits()
indeed uses a bunch of the old oi:
void object_info_t::copy_user_bits(const object_info_t& other)
{
// these bits are copied from head->clone.
size = other.size;
mtime = other.mtime;
local_mtime = other.local_mtime;
last_reqid = other.last_reqid;
truncate_seq = other.truncate_seq;
truncate_size = other.truncate_size;
flags = other.flags;
user_version = other.user_version;
data_digest = other.data_digest;
omap_digest = other.omap_digest;
}
However, I still have the impression something terrible is going on here:
- the
obs
is duplicated to freeze the initial object state, so PGBackend::clone()
can be executed after all the operations which in turn- results in the need to reorder the low-level
txn
to place clone op at its beginning (in 80103ca; reworked & extended in os, crimson/osd: reoder ops in os::Transaction to mimic classical PGTransaction #48361):
void PGBackend::clone(
object_info_t& snap_oi,
ObjectState& os,
ObjectState& d_os,
ceph::os::Transaction& txn)
{
// Prepend the cloning operation to txn
ceph::os::Transaction c_txn;
c_txn.clone(coll->get_cid(), ghobject_t{os.oi.soid}, ghobject_t{d_os.oi.soid});
// Operations will be removed from txn while appending
c_txn.append(txn);
txn = std::move(c_txn)
Maybe a third way exists. Why we don't clone just at the beginning, before executing any op? Changes made to obc::ssc
could be dealt with, I think (currently only CEPH_OSD_OP_DELETE
takes care for the sake of to-whiteout-or-not-whiteout; CEPH_OSD_OP_LIST_SNAPS
is coming IIRC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing the clone at the beginning might work as long as it ends up being simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:-)))) E(xperiment)IP!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sent as separate PR: #48527.
The differences
crimson
The classical OSD
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows