Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mds/MDLog: manage the lifetime of a LogSegment with shared pointers #53086

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

leonid-s-usov
Copy link
Contributor

  • Make sure that any structure referring to a LogSegment is using a shared pointer to do that
  • Limit the API changes by enhancing the LogSegment class with an enable_shared_from_this base
    • Make LogSegment::LogSegment private and expose a create method to correctly initialize the base class
    • Introduce AutoSharedLogSegment helper class that recovers a shared pointer from a LogSegment*

Fixes: https://tracker.ceph.com/issues/4744

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@github-actions github-actions bot added the cephfs Ceph File System label Aug 22, 2023
@leonid-s-usov leonid-s-usov requested a review from a team August 22, 2023 17:37
@leonid-s-usov leonid-s-usov force-pushed the shared-log-segment branch 9 times, most recently from 430d867 to 57df0f1 Compare August 23, 2023 20:30
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern with this PR is that it doesn't actually address:

"These really ought to be ref-counted in some way to prevent early expiry."

from https://tracker.ceph.com/issues/4744

It's not well laid out what the concern is but I suspect it's something like to prevent tickets like:

https://tracker.ceph.com/issues/59119

You could potentially resolve that ticket with this PR by checking that the LogSegment reference count is 1 before trimming* the segment.

  • I believe Greg actually meant "early trimming" as that's the real concern: that the log segment will be removed from the physical journal.

cc @gregsfortytwo


class MDSRank;
class LogSegment;
class EMetaBlob;
using AutoSharedLogSegment = auto_shared_ptr<LogSegment>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm mystified by the utility of this class. Why couldn't it be:

using foo = std::shared_ptr<LogSegment>

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in the commit message, the utility of this class is to automatically convert a pointer to an object that inherits from enable_shared_from_this into a shared pointer and also adds an implicit conversion from the shared pointer back to a plain pointer. This helps minimize refactoring.

I'll elaborate on how I believe this helps address the original ticket in a separate comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally we use boost::intrusive_ptr if other accessors have a raw pointer. That's somewhat inappropriate in a code base we fully control, the mds.

Your autoshared class is interesting though. I'll have to think about this more.


[[nodiscard]] static AutoSharedLogSegment create(uint64_t _seq, loff_t off = -1)
{
return std::shared_ptr<LogSegment>(new LogSegment(_seq, off));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_shared is inaccessible here because the constructor is private. See class Best() in the example from the documentation on shared_ptr

Additionally, make_shared comes at the cost of pinning the memory of the whole object when only weak pointers are left since the same memory block is used to hold the shared pointer control and the class data. This is mentioned in the second note on the documentation page you've referenced.
To clarify, I'm not sure this will be a blocker for us, and maybe we could even benefit from it, but I think it's an important point to keep in mind

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_shared is inaccessible here because the constructor is private. See class Best() in the example from the documentation on shared_ptr

ok

Additionally, make_shared comes at the cost of pinning the memory of the whole object when only weak pointers are left since the same memory block is used to hold the shared pointer control and the class data. This is mentioned in the second note on the documentation page you've referenced. To clarify, I'm not sure this will be a blocker for us, and maybe we could even benefit from it, but I think it's an important point to keep in mind

I don't really think that will be significant but I also wonder if a clever implementation is permitted to realloc the memory block when T is destroyed. What an odd caveat I had the opposite understanding about.

@leonid-s-usov
Copy link
Contributor Author

My main concern with this PR is that it doesn't actually address:

"These really ought to be ref-counted in some way to prevent early expiry."

@batrick I do think that this PR addresses the original ticket. It does so by changing all containers to hold a shared pointer to the LogSegment instead of a plain pointer.

It wouldn't be sufficient to only update the storage semantics, we also needed to correctly create shared pointers to the same original object. To achieve this I decided to utilize the enable_shared_from_this interface. Basically, it embeds the weak pointer to the object in the object itself, so that whoever wants to take shared ownership of it can simply call shared_from_this() and hold on to the result.
I chose this option to minimize the changes to the code since most of the functions that receive a pointer to the LogSegment perform synchronous operations with the object and hence would not benefit from getting a shared pointer instead.

To further reduce the scope of refactoring I've introduced a helper class auto_shared_ptr which can implicitly convert an enabled plain pointer to a shared pointer and back.

You could potentially resolve that ticket with this PR by checking that the LogSegment reference count is 1 before trimming* the segment.

This point needs discussion, as I don't think this requirement is still applicable - though I'll be happy to learn if I'm wrong about this.

At the time the original ticket was created, mds trimming logic was different - and simpler. Back then trimming was done synchronously, but now trimming is staged through

  std::set<AutoSharedLogSegment> expired_segments;
  std::set<AutoSharedLogSegment> expiring_segments;

I haven't yet fully understood the logic, but it looks like there's already an asynchronous process that only trims segments that were reported as unused.

@batrick
Copy link
Member

batrick commented Aug 24, 2023

My main concern with this PR is that it doesn't actually address:
"These really ought to be ref-counted in some way to prevent early expiry."

@batrick I do think that this PR addresses the original ticket. It does so by changing all containers to hold a shared pointer to the LogSegment instead of a plain pointer.

You've refcounted LogSegment but you haven't used that to prevent early expiry. Specifically, to address that I think you need to change:

ceph/src/mds/MDLog.cc

Lines 842 to 851 in ce9771b

dout(20) << __func__ << ": expiring " << *ls2 << dendl;
expired_events -= ls2->num_events;
expired_segments.erase(ls2);
if (pre_segments_size > 0)
pre_segments_size--;
num_events -= ls2->num_events;
logger->inc(l_mdl_evtrm, ls2->num_events);
logger->inc(l_mdl_segtrm);
expire_pos = ls2->end;
delete ls2;

to not trim (i.e. delete) a segment unless its refcount is 1.

@leonid-s-usov
Copy link
Contributor Author

to not trim (i.e. delete) a segment unless its refcount is 1.

Well, I don't need to do anything for the log segment to be deleted only when the last strong reference to it is gone, that's the primary function of a shared pointer. In other words, if the delete operation is what we'd like to delay - that's already the case with the current implementation.

@leonid-s-usov leonid-s-usov changed the title mds: manage the lifetime of a LogSegment with shared pointers mds/MDLog: manage the lifetime of a LogSegment with shared pointers Sep 11, 2023
@leonid-s-usov leonid-s-usov marked this pull request as draft September 11, 2023 09:05
@leonid-s-usov
Copy link
Contributor Author

leonid-s-usov commented Sep 11, 2023

@batrick @vshankar I ended up changing quite a bit. Unfortunately, there is no unittest infrastructure for MDLog, but it's worth investing in it. Most if not all of my changes can be relatively quickly verified with unit tests, and MDLog does not have that many external dependencies compared to other core classes.

I've changed this PR to a draft so that I can upload the current version which compiles, and have you go over it and start giving overall comments. If we want to proceed with this, I will have another commit where I make MDLog constructible in a unit test environment and start adding unit tests to verify the invariants that I have most probably violated

For details about the changes please see the commit message

@gregsfortytwo
Copy link
Member

Specific questions to look at here:

  1. the locking changes around the submit lock, and taking the mds lock in the thread
  2. ambiguity around event counts (live versus expiring versus expired) versus health checks

src/mds/MDLog.cc Outdated
logger->set(l_mdl_wrpos, ls->end);
// acquire the mds lock until the end of the scope
// to update the LS offsets and major offsets
std::lock_guard l(mds->mds_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to go through these changes in more detail but this will not fly. The mds_lock has too much thread contention already. Also, the submit thread is intended to operate completely outside the mds_lock. I do not think we should ever change that.

I believe there may be another way to protect the LogSegment members behind the mds_lock: add the LogSegment to C_MDL_Flushed and update the segment's end/offset in the completion (::finish).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C_MDL_Flushed will take mds_lock, so not much difference in terms of contention. But I agree that we could benefit from performing the update on the finisher thread instead of the submit thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you said, the journaler thread already regularly acquires the mds_lock so it's advantageous to use it for this purpose instead of adding a new thread to the contention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see a1c9d1a, I removed the need for locking.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@vshankar
Copy link
Contributor

jenkins retest this please

@vshankar vshankar marked this pull request as ready for review November 13, 2023 11:24
@leonid-s-usov
Copy link
Contributor Author

@vshankar apparently, there are some mdlog unit test timeouts that I need to investigate. I will get back to it after the quiesce db

Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@github-actions github-actions bot removed the stale label Jan 30, 2024
Copy link

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

damaged_unlocked is another method in the same class that has
the same semantics but a better name, so now
hande_write_error_unlocked will take the mds lock before
executing handle_write_error.

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
* Make sure that any structure referring to a LogSegment is using a shared pointer to do that
* Limit the API changes by enhancing the LogSegment class with an `enable_shared_from_this` base
  - Make LogSegment::LogSegment private and expose a `create` method to correctly initialize the base class
  - Introduce AutoSharedLogSegment helper class that recovers a shared pointer from a LogSegment*

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Fixes: https://tracker.ceph.com/issues/4744
The main goal of this change is to keep expired segments as weak references,
allowing to spot "zombie" segments which should've been deleted by the time
we decide to trim them.
Since it's not a fatal error to have a dangling log segment given that it was
properly expired, the system will ignore zombies unless a dedicated new
debug flag is enabled:

- name: mds_debug_zombie_log_segments
  type: bool
  level: advanced
  default: false
  services:
  - mds
  fmt_desc: MDS will abort if an expired log segment is still
            strongly referenced (for developers only).

To achieve this it was necessary to change how segments are kept in the log.
Specifically, it was necessary to break up the `segments` map into two:
unexpired_segments and expired_segments, where the latter holds structs of
ExpiredSegmentInfo with a weak ref to the original log segment,
as well as some importany by-value copies for accounting and syncronization.

Since the change already required significant refactoring, it was used to deliver
a number of additional improvements:

* expiring segments don't need a separate container, they are now kept in `unexpired_segments`
  until they complete expiration. The expiration boundary is kept in a separate
  variable `last_expiring_segment_seq` which is sufficient.
* submit_mutex is now used exclusively to protect the `pending_events` when sending them
  to the submit thread. All other internal state is already protected by the mds_lock
* expiration waiters are now added as a single context associated with a segment sequence,
  which will be finished when no more unexpired segments exists with the sequence or below.
* expiration and trimming logic were optimized to make as much use of the O(log N) lookup
  in the std containers as possible.
* methods and fields were renamed to reduce ambiguity

Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Signed-off-by: Leonid Usov <leonid.usov@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants