Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw multisite: automated trimming for bucket index logs #17761

Merged
merged 35 commits into from Nov 15, 2017

Conversation

Projects
None yet
3 participants
@cbodley
Copy link
Contributor

commented Sep 15, 2017

The bucket index logs (bilogs) used by multisite bucket sync are no longer needed once all peer zones have processed them. This patch set introduces a BucketTrimManager that runs in the existing RGWSyncLogTrimThread of RGWRados, and trims these unneeded bilog entries from all buckets.

The trim logic for a bucket shard is very similar to the existing trim logic for the metadata and data changes logs: query the sync status from each peer zone, calculate the minimum marker position, and trim log entries up to that position. This is implemented by BucketTrimInstanceCR, using a new get_bucket_index_log_status rest api to query the sync status.

Unlike metadata and datalog trimming, where there is a single (sharded) log for each, there are a potentially unbounded number of bucket index logs to process. So most of the complexity here comes from deciding which of those many buckets to trim. The algorithm that does this is split into two parts: one that tracks which buckets are most active (so most likely to benefit from trimming), and one that iterates over all bucket instances (to guarantee that all buckets are trimmed eventually). Each trim interval (20 minutes by default), BucketTrimCR will select a number of active buckets and cold buckets to trim.

The active buckets are chosen by counting their frequency of entries in the data changes log. Because this datalog is sharded and its processing is spread across the zone's gateways, these bucket counters need to be accumulated over all gateways each trim interval. A watch/notify api served by the BucketTrimWatcher class allows the trimming gateway to query the top bucket counters from its peer gateways. A new BoundedKeyCounter data structure was introduced to manage these counters, while enforcing an upper bound on the number of keys/memory usage.

For cold buckets, the RGWMetadataManager interface is used to iterate over bucket instance metadata. The AsyncMetadataList and MetadataListCR classes provide an asynchronous wrapper for this listing. The current marker position in the metadata list is written to rados as a BucketTrimStatus object, so that the listing can be resumed for the next trim interval. Once the listing reaches the end (marker=MAX), it restarts from beginning.

Once a bucket has been trimmed successfully, it is added to a RecentEventList. Until that event expires, the bucket is omitted from both the counting of active buckets and the listing of cold buckets.

Once all selected buckets are finished trimming, the updated BucketTrimStatus object is written, and the watch/notify api instructs gateways to reset their bucket counters for the next interval.

Fixes: http://tracker.ceph.com/issues/18229

@cbodley cbodley force-pushed the cbodley:wip-rgw-bilog-trim branch 5 times, most recently from 1140181 to 3cce095 Sep 18, 2017

@cbodley cbodley removed the needs-test label Sep 20, 2017

@cbodley

This comment has been minimized.

Copy link
Contributor Author

commented Sep 20, 2017

added a test that creates two buckets, designating one as 'active' and the other as 'cold', then uses the admin commands bilog autotrim and bilog list to verify that they are both handled correctly

@cbodley

This comment has been minimized.

Copy link
Contributor Author

commented Sep 20, 2017

jenkins test this please

@cbodley cbodley force-pushed the cbodley:wip-rgw-bilog-trim branch from 3cce095 to 91e47b4 Sep 20, 2017

@cbodley

This comment has been minimized.

Copy link
Contributor Author

commented Sep 22, 2017

depends on #17916 to fix some test_multi.py failures

@cbodley cbodley requested a review from yehudasa Sep 22, 2017

// update sorted position if necessary. use a binary search for the last
// element in the sorted range that's greater than this counter
sorted_position = std::lower_bound(sorted.begin(), sorted_position,
&*i, &value_greater);

This comment has been minimized.

Copy link
@cbodley

cbodley Sep 28, 2017

Author Contributor

@yehudasa this is the line i mentioned in the standup this morning. sorted_position is an iterator that tracks how much of the vector is sorted (i.e. all elements before the iterator)

as an example, assume that we have a vector [a=5 b=4 c=3 #] that is fully-sorted by descending value, with # to indicate that sorted_position is at the end()

if we insert a new key d=4, we call sorted.push_back() and the result is [a=5 b=4 c=3 # d=4]. this call to lower_bound() will move the sorted_position forward past all entries with value <= 4 and end up with [a=5 # b=4 c=3 d=4]

if we instead update an existing key, like b=7, lower_bound() here would result in [# a=5 b=7 c=3], a fully-unsorted vector

in either case, a later call to get_highest() with count=2 would follow the if (sorted_count < count) { branch, which advances the sorted_position to index 2 and performs a partial_sort() to guarantee that the range [0, 2) is fully sorted

does that example cover the case that you were concerned about, or did i miss it?

the GetNumSorted test case covers both of these, using get_num_sorted() to infer the sorted_position

This comment has been minimized.

Copy link
@yehudasa

yehudasa Oct 23, 2017

Member

@cbodley I don't remember what I had in mind, I think I missed the part where pos is updated after modifying an existing entry. So, iirc, the partial_sort would guarantee that the range is fully sorted, but you might have bigger entries outside of that range?

This comment has been minimized.

Copy link
@cbodley

cbodley Oct 23, 2017

Author Contributor

the call to std::partial_sort(sorted.begin(), sorted_position, sorted.end(), &value_greater); passes iterators for the entire range, so it does guarantee that all elements up to sorted_position are >= than the elements that come after - it just doesn't guarantee that the elements after sorted_position are in the right order

This comment has been minimized.

Copy link
@yehudasa

yehudasa Oct 23, 2017

Member

@cbodley what about a case where instead of updating existing key, we insert a new one with the same key?

This comment has been minimized.

Copy link
@cbodley

cbodley Oct 23, 2017

Author Contributor

the keys are stored in the counters map, so this insert logic should prevent the sorted vector from containing duplicate keys:

      std::tie(i, inserted) = counters.emplace(key, 0);
      if (inserted) {
        sorted.push_back(&*i);
      }
@cbodley

This comment has been minimized.

Copy link
Contributor Author

commented Oct 4, 2017

still having trouble getting this through teuthology. shaman is refusing to build packages for trusty, leading to a lot of failures

@cbodley cbodley force-pushed the cbodley:wip-rgw-bilog-trim branch from 91e47b4 to b0a2483 Oct 11, 2017

@yehudasa
Copy link
Member

left a comment

@cbodley great work. See my comments, mainly minor.

}

int start()
{

This comment has been minimized.

Copy link
@yehudasa

yehudasa Oct 24, 2017

Member

@cbodley I prefer having the scope opened at the end of the method definition line when implementing inline methods


~BucketTrimWatcher()
{
stop();

This comment has been minimized.

Copy link
@yehudasa

yehudasa Oct 24, 2017

Member

@cbodley what if start() was never called, will stop() work correctly?

This comment has been minimized.

Copy link
@cbodley

cbodley Oct 24, 2017

Author Contributor

nope, it segfaults. will fix

struct TrimCounters {
/// counter for a single bucket
struct BucketCounter {
std::string bucket;

This comment has been minimized.

Copy link
@yehudasa

yehudasa Oct 24, 2017

Member

is this a bucket name? a string representation of a bucket instance?

yield call(new RGWSimpleRadosUnlockCR(store->get_async_rados(), store,
obj, name, cookie));
}
}

This comment has been minimized.

Copy link
@yehudasa

yehudasa Oct 24, 2017

Member

missing return set_cr_done() here. Nevermind, I see it was added later.

This comment has been minimized.

Copy link
@cbodley

cbodley Oct 27, 2017

Author Contributor

this is a for (;;) loop that sleeps in between trim attempts, so it doesn't ever return done. the existing MetaTrimPollCR and DataLogTrimPollCR work the same way

// filter out active buckets that we've already selected
auto i = std::find(buckets.begin(), buckets.end(), bucket);
if (i != buckets.end()) {
return true;

This comment has been minimized.

Copy link
@yehudasa

yehudasa Oct 24, 2017

Member

@cbodley how is this working? iirc, returning value here just affects the cr manager run() caller, and isn't passed to the stack parent. Right?

This comment has been minimized.

Copy link
@cbodley

cbodley Oct 24, 2017

Author Contributor

this is inside the callback lambda for MetadataListCR, so returning true here just asks it for the next bucket instance

This comment has been minimized.

Copy link
@yehudasa

yehudasa Oct 24, 2017

Member

ah, right, missed that.

if (marker >= start_marker) {
return 0;
}
if (!callback(std::move(key), std::move(marker))) {

This comment has been minimized.

Copy link
@yehudasa

yehudasa Oct 24, 2017

Member

what if caller does not exist anymore? E.g., CR manager was shut down. callback might not be valid I think.

This comment has been minimized.

Copy link
@cbodley

cbodley Oct 24, 2017

Author Contributor

you're right, the lambda in BucketTrimCR is capturing a this pointer that may not be valid for the lifetime of MetadataListCR

it looks like this lambda can safely capture a reference to this via boost::intrusive_ptr. that way, the lambda capture will keep this BucketTrimCR alive until MetadataListCR destructs. because we spawn MetadataListCR here with a call(), it's the parent stack that holds a reference to the MetadataListCR, rather than BucketTrimCR itself - that means we don't have to worry about cyclical references

+        // capture a reference so 'this' remains valid in the callback
+        auto ref = boost::intrusive_ptr<RGWCoroutine>{this};
         // list cold buckets to consider for trim
-        auto cb = [this] (std::string&& bucket, std::string&& marker) {
+        auto cb = [this, ref] (std::string&& bucket, std::string&& marker) {

This comment has been minimized.

Copy link
@cbodley

cbodley Oct 27, 2017

Author Contributor

addressed in b861442 and verified with valgrind that it doesn't leak

@@ -4930,6 +4930,18 @@ std::vector<Option> get_rgw_options() {
.set_default(1200)
.set_description(""),

Option("rgw_sync_log_trim_max_buckets", Option::TYPE_INT, Option::LEVEL_ADVANCED)
.set_default(16)
.set_description("Maximum number of buckets to trim per interval"),

This comment has been minimized.

Copy link
@yehudasa

yehudasa Oct 24, 2017

Member

@cbodley maybe add .set_long_description() and .add_see_also()?

conf->get_val<int64_t>("rgw_sync_log_trim_min_cold_buckets");
config.concurrent_buckets =
conf->get_val<int64_t>("rgw_sync_log_trim_concurrent_buckets");
config.notify_timeout_ms = 10;

This comment has been minimized.

Copy link
@yehudasa

yehudasa Oct 24, 2017

Member

10ms for notify timeout? I'd go with 2 seconds

This comment has been minimized.

Copy link
@cbodley

cbodley Oct 24, 2017

Author Contributor

oops, i intended to use 10 sec here. that _ms suffix was supposed to prevent that kind of mistake :)

This comment has been minimized.

Copy link
@yehudasa

yehudasa Oct 24, 2017

Member

yeah 10 sec is probably better.

@@ -935,6 +984,8 @@ RGWOp *RGWHandler_Log::op_get() {
} else if (type.compare("bucket-index") == 0) {
if (s->info.args.exists("info")) {
return new RGWOp_BILog_Info;
} else if (s->info.args.exists("status")) {

This comment has been minimized.

Copy link
@yehudasa

yehudasa Oct 24, 2017

Member

maybe sync-status?

This comment has been minimized.

Copy link
@cbodley

cbodley Oct 24, 2017

Author Contributor

i had used status for datalog and mdlog trim, so i'd rather keep them consistent

return marker;
}
return marker.substr(p + 1);
}

This comment has been minimized.

Copy link
@yehudasa

yehudasa Oct 24, 2017

Member

Surely we defined a similar method elsewhere? if not, then the place for it is wherever BucketIndexShardsManager is defined I think

@cbodley cbodley force-pushed the cbodley:wip-rgw-bilog-trim branch from b0a2483 to 377ba6b Oct 27, 2017

cbodley added some commits Aug 29, 2017

rgw: MetadataManager interface takes const string refs
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: introduce RGWRadosNotifyCR for aio_notify
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add skeleton for BucketTrimManager
Signed-off-by: Casey Bodley <cbodley@redhat.com>
common: introduce BoundedKeyCounter and unit test
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: BucketTrimManager implements BucketChangeObserver
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add BucketTrimWatcher to serve watch/notify apis
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add TrimCounters api to BucketTrimWatcher
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add BucketTrimPollCR for interval and lease logic
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add BucketTrimCR to spawn trim for active buckets
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: BucketTrimManager implements BucketTrimObserver
Signed-off-by: Casey Bodley <cbodley@redhat.com>

cbodley added some commits Sep 6, 2017

rgw: add 'radosgw-admin bilog autotrim'
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add MetadataListCR to loop over bucket instances
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add configure_bucket_trim()
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add /admin/log rest api for bucket sync status
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add json decoders for bucket sync status
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add rgw_bucket_sync_status() to bypass manager
RGWBucketSyncStatusManager::init() is doing a lot of extra work that's
not needed to serve the rest api (spawning an http manager thread,
fetching the bucket instance info, etc)

uses RGWShardCollectCR to limit the number of concurrent reads to 16

Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: RGWGetBucketInstanceInfoCR takes rgw_bucket or metadata key
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add HTTPManager to BucketTrimManager
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add BucketTrimInstanceCR
fetches bucket sync status from each peer, calculates the min markers
for each shard, and trims the bilog shards. calls the TrimObserver on
success

Fixes: http://tracker.ceph.com/issues/18229

Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add RGWRadosBILogTrimCR
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add RGWBucketInfo overload for BucketShard::init
for use by callers that have already read the bucket instance info

Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: start BucketTrimManager in RGWRados
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: RGWDataSyncSingleEntryCR calls BucketChangeObserver
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add TrimComplete to watch/notify api
Signed-off-by: Casey Bodley <cbodley@redhat.com>
test/rgw: add test_bucket_index_log_trim()
Signed-off-by: Casey Bodley <cbodley@redhat.com>
qa/rgw: add kwargs for debug output
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: move shard marker helper into BucketIndexShardsManager
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: add comment for bucket in BucketCounter
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: hold cr reference in MetadataListCR callback
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: BucketTrimWatcher checks handle in stop
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: curly brace style for bilog trim classes
Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: fix notify timeout for BucketTrimWatcher
from seconds to msec

Signed-off-by: Casey Bodley <cbodley@redhat.com>
rgw: more documentation for bilog trim config
Signed-off-by: Casey Bodley <cbodley@redhat.com>

@cbodley cbodley force-pushed the cbodley:wip-rgw-bilog-trim branch from 377ba6b to 4d44421 Nov 10, 2017

@yuriw

This comment has been minimized.

Copy link
Contributor

commented Nov 14, 2017

@yuriw yuriw merged commit c9dbb86 into ceph:master Nov 15, 2017

5 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.