New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rgw: improve sync status #19573
rgw: improve sync status #19573
Conversation
9c6f0c0
to
3862dad
Compare
jenkins test this please |
@oritwas would you mind take a look? |
src/rgw/rgw_data_sync.cc
Outdated
@@ -1057,7 +1057,7 @@ class RGWDataSyncSingleEntryCR : public RGWCoroutine { | |||
sync_env->observer->on_bucket_changed(bs.bucket.get_key()); | |||
} | |||
/* FIXME: what do do in case of error */ | |||
if (marker_tracker && !entry_marker.empty()) { | |||
if (marker_tracker && !entry_marker.empty() && sync_status == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we do still want to update the marker here, so that an error from one bucket shard doesn't prevent progress on bucket shards that come later in the datalog
instead, we write this key to the error_repo
, and RGWDataSyncShardCR::incremental_sync()
will continue to retry it (see /* process bucket shards that previously failed */
), while also making progress on the next bucket shards in the datalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for kindly review, to be honest, do not update the marker here has a lot of side effects, I pushed this commit here so that to discuss it with all of you, instead of this implement, maybe we can judge whether a shard is failed in get_data_sync_status()
in rgw_admin.cc ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/rgw/rgw_admin.cc
Outdated
shards_behind_str.append(*it + ", "); | ||
} | ||
shards_behind_str.erase(shards_behind_str.size()-2); | ||
shards_behind_str.append("]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this could be simplified by using a std::set<int>
, and taking advantage of its operator<<
defined in include/types.h
:
std::set<int> shards = {0,1,2,3};
std::cout << '[' << shards << ']'; // prints [0,1,2,3]
81ddd29
to
84f8b06
Compare
At now, it will display lagging and error shards in radosgw-admin sync status, just as below:
|
84f8b06
to
3c882bb
Compare
At now, when specified --sync-detail in execute radosgw-admin sync status, it will display the lagging bucket shards related with behind log shards and error log shards , just as below:
|
3c882bb
to
c77de5d
Compare
@cbodley ping |
src/rgw/rgw_data_sync.cc
Outdated
for (auto entry : log_entries_map) { | ||
list<rgw_data_change_log_entry> log_entries = entry.second; | ||
for (list<rgw_data_change_log_entry>::iterator log_iter = log_entries.begin(); log_iter != log_entries.end(); ++log_iter) { | ||
ldout(sync_env->cct, 10 ) << "TESTLV " << log_iter->entry.key << dendl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TESTLV
should be removed
src/rgw/rgw_data_sync.cc
Outdated
} | ||
for (auto entry : log_entries_map) { | ||
list<rgw_data_change_log_entry> log_entries = entry.second; | ||
for (list<rgw_data_change_log_entry>::iterator log_iter = log_entries.begin(); log_iter != log_entries.end(); ++log_iter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (const auto& log_iter : log_entries)
src/rgw/rgw_admin.cc
Outdated
@@ -2713,6 +2735,8 @@ int main(int argc, const char **argv) | |||
// do nothing | |||
} else if (ceph_argparse_binary_flag(args, i, &sync_stats, NULL, "--sync-stats", (char*)NULL)) { | |||
// do nothing | |||
} else if (ceph_argparse_binary_flag(args, i, &sync_detail, NULL, "--sync-detail", (char*)NULL)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/doc/man/8/radosgw-admin.rst
should be updated.
27037e6
to
08cbb15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent work overall!
src/rgw/rgw_cr_rados.h
Outdated
@@ -222,6 +238,42 @@ class RGWSimpleRadosReadCR : public RGWSimpleCoroutine { | |||
} | |||
}; | |||
|
|||
class RGWSimpleOmapKeysReadCR : public RGWSimpleCoroutine { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use the existing RGWRadosGetOmapKeysCR
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbodley RGWRadosGetOmapKeysCR
will return -EIO
rather than -ENOENT
if the related rados object do not exists , when we use it in the RGWReadDataSyncErrorShardsCR
, this CR will failed, since it's super-class RGWShardCollectCR
always expects child CR succeed or return with -ENOENT
, so maybe we should use RGWSimpleOmapKeysReadCR
here, as it's retcode will be -ENOENT
or just do some change on the existing RGWRadosGetOmapKeysCR
? related code is as below:
int RGWShardCollectCR::operate() {
reenter(this) {
while (spawn_next()) {
current_running++;
while (current_running >= max_concurrent) {
int child_ret;
yield wait_for_child();
if (collect_next(&child_ret)) {
current_running--;
if (child_ret < 0 && child_ret != -ENOENT) {
ldout(cct, 10) << __func__ << ": failed to fetch log status, ret=" << child_ret << dendl;
status = child_ret;
}
}
}
}
while (current_running > 0) {
int child_ret;
yield wait_for_child();
if (collect_next(&child_ret)) {
current_running--;
if (child_ret < 0 && child_ret != -ENOENT) {
ldout(cct, 10) << __func__ << ": failed to fetch log status, ret=" << child_ret << dendl;
status = child_ret;
}
}
}
if (status < 0) {
return set_cr_error(status);
}
return set_cr_done();
}
return 0;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you tell where EIO
is coming from in that case? i don't see much difference between RGWRadosGetOmapKeysCR
and the new RGWSimpleOmapKeysReadCR
, except that yours is calling the synchronous version of ioctx.omap_get_vals()
where RGWRadosGetOmapKeysCR::send_request()
is using omap_get_vals2()
with aio_operate()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbodley both RGWSimpleOmapKeysReadCR
and RGWRadosGetOmapKeysCR
will call librados::ObjectReadOperation::omap_get_vals2()
, but RGWRadosGetOmapKeysCR
will call it with rval specified, if the related rados object do not exist, RGWRadosGetOmapKeysCR
will return with rval, the related code is as below:
//rgw_cr_rados.cc
int RGWRadosGetOmapKeysCR::send_request() {
int r = store->get_raw_obj_ref(obj, &ref);
if (r < 0) {
lderr(store->ctx()) << "ERROR: failed to get ref for (" << obj << ") ret=" << r << dendl;
return r;
}
set_status() << "send request";
librados::ObjectReadOperation op;
op.omap_get_vals2(marker, max_entries, entries, nullptr, &rval);
cn = stack->create_completion_notifier();
return ref.ioctx.aio_operate(ref.oid, cn->completion(), &op, NULL);
}
//rgw_cr_rados.h
int request_complete() override {
return rval;
}
but deep into the call chain of RGWSimpleOmapKeysReadCR
, we can find it will call omap_get_vals2()
with rval not specified, if the related rados object do not exist, the return value of opeator(oid, &op, &bl)
will be passed, the related code is as below:
//librados.cc
int librados::IoCtx::omap_get_vals(const std::string& oid,
const std::string& orig_start_after,
const std::string& filter_prefix,
uint64_t max_return,
std::map<std::string, bufferlist> *out_vals)
{
bool first = true;
string start_after = orig_start_after;
bool more = true;
while (max_return > 0 && more) {
std::map<std::string,bufferlist> out;
ObjectReadOperation op;
op.omap_get_vals2(start_after, filter_prefix, max_return, &out, &more,
nullptr);
bufferlist bl;
int ret = operate(oid, &op, &bl);
if (ret < 0) {
return ret;
}
if (more) {
if (out.empty()) {
return -EINVAL; // wth
}
start_after = out.rbegin()->first;
}
if (out.size() <= max_return) {
max_return -= out.size();
} else {
max_return = 0;
}
if (first) {
out_vals->swap(out);
first = false;
} else {
out_vals->insert(out.begin(), out.end());
out.clear();
}
}
return 0;
}
in order to analysis the rval passed by osd, we specified the rval in RGWSimpleOmapKeysReadCR
's call chain and display this rval , just as below:
[root@0deae4d9bb38 librados]# git diff librados.cc
diff --git a/src/librados/librados.cc b/src/librados/librados.cc
index e9157ab..d1789d7 100644
--- a/src/librados/librados.cc
+++ b/src/librados/librados.cc
@@ -396,10 +396,12 @@ int librados::IoCtx::omap_get_vals(const std::string& oid,
while (max_return > 0 && more) {
std::map<std::string,bufferlist> out;
ObjectReadOperation op;
+ int val;
op.omap_get_vals2(start_after, filter_prefix, max_return, &out, &more,
- nullptr);
+ &val);
bufferlist bl;
int ret = operate(oid, &op, &bl);
+ std::cout << cpp_strerror(ret) << " : " << cpp_strerror(val) << std::endl;
if (ret < 0) {
return ret;
}
in this way, the output is as below:
[root@0deae4d9bb38 build]# rgw sync status --rgw-zone=zone1 --debug-rgw=20 2>kk
realm d642c3d5-eea7-4a55-8a17-bf6296cf4548 (earth)
zonegroup 38a8d75d-c834-4987-a78f-e615328fa663 (us)
zone 71b1be97-96d1-4e0a-81e2-c5e430a186a7 (zone1)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
....
(2) No such file or directory : (5) Input/output error
(0) Success : (0) Success
(2) No such file or directory : (5) Input/output error
....
data sync source: 9a4a40f0-9b31-4820-a5a8-6be46155f3e6 (zone0)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
1 shards are recovering
recovering shards: [104]
so, if the related rados object do not exist, the rval from osd is always ENO(besides this ,we can also find it if we display th retcode of RGWRadosGetOmapKeysCR
in rgw_data_sync.cc when process bucket shards that previously failed), so maybe we should fix the RGWRadosGetOmapKeysCR
, so that it will return with -ENOENT
instead of -EIO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come the rval of librados::ObjectReadOperation::omap_get_vals2()
will be -EIO
, we are not exactly clear now, @liuchang0812 and I will take a deep look into it later on, but if we use rados listomapkeys
on an nonexistent object, the retcode will be -ENOENT
, just as below:
[root@0deae4d9bb38 build]# bin/rados -p zone1.rgw.log listomapkeys non-exist-object
error getting omap key set zone1.rgw.log/non-exist-object: (2) No such file or directory
so maybe the reasonable rval in omap_get_vals2()
should be -ENOENT
, so that it can stay in consensus with osds.
@cbodley @yehudasa @oritwas hope you can give some kindly suggestions on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like that EIO is coming from C_ObjectOperation_decodevals::finish()
in Objecter.h. i see the osd replying to the op with ENOENT, but finish() gets called with r=0 and an empty bufferlist. trying to decode that bufferlist results in EIO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see #19878
src/rgw/rgw_data_sync.h
Outdated
@@ -279,7 +282,8 @@ class RGWRemoteDataLog : public RGWCoroutinesManager { | |||
int read_log_info(rgw_datalog_info *log_info); | |||
int read_source_log_shards_info(map<int, RGWDataChangesLogInfo> *shards_info); | |||
int read_source_log_shards_next(map<int, string> shard_markers, map<int, rgw_datalog_shard_data> *result); | |||
int read_sync_status(rgw_data_sync_status *sync_status); | |||
int read_sync_status(rgw_data_sync_status *sync_status, bool read_error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it would make more sense to split this into a separate function like read_error_shards(set<int>&)
, instead of extending read_sync_status()
and struct rgw_data_sync_status
src/rgw/rgw_data_sync.h
Outdated
@@ -279,7 +282,8 @@ class RGWRemoteDataLog : public RGWCoroutinesManager { | |||
int read_log_info(rgw_datalog_info *log_info); | |||
int read_source_log_shards_info(map<int, RGWDataChangesLogInfo> *shards_info); | |||
int read_source_log_shards_next(map<int, string> shard_markers, map<int, rgw_datalog_shard_data> *result); | |||
int read_sync_status(rgw_data_sync_status *sync_status); | |||
int read_sync_status(rgw_data_sync_status *sync_status, bool read_error); | |||
int read_lagging_buckets(rgw_data_sync_status *sync_status, map<int, string>& shard_markers, set<int>& error_shards); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first argument can just be set<string>& lagging_buckets
instead of adding it to rgw_data_sync_status
src/rgw/rgw_admin.cc
Outdated
|
||
if (error_shards.size() > 0) { | ||
push_ss(ss, status, tab) << error_shards.size() << " shards have encountered errors"; | ||
push_ss(ss, status, tab) << "error shards: " << "[" << error_shards << "]"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd rather we didn't call these "errors" here, since the majority of them are likely to be from EBUSY and will be retried without admin intervention. can we use different terms, like maybe " shards are recovering from errors" and "recovering shards:"?
src/rgw/rgw_admin.cc
Outdated
} | ||
|
||
if (sync_detail) { | ||
ret = sync.read_lagging_buckets(&sync_status, shards_behind, error_shards); // read error shards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure how i feel about this 'lagging buckets' part
on the one hand, it's useful to see which buckets could be affected by the datalog shards that are behind. but on the other hand, it's the bucket's sync status that tells whether the bucket is actually behind. so there will be many cases where buckets have finished sync but are reported as laggy here because there are still some datalog entries referring to them (this would often be due to failures in other buckets on the same datalog shard that delay progress). so i worry that false positives here will confuse the admin, and they'll assume that they need to do something to fix all of those buckets
it's also fairly expensive to compute - first, RGWReadRemoteDataLogCR
will read up to 1000 datalog entries from each shard (so up to 128000 entries in memory), then RGWReadDataSyncErrorShardsCR
reads all of the error entries (which is not bounded in size). in large deployments, this is likely to either crash radosgw-admin, or just take a really long time and produce too much output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for kindly review. maybe just use bucket's sync status can not get all of the lagging bucket shards(i.e the related bucket instance have't been synced), instead of this lagging buckets
part, maybe we can do as below?
- get the behind data log shards and error data log shards
- for every behind data log shard, read the related bucket instance, then use bucket's sync status to judge whether the bucket is actually behind
- for every error data log shard, just read the error bucket shards record in the related
*.retry obj
any suggestions are appreciated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @cbodley here. I think the proper way to do it would be to flip the responsibility. Instead of having radosgw-admin
querying the status of buckets, have radosgw
report the status somewhere durable when the state changes. It is already being done for the sync trace node info, just that we currently only keep it in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Leeshine i would really like to see improvements to the bucket sync status
command so it can query the remote zones and compare their markers with the local bucket sync status. i think that would be the ideal way to inspect a single bucket/shard
so maybe the overall approach should be to start with a sync status
command that tells which datalog shards are behind (as you've done). then have some kind of data sync status --shard=x
command that lists which bucket shards are left to process (this 'lagging buckets' part). then the admin could use bucket sync status
to look into the status of each one
that way we don't make the sync status
command itself too complicated or verbose, but provide more tools to inspect the status in detail. what do you guys think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbodley sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @vumrao
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all right, I'll try to working on it. @cbodley
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Leeshine can we limit this pr to just the sync status
changes (minus --sync-detail and 'lagging buckets') and follow up with the other commands separately? i think what you've done so far is close to being ready to merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbodley sure
08cbb15
to
caa8286
Compare
@cbodley ping (limited this pr to just the sync status changes) |
Signed-off-by: lvshanchun <lvshanchun@gmail.com>
Signed-off-by: lvshanchun <lvshanchun@gmail.com>
caa8286
to
f22a217
Compare
@cbodley ping
|
src/rgw/rgw_data_sync.cc
Outdated
|
||
public: | ||
RGWReadDataSyncRecoveringShardsCR(RGWDataSyncEnv *env, uint64_t _max_entries, set<int>& _shards, | ||
map<int, std::set<std::string>>& _entries_map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the only caller is always requesting all shards from [0,num_shards)
, the use of set<int> shards
seems like overkill - can you have it take int num_shards
instead? see RGWReadDataSyncStatusMarkersCR
as an example
src/rgw/rgw_data_sync.cc
Outdated
int operate() override; | ||
}; | ||
|
||
int RGWReadRecoveringShardsCoroutine::operate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this coroutine is only making one asynchronous call to RGWReadDataSyncRecoveringShardsCR
, with some extra logic for setup/teardown, it probably makes sense to move that extra logic into read_recovering_shards()
so we only need the one RGWReadDataSyncRecoveringShardsCR
coroutine
src/rgw/rgw_data_sync.cc
Outdated
return set_cr_error(retcode); | ||
} | ||
|
||
for (auto entry : entries_map) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use of auto entry
here will copy each entry by value, and this is expensive for std::set
- prefer const auto& entry
f22a217
to
517ce71
Compare
@cbodley ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor comment, but otherwise looks great 👍
src/rgw/rgw_data_sync.cc
Outdated
@@ -672,6 +706,34 @@ int RGWRemoteDataLog::read_sync_status(rgw_data_sync_status *sync_status) | |||
return ret; | |||
} | |||
|
|||
int RGWRemoteDataLog::read_recovering_shards(rgw_data_sync_status *sync_status, set<int>& recovering_shards) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this function only uses sync_status->sync_info.num_shards
from sync_status
, could you just pass int num_shards
instead?
Signed-off-by: lvshanchun <lvshanchun@gmail.com>
Signed-off-by: lvshanchun <lvshanchun@gmail.com>
517ce71
to
8c3ea26
Compare
@cbodley thank you for kindly review, I'll put a PR about display lagging bucket shards and their sync detail later on |
great. i've been meaning to create a tracker issue for the next steps, and add some more details about how it should look. i'll try to do that today |
@Leeshine @cbodley I think we can use this old tracker - http://tracker.ceph.com/issues/20473 |
improve sync status in radowgw-admin:
do not update marker when sync faileddisplay error data log shards when execute radosgw-admin sync statusdisplay lagging bucket shards related with behind shards and error shardsSigned-off-by: lvshanchun lvshanchun@gmail.com