New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw: fix rgw hang when do RGWRealmReloader::reload after go SIGHUP #16417

Merged
merged 1 commit into from Jul 24, 2017

Conversation

Projects
None yet
4 participants
@fangyuxiangGL
Contributor

fangyuxiangGL commented Jul 19, 2017

pendings in RGWGetBucketStatsContext is initialized to zero when rgw_override_bucket_index_max_shards is zero.

so --pendings will got 0xffffffff, which lead async_refcount not put correctly, then the thread handle RGWRealmReloader::reload will hang, then cause request process worker thread of rgw frontend hang(the write lock was held by RGWRealmReloader::reload handle thread, but request process worker thread need to get read lock)

the procedure of quota fetch is a bit tricky and worse, this fix is not perfect, but is better with minimum changes

http://tracker.ceph.com/issues/20686

@fangyuxiangGL

This comment has been minimized.

Contributor

fangyuxiangGL commented Jul 19, 2017

@cbodley please have a look, thanks!

@mattbenjamin

looks promising, defer to one of the multisite folks for approval

@cbodley

good catch, thanks. can you expand a bit on the other issues with quota fetch?

@cbodley cbodley added the needs-qa label Jul 19, 2017

@fangyuxiangGL

This comment has been minimized.

Contributor

fangyuxiangGL commented Jul 19, 2017

@cbodley

Just issue with low possibility that will cause quota stats incorrect. please have a look at codes below:

int RGWRados::get_bucket_stats_async(RGWBucketInfo& bucket_info,
int shard_id, RGWGetBucketStats_CB *ctx)
{
int num_aio = 0;
RGWGetBucketStatsContext *get_ctx = new
RGWGetBucketStatsContext(ctx, bucket_info.num_shards);
assert(get_ctx);
int r = cls_bucket_head_async(bucket_info, shard_id, get_ctx, &num_aio);
get_ctx->put();
if (r < 0) {
ctx->put();
if (num_aio) {
get_ctx->unset_cb(); may race with the async io callback sent in cls_bucket_head_async
}
}
return r;
}`

@cbodley

This comment has been minimized.

Contributor

cbodley commented Jul 19, 2017

get_ctx->unset_cb(); may race with the async io callback sent in cls_bucket_head_async

thanks, looks like you're right. it also accesses get_ctx after it called get_ctx->put() above

@fangyuxiangGL

This comment has been minimized.

Contributor

fangyuxiangGL commented Jul 19, 2017

you are right, so race result:

  1. if part of async io sent was called back before get_ctx->unset_cb(), bucket quota stats will be incorrect
  2. if all of async io sent was called back before get_ctx->unset_cb(), rgw will crash
@cbodley

This comment has been minimized.

Contributor

cbodley commented Jul 19, 2017

@fangyuxiangGL could you move that call to get_ctx->put() down below the if (r < 0) {} block? we might as well fix and backport that at the same time

@fangyuxiangGL

This comment has been minimized.

Contributor

fangyuxiangGL commented Jul 20, 2017

@cbodley

moved get_ctx->put() down

@fangyuxiangGL

This comment has been minimized.

Contributor

fangyuxiangGL commented Jul 20, 2017

I am wrong with saying that "if part of async io sent was called back before get_ctx->unset_cb(), bucket quota stats will be incorrect"; in fact, every async io only update its temp stats and only calculated totally when (pendings == 0).

But there is an issue when exception happened, considered that pengdings == 0 and retcode < 0(real fail from rados cluster, not early fail) when code reach here:

if (--pendings == 0) { if (!ret_code) { cb->set_response(&stats); } cb->handle_response(ret_code); cb->put(); }

then cb->handle_response follow in:

void BucketAsyncRefreshHandler::handle_response(const int r)
{
if (r < 0) { early return without call async_refcount->put()
ldout(store->ctx(), 20) << "AsyncRefreshHandler::handle_response() r=" << r << dendl;
return; /* nothing to do here */
}

RGWStorageStats bs;

for (const auto& pair : *stats) {
const RGWStorageStats& s = pair.second;

bs.size += s.size;
bs.size_rounded += s.size_rounded;
bs.num_objects += s.num_objects;

}

cache->async_refresh_response(user, bucket, bs);
}

i introduced async_refresh_fail to RGWQuotaCache, and should be called when got error response, it just put the reference without update the quota stats.

@cbodley, now it's very pleased to hear your opinions, thanks!

rgw: fix rgw hang when do RGWRealmReloader::reload after go SIGHUP
Quota async processer reference count err when bucket has no explicit shard

Fixes: http://tracker.ceph.com/issues/20686

Signed-off-by: fang yuxiang fang.yuxiang@eisoo.com
@cbodley

This comment has been minimized.

Contributor

cbodley commented Jul 20, 2017

i introduced async_refresh_fail to RGWQuotaCache, and should be called when got error response, it just put the reference without update the quota stats.

thanks for the explanation, the async_refresh_fail() changes looks good 👍

@cbodley

This comment has been minimized.

Contributor

cbodley commented Jul 20, 2017

jenkins test this please

@yuriw

This comment has been minimized.

@yuriw yuriw merged commit b4c42f3 into ceph:master Jul 24, 2017

3 of 4 checks passed

make check (arm64) make check failed
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details

@fangyuxiangGL fangyuxiangGL deleted the fangyuxiangGL:rgw-hang branch Jan 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment