New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw : fix race in RGWCompleteMultipart #16732

Merged
merged 1 commit into from Aug 2, 2017

Conversation

Projects
None yet
6 participants
@Abhishekvrshny
Contributor

Abhishekvrshny commented Aug 1, 2017

Fixes : http://tracker.ceph.com/issues/20861

Signed-off-by: Abhishek Varshney abhishek.varshney@flipkart.com

@Abhishekvrshny

This comment has been minimized.

Contributor

Abhishekvrshny commented Aug 1, 2017

The root cause of missing parts of a multipart object is a race in 2 contending calls to RGWCompleteMultipart function in rgw_op.cc. This scenario happens when 1 thread is still writing the manifest file and has not deleted the meta object yet and a second thread comes, it is still able to read the meta object and proceeds ahead and adds all the existing entries in manifest file to gc before writing the entries to manifest again as it appears in RGWRados::update_gc_chain. This leads to parts being gc'ed. This problem is more prominent when there is degradation in the cluster and writes are latent, thus leading to race.

A proposed fix can be to lock the meta object in RGWCompleteMultipart. One problem in this approach is to identify the right timeout for the lock. We have seen request to RGWCompleteMultipart, taking a few minutes to complete.

@mattbenjamin

This comment has been minimized.

Contributor

mattbenjamin commented Aug 1, 2017

@Abhishekvrshny are the threads racing to perform the operation already in progress? it feels to me as if this the logical complete operation does require atomicity; I am interested in the design parameters that lead to exclusive lock or alternatives (or compound op of some kind)? otoh, I'd like to understand better how the second and subsequent threads enter the race, and whether they should be waiting for the lock here--i.e., when it is released, will this work be completed?

@varadakari

This comment has been minimized.

Contributor

varadakari commented Aug 1, 2017

@mattbenjamin To answer the first question yes. Tracker has the details about the race. To summarize, this is what lead to the problem, in RGWCompleteMultipart::execute() listing the parts and doing all the work to update the manifest etc.. took more than 60secs in our case due to cluster re-balance. Due to retry mechanism in boto, the same op is tried after 60 secs. This resulted in doing the same op twice, second op marks the multipart objects for GC, thinking a overwrite. Steps to reproduce are mentioned in the ticket. This seems more of atomicity problem raised while updating multiple objects like manifest and meta_obj(deleting).

@varadakari

This comment has been minimized.

Contributor

varadakari commented Aug 1, 2017

@mattbenjamin to answer the second question, once one of the completion succeeds second op becomes void and we haven't seen any object loss.

@mattbenjamin

This comment has been minimized.

Contributor

mattbenjamin commented Aug 1, 2017

@varadakari my second question is not doubting the loss will be corrected (I believe it :). I'm trying to ask whether there are alternatives to blocking each process which has a rendevous on the completing meta-object.

@varadakari

This comment has been minimized.

Contributor

varadakari commented Aug 1, 2017

@mattbenjamin so far we not able to come up with any alternative, one of the solutions might be keeping the completion state persistent. But that would stop any completion coming in, in case of that rgw crash once it updates/stores the state. we need a way to rollback the state in case of a failure.

@Abhishekvrshny

This comment has been minimized.

Contributor

Abhishekvrshny commented Aug 1, 2017

@mattbenjamin Another approach could be to check for prefixes for existing objects in manifest and not add them to GC if current entries in meta too have the same prefix.

@mattbenjamin

This comment has been minimized.

Contributor

mattbenjamin commented Aug 1, 2017

@Abhishekvrshny we discussed this PR in detail in the RGW standup (which you folks are welcome to join, btw :). We are fine with the exclusive lock implementation, but we theorize that subsequent threads trying and failing to get the exclusive lock should by default just return 500 to the client. That will permit the client to retry if desired--but of course, we can't protect the client from retrying after the original request --succeeds-- should that happen.

@mattbenjamin mattbenjamin self-requested a review Aug 1, 2017

@mattbenjamin

approve the basic concept, but requesting a change to the behavior during the retry window (500)

@Abhishekvrshny

This comment has been minimized.

Contributor

Abhishekvrshny commented Aug 1, 2017

@mattbenjamin Sure, I'll make the changes as requested. Just to clarify, the desired behaviour here is to just return 500 if op_ret < 0 and not to retry or send any other error code?

@Abhishekvrshny

This comment has been minimized.

Contributor

Abhishekvrshny commented Aug 1, 2017

@mattbenjamin : Re-pushed changes as requested. Please review.

}
if (op_ret < 0)
return;
op_ret = l.lock_exclusive(&ioctx,raw_meta_oid);

This comment has been minimized.

@cbodley

cbodley Aug 1, 2017

Contributor

since this cls_lock call is a rados write operation, the osd will create the object if it didn't already exist - that would likely confuse the logic below. we'll need to use the lock_exclusive(librados::ObjectWriteOperation*) overload of this function, along with ObjectWriteOperation::assert_exists() to get -ENOENT in that case instead:

    librados::ObjectWriteOperation op;
    op.assert_exists();
    l.lock_exclusive(&op);
    op_ret = ioctx.operate(raw_meta_oid, &op);
ldout(store->ctx(), 0) << "WARNING: failed to remove object " << meta_obj << dendl;
}
break;
} while(1);

This comment has been minimized.

@cbodley

cbodley Aug 1, 2017

Contributor

does this need to loop anymore, now that we no longer sleep/retry on failure to lock?

// remove the upload obj
int r = store->delete_obj(*static_cast<RGWObjectCtx *>(s->obj_ctx),
s->bucket_info, meta_obj, 0);
int r = l.unlock(&ioctx, raw_meta_oid);

This comment has been minimized.

@cbodley

cbodley Aug 1, 2017

Contributor

note that when the call to delete_obj() succeeds, this unlock() will fail with -ENOENT. it should probably only be called in the WARNING: failed to remove object block

@Abhishekvrshny

This comment has been minimized.

Contributor

Abhishekvrshny commented Aug 1, 2017

@cbodley Thanks for the elaborate and clear explanation. Made the changes as desired. Let me know if it looks good now.

if (op_ret < 0) {
dout(0) << "RGWCompleteMultipart::execute() failed to acquire lock " << dendl;
op_ret = -5;

This comment has been minimized.

@cbodley

cbodley Aug 1, 2017

Contributor

it looks like we don't have an error mapping for -5 (EIO) in rgw_common.cc - ERR_INTERNAL_ERROR is the only one that currently maps to 500. can you use that instead?

you might also consider setting s->err.message to something descriptive, like This multipart completion is already in progress.

librados::ObjectWriteOperation op;
librados::IoCtx ioctx;
rados::cls::lock::Lock l("RGWCompleteMultipart");
int max_lock_secs_mp = s->cct->_conf->rgw_mp_lock_max_time;

This comment has been minimized.

@cbodley

cbodley Aug 1, 2017

Contributor

we shouldn't be adding new config variables to legacy_config_opts.h. instead, we should rely on the version in options.cc and access it by name with s->cct->_conf->get_val<int64_t>("rgw_mp_lock_max_time")

@@ -5276,6 +5276,30 @@ void RGWCompleteMultipart::execute()
meta_obj.set_in_extra_data(true);
meta_obj.index_hash_source = s->object.name;
rgw_pool meta_pool;

This comment has been minimized.

@cbodley

cbodley Aug 1, 2017

Contributor

could you add a comment above this block? something simple like

take a cls lock on meta_obj to prevent racing completions (or retries) from deleting the parts

@Abhishekvrshny

This comment has been minimized.

Contributor

Abhishekvrshny commented Aug 2, 2017

@cbodley Changes re-pushed.

@mattbenjamin

lgtm

rados::cls::lock::Lock l("RGWCompleteMultipart");
int max_lock_secs_mp = s->cct->_conf->get_val<int64_t>("rgw_mp_lock_max_time");
op.assert_exists();

This comment has been minimized.

@mattbenjamin

mattbenjamin Aug 2, 2017

Contributor

ack

if (op_ret < 0) {
dout(0) << "RGWCompleteMultipart::execute() failed to acquire lock " << dendl;
op_ret = -ERR_INTERNAL_ERROR;

This comment has been minimized.

@mattbenjamin

mattbenjamin Aug 2, 2017

Contributor

ack

@mattbenjamin mattbenjamin self-assigned this Aug 2, 2017

@cbodley

cbodley approved these changes Aug 2, 2017

@mattbenjamin

This comment has been minimized.

Contributor

mattbenjamin commented Aug 2, 2017

@Abhishekvrshny I consider this a PASS, we just need to fix a conflict in common/options.cc to merge here, otherwise, we can hand merge; are you folks able to repush today?

@Abhishekvrshny

This comment has been minimized.

Contributor

Abhishekvrshny commented Aug 2, 2017

@mattbenjamin : Sure, let me rebase and repush.

rgw : fix race in RGWCompleteMultipart
Fixes : http://tracker.ceph.com/issues/20861

Signed-off-by: Abhishek Varshney <abhishek.varshney@flipkart.com>
@Abhishekvrshny

This comment has been minimized.

Contributor

Abhishekvrshny commented Aug 2, 2017

@mattbenjamin

This comment has been minimized.

Contributor

mattbenjamin commented Aug 2, 2017

thanks @Abhishekvrshny !

@mattbenjamin mattbenjamin merged commit 05cec0a into ceph:master Aug 2, 2017

4 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details
@alex-sainer

This comment has been minimized.

alex-sainer commented Dec 11, 2017

having a Problem here with the Fix being merged:
when i'm doing a Multipart-Upload with a resharded bucket every call of completeMultipartUpload fails with " RGWCompleteMultipart::execute() failed to acquire lock" and the equivalent S3-Error...

Reproduced on Testing-Servers and Production-Servers.

any suggestions for this?

@cbodley

This comment has been minimized.

Contributor

cbodley commented Dec 11, 2017

hi @alex-sainer, i wasn't able to reproduce that behavior in a simple test:

s3cmd mb s3://BUCKET
s3cmd put 256m.iso s3://BUCKET
radosgw-admin reshard add --bucket BUCKET --num-shards 8
radosgw-admin reshard process
s3cmd put 256m.iso s3://BUCKET

are your multipart uploads to new or preexisting objects?

can you please open a tracker issue at http://tracker.ceph.com/projects/rgw/issues/new and include a radosgw log with debug_rgw=20?

cc @oritwas

@kbabioch

This comment has been minimized.

kbabioch commented Dec 11, 2017

@cbodley: Here is the log you asked for. Multipart upload is definitely broken with 10.2.10 :-/

@cbodley

This comment has been minimized.

Contributor

cbodley commented Dec 11, 2017

thanks @kbabioch, i added a comment there to ask for output with debug_ms=1 as well

@kbabioch

This comment has been minimized.

kbabioch commented Dec 11, 2017

@cbodley @alex-sainer I've provided the log, thanks for looking into it ;).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment