New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw: fix versioned bucket data sync fail when upload is busy #12357

Merged
merged 1 commit into from Jan 18, 2017

Conversation

Projects
None yet
4 participants
@dongbula

dongbula commented Dec 7, 2016

http://tracker.ceph.com/issues/18208

In multisite cluster, upload an file to master zone two times in a short time, you will find the first version sync to slave zone failed, this fix it.

@idealguo

This comment has been minimized.

idealguo commented Dec 8, 2016

Signed-off-by and tracker

@cbodley cbodley self-assigned this Dec 13, 2016

@cbodley

This comment has been minimized.

Contributor

cbodley commented Dec 13, 2016

@dongbula thanks for taking this on. i added a test for it at #12474, and verified that your fix works as intended

i do still see an issue with this fix though. the squash_map is important here because these sync entries are processed in parallel. so if there's an OP_WRITE entry followed by OP_DEL on the same object, we can't guarantee that the OP_WRITE will sync before the OP_DEL - if the order gets switched, then this zone would up with that object still existing, while it didn't in the source zone

so we use the squash_map to make sure that we only apply the most recent entry for any given object. but by ignoring the squash_map for versioned entries, we wouldn't know to squash a LINK_OLH entry followed by an UNLINK_INSTANCE on the same object instance, and we could end up in the same situation

i think the ideal solution would involve extending the squash_map key to include the object instance along with the object name, so we can track each version separately. something like this:

-  map<string, pair<real_time, RGWModifyOp> > squash_map;
+  map<pair<string, string>, pair<real_time, RGWModifyOp> > squash_map;

then below:

-        auto& squash_entry = squash_map[e.object];
+        auto& squash_entry = squash_map[make_pair(e.object, e.instance)];
         if (squash_entry.first == e.timestamp &&
-            e.op == CLS_RGW_OP_DEL) {
+            (e.op == CLS_RGW_OP_DEL || e.op == CLS_RGW_OP_UNLINK_INSTANCE)) {
           squash_entry.second = e.op;
         } else if (squash_entry.first < e.timestamp) {
           squash_entry = make_pair<>(e.timestamp, e.op);
         }

does that make sense?

@dongbula

This comment has been minimized.

dongbula commented Dec 14, 2016

@cbodley Hi,
I understand the idea that

use the squash_map to make sure that we only apply the most recent entry for any given object

, but I still feel puzzled about these code:

       if (squash_entry.first == e.timestamp &&
              e.op == CLS_RGW_OP_DEL) {
            squash_entry.second = e.op;
         }

Is it means that OP_WRITE and OP_DEL entry may have same timestamp?

Considering that the instances of OP_WRITE entry for one object are usually different(just like write operations for different objects in non-versioned situation), I modify the code to make the squash_map just work in non-versioned bucket. That is my original idea, maybe it's not enough.

@cbodley

This comment has been minimized.

Contributor

cbodley commented Dec 14, 2016

Is it means that OP_WRITE and OP_DEL entry may have same timestamp?

good point - these are high-resolution timestamps, so that shouldn't be possible (right @yehudasa?). it should also be safe to assume that the bilog entries are already ordered by timestamp, so we could probably replace that whole block with:

-        auto& squash_entry = squash_map[e.object];
+        auto& squash_entry = squash_map[make_pair(e.object, e.instance)];
-        if (squash_entry.first == e.timestamp &&
-            e.op == CLS_RGW_OP_DEL) {
-          squash_entry.second = e.op;
-        } else if (squash_entry.first < e.timestamp) {
-          squash_entry = make_pair<>(e.timestamp, e.op);
-        }
+        squash_entry = make_pair(e.timestamp, e.op);

I modify the code to make the squash_map just work in non-versioned bucket. That is my original idea, maybe it's not enough.

we still need to squash entries for versioned buckets, or we'll run into ordering problems between LINK_OLH and UNLINK_INSTANCE. the squash_map is what prevents them from running in parallel

(http://tracker.ceph.com/issues/15542 has a bit more about the bug that was fixed by adding the squash_map)

@yehudasa

This comment has been minimized.

Member

yehudasa commented Dec 14, 2016

@cbodley iirc for OP_DEL we're storing the removed object's mtime, not when it was removed

@cbodley

This comment has been minimized.

Contributor

cbodley commented Dec 14, 2016

@yehudasa maybe you're thinking about the obj_tombstone_cache in RGWRados? these are the timestamps in rgw_bi_log_entry that we're using for the squash_map to resolve the create/delete races in bucket sync that @theanalyst had found. we're discussing whether this check is necessary:

         if (squash_entry.first == e.timestamp &&
              e.op == CLS_RGW_OP_DEL) {
            squash_entry.second = e.op;
         }

the timestamps were converted to from utime_t to ceph::real_time, so it's possible that sync is reading old entries that were encoded with 1-second resolution - so the timestamps could compare equal. but my point is that the bilog is ordered, so i think that the squash_map should always take the most recent entry

@cbodley

This comment has been minimized.

Contributor

cbodley commented Dec 14, 2016

@dongbula i updated the test in #12474 to exercise this squash logic. you can run it from your cmake build directory with:
$ nosetests --nocapture ../src/test/rgw/test_multi.py:test_versioned_object_incremental_sync

@yehudasa

This comment has been minimized.

Member

yehudasa commented Dec 14, 2016

@cbodley no, I'm really talking about bucket index log (also verified it now). The timestamp entries that we keep for the delete operations are the object's mtime. Note that although bilog is ordered, entries can arrive out of order as you might have multiple rgws feeding into it, so need to check timestamps.

@yehudasa

This comment has been minimized.

Member

yehudasa commented Dec 14, 2016

@cbodley ahrm.. ah, reading the cls/rgw code again, committed operations will be written to the bi log in order. That been said, I think we should still check timestamp, bugs may happen.

@cbodley

This comment has been minimized.

Contributor

cbodley commented Dec 14, 2016

@yehudasa okay, thanks! so it looks like the first diff i posted, that also checks for OP_UNLINK_INSTANCE in that comparison, would be the way to go

@dongbula

This comment has been minimized.

dongbula commented Dec 15, 2016

@cbodley @yehudasa thanks
Well, squashing entries seems to be necessary whether the bucket is versioned or not. And, so, the comparison for OP_UNLINK_INSTANCE/OP_DEL and LINK_OLH/OP_ADD entries with same timestamp is not necessary, isn't it?

I still have a question about:

the timestamps were converted to from utime_t to ceph::real_time, so it's possible that sync is reading old entries that were encoded with 1-second resolution - so the timestamps could compare equal.

When the entry will be encoded with 1-second resolution?

I modify the code to do squash for all entries, and replace CLS_RGW_OP_DEL with CLS_RGW_OP_LINK_OLH to make versioned upload sync normally. I think that is necessary.

@cbodley

This comment has been minimized.

Contributor

cbodley commented Dec 15, 2016

When the entry will be encoded with 1-second resolution?

the type of this timestamp was changed for jewel i think? so any entries that were written before that upgrade would be decoded in seconds

I modify the code to do squash for all entries, and replace CLS_RGW_OP_DEL with CLS_RGW_OP_LINK_OLH to make versioned upload sync normally. I think that is necessary.

looks good, except i'm not sure why you took the OP_DEL part out of that comparison?

@dongbula

This comment has been minimized.

dongbula commented Dec 15, 2016

@cbodley ah, i see, the comparison for OP_DEL is for compatibility with old entries. I will add that.

@cbodley

This comment has been minimized.

Contributor

cbodley commented Dec 15, 2016

some extra notes from testing..

uploading an object to a versioned bucket results in two bilog entries with the same timestamp, OP_ADD and OP_LINK_OLH. so in this case, the squash_map needs to overwrite the OP_ADD with OP_LINK_OLH

deleting an object instance results in OP_UNLINK_INSTANCE and OP_DEL, both with a matching timestamp (but different from the ADD/LINK_OLH). so the existing logic for if (squash_entry.first == e.timestamp && e.op == CLS_RGW_OP_DEL) covers this case

in both cases, when timestamps match, we want to apply the latest bilog entry. even in non-versioned buckets with low-resolution timestamps, if there are entries for OP_ADD, OP_DEL, OP_ADD, we want to apply the last OP_ADD. i think that something like this would work best:

-        auto& squash_entry = squash_map[e.object];
+        auto& squash_entry = squash_map[make_pair(e.object, e.instance)];
-        if (squash_entry.first == e.timestamp &&
-            e.op == CLS_RGW_OP_DEL) {
-          squash_entry.second = e.op;
-        } else if (squash_entry.first < e.timestamp) {
+        if (squash_entry.first <= e.timestamp) {
           squash_entry = make_pair<>(e.timestamp, e.op);
         }
@dongbula

This comment has been minimized.

dongbula commented Dec 16, 2016

good, that's concise, one more question

Note that although bilog is ordered, entries can arrive out of order as you might have multiple rgws feeding into it, so need to check timestamps.

Is it safety to assume that the OP_DEL is behind OP_ADD in entry when OP_ADD and OP_DEL own same timestamp which are encoded with 1-second resolution?

@cbodley

This comment has been minimized.

Contributor

cbodley commented Dec 16, 2016

Is it safety to assume that the OP_DEL is behind OP_ADD in entry when OP_ADD and OP_DEL own same timestamp which are encoded with 1-second resolution?

if the DEL entry comes later in the log, yes

@cbodley

This comment has been minimized.

Contributor

cbodley commented Dec 19, 2016

looks good to me 👍 it's passing test_multi.py, including the new test_versioned_object_incremental_sync from #12474

could you please squash this into a single commit so we can merge?

@dongbula

This comment has been minimized.

dongbula commented Dec 20, 2016

ok, I have squashed the commits

@cbodley

This comment has been minimized.

Contributor

cbodley commented Dec 20, 2016

@dongbula thank you. could you add the line Fixes: http://tracker.ceph.com/issues/18208 please?

lvshuhua
rgw: fix versioned bucket data sync fail when upload is busy
Fixes: http://tracker.ceph.com/issues/18208

Signed-off-by: lvshuhua <lvshuhua@cmss.chinamobile.com>
@dongbula

This comment has been minimized.

dongbula commented Dec 21, 2016

okey, have added it

@cbodley cbodley merged commit 37ff492 into ceph:master Jan 18, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment