New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw: optimize data sync. Add zones_trace in log to avoid needless sync. #13851

Merged
merged 1 commit into from May 30, 2017

Conversation

Projects
None yet
4 participants
@zhangsw
Contributor

zhangsw commented Mar 8, 2017

In zone A and zone B, if we upload a object obj to A. The data sync process is :

  1. B get the bi log from A;
  2. B fetch the obj from A according to the bi log, and then add a new bi log locally;
  3. A get the bi log from B.
  4. A fetch the obj from B according to the bi log, and B will return 304.

The process 4 isn't needed. Besides, some strange problems may occur because of that process. For example, users upload obj and delete it:
In zone A, users have deleted obj, but the delete operation hasn't been synced;
Then according to process 4, A will fetch the obj from B. Because A's obj has been deleted, B will return the obj's data.
Then A will add a new bi log that obj is created and sync this bi log to B.
Now B will have one delete operation followed by one put operation. If these two operations are in the squash map, The put will override delete.
At last, the obj will not be deleted.

To avoid the process 4, I add zones_trace in bi log entry so that we can decide whether to sync an operation in process 3. The zones_trace is used to record zones that have done the operation.

Fixes: http://tracker.ceph.com/issues/19219

Signed-off-by: Zhang Shaowen zhangshaowen@cmss.chinamobile.com

@zhangsw

This comment has been minimized.

Contributor

zhangsw commented Mar 8, 2017

@yehudasa @cbodley please help review this , thanks ~

@cbodley

This comment has been minimized.

Contributor

cbodley commented Mar 8, 2017

hi @zhangsw, thanks for the detailed analysis. @mattbenjamin and i had discussed something like your zones_trace in the past

Now B will have one delete operation followed by one put operation. If these two operations are in the squash map, The put will override delete. At last, the obj will not be deleted.

the squash map and tombstone cache should do the right thing here, in the common case. the tombstone cache remembers the mtime of the deleted object, so we can avoid fetching the old object from other zones in process 4. the tombstone cache isn't perfect though, because it doesn't cover cases where the gateway restarts after the delete, or zones that have multiple gateways

so i agree that your approach here is well-motivated

@cbodley cbodley self-assigned this Mar 8, 2017

@yehudasa

This comment has been minimized.

Member

yehudasa commented Mar 8, 2017

@zhangsw this will work well with 2 zones. Not sure whether we're not going to still have same issues when having more than 2 zones.

@zhangsw

This comment has been minimized.

Contributor

zhangsw commented Mar 9, 2017

@cbodley Yeah~ I missed the tombstone cache. Now I know that tombstone cache can prevent that issue if zone has only one gateway. In my test, I have two zones and each zone has two gateways: one for inner sync and the other for public service. I used cosbench to test put and delete and that issue occurred sometimes. Like you say, the tombstone cache can't work well in such situations. And unfortunately, our production environment with two zones is really big and we have many gateways in each zone(two zones is far away from each other). Issues like that may also occur.

@zhangsw

This comment has been minimized.

Contributor

zhangsw commented Mar 9, 2017

@yehudasa I think such issues shouldn't occur in multiple zones if we add the zones_trace. I think the root cause of these issues is that we have a ring in sync. For example, we have an operation op1 in zone A. If we don't have zones_trace, then this operation may do again. The operation's trace is A->B->A. We can break this ring with zones_trace. Now if we have more than two zones, we do actual operations op1,op2,...,opn in A. Because any sync about these operations from other zone will be skipped in A, so we can make sure that zone A has the right state. For simplicity, we have three zones, then zone B's sync trace may be A->B or A->C->B, and each trace will send the operation op1,op2,...,opn. Fortunately, we actually only need sync 'add an object' or 'remove an object'. So the state is decided by the last op. In above two traces, the last operation are both opn, same as A. So the other zones also have the right state.

@yehudasa

This comment has been minimized.

Member

yehudasa commented Mar 9, 2017

@zhangsw an operation with more than 2 zones may occur from multiple paths. E.g., A->B->C A->C->B.

@zhangsw

This comment has been minimized.

Contributor

zhangsw commented Mar 10, 2017

@yehudasa Yes. But for a specific zone, like B, operation path is A->B and A->C->B. The path A->B->C belongs to C and has no effect on B because we have broken the ring in sync(no A->B->C->B).

@zhangsw

This comment has been minimized.

Contributor

zhangsw commented Mar 30, 2017

@cbodley @yehudasa Any advice about this PR?

@cbodley

This comment has been minimized.

Contributor

cbodley commented Mar 30, 2017

hi @zhangsw, i'm keen to do some more review/testing on this

to start, i'm trying to fudge things so i can reproduce one of these sync cycles that we used to see. then, i can apply your changes to see that it properly avoids the cycle. but so far, i'm not seeing any cycles after disabling both the tombstone_cache and squash_map

i'll keep you posted

@zhangsw

This comment has been minimized.

Contributor

zhangsw commented Mar 31, 2017

@cbodley Like A->B->A, the last step is A requests an obj from B with an mtime, B compares mtime with local obj's mtime, then return error '304'. So the data will not be returned by B, but actually this request is redundant. I think this problem is not affected by tombstone_cache and squash_map. Another problem that deleting after putting an obj, I think we shouldn't disable squash_map if we want to reproduce it. Certainly this problem is not easy to reproduce. The delete must happen right after the last step mentioned above so that A will get real data from B. Then delete and put operation synced to B must be exactly in same squash map so that the put will override the delete.

@zhangsw

This comment has been minimized.

Contributor

zhangsw commented Apr 24, 2017

Rebase to solve the conflicts

@@ -128,6 +128,9 @@ static int log_index_operation(cls_method_context_t hctx, cls_rgw_obj_key& obj_k
if (owner_display_name) {
entry.owner_display_name = *owner_display_name;
}
if (zones_trace) {
entry.zones_trace = *zones_trace;

This comment has been minimized.

@cbodley

cbodley May 16, 2017

Contributor

can we avoid copy here with entry.zones_trace = std::move(*zones_trace);?

This comment has been minimized.

@zhangsw

zhangsw May 18, 2017

Contributor

yes~I've changed it

@@ -506,11 +506,12 @@ struct rgw_bi_log_entry {
uint16_t bilog_flags;
string owner; /* only being set if it's a delete marker */
string owner_display_name; /* only being set if it's a delete marker */
set<string> zones_trace;

This comment has been minimized.

@cbodley

cbodley May 16, 2017

Contributor

could you add a rgw_zone_set typedef for this and use it throughout?

using rgw_zone_set = std::set<std::string>;

this would enable a couple of future optimizations, like using a basic_sstring sized to fit uuids (36 + \n), or boost::container::flat_set to cut down on its allocations

This comment has been minimized.

@zhangsw

zhangsw May 18, 2017

Contributor

done

@cbodley

This comment has been minimized.

Contributor

cbodley commented May 16, 2017

needs a rebase, but it's working well so far in testing 👍

rgw: optimize data sync. Add zones_trace in bi log to avoid needless …
…sync.

Fixes: http://tracker.ceph.com/issues/19219

Signed-off-by: Zhang Shaowen <zhangshaowen@cmss.chinamobile.com>
@zhangsw

This comment has been minimized.

Contributor

zhangsw commented May 18, 2017

@cbodley I've rebased this

@cbodley

was seeing some intermittent failures from test_versioned_object_incremental_sync(), but those are unrelated and addressed in #13426

@cbodley cbodley merged commit 6049d97 into ceph:master May 30, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment