Pruned clocks and failure of full-sync replication #1864

martinsumner · 2023-07-13T10:52:55Z

The vector clock on an object will increase in size for every new vnode_id which attempts to co-ordinate a PUT for an object. The increase is limited by the pruning limits - young_vclock, old_vclock, small_vclock and big_vclock. The primary limit is the small_vclock which will cause an object hitting this limit (50 unique vnodes by default) to be pruned.

The pruning will normally cut back by 1 entry only - i.e. a clock that has reached a size of 51 will be pruned back to 50. There are some controls which may prevent this pruning - young_vclock, old_vclock - but by the defualt the big_vclock also being set to 50 means that clocks will remain at this limit.

In one key situation, pruning is disabled. When an update is a consequence of read_repair - the clock will not be pruned, so objects can be persisted with clocks longer than 50.

In a multi-cluster scenario, after read repairs there can be a circumstance where some objects have clocks of length > small_vclock on one cluster, but of length == small_vclock on another.

If there is now an attempt to full-sync between the clusters, the delta can be detected - but when the object is dispatched from the cluster with the dominant (longer) clock, the receiving cluster will prune the inbound clock ... and the delta will persist. In this case, the objects are the same, but the full-sync mechanism cannot determine they are the same - so continuously discovers differences.

Clocks being at the pruning length is more likely to occur on very long-lived clusters which have been subject to lots of leaves/joins - but as many users have been running Riak for > 10 years this is increasingly probable.

The text was updated successfully, but these errors were encountered:

martinsumner · 2023-07-13T10:56:07Z

There is a workaround to "touch" (i.e. GET, then PUT back the object) all objects which have exceeded the prune size.

i.e.

{ok, CH} = riak:local_client().
TouchFun = fun({B, K, _C}) -> {ok, Obj} = riak_client:get(B, K, CH), ok = riak_client:put(Obj, CH) end.
QueriesFun = fun(Bucket) -> lists:map(fun(I) -> {fetch_clocks_range, Bucket, all, {segments, lists:seq((128 * I) + 1, 128 *  (I + 1)), small}, all} end, lists:seq(0, 511)) end.
TouchGreaterThanFun = fun(B, M) -> lists:foreach(fun(Q) -> lists:foreach(TouchFun,lists:filter(fun({_B0, _K0, C0}) -> length(C0) > M end, element(2, riak_client:aae_fold(Q, CH)))) end, QueriesFun(B)) end.
TouchGreaterThanFun(<<"clinicalsRecord">>, 50).

This is only possible when tictacaae is enabled, to allow for aae_folds.

martinsumner · 2023-07-13T10:57:56Z

It should be noted that read-repair will also override any sibling limit, so this can also lead to full-sync discrepancies. Hitting sibling limits is an indication of a broader fault, whereas pruning clocks is eventually inevitable, and so this is the primary concern.

martinsumner mentioned this issue Jul 17, 2023

Use 'RR' when a prunable vclock is replicated #1866

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pruned clocks and failure of full-sync replication #1864

Pruned clocks and failure of full-sync replication #1864

martinsumner commented Jul 13, 2023 •

edited

martinsumner commented Jul 13, 2023

martinsumner commented Jul 13, 2023

Pruned clocks and failure of full-sync replication #1864

Pruned clocks and failure of full-sync replication #1864

Comments

martinsumner commented Jul 13, 2023 • edited

martinsumner commented Jul 13, 2023

martinsumner commented Jul 13, 2023

martinsumner commented Jul 13, 2023 •

edited