Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Tictac AAE for hints for more repairs #1805

Closed
martinsumner opened this issue Nov 12, 2021 · 0 comments
Closed

Use Tictac AAE for hints for more repairs #1805

martinsumner opened this issue Nov 12, 2021 · 0 comments
Assignees

Comments

@martinsumner
Copy link
Contributor

The tictac AAE process is deficient when compared with the "legacy" riak kv_index_hashtree approach to AAE, in that it repairs much slower.

The repairs are required to be slower as the cost of running the fetch_clocks comparison is relatively high in Tictac AAE, as they require a full scan of the vnode key store (albeit skipping any blocks that don't include relevant segments). The cost of fetch_clocks is proportionate to the number of segments in the results as defined by tictacaae_maxresults:

%% @doc Max number of leaf IDs per exchange
%% To control the length of time for each exchange, only a subset of the
%% conflicting leaves will be compared on each exchange.  If there are issues
%% with query timeouts this may be halved.  Large backlogs may be reduced
%% faster by doubling.  There are 1M segments in a standard tree overall.
{mapping, "tictacaae_maxresults", "riak_kv.tictacaae_maxresults", [
  {datatype, integer},
  {default, 256},
  hidden
]}.

On very large vnodes, 256 segments being checked on a fold can be time consuming, and the AAE process will back off if too much time is consumed - but this limits the number of repairs per exchange to be approximately 256, and then the back-off throttles the repair speed even more.

The proposed enhancement is to add two new workers per node:

  • riak_kv_tictacaae_monitor
  • riak_kv_readrepair_pool

Tictac AAE Monitor

Results from clock comparisons will now be sent to riak_kv_tictacaae_monitor rather than direct for read repair, along with the relevant n-val and partition information pertinent to the exchange. The monitor will prompt the read repairs as now, but will also generate a per-bucket and per-nval list of LastModifiedDates on those keys where a repair was required.

If there is a bucket with > 25% of the repairs - the a new fetch clock comparison should be prompted between the vnodes. This time the constrained should not be constrained by segment but by bucket and the last modified date range which covers at least 25% of the discovered repairs(i.e. using fetch_clocks_range).

Otherwise the nval list should be checked and for the nval with the most required repairs a last modified date range for a fetch_clocks_nval should be determined that covers at least 25% of the repairs.

Having discovered a potentially broken part of the store, the riak_kv_tictcaaae_monitor should now repeat a fetch_clocks_range/fetch_clocks_nval comparison for the discovered range, and prompt any discrepancies for read repair.

In general we expect AAE to be repairing either data loss from disk, or data loss due to downtime and lack of handoff (e.g. recovery from backup. For the former case, given each store is either ordered by time (bitcask, leveled journal), or by key (eleveldb, leveled ledger) - we would expect a per-bucket or time-range concentration. In the latter case it is likely that a recent modified date range has been impacted.

The aim would be to reduce the default tictacaae_maxresults to 64 (make it fast), and then limit results on the range comparison to a much larger value - perhaps change fetch_clocks to fetch only the first 8K keys.

Configuring the AAE monitor to do prompted comparisons should be controlled like this:

%% @doc Max number of keys for a prompted comparison
%% Following the discovery of a potential delta via tictac aae, the tictac aae
%% monitor should attempt to compare this many keys in the bucket in any
%% bucket or date range which has been shown by the aae process to have
%% a high density of required repairs
%% Set to 0 to disable prompted comparisons 
{mapping, "tictacaae_maxresults_prompted", "riak_kv.tictacaae_maxresults_prompted", [
  {datatype, integer},
  {default, 8196},
  hidden
]}.

Read Repair Pool

Read repair is not constrained in Riak, other than via the soft cap hard cap dice rolling -

%% based on what the get_put_monitor stats say, and a random roll, potentially
%% skip read-repriar
%% On a very busy system with many writes and many reads, it is possible to
%% get overloaded by read-repairs. By occasionally skipping read_repair we
%% can keep the load more managable; ie the only load on the system becomes
%% the gets, puts, etc.
maybe_read_repair(Indices, RepairObj, UpdStateData) ->
HardCap = app_helper:get_env(riak_kv, read_repair_max),
SoftCap = app_helper:get_env(riak_kv, read_repair_soft, HardCap),
Dorr = determine_do_read_repair(SoftCap, HardCap),
if
Dorr ->
read_repair(Indices, RepairObj, UpdStateData);
true ->
ok = riak_kv_stat:update(skipped_read_repairs),
skipping
end.
determine_do_read_repair(_SoftCap, HardCap) when HardCap == undefined ->
true;
determine_do_read_repair(SoftCap, HardCap) ->
Actual = riak_kv_util:gets_active(),
determine_do_read_repair(SoftCap, HardCap, Actual).
determine_do_read_repair(undefined, HardCap, Actual) ->
determine_do_read_repair(HardCap, HardCap, Actual);
determine_do_read_repair(_SoftCap, HardCap, Actual) when HardCap =< Actual ->
false;
determine_do_read_repair(SoftCap, _HardCap, Actual) when Actual =< SoftCap ->
true;
determine_do_read_repair(SoftCap, HardCap, Actual) ->
Roll = roll_d100(),
determine_do_read_repair(SoftCap, HardCap, Actual, Roll).
determine_do_read_repair(SoftCap, HardCap, Actual, Roll) ->
AdjustedActual = Actual - SoftCap,
AdjustedHard = HardCap - SoftCap,
Threshold = AdjustedActual / AdjustedHard * 100,
Threshold =< Roll.
-ifdef(TEST).
roll_d100() ->
fsm_eqc_util:get_fake_rng(get_fsm_eqc).
-else.
% technically not a d100 as it has a 0
roll_d100() ->
rand:uniform(101) - 1.
-endif.
.

With the availability of the flexible riak_core_worker_pool, it would be better to constrain read_repair instead by having a fixed size pool of workers that will run the GETs to prompt the read repairs. The queue_time and work_time associated with these read repairs can then be monitored as with other pools post Riak 3.0.9.

The worker_pool should also allow a limit on the size of the queue.

For configuration of the pool, the following is proposed:

%% @doc Pool Sizes - sizes for individual node_worker_pools
%% ...
{mapping, "repair_worker_pool_size", "riak_kv.repair_worker_pool_size", [
  {datatype, integer},
  {default, 4}
]}.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant