Use Tictac AAE for hints for more repairs #1805

martinsumner · 2021-11-12T13:42:53Z

The tictac AAE process is deficient when compared with the "legacy" riak kv_index_hashtree approach to AAE, in that it repairs much slower.

The repairs are required to be slower as the cost of running the fetch_clocks comparison is relatively high in Tictac AAE, as they require a full scan of the vnode key store (albeit skipping any blocks that don't include relevant segments). The cost of fetch_clocks is proportionate to the number of segments in the results as defined by tictacaae_maxresults:

%% @doc Max number of leaf IDs per exchange
%% To control the length of time for each exchange, only a subset of the
%% conflicting leaves will be compared on each exchange.  If there are issues
%% with query timeouts this may be halved.  Large backlogs may be reduced
%% faster by doubling.  There are 1M segments in a standard tree overall.
{mapping, "tictacaae_maxresults", "riak_kv.tictacaae_maxresults", [
  {datatype, integer},
  {default, 256},
  hidden
]}.

On very large vnodes, 256 segments being checked on a fold can be time consuming, and the AAE process will back off if too much time is consumed - but this limits the number of repairs per exchange to be approximately 256, and then the back-off throttles the repair speed even more.

The proposed enhancement is to add two new workers per node:

riak_kv_tictacaae_monitor
riak_kv_readrepair_pool

Tictac AAE Monitor

Results from clock comparisons will now be sent to riak_kv_tictacaae_monitor rather than direct for read repair, along with the relevant n-val and partition information pertinent to the exchange. The monitor will prompt the read repairs as now, but will also generate a per-bucket and per-nval list of LastModifiedDates on those keys where a repair was required.

If there is a bucket with > 25% of the repairs - the a new fetch clock comparison should be prompted between the vnodes. This time the constrained should not be constrained by segment but by bucket and the last modified date range which covers at least 25% of the discovered repairs(i.e. using fetch_clocks_range).

Otherwise the nval list should be checked and for the nval with the most required repairs a last modified date range for a fetch_clocks_nval should be determined that covers at least 25% of the repairs.

Having discovered a potentially broken part of the store, the riak_kv_tictcaaae_monitor should now repeat a fetch_clocks_range/fetch_clocks_nval comparison for the discovered range, and prompt any discrepancies for read repair.

In general we expect AAE to be repairing either data loss from disk, or data loss due to downtime and lack of handoff (e.g. recovery from backup. For the former case, given each store is either ordered by time (bitcask, leveled journal), or by key (eleveldb, leveled ledger) - we would expect a per-bucket or time-range concentration. In the latter case it is likely that a recent modified date range has been impacted.

The aim would be to reduce the default tictacaae_maxresults to 64 (make it fast), and then limit results on the range comparison to a much larger value - perhaps change fetch_clocks to fetch only the first 8K keys.

Configuring the AAE monitor to do prompted comparisons should be controlled like this:

%% @doc Max number of keys for a prompted comparison
%% Following the discovery of a potential delta via tictac aae, the tictac aae
%% monitor should attempt to compare this many keys in the bucket in any
%% bucket or date range which has been shown by the aae process to have
%% a high density of required repairs
%% Set to 0 to disable prompted comparisons 
{mapping, "tictacaae_maxresults_prompted", "riak_kv.tictacaae_maxresults_prompted", [
  {datatype, integer},
  {default, 8196},
  hidden
]}.

Read Repair Pool

Read repair is not constrained in Riak, other than via the soft cap hard cap dice rolling -

riak_kv/src/riak_kv_get_fsm.erl

Lines 730 to 777 in 75e142e

    
           %% based on what the get_put_monitor stats say, and a random roll, potentially 
        
           %% skip read-repriar 
        
           %% On a very busy system with many writes and many reads, it is possible to 
        
           %% get overloaded by read-repairs. By occasionally skipping read_repair we 
        
           %% can keep the load more managable; ie the only load on the system becomes 
        
           %% the gets, puts, etc. 
        
           maybe_read_repair(Indices, RepairObj, UpdStateData) -> 
        
               HardCap = app_helper:get_env(riak_kv, read_repair_max), 
        
               SoftCap = app_helper:get_env(riak_kv, read_repair_soft, HardCap), 
        
               Dorr = determine_do_read_repair(SoftCap, HardCap), 
        
               if 
        
                   Dorr -> 
        
                       read_repair(Indices, RepairObj, UpdStateData); 
        
                   true -> 
        
                       ok = riak_kv_stat:update(skipped_read_repairs), 
        
                       skipping 
        
               end. 
        
           determine_do_read_repair(_SoftCap, HardCap) when HardCap == undefined -> 
        
               true; 
        
           determine_do_read_repair(SoftCap, HardCap) -> 
        
               Actual = riak_kv_util:gets_active(), 
        
               determine_do_read_repair(SoftCap, HardCap, Actual). 
        
           determine_do_read_repair(undefined, HardCap, Actual) -> 
        
               determine_do_read_repair(HardCap, HardCap, Actual); 
        
           determine_do_read_repair(_SoftCap, HardCap, Actual) when HardCap =< Actual -> 
        
               false; 
        
           determine_do_read_repair(SoftCap, _HardCap, Actual) when Actual =< SoftCap -> 
        
               true; 
        
           determine_do_read_repair(SoftCap, HardCap, Actual) -> 
        
               Roll = roll_d100(), 
        
               determine_do_read_repair(SoftCap, HardCap, Actual, Roll). 
        
           determine_do_read_repair(SoftCap, HardCap, Actual, Roll) -> 
        
               AdjustedActual = Actual - SoftCap, 
        
               AdjustedHard = HardCap - SoftCap, 
        
               Threshold = AdjustedActual / AdjustedHard * 100, 
        
               Threshold =< Roll. 
        
           -ifdef(TEST). 
        
           roll_d100() -> 
        
               fsm_eqc_util:get_fake_rng(get_fsm_eqc). 
        
           -else. 
        
           % technically not a d100 as it has a 0 
        
           roll_d100() -> 
        
               rand:uniform(101) - 1. 
        
           -endif.

.

With the availability of the flexible riak_core_worker_pool, it would be better to constrain read_repair instead by having a fixed size pool of workers that will run the GETs to prompt the read repairs. The queue_time and work_time associated with these read repairs can then be monitored as with other pools post Riak 3.0.9.

The worker_pool should also allow a limit on the size of the queue.

For configuration of the pool, the following is proposed:

%% @doc Pool Sizes - sizes for individual node_worker_pools
%% ...
{mapping, "repair_worker_pool_size", "riak_kv.repair_worker_pool_size", [
  {datatype, integer},
  {default, 4}
]}.

The text was updated successfully, but these errors were encountered:

martinsumner self-assigned this Nov 12, 2021

martinsumner added 3.0.10 Enhancement labels Nov 12, 2021

martinsumner mentioned this issue Nov 18, 2021

Mas i1805 hintedrepair #1806

Merged

martinsumner closed this as completed Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Tictac AAE for hints for more repairs #1805

Use Tictac AAE for hints for more repairs #1805

martinsumner commented Nov 12, 2021

Use Tictac AAE for hints for more repairs #1805

Use Tictac AAE for hints for more repairs #1805

Comments

martinsumner commented Nov 12, 2021

Tictac AAE Monitor

Read Repair Pool