Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a router tag for all mutations after enabling HA [release-7.3] #11066

Open
wants to merge 1 commit into
base: release-7.3
Choose a base branch
from

Commits on Nov 16, 2023

  1. Add a router tag for all mutations after enabling HA

    We found a data corruption bug when switching from a single region to two
    regions, i.e., re-enabling HA. The exact sequence of corruption for the test is:
    
    1. Epoch 6: 2 regions
    2. Epoch 8: change usable_region to 1, txn commit at version V.
    3. Epoch 10: usable_region is now 2, recoverAt is V. and V is copied to the
       newly recruited tlogs. Remote SS peeked the tlog, but didn't persist the V's
       data yet.
    4. Restart. Epoch 12, another recovery
    5. Epoch 14: remote tlog starts at Unrecovered < V , actually Unrecovered ==
       Epoch 8's endVersion. So pullAsyncData is pulling with log router tag, and
       mutations at V don't have router tags.
    
    To reproduce:
    -f ./tests/restarting/from_7.3.0/ConfigureTestRestart-1.toml -b on -s 1855375089
    -f ./tests/restarting/from_7.3.0/ConfigureTestRestart-2.toml -b on -s 1855375090 --restarting
    commit d24a62c, clang build
    
    So the problem is with the tlog data copied from epoch 8 to 10 can be lost,
    because they don't have a log router tag. So at epoch 14, when pullAsyncData
    tries to copy the data from version V, they are not copied.
    
    The solution is to add a static router tag (-2, 0) to all mutations after HA
    is enabled. Since this transaction will trigger a recovery, the next epoch has
    log router tags, so the copied range will have the proper tag to be pulled from
    remote side, i.e., log routers. For the tlog data not copied from the old epoch
    when usable_region is 1, remote side storage servers will peek from old tlogs.
    jzhou77 committed Nov 16, 2023
    Configuration menu
    Copy the full SHA
    b9584ee View commit details
    Browse the repository at this point in the history