Skip to content

Commit

Permalink
Add a router tag for all mutations after enabling HA
Browse files Browse the repository at this point in the history
We found a data corruption bug when switching from a single region to two
regions, i.e., re-enabling HA. The exact sequence of corruption for the test is:

1. Epoch 6: 2 regions
2. Epoch 8: change usable_region to 1, txn commit at version V.
3. Epoch 10: usable_region is now 2, recoverAt is V. and V is copied to the
   newly recruited tlogs. Remote SS peeked the tlog, but didn't persist the V's
   data yet.
4. Restart. Epoch 12, another recovery
5. Epoch 14: remote tlog starts at Unrecovered < V , actually Unrecovered ==
   Epoch 8's endVersion. So pullAsyncData is pulling with log router tag, and
   mutations at V don't have router tags.

To reproduce:
-f ./tests/restarting/from_7.3.0/ConfigureTestRestart-1.toml -b on -s 1855375089
-f ./tests/restarting/from_7.3.0/ConfigureTestRestart-2.toml -b on -s 1855375090 --restarting
commit d24a62c, clang build

So the problem is with the tlog data copied from epoch 8 to 10 can be lost,
because they don't have a log router tag. So at epoch 14, when pullAsyncData
tries to copy the data from version V, they are not copied.

The solution is to add a static router tag (-2, 0) to all mutations after HA
is enabled. Since this transaction will trigger a recovery, the next epoch has
log router tags, so the copied range will have the proper tag to be pulled from
remote side, i.e., log routers. For the tlog data not copied from the old epoch
when usable_region is 1, remote side storage servers will peek from old tlogs.
  • Loading branch information
jzhou77 committed Nov 16, 2023
1 parent d24a62c commit b9584ee
Show file tree
Hide file tree
Showing 5 changed files with 16 additions and 1 deletion.
2 changes: 1 addition & 1 deletion fdbclient/SystemData.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -976,7 +976,7 @@ ProcessClass decodeProcessClassValue(ValueRef const& value) {

const KeyRangeRef configKeys("\xff/conf/"_sr, "\xff/conf0"_sr);
const KeyRef configKeysPrefix = configKeys.begin;

const KeyRef configUsableRegionsKey = "\xff/conf/usable_regions"_sr;
const KeyRef perpetualStorageWiggleKey("\xff/conf/perpetual_storage_wiggle"_sr);
const KeyRef perpetualStorageWiggleLocalityKey("\xff/conf/perpetual_storage_wiggle_locality"_sr);
// The below two are there for compatible upgrade and downgrade. After 7.3, the perpetual wiggle related keys should use
Expand Down
1 change: 1 addition & 0 deletions fdbclient/include/fdbclient/SystemData.h
Original file line number Diff line number Diff line change
Expand Up @@ -308,6 +308,7 @@ UID decodeProcessClassKeyOld(KeyRef const& key);
// See DatabaseConfiguration.cpp ::setInternal for more examples.
extern const KeyRangeRef configKeys;
extern const KeyRef configKeysPrefix;
extern const KeyRef configUsableRegionsKey;

extern const KeyRef perpetualStorageWiggleKey;
extern const KeyRef perpetualStorageWiggleLocalityKey;
Expand Down
4 changes: 4 additions & 0 deletions fdbserver/ApplyMetadataMutation.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,10 @@ class ApplyMetadataMutationsImpl {
.detail("ToCommit", toCommit != nullptr)
.detail("InitialCommit", initialCommit);
confChange = true;
if (m.param1 == configUsableRegionsKey && m.param2 == "2"_sr && toCommit) {
TraceEvent("MutationRequiresRestartEnableHA", dbgid);
toCommit->setChangedToHA();
}
}
}
if (!initialCommit)
Expand Down
2 changes: 2 additions & 0 deletions fdbserver/LogSystem.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -300,6 +300,8 @@ void LogPushData::writeMessage(StringRef rawMessageWithoutLength, bool usePrevio
prev_tags.clear();
if (logSystem->hasRemoteLogs()) {
prev_tags.push_back(logSystem->getRandomRouterTag());
} else if (changedToHA) {
prev_tags.emplace_back(-2, 0);
}
for (auto& tag : next_message_tags) {
prev_tags.push_back(tag);
Expand Down
8 changes: 8 additions & 0 deletions fdbserver/include/fdbserver/LogSystem.h
Original file line number Diff line number Diff line change
Expand Up @@ -813,6 +813,11 @@ struct LogPushData : NonCopyable {
// getAllMessages() and is used before writing any other mutations.
void setMutations(uint32_t totalMutations, VectorRef<StringRef> mutations);

// After HA is enabled, we always add log router tag (-2, 0) to all mutations
// to ensure after the recovery, the mutations can be sent to the log routers
// for the remote region to consume.
void setChangedToHA() { changedToHA = true; }

private:
Reference<ILogSystem> logSystem;
std::vector<Tag> next_message_tags;
Expand All @@ -828,6 +833,7 @@ struct LogPushData : NonCopyable {
uint32_t subsequence;
SpanContext spanContext;
bool shardChanged = false; // if keyServers has any changes, i.e., shard boundary modifications.
bool changedToHA = false; // if DB configuration usable_regions changed from 1 to 2

// Writes transaction info to the message stream at the given location if
// it has not already been written (for the current transaction). Returns
Expand All @@ -841,6 +847,8 @@ void LogPushData::writeTypedMessage(T const& item, bool metadataMessage, bool al
prev_tags.clear();
if (logSystem->hasRemoteLogs()) {
prev_tags.push_back(logSystem->getRandomRouterTag());
} else if (changedToHA) {
prev_tags.emplace_back(-2, 0);
}
for (auto& tag : next_message_tags) {
prev_tags.push_back(tag);
Expand Down

0 comments on commit b9584ee

Please sign in to comment.