Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrote tlog recruitment logic so that it is deterministic #4695

Merged
merged 12 commits into from Apr 27, 2021

Conversation

sfc-gh-etschannen
Copy link
Collaborator

@sfc-gh-etschannen sfc-gh-etschannen commented Apr 21, 2021

The goal of this rewrite is to prevent better master exists from triggering spuriously because tlogs randomly get located on different processes each time the function is called.

Changing the recruitment logic also has the benefit that now the cluster controller will ensure Log class processes are recruited before Transaction class processes.

Passed: 934930
Failed: 10

Failing tests:

RandomSeed="994272039" BuggifyEnabled="1" TestFile="tests/slow/SwizzledCycleTest.toml"
TooManyFiles

RandomSeed="325685976" BuggifyEnabled="1" TestFile="tests/slow/FastTriggeredWatches.toml"
Test timed out

RandomSeed="58742398" BuggifyEnabled="1" TestFile="tests/fast/SwizzledRollbackSideband.toml"
Error starting workload

JoshuaMessage Severity="40" Error="JoshuaTimeout" TimeoutCommandRun="true" PythonError="Command './joshua_timeout' timed out after 180 seconds”

RandomSeed="199525603" BuggifyEnabled="1" TestFile="tests/rare/CloggedCycleWithKills.toml"
RecoveryDelayedTooManyOldGenerations

RandomSeed="228055023" BuggifyEnabled="1" OldBinary="fdbserver" TestFile="tests/restarting/from_7.0.0/ConfigureTestRestart-1.txt"

RandomSeed="228055024" BuggifyEnabled="1" OldBinary="fdbserver" TestFile="tests/restarting/from_7.0.0/ConfigureTestRestart-2.txt"
TesterRecruitmentTimeout

RandomSeed="224073052" BuggifyEnabled="0" OldBinary="fdbserver" TestFile="tests/restarting/from_6.2.29/SnapTestAttrition-1.txt"
SetupAndRunError Error="timed_out"

RandomSeed="5142395" BuggifyEnabled="1" OldBinary="fdbserver" TestFile="tests/restarting/from_7.0.0/SnapIncrementalRestore-1.toml"
TestFailure Error="timed_out"

RandomSeed="699167286" BuggifyEnabled="1" TestFile="tests/fast/TxnStateStoreCycleTest.toml"
TooManyFiles

RandomSeed="854128356" BuggifyEnabled="1" OldBinary="fdbserver" TestFile="tests/restarting/from_7.0.0/SnapIncrementalRestore-1.toml"
TestFailure Error="timed_out"

Code-Reviewer Section

The general guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or master if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

…t better master exists from triggering spuriously
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: e18c996
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

…max commit latency, because it can be spuriously triggered by dummy transactions that take 5+ seconds each
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: b61a911
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

…mulation because it is the recommended process class, and the others are not deterministic when recruited in a constrained process situation
… amount of time a commit takes because of long commit times
…ys share with the same other role when everything else is equal
… just the worst one. Since this is a behavior change from the backup recruitment, we cannot compared degraded between the two recruitments
fix: tlog recruitment did not attempt to avoid longLivedStateless processes
@sfc-gh-mpilman sfc-gh-mpilman merged commit 340f012 into apple:master Apr 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants