Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add trace event to recovery_transaction step in recovery #5015

Merged
merged 1 commit into from Jun 21, 2021
Merged

Add trace event to recovery_transaction step in recovery #5015

merged 1 commit into from Jun 21, 2021

Conversation

kakaiu
Copy link
Member

@kakaiu kakaiu commented Jun 18, 2021

Problem:
(1) When a recovery blocks at recovery_transaction phase and we do root cause analysis of it, it is often hard to distinguish which step in the phase blocks the recovery --- The recovery may block at the special transaction, or block at sending TxnStateStore to proxies, or block at initializing resolvers.
(2) In the root cause analysis, we may want to see which proxy is the proxy 0. Proxy 0 is the proxy that accepts special transactions that must keep serial order. However, the current way to identify the proxy 0 is somewhat tricky. We want to clearly see this information in the trace event.

Solution:
(1) Add two trace events --- SentTxnStateStoreToCommitProxies and InitializedAllResolvers to sendInitialCommitToResolvers() in masterserver.actor.cpp.
In recovery_transaction phase, the time order is:
The master sends TxnStateStore to proxies meanwhile issues the special transaction to proxy 0, then the master initializes resolvers, then the special transaction at the proxy 0 gets resolution replied from resolver.
When the recovery blocks at recovery_transaction phase, (a) if both events appear, then we can conclude that the special transaction must be blocked. (b) If only SentTxnStateStoreToCommitProxies appears, then at least initializing resolvers is blocked, and the special transaction may be blocked (at no later than when the proxy sending resolution request to resolver) (c) If none of the two events appears, then at least sending TxnStateStore to proxies is blocked, and the special transaction may be blocked (at no later than when the proxy sending resolution request to resolver).
Noting that in case (b) and (c), it is hard to determine whether the special transaction is blocked when the master issues transaction to proxy.
This is basically due to that the special transaction and initializing resolvers happen concurrently, but we have to check them in a serialized way.

(2) Add "FirstProxy" to CommitProxyReplies events in newCommitProxies().

Verified the function via simulation.
Passed Joshua 100k test: 20210619-210933-zhewang-846f7db1a7ac57e1

Code-Reviewer Section

The general guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or master if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: 32920f3
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

@@ -965,6 +968,7 @@ ACTOR Future<Void> sendInitialCommitToResolvers(Reference<MasterData> self) {
}

wait(waitForAll(replies));
TraceEvent("SentTxnStateStoreToResolvers", self->dbgid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Master doesn't send txnStateStore to resolvers.

Suggested change
TraceEvent("SentTxnStateStoreToResolvers", self->dbgid);
TraceEvent("InitializedAllResolvers", self->dbgid);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It sends LastEpochEnd version to resolvers.

@apple apple deleted a comment from foundationdb-ci Jun 19, 2021
@apple apple deleted a comment from foundationdb-ci Jun 19, 2021
@apple apple deleted a comment from foundationdb-ci Jun 19, 2021
@kakaiu kakaiu requested a review from jzhou77 June 19, 2021 01:17
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: 250eaf2
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: 80aabcc
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: fff0593
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: d00d679
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

@@ -294,7 +294,9 @@ ACTOR Future<Void> newCommitProxies(Reference<MasterData> self, RecruitFromConfi
req.recoveryCount = self->cstate.myDBState.recoveryCount + 1;
req.recoveryTransactionVersion = self->recoveryTransactionVersion;
req.firstProxy = i == 0;
TraceEvent("CommitProxyReplies", self->dbgid).detail("WorkerID", recr.commitProxies[i].id());
TraceEvent("CommitProxyReplies", self->dbgid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is better named as "InitializeCommitProxy", since there is no reply at this point.

Copy link
Member Author

@kakaiu kakaiu Jun 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if I rename this existing event name, I am afraid this may break our existing queries for others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG. Didn't realize that it already exists.

@@ -953,6 +955,7 @@ ACTOR Future<Void> sendInitialCommitToResolvers(Reference<MasterData> self) {
wait(yield());
}
wait(waitForAll(txnReplies));
TraceEvent("SentTxnStateStoreToCommitProxies", self->dbgid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds too general. I think you want to add word "Recovery" in it since it is only used in recovery.

@halfprice
Copy link
Contributor

Love your PR description!

@xumengpanda
Copy link
Contributor

Love your PR description!
Clear and descriptive!

@kakaiu kakaiu requested a review from halfprice June 19, 2021 21:07
@foundationdb-ci
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: foundationdb-pull-request-build
  • Commit ID: b685691
  • Result: SUCCEEDED
  • Build Logs (available for 7 days)

@kakaiu kakaiu marked this pull request as ready for review June 20, 2021 02:29
@@ -294,7 +294,9 @@ ACTOR Future<Void> newCommitProxies(Reference<MasterData> self, RecruitFromConfi
req.recoveryCount = self->cstate.myDBState.recoveryCount + 1;
req.recoveryTransactionVersion = self->recoveryTransactionVersion;
req.firstProxy = i == 0;
TraceEvent("CommitProxyReplies", self->dbgid).detail("WorkerID", recr.commitProxies[i].id());
TraceEvent("CommitProxyReplies", self->dbgid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG. Didn't realize that it already exists.

@kakaiu kakaiu merged commit ab1404b into apple:master Jun 21, 2021
@kakaiu kakaiu deleted the zhewang-add-traceevent-in-recovery branch June 26, 2021 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants