New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add trace event to recovery_transaction step in recovery #5015
Add trace event to recovery_transaction step in recovery #5015
Conversation
AWS CodeBuild CI Report
|
fdbserver/masterserver.actor.cpp
Outdated
@@ -965,6 +968,7 @@ ACTOR Future<Void> sendInitialCommitToResolvers(Reference<MasterData> self) { | |||
} | |||
|
|||
wait(waitForAll(replies)); | |||
TraceEvent("SentTxnStateStoreToResolvers", self->dbgid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Master doesn't send txnStateStore to resolvers.
TraceEvent("SentTxnStateStoreToResolvers", self->dbgid); | |
TraceEvent("InitializedAllResolvers", self->dbgid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It sends LastEpochEnd version to resolvers.
AWS CodeBuild CI Report
|
AWS CodeBuild CI Report
|
AWS CodeBuild CI Report
|
AWS CodeBuild CI Report
|
@@ -294,7 +294,9 @@ ACTOR Future<Void> newCommitProxies(Reference<MasterData> self, RecruitFromConfi | |||
req.recoveryCount = self->cstate.myDBState.recoveryCount + 1; | |||
req.recoveryTransactionVersion = self->recoveryTransactionVersion; | |||
req.firstProxy = i == 0; | |||
TraceEvent("CommitProxyReplies", self->dbgid).detail("WorkerID", recr.commitProxies[i].id()); | |||
TraceEvent("CommitProxyReplies", self->dbgid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is better named as "InitializeCommitProxy", since there is no reply at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if I rename this existing event name, I am afraid this may break our existing queries for others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG. Didn't realize that it already exists.
fdbserver/masterserver.actor.cpp
Outdated
@@ -953,6 +955,7 @@ ACTOR Future<Void> sendInitialCommitToResolvers(Reference<MasterData> self) { | |||
wait(yield()); | |||
} | |||
wait(waitForAll(txnReplies)); | |||
TraceEvent("SentTxnStateStoreToCommitProxies", self->dbgid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds too general. I think you want to add word "Recovery" in it since it is only used in recovery.
Love your PR description! |
Love your PR description! |
AWS CodeBuild CI Report
|
@@ -294,7 +294,9 @@ ACTOR Future<Void> newCommitProxies(Reference<MasterData> self, RecruitFromConfi | |||
req.recoveryCount = self->cstate.myDBState.recoveryCount + 1; | |||
req.recoveryTransactionVersion = self->recoveryTransactionVersion; | |||
req.firstProxy = i == 0; | |||
TraceEvent("CommitProxyReplies", self->dbgid).detail("WorkerID", recr.commitProxies[i].id()); | |||
TraceEvent("CommitProxyReplies", self->dbgid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG. Didn't realize that it already exists.
Problem:
(1) When a recovery blocks at recovery_transaction phase and we do root cause analysis of it, it is often hard to distinguish which step in the phase blocks the recovery --- The recovery may block at the special transaction, or block at sending TxnStateStore to proxies, or block at initializing resolvers.
(2) In the root cause analysis, we may want to see which proxy is the proxy 0. Proxy 0 is the proxy that accepts special transactions that must keep serial order. However, the current way to identify the proxy 0 is somewhat tricky. We want to clearly see this information in the trace event.
Solution:
(1) Add two trace events --- SentTxnStateStoreToCommitProxies and InitializedAllResolvers to sendInitialCommitToResolvers() in masterserver.actor.cpp.
In recovery_transaction phase, the time order is:
The master sends TxnStateStore to proxies meanwhile issues the special transaction to proxy 0, then the master initializes resolvers, then the special transaction at the proxy 0 gets resolution replied from resolver.
When the recovery blocks at recovery_transaction phase, (a) if both events appear, then we can conclude that the special transaction must be blocked. (b) If only SentTxnStateStoreToCommitProxies appears, then at least initializing resolvers is blocked, and the special transaction may be blocked (at no later than when the proxy sending resolution request to resolver) (c) If none of the two events appears, then at least sending TxnStateStore to proxies is blocked, and the special transaction may be blocked (at no later than when the proxy sending resolution request to resolver).
Noting that in case (b) and (c), it is hard to determine whether the special transaction is blocked when the master issues transaction to proxy.
This is basically due to that the special transaction and initializing resolvers happen concurrently, but we have to check them in a serialized way.
(2) Add "FirstProxy" to CommitProxyReplies events in newCommitProxies().
Verified the function via simulation.
Passed Joshua 100k test: 20210619-210933-zhewang-846f7db1a7ac57e1
Code-Reviewer Section
The general guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branch
ormaster
if this is the youngest branch)