Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PrepareProposal/ProcessProposal invocation during replay inconsistent with spec #1018

Closed
kostko opened this issue Jun 23, 2023 · 11 comments · Fixed by #1033
Closed

PrepareProposal/ProcessProposal invocation during replay inconsistent with spec #1018

kostko opened this issue Jun 23, 2023 · 11 comments · Fixed by #1033
Assignees
Labels
bug Something isn't working

Comments

@kostko
Copy link

kostko commented Jun 23, 2023

During our E2E fuzz testing using the CometBFT 0.37.x branch, where one of the tests is randomly killing and restarting nodes, it seems that PrepareProposal/ProcessProposal invocation during replay is inconsistent with the spec.

The spec currently says:

For every round, if the local process is the proposer of the current round, CometBFT calls PrepareProposal, followed by ProcessProposal. These two always come together because they reflect the same proposal that the process also delivers to itself.

However, it seems that there is an edge case with the following scenario:

  • A validator node becomes the proposer, calls PrepareProposal, generates and broadcasts the proposal (proposal A). Then the node crashes/is killed.
  • When the node restarts, it enters the consensus replay mode.
  • CometBFT replays the timeout event, triggering a new round where the validator node is again the proposer (as this is a replay, the height/round is the same). This causes the recovering validator node to run PrepareProposal again, generating a new proposal B. Since signing this proposal would be an equivocation, the proposal is dropped.
  • CometBFT then replays proposal A, which causes it to be processed by the app via ProcessProposal.

This seems to contradict the specification, because based on it the following sequence would be expected in this scenario:

  • B := PrepareProposal()
  • ProcessProposal(B)
  • ProcessProposal(A)

However, the actual sequence is:

  • B := PrepareProposal()
  • ProcessProposal(A)

So either the specification needs to be updated to clarify that these other sequences are possible or the implementation should be fixed.

@kostko kostko added bug Something isn't working needs-triage This issue/PR has not yet been triaged by the team. labels Jun 23, 2023
@sergio-mena sergio-mena self-assigned this Jun 23, 2023
@lasarojc lasarojc removed the needs-triage This issue/PR has not yet been triaged by the team. label Jun 23, 2023
@cason
Copy link
Contributor

cason commented Jun 26, 2023

CometBFT then replays proposal A, which causes it to be processed by the app via PrepareProposal.

Are you sure that this happens? Or are you referring to ProcessProposal instead?

@kostko
Copy link
Author

kostko commented Jun 26, 2023

Yes, we have observed the replay process producing a timeout event which causes CometBFT to invoke PrepareProposal. Only then the replay process replays proposal A itself which triggers an invocation of ProcessProposal.

@cason
Copy link
Contributor

cason commented Jun 26, 2023

So, during the regular operation the application produces A as a response for PrepareProposal. The node broadcasts A block then crashes.

When the node is restarted, it replays the conditions for starting the same round, therefore it invokes PrepareProposal again, to which the application, for instance, replies with B.

The problem, I assume, is that since the proposal for B cannot be signed, there is no ProcessProposal for B. And once block A was persisted to disk, it is replayed, causing ProcessProposal to be invoked for A.

I cannot see, however, a situation in which ProcessProposal can be called for B, since B cannot be signed, therefore it is not processed as a "received" message.

@kostko
Copy link
Author

kostko commented Jun 26, 2023

Yeah in that case the best thing would be to update the spec to explicitly mention that this can happen.

@cason
Copy link
Contributor

cason commented Jun 26, 2023

CometBFT then replays proposal A, which causes it to be processed by the app via PrepareProposal.

I still think the wording here is wrong. The replayed proposal is passed to ProcessProposal, not PrepareProposal.

@cason
Copy link
Contributor

cason commented Jun 26, 2023

However, the actual sequence is:

B := PrepareProposal()
ProcessProposal(A)

The actual sequence should include the event that happened before the crash, i.e., the first step here should be A := PrepareProposal(), then crash, then the sequence.

@cason
Copy link
Contributor

cason commented Jun 26, 2023

In other words, we have two invocations of PrepareProposal, but only the output of one of them is passed to ProcessProposal.

@kostko
Copy link
Author

kostko commented Jun 26, 2023

CometBFT then replays proposal A, which causes it to be processed by the app via PrepareProposal.

I still think the wording here is wrong. The replayed proposal is passed to ProcessProposal, not PrepareProposal.

Oops sorry, yeah that was supposed to say ProcessProposal.

@cason
Copy link
Contributor

cason commented Jun 27, 2023

If we introduce rounds on PrepareProposal, the spec should be updated accordingly. I am afraid, however, that with the current design we cannot ensure that we call PrepareProposal at most once per each round of consensus.

#882

@sergio-mena
Copy link
Contributor

Some info on the reason for closing this issue for now.

After some internal discussions on the best way to tackle this, we decided that the best alternative was to fix the specification so that the replay case was properly covered (the spec as it was when this issue was created was inconsistent with the implementation). This work was done in #1033.

The main reason we went for this option in the short term is that fixing this problem in the Consensus replay implementation (which would allow us to keep the spec as it was) is far from trivial as you can appreciate from #1035.

@kostko
Copy link
Author

kostko commented Jun 30, 2023

Sounds like the right solution to me. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Status: Done
4 participants