Skip to content

Conversation

@masih
Copy link
Member

@masih masih commented Jun 10, 2024

When the current instance has not progressed after some time rebroadcast the last round of messages. The rebroadcast time is made configurable using a bounded exponential backoff after phase timeout expires.The rebroadcast timeout is offset by phase timeout when not expired, and by latest rebroadcast time after that.

Once the first rebroadcast is triggered successive rebroadcasts use the Clock alarm mechanism to daisy-chain the triggers one after another.

Introduce Drop adversary to simulate scenarios where for a given set of target participants messages are dropped based on some configured message loss probability. Simulate tests using the Drop adversary and assert that despite stochastic message loss the targeted participant reaches the expected consensus.

Fixes #243

@codecov
Copy link

codecov bot commented Jun 10, 2024

Codecov Report

Attention: Patch coverage is 78.48101% with 17 lines in your changes missing coverage. Please review.

Project coverage is 83.09%. Comparing base (ad199a8) to head (a4846f2).

Current head a4846f2 differs from pull request most recent head e6965c3

Please upload reports for the commit e6965c3 to get more accurate results.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #326      +/-   ##
==========================================
+ Coverage   82.89%   83.09%   +0.20%     
==========================================
  Files          15       15              
  Lines        1695     1769      +74     
==========================================
+ Hits         1405     1470      +65     
- Misses        169      174       +5     
- Partials      121      125       +4     
Files Coverage Δ
gpbft/options.go 76.74% <73.33%> (-1.83%) ⬇️
gpbft/gpbft.go 85.04% <79.68%> (+0.27%) ⬆️

... and 1 file with indirect coverage changes

@masih masih force-pushed the masih/rebroadcast-last-round branch from 6a44d48 to 5d3c58a Compare June 10, 2024 20:48
@masih
Copy link
Member Author

masih commented Jun 10, 2024

@anorth as always I'd appreciate your early feedback please. 🙏

Copy link
Member

@anorth anorth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more complicated than I was expecting. It's following the FIP quite closely, but I think the FIP is doing us a great disservice by giving a procedural style specification rather than just describing the properties that we need to meet. I think we should make some effort to simplify.

For recording the messages that we might possibly need to rebroadcast, I think we should start out by simply recording all messages that are sent (in the associated round state, or another round-indexed structure). I wouldn't bother indexing them by anything else. Removing messages from old rounds is an optimization, and if we're doing it, we should just remove the entire round state. When resending the messages for some round later we could filter by phase, (but it should also be ok to just resend them all).

For triggering, my (possibly naive) thought is that in tryPrepare/tryCommit, if the phase timeout has expired but the participant has not received from a strong quorum, then set another alarm for the remaining rebroadcast timeout. If we end up back in tryPrepare with that second timeout having expired, then rebroadcast and set a new rebroadcast alarm. When progress is made and we set a new phase timeout, zero the rebroadcast timeout. Similar but simpler in tryDecide, where there is no concurrent phase timeout.

@masih masih force-pushed the masih/rebroadcast-last-round branch 2 times, most recently from ead2809 to ac18c4e Compare June 13, 2024 14:56
@github-actions
Copy link

Fuzz test failed on commit ac18c4e. To troubleshoot locally, download the seed corpus using GitHub CLI by running:

gh run download 9501953317 -n testdata

Aleternatively, download directly from here.

@masih masih force-pushed the masih/rebroadcast-last-round branch from ac18c4e to 409c2b8 Compare June 13, 2024 16:07
@masih masih marked this pull request as ready for review June 13, 2024 19:49
@masih masih requested a review from anorth June 13, 2024 19:49
Copy link
Member

@anorth anorth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much tighter now, thanks.

The separation in code of setting the timeout from setting the alarm for that timeout makes it a bit difficult to mentally trace, leading to what I think is a timing error.

@masih masih force-pushed the masih/rebroadcast-last-round branch 3 times, most recently from 6a243d7 to b3f2ecf Compare June 14, 2024 12:39
Copy link
Member Author

@masih masih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review.

@masih masih force-pushed the masih/rebroadcast-last-round branch from b3f2ecf to a372aab Compare June 14, 2024 12:51
@masih masih requested review from Kubuxu and anorth June 14, 2024 13:01
@anorth
Copy link
Member

anorth commented Jun 17, 2024

@masih I pushed a commit to resolve the only nontrivial issue I noticed.

@masih masih force-pushed the masih/rebroadcast-last-round branch from 4c661c6 to f4a15a2 Compare June 18, 2024 19:26
@github-actions
Copy link

Fuzz test failed on commit f4a15a2. To troubleshoot locally, download the seed corpus using GitHub CLI by running:

gh run download 9571025648 -n testdata

Aleternatively, download directly from here.

@masih
Copy link
Member Author

masih commented Jun 18, 2024

#349 should resolve the failing fuzz/tests. It is submitted as a separate PR to capture the context for change in commit history.

@masih masih force-pushed the masih/rebroadcast-last-round branch 2 times, most recently from 00dc2da to 02ff84d Compare June 18, 2024 21:45
Copy link
Member

@anorth anorth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy for this to land as is, but I've made one suggestion that I think could make it easier to mentally follow.

@github-actions
Copy link

Fuzz test failed on commit 02ff84d. To troubleshoot locally, download the seed corpus using GitHub CLI by running:

gh run download 9572644922 -n testdata

Aleternatively, download directly from here.

@masih masih force-pushed the masih/rebroadcast-last-round branch from 02ff84d to a4846f2 Compare June 18, 2024 22:08
@github-actions
Copy link

Fuzz test failed on commit a4846f2. To troubleshoot locally, download the seed corpus using GitHub CLI by running:

gh run download 9572870133 -n testdata

Aleternatively, download directly from here.

@masih masih force-pushed the masih/rebroadcast-last-round branch from a4846f2 to 920c316 Compare June 18, 2024 22:21
@masih masih enabled auto-merge June 18, 2024 22:22
When the current instance has not progressed after some time rebroadcast
the last round of messages. The rebroadcast time is made configurable
using a bounded exponential backoff after phase timeout expires.The
rebroadcast timeout is offset by phase timeout when not expired, and by
latest rebroadcast time after that.

Once the first rebroadcast is triggered successive rebroadcasts use the
`Clock` alarm mechanism to daisy-chain the triggers one after another.

Introduce `Drop` adversary to simulate scenarios where for a given set
of target participants messages are dropped based on some configured
message loss probability. Simulate tests using the `Drop` adversary and
assert that despite stochastic message loss the targeted participant
reaches the expected consensus.

Fixes #243
@masih masih force-pushed the masih/rebroadcast-last-round branch from 920c316 to e6965c3 Compare June 19, 2024 11:20
@masih masih added this pull request to the merge queue Jun 19, 2024
Merged via the queue into main with commit e4ef2ea Jun 19, 2024
@masih masih deleted the masih/rebroadcast-last-round branch June 19, 2024 11:27
@masih
Copy link
Member Author

masih commented Jun 19, 2024

@anorth A quick note:

Since your latest review I fixed an issue that was causing test failures here. The issue was that when the first rebroadcast becomes necessary in DECIDE phase, we cannot rely on phaseTimeout as the offset for rebroadcast alarm. Because, DECIDE phase does not update the phase timeout.

I have resolved this by using current time as the offset of first rebroadcast alarm if it is triggered in DECIDE phase for the first time.

Thank you again for all your insightful reviews and comments 🙏

@github-actions
Copy link

Fuzz test failed on commit e6965c3. To troubleshoot locally, download the seed corpus using GitHub CLI by running:

gh run download 9581101741 -n testdata

Aleternatively, download directly from here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rebroadcast last round messages when stuck at step longer than timeout

3 participants