-
Notifications
You must be signed in to change notification settings - Fork 13
Rebroadcast last round of messages if instance has not progressed #326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #326 +/- ##
==========================================
+ Coverage 82.89% 83.09% +0.20%
==========================================
Files 15 15
Lines 1695 1769 +74
==========================================
+ Hits 1405 1470 +65
- Misses 169 174 +5
- Partials 121 125 +4
|
6a44d48 to
5d3c58a
Compare
|
@anorth as always I'd appreciate your early feedback please. 🙏 |
anorth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more complicated than I was expecting. It's following the FIP quite closely, but I think the FIP is doing us a great disservice by giving a procedural style specification rather than just describing the properties that we need to meet. I think we should make some effort to simplify.
For recording the messages that we might possibly need to rebroadcast, I think we should start out by simply recording all messages that are sent (in the associated round state, or another round-indexed structure). I wouldn't bother indexing them by anything else. Removing messages from old rounds is an optimization, and if we're doing it, we should just remove the entire round state. When resending the messages for some round later we could filter by phase, (but it should also be ok to just resend them all).
For triggering, my (possibly naive) thought is that in tryPrepare/tryCommit, if the phase timeout has expired but the participant has not received from a strong quorum, then set another alarm for the remaining rebroadcast timeout. If we end up back in tryPrepare with that second timeout having expired, then rebroadcast and set a new rebroadcast alarm. When progress is made and we set a new phase timeout, zero the rebroadcast timeout. Similar but simpler in tryDecide, where there is no concurrent phase timeout.
ead2809 to
ac18c4e
Compare
|
Fuzz test failed on commit ac18c4e. To troubleshoot locally, download the seed corpus using GitHub CLI by running: gh run download 9501953317 -n testdataAleternatively, download directly from here. |
ac18c4e to
409c2b8
Compare
anorth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much tighter now, thanks.
The separation in code of setting the timeout from setting the alarm for that timeout makes it a bit difficult to mentally trace, leading to what I think is a timing error.
6a243d7 to
b3f2ecf
Compare
masih
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-review.
b3f2ecf to
a372aab
Compare
|
@masih I pushed a commit to resolve the only nontrivial issue I noticed. |
4c661c6 to
f4a15a2
Compare
|
Fuzz test failed on commit f4a15a2. To troubleshoot locally, download the seed corpus using GitHub CLI by running: gh run download 9571025648 -n testdataAleternatively, download directly from here. |
|
#349 should resolve the failing fuzz/tests. It is submitted as a separate PR to capture the context for change in commit history. |
00dc2da to
02ff84d
Compare
anorth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy for this to land as is, but I've made one suggestion that I think could make it easier to mentally follow.
|
Fuzz test failed on commit 02ff84d. To troubleshoot locally, download the seed corpus using GitHub CLI by running: gh run download 9572644922 -n testdataAleternatively, download directly from here. |
02ff84d to
a4846f2
Compare
|
Fuzz test failed on commit a4846f2. To troubleshoot locally, download the seed corpus using GitHub CLI by running: gh run download 9572870133 -n testdataAleternatively, download directly from here. |
a4846f2 to
920c316
Compare
When the current instance has not progressed after some time rebroadcast the last round of messages. The rebroadcast time is made configurable using a bounded exponential backoff after phase timeout expires.The rebroadcast timeout is offset by phase timeout when not expired, and by latest rebroadcast time after that. Once the first rebroadcast is triggered successive rebroadcasts use the `Clock` alarm mechanism to daisy-chain the triggers one after another. Introduce `Drop` adversary to simulate scenarios where for a given set of target participants messages are dropped based on some configured message loss probability. Simulate tests using the `Drop` adversary and assert that despite stochastic message loss the targeted participant reaches the expected consensus. Fixes #243
920c316 to
e6965c3
Compare
|
@anorth A quick note: Since your latest review I fixed an issue that was causing test failures here. The issue was that when the first rebroadcast becomes necessary in DECIDE phase, we cannot rely on I have resolved this by using current time as the offset of first rebroadcast alarm if it is triggered in DECIDE phase for the first time. Thank you again for all your insightful reviews and comments 🙏 |
|
Fuzz test failed on commit e6965c3. To troubleshoot locally, download the seed corpus using GitHub CLI by running: gh run download 9581101741 -n testdataAleternatively, download directly from here. |
When the current instance has not progressed after some time rebroadcast the last round of messages. The rebroadcast time is made configurable using a bounded exponential backoff after phase timeout expires.The rebroadcast timeout is offset by phase timeout when not expired, and by latest rebroadcast time after that.
Once the first rebroadcast is triggered successive rebroadcasts use the
Clockalarm mechanism to daisy-chain the triggers one after another.Introduce
Dropadversary to simulate scenarios where for a given set of target participants messages are dropped based on some configured message loss probability. Simulate tests using theDropadversary and assert that despite stochastic message loss the targeted participant reaches the expected consensus.Fixes #243