Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix amendment majority flapping: use a more stable threshold for the number of votes required; when missing STValidation, use the last vote seen #4410

Merged
merged 7 commits into from Sep 28, 2023

Conversation

scottschurr
Copy link
Collaborator

High Level Overview of Change

This pull request is being offered as an alternative to approach to #4364. Both pull requests seek to limit amendment flapping. Only one of these pull requests should be merged.

This pull request specifically intends to address a comment by @mDuo13: #4364 (review)

Context of Change

Amendment flapping usually occurs when an amendment is on the verge of gaining majority, and a validator not in favor of the amendment goes offline.

This pull request addresses this problem by making three changes:

  1. Prior to this pull request the AmendmentTable was unaware of which validators are trusted. With this change the AmendmentTable keeps a record of which validators are currently on the UNL of this particular server. That record must be kept up to date by the server.

  2. The number of votes required for an amendment to attain majority is based on the number of trusted validators, not on the number of STValidations received from trusted validators prior to the flag ledger.

  3. The AmendmentTable now keeps a record of the last amendment votes seen from each of the validators on the current UNL. If, for some reason, the AmendmentTable does not receive an STValidation from a validator on the UNL prior to the flag ledger, then the AmendmentTable assumes that the missing validator(s) did not change their vote. It uses the last vote seen from that validator.

These changes address amendment flapping in the following ways.

  1. While it's not uncommon for trusted validators to sporadically go offline, it's much less common for the UNL of a validator to change. With this change we should only see flapping when the UNL changes. Flapping around a UNL change is inevitable, since the voting participants change at that point.

  2. The threshold for an amendment to attain majority is more stable. The threshold only changes with the size of the UNL changes. Previously a missing STValidation from a trusted validator would change the threshold.

  3. If a trusted validator's STValidation is missing during a vote, the AmendmentTable assumes that the vote from that validator has not changed. This provides additional stability of the vote while reducing requirements on trusted validators to remain online all the time.

Type of Change

Testing

Unit tests have been added to exercise the new code. However, with a change like this, it would also be good to exercise the code on a small test network to assure reasonable amendment voting behavior when the UNL changes on that network.

Copy link
Contributor

@nbougalis nbougalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there's a subtle problem with this: namely, that the "effective vote" that this creates will cause convergence issues.

Consider, inter alia, that this vote isn't saved across restarts. Imagine that some validator X goes offline. All other servers store its previously recorded vote. Now, half of those servers restart. They forget about X's vote and on the next voting round the half that didn't restart think that X is voting YES and the other half think that X is voting NO. There's a similar problem if a validator is offline during the vote: not only does it not get to vote, but it doesn't know what the other validators voted, so it can't update its own table; now, on the next round, it might rely on a stale vote.

This is the problem I was describing in the other PR: there's no reasonable way to save "voting state" unless you do it on the ledger and you can't do it on the ledger because whose votes are you going to save? You can't save all the votes, because that opens up an attack vector, and you can't save those that are in the UNL, because there's no "the UNL" (i.e. my UNL can be different than yours).

As it is now, a single validator fails to vote in a single round, you can flap. That's just one missed validation out of 256, or a validator with "uptime" of 99.6% can still cause flapping. The hysteresis added by the other amendment means that you can afford to miss one and as long as you're back in time for the next, all is well.

IMO, the flapping fix is only meant to address a short-term issue and not long-term outage.

Just my $0.02.

@scottschurr
Copy link
Collaborator Author

Thanks for looking @nbougalis. I'm not following 100% what your concern is. I'll describe my understanding of the concern, then you can refine the description.

I think your concern is two fold. There are two consequences of storing the vote in memory that I'm seeing raised:

  1. The stored vote is preserved for too long. If the validator that goes off line is off line for repeated flag ledgers, then we should treat that validator as voting no on the second consecutive time that it is missing. We should not preserve the vote of an absent validator indefinitely, which is the current behavior.

  2. If the validator restarts then its initial historical record is that all validators on the UNL are voting no on everything. So a validator restart can cause flapping.

If I got those concerns right, then let me see if I can address them.

  1. If the longevity of the saved vote is a concern, it's easy to set a timeout of how long the vote is saved locally. An STValidation carries a timeSeen_ member that could be associated with the saved vote. If the saved vote is held in memory for too long, then it could be removed.

  2. A restarted validator does indeed start out with a state of no saved votes. However, every validator is expected to issue their current amendment vote on the ledger just prior to the flag ledger. So the rebooted validator would need to sync with the network in the time interval immediately after the votes arrived but before the flag ledger in order to actually vote as though all of the validators vote no on everything. That's an interval of 8 seconds every 17 minutes or so. So it will happen only about 1 out of every 128 times that a validator reboots.

Additionally, there are some advantages of the approach taken here that should not be overlooked. @mDuo13 noted:

The correct fix for amendment flapping would be to change the activation threshold calculation so that it requires >80% of all validators on the node's UNL (ignoring Negative UNL) rather than >80% of validators currently participating in consensus.

That is precisely what is provided by the approach taken. The threshold is determined by the number of entries in the local node's UNL. That threshold only changes when the number of entries in the local UNL changes. That is an improvement on the current situation where the threshold changes whenever any validator's Amendment vote does not arrive in the correct time window.

In terms of the frequency of flapping, with this proposal flapping might occur when the validator is restarted. This occurs much less frequently than validators losing sync with the network. So flapping should be considerably reduced from where it is now as well as from the approach taken by #4364.

Additionally, the approach taken here adds no fields to the protocol. The approach in #4364 adds sfFlappingAmendments to Amendments.

I look forward your further thoughts and corrections.

@nbougalis
Copy link
Contributor

nbougalis commented Feb 8, 2023

I think I didn’t communicate the concern well enough.

Assume a network with, say, 10 validators. There’s some amendment A which has 100% support.

Now, three servers go offline—maybe it’s because they restarted or they simply just had connectivity issues; the important thing is that they missed a round of voting.

Those servers have no way of knowing if the votes they still have cached are valid.

Perhaps one of the validators on the UNL shifted their vote, for example. And you can’t assume that will be fixed in the next round: the validator(s) that shifted their vote may now be offline.

The net result is that servers (not just validators!) doing the tally may actually be tallying different sets of ballots!

This isn’t necessarily a disaster you mind you; servers will just build different proposed ledgers, and validators make different proposals, which will result in a desync and then the normal resync will happen. And in many cases that is what will happen, although I can think of a few corner cases that deserve more careful consideration. For example, a validator desyncing on the voting ledger will issue a partial validation in the next round, and the implications of that must be thought through.

“But” you might say “it’s possible for a server to receive a vote too late to consider it, and how is that different?” It’s true, the scenario is similar but not identical and the important thing to keep in mind is that in that case the server’s behavior is deterministic.

The other PR implements hysteresis by storing minimal state on the ledger (specifically, by storing which amendments would have lost majority in the past round but were granted a one round reprieve).

Your solution of caching the votes can achieve the same thing (if you cache them only for one “election”) but it suffers from the problem described above: that the caching isn’t consistent across servers.

My position continues to be that votes shouldn’t be sticky and that the intent didn’t to prevent flapping for long-term outages, but only for short-term ones, which (in my mind, at least) last less than 255 ledgers.

If the consensus is that the XRPL should have sticky votes, then we need a mechanism to allow a validator to make their votes sticky on ledger, with all the complications that introduces.

@intelliot intelliot assigned scottschurr and unassigned nbougalis Feb 8, 2023
@intelliot intelliot changed the title The fixAmendmentFlapping amendment fixAmendmentFlapping: use a more stable threshold for the number of votes required; when missing STValidation, use the last vote seen Feb 8, 2023
@scottschurr
Copy link
Collaborator Author

Thanks for the clarification @nbougalis. I feel like there's something that is obvious to you that I have not yet caught up on. So please bear with me.

Assume a network with, say, 10 validators. There’s some amendment A which has 100% support.

Now, three servers go offline—maybe it’s because they restarted or they simply just had connectivity issues; the important thing is that they missed a round of voting.

Those servers have no way of knowing if the votes they still have cached are valid.

Right. But if they don't cache the votes, they use a default. Currently that default is to act as though the votes that they didn't receive are missing. That reduces their voting threshold to zero. So they vote strictly according to their local amendment votes.

We have observed empirically that validators change their votes very seldom. Much more slowly than every flag ledger. If we were to go ahead with the caching performed by this pull request there are two possible outcomes:

  1. If the servers went off line because they restarted, then they have no record of previous votes in their cache. The resulting behavior is to assume that the votes they missed are no votes. This may result in flapping if the validators regain sync between when amendment votes arrive and when the flag ledger occurs. This will occur approximately 1 out of ever 128 restarts.

  2. If the servers went off line because of lost connectivity -- the more common case -- then the servers will assume that the position of the votes they missed did not change. We observe, empirically, that validators change their amendment vote much less often than every flag ledger. So assuming that the vote did not change is the best guess we can make. It is, as you say, just a guess. But it is strictly a better guess than assuming all of the missing votes were voting no.

Perhaps one of the validators on the UNL shifted their vote, for example. And you can’t assume that will be fixed in the next round: the validator(s) that shifted their vote may now be offline.

The net result is that servers (not just validators!) doing the tally may actually be tallying different sets of ballots!

Right. But a validator that went off-line may currently be tallying different sets of votes from a validator that did not go off line. This is unavoidable. Some validators are simply missing information. The code in this pull request does not make the situation any worse.

What this pull request does do is, if information goes missing, make the validator's guess more likely to be correct. Since validators seldom change their amendment votes, assuming that their vote did not change is the best we can do.

“But” you might say “it’s possible for a server to receive a vote too late to consider it, and how is that different?” It’s true, the scenario is similar but not identical and the important thing to keep in mind is that in that case the server’s behavior is deterministic.

That description is at odds with my understanding of the code in this pull request. There are no dice in this pull request. Behavior of the validator votes remains as deterministic as possible given the vagaries of communication and reboots. I will agree that the deterministic behavior is local to each validator. But I don't think there's anything new there. Each validator has always voted using the information that it has locally.

Your solution of caching the votes can achieve the same thing (if you cache them only for one “election”) but it suffers from the problem described above: that the caching isn’t consistent across servers.

Maybe we're at the heart of the matter now. Yes, in this pull request the caching is not consistent across servers. But I'm not convinced that matters. The arrival of amendment votes is not guaranteed to be consistent across servers at all times. Heck, even the UNL is local to each separate server. We try to encourage the UNL to be the same, but it is not guaranteed to be the same. Even two servers that share a remote UNL file may be different for some duration when that file changes and the servers update at different times.

My understanding is that each validator votes independently of other validators based on the best information that it has. Those votes are then shared. Sometimes validators disagree because they get different information. And the disagreement self-heals when communication is reestablished. That's still true with this pull request. But I also don't think the situation is worse with this pull request. In fact, I think the situation is improved because the local information is more likely to represent the intention of the missing validator(s) than before the local cache was added.

Here's a different way to think about it. As you say, “it’s possible for a server to receive a vote too late to consider it." But it's easy for that lateness to be non-uniform. The vote will arrive late for some servers but not for others. The consensus algorithm is designed to deal with non-uniformity of information. What this pull request does is improve the information each individual server has (locally) so each is more likely to vote correctly if some information goes temporarily missing or misses a deadline.

Your solution of caching the votes can achieve the same thing (if you cache them only for one “election”) but it suffers from the problem described above: that the caching isn’t consistent across servers.

I'm happy to make the change so the vote is only cached for one flag ledger. I'm not convinced that's necessarily an improvement, but I bow to your greater understanding of the ledger.

Other than that, I'm also okay with closing this pull request. Just because I don't understand your objection(s) doesn't mean that I'm right.

Further thoughts? Thanks for taking the time.

@thejohnfreeman
Copy link
Collaborator

I think we're forgetting that an amendment officially gaining or losing majority is still governed by the consensus process. The pseudo-transaction (PT) that records the change in status must still be included in a ledger validated by a quorum of validators.

Let's compare the before-and-after of some of the scenarios Nik laid out, with respect to this amendment:

Validator X has been voting YES for an amendment, and goes offline shortly before a flag ledger L1. Half of the remaining validators restart at the same time. At L1, X's vote is missing. Later, X reconnects, and votes YES at the next flag ledger, L2.

  • Before: At L1, if the amendment barely had majority, the online validators all propose a tfLostMajority PT, and it is validated. At L2, the validators all propose a tfGotMajority PT, and it is validated.
  • After: At L1, the validators that restarted fill X's missing vote as a NO vote; the rest fill it as a YES vote. If the amendment barely had majority, the validators that restarted propose a tfLostMajority PT, but it is not validated. At L2, everyone fixes their cache of X's vote and no one proposes a PT.

10 validators are voting YES for an amendment. 3 of them go offline shortly before flag ledger L1. Later, they restart, reconnect, and vote YES at the next flag ledger, L2.

  • Before: At L1, the online validators all propose a tfLostMajority PT, and it is validated. At L2, the validators all propose a tfGotMajority PT, and it is validated.
  • After: At L1, the online validators fill the missing votes as YES votes, and no one proposes a PT. At L2, everyone fixes their cache, and no one proposes a PT.

Even when the caches are incorrect, at worst we'll get a discordant set of PTs, and it will be resolved according to consensus. As a bonus, it will tend to resolve to "no change" (i.e. no flapping), unless every validator restarts, in which case we're back to the situation we have today where they all assume the missing vote is a NO.

@intelliot
Copy link
Collaborator

@thejohnfreeman @mDuo13 could you have a look at this?

If you don't think this is worth merging at this time, then go ahead and comment saying so. We can close this PR to reduce noise as we try to keep PRs from getting too old. The same solution could be proposed again in the future (potentially even in a new PR).

Also, @scottschurr if you happen to have any suggestions for moving this PR forward (or if you think it can be closed), please let such suggestions be heard.

@scottschurr
Copy link
Collaborator Author

@intelliot, thanks for asking. As far as I can see there is a stalemate on this pull request.

As it stands I, personally, will not approve #4364. I think this pull request (4410) is strictly an improvement over 4364. But, potentially, you could get other folks to approve 4364 and close 4410.

Another possible outcome would be for someone to explain to me the problems in this pull request (4410) in a way that I can understand. If someone can explain the concerns in a way that I can understand then I can either tweak 4410 to fix the problem, or close it in deference to 4364.

@intelliot intelliot requested a review from ChronusZ April 26, 2023 16:30
@ChronusZ
Copy link
Contributor

I don't have strong opinions between #4410 and #4364, but hopefully I can help clarify the argument. It sounds to me like @nbougalis is concerned that #4410 will make it easier for nodes to disagree about whether or not some validator supports the amendment. I'll try to elaborate a bit on why this might be true in some cases. Sorry Nik if I'm misunderstanding your critique!

With #4410, suppose validator A votes yes on some amendment in their validation for voting ledger 256n-1, but then goes offline or loses sync at e.g. ledger 256n+200. Nodes that stay alive will continue to assume A voted yes in ledger 256(n+1)-1. But any node that restarts between ledgers 256n and 256(n+1) will perceive A as voting no.

The main contrast in this situation with the current design or #4364, is that it doesn't matter where in this interval the node restarts. With the current design or #4364, it's guaranteed that if messages are delivered quickly then all nodes that are alive for ledger 256(n+1)-1 will agree on A's vote, even if they restart anywhere in the interval 256n ~ 256(n+1)-2. Since it's more likely that nodes will restart across a long span of time than it is they will restart at a specific ledger, one could expect it's more likely for disagreement to occur with #4410. The interval within which restarts can cause disagreement becomes larger the longer we keep saved votes alive for.

@JoelKatz
Copy link
Collaborator

I can imagine no scenario where this change makes things worse and lots of scenarios where it makes it better. No amendment is needed, and one should not be used, as this doesn't change any behavior related to consensus or ledger contents.

Some possible alternate designs that some might prefer:

  1. Treat the absence of a vote as a vote against changing the amendment's state.
  2. Expire entries in the cache after some period of time (4 hours?).
  3. Don't include any got/lost majority pseudo-transactions in your initial proposal until your server has been operating for at least one hour.

I see no reason to oppose this change if it's not associated with an amendment. There are cases where it makes things better, I can think of no cases where it makes things worse. It makes it easier to evolve the behavior to make things even better. It doesn't risk gaming the 80% threshold as proposals that change the threshold or add hysteresis might do. Though prone to sub-optimal behavior in some edge cases (like around validator restarts) it significantly reduces the sub-optimal behavior around very common cases.

@intelliot intelliot removed the request for review from RichardAH May 26, 2023 13:56
@intelliot intelliot requested a review from ckeshava May 31, 2023 21:51
Amendment flapping usually occurs when an amendment is on
the verge of gaining majority, and a validator not in favor
of the amendment goes offline.

If fixAmendmentFlapping activates then two changes occur:

1. The number of validators in the UNL determines the required
threshold for an amendment to gain majority.

2. The AmendmentTable keeps a record of the most recent
Amendment vote received from each trusted validator in the
UNL.  If no validation arrives from a given validator, then
the AmendmentTable assumes that the previously received vote
has not changed.

Fixes XRPLF#4350
@scottschurr
Copy link
Collaborator Author

The top two commits address @JoelKatz's concerns:

  • There's a 24 hour timeout on saved validator votes. Any saved votes that are older than 24 hours are removed.
  • The amendment enabling the feature is removed.

Additionally, the pull request is now rebased to the current develop branch.

@intelliot
Copy link
Collaborator

@ChronusZ thank you for the clarifier above. Will you be able to review this PR?

@intelliot intelliot requested a review from JoelKatz June 30, 2023 05:50
@intelliot intelliot removed the request for review from ChronusZ September 13, 2023 20:59
@intelliot intelliot changed the title fixAmendmentFlapping: use a more stable threshold for the number of votes required; when missing STValidation, use the last vote seen Fix amendment flapping: use a more stable threshold for the number of votes required; when missing STValidation, use the last vote seen Sep 21, 2023
@intelliot intelliot changed the title Fix amendment flapping: use a more stable threshold for the number of votes required; when missing STValidation, use the last vote seen Fix amendment majority flapping: use a more stable threshold for the number of votes required; when missing STValidation, use the last vote seen Sep 21, 2023
@Bronek
Copy link
Collaborator

Bronek commented Sep 21, 2023

I think the approach in this PR will lead to voting results which are easier to explain in the face of network disconnects, node restart etc. failures, as opposed to the approach presented in #4364 . The reason why I think so is that in this PR there are is no extra information recorded in the ledger, and everything can be explained by information available locally on the nodes.

I think explainability of voting is a very important property that we must not lose the sight of, and for that reason I would vote for this PR and not for the other, even if this PR does not appear to be robust in the face of all hypothetical failures.

Copy link
Collaborator

@Bronek Bronek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that we still count validator's vote towards quorum after the 24h time limit had passed, only as "all noes". It's a simple mental model which I think is correct, given that if we wanted to drop that (or any other) validator from quorum we can do that explicitly, as it's handled elsewhere.

@scottschurr scottschurr added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Sep 26, 2023
@intelliot intelliot linked an issue Sep 27, 2023 that may be closed by this pull request
@intelliot intelliot merged commit 2bb8de0 into XRPLF:develop Sep 28, 2023
15 checks passed
@intelliot intelliot mentioned this pull request Oct 17, 2023
1 task
@intelliot intelliot added this to the 2.0 milestone Oct 26, 2023
scottschurr added a commit to scottschurr/rippled that referenced this pull request Nov 3, 2023
…ism (XRPLF#4410)"

This reverts commit 2bb8de0.

Manual testing showed some unexpected behavior.  We're removing the
commit since the release date is close and fixing the problem is
risky.  A fixed commit will be put into a later release.
@sgramkumar
Copy link
Collaborator

sgramkumar commented Nov 7, 2023

Executed tests to verify this change and they work as expected. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. Testable
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Prevent amendment majority "flapping" (Version: 1.9.4)
8 participants