-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
btcstaking: improved algorithm for voting power update #486
btcstaking: improved algorithm for voting power update #486
Conversation
NewState: types.BTCDelegationStatus_PENDING, | ||
} | ||
if err := sdkCtx.EventManager().EmitTypedEvent(event); err != nil { | ||
panic(fmt.Errorf("failed to emit EventBTCDelegationStateUpdate for the new pending BTC delegation: %w", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it is a little too harsh to panic if the emit event fails, maybe just log or return err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that events are part of the consensus, if some validators successfully emit events while others do not, then there will be consensus failure. Panicking here allows the validators with event issues to stop immediately, resolve the issues, and restart. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Events do not affect appHash
The error that can happen on EmitTypedEvent
is to unmarshal the typed event, not on EmitEvent
part
So, if one validator sends the event and other doesn't, it does not affect consensus...
my opinion is
panics should be avoided in almost all cases (besides from rarely bad cases like consensus failures)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a part of app hash, but a part of Tendermint consensus (i.e., as input of some hash values in Tendermint headers). There was a decision that Tendermint makes events a part of consensus to make them deterministic (here and here). So in case if many validators fail to emit events, the Cosmos SDK app hash won't diverge, but Tendermint consensus will stall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda agree with @RafilxTenfen in the sense that if the panic indeed happens, then it happens in all the validators and the chain will completely stall until a fix is deployed over >2/3 validators. But if we log error/alert here, we won't have a stall.
As we have used panics in all events emitting places, we can revisit this problem later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then it happens in all the validators and the chain will completely stall until a fix is deployed over >2/3 validators. But if we log error/alert here, we won't have a stall.
This is not necessarily the only possibility -- another possible scenario (which I think is more likely) is that some validators did something wrong in their configurations and they failed to emit events, while other validators emit events without problem. In this case if validators with invalid configurations return an error while other validators go through the process, then the two sets of validators will end up with different app hash. If we panick here these validators still have a chance to fix the configuration and restart the node to catch up, without doing any roll back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of suggestions, but no impediments
utACK
1716aae
to
7d5b82b
Compare
176697b
to
e1a40d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. Spent some time to get my head around. In general, both the algorithm and implementation lgtm. Awesome work!
One question regarding the necessity of introducing ActiveFinalityProviders
field. If we always store FinalityProviders
in the order of voting power. It would be easy to get active set out of this. So do we really need ActiveFinalityProviders
?
Some other comments:
|
||
// for each finality provider the delegation restakes to, update its index | ||
for _, fpBTCPK := range btcDel.FpBtcPkList { | ||
var btcDelIndex = types.NewBTCDelegatorDelegationIndex() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var btcDelIndex = types.NewBTCDelegatorDelegationIndex() | |
var btcDelIndex *types.BTCDelegatorDelegationIndex |
Is NewBTCDelegatorDelegationIndex
needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, since later on we invoke btcDelIndex.Add
. If it's nil then the function invocation will panick
// if the previous height is not indexed (which might happen at genesis), | ||
// use the base header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Genesis is the only case that btc height is not indexed, right? Maybe we can check whether babylonHeight == 0
and panic if not (actually cosmos forbids access of height 0, so maybe we can panic eight way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if genesis is the only case that this might happen. Since BTC light client is initialised earlier than BTC staking module I would say using the base header is the safest solution. Wdyt?
Good point, you are right! Will revise |
/* | ||
filter and classify all events into new/expired BTC delegations and slashed FPs | ||
*/ | ||
for _, event := range events { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could probably be separate function called filterAndClassifyAllEvents
On the other hand, why do we need this classification step ?
Wouldn't it be enough to:
- Create map
fpBtcPkHex -> FinalityProviderDistInfo
based on old distribution cache - iterate over all events and apply the changes to this map based on events
- Create new distribution cache based on this new map
This flow maps a bit more nicely to the flow I had in mind for the whole thing i.e newDistCache = appalyEvents(oldDistCahce, Events)
but maybe I am missing some detail here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the distribution cache uses lists for finality providers and BTC delegations, thus cannot support efficient query. Say there is a BTC delegation expired, we need to do an O(n)
operation to find that BTC delegation and remove it. This needs to be done for every BTC delegation.
Right now we classify these events and assign them to maps, such that we only need to iterate over distribution cache once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I think I understand.
On the other hand with this approach with map
you need to iterate over all finality providers and all their delegations in this middle loop:
// iterate over all finality providers and apply all events
for i := range dc.FinalityProviders {
// create a copy of the finality provider
fp := *dc.FinalityProviders[i]
fp.TotalVotingPower = 0
fp.BtcDels = []*types.BTCDelDistInfo{}
fpBTCPKHex := fp.BtcPk.MarshalHex()
// if this finality provider is slashed, continue to avoid recording it
if _, ok := slashedFPs[fpBTCPKHex]; ok {
continue
}
// add all BTC delegations that are not unbonded to the new finality provider
for j := range dc.FinalityProviders[i].BtcDels {
btcDel := *dc.FinalityProviders[i].BtcDels[j]
if _, ok := unbondedBTCDels[btcDel.StakingTxHash]; !ok {
fp.AddBTCDelDistInfo(&btcDel)
}
}
// process all new BTC delegations under this finality provider
if fpActiveBTCDels, ok := activeBTCDels[fpBTCPKHex]; ok {
// handle new BTC delegations for this finality provider
for _, d := range fpActiveBTCDels {
fp.AddBTCDel(d)
}
// remove the finality provider entry in activeBTCDels map, so that
// after the for loop the rest entries in activeBTCDels belongs to new
// finality providers with new BTC delegations
delete(activeBTCDels, fpBTCPKHex)
}
// add this finality provider to the new cache if it has voting power
if fp.TotalVotingPower > 0 {
newDc.AddFinalityProviderDistInfo(&fp)
}
}
which seems wasteful. I think main advantage of having the event based model here is that we are able caluclate to per block reward and power distribution diffs.
In most cases the differences in rewards/power between Babylon blocks will be minimal like few new delegations and few unbondings. Given that it seems to me that it we should avoid iteration over all data from past distribution.
Not a blocker for me for now, we can improve it after adding more tests to this 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases the differences in rewards/power between Babylon blocks will be minimal like few new delegations and few unbondings. Given that it seems to me that it we should avoid iteration over all data from past distribution.
Yep agreed. Roughly speaking the current version has the complexity of O(n*x)
, while the optimal complexity is O(n log n)
, where n
is the number of active finality providers and x
is the number of active BTC delegations under each finality provider. To achieve the optimal value we need to use a map to store all BTC delegation's voting power, which seems not straightforward given that we cannot use map
that is non-deterministic. Added a TODO for now.
Thanks all for the great comments! I have addresses or answers all comments, notably
Please feel free to take a look again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Left only some potential sugguestions for future improvements :)
/* | ||
filter and classify all events into new/expired BTC delegations and slashed FPs | ||
*/ | ||
for _, event := range events { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I think I understand.
On the other hand with this approach with map
you need to iterate over all finality providers and all their delegations in this middle loop:
// iterate over all finality providers and apply all events
for i := range dc.FinalityProviders {
// create a copy of the finality provider
fp := *dc.FinalityProviders[i]
fp.TotalVotingPower = 0
fp.BtcDels = []*types.BTCDelDistInfo{}
fpBTCPKHex := fp.BtcPk.MarshalHex()
// if this finality provider is slashed, continue to avoid recording it
if _, ok := slashedFPs[fpBTCPKHex]; ok {
continue
}
// add all BTC delegations that are not unbonded to the new finality provider
for j := range dc.FinalityProviders[i].BtcDels {
btcDel := *dc.FinalityProviders[i].BtcDels[j]
if _, ok := unbondedBTCDels[btcDel.StakingTxHash]; !ok {
fp.AddBTCDelDistInfo(&btcDel)
}
}
// process all new BTC delegations under this finality provider
if fpActiveBTCDels, ok := activeBTCDels[fpBTCPKHex]; ok {
// handle new BTC delegations for this finality provider
for _, d := range fpActiveBTCDels {
fp.AddBTCDel(d)
}
// remove the finality provider entry in activeBTCDels map, so that
// after the for loop the rest entries in activeBTCDels belongs to new
// finality providers with new BTC delegations
delete(activeBTCDels, fpBTCPKHex)
}
// add this finality provider to the new cache if it has voting power
if fp.TotalVotingPower > 0 {
newDc.AddFinalityProviderDistInfo(&fp)
}
}
which seems wasteful. I think main advantage of having the event based model here is that we are able caluclate to per block reward and power distribution diffs.
In most cases the differences in rewards/power between Babylon blocks will be minimal like few new delegations and few unbondings. Given that it seems to me that it we should avoid iteration over all data from past distribution.
Not a blocker for me for now, we can improve it after adding more tests to this 👍
delEvent := typedEvent.BtcDelStateUpdate | ||
if delEvent.NewState == types.BTCDelegationStatus_ACTIVE { | ||
// newly active BTC delegation | ||
btcDel, err := k.GetBTCDelegation(ctx, delEvent.StakingTxHash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I our activation event would contain all necesary data to build:
type BTCDelDistInfo struct {
// btc_pk is the Bitcoin secp256k1 PK of this BTC delegation
// the PK follows encoding in BIP-340 spec
BtcPk *github_com_babylonchain_babylon_types.BIP340PubKey `protobuf:"bytes,1,opt,name=btc_pk,json=btcPk,proto3,customtype=github.com/babylonchain/babylon/types.BIP340PubKey" json:"btc_pk,omitempty"`
// babylon_pk is the Babylon public key of the BTC delegation
BabylonPk *secp256k1.PubKey `protobuf:"bytes,2,opt,name=babylon_pk,json=babylonPk,proto3" json:"babylon_pk,omitempty"`
// staking_tx_hash is the staking tx hash of the BTC delegation
StakingTxHash string `protobuf:"bytes,3,opt,name=staking_tx_hash,json=stakingTxHash,proto3" json:"staking_tx_hash,omitempty"`
// voting_power is the voting power of the BTC delegation
VotingPower uint64 `protobuf:"varint,4,opt,name=voting_power,json=votingPower,proto3" json:"voting_power,omitempty"`
}
We could avoid this random db read. Maybe todo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not straightforward and in some sense not evitable. A BTC delegation can stake to multiple finality providers thus has the field FpBtcPkList
for a list of finality providers' PKs. Meanwhile, BTCDelDistInfo
does not care about finality providers' PKs.
Given that we limit the number of GetBTCDelegation
invocations (the hotspot) to O(# of newly active BTC delegation) (which is optimal), maybe this is fine. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm maybe we will see when we will do load testing again :)
I wondering, as we will launch mainnet without restaking, maybe we should just have 1 to 1 mapping between delegation and finality provider ? I really do not like having features just in case :)
} | ||
|
||
// NOTE: we don't need to record events for pending BTC delegations since these | ||
// do not affect voting power distribution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just food for thought.
After thinking on it a bit longer, maybe it is usefull to have pending delegations in our VotingPowerDistCache
?
If we would do that then we could make it that the order of BTCDelDistInfo
under each FinalityProviderDistInfo
will be the same as the order of delegations in finality provider delegations index. This mean that whole VotingPowerDistCache
will be a bit like current snapshot of our whole db.
Having such setup, we could add delegationIndex
under each finality provider to each activation/unbonding event. Then during voting power update, we could use this indexes to avoid linear scans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we would do that then we could make it that the order of BTCDelDistInfo under each FinalityProviderDistInfo will be the same as the order of delegations in finality provider delegations index. This mean that whole VotingPowerDistCache will be a bit like current snapshot of our whole db.
There is a problem that delegations in finality provider delegations index in the KV store are out of order -- BTC delegations are ordered by their staking tx hash, and finality provider delegations are ordered by the finality provider's BTC PK. So their orders are different from VotingPowerDistCache
here. So this does not help with avoiding linear scans unfortunately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm from what I see delegations in index are ordered by time of arribal i.e with each new delegation we just append them to index:
if err := btcDelIndex.Add(stakingTxHash); err != nil {
return types.ErrInvalidStakingTx.Wrapf(err.Error())
}
So if we would update BtcDels []*BTCDelDistInfo
just by processing events in order then this field would have the same order as delegation index of validator.
What am I missing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you are right, yes in the BTC delegation index the BTC delegations are ordered by time. However I do not see how this helps with avoiding linear scan. The BTC delegation index is keyed by finality provider PK and BTC staker PK, while the events are only keyed by the Babylon height. It needs some thought if we want to use BTC delegation index to track state updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work!
NewState: types.BTCDelegationStatus_PENDING, | ||
} | ||
if err := sdkCtx.EventManager().EmitTypedEvent(event); err != nil { | ||
panic(fmt.Errorf("failed to emit EventBTCDelegationStateUpdate for the new pending BTC delegation: %w", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda agree with @RafilxTenfen in the sense that if the panic indeed happens, then it happens in all the validators and the chain will completely stall until a fix is deployed over >2/3 validators. But if we log error/alert here, we won't have a stall.
As we have used panics in all events emitting places, we can revisit this problem later.
32ef3f5
into
base/index-optimise-btcstaking
This PR introduces the improved algorithm for updating voting power table and distribution cache. Instead of iterating over all finality providers and BTC delegations, the algorithm reconciles the previous voting power distribution with all new voting power update events. This reduces the complexity of
BeginBlocker
from O(# of all BTC delegations) to O(# of BTC delegations whose states are updated at this height).The algorithm works as follows:
h
at current Babylon height and BTC tip heighth'
at the prior Babylon height (ie current - 1)[h', h]
Along the way, this PR also simplifies and fixes all existing fuzz tests, which have covered most of the scenarios.
TODO (that can be in subsequent PRs):