-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: broadcast barrier #4207
feat: broadcast barrier #4207
Conversation
PRO-949 Broadcast Barrier
We want to prevent validators from being punished for failing to submit polkadot transactions. However, with optimistic rotation, it's possible that we sign and broadcast transfers using the future key before that key has any available funds. In this case, the transaction would fail validation. There are two undesirable consequences of this:
The solution is to add a request barrier, similar to the threshold key lock. We can either add an option to the broadcaster, or add a separate method A barrier broadcast must succeed before any subsequent broadcasts are attempted. Previous broadcasts can proceed and be retried etc. as normal. For the queue of waiting broadcasts we can use the existing retry queue mechanism. Before we trigger a broadcast attempt we first need to check if there is a barrier. The broadcast only proceeds if its broadcast_id is less than or equal to the barrier. When the barrier broadcast succeeds it removes the barrier. Note it's important that threshold signature requests continue to be processed in the meantime: we want to avoid the possibility of a backlog of signature requests building up that might overload the network. The motivating example for this is Polkadot but it should be possible to configure this option generically for any chain. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #4207 +/- ##
======================================
- Coverage 72% 71% -0%
======================================
Files 385 384 -1
Lines 63554 63033 -521
Branches 63554 63033 -521
======================================
- Hits 45448 44996 -452
+ Misses 15749 15707 -42
+ Partials 2357 2330 -27 ☔ View full report in Codecov by Sentry. |
@@ -152,38 +152,20 @@ impl<T: Config<I>, I: 'static> VaultRotator for Pallet<T, I> { | |||
if let Some(VaultRotationStatus::<T, I>::KeyHandoverComplete { new_public_key }) = | |||
PendingVaultRotation::<T, I>::get() | |||
{ | |||
if let Some(EpochKey { key, key_state, .. }) = Self::active_epoch_key() { | |||
if let Some(EpochKey { key, .. }) = Self::active_epoch_key() { | |||
match <T::SetAggKeyWithAggKey as SetAggKeyWithAggKey<_>>::new_unsigned( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to move towards a non-ethereum based naming here, SetAggKeyWithAggKey
is specific to Ethereum - can be another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is a bit easier to follow without the key locking stuff now 👌
if retries.len() > num_retries_that_fit { | ||
BroadcastRetryQueue::<T, I>::put(retries.split_off(num_retries_that_fit)); | ||
paused_broadcasts.append(&mut retries.split_off(num_retries_that_fit)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
death to if statements!
paused_broadcasts.append(&mut retries.split_off(min(retries.len(), num_retries_that_fit)));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? I love if statements :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more branches = slower, in most cases (like this) the difference is negligible (or none in the case of a clever compiler), but the compiler will have an easier time optimising a min/max than branching through an if. It's also just tells the story a bit better, min
acting as a cap for how many we can process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the follow reads quite nicely, there is no mutable Vec, no if. It reads like we want:
From the retry queue, extract any value below the limit, take at most num_retries_that_fit
.
let retries = BroadcastRetryQueue::<T, I>::mutate(|retry_queue| {
let id_limit =
BroadcastPause::<T, I>::get().unwrap_or(BroadcastId::max_value());
retry_queue
.extract_if(|broadcast| {
broadcast.broadcast_attempt_id.broadcast_id < id_limit
})
.take(num_retries_that_fit)
.collect::<Vec<_>>()
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☝️ no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the follow reads quite nicely, there is no mutable Vec, no if. It reads like we want: From the retry queue, extract any value below the limit, take at most
num_retries_that_fit
.let retries = BroadcastRetryQueue::<T, I>::mutate(|retry_queue| { let id_limit = BroadcastPause::<T, I>::get().unwrap_or(BroadcastId::max_value()); retry_queue .extract_if(|broadcast| { broadcast.broadcast_attempt_id.broadcast_id < id_limit }) .take(num_retries_that_fit) .collect::<Vec<_>>() });
Looks cleaner, will make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a discussion point though, I dont see how in this case a min/max is a better choice over a simple if statement. In its implementation, at the innermost level of the function, the min/max function would have an if statement to do the comparison to find the min/max and the compiler would still have to compile that.
}, | ||
)); | ||
|
||
let (broadcast_id_3, _) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we could make this a bit more valuable a test by making this a loop, and we sent 5 transactions after it's paused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good.
This should work for Polkadot, but I think we may have overlooked an edge case with Ethereum, where a tx signed with an old aggkey is delayed until after the aggkey update. The tx would be broadcast but would revert.
For Ethereum we actually want to do it in multiple steps:
- Pause all retries, wait until all broadcasts before the setAggKey broadcast succeed or fail (not sure how best to achieve this).
- Broadcast the setAggKey transaction, allow retries only for this tx.
- When it succeeds, open up retries to be processed again (threshold sig should be refreshed as needed).
Not sure how best to deal with this. Maybe instead of a boolean condition, we would need an enum of different behaviours...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some merge conflicts, and it seems the tests don't pass, at least when I run broadcast_pause
test (which should probably be renamed now), it panics.
A note for future readers to say that this does interact with the old reporting mechanism a little. Because we report based on broadcast id, we would be potentially reporting people unfairly if they failed to broadcast due to invalid signature - however, since the barrier should stop this case, it's not a big concern.
if retries.len() > num_retries_that_fit { | ||
BroadcastRetryQueue::<T, I>::put(retries.split_off(num_retries_that_fit)); | ||
paused_broadcasts.append(&mut retries.split_off(num_retries_that_fit)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☝️ no?
if let Some(broadcast_barrier_id) = BroadcastBarriers::<T, I>::get().last() { | ||
if *broadcast_barrier_id == broadcast_id { | ||
BroadcastBarriers::<T, I>::mutate(|broadcast_barriers| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can write this more efficiently using a single try_mutate.
let initiated_at = T::ChainTracking::get_block_height(); | ||
|
||
let threshold_signature_payload = api_call.threshold_signature_payload(); | ||
T::ThresholdSigner::request_signature_with_callback( | ||
threshold_signature_payload.clone(), | ||
|threshold_request_id| { | ||
Call::on_signature_ready { | ||
threshold_request_id, | ||
threshold_signature_payload, | ||
api_call: Box::new(api_call), | ||
broadcast_attempt_id: next_broadcast_attempt_id, | ||
initiated_at, | ||
} | ||
.into() | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost identical to the what is called in threshold_sign_and_broadcast
. There must be a way to deduplicate this.
|
||
// We update the initiated_at here since as the tx is resigned and broadcast, it is | ||
// not possible for it to be successfully broadcasted before this point. | ||
let initiated_at = T::ChainTracking::get_block_height(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean for witnessing? For example, if one of the previous broadcasts is still pending in the witnesser, does this have any effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kylezs maybe you can answer this - will this cause any side effects in the engine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible it does impact witnessing yeah, will have to look a bit deeper to confirm. Relevant code is in engine/src/witness/common/chunked_chain_source/chunked_by_vault/deposit_addresses.rs
. Otherwise I'll have a look in a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just inline this, it looks like it's necessary to be called before the deposit event, but it's not, a little misleading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so there might be an issue here - if we're resetting the initiated at this scenario could occur:
- sign tx with initiated at 20
- finalised witnessing is at block 22
- tx gets into block 24 - we haven't witnessed, since we're slightly behind
- timeout on SC occurs, sig verification fails
- block head is at 26, so we resign and set initiated at to block 26
- Cfe gets to finalised witnessing at block 24, however when it queries storage for what it should witness it filters out the
TransactionOutId
it should witness, because it sees that it's only potentially valid from block 26, so we don't witness it.
so the correct thing to do would be if there already exists a TransactionOutId of this type, we use the lower initiated_at.
Deleting the TransactionOutIdToBroadcastId
as is done above here will mean we won't witness it if the transaction is on the external chain in the meantime, which is basically the same issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also this is a decent chunk of dup'd code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically we just don't need to do this resetting of the initiated_at
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. But we set it for a new one ofc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, what you've described above is problematic. It implies we duplicated a broadcast? AFAIK the only scenario when this can still happen is if there's a polkadot runtime upgrade. I guess that in order to witness this, we need to have witnessed all broadcasts up until the runtime upgrade block. And if we've seen all blocks for which the old tx would have been valid, then this implies it didn't succeed? So it's ok to re-broadcast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case you are describing above is similar to the one discussed in #4262. This can only happen in the unlikely case of the tx witness delay anomaly and the polkadot Runtime version update happening at the same time which themselves own their own, rare events. Moreover, as dan said if in that case we do witness the dot runtime version update, it means the witnessing has been done uptil that point and if the tx hasnt gone through until then, it will fail.
I think it is safe to reset the initiated_at when re signing the tx.
|
||
// We update the initiated_at here since as the tx is resigned and broadcast, it is | ||
// not possible for it to be successfully broadcasted before this point. | ||
let initiated_at = T::ChainTracking::get_block_height(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kylezs maybe you can answer this - will this cause any side effects in the engine?
|
||
#[cfg(feature = "try-runtime")] | ||
fn post_upgrade(state: Vec<u8>) -> Result<(), DispatchError> { | ||
if System::runtime_version().spec_version == SPEC_VERSION { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct (System::runtime_version().spec_version == SPEC_VERSION
in each of these methods)? Shouldn't some of them be <
or >
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be correct. This migration will only run if the post runtime upgrade spec version matches exactly to SPEC_VERSION for which this migration is created for as says in the comment. we then initialise this versioned migration with the spec version we want it to apply to (one plus the current) in the runtime lib file.
RequestType::CurrentKey | ||
}; | ||
let request_type = T::KeyProvider::active_epoch_key().defensive_map_or_else( | ||
|| RequestType::SpecificKey(Default::default(), Default::default()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use RequestType::CurrentKey
it will fail more gracefully (it will emit CurrentKeyUnavailable
and retry with CurrentKey until one is available).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the CurrentKey
variant in the RequestType
since we dont use it anymore. The ceremonies will always be signed with key active at the time of threshold signature request since do optimistic rotation for all chains now and every ceremony has a key associated to it.
Pull Request
Closes: PRO-949
Checklist
Please conduct a thorough self-review before opening the PR.
Summary
This includes the following changes:
vault_key_rotated
extrinsic is removed since there are no tasks that need to be done on the successful broadcast of the key activation tx.