Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: broadcast barrier #4207

Merged
merged 38 commits into from Dec 5, 2023
Merged

feat: broadcast barrier #4207

merged 38 commits into from Dec 5, 2023

Conversation

ramizhasan111
Copy link
Contributor

@ramizhasan111 ramizhasan111 commented Nov 6, 2023

Pull Request

Closes: PRO-949

Checklist

Please conduct a thorough self-review before opening the PR.

  • I am confident that the code works.
  • I have updated documentation where appropriate.

Summary

This includes the following changes:

  • Broadcast pause is introduced with the broadcast of the rotation tx which pauses and queues all future new broadcast requests and only allows the retry of already initiated broadcasts upto and including the rotation tx.
  • removes the key locking mechanism from the threshold signature pallet. The new key is immediately active and unlocked and the vault is rotated immediately (optimistic rotation now for all chains) as soon as key activation is initiated.
  • the vault_key_rotated extrinsic is removed since there are no tasks that need to be done on the successful broadcast of the key activation tx.

Copy link

linear bot commented Nov 6, 2023

PRO-949 Broadcast Barrier

We want to prevent validators from being punished for failing to submit polkadot transactions.

However, with optimistic rotation, it's possible that we sign and broadcast transfers using the future key before that key has any available funds. In this case, the transaction would fail validation. There are two undesirable consequences of this:

  1. The broadcaster of the transaction might be slashed/suspended.
  2. If the effect lasts too long, the transaction itself might be banned.

The solution is to add a request barrier, similar to the threshold key lock.

We can either add an option to the broadcaster, or add a separate method set_barrier(BroadastId) to mark a broadcast as a 'barrier broadcast'.

A barrier broadcast must succeed before any subsequent broadcasts are attempted. Previous broadcasts can proceed and be retried etc. as normal.

For the queue of waiting broadcasts we can use the existing retry queue mechanism.

Before we trigger a broadcast attempt we first need to check if there is a barrier.

The broadcast only proceeds if its broadcast_id is less than or equal to the barrier.

When the barrier broadcast succeeds it removes the barrier.

Note it's important that threshold signature requests continue to be processed in the meantime: we want to avoid the possibility of a backlog of signature requests building up that might overload the network.

The motivating example for this is Polkadot but it should be possible to configure this option generically for any chain.

Copy link

codecov bot commented Nov 8, 2023

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (e6ad83e) 72% compared to head (b42ef29) 71%.
Report is 35 commits behind head on main.

❗ Current head b42ef29 differs from pull request most recent head cd84ee0. Consider uploading reports for the commit cd84ee0 to get more accurate results

Files Patch % Lines
state-chain/pallets/cf-environment/src/mock.rs 0% 0 Missing and 3 partials ⚠️
state-chain/pallets/cf-emissions/src/mock.rs 75% 0 Missing and 1 partial ⚠️
state-chain/pallets/cf-funding/src/mock.rs 75% 0 Missing and 1 partial ⚠️
state-chain/pallets/cf-vaults/src/lib.rs 83% 1 Missing ⚠️
state-chain/pallets/cf-vaults/src/mock.rs 75% 0 Missing and 1 partial ⚠️
state-chain/runtime/src/chainflip.rs 0% 1 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##            main   #4207    +/-   ##
======================================
- Coverage     72%     71%    -0%     
======================================
  Files        385     384     -1     
  Lines      63554   63033   -521     
  Branches   63554   63033   -521     
======================================
- Hits       45448   44996   -452     
+ Misses     15749   15707    -42     
+ Partials    2357    2330    -27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ramizhasan111 ramizhasan111 mentioned this pull request Nov 9, 2023
2 tasks
@ramizhasan111 ramizhasan111 marked this pull request as ready for review November 9, 2023 18:36
state-chain/pallets/cf-vaults/src/tests.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-vaults/src/vault_rotator.rs Outdated Show resolved Hide resolved
@@ -152,38 +152,20 @@ impl<T: Config<I>, I: 'static> VaultRotator for Pallet<T, I> {
if let Some(VaultRotationStatus::<T, I>::KeyHandoverComplete { new_public_key }) =
PendingVaultRotation::<T, I>::get()
{
if let Some(EpochKey { key, key_state, .. }) = Self::active_epoch_key() {
if let Some(EpochKey { key, .. }) = Self::active_epoch_key() {
match <T::SetAggKeyWithAggKey as SetAggKeyWithAggKey<_>>::new_unsigned(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to move towards a non-ethereum based naming here, SetAggKeyWithAggKey is specific to Ethereum - can be another PR

state-chain/runtime/src/chainflip/broadcaster.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-vaults/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@kylezs kylezs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is a bit easier to follow without the key locking stuff now 👌

state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-broadcast/src/tests.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-vaults/src/lib.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines 430 to 432
if retries.len() > num_retries_that_fit {
BroadcastRetryQueue::<T, I>::put(retries.split_off(num_retries_that_fit));
paused_broadcasts.append(&mut retries.split_off(num_retries_that_fit));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

death to if statements!
paused_broadcasts.append(&mut retries.split_off(min(retries.len(), num_retries_that_fit)));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? I love if statements :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more branches = slower, in most cases (like this) the difference is negligible (or none in the case of a clever compiler), but the compiler will have an easier time optimising a min/max than branching through an if. It's also just tells the story a bit better, min acting as a cap for how many we can process.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the follow reads quite nicely, there is no mutable Vec, no if. It reads like we want:
From the retry queue, extract any value below the limit, take at most num_retries_that_fit.

				let retries = BroadcastRetryQueue::<T, I>::mutate(|retry_queue| {
					let id_limit =
						BroadcastPause::<T, I>::get().unwrap_or(BroadcastId::max_value());
					retry_queue
						.extract_if(|broadcast| {
							broadcast.broadcast_attempt_id.broadcast_id < id_limit
						})
						.take(num_retries_that_fit)
						.collect::<Vec<_>>()
				});

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☝️ no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the follow reads quite nicely, there is no mutable Vec, no if. It reads like we want: From the retry queue, extract any value below the limit, take at most num_retries_that_fit.

				let retries = BroadcastRetryQueue::<T, I>::mutate(|retry_queue| {
					let id_limit =
						BroadcastPause::<T, I>::get().unwrap_or(BroadcastId::max_value());
					retry_queue
						.extract_if(|broadcast| {
							broadcast.broadcast_attempt_id.broadcast_id < id_limit
						})
						.take(num_retries_that_fit)
						.collect::<Vec<_>>()
				});

Looks cleaner, will make the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a discussion point though, I dont see how in this case a min/max is a better choice over a simple if statement. In its implementation, at the innermost level of the function, the min/max function would have an if statement to do the comparison to find the min/max and the compiler would still have to compile that.

state-chain/pallets/cf-broadcast/src/tests.rs Outdated Show resolved Hide resolved
},
));

let (broadcast_id_3, _) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could make this a bit more valuable a test by making this a loop, and we sent 5 transactions after it's paused

Copy link
Collaborator

@dandanlen dandanlen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good.

This should work for Polkadot, but I think we may have overlooked an edge case with Ethereum, where a tx signed with an old aggkey is delayed until after the aggkey update. The tx would be broadcast but would revert.

For Ethereum we actually want to do it in multiple steps:

  1. Pause all retries, wait until all broadcasts before the setAggKey broadcast succeed or fail (not sure how best to achieve this).
  2. Broadcast the setAggKey transaction, allow retries only for this tx.
  3. When it succeeds, open up retries to be processed again (threshold sig should be refreshed as needed).

Not sure how best to deal with this. Maybe instead of a boolean condition, we would need an enum of different behaviours...

state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-vaults/src/lib.rs Show resolved Hide resolved
state-chain/pallets/cf-vaults/src/vault_rotator.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@kylezs kylezs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some merge conflicts, and it seems the tests don't pass, at least when I run broadcast_pause test (which should probably be renamed now), it panics.

A note for future readers to say that this does interact with the old reporting mechanism a little. Because we report based on broadcast id, we would be potentially reporting people unfairly if they failed to broadcast due to invalid signature - however, since the barrier should stop this case, it's not a big concern.

state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-broadcast/src/tests.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines 430 to 432
if retries.len() > num_retries_that_fit {
BroadcastRetryQueue::<T, I>::put(retries.split_off(num_retries_that_fit));
paused_broadcasts.append(&mut retries.split_off(num_retries_that_fit));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☝️ no?

state-chain/pallets/cf-broadcast/src/lib.rs Show resolved Hide resolved
Comment on lines 574 to 576
if let Some(broadcast_barrier_id) = BroadcastBarriers::<T, I>::get().last() {
if *broadcast_barrier_id == broadcast_id {
BroadcastBarriers::<T, I>::mutate(|broadcast_barriers| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can write this more efficiently using a single try_mutate.

Comment on lines 826 to 840
let initiated_at = T::ChainTracking::get_block_height();

let threshold_signature_payload = api_call.threshold_signature_payload();
T::ThresholdSigner::request_signature_with_callback(
threshold_signature_payload.clone(),
|threshold_request_id| {
Call::on_signature_ready {
threshold_request_id,
threshold_signature_payload,
api_call: Box::new(api_call),
broadcast_attempt_id: next_broadcast_attempt_id,
initiated_at,
}
.into()
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost identical to the what is called in threshold_sign_and_broadcast. There must be a way to deduplicate this.

state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved

// We update the initiated_at here since as the tx is resigned and broadcast, it is
// not possible for it to be successfully broadcasted before this point.
let initiated_at = T::ChainTracking::get_block_height();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean for witnessing? For example, if one of the previous broadcasts is still pending in the witnesser, does this have any effect?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kylezs maybe you can answer this - will this cause any side effects in the engine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible it does impact witnessing yeah, will have to look a bit deeper to confirm. Relevant code is in engine/src/witness/common/chunked_chain_source/chunked_by_vault/deposit_addresses.rs . Otherwise I'll have a look in a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just inline this, it looks like it's necessary to be called before the deposit event, but it's not, a little misleading

Copy link
Contributor

@kylezs kylezs Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so there might be an issue here - if we're resetting the initiated at this scenario could occur:

  1. sign tx with initiated at 20
  2. finalised witnessing is at block 22
  3. tx gets into block 24 - we haven't witnessed, since we're slightly behind
  4. timeout on SC occurs, sig verification fails
  5. block head is at 26, so we resign and set initiated at to block 26
  6. Cfe gets to finalised witnessing at block 24, however when it queries storage for what it should witness it filters out the TransactionOutId it should witness, because it sees that it's only potentially valid from block 26, so we don't witness it.

so the correct thing to do would be if there already exists a TransactionOutId of this type, we use the lower initiated_at.

Deleting the TransactionOutIdToBroadcastId as is done above here will mean we won't witness it if the transaction is on the external chain in the meantime, which is basically the same issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this is a decent chunk of dup'd code

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically we just don't need to do this resetting of the initiated_at, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. But we set it for a new one ofc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, what you've described above is problematic. It implies we duplicated a broadcast? AFAIK the only scenario when this can still happen is if there's a polkadot runtime upgrade. I guess that in order to witness this, we need to have witnessed all broadcasts up until the runtime upgrade block. And if we've seen all blocks for which the old tx would have been valid, then this implies it didn't succeed? So it's ok to re-broadcast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case you are describing above is similar to the one discussed in #4262. This can only happen in the unlikely case of the tx witness delay anomaly and the polkadot Runtime version update happening at the same time which themselves own their own, rare events. Moreover, as dan said if in that case we do witness the dot runtime version update, it means the witnessing has been done uptil that point and if the tx hasnt gone through until then, it will fail.

I think it is safe to reset the initiated_at when re signing the tx.

state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved
state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved

// We update the initiated_at here since as the tx is resigned and broadcast, it is
// not possible for it to be successfully broadcasted before this point.
let initiated_at = T::ChainTracking::get_block_height();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kylezs maybe you can answer this - will this cause any side effects in the engine?


#[cfg(feature = "try-runtime")]
fn post_upgrade(state: Vec<u8>) -> Result<(), DispatchError> {
if System::runtime_version().spec_version == SPEC_VERSION {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct (System::runtime_version().spec_version == SPEC_VERSION in each of these methods)? Shouldn't some of them be < or > ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be correct. This migration will only run if the post runtime upgrade spec version matches exactly to SPEC_VERSION for which this migration is created for as says in the comment. we then initialise this versioned migration with the spec version we want it to apply to (one plus the current) in the runtime lib file.

state-chain/runtime/src/chainflip/broadcaster.rs Outdated Show resolved Hide resolved
RequestType::CurrentKey
};
let request_type = T::KeyProvider::active_epoch_key().defensive_map_or_else(
|| RequestType::SpecificKey(Default::default(), Default::default()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use RequestType::CurrentKey it will fail more gracefully (it will emit CurrentKeyUnavailable and retry with CurrentKey until one is available).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the CurrentKey variant in the RequestType since we dont use it anymore. The ceremonies will always be signed with key active at the time of threshold signature request since do optimistic rotation for all chains now and every ceremony has a key associated to it.

@dandanlen dandanlen mentioned this pull request Nov 29, 2023
2 tasks
@dandanlen dandanlen changed the title feat: broadcast pause feat: broadcast barrier Dec 5, 2023
@dandanlen dandanlen enabled auto-merge (squash) December 5, 2023 09:52
@dandanlen dandanlen merged commit 5264c7e into main Dec 5, 2023
40 checks passed
@dandanlen dandanlen deleted the feat/broadcast-pause branch December 5, 2023 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants