feat: broadcast barrier #4207

ramizhasan111 · 2023-11-06T17:14:02Z

Pull Request

Closes: PRO-949

Checklist

Please conduct a thorough self-review before opening the PR.

I am confident that the code works.
I have updated documentation where appropriate.

Summary

This includes the following changes:

Broadcast pause is introduced with the broadcast of the rotation tx which pauses and queues all future new broadcast requests and only allows the retry of already initiated broadcasts upto and including the rotation tx.
removes the key locking mechanism from the threshold signature pallet. The new key is immediately active and unlocked and the vault is rotated immediately (optimistic rotation now for all chains) as soon as key activation is initiated.
the vault_key_rotated extrinsic is removed since there are no tasks that need to be done on the successful broadcast of the key activation tx.

linear · 2023-11-06T17:14:05Z

PRO-949 Broadcast Barrier

We want to prevent validators from being punished for failing to submit polkadot transactions.

However, with optimistic rotation, it's possible that we sign and broadcast transfers using the future key before that key has any available funds. In this case, the transaction would fail validation. There are two undesirable consequences of this:

The broadcaster of the transaction might be slashed/suspended.
If the effect lasts too long, the transaction itself might be banned.

The solution is to add a request barrier, similar to the threshold key lock.

We can either add an option to the broadcaster, or add a separate method set_barrier(BroadastId) to mark a broadcast as a 'barrier broadcast'.

A barrier broadcast must succeed before any subsequent broadcasts are attempted. Previous broadcasts can proceed and be retried etc. as normal.

For the queue of waiting broadcasts we can use the existing retry queue mechanism.

Before we trigger a broadcast attempt we first need to check if there is a barrier.

The broadcast only proceeds if its broadcast_id is less than or equal to the barrier.

When the barrier broadcast succeeds it removes the barrier.

Note it's important that threshold signature requests continue to be processed in the meantime: we want to avoid the possibility of a backlog of signature requests building up that might overload the network.

The motivating example for this is Polkadot but it should be possible to configure this option generically for any chain.

codecov · 2023-11-08T16:51:27Z

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (e6ad83e) 72% compared to head (b42ef29) 71%.
Report is 35 commits behind head on main.

❗ Current head b42ef29 differs from pull request most recent head cd84ee0. Consider uploading reports for the commit cd84ee0 to get more accurate results

Files	Patch %	Lines
state-chain/pallets/cf-environment/src/mock.rs	0%	0 Missing and 3 partials ⚠️
state-chain/pallets/cf-emissions/src/mock.rs	75%	0 Missing and 1 partial ⚠️
state-chain/pallets/cf-funding/src/mock.rs	75%	0 Missing and 1 partial ⚠️
state-chain/pallets/cf-vaults/src/lib.rs	83%	1 Missing ⚠️
state-chain/pallets/cf-vaults/src/mock.rs	75%	0 Missing and 1 partial ⚠️
state-chain/runtime/src/chainflip.rs	0%	1 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff           @@
##            main   #4207    +/-   ##
======================================
- Coverage     72%     71%    -0%     
======================================
  Files        385     384     -1     
  Lines      63554   63033   -521     
  Branches   63554   63033   -521     
======================================
- Hits       45448   44996   -452     
+ Misses     15749   15707    -42     
+ Partials    2357    2330    -27

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

state-chain/pallets/cf-vaults/src/tests.rs

state-chain/pallets/cf-vaults/src/vault_rotator.rs

kylezs · 2023-11-10T09:24:02Z

state-chain/pallets/cf-vaults/src/vault_rotator.rs

@@ -152,38 +152,20 @@ impl<T: Config<I>, I: 'static> VaultRotator for Pallet<T, I> {
 		if let Some(VaultRotationStatus::<T, I>::KeyHandoverComplete { new_public_key }) =
 			PendingVaultRotation::<T, I>::get()
 		{
-			if let Some(EpochKey { key, key_state, .. }) = Self::active_epoch_key() {
+			if let Some(EpochKey { key, .. }) = Self::active_epoch_key() {
 				match <T::SetAggKeyWithAggKey as SetAggKeyWithAggKey<_>>::new_unsigned(


It would be nice to move towards a non-ethereum based naming here, SetAggKeyWithAggKey is specific to Ethereum - can be another PR

state-chain/runtime/src/chainflip/broadcaster.rs

state-chain/pallets/cf-vaults/src/vault_rotator.rs

state-chain/pallets/cf-vaults/src/lib.rs

kylezs

This code is a bit easier to follow without the key locking stuff now 👌

state-chain/cf-integration-tests/src/authorities.rs

state-chain/pallets/cf-broadcast/src/lib.rs

state-chain/pallets/cf-broadcast/src/tests.rs

state-chain/pallets/cf-vaults/src/benchmarking.rs

state-chain/pallets/cf-vaults/src/lib.rs

state-chain/pallets/cf-broadcast/src/lib.rs

kylezs · 2023-11-10T10:44:33Z

state-chain/pallets/cf-broadcast/src/lib.rs

 				if retries.len() > num_retries_that_fit {
-					BroadcastRetryQueue::<T, I>::put(retries.split_off(num_retries_that_fit));
+					paused_broadcasts.append(&mut retries.split_off(num_retries_that_fit));
 				}


death to if statements!
paused_broadcasts.append(&mut retries.split_off(min(retries.len(), num_retries_that_fit)));

why? I love if statements :)

more branches = slower, in most cases (like this) the difference is negligible (or none in the case of a clever compiler), but the compiler will have an easier time optimising a min/max than branching through an if. It's also just tells the story a bit better, min acting as a cap for how many we can process.

IMO the follow reads quite nicely, there is no mutable Vec, no if. It reads like we want:
From the retry queue, extract any value below the limit, take at most num_retries_that_fit.

let retries = BroadcastRetryQueue::<T, I>::mutate(|retry_queue| { let id_limit = BroadcastPause::<T, I>::get().unwrap_or(BroadcastId::max_value()); retry_queue .extract_if(|broadcast| { broadcast.broadcast_attempt_id.broadcast_id < id_limit }) .take(num_retries_that_fit) .collect::<Vec<_>>() });

IMO the follow reads quite nicely, there is no mutable Vec, no if. It reads like we want: From the retry queue, extract any value below the limit, take at most num_retries_that_fit.

let retries = BroadcastRetryQueue::<T, I>::mutate(|retry_queue| { let id_limit = BroadcastPause::<T, I>::get().unwrap_or(BroadcastId::max_value()); retry_queue .extract_if(|broadcast| { broadcast.broadcast_attempt_id.broadcast_id < id_limit }) .take(num_retries_that_fit) .collect::<Vec<_>>() });

Looks cleaner, will make the change.

As a discussion point though, I dont see how in this case a min/max is a better choice over a simple if statement. In its implementation, at the innermost level of the function, the min/max function would have an if statement to do the comparison to find the min/max and the compiler would still have to compile that.

state-chain/pallets/cf-broadcast/src/tests.rs

kylezs · 2023-11-10T10:59:10Z

state-chain/pallets/cf-broadcast/src/tests.rs

+			},
+		));
+
+		let (broadcast_id_3, _) =


nit: we could make this a bit more valuable a test by making this a loop, and we sent 5 transactions after it's paused

dandanlen

Looking good.

This should work for Polkadot, but I think we may have overlooked an edge case with Ethereum, where a tx signed with an old aggkey is delayed until after the aggkey update. The tx would be broadcast but would revert.

For Ethereum we actually want to do it in multiple steps:

Pause all retries, wait until all broadcasts before the setAggKey broadcast succeed or fail (not sure how best to achieve this).
Broadcast the setAggKey transaction, allow retries only for this tx.
When it succeeds, open up retries to be processed again (threshold sig should be refreshed as needed).

Not sure how best to deal with this. Maybe instead of a boolean condition, we would need an enum of different behaviours...

state-chain/pallets/cf-broadcast/src/lib.rs

state-chain/pallets/cf-vaults/src/lib.rs

state-chain/pallets/cf-vaults/src/vault_rotator.rs

kylezs

There are some merge conflicts, and it seems the tests don't pass, at least when I run broadcast_pause test (which should probably be renamed now), it panics.

A note for future readers to say that this does interact with the old reporting mechanism a little. Because we report based on broadcast id, we would be potentially reporting people unfairly if they failed to broadcast due to invalid signature - however, since the barrier should stop this case, it's not a big concern.

state-chain/pallets/cf-broadcast/src/lib.rs

state-chain/pallets/cf-broadcast/src/tests.rs

state-chain/pallets/cf-broadcast/src/lib.rs

state-chain/pallets/cf-vaults/src/migrations.rs

state-chain/pallets/cf-broadcast/src/lib.rs

dandanlen · 2023-11-21T14:00:44Z

state-chain/pallets/cf-broadcast/src/lib.rs

 				if retries.len() > num_retries_that_fit {
-					BroadcastRetryQueue::<T, I>::put(retries.split_off(num_retries_that_fit));
+					paused_broadcasts.append(&mut retries.split_off(num_retries_that_fit));
 				}


state-chain/pallets/cf-broadcast/src/lib.rs

dandanlen · 2023-11-21T15:16:08Z

state-chain/pallets/cf-broadcast/src/lib.rs

+			if let Some(broadcast_barrier_id) = BroadcastBarriers::<T, I>::get().last() {
+				if *broadcast_barrier_id == broadcast_id {
+					BroadcastBarriers::<T, I>::mutate(|broadcast_barriers| {


We can write this more efficiently using a single try_mutate.

dandanlen · 2023-11-21T15:50:40Z

state-chain/pallets/cf-broadcast/src/lib.rs

+				let initiated_at = T::ChainTracking::get_block_height();
+
+				let threshold_signature_payload = api_call.threshold_signature_payload();
+				T::ThresholdSigner::request_signature_with_callback(
+					threshold_signature_payload.clone(),
+					|threshold_request_id| {
+						Call::on_signature_ready {
+							threshold_request_id,
+							threshold_signature_payload,
+							api_call: Box::new(api_call),
+							broadcast_attempt_id: next_broadcast_attempt_id,
+							initiated_at,
+						}
+						.into()
+					},


This is almost identical to the what is called in threshold_sign_and_broadcast. There must be a way to deduplicate this.

state-chain/pallets/cf-broadcast/src/lib.rs

dandanlen · 2023-11-21T16:14:29Z

state-chain/pallets/cf-broadcast/src/lib.rs

+
+				// We update the initiated_at here since as the tx is resigned and broadcast, it is
+				// not possible for it to be successfully broadcasted before this point.
+				let initiated_at = T::ChainTracking::get_block_height();


What does this mean for witnessing? For example, if one of the previous broadcasts is still pending in the witnesser, does this have any effect?

@kylezs maybe you can answer this - will this cause any side effects in the engine?

It's possible it does impact witnessing yeah, will have to look a bit deeper to confirm. Relevant code is in engine/src/witness/common/chunked_chain_source/chunked_by_vault/deposit_addresses.rs . Otherwise I'll have a look in a bit.

We can just inline this, it looks like it's necessary to be called before the deposit event, but it's not, a little misleading

Ok, so there might be an issue here - if we're resetting the initiated at this scenario could occur:

sign tx with initiated at 20

finalised witnessing is at block 22

tx gets into block 24 - we haven't witnessed, since we're slightly behind

timeout on SC occurs, sig verification fails

block head is at 26, so we resign and set initiated at to block 26

Cfe gets to finalised witnessing at block 24, however when it queries storage for what it should witness it filters out the TransactionOutId it should witness, because it sees that it's only potentially valid from block 26, so we don't witness it.

so the correct thing to do would be if there already exists a TransactionOutId of this type, we use the lower initiated_at.

Deleting the TransactionOutIdToBroadcastId as is done above here will mean we won't witness it if the transaction is on the external chain in the meantime, which is basically the same issue.

also this is a decent chunk of dup'd code

So basically we just don't need to do this resetting of the initiated_at, right?

Yep. But we set it for a new one ofc.

Either way, what you've described above is problematic. It implies we duplicated a broadcast? AFAIK the only scenario when this can still happen is if there's a polkadot runtime upgrade. I guess that in order to witness this, we need to have witnessed all broadcasts up until the runtime upgrade block. And if we've seen all blocks for which the old tx would have been valid, then this implies it didn't succeed? So it's ok to re-broadcast?

The case you are describing above is similar to the one discussed in #4262. This can only happen in the unlikely case of the tx witness delay anomaly and the polkadot Runtime version update happening at the same time which themselves own their own, rare events. Moreover, as dan said if in that case we do witness the dot runtime version update, it means the witnessing has been done uptil that point and if the tx hasnt gone through until then, it will fail.

I think it is safe to reset the initiated_at when re signing the tx.

state-chain/pallets/cf-broadcast/src/lib.rs

dandanlen · 2023-11-29T10:09:01Z

state-chain/pallets/cf-broadcast/src/lib.rs

+
+				// We update the initiated_at here since as the tx is resigned and broadcast, it is
+				// not possible for it to be successfully broadcasted before this point.
+				let initiated_at = T::ChainTracking::get_block_height();


@kylezs maybe you can answer this - will this cause any side effects in the engine?

dandanlen · 2023-11-29T10:20:08Z

state-chain/runtime/src/migrations.rs

+
+	#[cfg(feature = "try-runtime")]
+	fn post_upgrade(state: Vec<u8>) -> Result<(), DispatchError> {
+		if System::runtime_version().spec_version == SPEC_VERSION {


Is this correct (System::runtime_version().spec_version == SPEC_VERSION in each of these methods)? Shouldn't some of them be < or > ?

should be correct. This migration will only run if the post runtime upgrade spec version matches exactly to SPEC_VERSION for which this migration is created for as says in the comment. we then initialise this versioned migration with the spec version we want it to apply to (one plus the current) in the runtime lib file.

state-chain/pallets/cf-vaults/src/vault_rotator.rs

state-chain/runtime/src/chainflip/broadcaster.rs

dandanlen · 2023-11-29T14:52:27Z

state-chain/pallets/cf-threshold-signature/src/lib.rs

-			RequestType::CurrentKey
-		};
+		let request_type = T::KeyProvider::active_epoch_key().defensive_map_or_else(
+			|| RequestType::SpecificKey(Default::default(), Default::default()),


If we use RequestType::CurrentKey it will fail more gracefully (it will emit CurrentKeyUnavailable and retry with CurrentKey until one is available).

I have removed the CurrentKey variant in the RequestType since we dont use it anymore. The ceremonies will always be signed with key active at the time of threshold signature request since do optimistic rotation for all chains now and every ceremony has a key associated to it.

feat: broadcast pause

d127712

ramizhasan111 added 4 commits November 7, 2023 16:58

feat: removed key locking and non-optimistic activation

5af74d7

fix: tests

d777537

Merge branch 'main' into feat/broadcast-pause

078cded

fix: bouncer prettier check

51c49b5

ramizhasan111 mentioned this pull request Nov 9, 2023

rotation refactor #4019

Merged

2 tasks

feat: broadcast pause unit test

c2a2e20

ramizhasan111 requested review from dandanlen, kylezs and martin-chainflip November 9, 2023 16:44

ramizhasan111 marked this pull request as ready for review November 9, 2023 18:36

kylezs reviewed Nov 10, 2023

View reviewed changes

dandanlen reviewed Nov 10, 2023

View reviewed changes

ramizhasan111 added 5 commits November 14, 2023 14:01

chore: comments

fe789b9

chore: test

b42ef29

chore: addressed comments

06b7df2

feat: use same broadcast_id for tx, chain-specific broadcast barriers

4d445bb

chore: reinstate test

56dbec5

kylezs reviewed Nov 20, 2023

View reviewed changes

state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved

state-chain/pallets/cf-broadcast/src/tests.rs Outdated Show resolved Hide resolved

state-chain/pallets/cf-broadcast/src/lib.rs Outdated Show resolved Hide resolved

feat: storage migration

2d3cdc5

kylezs reviewed Nov 20, 2023

View reviewed changes

state-chain/pallets/cf-vaults/src/migrations.rs Show resolved Hide resolved

Merge branch 'main' into feat/broadcast-pause

1bd142b

dandanlen reviewed Nov 21, 2023

View reviewed changes

ramizhasan111 added 4 commits November 21, 2023 17:40

fix: check for all broadcasts before the barrier

41a7f08

feat: more tests, test fixes

24de28d

chore: addressed comments

0852b66

feat: on_sig_ready call migration

7efd2e5

ramizhasan111 and others added 2 commits November 29, 2023 13:52

Merge branch 'main' into feat/broadcast-pause

dd513d1

fix: vec import for tr-runtime

51f014b

dandanlen reviewed Nov 29, 2023

View reviewed changes

ramizhasan111 added 2 commits November 29, 2023 13:18

fix: use only specific key for signing

d9a0be7

chore: addressed comments

ba13b6b

dandanlen reviewed Nov 29, 2023

View reviewed changes

dandanlen mentioned this pull request Nov 29, 2023

feat: Re-sign failed CCMs #4277

Merged

2 tasks

ramizhasan111 and others added 17 commits November 30, 2023 12:06

fix: only add barriers if there are already pending txs before it

c267707

feat: remove aborted broadcasts from pending

c9daf2d

chore: remove CurrentKey variant in RequestType

20fd42a

Merge branch 'main' into feat/broadcast-pause

455ebcf

fix: all tests passing (bouncer tbc)

7ddeea6

fix: remove defenisve_proof

0c056b5

fix: always remove barrier with pending broadcast

dd06205

chore: readability

6f3ff2a

fix: typo

5ce1a17

fix: use BTreeSet to avoid duplicates.

6abf06f

fix: check attempt count >= not ==

8b46d1a

test: test that the broadcast is aborted when all broadcasters attempt

7267a8b

Merge branch 'main' into feat/broadcast-pause

4cb5d89

fix: simplify signature refresh logic

558a57a

fix: add PendingBroadcast migration

7a0ab45

fix: migrations

43d44eb

fix: use correct spec version

cd84ee0

dandanlen changed the title ~~feat: broadcast pause~~ feat: broadcast barrier Dec 5, 2023

dandanlen enabled auto-merge (squash) December 5, 2023 09:52

dandanlen approved these changes Dec 5, 2023

View reviewed changes

dandanlen merged commit 5264c7e into main Dec 5, 2023
40 checks passed

dandanlen deleted the feat/broadcast-pause branch December 5, 2023 09:52

dandanlen mentioned this pull request Dec 5, 2023

fix: remove the TransactionOutIdToBroadcastId storage correctly #4262

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: broadcast barrier #4207

feat: broadcast barrier #4207

ramizhasan111 commented Nov 6, 2023 •

edited

linear bot commented Nov 6, 2023

codecov bot commented Nov 8, 2023 •

edited

kylezs Nov 10, 2023

kylezs left a comment

kylezs Nov 10, 2023

ramizhasan111 Nov 14, 2023

kylezs Nov 15, 2023

dandanlen Nov 15, 2023

dandanlen Nov 21, 2023

ramizhasan111 Nov 22, 2023

ramizhasan111 Nov 22, 2023

kylezs Nov 10, 2023

dandanlen left a comment

kylezs left a comment

dandanlen Nov 21, 2023

dandanlen Nov 21, 2023

dandanlen Nov 21, 2023

dandanlen Nov 21, 2023

dandanlen Nov 29, 2023

kylezs Nov 29, 2023

kylezs Nov 29, 2023

kylezs Nov 29, 2023 •

edited

kylezs Nov 29, 2023

dandanlen Nov 29, 2023

kylezs Nov 29, 2023

dandanlen Nov 29, 2023

ramizhasan111 Nov 30, 2023

dandanlen Nov 29, 2023

dandanlen Nov 29, 2023

ramizhasan111 Nov 29, 2023

dandanlen Nov 29, 2023

ramizhasan111 Dec 1, 2023

feat: broadcast barrier #4207

feat: broadcast barrier #4207

Conversation

ramizhasan111 commented Nov 6, 2023 • edited

Pull Request

Checklist

Summary

linear bot commented Nov 6, 2023

codecov bot commented Nov 8, 2023 • edited

Codecov Report

Choose a reason for hiding this comment

kylezs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dandanlen left a comment

Choose a reason for hiding this comment

kylezs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylezs Nov 29, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ramizhasan111 commented Nov 6, 2023 •

edited

codecov bot commented Nov 8, 2023 •

edited

kylezs Nov 29, 2023 •

edited