Skip to content

feat: CON-1445 Start remote DKG as soon as request appears in state#10055

Merged
eichhorl merged 166 commits into
masterfrom
eichhorl/early-dkg-configs
May 12, 2026
Merged

feat: CON-1445 Start remote DKG as soon as request appears in state#10055
eichhorl merged 166 commits into
masterfrom
eichhorl/early-dkg-configs

Conversation

@eichhorl
Copy link
Copy Markdown
Contributor

@eichhorl eichhorl commented Apr 29, 2026

Background

The NiDkgConfig contains the data necessary for a subnet to run one instance of the NiDKG protocol. For remote DKG (where a subnet generates key material for a different subnet), these configs are derived from request contexts stored in the replicated state.

Problem

Before this PR, these remote NiDkgConfigs were only created and included as part of summary blocks. This means, a remote DKG request had to wait in the replicated state until consensus reaches the next summary block, at which point the config is created and the remote DKG protocol is started. This increases the worst case latency of remote DKG requests (i.e. subnet/engine creations) by up to one DKG interval.

Proposed Changes

Instead of waiting until the next summary, we can create the NiDkgConfigs on demand, as soon as the corresponding request appears in the replicated state. We continue to use the start_height of the most recent summary block for these configs, essentially treating them "as if" they had already existed when the last summary was created. The only difference is that the number of remaining blocks to complete the remote DKG may be reduced, since it may be started at any point throughout the interval.

For simplicity, this PR removes the handling of remote DKG as part of summary blocks entirely, which implies that remote NiDkgConfigs are no longer stored in summary blocks. Therefore, when creating/validating dealings and transcripts, we now iterate over the union of local configs in the summary and remote configs derived from the replicated state.

Additionally, this means that config creation errors and timeouts are now also handled by data blocks:

  • Config creation errors are included in the data block, which will generate a reject response to the context.
  • A remote DKG request context increases its attempt counter (stored in the summary), whenever a new summary block is created. Once this counter exceeds the maximum number of attempts, the next data block will include a timeout response.
  • Completed requests of the current interval (successful or unsuccessful) are detected by traversing previous data blocks, such that the same request isn't handled multiple times.
  • Completed requests of the past interval (successful or unsuccessful) are tracked by the summary with an attempt counter of 0. These requests are ignored when creating future NiDkgConfigs, such that the same request isn't handled multiple times.
  • Note that we consider a request completed if any one of it's transcripts were completed.

Future Work

  • Currently, we may start a remote DKG request near the end of the interval, without enough data blocks left to include enough dealings to finish all required transcripts. This means the request will be retried at the start of the next interval. To improve efficiency, we should not create and include dealings for remote DKGs that were received with an insufficient number of data blocks left. Instead, we should only start working on the request at the start of the next interval.

  • Consider increasing the number of dealings per block and the number of prioritized remote DKGs per interval.

Copy link
Copy Markdown
Contributor

@pierugo-dfinity pierugo-dfinity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that crosses my mind when reading

remote NiDkgConfigs are no longer stored in summary blocks

Can something break on upgrading the subnet if the old replica version included a NiDkgConfig in the CUP's summary block? From what I can tell, it looks like we do not re-validate the CUP's block when starting up (which would fail on the new replica version) but directly insert it into our validated pool instead. So it appears to look fine. Also from NiDKG's PoV, it does not make a difference whether the remote config comes from the state or the summary since merge_configs unifies them.
But more generally, couldn't having an invalid block in our validated pool break stuff elsewhere?

Comment thread rs/consensus/dkg/src/lib.rs Outdated
Comment thread rs/consensus/dkg/src/lib.rs Outdated
Comment thread rs/consensus/dkg/src/lib.rs
Comment thread rs/consensus/dkg/src/lib.rs
Comment thread rs/consensus/dkg/src/lib.rs
Comment thread rs/consensus/dkg/src/payload_builder.rs Outdated
Comment thread rs/consensus/dkg/src/payload_builder.rs Outdated
Comment thread rs/consensus/src/consensus/batch_delivery.rs Outdated
Comment thread rs/tests/consensus/subnet_splitting_test.rs
Comment thread rs/types/types/src/consensus/dkg.rs
@eichhorl eichhorl requested a review from a team as a code owner May 8, 2026 14:13
@eichhorl
Copy link
Copy Markdown
Contributor Author

eichhorl commented May 8, 2026

Can something break on upgrading the subnet if the old replica version included a NiDkgConfig in the CUP's summary block? From what I can tell, it looks like we do not re-validate the CUP's block when starting up (which would fail on the new replica version) but directly insert it into our validated pool instead. So it appears to look fine. Also from NiDKG's PoV, it does not make a difference whether the remote config comes from the state or the summary since merge_configs unifies them.
But more generally, couldn't having an invalid block in our validated pool break stuff elsewhere?

This is not something we usually pay attention to. Technically, I guess every upgrade summary is "invalid" on the new version, because it contains the wrong replica version.

The only place where the upgrade CUP is validated again after the upgrade should be when it is fetched by an orchestrator of a different node that is still on the old version. But in that case we only validate the threshold signature. There, what's more important is that the new replica still returns the exact same bytes that the CUP was created with on the old version, which is why we store the original bytes separately in the pool. Otherwise the byte representation could change if an old CUP was deserialized and serialized again using the new version.

Copy link
Copy Markdown
Contributor

@pierugo-dfinity pierugo-dfinity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 🚀 🚀

Comment thread rs/consensus/dkg/src/lib.rs Outdated
Comment thread rs/consensus/dkg/src/lib.rs
Comment thread rs/consensus/dkg/src/lib.rs Outdated
Comment thread rs/consensus/dkg/src/lib.rs
Comment thread rs/consensus/dkg/src/lib.rs Outdated
Comment thread rs/consensus/dkg/src/payload_builder.rs Outdated
Comment thread rs/consensus/dkg/src/payload_builder.rs Outdated
Comment thread rs/types/types/src/consensus/dkg.rs
Comment thread rs/consensus/dkg/src/payload_builder.rs
@pierugo-dfinity
Copy link
Copy Markdown
Contributor

every upgrade summary is "invalid" on the new version, because it contains the wrong replica version.

Good point. Then, this addresses my concern.

There, what's more important is that the new replica still returns the exact same bytes that the CUP was created with

Good to know, right, thanks!

@eichhorl eichhorl removed the CI_ALL_BAZEL_TARGETS Runs all bazel targets label May 12, 2026
@eichhorl eichhorl enabled auto-merge May 12, 2026 08:03
@eichhorl eichhorl disabled auto-merge May 12, 2026 09:23
@eichhorl eichhorl added this pull request to the merge queue May 12, 2026
Merged via the queue into master with commit 3af058d May 12, 2026
89 of 91 checks passed
@eichhorl eichhorl deleted the eichhorl/early-dkg-configs branch May 12, 2026 09:43
pull Bot pushed a commit to bit-cook/ic that referenced this pull request May 19, 2026
…ty#10250)

Instead of the verbose `(NiDkgId, CallbackId, Result<NiDkgTranscript,
String>)` triple, this PR introduces a dedicated struct for remote NiDKG
transcript creation results.

Additionally, we remove all mentions of "early" remote transcripts.
After dfinity#10055, "early" transcript
creation (i.e. as part of data blocks) is the only way for remote
transcripts to be created, since they can no longer be part of summary
blocks. Therefore, we can drop the "early" prefix.

Note that, although we change the type of the summary payload, this
change is backward/forward compatible (see passing
`cup_compatibility_test`). This is because the hash function of the
triple is equal to the one of the struct, therefore no migration is
necessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants