Skip to content

perf(crypto): CON-1650 Use Rayon in NIDKG#9007

Merged
randombit merged 50 commits intomasterfrom
jack/crp-2550
Apr 21, 2026
Merged

perf(crypto): CON-1650 Use Rayon in NIDKG#9007
randombit merged 50 commits intomasterfrom
jack/crp-2550

Conversation

@randombit
Copy link
Copy Markdown
Contributor

@randombit randombit commented Feb 23, 2026

This substantially improves NIDKG performance. The following table of results is from the crypto benchmarks (so no network/consensus) on my devenv which has 16 cores similar to that used in the Gen2 machines.

All times for the 40 nodes, 40 dealers, high threshold setting. The improvements for other settings are similar.

Operation master rayon
create_initial_dealing 283 ms 101 ms
create_reshare_dealing 272 ms 94 ms
verify_dealing 310 ms 84 ms
create_transcript 152 ms 14 ms
load_transcript 1800 ms 248 ms
retain_active_keys/0 3640 ms 723 ms
retain_active_keys/1 3770 ms 744 ms
retain_active_keys/5 3870 ms 762 ms
retain_active_keys/15 4400 ms 853 ms
retain_active_keys/30 5200 ms 1080 ms

@github-actions github-actions Bot added the perf label Feb 23, 2026
@randombit
Copy link
Copy Markdown
Contributor Author

randombit commented Feb 23, 2026

Todo

  • There is a stability test for key update but I think just one. I'd like to add several more to cover different scenarios, since the update process depends on the epochs in use.
  • Verify proofs in parallel
  • Investigate BSGS
  • Table lookup for honest dealer
  • Update CON-1444 doc

@randombit
Copy link
Copy Markdown
Contributor Author

To set some number of CPUs

    let result = ThreadPoolBuilder::new()
        .num_threads(4)
        .build_global();

I'm not sure what to do here, default is # of cores but this may oversubscribe us a bit. OTOH there are points where the replica is pretty well just blocked on us and the faster the vault completes the work the better.

@randombit
Copy link
Copy Markdown
Contributor Author

randombit commented Feb 23, 2026

I checked 40 dealer 13 recipient NIDKG also since that's relevant for engines, numbers look quite good to me

crypto_nidkg_13_nodes_40_dealers_high/create_initial_dealing
                        time:   [59.307 ms 63.587 ms 69.395 ms]
crypto_nidkg_13_nodes_40_dealers_high/create_reshare_dealing
                        time:   [58.164 ms 63.475 ms 69.890 ms]
crypto_nidkg_13_nodes_40_dealers_high/verify_dealing
                        time:   [50.716 ms 50.977 ms 51.259 ms]
crypto_nidkg_13_nodes_40_dealers_high/create_transcript
                        time:   [7.0596 ms 7.1216 ms 7.1903 ms]
crypto_nidkg_13_nodes_40_dealers_high/load_transcript
                        time:   [117.18 ms 118.20 ms 119.22 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/0
                        time:   [338.97 ms 342.47 ms 345.78 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/1
                        time:   [355.97 ms 358.14 ms 360.47 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/5
                        time:   [370.10 ms 373.88 ms 377.47 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/15
                        time:   [429.25 ms 431.44 ms 433.05 ms]

@randombit randombit changed the title perf(crypto): CRP-2550 Use Rayon in NIDKG perf(crypto): CON-1650 Use Rayon in NIDKG Feb 24, 2026
@randombit randombit marked this pull request as ready for review March 9, 2026 13:28
@randombit randombit requested a review from a team as a code owner March 9, 2026 13:28
@basvandijk
Copy link
Copy Markdown
Collaborator

basvandijk commented Mar 10, 2026

I noticed //rs/tests/nns:nns_token_balance_test was flaky on this PR:

$ bazel run //ci/githubstats:query -- last --flaky //rs/tests/nns:nns_token_balance_test
...
Downloading logs to: /ic/logs/nns_token_balance_test/2026-03-09T22:11:14
...
╒════╤═════════════════════════╤════════════╤══════════╤═══════════════╤═══════╤══════════╤══════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════╕
│    │   last started at (UTC) │   duration │ status   │ branch        │ PR    │ commit   │ buildbuddy   │ errors per attempt                                                                            │
╞════╪═════════════════════════╪════════════╪══════════╪═══════════════╪═══════╪══════════╪══════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════╡
│  0 │ Mon 2026-03-09 17:24:03 │       2:37 │ FLAKY    │ jack/crp-2550 │ #9007 │ fe35a78  │ logs         │ 1: setup: called `Result::unwrap()` on an `Err` value: Func="await_status_is_healthy of ht... │
│    │                         │            │          │               │       │          │              │    test: Failed to find SetupResult attribute after setup. Cancelling test function.          │
├────┼─────────────────────────┼────────────┼──────────┼───────────────┼───────┼──────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────────────────────┤
│  1 │ Mon 2026-03-09 13:59:50 │       2:26 │ FLAKY    │ jack/crp-2550 │ #9007 │ c0ea430  │ logs         │ 1: setup: called `Result::unwrap()` on an `Err` value: Func="await_status_is_healthy of ht... │
│    │                         │            │          │               │       │          │              │    test: Failed to find SetupResult attribute after setup. Cancelling test function.          │
╘════╧═════════════════════════╧════════════╧══════════╧═══════════════╧═══════╧══════════╧══════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════╛

So I asked Claude Opus 4.6 about it:

Prompt

The //rs/tests/nns:nns_token_balance_test as implemented in rs/tests/nns/nns_token_balance_test.rs has a simple setup() function which sets up an IC testnet consisting of a single node NNS subnet and a single node application subnet:

    InternetComputer::new()
        .add_fast_single_node_subnet(SubnetType::System)
        .add_fast_single_node_subnet(SubnetType::Application)
        .setup_and_start(&env)
        .expect("failed to setup IC under test");

However, as observed in some flaky runs of this test the application node doesn't become healthy because it can't fetch a NNS delegation from the NNS node because its own firewall blocks incoming packets from the NNS node.

And example of this can be found in the following log of a flaky run: logs/nns_token_balance_test/2026-03-09T22:11:14/2026-03-09T17:24:03_dc9adb68-14be-423e-88ce-b71a4d3034f6/1/FAILED.log. Specifically in the log of the application node: logs/nns_token_balance_test/2026-03-09T22:11:14/2026-03-09T17:24:03_dc9adb68-14be-423e-88ce-b71a4d3034f6/1/ic_logs/2602:fb2b:110:10:50a2:78ff:fe84:6fdc.log where you see it successfully establishes a TCP connection to the NNS node at IPv6 2602:fb2b:110:10:502f:52ff:fe34:668a but then fails to establish a TLS stream to that same node because its own firewall blocks it:

2026-03-09 17:23:44.428001 Establishing TCP connection to g4zag-dfsbp-vyukk-bqngo-f53lb-es6h4-czfup-kdnu2-qighy-gcx32-fae @ [2602:fb2b:110:10:502f:52ff:fe34:668a]:8080
2026-03-09 17:23:44.428001 Establishing TLS stream to g4zag-dfsbp-vyukk-bqngo-f53lb-es6h4-czfup-kdnu2-qighy-gcx32-fae. Tcp stream: TcpStream { addr: [2602:fb2b:110:10:50a2:78ff:fe84:6fdc]:58544, peer: [2602:fb2b:110:10:502f:52ff:fe34:668a]:8080, fd: 39 }
2026-03-09 17:23:47.248279 Drop - default policy: IN=enp1s0 OUT= MAC=52:a2:78:84:6f:dc:52:2f:52:34:66:8a:86:dd SRC=2602:fb2b:0110:0010:502f:52ff:fe34:668a DST=2602:fb2b:0110:0010:50a2:78ff:fe84:6fdc LEN=205 TC=0 HOPLIMIT=64 FLOWLBL=3982 PROTO=TCP SPT=8080 DPT=60338 WINDOW=16 RES=0x00 ACK PSH URGP=0

Why is the firewall of the application node dropping packets from other nodes in the testnet?

Root Cause Analysis: Application Node Firewall Drops in nns_token_balance_test

Test: //rs/tests/nns:nns_token_balance_test
Date: 2026-03-09
Run: 2026-03-09T17:24:03_dc9adb68-14be-423e-88ce-b71a4d3034f6

Summary

The application node's firewall drops packets from the NNS node, preventing the app node from fetching an NNS delegation and causing the test to fail. The drops are a symptom, not the root cause. The true cause is the NNS node's CSP vault being overloaded during DKG/consensus initialization, which prevents TLS handshakes from completing.

Environment

Role IPv6 Address
NNS node (System subnet) 2602:fb2b:110:10:502f:52ff:fe34:668a
APP node (Application subnet) 2602:fb2b:110:10:50a2:78ff:fe84:6fdc

Ports involved: 8080 (HTTP handler), 2497 (XNet), 4100 (transport)

Timeline

Time Event
17:15:31 Orchestrator writes firewall rules — both nodes correctly whitelisted on ports 22, 2497, 4100, 8080 (TCP) and 4100 (UDP)
17:15:31–17:20:34 ~5 minutes of normal operation, no drops
17:20:34.878 First NNS TLS handshake timeout (CspServerEd25519Signer: TransientInternalError: "the request exceeded its deadline")
17:20:34.879 First APP firewall drop logged — 0.5 ms after the NNS error
17:20:34–17:23:24 Drops repeat every ~5 seconds on ports 2497 and 8080, each correlated sub-millisecond with a NNS TLS error

Mechanism

APP node                          NNS node
   |                                 |
   |------- TLS ClientHello -------->|  (conntrack: NEW → ESTABLISHED)
   |                                 |  NNS CSP vault busy with DKG signing
   |                                 |  ...deadline exceeded after ~5s...
   |<-- [timeout] client gives up    |
   |  sends RST, conntrack entry     |
   |  destroyed                      |
   |                                 |
   |                                 |  CSP vault finally returns error
   |<------- TLS error response -----|  arrives ~0.5ms later
   |                                 |
   |  No conntrack entry found       |
   |  → ct state = NEW              |
   |  → No rule accepts NEW TCP     |
   |    to ephemeral dest port      |
   |  → DEFAULT POLICY: DROP        |
   |  → logged as "Drop - default   |
   |    policy"                      |
  1. The APP node opens a TLS connection to the NNS node (ports 2497 or 8080).
  2. The NNS node's CspServerEd25519Signer is saturated by DKG/consensus work and cannot complete the TLS handshake within the deadline (~5 seconds).
  3. The APP node's TLS client times out and sends a TCP RST, destroying the conntrack entry.
  4. ~0.5 ms later, the NNS node's error response packet arrives at the APP node.
  5. With no conntrack entry, nftables classifies it as ct state new.
  6. No rule accepts new inbound TCP on random ephemeral destination ports → dropped by the default drop policy.
  7. The drop is logged with prefix "Drop - default policy: ".

Evidence: Sub-millisecond Timing Correlation

Every NNS TLS error matches an APP firewall drop within < 1 ms:

NNS TLS Error APP Firewall Drop Delta
17:20:34.878498 17:20:34.879075 +0.6 ms
17:20:39.922932 17:20:39.922915 −0.0 ms
17:20:44.937072 17:20:44.936623 −0.4 ms
17:20:49.944747 17:20:49.944717 −0.0 ms
17:20:54.964523 17:20:54.964530 +0.0 ms
17:20:59.976485 17:20:59.976427 −0.1 ms
17:21:04.987832 17:21:04.987796 −0.0 ms
17:21:09.997946 17:21:09.998127 +0.2 ms
17:21:15.011556 17:21:15.011534 −0.0 ms
17:21:20.019024 17:21:20.019048 +0.0 ms

Firewall Configuration (Verified Correct)

The orchestrator correctly applied the following whitelisting rules at 17:15:31:

# IPv6 node whitelisting
tcp dport { 22, 2497, 4100, 8080 } ip6 saddr { 2602:fb2b:110:10:502f:52ff:fe34:668a, 2602:fb2b:110:10:50a2:78ff:fe84:6fdc } accept
udp dport { 4100 } ip6 saddr { 2602:fb2b:110:10:502f:52ff:fe34:668a, 2602:fb2b:110:10:50a2:78ff:fe84:6fdc } accept

The dropped packets have destination ports that differ from the whitelisted ports (they are ephemeral ports from closed connections), confirming the firewall rules are not at fault.

Root Cause

The NNS node's CspServerEd25519Signer is overloaded during initial DKG/consensus bootstrapping, unable to service TLS handshake signing requests within their deadline. This prevents the APP node from establishing any TLS connection to the NNS node, blocking NNS delegation fetches and causing the test to fail.

The firewall drops are a harmless secondary effect of stale TCP response packets arriving after connections have already been reset.

Potential Fixes

  1. Increase CSP vault signing deadline during the bootstrap phase to tolerate DKG load.
  2. Prioritize TLS handshake signing requests over DKG signing in the CSP vault.
  3. Add more resilient retry logic in the APP node's NNS delegation fetcher to tolerate prolonged TLS failures.
  4. Ensure test waits for NNS node readiness (post-DKG stabilization) before starting the application subnet.

@randombit
Copy link
Copy Markdown
Contributor Author

Thanks @basvandijk, definitely an issue - I think none of the solutions Claude lays out really works since they are more or less specific to this one test, while resource starvation in the vault probably could apply in all sorts of scenarios. I guess as a starting point I'll configure Rayon's thread pool to use only a portion of the available core count.

@basvandijk
Copy link
Copy Markdown
Collaborator

Sounds good. Also note I observed some other tests flaking as well on this PR. Just making sure that we're not introducing a new source of flakiness.

@andreacerulli andreacerulli self-requested a review March 16, 2026 15:04
@basvandijk
Copy link
Copy Markdown
Collaborator

basvandijk commented Mar 17, 2026

I was debugging a flaky run of //rs/tests/idx:basic_health_test and didn't immediately realised it was coming from this PR. However I just noticed my agent had already automatically posted the following comment (let me know if it's just noise then I'll delete it):

Flake analysis: //rs/tests/idx:basic_health_test

I investigated the flake of //rs/tests/idx:basic_health_test on commit 3dd693a (this PR) on 2026-03-17. Here is a detailed analysis and a proposed fix.

Symptoms

  • Attempt 1 (FAILED): The Application subnet never finalized blocks beyond genesis. All 4 nodes were stuck at the genesis height (16299310772172839798, set by with_random_height()). After 504s of retrying, installing a UniversalCanister failed with: "Certificate is stale (over 270s). Is the computer's clock synchronized?"
  • Attempt 2 (PASSED): With fresh VMs, the Application subnet started normally and UC installation succeeded in 2.5s.

Root cause: vault thread pool starvation

The NIDKG code now uses rayon::par_iter() extensively. In the ic-crypto-csp vault server, these operations run on the vault's own Rayon thread pool (built via ThreadPoolBuilder::new() in tarpc_csp_vault_server.rs:645), which has available_parallelism() threads (= 6 on test VMs with 6 vCPUs).

When NIDKG calls par_iter() from a vault thread, Rayon recruits all idle threads in that same pool to do the parallel work. During DKG operations (especially at startup when initial transcripts are loaded), this can consume all 6 vault threads, leaving none available for latency-sensitive operations like signing (needed for consensus notarization and block proposals).

On master (without this PR), NIDKG runs sequentially on 1 vault thread, leaving 5 threads free for signing RPCs. With this PR, NIDKG's par_iter() can temporarily monopolize all 6 threads. If consensus can't get timely signatures from the vault, it stalls — exactly what happened: no blocks finalized beyond genesis.

Why the rayon_worker_threads() fix in main() doesn't help

The "Tone down the Rayon thread pool size" commit configures the global Rayon pool. But vault operations run on the vault's custom Rayon pool, not the global one. When par_iter() is called from a thread that belongs to a custom Rayon pool, Rayon uses that pool, ignoring the global settings. So the rayon_worker_threads() configuration has no effect on vault-dispatched NIDKG work.

Proposed fix

Create a dedicated, smaller Rayon thread pool for NIDKG operations, so they don't starve signing operations on the vault's main pool:

// In the NIDKG library (e.g., forward_secure.rs or a shared module)
use std::sync::LazyLock;
use rayon::{ThreadPool, ThreadPoolBuilder};

static NIDKG_POOL: LazyLock<ThreadPool> = LazyLock::new(|| {
    let num_threads = std::thread::available_parallelism()
        .map(|n| n.get().max(2) / 2)
        .unwrap_or(2)
        .max(2);

    ThreadPoolBuilder::new()
        .num_threads(num_threads)
        .thread_name(|i| format!("nidkg-rayon-{i}"))
        .build()
        .expect("failed to create NIDKG thread pool")
});

Then wrap NIDKG entry points with NIDKG_POOL.install(|| { ... }). Code inside install() will have its par_iter() calls use NIDKG_POOL instead of the vault's pool, ensuring that:

  1. NIDKG still benefits from parallelism (3 threads on 6 vCPU VMs)
  2. The vault's main pool always has threads available for signing RPCs
  3. Consensus can't be starved by DKG operations

As a secondary benefit, the rayon_worker_threads() configuration in main() can be removed since it has no effect.

@randombit
Copy link
Copy Markdown
Contributor Author

randombit commented Mar 18, 2026

There does seem to be a starvation issue here still. This comment is incoherent though

The "Tone down the Rayon thread pool size" commit configures the global Rayon pool. But vault operations run on the vault's custom Rayon pool, not the global one.

There is no "custom" rayon pool in use, only the global pool - Claudes suggestion is exactly to create such a pool.

Edit: oh fuck there is such a pool - our generic tarpc RPCs are being dispatched over a Rayon thread pool (since CRP-2263) apparently I was out the week that change was made, had no idea we did this. Seems wrong to me tbh

Copy link
Copy Markdown
Contributor

@fspreiss fspreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments, @randombit!

@randombit randombit enabled auto-merge April 16, 2026 12:57
@randombit randombit added this pull request to the merge queue Apr 16, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 16, 2026
@randombit randombit added this pull request to the merge queue Apr 16, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 16, 2026
Previously we only installed a thread pool for NIDKG work in
the remote vault. For tests that use an in-replica vault NIDKG
would end up using the global thread pool which is by default
sized to the number of cores. This caused massive over-subscription.
This reverts commit 0bee23a.

Shouldn't be needed
@randombit
Copy link
Copy Markdown
Contributor Author

randombit commented Apr 17, 2026

@fspreiss Rerequesting review due to addition of 92110d7 which seems to have addressed the test flakes. The main issue there is that previously if the vault was in-replica we did not set up any kind of explicit thread pool, so the Rayon par_iter/etc went right to the default global Rayon pool which is default sized to the number of cores. In addition, many of the tests (//rs/registry/canister:registry_canister_integration_test_tests seeming a very bad offender here and a commonly flaking one) are set up such that all of the tests run in parallel (as with usual Rust #[test]s) and each one sets up one or several subnets. This causes both oversubscription (due to Rayon allocating all possible cores just for NIDKG) and also starvation (because the same thread pool is used to satisfy all NIDKG operations across all of the tests, and it seems in some circumstances one test steals all the computational resources from the others, eventually resulting in timeouts)

@randombit randombit requested a review from fspreiss April 17, 2026 21:43
@randombit
Copy link
Copy Markdown
Contributor Author

No flakes in the last several merge-master commits so I think 92110d7 did address that

Comment thread rs/crypto/internal/crypto_service_provider/src/vault/local_csp_vault/mod.rs Outdated
@randombit randombit added this pull request to the merge queue Apr 21, 2026
Merged via the queue into master with commit dd1ef70 Apr 21, 2026
37 checks passed
@randombit randombit deleted the jack/crp-2550 branch April 21, 2026 20:04
basvandijk added a commit that referenced this pull request Apr 22, 2026
This reverts commit dd1ef70 since it's
causing massive failures and flakiness during [Kickoff / CI Main / Bazel
Test All
(push)](https://github.com/dfinity/ic/actions/runs/24743705235/job/72389924714)
and [Release
Testing](https://github.com/dfinity/ic/actions/runs/24759898494).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants