perf(crypto): CON-1650 Use Rayon in NIDKG by randombit · Pull Request #9007 · dfinity/ic

randombit · 2026-02-23T14:25:02Z

This substantially improves NIDKG performance. The following table of results is from the crypto benchmarks (so no network/consensus) on my devenv which has 16 cores similar to that used in the Gen2 machines.

All times for the 40 nodes, 40 dealers, high threshold setting. The improvements for other settings are similar.

Operation	master	rayon
create_initial_dealing	283 ms	101 ms
create_reshare_dealing	272 ms	94 ms
verify_dealing	310 ms	84 ms
create_transcript	152 ms	14 ms
load_transcript	1800 ms	248 ms
retain_active_keys/0	3640 ms	723 ms
retain_active_keys/1	3770 ms	744 ms
retain_active_keys/5	3870 ms	762 ms
retain_active_keys/15	4400 ms	853 ms
retain_active_keys/30	5200 ms	1080 ms

randombit · 2026-02-23T14:39:56Z

Todo

There is a stability test for key update but I think just one. I'd like to add several more to cover different scenarios, since the update process depends on the epochs in use.
Verify proofs in parallel
Investigate BSGS
Table lookup for honest dealer
Update CON-1444 doc

randombit · 2026-02-23T16:36:16Z

To set some number of CPUs

    let result = ThreadPoolBuilder::new()
        .num_threads(4)
        .build_global();

I'm not sure what to do here, default is # of cores but this may oversubscribe us a bit. OTOH there are points where the replica is pretty well just blocked on us and the faster the vault completes the work the better.

randombit · 2026-02-23T20:19:23Z

I checked 40 dealer 13 recipient NIDKG also since that's relevant for engines, numbers look quite good to me

crypto_nidkg_13_nodes_40_dealers_high/create_initial_dealing
                        time:   [59.307 ms 63.587 ms 69.395 ms]
crypto_nidkg_13_nodes_40_dealers_high/create_reshare_dealing
                        time:   [58.164 ms 63.475 ms 69.890 ms]
crypto_nidkg_13_nodes_40_dealers_high/verify_dealing
                        time:   [50.716 ms 50.977 ms 51.259 ms]
crypto_nidkg_13_nodes_40_dealers_high/create_transcript
                        time:   [7.0596 ms 7.1216 ms 7.1903 ms]
crypto_nidkg_13_nodes_40_dealers_high/load_transcript
                        time:   [117.18 ms 118.20 ms 119.22 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/0
                        time:   [338.97 ms 342.47 ms 345.78 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/1
                        time:   [355.97 ms 358.14 ms 360.47 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/5
                        time:   [370.10 ms 373.88 ms 377.47 ms]
crypto_nidkg_13_nodes_40_dealers_high/retain_active_keys/15
                        time:   [429.25 ms 431.44 ms 433.05 ms]

basvandijk · 2026-03-10T11:47:22Z

I noticed //rs/tests/nns:nns_token_balance_test was flaky on this PR:

$ bazel run //ci/githubstats:query -- last --flaky //rs/tests/nns:nns_token_balance_test
...
Downloading logs to: /ic/logs/nns_token_balance_test/2026-03-09T22:11:14
...
╒════╤═════════════════════════╤════════════╤══════════╤═══════════════╤═══════╤══════════╤══════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════╕
│    │   last started at (UTC) │   duration │ status   │ branch        │ PR    │ commit   │ buildbuddy   │ errors per attempt                                                                            │
╞════╪═════════════════════════╪════════════╪══════════╪═══════════════╪═══════╪══════════╪══════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════╡
│  0 │ Mon 2026-03-09 17:24:03 │       2:37 │ FLAKY    │ jack/crp-2550 │ #9007 │ fe35a78  │ logs         │ 1: setup: called `Result::unwrap()` on an `Err` value: Func="await_status_is_healthy of ht... │
│    │                         │            │          │               │       │          │              │    test: Failed to find SetupResult attribute after setup. Cancelling test function.          │
├────┼─────────────────────────┼────────────┼──────────┼───────────────┼───────┼──────────┼──────────────┼───────────────────────────────────────────────────────────────────────────────────────────────┤
│  1 │ Mon 2026-03-09 13:59:50 │       2:26 │ FLAKY    │ jack/crp-2550 │ #9007 │ c0ea430  │ logs         │ 1: setup: called `Result::unwrap()` on an `Err` value: Func="await_status_is_healthy of ht... │
│    │                         │            │          │               │       │          │              │    test: Failed to find SetupResult attribute after setup. Cancelling test function.          │
╘════╧═════════════════════════╧════════════╧══════════╧═══════════════╧═══════╧══════════╧══════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════╛

So I asked Claude Opus 4.6 about it:

Prompt

The //rs/tests/nns:nns_token_balance_test as implemented in rs/tests/nns/nns_token_balance_test.rs has a simple setup() function which sets up an IC testnet consisting of a single node NNS subnet and a single node application subnet:

    InternetComputer::new()
        .add_fast_single_node_subnet(SubnetType::System)
        .add_fast_single_node_subnet(SubnetType::Application)
        .setup_and_start(&env)
        .expect("failed to setup IC under test");

However, as observed in some flaky runs of this test the application node doesn't become healthy because it can't fetch a NNS delegation from the NNS node because its own firewall blocks incoming packets from the NNS node.

And example of this can be found in the following log of a flaky run: logs/nns_token_balance_test/2026-03-09T22:11:14/2026-03-09T17:24:03_dc9adb68-14be-423e-88ce-b71a4d3034f6/1/FAILED.log. Specifically in the log of the application node: logs/nns_token_balance_test/2026-03-09T22:11:14/2026-03-09T17:24:03_dc9adb68-14be-423e-88ce-b71a4d3034f6/1/ic_logs/2602:fb2b:110:10:50a2:78ff:fe84:6fdc.log where you see it successfully establishes a TCP connection to the NNS node at IPv6 2602:fb2b:110:10:502f:52ff:fe34:668a but then fails to establish a TLS stream to that same node because its own firewall blocks it:

2026-03-09 17:23:44.428001 Establishing TCP connection to g4zag-dfsbp-vyukk-bqngo-f53lb-es6h4-czfup-kdnu2-qighy-gcx32-fae @ [2602:fb2b:110:10:502f:52ff:fe34:668a]:8080
2026-03-09 17:23:44.428001 Establishing TLS stream to g4zag-dfsbp-vyukk-bqngo-f53lb-es6h4-czfup-kdnu2-qighy-gcx32-fae. Tcp stream: TcpStream { addr: [2602:fb2b:110:10:50a2:78ff:fe84:6fdc]:58544, peer: [2602:fb2b:110:10:502f:52ff:fe34:668a]:8080, fd: 39 }
2026-03-09 17:23:47.248279 Drop - default policy: IN=enp1s0 OUT= MAC=52:a2:78:84:6f:dc:52:2f:52:34:66:8a:86:dd SRC=2602:fb2b:0110:0010:502f:52ff:fe34:668a DST=2602:fb2b:0110:0010:50a2:78ff:fe84:6fdc LEN=205 TC=0 HOPLIMIT=64 FLOWLBL=3982 PROTO=TCP SPT=8080 DPT=60338 WINDOW=16 RES=0x00 ACK PSH URGP=0

Why is the firewall of the application node dropping packets from other nodes in the testnet?

Root Cause Analysis: Application Node Firewall Drops in `nns_token_balance_test`

Test: //rs/tests/nns:nns_token_balance_test
Date: 2026-03-09
Run: 2026-03-09T17:24:03_dc9adb68-14be-423e-88ce-b71a4d3034f6

Summary

The application node's firewall drops packets from the NNS node, preventing the app node from fetching an NNS delegation and causing the test to fail. The drops are a symptom, not the root cause. The true cause is the NNS node's CSP vault being overloaded during DKG/consensus initialization, which prevents TLS handshakes from completing.

Environment

Role	IPv6 Address
NNS node (System subnet)	`2602:fb2b:110:10:502f:52ff:fe34:668a`
APP node (Application subnet)	`2602:fb2b:110:10:50a2:78ff:fe84:6fdc`

Ports involved: 8080 (HTTP handler), 2497 (XNet), 4100 (transport)

Timeline

Time	Event
17:15:31	Orchestrator writes firewall rules — both nodes correctly whitelisted on ports 22, 2497, 4100, 8080 (TCP) and 4100 (UDP)
17:15:31–17:20:34	~5 minutes of normal operation, no drops
17:20:34.878	First NNS TLS handshake timeout (`CspServerEd25519Signer: TransientInternalError: "the request exceeded its deadline"`)
17:20:34.879	First APP firewall drop logged — 0.5 ms after the NNS error
17:20:34–17:23:24	Drops repeat every ~5 seconds on ports 2497 and 8080, each correlated sub-millisecond with a NNS TLS error

Mechanism

APP node                          NNS node
   |                                 |
   |------- TLS ClientHello -------->|  (conntrack: NEW → ESTABLISHED)
   |                                 |  NNS CSP vault busy with DKG signing
   |                                 |  ...deadline exceeded after ~5s...
   |<-- [timeout] client gives up    |
   |  sends RST, conntrack entry     |
   |  destroyed                      |
   |                                 |
   |                                 |  CSP vault finally returns error
   |<------- TLS error response -----|  arrives ~0.5ms later
   |                                 |
   |  No conntrack entry found       |
   |  → ct state = NEW              |
   |  → No rule accepts NEW TCP     |
   |    to ephemeral dest port      |
   |  → DEFAULT POLICY: DROP        |
   |  → logged as "Drop - default   |
   |    policy"                      |

The APP node opens a TLS connection to the NNS node (ports 2497 or 8080).
The NNS node's CspServerEd25519Signer is saturated by DKG/consensus work and cannot complete the TLS handshake within the deadline (~5 seconds).
The APP node's TLS client times out and sends a TCP RST, destroying the conntrack entry.
~0.5 ms later, the NNS node's error response packet arrives at the APP node.
With no conntrack entry, nftables classifies it as ct state new.
No rule accepts new inbound TCP on random ephemeral destination ports → dropped by the default drop policy.
The drop is logged with prefix "Drop - default policy: ".

Evidence: Sub-millisecond Timing Correlation

Every NNS TLS error matches an APP firewall drop within < 1 ms:

NNS TLS Error	APP Firewall Drop	Delta
17:20:34.878498	17:20:34.879075	+0.6 ms
17:20:39.922932	17:20:39.922915	−0.0 ms
17:20:44.937072	17:20:44.936623	−0.4 ms
17:20:49.944747	17:20:49.944717	−0.0 ms
17:20:54.964523	17:20:54.964530	+0.0 ms
17:20:59.976485	17:20:59.976427	−0.1 ms
17:21:04.987832	17:21:04.987796	−0.0 ms
17:21:09.997946	17:21:09.998127	+0.2 ms
17:21:15.011556	17:21:15.011534	−0.0 ms
17:21:20.019024	17:21:20.019048	+0.0 ms

Firewall Configuration (Verified Correct)

The orchestrator correctly applied the following whitelisting rules at 17:15:31:

# IPv6 node whitelisting
tcp dport { 22, 2497, 4100, 8080 } ip6 saddr { 2602:fb2b:110:10:502f:52ff:fe34:668a, 2602:fb2b:110:10:50a2:78ff:fe84:6fdc } accept
udp dport { 4100 } ip6 saddr { 2602:fb2b:110:10:502f:52ff:fe34:668a, 2602:fb2b:110:10:50a2:78ff:fe84:6fdc } accept

The dropped packets have destination ports that differ from the whitelisted ports (they are ephemeral ports from closed connections), confirming the firewall rules are not at fault.

Root Cause

The NNS node's CspServerEd25519Signer is overloaded during initial DKG/consensus bootstrapping, unable to service TLS handshake signing requests within their deadline. This prevents the APP node from establishing any TLS connection to the NNS node, blocking NNS delegation fetches and causing the test to fail.

The firewall drops are a harmless secondary effect of stale TCP response packets arriving after connections have already been reset.

Potential Fixes

Increase CSP vault signing deadline during the bootstrap phase to tolerate DKG load.
Prioritize TLS handshake signing requests over DKG signing in the CSP vault.
Add more resilient retry logic in the APP node's NNS delegation fetcher to tolerate prolonged TLS failures.
Ensure test waits for NNS node readiness (post-DKG stabilization) before starting the application subnet.

randombit · 2026-03-10T13:30:23Z

Thanks @basvandijk, definitely an issue - I think none of the solutions Claude lays out really works since they are more or less specific to this one test, while resource starvation in the vault probably could apply in all sorts of scenarios. I guess as a starting point I'll configure Rayon's thread pool to use only a portion of the available core count.

basvandijk · 2026-03-10T13:32:31Z

Sounds good. Also note I observed some other tests flaking as well on this PR. Just making sure that we're not introducing a new source of flakiness.

basvandijk · 2026-03-17T22:06:40Z

I was debugging a flaky run of //rs/tests/idx:basic_health_test and didn't immediately realised it was coming from this PR. However I just noticed my agent had already automatically posted the following comment (let me know if it's just noise then I'll delete it):

Flake analysis: `//rs/tests/idx:basic_health_test`

I investigated the flake of //rs/tests/idx:basic_health_test on commit 3dd693a (this PR) on 2026-03-17. Here is a detailed analysis and a proposed fix.

Symptoms

Attempt 1 (FAILED): The Application subnet never finalized blocks beyond genesis. All 4 nodes were stuck at the genesis height (16299310772172839798, set by with_random_height()). After 504s of retrying, installing a UniversalCanister failed with: "Certificate is stale (over 270s). Is the computer's clock synchronized?"
Attempt 2 (PASSED): With fresh VMs, the Application subnet started normally and UC installation succeeded in 2.5s.

Root cause: vault thread pool starvation

The NIDKG code now uses rayon::par_iter() extensively. In the ic-crypto-csp vault server, these operations run on the vault's own Rayon thread pool (built via ThreadPoolBuilder::new() in tarpc_csp_vault_server.rs:645), which has available_parallelism() threads (= 6 on test VMs with 6 vCPUs).

When NIDKG calls par_iter() from a vault thread, Rayon recruits all idle threads in that same pool to do the parallel work. During DKG operations (especially at startup when initial transcripts are loaded), this can consume all 6 vault threads, leaving none available for latency-sensitive operations like signing (needed for consensus notarization and block proposals).

On master (without this PR), NIDKG runs sequentially on 1 vault thread, leaving 5 threads free for signing RPCs. With this PR, NIDKG's par_iter() can temporarily monopolize all 6 threads. If consensus can't get timely signatures from the vault, it stalls — exactly what happened: no blocks finalized beyond genesis.

Why the `rayon_worker_threads()` fix in `main()` doesn't help

The "Tone down the Rayon thread pool size" commit configures the global Rayon pool. But vault operations run on the vault's custom Rayon pool, not the global one. When par_iter() is called from a thread that belongs to a custom Rayon pool, Rayon uses that pool, ignoring the global settings. So the rayon_worker_threads() configuration has no effect on vault-dispatched NIDKG work.

Proposed fix

Create a dedicated, smaller Rayon thread pool for NIDKG operations, so they don't starve signing operations on the vault's main pool:

// In the NIDKG library (e.g., forward_secure.rs or a shared module)
use std::sync::LazyLock;
use rayon::{ThreadPool, ThreadPoolBuilder};

static NIDKG_POOL: LazyLock<ThreadPool> = LazyLock::new(|| {
    let num_threads = std::thread::available_parallelism()
        .map(|n| n.get().max(2) / 2)
        .unwrap_or(2)
        .max(2);

    ThreadPoolBuilder::new()
        .num_threads(num_threads)
        .thread_name(|i| format!("nidkg-rayon-{i}"))
        .build()
        .expect("failed to create NIDKG thread pool")
});

Then wrap NIDKG entry points with NIDKG_POOL.install(|| { ... }). Code inside install() will have its par_iter() calls use NIDKG_POOL instead of the vault's pool, ensuring that:

NIDKG still benefits from parallelism (3 threads on 6 vCPU VMs)
The vault's main pool always has threads available for signing RPCs
Consensus can't be starved by DKG operations

As a secondary benefit, the rayon_worker_threads() configuration in main() can be removed since it has no effect.

randombit · 2026-03-18T18:04:34Z

There does seem to be a starvation issue here still. This comment is incoherent though

The "Tone down the Rayon thread pool size" commit configures the global Rayon pool. But vault operations run on the vault's custom Rayon pool, not the global one.

There is no "custom" rayon pool in use, only the global pool - Claudes suggestion is exactly to create such a pool.

Edit: oh fuck there is such a pool - our generic tarpc RPCs are being dispatched over a Rayon thread pool (since CRP-2263) apparently I was out the week that change was made, had no idea we did this. Seems wrong to me tbh

fspreiss

Thanks for addressing my comments, @randombit!

Previously we only installed a thread pool for NIDKG work in the remote vault. For tests that use an in-replica vault NIDKG would end up using the global thread pool which is by default sized to the number of cores. This caused massive over-subscription.

This reverts commit 0bee23a. Shouldn't be needed

randombit · 2026-04-17T21:43:49Z

@fspreiss Rerequesting review due to addition of 92110d7 which seems to have addressed the test flakes. The main issue there is that previously if the vault was in-replica we did not set up any kind of explicit thread pool, so the Rayon par_iter/etc went right to the default global Rayon pool which is default sized to the number of cores. In addition, many of the tests (//rs/registry/canister:registry_canister_integration_test_tests seeming a very bad offender here and a commonly flaking one) are set up such that all of the tests run in parallel (as with usual Rust #[test]s) and each one sets up one or several subnets. This causes both oversubscription (due to Rayon allocating all possible cores just for NIDKG) and also starvation (because the same thread pool is used to satisfy all NIDKG operations across all of the tests, and it seems in some circumstances one test steals all the computational resources from the others, eventually resulting in timeouts)

randombit · 2026-04-20T16:16:48Z

No flakes in the last several merge-master commits so I think 92110d7 did address that

This reverts commit dd1ef70 since it's causing massive failures and flakiness during [Kickoff / CI Main / Bazel Test All (push)](https://github.com/dfinity/ic/actions/runs/24743705235/job/72389924714) and [Release Testing](https://github.com/dfinity/ic/actions/runs/24759898494).

github-actions Bot added the perf label Feb 23, 2026

randombit changed the title ~~perf(crypto): CRP-2550 Use Rayon in NIDKG~~ perf(crypto): CON-1650 Use Rayon in NIDKG Feb 24, 2026

perf(crypto): CRP-2550 Add parallelism to NIDKG using Rayon

f8d55e3

randombit force-pushed the jack/crp-2550 branch from 5f9736f to f8d55e3 Compare March 6, 2026 18:56

IDX GitHub Automation and others added 2 commits March 6, 2026 19:00

Automatically fixing code for linting and formatting issues

95542f6

Bazel fix

c0ea430

randombit marked this pull request as ready for review March 9, 2026 13:28

randombit requested a review from a team as a code owner March 9, 2026 13:28

github-actions Bot added the @consensus label Mar 9, 2026

Merge branch 'master' into jack/crp-2550

fe35a78

randombit and others added 4 commits March 10, 2026 16:02

Tone down the Rayon thread pool size a bit

b593d75

Automatically updated Cargo*.lock

38dee8b

Fix bazel

5791126

Automatically fixing code for linting and formatting issues

f7beaca

andreacerulli self-requested a review March 16, 2026 15:04

randombit added 2 commits March 16, 2026 12:56

Merge branch 'master' into jack/crp-2550

8608130

Merge branch 'master' into jack/crp-2550

3dd693a

Merge branch 'master' into jack/crp-2550

90c036d

randombit and others added 4 commits March 18, 2026 14:06

Merge branch 'master' into jack/crp-2550

2df97b7

Use Tokio blocking pool for vault RPCs

6d474b0

lock

ade3329

Automatically fixing code for linting and formatting issues

ef14791

Merge branch 'master' into jack/crp-2550

13e971b

fspreiss approved these changes Apr 16, 2026

View reviewed changes

Merge branch 'master' into jack/crp-2550

80bc7b2

randombit enabled auto-merge April 16, 2026 12:57

randombit added this pull request to the merge queue Apr 16, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 16, 2026

randombit added this pull request to the merge queue Apr 16, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 16, 2026

randombit added 8 commits April 17, 2026 08:54

Merge branch 'master' into jack/crp-2550

064d0fa

Merge branch 'master' into jack/crp-2550

75734d8

Merge branch 'master' into jack/crp-2550

50637f1

Reduce core counts

0bee23a

Merge branch 'master' into jack/crp-2550

37eff5e

Revert "Reduce core counts"

d7f3a69

This reverts commit 0bee23a. Shouldn't be needed

Merge branch 'master' into jack/crp-2550

3e9049a

randombit requested a review from fspreiss April 17, 2026 21:43

randombit added 2 commits April 20, 2026 09:29

Merge branch 'master' into jack/crp-2550

588e470

Merge branch 'master' into jack/crp-2550

860ed9a

fspreiss reviewed Apr 21, 2026

View reviewed changes

Comment thread rs/crypto/internal/crypto_service_provider/src/vault/local_csp_vault/mod.rs Outdated

randombit added 2 commits April 21, 2026 13:40

Address thread pool conflict

5480ac2

Merge branch 'master' into jack/crp-2550

44e989f

randombit added this pull request to the merge queue Apr 21, 2026

Merged via the queue into master with commit dd1ef70 Apr 21, 2026
37 checks passed

randombit deleted the jack/crp-2550 branch April 21, 2026 20:04

randombit mentioned this pull request Apr 22, 2026

perf(crypto): CRP-2550 Enable Rayon in NIDKG again #9987

Open

Conversation

randombit commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

randombit commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

randombit commented Feb 23, 2026

Uh oh!

randombit commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

basvandijk commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prompt

Root Cause Analysis: Application Node Firewall Drops in nns_token_balance_test

Summary

Environment

Timeline

Mechanism

Evidence: Sub-millisecond Timing Correlation

Firewall Configuration (Verified Correct)

Root Cause

Potential Fixes

Uh oh!

randombit commented Mar 10, 2026

Uh oh!

basvandijk commented Mar 10, 2026

Uh oh!

basvandijk commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Flake analysis: //rs/tests/idx:basic_health_test

Symptoms

Root cause: vault thread pool starvation

Why the rayon_worker_threads() fix in main() doesn't help

Proposed fix

Uh oh!

randombit commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fspreiss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

randombit commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

randombit commented Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

randombit commented Feb 23, 2026 •

edited

Loading

randombit commented Feb 23, 2026 •

edited

Loading

randombit commented Feb 23, 2026 •

edited

Loading

basvandijk commented Mar 10, 2026 •

edited

Loading

Root Cause Analysis: Application Node Firewall Drops in `nns_token_balance_test`

basvandijk commented Mar 17, 2026 •

edited

Loading

Flake analysis: `//rs/tests/idx:basic_health_test`

Why the `rayon_worker_threads()` fix in `main()` doesn't help

randombit commented Mar 18, 2026 •

edited

Loading

randombit commented Apr 17, 2026 •

edited

Loading