Skip to content

Conversation

@pierugo-dfinity
Copy link
Contributor

@pierugo-dfinity pierugo-dfinity commented Jan 27, 2026

nr_large is still flaky. The last fix was not enough to fix the problem. It looks like rebuilding a socket from scratch after a timeout was the correct approach (see #6823), so I reverted most changes of the last fix, and instead simply wrapped the blocking operation with a .spawn_blocking, because I suspect the test started to become flaky because some tasks started to block others.

@github-actions github-actions bot added the fix label Jan 27, 2026
@pierugo-dfinity
Copy link
Contributor Author

I think this solution should be reliable with enough blocking threads, i.e. if there are more blocking threads than there are nodes to connect to in parallel. Though, when looking at the definition of block_on, I see that we use a maximum of 16 blocking threads, even though the default is 512 and the docs explicitely say not to set this number too low. Is there a reason why we chose such a small number?

@pierugo-dfinity pierugo-dfinity added the CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 label Jan 27, 2026
@pierugo-dfinity pierugo-dfinity marked this pull request as ready for review January 27, 2026 16:55
@pierugo-dfinity pierugo-dfinity requested a review from a team as a code owner January 27, 2026 16:55
@github-actions github-actions bot added the @idx label Jan 27, 2026
Copy link
Collaborator

@basvandijk basvandijk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though, when looking at the definition of block_on, I see that we use a maximum of 16 blocking threads, even though the default is 512 and the docs explicitely say not to set this number too low. Is there a reason why we chose such a small number?

I don't know but I see it was introduced in 82380b3.

@pierugo-dfinity
Copy link
Contributor Author

it was introduced in 82380b3

Thank you. Do you think it could be acceptable to bump this number to something like 64 or 128? Again, the default is 512.

@pierugo-dfinity
Copy link
Contributor Author

Thank you. Do you think it could be acceptable to bump this number to something like 64 or 128? Again, the default is 512.

Done in a separate PR

@pierugo-dfinity pierugo-dfinity added this pull request to the merge queue Jan 29, 2026
Merged via the queue into master with commit 96cc26d Jan 29, 2026
40 checks passed
@pierugo-dfinity pierugo-dfinity deleted the pierugo/nr_large/spawn_blocking_sync_sockets branch January 29, 2026 09:12
github-merge-queue bot pushed a commit that referenced this pull request Jan 29, 2026
As mentioned in
#8547 (comment), the
`tokio`
[docs](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads:~:text=It%E2%80%99s%20recommended%20to,.)
explicitly say not to set the `max_blocking_threads` of a Runtime too
low. So let's just use the default of 512. Considering that these
threads are not always active it's fine to have more of them in case
they're needed.
github-merge-queue bot pushed a commit that referenced this pull request Jan 29, 2026
…8595)

Similarly to #8547, this PR spawns the execution of the bash script
itself as a blocking task since it also uses blocking I/O.

Unfortunately, I could not simply write
```rust
tokio::task::spawn_blocking(move || self.block_on_bash_script_from_session(&session, &script))
    .await
    .expect("Executing bash script task panicked")
```
because I also had to clone `self`, which is not `Clone`. Removing the
`self` parameter from `block_on_bash_script_from_session` (which is not
used) would not be practical either as we would have to adapt every call
site.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 fix @idx

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants