-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use quinn 0.11.x #1641
Use quinn 0.11.x #1641
Conversation
cd01252
to
657ec2e
Compare
7edaeda
to
9e28f96
Compare
d885456
to
33897f9
Compare
@@ -305,7 +305,7 @@ reqwest-middleware = "0.2.5" | |||
rolling-file = "0.2.0" | |||
rpassword = "7.3" | |||
rustc_version = "0.4" | |||
rustls = { version = "0.21.12", default-features = false, features = ["quic"] } | |||
rustls = { version = "0.23.9", default-features = false } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Required by the new quinn lib version
9d2e8b0
to
6443eda
Compare
let mut config = rustls::ClientConfig::builder() | ||
.with_safe_defaults() | ||
.dangerous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to the rustls 0.23.12 change. It made the unsafe interfaces more explicit and put under "dangerous"
@@ -564,7 +589,7 @@ async fn send_request( | |||
const READ_TIMEOUT_DURATION: Duration = Duration::from_secs(10); | |||
let (mut send_stream, mut recv_stream) = connection.open_bi().await?; | |||
send_stream.write_all(&bytes).await?; | |||
send_stream.finish().await?; | |||
send_stream.finish()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because finish is made a synchronous call in Quinn 0.11.x.
Error::WriteError(WriteError::ZeroRttRejected) => { | ||
add_metric!(stats.write_error_zero_rtt_rejected) | ||
} | ||
Error::ConnectError(ConnectError::CidsExhausted) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to new error code introduced in Quinn 0.11.x
@@ -3032,6 +3061,26 @@ version = "1.0.9" | |||
source = "registry+https://github.com/rust-lang/crates.io-index" | |||
checksum = "af150ab688ff2122fcef229be89cb50dd66af9e01a4ff320cc137eecc9bacc38" | |||
|
|||
[[package]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cascading changes are all because of Quinn version update which requires rustls update.
@@ -1274,6 +1274,12 @@ dependencies = [ | |||
"libc", | |||
] | |||
|
|||
[[package]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cascading changes due to Quinn version update which triggers rustls version update.
@@ -924,6 +924,12 @@ dependencies = [ | |||
"libc", | |||
] | |||
|
|||
[[package]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cascading changes due to rustls version update.
programs/sbf/Cargo.lock
Outdated
@@ -4458,9 +4576,9 @@ dependencies = [ | |||
|
|||
[[package]] | |||
name = "slab" | |||
version = "0.4.2" | |||
version = "0.4.6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new Quinn 0.11.x requires slab 0.4.6 or later to work
programs/sbf/Cargo.lock
Outdated
"sct", | ||
] | ||
|
||
[[package]] | ||
name = "rustls" | ||
version = "0.23.9" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why rustls version here is 0.23.9 while it is 0.23.12 in the main Cargo.lock?
Arc::new(Self) | ||
} | ||
} | ||
|
||
impl rustls::client::ServerCertVerifier for SkipServerVerification { | ||
impl rustls::client::danger::ServerCertVerifier for SkipServerVerification { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
due to rustls interface change. it is moved under "danger"
@@ -69,6 +69,7 @@ anchor() { | |||
patch_crates_io_solana Cargo.toml "$solana_dir" | |||
patch_spl_crates . Cargo.toml "$spl_dir" | |||
|
|||
sed -i '/\[patch.crates-io\]/a curve25519-dalek = { git = "https://github.com/anza-xyz/curve25519-dalek.git", rev = "b500cdc2a920cd5bff9e2dd974d7b97349d61464" }' ./Cargo.toml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required to build downstream to ensure Anza's curve25519-dalek is used
|
||
let mut server_config = ServerConfig::with_crypto(Arc::new(server_tls_config)); | ||
server_config.concurrent_connections(max_concurrent_connections as u32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concurrent_connections configuration has been removed from the Quinn ServerConfig. It can be enforced by the application layer now. To be improved in the following PR.
@@ -91,12 +96,14 @@ pub fn new_dummy_x509_certificate(keypair: &Keypair) -> (rustls::Certificate, ru | |||
]); | |||
|
|||
( | |||
rustls::Certificate(cert_der), | |||
rustls::PrivateKey(key_pkcs8_der), | |||
rustls::pki_types::CertificateDer::from(cert_der), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
due to rustls version update
streamer/src/nonblocking/quic.rs
Outdated
@@ -331,6 +331,7 @@ async fn run_server( | |||
stats | |||
.connection_rate_limited_across_all | |||
.fetch_add(1, Ordering::Relaxed); | |||
connection.ignore(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use connection.ignore which can more efficiently drop connection without notifying the clients.
e5169f2
to
25c5322
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this change be broken to any smaller parts?
For example, there are multiple dependencies updated zeroize
, rustls
, curve25519-dalek
, etc. Can any of these be updated separately?
Also new error types (and their respective metrics) in {repair,turbine}/quic_endpoint
can be ignore for now, and added in a followup PR. (Or I can add them myself if you prefer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just leaving some thoughts here. I know the PR isn't ready for review yet but maybe some of it is useful. :)
config | ||
.transport_config(Arc::new(new_transport_config())) | ||
.use_retry(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why disable retry? (Retries are cheap and don't require any X25519 or Ed25519 cryptography)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_retry is removed from the Config, it is now implemented by the application layer.
The curve25519-dalek change and zeroize changes are due to rustls change. Unfortunately -- the rustl version update has to be done with the new Quinn version update. The older Quinn (0.11.x) does not work with the newer rustls: 0.23.12. I have annotated the changes in the PR explaining why they were done. |
7cb7c7f
to
3669f10
Compare
e7c65b0
to
040b434
Compare
use quinn 0.11.x Fixed additional comp errors Try change zeroize version zeroize depdency issues Fixed slab dependency issue revert change to zk-sdk/Cargo.toml revert change to sdk/program/Cargo.toml format cargo.toml Fixed unit tests due to rustls change and stream finish interface change Updated changes downstream test fix on curve25519-dalek Try patching curve25519-dalek for anchor tests Use anza-xyz for curve25519-dalek Fixed achor down stream tests use anza's curve25519-dalek regenerate Cargo.lock files Ignore excessive connections use debug version of quinn use debug version of quinn in Endpoint::accept use b8d9a8762deb848a17ccc1651cdb09519d226fb3 -- reduce noise on logging of quinn use 9a2a51efddc8000482f74b15be09d8b2cd58ee5e use crude filtering local address to check if the rate limiting itself is impacting performance of polling socket use governor rate limiter removed naive rate limter implementations fix test failures use release quinn 0.11.x removed local git dependency for quinn update variable names to show incoming, connecting rebuild with the latest quinn 0.11.x release Fix slab version to 0.4.9 Changed SkipServerVerification and SkipClientVerification to use default CryptoProvider to verify signatures use retry disable retry logic to investigate a panic in dropping Endpoint in local cluster test reenable retry to investigate the panic Updated quinn to fix the stateless retry connection corruption Fixed cargo.lock in programs/sbf downstream dalekcurve25519-dalek comp error fmt issues fmt issues
040b434
to
da46233
Compare
@@ -2032,7 +2048,7 @@ version = "1.0.1" | |||
source = "registry+https://github.com/rust-lang/crates.io-index" | |||
checksum = "c762bae6dcaf24c4c84667b8579785430908723d5c889f469d76a41d59cc7a9d" | |||
dependencies = [ | |||
"curve25519-dalek 3.2.1", | |||
"curve25519-dalek 3.2.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of zeroize dependency. curve25519-dalek 3.2.1 requires zeroize < 1.4 which 3.2.0 does not requires. downgrade done by cargo
Test results show with 0.11.x the UDP kernel queue length is more quickly processed and better bench-tps results than the results of maser based off Quinn 0.10.x Old vs Quinn 0.11.x New [2024-09-10T16:53:47.476072852Z INFO solana_bench_tps::bench] ---------------------+---------------+-------------------- [2024-09-10T17:16:34.543049632Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 74028.56 | 1073732 [2024-09-10T17:19:59.476841034Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions Under QUIC connection load tool: UDP port queue length UNCONN 0 0 0.0.0.0:8009 0.0.0.0:* Bench-tps results: [2024-09-10T17:25:17.685245874Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions [2024-09-10T17:31:46.298186698Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 76908.80 | 607075 [2024-09-10T17:38:07.613480246Z INFO solana_bench_tps::bench] Token balance: 33858885150 Before: (98c8853) Without QUIC connection load test: [2024-09-10T18:05:20.898492879Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions [2024-09-10T19:04:58.595399026Z INFO solana_bench_tps::bench] Node address | Max TPS | Total Transactions [2024-09-10T19:24:53.173200122Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 66418.59 | 1113740 The queue under attack: see persistent high Recv-Q length State Recv-Q Send-Q Local Address:Port Peer Address:Port Process Under attack bench-tps results; [2024-09-10T19:38:19.596617604Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 40387.21 | 331225 [2024-09-10T19:42:07.547202338Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 55878.07 | 487310 [2024-09-10T19:45:23.950704693Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 53402.19 | 413426 [2024-09-10T20:09:48.869452792Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 50056.67 | 489929 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change lgtm,
but please let @ripatel-fd to also do a review.
core/src/repair/quic_endpoint.rs
Outdated
} | ||
Err(error) => { | ||
debug!( | ||
"Error while accepting incoming connection: {error:?} from {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix
s2.finish() | ||
.await | ||
.expect_err("shouldn't be able to open 2 connections"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Quinn 0.11.x finish is made synchronous, it does no longer return error under this condition.
@@ -1712,7 +1720,6 @@ pub mod test { | |||
// Test that more writes to the stream will fail (i.e. the stream is no longer writable | |||
// after the timeouts) | |||
assert!(s1.write_all(&[0u8]).await.is_err()); | |||
assert!(s1.finish().await.is_err()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finish is made a sync function. it is no longer returning error in this in 0.11.x under this case.
Also it would be great if you can please run a staked node on testnet and confirm that both the client and the server are compatible. |
I have tested the different server/client combination: server running the code with this branch and client of the master (without Quinn and rustls update) and server with master and client built from this branch. They all work fine running bench-tps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read through this PR, here are my thoughts.
There is one high priority issue.
- The list of allowed signature verification algorithms for rustls is unrestricted. These should be limited to Ed25519 before merging. I will send a PR to this PR with the fix. EDIT: To clarify, this is the same before and after the PR (not a regression).
Other than that, just some general observations:
- The new cert verification policy otherwise seems fine. SkipServerVerification and SkipClientVerification accept any certificate but still verify the signature regardless of what the pubkey is. (So the pubkey can be trusted later on)
- In the future, it would be nice to replace the cert verify algorithm with ed25519_dalek. But that's probably out of scope for this version bump PR
- QUIC retries were disabled. Is this intentional?
- verify_tls12_signature should always return an error. All TPU peers today support TLS 1.3. Any TLS 1.2 logic is dead code.
- Using AtomicU64 for high-frequency metrics could cause excessive cache traffic when these counters get updated from multiple CPUs. (But probably out of scope for this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @alessandrod could you also take a look?
Confirmed that fd_quic is compatible with this change (both client and server). The throughput in my simple "Agave client => fd_quic server" spam test over localhost decreased from ~80-120k TPS (Agave 2.0.4) to 70k-85k TPS (this branch). There is definitely a performance regression somewhere. I haven't bisected whether the regression is caused by this PR, something else, or some weird interaction that's fd_quic's fault. This is the test code: const BUF: [u8; 1232] = [0u8; 1232];
let conn_cache = ConnectionCache::new_quic("test", 16);
let conn = conn_cache.get_connection(&SocketAddr::new(
IpAddr::V4(Ipv4Addr::new(127, 0, 0, 1)),
listen_port,
));
let mut batch = Vec::<Vec<u8>>::with_capacity(1024);
let mut rng = rand::thread_rng();
loop {
let cnt: usize = rng.gen_range(1..batch.capacity());
batch.clear();
for _ in 0..cnt {
batch.push(BUF[0..rng.gen_range(1..BUF.len())].to_vec());
}
if let Err(err) = conn.send_data_batch(&batch) {
eprintln!("{:?}", err);
}
} |
Some results of bench TPS: New Server: Quinn 0.11.x based server (this branch) Old Server: Quinn 0.10.x based server (2f8f910) Old Server/ New Client [2024-09-13T04:09:54.104627665Z INFO solana_bench_tps::bench] Token balance: 28871270150 [2024-09-13T04:11:32.688330444Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 10110.44 | 80240 [2024-09-13T04:12:50.993465813Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 11216.83 | 81288 [2024-09-13T04:17:05.360114465Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 7769.16 | 78252 Old Server/Old Client [2024-09-13T04:14:04.560341893Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 14687.67 | 101264 [2024-09-13T04:15:11.696875557Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 8072.19 | 100284 [2024-09-13T04:16:10.595767307Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=29168640150i [2024-09-13T04:18:34.354558202Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 13677.37 | 104204 New Server/New Client [2024-09-13T16:59:48.232051611Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 23692.68 | 139309 [2024-09-13T17:01:12.933942408Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 76041.14 | 269543 [2024-09-13T17:45:48.114628190Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 40676.90 | 165451 [2024-09-13T17:53:34.316832096Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 60015.69 | 248985 [2024-09-13T18:54:22.329161126Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 33567.08 | 166174 [2024-09-13T18:54:22.329161126Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 33567.08 | 166174 New Server/Old Client [2024-09-13T18:59:06.846784990Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 38924.14 | 150891 [2024-09-13T19:07:35.783540809Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 33126.60 | 226337 [2024-09-13T19:21:07.516963576Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 46994.12 | 271263 [2024-09-13T19:29:15.004030839Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 59456.17 | 330006 [2024-09-13T19:30:06.065923915Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 39646.18 | 255931 I am actually seeing better performance results with server running with this branch. Interestingly on the bench TPS results to the old server, the old client seems to have better results than the new client. |
I had a staked node running with this branch against test net, it shows connections working in both way: |
@@ -331,6 +331,7 @@ async fn run_server( | |||
stats | |||
.connection_rate_limited_across_all | |||
.fetch_add(1, Ordering::Relaxed); | |||
incoming.ignore(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that this should be ignore
and not refuse
? Won't ignore trigger
an automatic retry from the client making things worse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. I was consciously using ignore feature for two reasons: 1. it is less costly for the server to do so -- we do not have to schedule outgoing reply packets when it is already under load -- we have in the past excessive outgoing packets for under initial packets attacks. 2; the timeout will hopefully make the client taking longer time to do the retry -- it need to wait for the timeout before another attack. I think a non-collaborative client could do that regardless we do refuse or silently drop it. Maybe we could improve the logic in follow on PR to initially do the refuse and when it keep violating the limit we could just ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an issue because it makes well behaved clients retry, as to them this is effectively packet loss. The default packet loss timeout in quinn iirc is 1.3 * RTT. But I'm ok fixing this in a follow up!
@@ -349,26 +350,35 @@ async fn run_server( | |||
stats | |||
.connection_rate_limited_per_ipaddr | |||
.fetch_add(1, Ordering::Relaxed); | |||
incoming.ignore(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
Problem
Update Quinn 0.11.x release. There is important improvement in Quinn 0.11.x which enables the application layer to filter QUIC connections earlier before crypto handshaking efficiently. This is proved by running the integration against a connection load test tool which shows it can much more effectively shedding the spamming connection loads.
Summary of Changes
The change is mostly due to the Quinn 0.11.x requires newer rustls version 0.23.12 which make it more explicit about using "dangerous" interfaces.
I have annotated the PR with comments explaining why the change is done.
Tests:
bench-tps with quic connection load tool.
Fixes #