Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection migration support #1223

Merged
merged 1 commit into from Jun 21, 2022

Conversation

qdeconinck
Copy link
Contributor

This PR provides provides connection migration support to quiche. It is composed of 3 incremental parts (the current 3 commits), starting from CID handling support, then path validation and finally connection migration. See the commit messages for further details about the changes introduced by each of them.

Note that while most of the changes are additive, there are a few breaking API changes introduced by the path validation support. Namely,

  • RecvInfo now contains the to field, indicating the local host address on which the packet was received.
  • SendInfo now contains the from field, indicating from which local host address the packet should be sent.
  • Both connect and accept now takes an owned ConnectionId for the source CID and an additional SocketAddr parameter corresponding to the (initial) local host address that is used by the connection.

@qdeconinck qdeconinck requested a review from a team as a code owner May 4, 2022 08:52
@normanmaurer
Copy link
Contributor

@qdeconinck can you check the test-failure ?

@qdeconinck
Copy link
Contributor Author

The integration indicates a failure with tests::reset_stream_data_recvdwith the target quiche_multiarch (armv7-unknown-linux-gnueabihf). However, I'm not able to reproduce it.

qdeconinck@server:~/git/quiche$ cross test --target=armv7-unknown-linux-gnueabihf --verbose --all-targets --features=ffi,qlog
running 20 tests
test tests::get_bytes ... ok
test tests::get_bytes_mut ... ok
test tests::get_u ... ok
test tests::get_u_mut ... ok
test tests::get_varint ... ok
test tests::get_varint_mut ... ok
test tests::peek_bytes ... ok
test tests::peek_bytes_mut ... ok
test tests::peek_u ... ok
test tests::peek_u_mut ... ok
test tests::put_bytes ... ok
test tests::put_u ... ok
test tests::put_varint ... ok
test tests::slice ... ok
test tests::slice_last ... ok
test tests::slice_last_mut ... ok
test tests::slice_mut ... ok
test tests::split ... ok
test tests::split_at ... ok
test tests::varint_too_large - should panic ... ok

test result: ok. 20 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.03s

     Running `/linux-runner armv7 /target/armv7-unknown-linux-gnueabihf/debug/deps/qlog-8e3d1a131b8c1e3a`

running 7 tests
test events::quic::tests::packet_header ... ok
test streamer::tests::serialization_states ... ok
test tests::packet_sent_event_no_frames ... ok
test tests::packet_sent_event_some_frames ... ok
test tests::trace_no_events ... ok
test tests::trace_seq_no_events ... ok
test tests::trace_single_transport_event ... ok

test result: ok. 7 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.07s

     Running `/linux-runner armv7 /target/armv7-unknown-linux-gnueabihf/debug/deps/quiche-7f38ad5740b58ee6`

running 388 tests
[...]
test result: ok. 388 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 82.75s

     Running `/linux-runner armv7 /target/armv7-unknown-linux-gnueabihf/debug/examples/client-4fdb7e1e89d4d433`

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running `/linux-runner armv7 /target/armv7-unknown-linux-gnueabihf/debug/examples/http3_client-5b3b46fc488749cb`

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running `/linux-runner armv7 /target/armv7-unknown-linux-gnueabihf/debug/examples/http3_server-4fca75b077c783ca`

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running `/linux-runner armv7 /target/armv7-unknown-linux-gnueabihf/debug/examples/qpack_decode-f95ee9ee3a677544`

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running `/linux-runner armv7 /target/armv7-unknown-linux-gnueabihf/debug/examples/qpack_encode-41eade3245482513`

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running `/linux-runner armv7 /target/armv7-unknown-linux-gnueabihf/debug/examples/server-cd750cae0dd058b8`

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running `/linux-runner armv7 /target/armv7-unknown-linux-gnueabihf/debug/deps/quiche_apps-53201971cdab96a0`

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running `/linux-runner armv7 /target/armv7-unknown-linux-gnueabihf/debug/deps/quiche_client-f5857d3a376ae482`

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running `/linux-runner armv7 /target/armv7-unknown-linux-gnueabihf/debug/deps/quiche_server-807fbaebbba3012c`

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

@qdeconinck
Copy link
Contributor Author

After investigation, it seems that the failing test was due to the slowness of the quiche_multiarch containers, sometimes increasing the value of the ACK delay above 63 and thus increasing by one byte the size of the packet. After adapting the test, all the checks should now pass.

apps/src/args.rs Outdated
@@ -68,6 +70,8 @@ pub struct CommonArgs {
/// --dgram-proto PROTO DATAGRAM application protocol.
/// --dgram-count COUNT Number of DATAGRAMs to send.
/// --dgram-data DATA DATAGRAM data to send.
/// --max-active-cids NUM Maximum number of active Connection IDs.
/// --migrate Enable active connection migration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of this flag name since it can be overloaded. How about --enable-active-migration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is complicated :) but yes, --enable-active-migration actually better describes what this option does. Will update that and the following occurrences.

apps/src/args.rs Outdated
@@ -216,6 +229,9 @@ Options:
--no-grease Don't send GREASE.
--cc-algorithm NAME Specify which congestion control algorithm to use [default: cubic].
--disable-hystart Disable HyStart++.
--max-active-cids NUM The maximum number of active Connection IDs we can support [default: 2].
--migrate Enable active connection migration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here regarding the name.

apps/src/args.rs Outdated
@@ -360,6 +381,8 @@ Options:
--dgram-data DATA Data to send for certain types of DATAGRAM application protocol [default: brrr].
--cc-algorithm NAME Specify which congestion control algorithm to use [default: cubic].
--disable-hystart Disable HyStart++.
--max-active-cids NUM The maximum number of active Connection IDs we can support [default: 2].
--migrate Enable active connection migration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here :-)

.conn
.probe_path(local_addr, peer_addr)
.expect("cannot probe");
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: house style prefers blank line between match cases. Given this is a bit of a sizeable block of code, please consider putting it in a method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated that, also integrated the trace_id of the client's connection for each received event.


Err(e) => {
// There are no more UDP packets to read, so end the read
// loop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment true? Isn't there a possibility that there are other packets to read from other sockets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Indeed, we should instead iterate over the events and loop on a given socket until there is nothing more to read on that one. Will update the code.

// There are no more UDP packets to read, so end the read
// loop.
if e.kind() == std::io::ErrorKind::WouldBlock {
trace!("recv() would block");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want to add detail to this line to say which socket would block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I added in the log the local address on which the socket is bound.


trace!("got {} bytes", len);
trace!("got {} bytes", len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from experience, it helps to be as descriptive in these lines as possible. So like above let's consider if there is any way for the log line to discriminate exactly what received the bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the same as above.

while let Ok(qe) = conn.poll_path() {
match qe {
quiche::PathEvent::New(..) => {
panic!("Should not see new path");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this is an semantic error but it would be great to avoid the explicit panic. This makes me wonder, does the library event return such an event?

Perhaps a clearer thing to do would be to use unreachable!() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a client-side connection, it should not. I replaced that to unreachable!().

@@ -78,14 +78,24 @@ pub fn connect(
std::net::SocketAddr::V6(_) => "[::]:0",
};

let mut sockets =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are only going to use at most 2 sockets in the migration, I think it would be better to make that explicit (i.e. just name the objects) rather than have a potentially unbounded vector that requires magic indexes, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objection doing explicit naming instead of the current sockets solution. My original motivation to do so was to handle the variable number of sockets with slightly less code. I now have socket and migrate_socket, the second being an Option depending whether the application wants to perform migration or not.

@@ -446,3 +553,16 @@ pub fn connect(

Ok(())
}

/// Generate a new pair of Source Connection ID and reset token.
fn generate_cid_and_reset_token<T: SecureRandom>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two of these and they seem identical?

If so, please put common code in apps/src/common.rs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, moved the duplicate codes into common.rs

@@ -89,6 +89,8 @@ pub struct Client {

pub http_conn: Option<Box<dyn HttpConn>>,

pub id: u64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more descriptive term than just id that we could use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe client_id? It is used as an intermediate identifier between Connection IDs and actual client's connections.

@@ -520,6 +621,12 @@ typedef struct {

// DATAGRAM frame extension parameter, if any.
ssize_t peer_max_datagram_frame_size;

// The stats of the connection's paths.
quiche_path_stats paths[8];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does 8 come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually a magic number to ensure the structure can hold up to 8 path's statistics. I could instead put a simple pointer along with a length, but it is unclear to me whether the caller should allocate the required space or the ffi should instead do this.

@qdeconinck qdeconinck force-pushed the connection-migration branch 4 times, most recently from 09438b7 to 21ce1c7 Compare May 25, 2022 06:18
@ghedo
Copy link
Member

ghedo commented Jun 10, 2022

I pushed a commit that does a bit of refactoring and clean-up. For the most part there should be no functional change, however I did rename or otherwise change some of the new public APIs as follows:

  • Changed back accept() and connect() to take a &ConnectionId instead of ConnectionId<'static>.

  • Removed Config::enable_events(). Rationale being that an application is probably going to be broken without events as there's no way to actually disable migration in general.

  • Renamed path_send() and path_send_quantum() to send_on_path() and * send_quantum_on_path().

  • Renamed poll_path() to path_event_next() and changed it to return Option<> instead of Result<>. In retrospect calling the HTTP/3 method "poll()" wasn't such a good idea due to clash and potential confusion with the standard Future::poll() method, so we might have to change it in the future.

  • Renamed retired_scids() to retired_scid_next() and changed it to return a single ConnectionId wrapped in Option<>. This way we don't have to pre-allocate a whole vector of CIDs and we can also remove the custom iterator type.

  • Renamed sendable_paths() to paths_iter().

  • Renamed spare_scid_spots() to source_cids_left().

  • Removed FFI API for now as it was incomplete and there were some issues. We'll add it back at a later time.

(I might have missed something in the list above)

quiche/src/lib.rs Outdated Show resolved Hide resolved
@ghedo ghedo force-pushed the connection-migration branch 3 times, most recently from 92b3d16 to 3f96d4d Compare June 16, 2022 12:08
@qdeconinck
Copy link
Contributor Author

The proposed refactoring looks good to me.

This commit adds connection migration support, as well as related
features such as support for changing connection IDs and tracking
multiple network paths.

Network path changes are exposed to applications through the
`path_event_next()` API, and migration is done either passively as a
server (raising a `PathEvent::PeerMigrated` event to the app), or
actively as a client through the `migrate()` API.

Note that after a migration, the connection might lose connectivity,
e.g. because the client uses a non-working 4-tuple or because there was
some form of address spoofing attack seen as a migration by the server.
quiche includes a fall-back support when the newly migrated path is
detected as non-working.

This required introducing full 4-tuple awareness to quiche, i.e. both
the peer and local addresses (while before only the peer's address was
tracked), which required changing existing public APIs, such as
`SendInfo`, `RecvInfo`, `connect()` and `accept()`, but applications can
be easily updated as the local address can be typically retrieved using
`socket.local_addr().unwrap()`. In addition, this adds APIs for
considering only a specific 4-tuple using the `send_on_path()` method
(implementers might also be interested in the `paths_iter()` method).

This also adds tracking of multiple network paths, via the `Path`
structure (located in its dedicated `path` module). Note that the
`Recovery` structure is now relative to a path and not on the whole
connection anymore. This enables the stack to easily maintain sent
packets, RTT estimates, congestion control state,... on each existing
path. All these network paths are handled by a dedicated structure, the
`PathMap`, also located in the `path` module.

The stack is now able to probe network paths in order to validate them
through PATH_CHALLENGE/PATH_RESPONSE exchange. Such challenge process
can be triggered by the application thanks to the `probe_path()` method.
While the client can probe new 4-tuples, the server first needs to see a
peer's packet originating from the 4-tuple before probing it. Several
new `PathEvent` events returned by the `path_event_next()` method (`New`,
`Validated`, `FailedValidation`,...) enables the application to handle
such cases.

Finally, support for Connection ID changes was also added. A connection
is not bound to a single SCID-DCID pair and can now advertise and use
other Connection IDs through NEW_CONNECTION_ID and RETIRE_CONNECTION_ID
frames. From an API viewpoint, a host can indicate in its QUIC config
the maximum number of active connection IDs it can support
(`set_active_connection_id_limit()`). Then, it can actively provide
Connection IDs to its peer (`new_source_cid()`) and retire peer-provided
Connection IDs (`retire_destination_cid()`). Retiring a source
Connection ID are monitored through the `retired_scid_next()` method.
Connection IDs are handled by a dedicated structure,
`ConnectionIdentifiers`, located in the newly introduced `cid` module.
@ghedo ghedo merged commit 3e719a3 into cloudflare:master Jun 21, 2022
@ghedo
Copy link
Member

ghedo commented Jun 21, 2022

Merged 🎉, thanks @qdeconinck for all your work and @ehaydenr for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants