Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http: alpn upstream #13922

Merged
merged 29 commits into from Dec 15, 2020
Merged

http: alpn upstream #13922

merged 29 commits into from Dec 15, 2020

Conversation

alyssawilk
Copy link
Contributor

Allows configuring both HTTP/1 and HTTP/2 options for a cluster, which triggers TLS alpn negotiation.

Risk Level: Medium (a bunch of little refactors)
Testing: integration tests. NEEDS UNIT TESTS
Docs Changes: n.a
Release Notes: inline
Fixes #3431

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy[\w/]*/(v1alpha\d?|v1|v2alpha\d?|v2))|(api/envoy/type/(matcher/)?\w+.proto).
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #13922 was opened by alyssawilk.

see: more, trace.

@alyssawilk
Copy link
Contributor Author

Unit tests are expected to fail - there's some mock clean up around timers and connection.connected() to do, but it's in pretty good shape stability-wise (beyond the new integration test, I've run the full http2_upstream_integration_test and http_integration_test with alpn forced true and they pass)

I wanted to get include-API thumbs up before I spent too much time reordering mocks and writing the new nuit tests.

@mattklein123 mattklein123 self-assigned this Nov 6, 2020
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk alyssawilk marked this pull request as ready for review November 9, 2020 15:10
@alyssawilk
Copy link
Contributor Author

Whoops, meant to cc @mattklein123 and @snowp last week. Still got 1 of 2, not bad :-P

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@yanavlasov yanavlasov self-assigned this Nov 10, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this long requested feature. I have some high level API/usability comments to get started. I have major regrets about how upstream h2 is configured and I fear this is only going to dig the whole deeper so want to discuss if there are other options we should consider.

/wait-any

api/envoy/config/cluster/v3/cluster.proto Outdated Show resolved Hide resolved
// configured, Envoy will attempt to do ALPN negotiation for TLS connections, failing
// over to HTTP/1.1 if ALPN negotiation fails.
// If only one protocol option is present it will be used as the hard-coded
// protocol. If neither is present, HTTP/1.1 will be used.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat orthogonal, while talking config though, I was of the mind to refuse to allow the new config unless TLS was explicitly configured - you can't do ALPN without TLS and given the ALPN pool "needs" to fail over to HTTP/1 I think it'd be easy to accidentally configure ALPN, forget TLS, and get locked into HTTP/1

We can't require TLS though, because there's other ALPN (say ALTS ALPN), which we use internally.
I was thinking we could make transport sockets register if they do ALPN, and reject config which enables H1/H2 without ALPN. That doesn't extend well to HTTP/3 which AFIK requires TLS/ALTS but doesn't actually do ALPN.
Worst case we could just comment a warning, and increment a stat of ALPN fails, but I'm wondering if you have ideas to make borked configs more obvious here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above. This is what I was going to suggest and I think we should do this.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this LGTM at a high level. It's going to need a ton of fixes around docs. We need to be super clear on how to configure things for the average user, so we will need to update all the sandboxes, probably ref link from the typed_protocol_config field, etc.

But if you are up for all of this, I think this will be a massive improvement. Thank you!

/wait


config.core.v3.UpstreamHttpProtocolOptions upstream_http_protocol_options = 2;

oneof upstream_protocol_options {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one can have a required validation.

@alyssawilk
Copy link
Contributor Author

Ok, I'll go beat the config loading into shape, and see how much of the config the deprecated CI gets, then ping for another pass. So waiting w.r.t Matt, but I would appreciate Snow's toughts w.r.t. alpn decoration. I'm not sure of the motivation for explicitly configured transport ALPN vs cluster failover ALPN so I'm not sure how to address Matt's question there.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk
Copy link
Contributor Author

new conflicts are just two build file additions and two include additions. Will push another update but it should have no impact on tests.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool. Flushing out some comments to get started. Thank you!

/wait

// protocol is negotiated by ALPN with the upstream.
// Clusters configured with *AutoHttpConfig* will use the highest available
// protocol; HTTP/2 if supported, otherwise HTTP/1.
// If the upstream does not support ALPN, *AutoHttpConfig* will will fail over to HTTP/1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If the upstream does not support ALPN, *AutoHttpConfig* will will fail over to HTTP/1.
// If the upstream does not support ALPN, *AutoHttpConfig* will fail over to HTTP/1.

@@ -64,6 +64,7 @@ New Features
* health_check: added option to use :ref:`no_traffic_healthy_interval <envoy_v3_api_field_config.core.v3.HealthCheck.no_traffic_healthy_interval>` which allows a different no traffic interval when the host is healthy.
* http: added HCM :ref:`timeout config field <envoy_v3_api_field_extensions.filters.network.http_connection_manager.v3.HttpConnectionManager.request_headers_timeout>` to control how long a downstream has to finish sending headers before the stream is cancelled.
* http: added frame flood and abuse checks to the upstream HTTP/2 codec. This check is off by default and can be enabled by setting the `envoy.reloadable_features.upstream_http2_flood_checks` runtime key to true.
* http: alpn is now supported upstream, configurable via `alpn_config <envoy_v3_api_field_extensions.upstreams.http.v3.HttpProtocolOptions.alpn_config>` in the :ref:`http_protocol_options <envoy_v3_api_msg_extensions.upstreams.http.v3.HttpProtocolOptions>` message.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: alpn was already supported upstream, just manually configured. Rephrase slightly? Also, I think the field names here are out of date so I think you are also missing a :ref.

@@ -711,6 +711,10 @@ class ClusterInfo {
static const uint64_t USE_DOWNSTREAM_PROTOCOL = 0x2;
// Whether connections should be immediately closed upon health failure.
static const uint64_t CLOSE_CONNECTIONS_ON_HOST_HEALTH_FAILURE = 0x4;
// If HTTP2 is true, the upstream protocol will be negotiated using ALPN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this true? Don't we still support H2 with prior knowledge explicitly configured?

source/common/conn_pool/conn_pool_base.h Show resolved Hide resolved
source/common/http/codec_client.cc Show resolved Hide resolved
Comment on lines 27 to 28
// When we upgrade from a TCP client to non-TCP we get a spurious onConnected
// from the new client. Ignore that.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems broken. TODO somewhere to not raise onConnected() multiple times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here is that the client is subscribed to network callbacks. When under the call stack of the network raising the connection event, we detach the TCP client from the network callbacks, and attach the HTTP client.

I don't think it's safe to delay subscribing the HTTP client to the network callbacks until not under the stack of raising the connected event. We could avoid the detach and reattach by having a shim event handler in the connection class which originally pass onEvent to the TCP client, and is std::moved over to pass onEvent to the HTTP client. But the problem with that is that for TCP, the connection pool client subscribes directly to the network's callbacks. For HTTP, the connection pool client asks the codec to add it to network callbacks, and doesn't do so to the connection directly. Today that's the same code, but if we decide to do custom event work in the codec client, we could end up introducing some really weird bugs.

Long story short I actually think this is cleaner than any other option. If I can sell you on that I'd be happy to add inline comments on why it's safer than other options. Otherwise I'd prefer nailing down something which isn't uglier before I add a TODO :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry are you saying that because we are under the same call stack the event gets added and then it gets iterated to and called? I'm surprised that actually works but I forget what data structure is used. I believe you that this is the cleanest way so yeah more comments would be good.

// If an old TLS stack does not negotiate alpn, it likely does not support
// HTTP/2. Fail over to HTTP/1.
protocol_ = Protocol::Http11;
auto tcp_client = static_cast<Tcp::ActiveTcpClient*>(&client);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there by a dynamic_cast ASSERT here? Is there any way to avoid the effective dynamic_cast? Seems like this could be an interface function of some type?

source/common/http/mixed_conn_pool.cc Show resolved Hide resolved
!raw_factory_pointer->supportsAlpn()) {
throw EnvoyException(
fmt::format("ALPN configured for a cluster which has a non-ALPN transport socket: {}",
cluster.DebugString()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would probably print the cluster name here vs. debug string but up to you.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@alyssawilk
Copy link
Contributor Author

PTAL?

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very cool. Some random comments but generally LGTM.

/wait

include/envoy/upstream/upstream.h Outdated Show resolved Hide resolved
source/common/conn_pool/conn_pool_base.cc Outdated Show resolved Hide resolved
Http::Protocol protocol() { return protocol_; }

private:
bool connected_{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used? Do you maybe mean to guard re-checking protocol if we previously had a connection? If so should this be saw_first_connection_ or something like that?

// If an old TLS stack does not negotiate alpn, it likely does not support
// HTTP/2. Fail over to HTTP/1.
protocol_ = Protocol::Http11;
auto tcp_client = dynamic_cast<Tcp::ActiveTcpClient*>(&client);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For perf do you want to static cast this and then assert the dynamic cast? Also, I think I asked this before, but is it possible to not have casts here by just having an interface function that returns nextProtocol()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so - we still need to move the connection to the new class. We could have an interface function for "detach and return the network::connection" if you prefer that to the cast?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have an interface function for "detach and return the network::connection" if you prefer that to the cast?

Personally yes but up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah, checked this branch back out to start on that, but of course the HTTP and HTTP2 clients can't have a function that rips off the network::connection because it's owned by the codec. given it was optional I'm going to leave as is since I think the cast is less ugly than a "hand off if possible but mostly it isn't"

Comment on lines +1510 to +1512
if (protocols.size() == 2 &&
((protocols[0] == Http::Protocol::Http2 && protocols[1] == Http::Protocol::Http11) ||
(protocols[1] == Http::Protocol::Http2 && protocols[0] == Http::Protocol::Http11))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems kind of fragile and also won't obviously support H3. Can we at least have a TODO about H3 here. I wonder also is there some way of pre-selecting the conn pool somewhere else and then passing an enum in here? I'm not sure what is possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is limited by design. I don't think we know if when we add HTTP/3 it'll be into the mixed pool - I suspect we'd have a wrapper pool around a mixed pool and a probing-h3 pool but not sure yet. Whoever adds that (which may yet be me) will surely be adding code and tests, so can add the relevant extra code then, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that's fine maybe add a TODO but up to you.

Comment on lines +27 to +32
// onConnected is called under the stack of the Network::Connection raising
// the Connected event. The first time it is called, it's called for a TCP
// client, the TCP client is detached from the connection and discarded, and an
// HTTP client is associated with that connection. When the first call returns, the
// Network::Connection will inform the new callback (the HTTP client) that it
// is connected. The early return is to ignore that second call.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this makes sense to me now. My only concern is this assumes that adding/removing callbacks is safe in the context of callbacks being called. This is true with the current implementation inside ConnectionImpl, but do you know if we have explicit unit tests there about that? I'm mainly concerned about this being somewhat fragile and we might break it in a non-obvious way. If you think we have good coverage that's fine, but maybe update this comment to mention the implementation assumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's fragile given it's pretty deterministic and regression tested. The only bit that could change is if someone fixed the extra callback under the hood and that'd be a no-op (just make the comment obsolete). Added some unit tests of unregister/reregister mid-callback as that's totally a good thing to test.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
mattklein123
mattklein123 previously approved these changes Dec 15, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo remaining optional comments.

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>
@mattklein123 mattklein123 merged commit 93ee668 into envoyproxy:master Dec 15, 2020
mpuncel added a commit to mpuncel/envoy that referenced this pull request Dec 16, 2020
* master: (49 commits)
  sds: allow multiple init managers share sds target (envoyproxy#14357)
  [http] Remove legacy codecs (envoyproxy#14381)
  http2: Add integration tests for METADATA and RST_STREAM frame flood mitigation for upstream servers (envoyproxy#14365)
  test: start dissolving :printers_include rule. (envoyproxy#14429)
  integration tests: re-enable set_node_on_first_message_only (envoyproxy#14270)
  formatter: add a formatter that returns a google::protobuf::Struct rather than a string (envoyproxy#14258)
  ratelimit: support returning custom response bodies for non-OK responses from the external ratelimit service (envoyproxy#14189)
  deps: update protobuf to 3.14 (envoyproxy#14253)
  stream_info: add setResponseCode and update local_reply to take a normal StreamInfo (envoyproxy#14402)
  http: alpn upstream (envoyproxy#13922)
  Moved starttls integration test to test/extensions/transport_sockets/starttls. (envoyproxy#14425)
  generic conn pool: directly use thread local cluster (envoyproxy#14423)
  wasm: add mathetake to CODEOWNERS (envoyproxy#14427)
  wasm: clear route cache when modifying HTTP request headers. (envoyproxy#14318)
  tls: disable TLS inspector injection (envoyproxy#14404)
  aggregate cluster: cleanups (envoyproxy#14411)
  Mark starttls_integration_test flaky on Windows (envoyproxy#14419)
  tcp: improved unit testing (envoyproxy#14415)
  config: making protocol config explicit (envoyproxy#14362)
  wasm: dead code (envoyproxy#14407)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
@alyssawilk alyssawilk deleted the h1_h2 branch June 10, 2021 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Client side ALPN support (upstream cluster)
3 participants