Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuring idle_timeout Envoy parameter on HTTP Connection Manager #2164

Open
wants to merge 5 commits into
base: master
from

Conversation

@neufeldtech
Copy link

neufeldtech commented Jan 9, 2020

Description

We currently have Ambassador deployed on-premise behind a firewall and fronted by a CDN. We encountered an issue where on low-traffic services, our clients were receiving some 503 first byte timeouts.

The issue is present because when the firewall chose to remove idle tcp sessions from its session table, it would not send a FIN or RST to the CDN. On the next request, the CDN thinks that it still has a tcp connection open and attempts to re-use the now-closed TCP connection, resulting in the request getting 'black-holed', and leads to an HTTP 503.

Because Envoy doesn't ever attempt to close idle tcp connections by default, we're left with Envoy thinking that sessions are open, but our firewall has closed them after a 30 minute timeout (without notifying the CDN that the connection is closed), and the CDN still thinks that the connection is open.

Our "options" for remediating this scenario were:

  • Configure tcp idle timeout on the CDN side to control connection idle time
    • This turned out to not be an option that we can readily control on our CDN. They have 'dynamic' connection times.
  • Make the firewall send a FIN or RST to the CDN when it removes a session from its session table
    • We investigated this as well, but on our firewall this is not an option that is configurable.
  • Have Envoy close idle tcp connections gracefully after a configurable timeout (in our case, some timer that is lower than our firewall idle session timeout)
    • We believe that this is the most appropriate way to solve the issue, and gives us control when tuning idle timeouts across each network device in the path, so this PR addresses making the idle_timeout Envoy setting configurable on the Envoy http listener.

Related Issues

#2155
#1738
#2126

Testing

  • Manual testing via netcat with ambassador running locally in docker-for-desktop Kubernetes cluster.
  • Will run in our staging environment once we upgrade our stack to Ambassador v1.0

Manual testing:

Default configuration - observe that the TCP connection can remain open for (at least) 2 mins and 30 seconds. I sent a request, waited a few mins, then was able to send an additional request over the same connection.

$ time nc localhost 80
GET / HTTP/1.1
Host: localhost
HTTP/1.1 200 OK
content-length: 2002
content-disposition: inline; filename="index.html"
accept-ranges: bytes
etag: "334cde9c1e9a17acbc4b7fc05810191319a69eec"
content-type: text/html; charset=utf-8
vary: Accept-Encoding
date: Thu, 09 Jan 2020 20:15:36 GMT
x-envoy-upstream-service-time: 2
server: envoy
<!doctype html>....</html>
GET / HTTP/1.1
Host: localhost
HTTP/1.1 200 OK
content-length: 2002
content-disposition: inline; filename="index.html"
accept-ranges: bytes
etag: "334cde9c1e9a17acbc4b7fc05810191319a69eec"
content-type: text/html; charset=utf-8
vary: Accept-Encoding
date: Thu, 09 Jan 2020 20:18:03 GMT
x-envoy-upstream-service-time: 4
server: envoy
<!doctype html>....</html>

nc localhost 80  0.00s user 0.01s system 0% cpu 2:31.62 total

Now with idle_timeout: 30s setting configured - observe that the TCP connection is closed by envoy after 30 seconds of being idle. I send one request, then it gets gracefully closed ~30s later because there is no further activity.

$ time nc localhost 80
GET / HTTP/1.1
Host: localhost
HTTP/1.1 200 OK
content-length: 2002
content-disposition: inline; filename="index.html"
accept-ranges: bytes
etag: "334cde9c1e9a17acbc4b7fc05810191319a69eec"
content-type: text/html; charset=utf-8
vary: Accept-Encoding
date: Thu, 09 Jan 2020 20:21:05 GMT
x-envoy-upstream-service-time: 1
server: envoy
<!doctype html>....</html>
nc localhost 80  0.00s user 0.01s system 0% cpu 37.362 total

Todos

  • Tests
  • Documentation

Other

@neufeldtech neufeldtech mentioned this pull request Jan 9, 2020
1 of 2 tasks complete
@neufeldtech neufeldtech force-pushed the neufeldtech:idle_timeout_envoy branch from de8f11f to 570d56c Jan 10, 2020
@neufeldtech

This comment has been minimized.

Copy link
Author

neufeldtech commented Jan 10, 2020

@kflynn Please review when you are able

@neufeldtech neufeldtech force-pushed the neufeldtech:idle_timeout_envoy branch 2 times, most recently from aafe90c to 95a01d1 Jan 14, 2020
@neufeldtech neufeldtech requested a review from kflynn Jan 14, 2020
@neufeldtech neufeldtech force-pushed the neufeldtech:idle_timeout_envoy branch from 544b020 to c6b81a2 Jan 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.