Feature request: a means to refuse subsequent TCP connections while allowing current connections enough time to drain #2920

rosenhouse · 2018-03-28T05:35:59Z

update

We've edited this issue to be less prescriptive about solution. it now presents 3 possible solutions that we can see

summary

Given I've configured Envoy with LDS serving a TCP proxy listener on some port
and there are connections in flight
I would like a way to refuse subsequent TCP connections to that port while allowing current established connections to drain

We tried the following approaches, but none of them achieve our goals:

the LDS is updated to remove the listener
Envoy is signalled with a SIGTERM
Send a GET request to /healthcheck/fail

steps to reproduce

write a bootstrap.yaml like

---
admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9901

node:
  id: some-envoy-node
  cluster: some-envoy-cluster

dynamic_resources:
  lds_config:
    path: /cfg/lds-current.yaml

static_resources:
  clusters:
  - name: example_cluster
    connect_timeout: 0.25s
    type: STATIC
    lb_policy: ROUND_ROBIN
    hosts:
    - socket_address:
        address: 93.184.216.3    # IP address of example.com
        port_value: 80

write a lds-current.yaml file like

version_info: "0"
resources:
- "@type": type.googleapis.com/envoy.api.v2.Listener
  name: listener_0
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 8080
  filter_chains:
    - filters:
        - name: envoy.tcp_proxy
          config:
            stat_prefix: ingress_tcp
            cluster: example_cluster

launch envoy (I'm using v1.6.0)

envoy -c /cfg/bootstrap.yaml --v2-config-only --drain-time-s 30

confirm that the TCP proxy is working

curl -v -H 'Host: example.com' 127.0.0.1:8080

Possible approach 1: remove the listener

update the LDS to return an empty set of listeners. this is a two step process. first, write an empty LDS response file lds-empty.yaml

version_info: "1"
resources: []

second, move that file on top of the file being watched:

mv lds-empty.yaml lds-current.yaml

in the Envoy stdout logs you'll see a line

source/server/lds_api.cc:68] lds: remove listener 'listener_0'

attempt to connect to the port where the listener used to be:

curl -v -H 'Host: example.com' 127.0.0.1:8080

expected behavior

Would like to see all new TCP connections be refused immediately, as if a listener had never been added in the first place. Existing TCP connections should continue to be serviced.

actual behavior

the port is still open, even after the LDS update occurs

lsof -i
COMMAND PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
envoy     1 root    9u  IPv4  30166      0t0  TCP *:9901 (LISTEN)
envoy     1 root   22u  IPv4  30171      0t0  TCP *:8080 (LISTEN)

clients can connect to the port, but the TCP proxying seems to hang (can't tell where)

curl -H 'Host: example.com' -v 127.0.0.1:8080
* Rebuilt URL to: 127.0.0.1:8080/
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 8080 (#0)
> GET / HTTP/1.1
> Host: example.com
> User-Agent: curl/7.47.0
> Accept: */*
>
^C

this state remains until --drain-time-s time has elapsed (30 seconds in this example). At that point the port is finally closed, so you see

curl 127.0.0.1:8080
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused

Possible approach 2: pkill -SIGTERM envoy

If instead of removing the listeners we signaled Envoy

pkill -SIGTERM envoy

Envoy exits immediately without allowing current connections to drain

[2018-03-28 17:42:06.995][3563][warning][main] source/server/server.cc:312] caught SIGTERM
[2018-03-28 17:42:06.995][3563][info][main] source/server/server.cc:357] main dispatch loop exited
[2018-03-28 17:42:07.004][3563][info][main] source/server/server.cc:392] exiting

EDITED to remove incorrect bit about listeners staying open after SIGTERM.

Possible approach 3: admin healthcheck fail

We could instead GET /healthcheck/fail to trigger this behavior. As above, we would expect that new TCP connections should be refused while existing TCP connections are serviced.

background

In Cloud Foundry, we have the following setup currently:

       Router         =====>       Envoy    ---->    App
(shared ingress)      (TLS)                 TCP

Each application instance has a sidecar Envoy which terminates TLS connections from the shared ingress router. Applications may not speak HTTP, so we use basic TCP connectivity checks from the shared Router to the Envoy in order to infer application health and determine if a client connection should be load-balanced to that Envoy. When the upstream Envoy accepts the TCP connection, the Router considers that upstream healthy. When the upstream refuses the TCP connection, the Router considers that upstream unhealthy.

During a graceful shutdown, the scheduler ought to be able to drain the Envoy before terminating the application. This would mean that the Envoy ought to service any in-flight TCP connections without accepting any new ones.

acknowledgements

h/t @jvshahid and @emalm for investigation and edits

The text was updated successfully, but these errors were encountered:

rosenhouse · 2018-03-28T17:10:19Z

cc @rshriram

mattklein123 · 2018-03-28T21:10:47Z

Yeah I think we can probably add listener drain configuration to better determine what happens when a listener enters drain mode. I.e., an option can be to just start refusing connections immediately when draining starts.

wrowe · 2018-04-04T23:49:49Z

In Approach 2. above, typically a different signal is used for a graceful stop, such as SIGHUP. SIGTERM well-known behavior should not be altered.

wrowe · 2018-04-05T17:43:11Z

Note also in the observations for 2. above, that keeping the listener alive after SIGTERM is clearly erroneous, a bug in its own right; that is unrelated to the desire for a TCP drain functionality.

rosenhouse · 2018-04-05T17:47:33Z

That may be a formatting oversight in the issue. I'm pretty sure Envoy just quits on SIGTERM. Let me fix that in the issue description.

jukylin · 2018-04-30T17:57:52Z

Now, Envoy support graceful shutdown?

wrowe · 2018-05-04T18:50:48Z

To sum up my research to date, and of the three pathways above aught to work, but none will, due to the underlying libevent listener implementation in 2.1.8 released Feb of '17.

It appears deliberate that disabling a listener simply removes the socket from the poll queue, collecting connections against the backq, until then listener is re-enabled.

Work on the listener implementation has been merged to the 2.2.0 dev master, so I am working on a minimal repro case to discuss as a libevent ticket, solicit workarounds effective for 2.0/2.1/2.2, and determine whether this represents a defect or enhancement.

ggreenway · 2018-05-04T21:16:04Z

Can you elaborate on what issues you're having with libevent?

I'm working on implementing something similar right now. I'm trying to solve a slightly different problem, but it has a lot of overlap. But I have envoy closing its listener (as reported by netstat).

ggreenway · 2018-05-07T19:35:00Z

See #3307. Would that resolve this issue for you?

rosenhouse · 2018-05-08T15:20:48Z

@ggreenway Yes, thank you.

In the future, we may want a means for certain listeners to opt-out of this behavior. For example, in a sidecar proxy scenario, during the shutdown & drain time, we'd want to close the ingress listeners but leave any egress listeners open until the very end. But we can wait on that enhancement. For now, #3307 would be great.

mattklein123 · 2018-05-08T15:41:55Z

@ggreenway @rosenhouse note that we already support selective per-listener drain options via https://github.com/envoyproxy/envoy/blob/master/api/envoy/api/v2/lds.proto#L121 (and use it for same reasons at Lyft). @ggreenway please also consider ^ when you are looking into the final solution.

ggreenway · 2018-05-08T18:13:52Z

@rosenhouse For clarity, in your ideal scenario, can you write a timeline (or ordered list) of when ingress and egress listeners would be drained and closed?

rosenhouse · 2018-05-08T18:20:06Z

ingress listeners stop accepting new connections (clients see TCP connection refused) but continues to service existing connections. egress listeners are completely unaffected.
configurable delay to allow workload to finish servicing existing connections.
envoy (and workload) both terminate

@emalm does that look right?

ggreenway · 2018-05-08T18:28:49Z

Thanks

emalm · 2018-05-08T19:10:13Z

Thanks, @rosenhouse, that sequence looks right to me!

ggreenway · 2018-05-08T23:37:50Z

Not sure if this is important to you or not, but 1 caveat to keep in mind: if you have no way to tell your clients to stop connecting before you close listeners (to cause TCP clients to see connection refused), the listener may have some connections in the listen() backlog when the listener is closed. There is no race-free way to avoid this with the socket API. Those connections will get RST, but they may have already sent some request data.

rosenhouse · 2018-05-09T00:32:26Z

@ggreenway interesting, I wasn't aware of that. Do you have any references you could point to that describe this?

ggreenway · 2018-05-09T20:10:39Z

The documentation is scarce, and results may be OS-dependent. I did testing on linux, using a simple test client and server application. The server calls listen() but never calls accept(). The client connects (using blocking socket calls), then sends some bytes. tcpdump shows the client SYN, server SYN/ACK, client ACK, and the client sending data, and the server ACK'ing the data.

The reason is that, at least on linux, accept() will always return a fully-established connection socket, not a half-open socket. So to meet that requirement, the server OS must tell the client that is is connected. Once it is connected, the client may fill up at least 1 TCP Window worth of data.

wrowe · 2018-05-09T20:19:40Z

@ggreenway I see we've been mulling the same issue.

I'm also convinced libevent is wrong;
http://www.wangafu.net/~nickm/libevent-book/Ref8_listener.html
Enabling and disabling an evconnlistener
int evconnlistener_disable(struct evconnlistener *lev);
int evconnlistener_enable(struct evconnlistener *lev);
These functions temporarily disable or reenable listening for new connections.'

They do no such thing, of course; the listener remains open accepting connections, while the event loop stops accepting them during the disabled period.

@rosenhouse Greg's answer above is good, James Briggs calls out how this can be accomplished with iptables. http://www.jebriggs.com/blog/2016/10/linux-graceful-service-shutdown-techniques/

Stevens Unix Network Programming vol 1 section 4.5 goes into quite a bit of detail across the variety of network stack implementations, section 30.7 goes into the two queue lengths. Linux does not implement changing the backlog queue length to 0 (queueing incoming SYNs to later be dropped).

I'm beginning to look at whether packet filtering might help us here to achieve our choice of dropping the SYN on the floor, or ACK+RST the syn request.

mattklein123 · 2018-05-09T20:24:35Z

@wrowe technically I understand why all of this is being discussed, but practically, it's really hard for me to understand why this makes operational sense for any deployment. This is exactly why Envoy supports health check draining. In what situation are you not able to drain traffic via failing health checks of some type?

ggreenway · 2018-05-09T20:26:20Z

I understand your point about libevent. I imagine that was written from the point of view of a library that provides events for things (and also happens to manage sockets sometimes). I think it could be rephrased more correctly as "These functions temporarily disable or reenable listening for events for new connections." But I think that's beside the point, because the socket API doesn't provide for what you want to do (at least on linux). As you pointed out, you'll probably need to do some kind of packet filtering first.

If you're going down the road of packet filtering, you may not need any change to envoy. You can add a rule to drop all SYNs, wait until connections close or you've waited as long as you want to, then kill envoy.

ggreenway · 2018-05-09T20:26:56Z

@mattklein123 front-end health check failing doesn't really work with TcpProxy, does it?

ggreenway · 2018-05-09T20:30:02Z

But in the absence of /healthcheck/fail (like for TcpProxy), you need some way external to envoy to drain traffic. IPTables is one such way.

mattklein123 · 2018-05-09T20:50:04Z

@mattklein123 front-end health check failing doesn't really work with TcpProxy, does it?

In every deployment I've ever done it does. Basically, HTTP health check can be used to fail an Envoy out of some rotation, whether that be a load balancer, DNS, BGP, etc. Then after some period of time the process can be terminated/restarted/etc. I've just never seen a need for this type of control so it's kind of surprising to me.

ggreenway · 2018-05-09T20:53:24Z

Yeah, I see what you mean. I meant specifically /healthcheck/fail. But yes, there must be some other mechanism for doing the same thing. I think you and I are in agreement here.

jvshahid · 2018-05-09T21:20:55Z

I'm having trouble following this thread starting from this comment. Up to that comment, the conversation has been around making the current linux socket api drain in-flight connections (i.e. that have finished the 3-way handshake but aren't accepted yet by the application). @mattklein123 can you explain to me what do you mean by this comment and this one

rosenhouse · 2018-05-09T22:57:38Z

@jvshahid My understanding from the conversation is that this feature request (OP) cannot be achieved in a non-racy (i.e. correct) way using sockets API alone.

However, as you said on our call earlier, the race condition may be unlikely enough that it meets our SLOs, especially when combined with our service discovery deregistration mechanism (NATS deregistration message to Cloud Foundry ingress router). End-users would see failures only when the NATS deregistration message is lost and when this TCP race condition is hit.

For that reason, I think we'd still find this feature valuable.

The alternative would be for us to do active healthchecking from the ingress router to the backend Envoy. That would be a lot of work in gorouter (and elsewhere in cloud foundry), and I'd have concerns about how it scales when a single ingress router is proxying to >10^5 backends (as we have in large Cloud Foundry deployments).

rosenhouse · 2018-05-09T23:27:21Z

But I would also like to better understand the health-checking approach that @mattklein123 and @ggreenway seem to be discussing, in light of the scale targets we have in mind for CF.

-----------------                 ---------------------------------
| ingress router |     ---->     |  sidecar Envoy ---> application |
-----------------                 ---------------------------------
    N of these                               M of these

Our setup is N ingress proxies, each forwarding to M sidecar'd application instances. We constrain the drain duration to T seconds. With active health-checking from each ingress to each sidecar, that means each ingress router needs to average M/T health-checks per-second.

A large Cloud Foundry deployment has:
N = 20
M = 200000
with T=10 seconds

So each ingress router is averaging 20k health-checks per second.

Is this realistic? Are we thinking about this wrong?

cc @emalm @shalako

mattklein123 · 2018-05-10T17:23:12Z

@rosenhouse without knowing all the details of your architecture is hard to comment, but the way I would likely approach this would be to utilize a workflow in which you:

Know that an application is being destroyed (however you would invoke the close listener functionality)
Indicate via EDS response that the backend in question is not healthy, so no new requests/connections will be sent to it.
Wait some period of time.
Kill the application/remove from EDS.

rosenhouse · 2018-05-16T17:44:56Z

~~Thanks, this sounds like what we already do in Cloud Foundry.~~

EDIT: We don't quite do this in Cloud Foundry

wrowe · 2018-10-16T16:40:55Z

Coming back to this ticket (sorry for the long pause), I think we want to take 3. GET request to /healthcheck/fail out of consideration for this behavior change.

While cases 1. and 2. strongly suggest closing most listening sockets immediately, which I'll get into in a follow-up post, Case 3. doesn't fit this pattern.

If we consider the Note: behind this documented feature, the listening endpoint would no longer respond with a healthcheck failure status header that it promises to deliver, see;

https://www.envoyproxy.io/docs/envoy/latest/configuration/http_filters/health_check_filter#config-http-filters-health-check

[Edit to add] the endpoint is further unable to to respond to the ok filter ("The /healthcheck/ok admin endpoint reverses this behavior.") Given these contradictions, the entire healthcheck facility should be considered "advisory" and have no effect on polling the listener.

My follow-ups will focus on the first two cases, removing a configured listener or taking the process out of service.

naftulikay · 2019-10-10T18:18:49Z

I came here from #7841.

I built an internal tool at my company for handling graceful Envoy shutdown along with hot restarts. It was written in Rust and does the following:

Handles the hot-reload functionality that the Python wrapper provides.
Implements graceful shutdown by doing the following, in order:
- Upon a shutdown signal, POSTs to /healthcheck/fail to cause Envoy to fail healthchecks. Configure an Envoy health check endpoint in your Envoy configuration that your load-balancer will use to determine Envoy health.
- Sleep for a configurable eviction period to allow the Envoy instance to be removed from the load balancer so that new connections aren't routed to it. If interrupted, jump to the shutdown scheme.
- Query the statistics interface to see if there are active requests. Once this number is zero and confirmed as such by three subsequent queries, proceed to shutdown. If this does not happen within a drain timeout, then shut down and exit 1.
- Send a SIGTERM to all child processes with a kill timeout: if processes don't exit within the kill timeout, SIGKILL and exit 1.

We provide the following knobs:

Eviction period: the amount of time to sleep before querying the stats interface.
Drain timeout: the maximum amount of time to wait for draining to occur during querying the stats interface.
Drain interval: the amount of time to sleep between drain query requests.
Drain attempts: the amount of successful drain query requests required in a row to mark Envoy as being successfully drained.
Kill timeout: the maximum amount of time to wait for child processes to exit after SIGTERM before sending SIGKILL.

We're using systemd, so we set TimeoutStopSec= to the maximum sum of all periods to give the service ample time to shut down and drain.

Do note that we are not doing anything with TCP, rather we are relying on load-balancer health checks to ensure draining occurs. We set the eviction timeout to the maximum allowed request duration, but this could be set more aggressively.

It would be possible for us to add a hook into this system in order to use iptables to stop accepting new TCP connections, but we aren't doing that presently.

I'm not sure if I will be able to open source this effort, but wanted to provide some breadcrumbs for others if they need a solution to this.

auni53 · 2019-12-12T15:20:14Z

We're implementing lame duck mode for our project, so I may be taking this issue on. Not sure yet though, still scoping.

kyessenov · 2021-08-30T21:41:09Z

Given that Envoy sets REUSE_PORT by default now, it seems there is another reason to stop accepting connections in draining listeners: this would help shift traffic to another instance of envoy bound to the same port. We're thinking specifically of outbound connections that sidecar Envoy intercepts (hence, client "lame duck mode" does not help). We're aware of the inherent race in Linux TCP accept queue implementation as @ggreenway mentioned.

villainb-dg · 2024-02-27T12:41:59Z

I stumbled across this feature request while working on shutdown behavior of Envoy used as sidecar proxy for ingress traffic and was surprised to not see any mention of the drain_listener?graceful operation. It has worked wonders for me so I'll document the setup for feedback/comments.

We have gRPC service endpoints fronted by Envoy sidecars running on Kubernetes. During application shutdown, we somewhat control the sequence of POSIX signals sent to both Envoy and gRPC but are bound to terminate within a 30 seconds window. The current implementation uses active health checks and the healthcheck/fail API to make clients evict the endpoint but this introduces challenges given the number of clients we have (who all perform active health checks towards all the endpoints they connect to). When using the drain_listener?graceful call, Envoy does continue to listen on its socket but starts sending GOAWAY frames on every active connections while continuing to serve traffic. After the drain time period (configured on startup with --drain-time-s), Envoy closes the listening socket but keeps serving traffic on active ones. By setting a short --drain-time-s (eg. 0), Envoy effectively starts a draining sequence that prevents new connections from being accepted while allowing current connections enough time to drain. After the application drain time period has expired, a SIGTERM can be sent to Envoy. All actives connections are then reset but since we are after the grace period, that is the expected behavior of the system.

rosenhouse mentioned this issue Mar 28, 2018

Add a graceful shutdown mechanism #1990

Closed

alyssawilk added enhancement Feature requests. Not bugs or questions. help wanted Needs help! labels Mar 28, 2018

rosenhouse changed the title ~~Feature request: after removing a TCP proxy listener, subsequent TCP connections should be refused~~ Feature request: a means to refuse subsequent TCP connections while allowing current connections enough time to drain Mar 28, 2018

wrowe mentioned this issue May 9, 2018

RFC: server: add mechanism for graceful shutdown. #3307

Closed

hzxuzhonghu mentioned this issue Nov 26, 2018

istio-proxy stop before my containers istio/istio#10112

Closed

liamawhite mentioned this issue Jan 21, 2019

Envoy shutting down before the thing it's wrapping can cause failed requests istio/istio#7136

Closed

concaf mentioned this issue Jun 9, 2019

Unable to drain endpoints before removal #7218

Open

wrowe mentioned this issue Oct 10, 2019

Graceful HTTP Connection Draining During Shutdown? #7841

Open

kuzaxak mentioned this issue Dec 23, 2019

Add possibility to inject values to readinessGates via Sidecar Injector istio/istio#19757

Closed

travisgroth mentioned this issue Jun 3, 2020

More graceful envoy shutdown pomerium/pomerium#825

Open

prateek0394 mentioned this issue Aug 2, 2021

Limit max active connections to a TCP listener #17561

Closed

vlast3k mentioned this issue Feb 27, 2023

Fix: Envoy breaking ZDM during lifecycle operations cloudfoundry/diego-release#711

Closed

Feature request: a means to refuse subsequent TCP connections while allowing current connections enough time to drain #2920

Feature request: a means to refuse subsequent TCP connections while allowing current connections enough time to drain #2920

Comments

rosenhouse commented Mar 28, 2018 • edited Loading

update

summary

steps to reproduce

Possible approach 1: remove the listener

expected behavior

actual behavior

Possible approach 2: pkill -SIGTERM envoy

Possible approach 3: admin healthcheck fail

background

acknowledgements

rosenhouse commented Mar 28, 2018

mattklein123 commented Mar 28, 2018

wrowe commented Apr 4, 2018

wrowe commented Apr 5, 2018

rosenhouse commented Apr 5, 2018 • edited Loading

jukylin commented Apr 30, 2018 • edited Loading

wrowe commented May 4, 2018

ggreenway commented May 4, 2018

ggreenway commented May 7, 2018

rosenhouse commented May 8, 2018 • edited Loading

mattklein123 commented May 8, 2018

ggreenway commented May 8, 2018

rosenhouse commented May 8, 2018 • edited Loading

ggreenway commented May 8, 2018

emalm commented May 8, 2018

ggreenway commented May 8, 2018

rosenhouse commented May 9, 2018

ggreenway commented May 9, 2018

wrowe commented May 9, 2018

mattklein123 commented May 9, 2018

ggreenway commented May 9, 2018

ggreenway commented May 9, 2018

ggreenway commented May 9, 2018

mattklein123 commented May 9, 2018

ggreenway commented May 9, 2018

jvshahid commented May 9, 2018

rosenhouse commented May 9, 2018 • edited Loading

rosenhouse commented May 9, 2018

mattklein123 commented May 10, 2018

rosenhouse commented May 16, 2018 • edited Loading

wrowe commented Oct 16, 2018 • edited Loading

naftulikay commented Oct 10, 2019

auni53 commented Dec 12, 2019

kyessenov commented Aug 30, 2021

villainb-dg commented Feb 27, 2024

rosenhouse commented Mar 28, 2018 •

edited

Loading

rosenhouse commented Apr 5, 2018 •

edited

Loading

jukylin commented Apr 30, 2018 •

edited

Loading

rosenhouse commented May 8, 2018 •

edited

Loading

rosenhouse commented May 8, 2018 •

edited

Loading

rosenhouse commented May 9, 2018 •

edited

Loading

rosenhouse commented May 16, 2018 •

edited

Loading

wrowe commented Oct 16, 2018 •

edited

Loading