Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd watch stream starvation under high read response load when sharing same connection and TLS is enabled #15402

Closed
serathius opened this issue Mar 3, 2023 · 28 comments · Fixed by #16735
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release/v3.4 release/v3.5 release/v3.6 stage/tracked type/bug

Comments

@serathius
Copy link
Member

serathius commented Mar 3, 2023

What happened?

When etcd client is generating high read response load, it can result in watch response stream in the same connection being starved. For example a client with an open watch and running 10 concurrent Range requests each returning around 10MB. The watch might get starved and not get any response for tens of seconds.

Problem does not occur when TLS is not enabled nor when watch is created on separate client/connection.

This affects also K8s (any version) as Kubernetes has one client per resource. Problem will trigger when single K8s resource has a lot of data and there are 10+ concurrent LIST requests for the same resource send to apiserver. For example 10k pods. This can cause serious correctness issues in K8s, like controllers not doing any job as they depend on watch to get updates. For example scheduler not scheduling any pods.

We tested and confirmed that all v3.4+ versions are affected.

Issue affects any watch response type:

  • Event
  • Periodic watch progress notification
  • Manual watch progress notification

What did you expect to happen?

Watch stream should never get blocked no matter the load.

How can we reproduce it (as minimally and precisely as possible)?

Repro for progress notify:

Start etcd with TLS and progress notify, for example:

./bin/etcd   --cert-file ./tests/fixtures/server2.crt --key-file ./tests/fixtures/server2.key.insecure --listen-client-urls https://127.0.0.1:2379 --advertise-client-urls https://127.0.0.1:2379 --experimental-watch-progress-notify-interval=100ms

insert a lot of data into etcd, for example run command below couple of times.

go run ./tools/benchmark/main.go  put --total 10000 --val-size=10000 --key-space-size 10000 --key ./tests/fixtures/server.key.insecure --cacert ./tests/fixtures/ca.crt --cert ./tests/fixtures/server.crt

Run program below

package main

import (
	"context"
	"crypto/tls"
	"fmt"
	"time"

	clientv3 "go.etcd.io/etcd/client/v3"
)

func main() {
	c, err := clientv3.New(clientv3.Config{
		Endpoints: []string{"127.0.0.1:2379"},
		TLS:       &tls.Config{InsecureSkipVerify: true},
	})
	if err != nil {
		panic(err)
	}
	defer c.Close()
	for i := 0; i < 10; i++ {
		go func() {
			for {
				start := time.Now()
				resp, err := c.Get(context.Background(), "", clientv3.WithPrefix())
				if err != nil {
					fmt.Printf("Error: %v\n", err)
				}
				fmt.Printf("Got %d, took %s\n", len(resp.Kvs), time.Since(start))
			}
		}()
	}
	start := time.Now()
	for resp := range c.Watch(context.Background(), "a", clientv3.WithProgressNotify()) {
		fmt.Printf("Got notify, rev: %d, since last: %s\n", resp.Header.Revision, time.Since(start))
		start = time.Now()
	}
}

Instead of getting logs like:

Got notify, rev: 26971, since last: 110.947114ms
Got notify, rev: 26971, since last: 109.55032ms
Got notify, rev: 26971, since last: 109.20894ms
Got notify, rev: 26971, since last: 110.621913ms
Got notify, rev: 26971, since last: 107.954765ms
Got notify, rev: 26971, since last: 110.530701ms
Got notify, rev: 26971, since last: 107.704987ms
Got notify, rev: 26971, since last: 111.342234ms
Got notify, rev: 26971, since last: 107.9304ms
Got notify, rev: 26971, since last: 111.170052ms
Got notify, rev: 26971, since last: 108.656011ms
Got notify, rev: 26971, since last: 107.874665ms
Got notify, rev: 26971, since last: 108.476916ms
Got notify, rev: 26971, since last: 111.891616ms
Got notify, rev: 26971, since last: 107.340694ms
Got notify, rev: 26971, since last: 109.147243ms
Got notify, rev: 26971, since last: 111.340591ms
Got notify, rev: 26971, since last: 106.833842ms
Got notify, rev: 26971, since last: 109.345355ms
Got notify, rev: 26971, since last: 112.012011ms
Got 3783, took 2.237987879s
Got notify, rev: 26971, since last: 109.694461ms
Got 3783, took 2.324323811s
Got notify, rev: 26971, since last: 105.763332ms
Got 3783, took 2.41313437s
Got 3783, took 2.462624343s
Got 3783, took 2.494911293s
Got 3783, took 2.499409447s
Got 3783, took 2.500700171s
Got 3783, took 2.508462692s
Got notify, rev: 26971, since last: 109.242693ms
Got 3783, took 2.515227452s
Got 3783, took 2.534498037s
Got notify, rev: 26971, since last: 121.741006ms
Got notify, rev: 26971, since last: 112.175327ms
Got notify, rev: 26971, since last: 107.779196ms
Got notify, rev: 26971, since last: 98.25659ms
Got notify, rev: 26971, since last: 107.626991ms
Got notify, rev: 26971, since last: 108.487155ms

We get logs like:

Got notify, rev: 26971, since last: 113.603618ms
Got notify, rev: 26971, since last: 107.494278ms
Got notify, rev: 26971, since last: 106.061371ms
Got 3783, took 1.425461398s
Got 3783, took 1.010333393s
Got 3783, took 1.040919851s
Got 3783, took 986.415707ms
Got 3783, took 4.690286601s
Got 3783, took 1.029349011s
Got 3783, took 983.650976ms
Got 3783, took 1.012731026s
Got 3783, took 1.054258714s
Got 3783, took 4.312497715s
Got 3783, took 1.068067416s
Got 3783, took 986.65717ms
Got 3783, took 1.055793164s
Got 3783, took 992.97395ms
Got 3783, took 1.066242731s
Got 3783, took 5.192239047s
Got 3783, took 1.065642174s
Got 3783, took 1.015330689s
Got 3783, took 1.040866735s
Got 3783, took 1.023127374s

And next notify response never comes.

Alternatively run ready e2e test:
5c26ded

Anything else we need to know?

With help of @mborsz I managed to find the root cause:

I have a fix in work in release-3.4...serathius:etcd:fix-watch-starving
Will send PR when ready

Etcd version (please run commands below)

All etcd versions

Etcd configuration (command line flags or environment variables)

Any config with TLS enabled

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

N/A

Relevant log output

No response

@serathius serathius added type/bug priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 3, 2023
@serathius serathius self-assigned this Mar 3, 2023
@serathius serathius pinned this issue Mar 3, 2023
@serathius serathius changed the title Etcd watch stream starvation under high read response load when TLS is enabled Etcd watch stream starvation under high read response load when sharing same connection and TLS is enabled Mar 3, 2023
serathius added a commit to serathius/etcd that referenced this issue Mar 3, 2023
serathius added a commit to serathius/etcd that referenced this issue Mar 3, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Mar 3, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Mar 3, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Mar 3, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Mar 3, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Mar 7, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Mar 7, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@serathius
Copy link
Member Author

serathius commented Mar 8, 2023

Problem definition

There are two issues we want to address related to etcd endpoint setup:

  • Confirmed issue watch stream starvation. Severity: Critical
  • Potential issues related to running grpc under http server. Severity: Important.

Watch starvation

Caused by running grpc server under http server which is affected by golang/go#58804.

For the watch starvation we can:

Running grpc under http server

Serving grpc server under http it is an experimental API with different performance and features.
It should not be used in production.

To avoid running grpc under http server we can:

Requirements

  • No backport of breaking changes.
  • No backport of code not ready for production.

Proposal

Implement changes and execute backports in the order below:

  1. [v3.4+] Change http server frame scheduler to random.
  2. [v3.4+] Move http server to separate port by passing --listen-client-http-urls.
  3. [v3.6+] Use customised connection multiplexer for etcd by @aojea.
  4. [v3.4+] When ready for production use, change to new frame scheduler round-robin algorithm.

Motivation

Background

List of etcd http paths:

  • /debug/vars - Access to etcd config. Available only via --listen-client-urls
  • /config/local/log - Remote configuration of global log level. Removed in v3.5. Available only via --listen-client-urls
  • /version - Version endpoint. Available only via --listen-client-urls
  • /metrics - Metrics endpoint. Available via both --listen-client-urls and --listen-metrics-urls
  • /health - Health endpoint (same as etdctl endpoint health). Available only via --listen-client-urls.
  • /v2/* - etcd V2 API. Available only via --listen-client-urls (removed in v3.6)
  • /v3/* - grpc gateway for V3 API. Allows to call V3 API using http and Json. Available only via --listen-client-urls

EDIT 1: based on @ptabor feedback, I removed --client-no-http flag and proposed backporting --listen-client-http-urls to avoid deprecation.
EDIT 2: Updated to use @aojea custom connection multiplexer as a v3.6+ long term solution.

@serathius
Copy link
Member Author

cc @ptabor @ahrtr @spzala @mitake

@serathius
Copy link
Member Author

cc @wojtek-t @liggitt

@fuweid
Copy link
Member

fuweid commented Mar 9, 2023

1 [v3.4+] Change the default write scheduler from Priority to Random, which in my testing improves resilience to starvation (not a full immunity).

HI @serathius is it possible to maintain a round-robin scheduler writer? I think it is good for that metrics, which the latency is more predicatable as there is max concurrent streams. The scheduler writer runs without lock. Maybe it could be easier~.

@serathius
Copy link
Member Author

serathius commented Mar 9, 2023

1 [v3.4+] Change the default write scheduler from Priority to Random, which in my testing improves resilience to starvation (not a full immunity).

HI @serathius is it possible to maintain a round-robin scheduler writer? I think it is good for that metrics, which the latency is more predicatable as there is max concurrent streams. The scheduler writer runs without lock. Maybe it could be easier~.

Prefer to avoid backporting untested code especially if it would be written by me :P. I would wait for grpc experts to implement it and test it.

@ptabor
Copy link
Contributor

ptabor commented Mar 9, 2023

LGTM. I would consider also:

  • log WARNING in v3.4, v3.5 when user is accessing any HTTP (not grpc) service on the listener not defined by --listen-client-http-urls. This will drive proactively users to prepare for v3.6 version.

@serathius
Copy link
Member Author

Discussed @fuweid idea with @mborsz. Implementing it based on Random scheduler should be much simpler than fixing golang/go#58804 which will need to modify priority scheduler.

I will try to implement round-robin scheduling. Thanks for suggestion @fuweid

@ahrtr
Copy link
Member

ahrtr commented Mar 9, 2023

[v3.4+] Add flag --listen-client-http-urls that exposes http server on separate endpoint and disables non-grpc handlers on --listen-client-urls. This way we expose pure grpc server, full mitigating the issue but requires a user action.

  1. We already have some confusing user facing URLs (e.g. client/peer, listen/advertise, etc.), don't think it's a good idea to add more.
  2. I tend to refactor (*serveCtx) serve to resolve this issue.

@fuweid
Copy link
Member

fuweid commented Mar 9, 2023

@serathius thanks!!!

@serathius
Copy link
Member Author

serathius commented Mar 10, 2023

Quick look into implementation of random write scheduler shows that it's isn't so random :P. It prioritizes writing control frames. This also done by priority scheduler. Unfortunately method to check if frame is a control frame is not public https://go.googlesource.com/net/+/master/http2/writesched_random.go#48 (thanks Go).

Meaning that custom implementation of scheduler will already be subpar. I can implement scheduler that treats all frames the same, however I'm not expert in http2, so it's hard to guess what it could break.

EDIT: Or we can guess what it breaks. https://go.googlesource.com/net/+/04296fa82e83b85317bd93ad50dd00460d6d7940%5E%21/http2/writesched_random.go. Nice to see familiar people.

@aojea would love to get your thoughts on this issue.

@aojea
Copy link

aojea commented Mar 10, 2023

IIUIC the problem here is with DATA frames , those are the ones that can cause starvation.

Control frames are not carrying much data, they are used for signaling, and per RFC they MUST not be considered for flow-control

https://www.rfc-editor.org/rfc/rfc7540

  1. The frame type determines whether flow control applies to a
    frame. Of the frames specified in this document, only DATA
    frames are subject to flow control; all other frame types do not
    consume space in the advertised flow-control window. This
    ensures that important control frames are not blocked by flow
    control.

One interesting thing is that if you were using the http2 implementation from golang/x/net instead of the one in the standaard library, is that you. probably were using the Random scheduler until last year https://go-review.googlesource.com/c/net/+/382015

Based on this comment from Brad Fitzpatrick https://go-review.googlesource.com/c/net/+/382015/comments/16b761d4_4c27bf6f and that the problem here are DATA frames, I'm +1 on the Random Scheduler if it demonstrates that it solves the problem

brandond added a commit to k3s-io/k3s that referenced this issue Sep 25, 2023
Fixes performance issue under load, ref: etcd-io/etcd#15402 and kubernetes/kubernetes#118460

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
@serathius
Copy link
Member Author

serathius commented Oct 10, 2023

Returning to the issue as the fix in golang was made available.

  • On main branch we can just remove the override and start using the default scheduler which no longer is affected.
  • On the v3.5 we are using golang.org/x/net v0.7.0 we would need to bump it to at least v0.11.0 and then remove the override.
  • Same on v3.4

I think the last question is: Should/Can we backport the golang.org/x/net bump and change the scheduler on old release.

@serathius serathius reopened this Oct 11, 2023
brandond added a commit to brandond/k3s that referenced this issue Oct 12, 2023
Fixes performance issue under load, ref: etcd-io/etcd#15402 and kubernetes/kubernetes#118460

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 8c73fd6)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
brandond added a commit to brandond/k3s that referenced this issue Oct 12, 2023
Fixes performance issue under load, ref: etcd-io/etcd#15402 and kubernetes/kubernetes#118460

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 8c73fd6)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
brandond added a commit to brandond/k3s that referenced this issue Oct 12, 2023
Fixes performance issue under load, ref: etcd-io/etcd#15402 and kubernetes/kubernetes#118460

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 8c73fd6)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
brandond added a commit to k3s-io/k3s that referenced this issue Oct 13, 2023
Fixes performance issue under load, ref: etcd-io/etcd#15402 and kubernetes/kubernetes#118460

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 8c73fd6)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
brandond added a commit to k3s-io/k3s that referenced this issue Oct 13, 2023
Fixes performance issue under load, ref: etcd-io/etcd#15402 and kubernetes/kubernetes#118460

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 8c73fd6)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
brandond added a commit to k3s-io/k3s that referenced this issue Oct 13, 2023
Fixes performance issue under load, ref: etcd-io/etcd#15402 and kubernetes/kubernetes#118460

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 8c73fd6)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
@serathius
Copy link
Member Author

Closing as the fix is now available by default on all supported minor versions.

brandond added a commit to brandond/kine that referenced this issue Oct 26, 2023
Upstream etcd has seen issues with grpc muxed with http and recommends against it. Ref: etcd-io/etcd#15402

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
brandond added a commit to brandond/kine that referenced this issue Oct 27, 2023
Upstream etcd has seen issues with grpc muxed with http and recommends against it. Ref: etcd-io/etcd#15402

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
@serathius serathius unpinned this issue Oct 27, 2023
brandond added a commit to k3s-io/kine that referenced this issue Oct 30, 2023
Upstream etcd has seen issues with grpc muxed with http and recommends against it. Ref: etcd-io/etcd#15402

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
gshilei pushed a commit to gshilei/kine that referenced this issue Oct 31, 2023
Upstream etcd has seen issues with grpc muxed with http and recommends against it. Ref: etcd-io/etcd#15402

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
gshilei pushed a commit to gshilei/kine that referenced this issue Oct 31, 2023
Upstream etcd has seen issues with grpc muxed with http and recommends against it. Ref: etcd-io/etcd#15402

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: zhiyuan <guoshilei.gsl@antgroup.com>
xiaods added a commit to xiaods/k8e that referenced this issue Nov 13, 2023
Fixes performance issue under load, ref: etcd-io/etcd#15402 and kubernetes/kubernetes#118460

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
(cherry picked from commit 8c73fd6)
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
porting by
Signed-off-by: Deshi Xiao <xiaods@gmail.com>
@zhouhaibing089
Copy link

For people who come to this thread, it could also happen when there are a lot of huge PUT request (for instance, if there are lot of resource creation in k8s), too.

Below is the "Client Traffic In" panel:

rate(etcd_network_client_grpc_received_bytes_total{...} [3m])
Screenshot 2024-07-22 at 5 34 53 PM

From k8s perspective:

sum(rate(apiserver_watch_cache_events_received_total{...}[2m])) by (pod)
Screenshot 2024-07-22 at 5 35 51 PM

The left side, the red line has a significant lag comparing with the other two apiserver pods (meaning its watch cache has an accumulated delay for receiving new watch events) until I stopped the test, while the right side was fine after we upgraded etcd to 3.5.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release/v3.4 release/v3.5 release/v3.6 stage/tracked type/bug
8 participants