All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247

aaronjzhang · 2023-02-06T08:01:45Z

What happened?

We have 3 etcd nodes which are running on k8s, the etcd pod use portworx storage for db saving, we've found when rebooting the portworx node, it will cause etcd to be stuck in fdatasync, if this etcd node is leader, then it will cause below 2 lease keep alive issues.

the old leader still revokes lease after it steped to follower, this is the same issue as #12528

some clients connected to other etcd nodes and send keep alive to them, after the old leader steped to follower, the new leader will handle the lease keep alive, but the old leader's lessor still considered itself is the primary and revoked all leases after their ttl expired in its view, after the old leader returned from fdatasync, these revoke request will be still sent out when handling the next Ready.

the old leader still handle lease keep alive request from the connected client.

some clients connected to the old leader etcd node, when the old leader was stuck in fdatasync, the raft layer steped to follower, but the lessor was not demoted, the connected clients can still send keep alive request to the old leader, at the same time the other two etcd nodes elected a new leader and didn't receive the lease keep alive message which should be sent to them but they were still sent to old leader, so these leases were revoked after the ttl expired.

Essentially the above 2 issues are because of the lessor primary is not synced up with the raft leader, there is big gap between lessor primary and raft leader when etcd server is stuck in processing raft Ready.

What did you expect to happen?

The lease which is kept alive on old leader should return an error when its raft stepped to follower so that the clients can connect to other etcd nodes.
The old leader should not revoke the other leases.

How can we reproduce it (as minimally and precisely as possible)?

Add a piece of codes in func (w *WAL) sync() function to simulate fdatasync stuck

func (w *WAL) sync() error {
	if w.encoder != nil {
		if err := w.encoder.flush(); err != nil {
			return err
		}
	}
	start := time.Now()
	err := fileutil.Fdatasync(w.tail().File)

    //Simulate fdatasync is stuck
	if fileutil.Exist("/run/etcd/fdatasync_block") {
		f, err := ioutil.ReadFile("/run/etcd/fdatasync_block")
		if err == nil {
			str_num := strings.Trim(string(f), "\n")
			block_duration, err := strconv.Atoi(str_num)
			if err == nil {
				time.Sleep(time.Duration(block_duration) * time.Second)
			} else {
				if w.lg != nil {
					w.lg.Warn("Atoi error", zap.Error(err))
				}
			}
		}
	}

	took := time.Since(start)
	if took > warnSyncDuration {
		if w.lg != nil {
			w.lg.Warn(
				"slow fdatasync",
				zap.Duration("took", took),
				zap.Duration("expected-duration", warnSyncDuration),
			)
		} else {
			plog.Warningf("sync duration of %v, expected less than %v", took, warnSyncDuration)
		}
	}
	walFsyncSec.Observe(took.Seconds())

	return err
}

Compile etcd with above changes.
Apply the etcd patch in a cluster and restart them.
Grant 3 leases with 10 seconds ttl.
Keep different lease alive on different etcd node.
Start to simulate fdatasync blocking on leader etcd with below commands:

mkdir /run/etcd; echo 60 > /run/etcd/fdatasync_block

Put a kv to trigger fdatasync.
Wait for etcd leader changing.
Check the time to live of the lease on the new etcd leader which is kept alive on the old leader, you will see it's expired but the keeping alive is on going on the old leader.
Wait for about 60 seconds, the other two leases will be revoked after the old leader is recovered.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.4.3
Git SHA: 3cf2f69b5
Go Version: go1.14
Go OS/Arch: linux/amd64




$ etcdctl version
etcdctl version: 3.4.3
API version: 3.4

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

Solution and analysis (updated by @ahrtr)

Resolved in Ignore old leader's leases revoking request #16822 for main (3.6)
Summary: All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247 (comment)

The text was updated successfully, but these errors were encountered:

ahrtr · 2023-02-07T06:57:44Z

Thanks @aaronjzhang for raising this issue.

It looks like a major design issue to me. When the etcdserver is being blocked on the fdatasync, there is no chance for raft to notify etcdserver the leader changing.

I do not get time to deep dive into the new feature raft: support asynchronous storage writes, my immediate feeling is that it should can resolve such issue. @tbg @nvanbenschoten Could you double confirm this? But we also need to resolve it for etcd 3.5, which still depends on the old raft package.

Note you can reproduce this issue using the failpoint walBeforeSync or walAfterSync added in #15252. The steps to reproduce this issue are as below,

$ make gofail-enable
$ make build
$ goreman start   # Use the Procfile below
$ etcdctl lease grant 20
$ etcdctl lease keep-alive 3269862a9fd99a09  # Run this command in another terminal. The leaseid is returned in above command
$ curl http://127.0.0.1:2003/raftBeforeSave -XPUT -d'sleep("100s")'  # Note that you need to enable failpoint on the leader, assuming etcd3 is the leader
$ etcdctl put k1 v1   # wait about 100s, then you will see that the lease will be revoked

Procfile

etcd1: GOFAIL_HTTP="127.0.0.1:2001" bin/etcd --name infra1 --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 --listen-peer-urls http://127.0.0.1:12380 --initial-advertise-peer-urls http://127.0.0.1:12380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr
etcd2: GOFAIL_HTTP="127.0.0.1:2002" bin/etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr
etcd3: GOFAIL_HTTP="127.0.0.1:2003" bin/etcd --name infra3 --listen-client-urls http://127.0.0.1:32379 --advertise-client-urls http://127.0.0.1:32379 --listen-peer-urls http://127.0.0.1:32380 --initial-advertise-peer-urls http://127.0.0.1:32380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-outputs=stderr

ahrtr · 2023-02-07T06:58:52Z

Run make gofail-disable to cleanup the codebase.

tbg · 2023-02-09T07:49:18Z

do not get time to deep dive into the new feature etcd-io/raft#8, my immediate feeling is that it should can resolve such issue. @tbg @nvanbenschoten Could you double confirm this? But we also need to resolve it for etcd 3.5, which still depends on the old raft package.

It could help be more resilient to disk stalls, but it wasn't designed to do just that, and so this is just speculation. In CockroachDB we only use async log appends, but the log application is still in the same loop, so we don't have a ton of experience there either. Plus, we haven't shipped it. I don't understand exactly what the problem in etcd is that is described above, but it's probably similar to issues we have in CockroachDB when the disk stalls and turns into a grey failure. What we do there is detect disk stalls and fatal the process. We don't try to handle it gracefully.

ahrtr · 2023-02-09T08:10:43Z

What we do there is detect disk stalls and fatal the process.

Thanks @tbg . This seems to be the key point. Could you be a little more specific about this?

tbg · 2023-02-09T10:17:07Z

Sure - here's a small tour. I hope I'm not missing anything but basically

all logging is fallible: buffered and when it can't flush in time, exit the process¹. This includes network logging (fluentd). We have observed grey failures due to all of these stalling.
our storage engine pebble detects stalled writes and exits the process².
additionally, in the liveness layer in CRDB, we also write to our disks before extending a lease³; this is so that we are a bit more snappy in detecting badness but doesn't fundamentally change the behavior.
we perform end-to-end testing of disk stalls⁴⁵.

ahrtr · 2023-02-10T12:22:03Z

Thank you @tbg ! Will take a look later.

chaochn47 · 2023-02-13T21:51:14Z

#13527 is related.

ahrtr · 2023-02-22T06:35:31Z

Thanks @tbg for the input.

When there is something wrong on the storage, etcd may get stuck on syncing data on the storage files. When running into this situation, the etcd member may not be able to process any requests, even become unresponsive when it is being removed from the cluster, e.g issues/14338. If the blocked member is the leader, it may cause two issues,

Users may get two leaders, but actually there is only one valid leader. The old leader is getting stuck on stalled writes, so the raft has no chance to notify it that the leader has changed. See 14338#issuecomment-1216006545.
The old leader may unexpectedly revoke all leases when it recovers from stalled writes, and accordingly remove all keys bound to the leases. See issues/15247.

I am thinking to propose to terminate etcd process when it doesn’t finish syncing the data (WAL or boltdb) in a configured max duration (we can define two thresholds of duration, one for WAL and the other for boltDB). When the member starts again, it will not be elected as the leader anymore if the storage issue isn’t resolved, so the two issues mentioned above will not happen.

any immediate comments or concerns?
@mitake @ptabor @serathius @spzala @tbg
@liggitt @dims @lavalamp @wojtek-t @nikhita @neolit123 @chaochn47 @tjungblu @fuweid

If it looks overall good, then I will provide a detailed design doc.

chaochn47 · 2023-02-22T18:24:50Z

+1 on the proposal

The issue we saw was kube-apiserver watch still connects to the old member which was stuck in disk write and serving with no updates. Terminating the "stuck in disk writing" etcd process is the right thing to do IMO. It is now fulfilled by nanny agent but could have been done in etcd.

For data consistency risks:

boltdb is less concerning as long as consistent_index did not persist before terminating. WAL replay will reapply to boltdb after restart.
Not sure what's the side effect of "meant to be persisted" WAL entries are lost especially hard state (commit index). I think the safest way to mitigate this uncertainty is to fail fast after restart. This could be achieved by removing this "bad" member permanently.

In other hand, linearizability test already has equivalent failure injection point raftBeforeSave and beforeCommit to verify correctness.

ahrtr · 2023-02-23T02:07:30Z

I don't see any data consistency issue. I will clearly clarify & explain the details in the design doc.

Note that it's a critical issue if we see any data inconsistency in whatever situation, unless the storage hardware is broken.

fuweid · 2023-02-23T02:45:38Z

+1 for the proposal. Let the old leader steps down and raises a alert. The admin operator can notice the cluster is not health, even if the old leader process is in D status and waiting for pid reaper.

mitake · 2023-02-23T11:18:07Z

+1 for @ahrtr 's proposal. I think the root cause is that some part of lease related state transition isn't controlled by Raft, so the old leader still handle lease keep alive request from the connected client. can happen I think. This is a related discussion: #7320 (note that the related behavior is quite subtle, I’m still checking). It might be valuable to move the entire state transition to Raft layer?

ahrtr · 2023-02-27T08:02:50Z

Thanks all for the feedback.

The first draft design doc: https://docs.google.com/document/d/1U9hAcZQp3Y36q_JFiw2VBJXVAo2dK2a-8Rsbqv3GgDo/edit#

tjungblu · 2023-03-03T12:39:45Z

Great proposal, I see there's a lot of comments in the doc around the specificity of this problem. I agree with @serathius though, we should look at this more from a general failure PoV.

If it's not disk stalls, it's a slow network. Therefore I do like the split between detection and action (or remediation). Clearly we don't need to implement all of it to fix the issue at hand, but having something more generic in place will allow us to add more over time. Just to summarize some strategies that were proposed here or in the doc:

Stall Detection

high fsync latency
slow roundtrips
no apply commit for X minutes
high CPU, perhaps something more local to the process, like goroutine schedule metrics described in runtime/metrics: add scheduling statistics for go routines golang/go#50715

Stall Remediation

backoff
step down leader, wait for next election result
fatal exit (supervised via liveness probes / process suicide)

mitake · 2023-04-16T12:39:10Z

I thought about another lightweight mitigation for the issue: let LeaseRevokeRequest have a field of Raft term when it was issued. This branch has a rough implementation: https://github.com/mitake/etcd/commits/lease-revoke-term

With the change, a node (which thinks it’s a leader) issues LeaseRevokeRequest including its term information. When other nodes apply the request, it compares the term in the request and its own term. If the term from request is older, it simply discard the request. I think it can be an adhoc mitigation for the probelm.

I tested the above branch with the fail point testing briefly (#15247 (comment)) and it seems to be working.

Pros:

Implementation is quite simple

Cons:

It will change WAL file format by the new field.
Using term information inside the state machine is unusual, need to be careful about it’s actually ok or not

I think similar thing can be done for lease renew by adding term information to LeaseKeepAliveRequest.

 This approach is a very adhoc and dirty change so I’m not sure it’s worth trying. I think the watchdog approach will be good for being defensive for other issues caused by slow I/O (but this approach isn’t conflicting with the watchdog).

If you have comments, please let me know.

serathius · 2023-05-10T11:28:28Z

Looking at the scope this has become a whole new feature for etcd to detect getting stuck on disk.

ahrtr · 2023-05-10T11:30:58Z

Looking at the scope this has become a whole new feature for etcd to detect getting stuck on disk.

It isn't a blocker for 3.4.26 and 3.5.9. The big challenge is to sort out how to prevent lease from being mistakenly revoked. I will have a deep dive into it once we have more confidence on bbolt.

@mitake 's comment above #15247 (comment) makes some sense, but there might have some potential compatibility issue. Will dig into it later as well.

ahrtr · 2023-12-19T09:28:48Z

Could anyone help to backport #16822 to 3.5 and 3.4?

amit-rastogi · 2023-12-29T21:16:46Z

Hi, I was looking for some good first issues to start with. I would like to pick this up if no one else is currently working on this. Please let me know.

jmhbnz · 2023-12-29T21:44:36Z

/assign @amit-rastogi

Thanks Amit - Assigned to you to complete the backports, please let us know if you have any questions 🙏🏻

yitian108 · 2024-01-26T09:14:51Z

Not sure is it related to #17305 , @ahrtr ? But my k8s is standalone, not the cluster.
PS: the etcd version is 3.4.13

ahrtr · 2024-01-26T10:22:12Z

@yitian108

Both 3.4 and 3.5 have this issue as well, and we need to backport the fix to both of them.
What's your lease's TTL?
Could you double check whether you can reproduce the issue using main branch? We have already fixed this issue on main branch in Ignore old leader's leases revoking request #16822

yitian108 · 2024-01-29T02:15:17Z

Hi @ahrtr , sorry for the delay. Our lease's TTL is 15s. Sorry for the confusion, but the etcd server version is actually based on the image: FROM k8s.gcr.io/etcd:3.5.0. I will try to replace the image with gcr.io/etcd-development/etcd:v3.5.11 and test the case.

Unluckily, when I use gcr.io/etcd-development/etcd:v3.5.11 as the basic image, the command docker build reports
runc run failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory, Perhaps I should consider using another image that includes the fixed functionality from the main branch.

ahrtr · 2024-01-29T09:53:10Z

exec: "/bin/sh": stat /bin/sh: no such file or directory

Please read etcd-io/website#640

I will try to replace the image with gcr.io/etcd-development/etcd:v3.5.11 and test the case.

We haven't backport the fix to 3.5 and 3.4 yet. Could you try the main branch?

yitian108 · 2024-01-29T12:04:21Z

Thank you @ahrtr for supporting the guidelines. For the main branch, can I follow the steps as below to build the images?

Clone the main branch:
git clone https://github.com/etcd-io/etcd.git
Build Amd64 architecture:
ARCH=amd64 GOOS=linux make build
Build the image:
cd bin
cp "../Dockerfile-release.amd64" Dockerfile
docker build -f Dockerfile -t my/etcd:3.5.11

Alternatively, after finishing the first step and building the source code, I will generate the binary program 'etcd'. Then, I will replace the old binary program 'etcd' in the existing image 'k8s.gcr.io/etcd:3.5.0'.

These are the two ways that could be used for the main branch. If I am wrong, please correct me.

In general, where can we obtain the official image after backporting the fix to 3.5 and 3.4?

ahrtr · 2024-01-29T12:49:00Z

Clone the main branch:
git clone https://github.com/etcd-io/etcd.git

Build Amd64 architecture:
ARCH=amd64 GOOS=linux make build

Build the image:
cd bin
cp "../Dockerfile-release.amd64" Dockerfile
docker build -f Dockerfile -t my/etcd:3.5.11

Yes, it should work.

In general, where can we obtain the official image after backporting the fix to 3.5 and 3.4?

etcd uses gcr.io/etcd-development/etcd as a primary container registry, and quay.io/coreos/etcd as secondary.

Please refer to https://github.com/etcd-io/etcd/releases

yitian108 · 2024-01-30T07:31:24Z

Hi @ahrtr , unfortunately, when executing the build command ARCH=amd64 GOOS=linux make build， it shows:
which: no yamllint in (/usr/local/go/bin:/usr/local/mysql/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/czz/FlameGraph:/usr/local/sbin:/usr/local/libexec) which: no yamlfmt in (/usr/local/go/bin:/usr/local/mysql/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/czz/FlameGraph:/usr/local/sbin:/usr/local/libexec) which: no golangci-lint in (/usr/local/go/bin:/usr/local/mysql/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/czz/FlameGraph:/usr/local/sbin:/usr/local/libexec) GO_BUILD_FLAGS=" -v -mod=readonly" ./scripts/build.sh Running etcd_build % rm -f bin/etcd % (cd server && env CGO_ENABLED=0 GO_BUILD_FLAGS= -v -mod=readonly GOOS=linux GOARCH=amd64 go build -v -mod=readonly -trimpath -installsuffix=cgo -ldflags=-X=go.etcd.io/etcd/api/v3/version.GitSHA=58846bd8f -o=../bin/etcd .) stderr: ../client/v3/example_auth_test.go:1:1: expected 'package', found '.' stderr: ../client/v3/concurrency/example_election_test.go:1:1: expected 'package', found '.' FAIL: (code:1): % (cd server && env CGO_ENABLED=0 GO_BUILD_FLAGS= -v -mod=readonly GOOS=linux GOARCH=amd64 go build -v -mod=readonly -trimpath -installsuffix=cgo -ldflags=-X=go.etcd.io/etcd/api/v3/version.GitSHA=58846bd8f -o=../bin/etcd .) FAIL: etcd_build (GOARCH=amd64) make: *** [build] Error 2

It seems that some packages' version mismatched, as the following picture shown, the Go file is missing package declaration.
ps: the go version is 1.18.

yitian108 · 2024-01-31T09:07:40Z

Hi @ahrtr , Above errors may come from the old go version, since I update go version, it appears to work.

For testing the issue further, I reproduce it as below steps:

clone the main branch
git clone https://github.com/etcd-io/etcd.git
ARCH=amd64 GOOS=linux make build
cd bin && cp ../Dockerfile .
docker build -t my/etcd:3.5.11 .

Go through the above 4 steps, the image my/etcd:3.5.11 will be created. Since remove /bin/sh #15173 from the image, I couldn't build a new image based on my/etcd:3.5.11 again with customed shell script file. So, I just copy the 3 files: etcd/etcdctl/etcdutl to my old images k8s.gcr.io/etcd:3.5.0, then I based on the copied file image to build a new image, suppose the new image name as k8s.gcr.io/etcd:3.5.11, then I start to test the case again.

For testing the new image, I prepared for some test tasks:

edit the shell script for writing etcd keys

#!/bin/bash

for i in {1..100000}
do
etcdctl --endpoints=xx.xxx.xxx.xx:14592 put "t$i" "This value from the key $i zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"
sleep 0.01
done

edit the shell script for reading etcd keys

#!/bin/bash

for i in {1..10000}
do
etcdctl --endpoints=xx.xxx.xxx.xx:14592 get "t$i"
sleep 0.01
done

prepare rand-write with command fio
fio -output=/tmp/100R100W -name=100R100W -filename=/dev/sda -ioengine=sync -direct=1 -blocksize=2M -size=10GB -rw=randwrite -iodepth=8 -numjobs=1
prepare rand-read and rand-write with command fio
fio -filename=/dev/sda -direct=1 -iodepth 1 -thread -rw=randrw -rwmixread=70 -ioengine=psync -bs=1m -size=10G -numjobs=10 -runtime=60 -group_reporting -name=mytest -ioscheduler=noop
prepare watch the etcd log

To start test

Open 5 terminal windows according to the preparation tasks.
Executing the command of the step 3 & step 4 almost at the same time
Executing the command of the step 1 & step 2 almost at the same time
Watch the output of the step 1 & step 2 and watch the logs of the etcd server
[Set ETCD Key Log]

OK
OK
OK
OK
OK
{"level":"warn","ts":"2024-01-31T16:54:29.302+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-d334eb87-ef03-4df8-878a-e2cbe0fa26a5/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
OK
{"level":"warn","ts":"2024-01-31T16:54:36.650+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-6cdc8b1b-c76d-4f08-a080-1b19acb8766d/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
OK
{"level":"warn","ts":"2024-01-31T16:54:42.026+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-41a874b2-4f28-4b4b-bd26-0ec691d0f6a1/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
{"level":"warn","ts":"2024-01-31T16:54:47.054+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-13d7317a-b213-41d8-be66-de6bcc5e03d3/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
{"level":"warn","ts":"2024-01-31T16:54:52.115+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-302264df-fc43-437c-9736-8ceb464ad668/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
{"level":"warn","ts":"2024-01-31T16:54:57.144+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-08d31c7b-2023-4dd5-91e8-7e4e641bc1d6/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded

[Get ETCD Key Log]

t100
This value from the key 100 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
{"level":"warn","ts":"2024-01-31T16:54:29.292+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-040ed215-c525-4a4f-9273-43b3568ac8c8/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
t102
This value from the key 102 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
{"level":"warn","ts":"2024-01-31T16:54:36.649+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-0aa2d963-61de-4985-9094-136e392faa14/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
{"level":"warn","ts":"2024-01-31T16:54:41.675+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2546fd54-7af0-4083-9b3f-8b07283699d4/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
{"level":"warn","ts":"2024-01-31T16:54:46.700+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-de4ee4b3-720e-4b6e-8303-6f9f6176e627/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
{"level":"warn","ts":"2024-01-31T16:54:51.725+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-e8d38d3b-abad-45b6-9a4b-8f892a59704b/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
{"level":"warn","ts":"2024-01-31T16:54:56.747+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-f0b9361c-fa16-46b2-87f2-e6e572201bbe/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded

[ETCD Server Log]

{"level":"warn","ts":"2024-01-31T16:54:24.855254+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:25.356242+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:25.856359+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:26.357513+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:26.857696+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:27.358825+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:27.859296+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:28.359429+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:29.303137+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:24.304546+0800","time spent":"4.998551022s","remote":"xx.xxx.xxx.xx:61995","response type":"/etcdserverpb.KV/Put","request count":1,"request size":829,"response count":0,"response size":0,"request content":"key:\"t6\" value_size:822 "}
{"level":"warn","ts":"2024-01-31T16:54:31.018713+0800","caller":"wal/wal.go:816","msg":"slow fdatasync","took":"6.714036315s","expected-duration":"1s"}
{"level":"warn","ts":"2024-01-31T16:54:31.040403+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"6.746322608s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"t101\" ","response":"","error":"rangeKeys: context cancelled: context canceled"}
{"level":"info","ts":"2024-01-31T16:54:31.604168+0800","caller":"traceutil/trace.go:172","msg":"trace[1927508231] range","detail":"{range_begin:t101; range_end:; }","duration":"7.310088226s","start":"2024-01-31T16:54:24.294057+0800","end":"2024-01-31T16:54:31.604145+0800","steps":["trace[1927508231] 'range keys from in-memory index tree'  (duration: 6.746222684s)"],"step_count":1}
{"level":"warn","ts":"2024-01-31T16:54:31.604277+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:24.29402+0800","time spent":"7.310222566s","remote":"xx.xxx.xxx.xx:61993","response type":"/etcdserverpb.KV/Range","request count":0,"request size":6,"response count":0,"response size":0,"request content":"key:\"t101\" "}
{"level":"warn","ts":"2024-01-31T16:54:31.223615+0800","caller":"etcdserver/v3_server.go:897","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947687828599055,"retry-timeout":"500ms"}
{"level":"warn","ts":"2024-01-31T16:54:31.604363+0800","caller":"etcdserver/v3_server.go:909","msg":"timed out waiting for read index response (local node might have slow network)","timeout":"7s"}
{"level":"warn","ts":"2024-01-31T16:54:31.604412+0800","caller":"etcdserver/v3_server.go:874","msg":"ignored out-of-date read index response; local node read indexes queueing up and waiting to be in sync with leader","sent-request-id":4822947687828599058,"received-request-id":4822947687828599055}
{"level":"warn","ts":"2024-01-31T16:54:31.355484+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:31.604447+0800","caller":"etcdserver/server.go:862","msg":"Ignore the lease revoking request because current member isn't a leader","local-member-id":8834979854806041326}
{"level":"info","ts":"2024-01-31T16:54:31.618665+0800","caller":"traceutil/trace.go:172","msg":"trace[1553294244] put","detail":"{key:t7; req_size:829; response_revision:9641; }","duration":"2.289589352s","start":"2024-01-31T16:54:29.329056+0800","end":"2024-01-31T16:54:31.618646+0800","steps":["trace[1553294244] 'process raft request'  (duration: 2.289500465s)"],"step_count":1}
{"level":"warn","ts":"2024-01-31T16:54:31.618762+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:29.329036+0800","time spent":"2.289699725s","remote":"xx.xxx.xxx.xx:35067","response type":"/etcdserverpb.KV/Put","request count":1,"request size":829,"response count":0,"response size":28,"request content":"key:\"t7\" value_size:822 "}
{"level":"warn","ts":"2024-01-31T16:54:31.61895+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"2.302181338s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"t102\" ","response":"range_response_count:1 size:851"}
{"level":"warn","ts":"2024-01-31T16:54:31.618989+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"3.992991745s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/ms/group/1\" limit:32533 ","response":"range_response_count:1 size:63"}
{"level":"info","ts":"2024-01-31T16:54:31.619038+0800","caller":"traceutil/trace.go:172","msg":"trace[2020392476] range","detail":"{range_begin:/ms/group/1; range_end:; response_count:1; response_revision:9641; }","duration":"3.993042259s","start":"2024-01-31T16:54:27.625985+0800","end":"2024-01-31T16:54:31.619027+0800","steps":["trace[2020392476] 'agreement among raft nodes before linearized reading'  (duration: 3.992941947s)"],"step_count":1}
{"level":"warn","ts":"2024-01-31T16:54:31.619089+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:27.625961+0800","time spent":"3.993113588s","remote":"xx.xxx.xxx.xx:51975","response type":"/etcdserverpb.KV/Range","request count":0,"request size":17,"response count":1,"response size":86,"request content":"key:\"/ms/group/1\" limit:32533 "}
{"level":"warn","ts":"2024-01-31T16:54:31.619208+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"3.993586072s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/ms/group/2\" limit:32575 ","response":"range_response_count:1 size:63"}
{"level":"info","ts":"2024-01-31T16:54:31.619001+0800","caller":"traceutil/trace.go:172","msg":"trace[1202781310] range","detail":"{range_begin:t102; range_end:; response_count:1; response_revision:9641; }","duration":"2.302243319s","start":"2024-01-31T16:54:29.316748+0800","end":"2024-01-31T16:54:31.618991+0800","steps":["trace[1202781310] 'agreement among raft nodes before linearized reading'  (duration: 2.30210135s)"],"step_count":1}
{"level":"info","ts":"2024-01-31T16:54:31.619243+0800","caller":"traceutil/trace.go:172","msg":"trace[2121909998] range","detail":"{range_begin:/ms/group/2; range_end:; response_count:1; response_revision:9641; }","duration":"3.993623372s","start":"2024-01-31T16:54:27.625612+0800","end":"2024-01-31T16:54:31.619235+0800","steps":["trace[2121909998] 'agreement among raft nodes before linearized reading'  (duration: 3.993537659s)"],"step_count":1}
{"level":"warn","ts":"2024-01-31T16:54:31.61928+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:29.316717+0800","time spent":"2.302540326s","remote":"xx.xxx.xxx.xx:35056","response type":"/etcdserverpb.KV/Range","request count":0,"request size":6,"response count":1,"response size":874,"request content":"key:\"t102\" "}
{"level":"warn","ts":"2024-01-31T16:54:31.619292+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:27.625568+0800","time spent":"3.993714699s","remote":"xx.xxx.xxx.xx:55357","response type":"/etcdserverpb.KV/Range","request count":0,"request size":17,"response count":1,"response size":86,"request content":"key:\"/ms/group/2\" limit:32575 "}
{"level":"info","ts":"2024-01-31T16:54:36.496593+0800","caller":"fileutil/purge.go:96","msg":"purged","path":"/data/etcd/member/snap/0000000000000002-0000000000004bc8.snap"}
{"level":"warn","ts":"2024-01-31T16:54:36.650245+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"4.99877509s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"t103\" ","response":"","error":"context canceled"}
{"level":"info","ts":"2024-01-31T16:54:36.650336+0800","caller":"traceutil/trace.go:172","msg":"trace[948427076] range","detail":"{range_begin:t103; range_end:; }","duration":"4.998869058s","start":"2024-01-31T16:54:31.651442+0800","end":"2024-01-31T16:54:36.650311+0800","steps":["trace[948427076] 'agreement among raft nodes before linearized reading'  (duration: 4.998770603s)"],"step_count":1}
{"level":"warn","ts":"2024-01-31T16:54:36.650394+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:31.651411+0800","time spent":"4.998969357s","remote":"xx.xxx.xxx.xx:46974","response type":"/etcdserverpb.KV/Range","request count":0,"request size":6,"response count":0,"response size":0,"request content":"key:\"t103\" "}
{"level":"warn","ts":"2024-01-31T16:54:36.650946+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:31.651846+0800","time spent":"4.999080243s","remote":"xx.xxx.xxx.xx:46976","response type":"/etcdserverpb.KV/Put","request count":1,"request size":829,"response count":0,"response size":0,"request content":"key:\"t8\" value_size:822 "}
{"level":"warn","ts":"2024-01-31T16:54:36.998188+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"5.377522133s","expected-duration":"100ms","prefix":"","request":"header:<ID:4822947687828599060 > lease_revoke:<id:42ee8d5e77509cdf>","response":"size:28"}
{"level":"info","ts":"2024-01-31T16:54:36.998396+0800","caller":"traceutil/trace.go:172","msg":"trace[900412583] linearizableReadLoop","detail":"{readStateIndex:19611; appliedIndex:19610; }","duration":"5.377666816s","start":"2024-01-31T16:54:31.620686+0800","end":"2024-01-31T16:54:36.998352+0800","steps":["trace[900412583] 'read index received'  (duration: 39.87µs)","trace[900412583] 'applied index is now lower than readState.Index'  (duration: 5.377625784s)"],"step_count":2}
{"level":"info","ts":"2024-01-31T16:54:36.998599+0800","caller":"traceutil/trace.go:172","msg":"trace[764521200] put","detail":"{key:t9; req_size:829; response_revision:9643; }","duration":"321.174018ms","start":"2024-01-31T16:54:36.677411+0800","end":"2024-01-31T16:54:36.998585+0800","steps":["trace[764521200] 'process raft request'  (duration: 321.119933ms)"],"step_count":1}
{"level":"warn","ts":"2024-01-31T16:54:36.998652+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"5.37867209s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/ms/group/\" range_end:\"/ms/group0\" ","response":"range_response_count:2 size:119"}
{"level":"info","ts":"2024-01-31T16:54:36.998702+0800","caller":"traceutil/trace.go:172","msg":"trace[1755538231] range","detail":"{range_begin:/ms/group/; range_end:/ms/group0; response_count:2; response_revision:9643; }","duration":"5.378739001s","start":"2024-01-31T16:54:31.619954+0800","end":"2024-01-31T16:54:36.998693+0800","steps":["trace[1755538231] 'agreement among raft nodes before linearized reading'  (duration: 5.378594853s)"],"step_count":1}
{"level":"warn","ts":"2024-01-31T16:54:36.998745+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:31.619903+0800","time spent":"5.378829516s","remote":"xx.xxx.xxx.xx:51975","response type":"/etcdserverpb.KV/Range","request count":0,"request size":24,"response count":2,"response size":142,"request content":"key:\"/ms/group/\" range_end:\"/ms/group0\" "}
{"level":"warn","ts":"2024-01-31T16:54:36.998663+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:36.677403+0800","time spent":"321.241496ms","remote":"xx.xxx.xxx.xx:56786","response type":"/etcdserverpb.KV/Put","request count":1,"request size":829,"response count":0,"response size":28,"request content":"key:\"t9\" value_size:822 "}
{"level":"warn","ts":"2024-01-31T16:54:39.430229+0800","caller":"wal/wal.go:816","msg":"slow fdatasync","took":"2.400318935s","expected-duration":"1s"}
{"level":"warn","ts":"2024-01-31T16:54:41.675186+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"4.99863799s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"t104\" ","response":"","error":"context canceled"}

From the server log, we can see the slow fdatasync and waiting for ReadIndex response took too long, etc. Does it mean the issue still exists?

PS: I am not sure only copy 3 files etcd/etcdctl/etcdutl enough to test and the etcd pod's info as below:

# kubectl exec -it -n puc etcd-25-5556ccfc8f-8rc89 -- etcd --version
etcd Version: 3.6.0-alpha.0
Git SHA: 58846bd8f
Go Version: go1.21.6
Go OS/Arch: linux/amd64

ahrtr · 2024-01-31T10:04:48Z

From the server log, we can see the slow fdatasync and waiting for ReadIndex response took too long, etc. Does it mean the issue still exists?

No, it doesn't reproduce the issue. Please refer to

All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247 (comment)
All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247 (comment)

yitian108 · 2024-01-31T10:26:16Z

That means what I raised is another issue?

ahrtr · 2024-02-01T14:46:21Z

From the server log, we can see the slow fdatasync and waiting for ReadIndex response took too long,

It isn't an issue. It just complained the performance of the disk and/or network communication between peers.

yitian108 · 2024-02-02T01:57:20Z

Hi @ahrtr , we can also see the logs:

{"level":"warn","ts":"2024-01-31T16:54:54.492078+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:47.491377+0800","time spent":"7.000695668s","remote":"100.103.215.183:43860","response type":"/etcdserverpb.Lease/LeaseRevoke","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2024-01-31T16:54:54.525051+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:47.524069+0800","time spent":"7.000975915s","remote":"100.103.215.201:40578","response type":"/etcdserverpb.Lease/LeaseRevoke","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2024-01-31T16:54:54.553334+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:47.552291+0800","time spent":"7.00103091s","remote":"100.103.215.163:52356","response type":"/etcdserverpb.Lease/LeaseGrant","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2024-01-31T16:54:54.988081+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:47.986846+0800","time spent":"7.001222065s","remote":"xx.xxx.xxx.xx:45079","response type":"/etcdserverpb.Lease/LeaseGrant","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2024-01-31T16:54:55.039296+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:48.038347+0800","time spent":"7.000941899s","remote":"100.103.215.202:58456","response type":"/etcdserverpb.Lease/LeaseGrant","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
2024/01/31 16:54:58 WARNING: [core] [Server #3] grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
{"level":"warn","ts":"2024-01-31T16:54:55.086478+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:48.085406+0800","time spent":"7.00106384s","remote":"100.103.215.205:56334","response type":"/etcdserverpb.Lease/LeaseGrant","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2024-01-31T16:54:55.716188+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:48.71513+0800","time spent":"7.001051226s","remote":"xx.xxx.xxx.xx:53463","response type":"/etcdserverpb.Lease/LeaseGrant","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"warn","ts":"2024-01-31T16:54:56.350418+0800","caller":"v3rpc/interceptor.go:200","msg":"request stats","start time":"2024-01-31T16:54:51.792424+0800","time spent":"4.557982024s","remote":"100.103.215.202:44474","response type":"/etcdserverpb.Lease/LeaseGrant","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
2024/01/31 16:54:58 WARNING: [core] [Server #3] grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
{"level":"warn","ts":"2024-01-31T16:54:56.361035+0800","caller":"wal/wal.go:816","msg":"slow fdatasync","took":"6.81862586s","expected-duration":"1s"}
{"level":"warn","ts":"2024-01-31T16:54:56.366378+0800","caller":"txn/util.go:93","msg":"apply request took too long","took":"4.372258612s","expected-duration":"100ms","prefix":"","request":"header:<ID:4822947687828599072 > lease_revoke:<id:42ee8d5e774db0c7>","response":"size:28"}
{"level":"warn","ts":"2024-01-31T16:54:56.543459+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543452+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.54347+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543483+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543502+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543505+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543505+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543523+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543517+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543524+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}
{"level":"warn","ts":"2024-01-31T16:54:56.543529+0800","caller":"etcdserver/server.go:906","msg":"Failed to check current member's leadership","error":"context deadline exceeded"}

During this period, all services that depend on the etcd service were restarted. I picked the log from one of these services:

{"level":"warn","ts":"2024-01-31T16:54:44.043+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-33a5a824-a274-4df9-ad14-64a2e526490f/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
{"level":"warn","ts":"2024-01-31T16:54:45.720+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-51a48217-3b7d-43be-a7e8-5ea185f5b18f/xx.xxx.xxx.xx:14592","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
{"level":"warn","ts":"2024-01-31T16:54:51.070+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-33a5a824-a274-4df9-ad14-64a2e526490f/xx.xxx.xxx.xx:14592","attempt":1,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
{"level":"warn","ts":"2024-01-31T16:54:52.043+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-33a5a824-a274-4df9-ad14-64a2e526490f/xx.xxx.xxx.xx:14592","attempt":2,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
2024-01-31 16:54:52.043647 I | role callback leader false
2024-01-31 16:54:52.043689 I | recv service active exit

I can understand it was related to disk performance, but due to etcd and all services deployed on the same server, there should not be a network issue. This should be a problem caused by high I/O, which led to etcd missing the lease and consequently causing all services to be restarted.

ahrtr · 2024-02-22T09:55:01Z

@ivanvc Please add a change for both 3.4 and 3.5, once it gets merged, then we can close this ticket. thx

Regarding #15247 (comment), I don't see an easy & cheap solution for now. I expect the Livez feature can workaround it.

ahrtr · 2024-02-23T10:46:03Z

All done. Thanks @ivanvc for backporting the fix to 3.5 and 3.4!

ref #5520, close #8018 Upgrade etcd to v3.4.31 to fix the etcd issue: etcd-io/etcd#15247 Signed-off-by: dongmen <414110582@qq.com>

ref #5520, close #8018 Upgrade etcd to v3.4.31 to fix the etcd issue: etcd-io/etcd#15247 Signed-off-by: husharp <jinhao.hu@pingcap.com> Co-authored-by: husharp <jinhao.hu@pingcap.com> Co-authored-by: lhy1024 <liuhanyang@pingcap.com>

aaronjzhang added the type/bug label Feb 6, 2023

ahrtr added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 7, 2023

ahrtr mentioned this issue Feb 7, 2023

etcdserver: add failpoints walBeforeSync and walAfterSync #15252

Merged

serathius added release/v3.6 release/v3.5 release/v3.4 labels Feb 8, 2023

ahrtr mentioned this issue Mar 22, 2023

Propose project roadmap #15499

Closed

serathius added type/feature and removed type/bug release/v3.6 release/v3.5 release/v3.4 labels May 10, 2023

mitake mentioned this issue May 10, 2023

[Lease] Refactor lease renew request via raft #14094

Draft

k8s-ci-robot assigned amit-rastogi Dec 29, 2023

This was referenced Dec 31, 2023

Backport testutils.ExecuteUntil to 3.5 branch #17177

Merged

Backport #16822 to v3.5 #17196

Closed

ahrtr unassigned amit-rastogi Feb 1, 2024

This was referenced Feb 8, 2024

[3.5] backport e2e WaitLeader #17381

Merged

[3.4] tests/e2e: backport WaitLeader #17398

Merged

ahrtr mentioned this issue Feb 18, 2024

[3.5] backport ignore old leaders leases revoking requests #17425

Merged

ivanvc mentioned this issue Feb 20, 2024

[3.4] backport ignore old leader leases revoking requests #17465

Merged

ivanvc mentioned this issue Feb 23, 2024

[3.4][3.5] changelog: Add ignore old leader's leases revoking requests #17480

Merged

ahrtr closed this as completed Feb 23, 2024

ti-chi-bot bot pushed a commit to tikv/pd that referenced this issue Apr 7, 2024

*: upgrade etcd to v3.4.31 (#8017)

7179657

ref #5520, close #8018 Upgrade etcd to v3.4.31 to fix the etcd issue: etcd-io/etcd#15247 Signed-off-by: dongmen <414110582@qq.com>

jmhbnz mentioned this issue May 27, 2024

Performance Issue: CPU usage of the leader node increased by 20% #18069

Open

4 tasks

All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247

All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247

Comments

aaronjzhang commented Feb 6, 2023 • edited by ahrtr

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

Solution and analysis (updated by @ahrtr)

ahrtr commented Feb 7, 2023 • edited

ahrtr commented Feb 7, 2023

tbg commented Feb 9, 2023

ahrtr commented Feb 9, 2023

tbg commented Feb 9, 2023

Footnotes

ahrtr commented Feb 10, 2023

chaochn47 commented Feb 13, 2023

ahrtr commented Feb 22, 2023

chaochn47 commented Feb 22, 2023 • edited

ahrtr commented Feb 23, 2023

fuweid commented Feb 23, 2023

mitake commented Feb 23, 2023 • edited

ahrtr commented Feb 27, 2023 • edited

tjungblu commented Mar 3, 2023

mitake commented Apr 16, 2023 • edited

serathius commented May 10, 2023

ahrtr commented May 10, 2023 • edited

ahrtr commented Dec 19, 2023

amit-rastogi commented Dec 29, 2023

jmhbnz commented Dec 29, 2023

yitian108 commented Jan 26, 2024 • edited

ahrtr commented Jan 26, 2024

yitian108 commented Jan 29, 2024 • edited

ahrtr commented Jan 29, 2024

yitian108 commented Jan 29, 2024

ahrtr commented Jan 29, 2024

yitian108 commented Jan 30, 2024

yitian108 commented Jan 31, 2024

ahrtr commented Jan 31, 2024

yitian108 commented Jan 31, 2024

ahrtr commented Feb 1, 2024

yitian108 commented Feb 2, 2024 • edited

ahrtr commented Feb 22, 2024

ahrtr commented Feb 23, 2024

aaronjzhang commented Feb 6, 2023 •

edited by ahrtr

ahrtr commented Feb 7, 2023 •

edited

chaochn47 commented Feb 22, 2023 •

edited

mitake commented Feb 23, 2023 •

edited

ahrtr commented Feb 27, 2023 •

edited

mitake commented Apr 16, 2023 •

edited

ahrtr commented May 10, 2023 •

edited

yitian108 commented Jan 26, 2024 •

edited

yitian108 commented Jan 29, 2024 •

edited

yitian108 commented Feb 2, 2024 •

edited