Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: ignore old leader's request to revoke lease #12531

Closed
wants to merge 2 commits into from

Conversation

tangcong
Copy link
Contributor

@tangcong tangcong commented Dec 8, 2020

No description provided.

@tangcong tangcong marked this pull request as draft December 8, 2020 16:50
@tangcong tangcong marked this pull request as ready for review December 23, 2020 09:22
@tangcong tangcong changed the title [DO NOT REVIEW] etcdserver: ignore old leader's request to revoke lease etcdserver: ignore old leader's request to revoke lease Dec 23, 2020
@tangcong
Copy link
Contributor Author

tangcong commented Dec 23, 2020

Fix #12528 (old leader still revokes lease after it steped to follower if cpu or disk io latency is high) @jpbetz @gyuho

+16:57:52 etcd1 | {"level":"warn","ts":"2020-12-14T16:57:52.831+0800","caller":"etcdserver/server.go:1055","msg":"ignore older leader revoke lease request","lease-id":"32697660774ba905","isLeader":true,"isLeaderInRaftStateMachine":false}
+16:57:54 etcd1 | {"level":"warn","ts":"2020-12-14T16:57:54.686+0800","caller":"etcdserver/raft.go:376","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"fd422379fda50e48","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"4.889343035s"}
+16:57:54 etcd1 | {"level":"warn","ts":"2020-12-14T16:57:54.686+0800","caller":"etcdserver/raft.go:376","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"91bc3c398fb3c146","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"4.889457147s"}

server/etcdserver/server.go Outdated Show resolved Hide resolved
@tangcong
Copy link
Contributor Author

updated. thanks. @gyuho semaphores ci is failed, It seems that the failed test case is not related to this PR. my local e2e test is ok.

Copy link

@jiapeish jiapeish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. We've verified this PR and it takes effect.

server/etcdserver/server.go Show resolved Hide resolved
@jiapeish
Copy link

I noticed this PR is here for almost 1 month... Are there any more reviewers?

Copy link
Contributor

@ptabor ptabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. PRs that are not green on the list of PRs have lower chances to get attention.
    I know its due to test flakiness, but still recommend running 'git commit --amend; git push' to trigger the re-test.

  2. It's hard here, but it would be good to have a test-case (at least a unit one) that shows this scenario.

  3. I'm new to this code, but it seems unnatural that the decision to expire lease is not just propagated by RAFT... Alternatively the decision to revoke lease might contains term and its only executed (accepted by current leader in ) if the senders term matches the current term. For my education: were this option considered and refused for some reasons ?

Nit: 4. Please squash the commits.

server/etcdserver/server.go Show resolved Hide resolved
@tangcong tangcong closed this Feb 26, 2021
@tangcong tangcong deleted the fix-old-leader-revoker branch February 26, 2021 18:57
@tangcong tangcong restored the fix-old-leader-revoker branch February 26, 2021 18:57
@tangcong tangcong deleted the fix-old-leader-revoker branch February 26, 2021 19:00
@tangcong tangcong restored the fix-old-leader-revoker branch April 22, 2021 02:46
@tangcong tangcong reopened this Apr 22, 2021
@tangcong tangcong force-pushed the fix-old-leader-revoker branch 2 times, most recently from 6b84b7b to 714c68c Compare April 22, 2021 04:16
@codecov-commenter
Copy link

codecov-commenter commented Apr 22, 2021

Codecov Report

Merging #12531 (147050f) into master (344c9f3) will decrease coverage by 12.94%.
The diff coverage is 20.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #12531       +/-   ##
===========================================
- Coverage   70.75%   57.80%   -12.95%     
===========================================
  Files         429      426        -3     
  Lines       34111    34020       -91     
===========================================
- Hits        24135    19666     -4469     
- Misses       8076    12542     +4466     
+ Partials     1900     1812       -88     
Flag Coverage Δ
all 57.80% <20.00%> (-12.95%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/etcdserver/server.go 67.70% <20.00%> (-12.26%) ⬇️
client/v3/txn.go 0.00% <0.00%> (-100.00%) ⬇️
client/v3/compact_op.go 0.00% <0.00%> (-100.00%) ⬇️
client/v3/ordering/util.go 0.00% <0.00%> (-100.00%) ⬇️
client/v3/namespace/util.go 0.00% <0.00%> (-100.00%) ⬇️
client/v3/naming/endpoints/endpoints.go 0.00% <0.00%> (-100.00%) ⬇️
...ver/proxy/grpcproxy/adapter/lock_client_adapter.go 0.00% <0.00%> (-100.00%) ⬇️
client/v3/leasing/util.go 0.00% <0.00%> (-98.04%) ⬇️
pkg/report/report.go 0.00% <0.00%> (-95.58%) ⬇️
...er/proxy/grpcproxy/adapter/watch_client_adapter.go 7.14% <0.00%> (-92.86%) ⬇️
... and 183 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 344c9f3...147050f. Read the comment docs.

@xiang90
Copy link
Contributor

xiang90 commented Apr 23, 2021

@tangcong can you push again to re-trigger travis ci?

@tangcong
Copy link
Contributor Author

Procfile.gofail file

# Use goreman to run `go get github.com/mattn/goreman`
etcd1: GOFAIL_HTTP="127.0.0.1:1111" bin/etcd --name infra1 --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 --listen-peer-urls http://127.0.0.1:12380 --initial-advertise-peer-urls http://127.0.0.1:12380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-level=debug
etcd2: GOFAIL_HTTP="127.0.0.1:1112" bin/etcd --name infra2 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-level=debug
etcd3: GOFAIL_HTTP="127.0.0.1:1113" bin/etcd --name infra3 --listen-client-urls http://127.0.0.1:32379 --advertise-client-urls http://127.0.0.1:32379 --listen-peer-urls http://127.0.0.1:32380 --initial-advertise-peer-urls http://127.0.0.1:32380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:12380,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof --logger=zap --log-level=debug

reproduce.sh

#!/bin/bash
# gofail enable,build etcd
cd ./server && go get github.com/etcd-io/gofail/runtime && cd ../ 
cd ./etcdctl && go get github.com/etcd-io/gofail && cd ../
FAILPOINTS="true" make build

# start etcd and enable the FAILPOINT HTTP server
nohup goreman -f Procfile.gofail start &

# waiting for etcd ready
members=( 2379 22379 32379)
for port in ${members[@]}
do
echo $port
echo "Waiting for etcd $port ready..."
while ! nc -z 127.0.0.1 $port; do
  sleep 1
done
done

# grant lease,put key,keep-alive
leaseid=`etcdctl lease grant 3 | awk -F" " '{ print $2}'`
echo $leaseid
etcdctl put hello world --lease $leaseid
members=( 2379 22379 32379)
for port in ${members[@]}
do
nohup etcdctl --endpoints=http://127.0.0.1:$port lease keep-alive $leaseid > lease.log 2>&1 &
done

# inject disk-io delay failure into leader
members=( 2379 22379 32379)
gofailport=1111
for port in ${members[@]}
do
leader=`curl -s 127.0.0.1:$port/metrics | grep etcd_server_is_leader | tail -1 | grep 1`
echo $port,$leader
if [ -n "$leader" ]; then
    echo "127.0.0.1:$port is leader","gofail http addr is",$gofailport
    echo "inject disk-io delay failure into leader"
    curl http://127.0.0.1:$gofailport/etcdserver/raftAfterSave -XPUT -d'sleep(4000)'
    curl http://127.0.0.1:$gofailport
    etcdctl put hello1 world2
    break
fi
gofailport=$(($gofailport+1))
done

# wait result
echo "wait for result:"
sleep 5
grep -rn leader-in-raft-state-machine nohup.out

#{"level":"warn","ts":"2021-04-23T16:21:09.897+0800","caller":"etcdserver/server.go:1091",
# "msg":"ignore older leader revoke lease request",
# "lease-id":"326978fdcba0b504","is-leader":true,"is-leader-in-raft-state-machine":false}

reproduce the issue, verify the pr

./reproduce.sh

@ptabor it is hard to add integration and unit test case, I provide a simple script to inject disk-io delay failure into the leader, it can reproduce the issue verify the pr correctness.
if necessary,maybe we can add a case to simulate this problem in the functional test in another pr.

@tangcong
Copy link
Contributor Author

@tangcong can you push again to re-trigger travis ci?

done, travis ci is green. @xiang90 @gyuho

@xiang90
Copy link
Contributor

xiang90 commented Apr 23, 2021

@tangcong

Adding it into the functional test should be good enough (translating your shell script into a Go program basically).

@jiapeish
Copy link

From a user's perspective, it would be better to solve this issue first and then put a unit-test later in an independent PR.

@tangcong
Copy link
Contributor Author

@tangcong

Adding it into the functional test should be good enough (translating your shell script into a Go program basically).

done. please see pr #12898. thanks. @xiang90

@xiang90
Copy link
Contributor

xiang90 commented May 6, 2021

@tangcong Resolve the conflict? Also, why does the test still pass after adding #12898?

@tangcong
Copy link
Contributor Author

tangcong commented May 7, 2021

@xiang90 FAILPOINT TEST CASE is not enabled by default, I guess it may be too flaky, so my new FAILPOINT test case is also disabled. I'm not sure if etcd release team ran FAILPOINT TEST CASE in other environments when they released the new version of etcd. There was a bug( fixed by #12898) in it before and it seemed to be unable to run.

@tangcong
Copy link
Contributor Author

@xiang90 @gyuho PTAL. thanks. May 17th as code freeze for the 3.5 release.

@jiapeish
Copy link

jiapeish commented May 21, 2021

It seems that we missed the v3.5 code freeze...
@xiang90 @gyuho I was wondering if you can have a review of this PR again? It's important for users like us who have a strict production environment.

@xiang90
Copy link
Contributor

xiang90 commented May 26, 2021

@tangcong @jiapeish

If the raft routine also blocks for some time, the isLeader might still return true while a new leader is already elected?

I think it is safter to:

  1. only apply lease revoke request from the current term. if a revoke is from a previous leader, just ignore it.
    or 2. not forward lease revoke request to raft leader (I am not sure if this is easy to implement or not)

@xiang90
Copy link
Contributor

xiang90 commented May 26, 2021

@tangcong @jiapeish

This is a bug fix rahter than a new feature. So we do not need to worry about the releases.

@stale
Copy link

stale bot commented Aug 25, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

None yet

6 participants