Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in warnOfExpensiveReadOnlyTxnRequest due to process pause #14891

Closed
aphyr opened this issue Dec 5, 2022 · 2 comments · Fixed by #14895
Closed

Segfault in warnOfExpensiveReadOnlyTxnRequest due to process pause #14891

aphyr opened this issue Dec 5, 2022 · 2 comments · Fixed by #14895
Assignees
Labels

Comments

@aphyr
Copy link

aphyr commented Dec 5, 2022

What happened?

In Jepsen tests of etcd 3.5.6, process pauses can occasionally induce etcd nodes to segfault, logging

{"level":"error","ts":"2022-12-05T00:17:39.650-0500","caller":"etcdserver/apply.go:487","msg":"unexpected error during readonly txn","error":"applyTxn: failed Range: rangeKeys: context cancelled: context canceled","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Txn\n\tgo.etcd.io/etcd/server/v3/etcdserver/apply.go:487\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func3\n\tgo.etcd.io/etcd/server/v3/etcdserver/v3_server.go:181\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).doSerialize\n\tgo.etcd.io/etcd/server/v3/etcdserver/v3_server.go:659\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn\n\tgo.etcd.io/etcd/server/v3/etcdserver/v3_server.go:182\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn\n\tgo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc/key.go:95\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn\n\tgo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc/quota.go:70\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1\n\tgo.etcd.io/etcd/api/v3@v3.5.6/etcdserverpb/rpc.pb.go:6617\ngithub.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1\n\tgo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc/interceptor.go:71\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1\n\tgo.etcd.io/etcd/server/v3/etcdserver/api/v3rpc/interceptor.go:78\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\tgithub.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\ngo.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler\n\tgo.etcd.io/etcd/api/v3@v3.5.6/etcdserverpb/rpc.pb.go:6619\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\tgoogle.golang.org/grpc@v1.41.0/server.go:1279\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/grpc@v1.41.0/server.go:1608\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\tgoogle.golang.org/grpc@v1.41.0/server.go:923"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xd82e02]

goroutine 170836 [running]:
go.etcd.io/etcd/server/v3/etcdserver.warnOfExpensiveReadOnlyTxnRequest(0xc00012c000, 0x5f5e100, 0xc0db7d9b2d5e9b9d, 0x305e55b8e9, 0x1aa0ac0, 0xc001ded650, 0xc003d4db30, 0x13a2780, 0xc003f76e40)
	go.etcd.io/etcd/server/v3/etcdserver/util.go:143 +0x182
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn.func2(0xc0005ae300, 0xc001ded650, 0xc004b5d470, 0xc004b5d498, 0xc001ded6c0, 0xc0db7d9b2d5e9b9d, 0x305e55b8e9, 0x1aa0ac0)
	go.etcd.io/etcd/server/v3/etcdserver/v3_server.go:177 +0xa5
go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).Txn(0xc0005ae300, 0x13bebd8, 0xc005980690, 0xc001ded650, 0xc003d4db30, 0x13a2780, 0xc003f76e40)
	go.etcd.io/etcd/server/v3/etcdserver/v3_server.go:185 +0x56a
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*kvServer).Txn(0xc000a025c0, 0x13bebd8, 0xc005980510, 0xc001ded650, 0xc001ded650, 0x0, 0x0)
	go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc/key.go:95 +0xfc
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.(*quotaKVServer).Txn(0xc000a02600, 0x13bebd8, 0xc005980510, 0xc001ded650, 0xc000a02600, 0x13c5868, 0xc0005a87e0)
	go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc/quota.go:70 +0xc2
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler.func1(0x13bebd8, 0xc005980510, 0x11c81c0, 0xc001ded650, 0x14, 0xc005790d20, 0x2, 0xc0014f4818)
	go.etcd.io/etcd/api/v3@v3.5.6/etcdserverpb/rpc.pb.go:6617 +0x89
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1(0x13bebd8, 0xc005980510, 0x11c81c0, 0xc001ded650, 0xc005cbb000, 0xc005c3c708, 0xc00480fb80, 0x0, 0x0, 0x0)
	github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107 +0xb0
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13bebd8, 0xc005980510, 0x11c81c0, 0xc001ded650, 0x1ad2f80, 0xc001dbf320, 0x8b336c, 0x13a4c40)
	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newUnaryInterceptor.func1(0x13bebd8, 0xc005980510, 0x11c81c0, 0xc001ded650, 0xc005cbb000, 0xc005cbb020, 0x493be6, 0x638d7eec, 0x2d531174, 0xc48e0df1be7f)
	go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc/interceptor.go:71 +0x139
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13bebd8, 0xc005980510, 0x11c81c0, 0xc001ded650, 0x0, 0x203001, 0x13cebf0, 0xc0051905f0)
	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1(0x13bebd8, 0xc005980510, 0x11c81c0, 0xc001ded650, 0xc005cbb000, 0xc005cbb040, 0x0, 0x0, 0x0, 0x0)
	go.etcd.io/etcd/server/v3/etcdserver/api/v3rpc/interceptor.go:78 +0xd9
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1(0x13bebd8, 0xc005980510, 0x11c81c0, 0xc001ded650, 0xc0030a4400, 0x0, 0xc0014f4b30, 0x40e318)
	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x63
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1(0x13bebd8, 0xc005980510, 0x11c81c0, 0xc001ded650, 0xc005cbb000, 0xc005c3c708, 0xc0014f4ba0, 0x573f66, 0x1170200, 0xc005980510)
	github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34 +0xd7
go.etcd.io/etcd/api/v3/etcdserverpb._KV_Txn_Handler(0x1113000, 0xc000a02600, 0x13bebd8, 0xc005980510, 0xc0041ff0e0, 0xc0024001b0, 0x13bebd8, 0xc005980510, 0xc005d14b58, 0x16)
	go.etcd.io/etcd/api/v3@v3.5.6/etcdserverpb/rpc.pb.go:6619 +0x150
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000156700, 0x13cb078, 0xc005d56000, 0xc002420ea0, 0xc002400300, 0x1a8cc88, 0x0, 0x0, 0x0)
	google.golang.org/grpc@v1.41.0/server.go:1279 +0x52b
google.golang.org/grpc.(*Server).handleStream(0xc000156700, 0x13cb078, 0xc005d56000, 0xc002420ea0, 0x0)
	google.golang.org/grpc@v1.41.0/server.go:1608 +0xd0c
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc005238090, 0xc000156700, 0x13cb078, 0xc005d56000, 0xc002420ea0)
	google.golang.org/grpc@v1.41.0/server.go:923 +0xab
created by google.golang.org/grpc.(*Server).serveStreams.func1
	google.golang.org/grpc@v1.41.0/server.go:921 +0x1fd

What did you expect to happen?

Processes probably shouldn't segfault when things get slow.

How can we reproduce it (as minimally and precisely as possible)?

With jepsen-etcd 8a76f41ad77a5badcfb2b06734374849210c4d23, run

lein run test-all --workload wr --time-limit 300 --rate 10000 --concurrency 10n --test-count 200 --nemesis pause --version 3.5.6 --debug

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.5
Git SHA: 19002cfc6
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.5
API version: 3.5

Etcd configuration (command line flags or environment variables)

2022-12-04 18:11:03 Jepsen starting etcd --enable-v2 --log-outputs stderr --logger zap --name n1 --listen-peer-urls http://192.168.122.101:2380 --listen-client-urls http://192.168.122.101:2379 --advertise-client-urls http://192.168.122.101:2379 --initial-cluster-state new --initial-advertise-peer-urls http://192.168.122.101:2380 --initial-cluster n1=http://192.168.122.101:2380,n2=http://192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380 --snapshot-count 100

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

```console $ etcdctl member list -w table root@n1:/opt/etcd# ./etcdctl --endpoints=n1:2379 member list -w table +------------------+---------+------+-----------------------------+-----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+------+-----------------------------+-----------------------------+------------+ | 1153c9690d2b2284 | started | n3 | http://192.168.122.103:2380 | http://192.168.122.103:2379 | false | | 4824313a421b2502 | started | n5 | http://192.168.122.105:2380 | http://192.168.122.105:2379 | false | | 4d6e27d122507e9c | started | n4 | http://192.168.122.104:2380 | http://192.168.122.104:2379 | false | | a1ffd5acd6a88a6a | started | n2 | http://192.168.122.102:2380 | http://192.168.122.102:2379 | false | | afa39e55dee6dc2e | started | n1 | http://192.168.122.101:2380 | http://192.168.122.101:2379 | false | +------------------+---------+------+-----------------------------+-----------------------------+------------+

$ etcdctl --endpoints= endpoint status -w table
root@n1:/opt/etcd# ./etcdctl --endpoints=n1:2379 endpoint status -w table
+----------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| n1:2379 | afa39e55dee6dc2e | 3.5.5 | 4.3 MB | true | false | 4 | 56100 | 56100 | |
+----------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

</details>


### Relevant log output

_No response_
@aphyr aphyr added the type/bug label Dec 5, 2022
@ahrtr ahrtr self-assigned this Dec 5, 2022
@ahrtr
Copy link
Member

ahrtr commented Dec 5, 2022

The root cause is clear now. When the range request fails for whatever reason (e.g timeout), then the ResponseRange is not set, accordingly nil. Eventually it causes nil pointer panic at util.go#L143.

The main branch has this issue as well. Will deliver a PR this this tomorrow.

ahrtr added a commit to ahrtr/etcd that referenced this issue Dec 5, 2022
FYI. etcd-io#14891 (comment)

Signed-off-by: Benjamin Wang <wachao@vmware.com>
ahrtr added a commit to ahrtr/etcd that referenced this issue Dec 5, 2022
FYI. etcd-io#14891 (comment)

Signed-off-by: Benjamin Wang <wachao@vmware.com>
ahrtr added a commit to ahrtr/etcd that referenced this issue Dec 6, 2022
FYI. etcd-io#14891 (comment)

Signed-off-by: Benjamin Wang <wachao@vmware.com>
ahrtr added a commit to ahrtr/etcd that referenced this issue Dec 6, 2022
FYI. etcd-io#14891 (comment)

Signed-off-by: Benjamin Wang <wachao@vmware.com>
ahrtr added a commit to ahrtr/etcd that referenced this issue Dec 6, 2022
FYI. etcd-io#14891 (comment)

Signed-off-by: Benjamin Wang <wachao@vmware.com>
@ahrtr
Copy link
Member

ahrtr commented Dec 6, 2022

The fix to this issue will be included in etcd v3.5.7 and v3.4.23.

Thanks @aphyr for raising this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

2 participants