New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd client v3 hung at kv.Do #8179

Closed
keyingliu opened this Issue Jun 27, 2017 · 7 comments

Comments

2 participants
@keyingliu
Copy link

keyingliu commented Jun 27, 2017

In our kubernetes cluster, if there are many concurrent get request to a same resource, apiserver hung at this resource (like service) operations. We found for each resource in apiserver, there is a connection to etcd. The following code detected halt error, but it seems it didn't handle all errors:

func (kv *kv) Do(ctx context.Context, op Op) (OpResponse, error) {
	for {
		resp, err := kv.do(ctx, op)
		if err == nil {
			return resp, nil
		}

		if isHaltErr(ctx, err) {
			return resp, toErr(ctx, err)
		}
		// do not retry on modifications
		if op.isWrite() {
			return resp, toErr(ctx, err)
		}
	}
}

The hung is at https://github.com/kubernetes/kubernetes/blob/release-1.5/vendor/google.golang.org/grpc/transport/http2_client.go#L557

and the error is: rpc error: code = 13 desc = stream terminated by RST_STREAM with error code: 1

checked the code 13:

	// Internal errors.  Means some invariants expected by underlying
	// system has been broken.  If you see one of these errors,
	// something is very broken.
	Internal Code = 13

with this condition, the connection is till there or it should be reconnect.

as the kv.Do is an infinite loop, if code 13 appeared, it still continues to call rather than handle the code, will it speed up the outage?

@xiang90

This comment has been minimized.

Copy link
Contributor

xiang90 commented Jun 27, 2017

Can you write a script to reproduce the issue?

@keyingliu

This comment has been minimized.

Copy link

keyingliu commented Jun 27, 2017

We can reproduce it every time when we do LNP test for our apiserver, and by checking the goroutine stack with http://127.0.0.1:8080/debug/pprof/goroutine?debug=2, it hung at the wait function.

@xiang90

This comment has been minimized.

Copy link
Contributor

xiang90 commented Jun 27, 2017

@keyingliu Please provide a test script/program and steps to reproduce the issue, as we mentioned here: https://github.com/coreos/etcd/blob/master/Documentation/reporting_bugs.md.

We can help you to identify and resolve the problem faster if you do so. It is not convenient for etcd developers to setup k8s and digging into it.

@keyingliu

This comment has been minimized.

Copy link

keyingliu commented Jun 27, 2017

@xiang90 I will try to have one.
But I am just wondering when code = 13 desc = stream terminated by RST_STREAM returned, what happened to etcd server? and if client see code 13, should it handle more gracefully? what's your opinion about this kind of error. Thanks.

@xiang90

This comment has been minimized.

Copy link
Contributor

xiang90 commented Jun 27, 2017

@keyingliu

check https://github.com/coreos/etcd/blob/5fedaf2dd78aca3eeae723f4dfe9d37ea6edd2fa/clientv3/client.go#L481.

The assumption was internal error is not a fatal error. But it seems you hit one that is a fatal error. retrying on that wont work. so i want you to help to reproduce it so that we can understand the behavior better.

@keyingliu

This comment has been minimized.

Copy link

keyingliu commented Jun 27, 2017

@xiang90 we found it is hard to reproduce it util we realized the test client we are using is the newer version of grpc. if we use the grpc in kubernetes 1.5, it is easy to reproduce. But anyway, it seems it is grpc issue. We build our apiserver with newer version grpc, the hung issue has gone.

Thanks.

@xiang90

This comment has been minimized.

Copy link
Contributor

xiang90 commented Jun 27, 2017

ok. closing.

@xiang90 xiang90 closed this Jun 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment