Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3: respect up/down notifications from grpc #5845

Merged
merged 4 commits into from
Aug 16, 2016

Conversation

heyitsanthony
Copy link
Contributor

Partial patch; will need to revendor grpc once the fix on that end is merged.

Fixes #5842

numGets uint32
// mu protects upEps, downEps, and numGets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mu sync.Mutex on the top of upEps and numGets?

And what is downEps? I don't see it in the struct fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment was outdated; fixed

@heyitsanthony
Copy link
Contributor Author

fixed up to use balancer for this functionality since withblock patch was rejected PTAL /cc @xiang90

v := atomic.AddUint32(&b.numGets, 1)
ep := b.eps[v%uint32(len(b.eps))]
return grpc.Address{Addr: getHost(ep)}, func() {}, nil
b.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that we need to do more work here?

see comments on Get at https://godoc.org/google.golang.org/grpc#Balancer

Also: https://github.com/grpc/grpc-go/blob/master/balancer.go#L272-L364

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of that complication is from making it as general as possible. It's safe to assume FailFast is false so there's no need to implement the suggested convoluted blocking logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expose the failfast option to user now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. good.

@heyitsanthony heyitsanthony force-pushed the clientv3-ignore-dead-eps branch 5 times, most recently from 8755254 to 8b53cea Compare August 2, 2016 04:34
defer b.mu.Unlock()

if b.pinAddr != nil {
if _, ok := b.upEps[b.pinAddr.Addr]; ok || time.Since(b.pinTime) < b.pinWait {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not very clear about this. why do we need to have a pinWait? if one pined address is still up, should we use it until it fails? current pinWait is 500ms, so a new rpc coming after 500ms will choose a new endpoint now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this code is bad; the dial timeout is already in the grpc dialer and there should only be one connecting / up endpoint at a time. I can simplify it.

@heyitsanthony heyitsanthony force-pushed the clientv3-ignore-dead-eps branch 3 times, most recently from c78bd3f to 83e7ddb Compare August 4, 2016 06:26
@heyitsanthony
Copy link
Contributor Author

@xiang90 all fixed PTAL

b.mu.Unlock()
// notify client that a connection is up
select {
case b.upc <- struct{}{}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like we only needs to send this once when we create the client? should we rename this to ready? and wrap it with sync.Once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@xiang90
Copy link
Contributor

xiang90 commented Aug 4, 2016

LGTM

@heyitsanthony
Copy link
Contributor Author

There may be a subtle bug in this if connectingAddr's host is down but other endpoints are available; I'll see if I can trigger it with a test case.

@heyitsanthony
Copy link
Contributor Author

Added a test for the failover case. Now blocked on grpc/grpc-go#810

@heyitsanthony heyitsanthony force-pushed the clientv3-ignore-dead-eps branch 2 times, most recently from 7de19f7 to 11f2b99 Compare August 16, 2016 16:49
return func(rpcCtx context.Context, f rpcFunc) {
for {
err := f(rpcCtx)
// ignore grpc conn closing on fail-fast calls; they are transient errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a dial failure error? can connclosing happen after writing the request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no explicit dial failure error; the closest thing is the helpfully unexported errConnClosing (which gets grpc.Errorf()'d into a grpc formatted error). Transport errors seem to either go through ConnectionErrorf or prefixed with "transport:". I guess the safest policy is retry only on isConnClosing(err) and bail out otherwise?

@heyitsanthony
Copy link
Contributor Author

OK, changed the retry logic to bail if err != closing. CI seems to be happy. PTAL /cc @xiang90

@xiang90
Copy link
Contributor

xiang90 commented Aug 16, 2016

lgtm

@heyitsanthony heyitsanthony merged commit 8d77035 into etcd-io:master Aug 16, 2016
@heyitsanthony heyitsanthony deleted the clientv3-ignore-dead-eps branch August 16, 2016 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

clientv3: won't connect if any endpoint is down
3 participants