Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3: add SetEndpoints method #6330

Merged
merged 2 commits into from Sep 19, 2016
Merged

clientv3: add SetEndpoints method #6330

merged 2 commits into from Sep 19, 2016

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Sep 1, 2016

Fix #5491.
Related to #5920.

eps = append(eps, ep)
}

c.balancer.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only balancer methods should lock the balancer so there's a clear division of responsibility

@xiang90
Copy link
Contributor

xiang90 commented Sep 1, 2016

@heyitsanthony @gyuho Shouldnt the first step be making the users be able to change the endpoints? then the auto-mebership-sync would be a wrapper around it?

@@ -62,6 +63,39 @@ func TestMemberAdd(t *testing.T) {
}
}

func TestMemberRemoveWithAutoSync(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only testing that the client is getting the member list when the important part is really client connectivity. A good test of functionality would be something like:

  1. boot a cluster with one member
  2. attach an autosync client
  3. add two more members
  4. wait for sync to grab the new members
  5. stop the original member
  6. use the client to access the cluster (the rpc will only go through if the sync worked)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok will rework on this. Thanks

@gyuho gyuho force-pushed the balancer-sync branch 2 times, most recently from 1e6fa9f to f010129 Compare September 4, 2016 12:05
clus.Members[0].Stop(t)
clus.WaitLeader(t)

// rpc will only go through if the sync worked
Copy link
Contributor Author

@gyuho gyuho Sep 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail because adding more endpoints doesn't update the unavailable pinned endpoint.

Do we expect users to detect the unavailable endpoint and change the pinned one manually? How should we test this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The balancer should automatically choose another endpoint to pin when it detects the old pinned endpoint is down. For a non-mutable RPC like Get or MemberList, the user shouldn't notice the failover.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our clientv3 is not yet implementing Start(target string) method of grpc.Balancer interface?

func (b *simpleBalancer) Start(target string) error { return nil }

So there's no routine that watches unavailable endpoints and changes the pinned one by calling Up(addr).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The balancer shouldn't call Up, only grpc does that since that's how it signals to the balancer it established a connection. I don't think implementing Start will help-- it looks like it's only called on Dial and the endpoint updates from sync shouldn't trigger the dial path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. grpc internally calls Up via transportMonitor and resetAddrConn, but it is not being called in our test case, for some reason. I will look into it more.

@gyuho
Copy link
Contributor Author

gyuho commented Sep 7, 2016

@heyitsanthony This needs more work than I thought.

Problems with this PR is

  1. MemberList API returns client URLs, but in our integration test, we use nodeName + random number as a grpc endpoint, so sync operation based member list doesn't work in integration tests
  2. If a client with only one endpoint cannot connect to that endpoint, the periodic MemberList call would also fail. I am not sure how we should handle this, while separating sync logic to balancer
  3. I notice when etcd client tries to send RPC to unavailable server, the grpc.Balancer.Up(addr) never gets called because transport.NewClientTransport here is always failing as below
2016-09-07 14:04:43.654328 W | etcdserver: failed to reach the peerURL(unix://127.0.0.1:21001.31686.sock) of member 90eca40574dc4fec (Get http://127.0.0.1:21001.31686.sock/version: dial unix 127.0.0.1:21001.31686.sock: connect: no such file or directory)
2016-09-07 14:04:43.654337 W | etcdserver: cannot get the version of member 90eca40574dc4fec (Get http://127.0.0.1:21001.31686.sock/version: dial unix 127.0.0.1:21001.31686.sock: connect: no such file or directory)
2016-09-07 14:04:43.657548 W | rafthttp: lost the TCP streaming connection with peer 90eca40574dc4fec (stream Message writer)
2016-09-07 14:04:44.280331 W | rafthttp: lost the TCP streaming connection with peer 90eca40574dc4fec (stream MsgApp v2 writer)
2016-09-07 14:04:44.411989 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node4420673713181363807.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node4420673713181363807.sock.bridge" <nil>}
2016-09-07 14:04:46.254541 I | v3rpc/grpc: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix localhost:node4420673713181363807.sock.bridge: connect: no such file or directory"; Reconnecting to {"localhost:node4420673713181363807.sock.bridge" <nil>}
2016-09-07 14:04:46.713559 W | rafthttp: lost the TCP streaming connection with peer 90eca40574dc4fec (stream Message writer)
2016-09-07 14:04:47.654631 W | etcdserver: failed to reach the peerURL(unix://127.0.0.1:21001.31686.sock) of member 90eca40574dc4fec (Get http://127.0.0.1:21001.31686.sock/version: dial unix 127.0.0.1:21001.31686.sock: connect: no such file or directory)
2016-09-07 14:04:47.654649 W | etcdserver: cannot get the version of member 90eca40574dc4fec (Get http://127.0.0.1:21001.31686.sock/version: dial unix 127.0.0.1:21001.31686.sock: connect: no such file or directory)

Any suggestion? Thanks!

@heyitsanthony
Copy link
Contributor

  1. Override the client.Cluster with a mock implementation that returns the right values
  2. I think it's OK to just ignore that failure
  3. Why would Up be called if the server is unavailable? Up is only called if grpc can establish a connection.

@gyuho gyuho force-pushed the balancer-sync branch 4 times, most recently from 5fd6a0c to 096ff3b Compare September 7, 2016 18:09
addrs = append(addrs, grpc.Address{Addr: getHost(ep)})
}
b.addrs = addrs
b.notifyCh <- addrs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heyitsanthony So I am seeing gRPC is being notified of these new addrs via Notify and lbWatchers, but I am still seeing the same error in tests. Do you see anything else that I am doing wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really familiar with the semantics of the notify channel. I don't think this function should be resetting readyc/upEps/upc/pinAddr though; anything that's waiting on upc or readyc will break for sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I need to read more gRPC part. Will rework on this. Thanks.

@gyuho gyuho changed the title *: autoSync in balancer *: AutoSync in balancer Sep 9, 2016
@gyuho gyuho changed the title *: AutoSync in balancer *: AutoSync client endpoints Sep 9, 2016
@gyuho gyuho changed the title *: AutoSync client endpoints clientv3: add UpdateEndpoints method Sep 11, 2016
@gyuho gyuho force-pushed the balancer-sync branch 2 times, most recently from 274c1b1 to e4d6312 Compare September 16, 2016 15:52
@gyuho
Copy link
Contributor Author

gyuho commented Sep 16, 2016

Believe test failures are not related to this.

@gyuho gyuho force-pushed the balancer-sync branch 11 times, most recently from 95d5a26 to 4a51719 Compare September 17, 2016 02:07
@gyuho
Copy link
Contributor Author

gyuho commented Sep 17, 2016

@heyitsanthony PTAL.

Test failures were related to the changes. Now fixed.

Have processCreds return if schema (://) does not exist.

Thanks.

@gyuho gyuho force-pushed the balancer-sync branch 5 times, most recently from 95d2a15 to dfc8c73 Compare September 18, 2016 16:34
@@ -171,7 +183,8 @@ func (c *Client) dialSetupOpts(endpoint string, dopts ...grpc.DialOption) (opts
}
opts = append(opts, grpc.WithDialer(f))

_, _, creds := c.dialTarget(endpoint)
proto, _, scheme := parseEndpoint(endpoint)
creds := c.processCreds(proto, scheme)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think processCreds can do without the bool argument:

creds := c.creds
if proto, _, scheme := parseEndpoint(endpoint); scheme {
    creds = c.processCreds(proto)
}

@heyitsanthony
Copy link
Contributor

lgtm following processCreds args fixup. Thanks!

@gyuho gyuho merged commit c9e06fa into etcd-io:master Sep 19, 2016
@gyuho gyuho deleted the balancer-sync branch September 19, 2016 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants