Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: renew index on watch timeouts #1292

Closed
yichengq opened this issue Oct 12, 2014 · 11 comments
Closed

client: renew index on watch timeouts #1292

yichengq opened this issue Oct 12, 2014 · 11 comments
Milestone

Comments

@yichengq
Copy link
Contributor

When watch timeouts, it will receive a etcd timeout error. Then it should renew index on watch based on X-ETCD-INDEX at the header of the response.

@yichengq yichengq added this to the v0.5.0 milestone Oct 12, 2014
@jonboulle
Copy link
Contributor

Do we need this in etcd proper for 0.5? Does that mean we're pushing client/ as the recommended client?

@kelseyhightower
Copy link
Contributor

I must be hitting this bug now. Currently the go-etcd library does not produce an error when timeouts happen and watches seem to be broken after the timeout. Maybe I'm just doing it wrong: https://github.com/kelseyhightower/flannel-route-manager/blob/master/server/server.go#L126

@yichengq
Copy link
Contributor Author

@jonboulle More context for this issue:
Current discovery service depends on watch mechanism. When watch timeouts, it will retry. At the third time of retry, the request will fail because the index that is watching on is out of window, and etcd prints out badrequest error.

@jonboulle jonboulle changed the title renew index when recursive watch timeouts renew index when watch timeouts Oct 13, 2014
@jonboulle
Copy link
Contributor

Capturing OOB discussion:

We only send X-Etcd-Index once at the start of a response. In the case of a longstanding watch that's aborted after a timeout (e.g. a Gateway Timeout after 10 minutes), even if it's retried immediately with that X-Etcd-Index value there's a reasonable chance (particularly on a busy cluster) that the index has fallen out of the history window already. So, any good client must really incorporate "catch-up" behaviour into its watch mechanism to get back into the index window.

@jonboulle jonboulle changed the title renew index when watch timeouts client: renew index on watch timeouts Oct 13, 2014
@jonboulle
Copy link
Contributor

Also to be clear there are two timeouts (at least) that we need to deal with:

  • 504 Gateway Timeout (e.g. in the case of discovery.etcd.io, this is what the load balancer in front of it will return)
  • The etcd server timeout (introduced @ 084dcb5), which simply closes HTTP connections (semi-gracefully: per chunked transfer encoding, we send a final chunk length of 0)

@jonboulle
Copy link
Contributor

@unihorn After thinking about it a bit more and looking at the chunked transfer encoding spec, I am wondering if we should send try to send another X-Etcd-Index in the trailer in the case of etcdserver timeouts, as a hint to the user.

(Still doesn't help with 504s, but I'm anticipating that they're dramatically less common)

@yichengq
Copy link
Contributor Author

@jonboulle
Like discovery service, 504 is a general case that should be handled by our proxy too.
If the connection is closed accidentally, client will miss the trailer. we need to serve the bad path too.
I think at the first step, we should define how good strategy etcd should provide for index renewal. I would say that if the client doesn't disconnect from the server for more than 5s, etcd should be able to keep watching.

@jonboulle
Copy link
Contributor

@unihorn you mean, resume a watch? Don't we need to track sessions then?

@yichengq
Copy link
Contributor Author

@jonboulle I mean resume/relaunch watch.
Personally I dislike the session thoughts because i think server should not record client info, which may make etcd complicated and limit the client number.

@jonboulle
Copy link
Contributor

@unihorn how do you propose for etcd to do resume/relaunch without sessions?

@xiang90
Copy link
Contributor

xiang90 commented Dec 15, 2014

@yichengq @jonboulle For 0.4x, when there is a watch timeout the client can:

  1. try to watch from the last known index
    1.1 watch successfully
    1.1 watch returns "out of window" -> recursive get all the content and watch from the index of the get

We will introduce more reliable watching mechanism in the new api.

@xiang90 xiang90 closed this as completed Dec 15, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants