You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some Elasticsearch API endpoints (/_cluster/settings, /_cluster/voting_config_exclusions) have a timeout query parameter. If not set, it defaults to 30 seconds.
There might be cases were an overloaded cluster needs more than 30 seconds to respond.
For those calls, our Elasticsearch client timeout (default 3min) does not matter much, since Elasticsearch timeout will be reached earlier. Even though users can override our client timeout through an annotation on the Elasticsearch resource, it is ignored in this case.
We should probably make sure all API calls we make that support the timeout query parameter actually set it to a value close to our client timeout?
Conceptually, we could set the api timeout parameter to a lower value than the client timeout, so the client does not timeout before Elasticsearch. By how much though? A few seconds likely. In practice, I don't think that makes a big difference: if the request times out on our end we'll retry it at next reconciliation. Our reconciliation loop should be idempotent: there should be no case where a call that times out for us but succeeds on Elasticsearch side would cause troubles.
The text was updated successfully, but these errors were encountered:
I agree with your comments about the reconciliation loop being idempotent, but, I think we should have some mechanical sympathy here as well. If the timeout parameter is exactly the same as the client timeout, there's a very good chance that the client will hit its request timeout before the Elasticsearch server times out. Of course, since the call is supposedly idempotent, there's no harm in retrying (even though the call may have succeeded unbeknown to us). The problem is that if the Elasticsearch cluster is so overloaded that it can't even respond to a request within a reasonable amount of time, blindly retrying calls is not going to make things better. Ideally, in cases like this we should at least be backing off exponentially to avoid overloading the cluster. The requeue mechanism probably does this for us; however, we then end up running the entire reconciliation loop again: doing a lot of unnecessary work on struggling cluster.
I'm fine with setting a slightly lower timeout for the api call vs. the client. It's not that much more extra work :)
What do you think would be a good delta value between both timeouts? ~5sec?
It shouldn't be too hard to derive the timeout parameter based on the client timeout.
Ideally the difference would be calculated by measuring the network latency but that's probably going overboard for this. Since we are communicating within the Kubernetes cluster, I think 2-3 seconds would be reasonable enough to cover most cases.
Some Elasticsearch API endpoints (
/_cluster/settings
,/_cluster/voting_config_exclusions
) have atimeout
query parameter. If not set, it defaults to 30 seconds.There might be cases were an overloaded cluster needs more than 30 seconds to respond.
For those calls, our Elasticsearch client timeout (default 3min) does not matter much, since Elasticsearch timeout will be reached earlier. Even though users can override our client timeout through an annotation on the Elasticsearch resource, it is ignored in this case.
We should probably make sure all API calls we make that support the
timeout
query parameter actually set it to a value close to our client timeout?Conceptually, we could set the api timeout parameter to a lower value than the client timeout, so the client does not timeout before Elasticsearch. By how much though? A few seconds likely. In practice, I don't think that makes a big difference: if the request times out on our end we'll retry it at next reconciliation. Our reconciliation loop should be idempotent: there should be no case where a call that times out for us but succeeds on Elasticsearch side would cause troubles.
The text was updated successfully, but these errors were encountered: