Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hedge against slow or failed datastore requests #19

Closed
jzelinskie opened this issue Aug 19, 2021 · 2 comments · Fixed by #187
Closed

Hedge against slow or failed datastore requests #19

jzelinskie opened this issue Aug 19, 2021 · 2 comments · Fixed by #187
Labels
area/datastore Affects the storage system area/perf Affects performance or scalability priority/2 medium This needs to be done

Comments

@jzelinskie
Copy link
Member

In the case of bad network performance, cache misses, or a variety of invalid assumptions that could cause latency, the distributed dispatcher should also dispatch to another instance as a "hedged bet" to guarantee a response with the least amount of latency as possible.

@jzelinskie jzelinskie added priority/2 medium This needs to be done area/perf Affects performance or scalability area/dispatch Affects dispatching of requests labels Aug 19, 2021
@jzelinskie
Copy link
Member Author

Here's an excerpt from the Zanzibar paper:

Zanzibar’s distributed processing requires measures to accommodate slow tasks. For calls to Spanner and to the Leopard index we rely on request hedging (i.e. we send the same request to multiple servers, use whichever response comes back first, and cancel the other requests). To reduce round-trip times, we try to place at least two replicas of these backend services in every geographical region where we have Zanzibar servers. To avoid unnecessarily multiplying load, we first send one request and defer sending hedged requests until the initial request is known to be slow.
To determine the appropriate hedging delay threshold, each server maintains a delay estimator that dynamically computes an Nth percentile latency based on recent measurements. This mechanism allows us to limit the additional traffic incurred by hedging to a small fraction of total traffic.
Effective hedging requires the requests to have similar costs. In the case of Zanzibar’s authorization checks, some checks are inherently more time-consuming than others because they require more work. Hedging check requests would result in duplicating the most expensive workloads and, ironically, worsening latency. Therefore we do not hedge requests between Zanzibar servers, but rely on the previously discussed sharding among multiple replicas and on monitoring mechanisms to detect and avoid slow servers

@jakedt jakedt mentioned this issue Oct 12, 2021
@jakedt jakedt changed the title Hedge against slow or failed request dispatching Hedge against slow or failed datastore dispatching Oct 12, 2021
@jakedt jakedt changed the title Hedge against slow or failed datastore dispatching Hedge against slow or failed datastore requests Oct 12, 2021
@jakedt
Copy link
Member

jakedt commented Oct 12, 2021

Changed the title, the paper has this to say explicitly about hedging redisaptch:

In the case of Zanzibar’s authorization checks, some checks are inherently more time-consuming than others be- cause they require more work. Hedging check requests would result in duplicating the most expensive workloads and, ironically, worsening latency. Therefore we do not hedge requests between Zanzibar servers, but rely on the pre- viously discussed sharding among multiple replicas and on monitoring mechanisms to detect and avoid slow servers.

@jzelinskie jzelinskie added area/datastore Affects the storage system and removed area/dispatch Affects dispatching of requests labels Oct 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datastore Affects the storage system area/perf Affects performance or scalability priority/2 medium This needs to be done
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants