Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request hedging #187

Merged
merged 5 commits into from
Oct 15, 2021
Merged

Request hedging #187

merged 5 commits into from
Oct 15, 2021

Conversation

jakedt
Copy link
Member

@jakedt jakedt commented Oct 12, 2021

From the paper:

For calls to Spanner and to the Leop- ard index we rely on request hedging [16] (i.e. we send the same request to multiple servers, use whichever response comes back first, and cancel the other requests). To reduce round-trip times, we try to place at least two replicas of these backend services in every geographical region where we have Zanzibar servers. To avoid unnecessarily multiply- ing load, we first send one request and defer sending hedged requests until the initial request is known to be slow.

To determine the appropriate hedging delay threshold, each server maintains a delay estimator that dynamically computes an Nth percentile latency based on recent mea- surements. This mechanism allows us to limit the additional traffic incurred by hedging to a small fraction of total traffic.

Fixes #19

@jakedt jakedt marked this pull request as draft October 12, 2021 22:37
@github-actions github-actions bot added area/datastore Affects the storage system area/dependencies Affects dependencies area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) labels Oct 12, 2021
internal/datastore/proxy/hedging.go Outdated Show resolved Hide resolved
internal/datastore/proxy/hedging.go Show resolved Hide resolved
Comment on lines 31 to 54
var digestLock sync.Mutex

digests := []*tdigest.TDigest{
tdigest.NewWithCompression(1000),
tdigest.NewWithCompression(1000),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's overkill to pull this out into its own type that has thread-safe methods like Quantile().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure, we would still need a lock around any mutations to the slice, and we're reading from the digests just as often as we are writing to them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we move the writes into their own routine (see below), we could in theory either switch to a RWLock or, alternatively, have the "write" call also store the current first quantile value, and have reads just use whatever is stored?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a change to make here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd see if we could avoid the lock entirely on the read, only take it on the write (ideally) and do so asynchronously from the critical path

internal/datastore/test/mock.go Outdated Show resolved Hide resolved
@jakedt jakedt force-pushed the request-hedging branch 5 times, most recently from 10303bf to bb6c13d Compare October 14, 2021 00:42
@jakedt jakedt marked this pull request as ready for review October 14, 2021 00:48

return hedgingProxy{
delegate,
newHedger(timeSource, initialSlowRequestThreshold, maxSampleCount, hedgingQuantile, 1000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Place the compression into a const?

go req(ctx, responseReady)

select {
case <-responseReady:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put a comment here? I know Go doesn't auto-fallthrough, but it first read that way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a select not a switch...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but it still looks that way at first read through

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk about that and even for most cases you're unlikely to be even want to use fallthrough because you can do multiple cases separated by commas (e.g. case This == false, That == false:).

<-responseReady
}

// Compute how long it took for us to get any answer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does order matter at all here? If not, I'd say we update the digest in a goroutine to prevent this from being on the critical call path

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They take microseconds to query and update according to the original author of the algorithm.

Comment on lines 31 to 54
var digestLock sync.Mutex

digests := []*tdigest.TDigest{
tdigest.NewWithCompression(1000),
tdigest.NewWithCompression(1000),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we move the writes into their own routine (see below), we could in theory either switch to a RWLock or, alternatively, have the "write" call also store the current first quantile value, and have reads just use whatever is stored?

Signed-off-by: Jake Moshenko <jacob.moshenko@gmail.com>
Signed-off-by: Jake Moshenko <jacob.moshenko@gmail.com>
"github.com/authzed/spicedb/internal/datastore"
)

var hedgeableCount = prometheus.NewCounter(prometheus.CounterOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should just use promauto

internal/datastore/proxy/hedging.go Show resolved Hide resolved
Signed-off-by: Jake Moshenko <jacob.moshenko@gmail.com>
Signed-off-by: Jake Moshenko <jacob.moshenko@gmail.com>
Signed-off-by: Jake Moshenko <jacob.moshenko@gmail.com>
@jakedt jakedt merged commit ca82b60 into main Oct 15, 2021
@jakedt jakedt deleted the request-hedging branch October 15, 2021 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datastore Affects the storage system area/dependencies Affects dependencies area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hedge against slow or failed datastore requests
3 participants