Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return data fetched from a subset of store-gateways instead of returning error if a single store-gateway fails #4532

Merged

Conversation

roystchiang
Copy link
Contributor

@roystchiang roystchiang commented Oct 15, 2021

What this PR does:

Previously, if a single store-gateway fails to return data, the whole function fails. This causes queries to fail without reattempting other available store-gateways.

Now, the function responsible for fetching data returns whatever was retrieved, and relies on the caller to determine whether all the necessary blocks were gathered.

Which issue(s) this PR fixes:
Fixes #4529

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@roystchiang roystchiang marked this pull request as draft October 16, 2021 04:18
@roystchiang roystchiang force-pushed the store-gateway-partial-failure branch 2 times, most recently from 492feae to 412eb6a Compare October 19, 2021 17:42
@pull-request-size pull-request-size bot added size/L and removed size/M labels Oct 19, 2021
@roystchiang roystchiang marked this pull request as ready for review October 19, 2021 21:52
@alanprot
Copy link
Member

alanprot commented Jan 6, 2022

Hi,
Is it possible to take a look on this? This seems a very important improvement

@@ -654,12 +659,14 @@ func (q *blocksStoreQuerier) fetchSeriesFromStores(
if h := resp.GetHints(); h != nil {
hints := hintspb.SeriesResponseHints{}
if err := types.UnmarshalAny(h, &hints); err != nil {
return errors.Wrapf(err, "failed to unmarshal series hints from %s", c.RemoteAddress())
level.Warn(spanLog).Log("err", errors.Wrapf(err, "failed to unmarshal series hints from %s", c.RemoteAddress()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does make sense to keep the change smaller for now and just ignore the network errors on line and line?

@@ -590,6 +590,9 @@ func (q *blocksStoreQuerier) fetchSeriesFromStores(
// TODO(goutham): we should ideally be passing the hints down to the storage layer
// and let the TSDB return us data with no chunks as in prometheus#8050.
// But this is an acceptable workaround for now.

// Only fail the function if we have validation error. We should return blocks that were successfully
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also fail the function when store-gateways return a limit error. It'll be good to avoid unnecessary retries in that case.

@roystchiang roystchiang force-pushed the store-gateway-partial-failure branch 3 times, most recently from 3aea3af to 8ebebae Compare July 8, 2022 00:29
@roystchiang roystchiang requested review from alanprot and removed request for pracucci July 8, 2022 03:11
@alanprot
Copy link
Member

alanprot commented Jul 8, 2022

Nice! LGTM

@@ -595,6 +602,10 @@ func (q *blocksStoreQuerier) fetchSeriesFromStores(

stream, err := c.Series(gCtx, req)
if err != nil {
if isRetryableError(err) {
level.Warn(spanLog).Log("err", errors.Wrapf(err, "failed to fetch series from %s", c.RemoteAddress()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it clear that this is a retriable error in the log message? Something like "failed to fetch series from %s due to retriable error".

…ing error if a single store-gateway fails

Signed-off-by: Roy Chiang <roychi@amazon.com>
Signed-off-by: Roy Chiang <roychi@amazon.com>
Signed-off-by: Roy Chiang <roychi@amazon.com>
Signed-off-by: Roy Chiang <roychi@amazon.com>
Signed-off-by: Roy Chiang <roychi@amazon.com>
Signed-off-by: Roy Chiang <roychi@amazon.com>
Signed-off-by: Roy Chiang <roychi@amazon.com>
@roystchiang roystchiang force-pushed the store-gateway-partial-failure branch from 546b41c to d1f10b2 Compare July 13, 2022 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to complete query with a single unavailable store-gateway, with shuffle-sharding and zone-awareness
4 participants