Retry query automatically for failed historicals by churromorales · Pull Request #13498 · apache/druid

churromorales · 2022-12-05T19:36:00Z

Should address: #5709

This is a problem we experienced. When nodes go down unexpectedly, queries in flight will fail. We want druid to retry these segments for the failed historicals since we have replication > 1.

This adds missing segments to the context when a channel disconnect happens so the RetryQueryRunner can make the query.

I wanted to put up this PR and get some eyes on it first as testing this is a bit tricky so I will have to do a bit of refactoring. Before I go down the refactoring path, I want to make sure I'm not doing something outlandish. If this looks good to folks, I can try and add some tests.

rohangarg · 2022-12-05T20:10:45Z

@churromorales : Thanks a lot for taking a stab at this! Although I wanted to confirm that does this change handle a problem mentioned by Gian in the issue tagged in the description #5709 (I have copied the relevant parts) :

Another thing to think about is that it is possible for results to be partially retrieved (and partially processed) and then for the subquery to fail midway through (before all results have come in). In this case, it's probably not possible for the broker to recover, since subquery results have already been mixed into the overall query results. The query may need to be retried from scratch. (ref : #5709 (comment))

IIRC, RetryQueryRunner only retries in a specific situation: when a segment has moved to another server since the time the broker made the query. Specifically it looks at the X-Druid-Response-Context header for the key that lists missing segments (as reported by ReportTimelineMissingSegmentQueryRunner). I don't think it retries on errors or anything else like that. So it shouldn't have those issues I mentioned, because it's not as general as what this issue aims to accomplish. (ref : #5709 (comment))

I think channelDisconnect can come after consuming some results from the network stream as well, so it might be possible that the broker has consumed those partial results and merged them with results from other historicals as well. In that case, I'm not sure if we can identify which segments to retry the query for and as Gian mentioned possibly in that case the whole query should be retried. Please let me know your thoughts!

churromorales · 2022-12-06T22:53:35Z

@rohangarg i see your point. If you have a historical give partial data back and disconnect, you will end up re-querying the segments potentially and returning incorrect results. That is no good. I have another potential fix, wanted to go over it with you first though.

I think a full query retry is okay in this scenario.
On a channel disconnect we could just do a re-query of everything. I think I can add a special flag in the context, and when that gets set the RetryRunner queries the missing intervals, not the missing segments again.

What do you think about that?

…etryOnDisconnect set in the Query context

rohangarg · 2022-12-08T21:54:30Z

@churromorales Thanks for the response!

On a channel disconnect we could just do a re-query of everything. I think I can add a special flag in the context, and when that gets set the RetryRunner queries the missing intervals, not the missing segments again.

Yes, a full query retry seems like the only option on a disconnect. Are you thinking on doing it on the server side or from the client by returning a clear/specific error?
Regarding the missing intervals part, do you mean all the intervals or only some of them? Sorry I didn't understand that part fully yet.

churromorales · 2022-12-08T22:07:19Z

@rohangarg check out the latest PR - I updated it to retry the entire query (good point on the partial results), namely start at the RetryQueryRunner and go from there. Let me know if you have any questions. Thanks for getting back to me on this.

rohangarg · 2022-12-09T18:30:48Z

@churromorales : Thanks for the context! I skimmed through the implementation and I think you're trying to implement full query retry through the RetryQueryRunner. Is that right to say? If so, I had one question regarding it - similar to how broker can consume partial results from historical and then receive a disconnect, it could also happen that the client has also read from the broker some partial results. And after reading some results, a historical disconnects. Now if we retry the whole query, the client may receive duplicate results or some error as well due to truncated results from the first failed try of the query. Is that something that you've considered in your solution?

rohangarg · 2022-12-12T20:12:47Z

@churromorales : @imply-cheddar and I were also discussing this today and we think that if you really want to implement server side retry of queries in druid, you'd need to buffer up the results somewhere (maybe router since that's the query entry point) and hold them until the query finishes, since that allows you to retry the query from the buffering daemon - because the downstream client hasn't received any results yet.
But buffering can also introduce the need of managing memory to avoid OOMs on the router/buffering daemon.

churromorales · 2023-01-10T21:30:23Z

@rohangarg sorry for the late reply and happy new years. Honestly I don't think it is worth pursuing this if we stream back to the client. I didn't realize that not only is data streamed from historicals to brokers, but also streamed back to the client. I don't think this is worth pursuing and I'll go ahead and close this PR if you agree.

rohangarg · 2023-01-11T11:59:28Z

@churromorales : Thanks for the response and happy new years to you too! Yes, I agree with your sentiment that it is generally not worth pursuing the buffering of results for server side retry. It is a last resort solution - so we can close this PR if you think server-side retry doesn't provide enough value to the problem you're facing.

churromorales and others added 2 commits December 1, 2022 14:40

Addresses issue: apache#5709

30cbf67

Merge branch 'master' into retry_failed_historical

09cb6b5

clintropolis added the Area - Querying label Dec 6, 2022

churromorales added 3 commits December 7, 2022 14:50

On a channel disconnect, you now retry the entire query if you have r…

337023b

…etryOnDisconnect set in the Query context

Forbidden api call fix

fc65c55

Add more tests, fix merge conflict

6c4057d

churromorales closed this Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry query automatically for failed historicals #13498

Retry query automatically for failed historicals #13498
churromorales wants to merge 5 commits intoapache:masterfrom
churromorales:retry_failed_historical

churromorales commented Dec 5, 2022

Uh oh!

rohangarg commented Dec 5, 2022

Uh oh!

churromorales commented Dec 6, 2022

Uh oh!

rohangarg commented Dec 8, 2022

Uh oh!

churromorales commented Dec 8, 2022

Uh oh!

rohangarg commented Dec 9, 2022

Uh oh!

rohangarg commented Dec 12, 2022

Uh oh!

churromorales commented Jan 10, 2023

Uh oh!

rohangarg commented Jan 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

churromorales commented Dec 5, 2022

Uh oh!

rohangarg commented Dec 5, 2022

Uh oh!

churromorales commented Dec 6, 2022

Uh oh!

rohangarg commented Dec 8, 2022

Uh oh!

churromorales commented Dec 8, 2022

Uh oh!

rohangarg commented Dec 9, 2022

Uh oh!

rohangarg commented Dec 12, 2022

Uh oh!

churromorales commented Jan 10, 2023

Uh oh!

rohangarg commented Jan 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants