Skip to content

Retry query automatically for failed historicals #13498

Closed
churromorales wants to merge 5 commits intoapache:masterfrom
churromorales:retry_failed_historical
Closed

Retry query automatically for failed historicals #13498
churromorales wants to merge 5 commits intoapache:masterfrom
churromorales:retry_failed_historical

Conversation

@churromorales
Copy link
Contributor

Should address: #5709

This is a problem we experienced. When nodes go down unexpectedly, queries in flight will fail. We want druid to retry these segments for the failed historicals since we have replication > 1.

This adds missing segments to the context when a channel disconnect happens so the RetryQueryRunner can make the query.

I wanted to put up this PR and get some eyes on it first as testing this is a bit tricky so I will have to do a bit of refactoring. Before I go down the refactoring path, I want to make sure I'm not doing something outlandish. If this looks good to folks, I can try and add some tests.

@rohangarg
Copy link
Member

@churromorales : Thanks a lot for taking a stab at this! Although I wanted to confirm that does this change handle a problem mentioned by Gian in the issue tagged in the description #5709 (I have copied the relevant parts) :

Another thing to think about is that it is possible for results to be partially retrieved (and partially processed) and then for the subquery to fail midway through (before all results have come in). In this case, it's probably not possible for the broker to recover, since subquery results have already been mixed into the overall query results. The query may need to be retried from scratch. (ref : #5709 (comment))

IIRC, RetryQueryRunner only retries in a specific situation: when a segment has moved to another server since the time the broker made the query. Specifically it looks at the X-Druid-Response-Context header for the key that lists missing segments (as reported by ReportTimelineMissingSegmentQueryRunner). I don't think it retries on errors or anything else like that. So it shouldn't have those issues I mentioned, because it's not as general as what this issue aims to accomplish. (ref : #5709 (comment))

I think channelDisconnect can come after consuming some results from the network stream as well, so it might be possible that the broker has consumed those partial results and merged them with results from other historicals as well. In that case, I'm not sure if we can identify which segments to retry the query for and as Gian mentioned possibly in that case the whole query should be retried. Please let me know your thoughts!

@churromorales
Copy link
Contributor Author

@rohangarg i see your point. If you have a historical give partial data back and disconnect, you will end up re-querying the segments potentially and returning incorrect results. That is no good. I have another potential fix, wanted to go over it with you first though.

I think a full query retry is okay in this scenario.
On a channel disconnect we could just do a re-query of everything. I think I can add a special flag in the context, and when that gets set the RetryRunner queries the missing intervals, not the missing segments again.

What do you think about that?

@rohangarg
Copy link
Member

@churromorales Thanks for the response!

On a channel disconnect we could just do a re-query of everything. I think I can add a special flag in the context, and when that gets set the RetryRunner queries the missing intervals, not the missing segments again.

Yes, a full query retry seems like the only option on a disconnect. Are you thinking on doing it on the server side or from the client by returning a clear/specific error?
Regarding the missing intervals part, do you mean all the intervals or only some of them? Sorry I didn't understand that part fully yet.

@churromorales
Copy link
Contributor Author

@rohangarg check out the latest PR - I updated it to retry the entire query (good point on the partial results), namely start at the RetryQueryRunner and go from there. Let me know if you have any questions. Thanks for getting back to me on this.

@rohangarg
Copy link
Member

@churromorales : Thanks for the context! I skimmed through the implementation and I think you're trying to implement full query retry through the RetryQueryRunner. Is that right to say? If so, I had one question regarding it - similar to how broker can consume partial results from historical and then receive a disconnect, it could also happen that the client has also read from the broker some partial results. And after reading some results, a historical disconnects. Now if we retry the whole query, the client may receive duplicate results or some error as well due to truncated results from the first failed try of the query. Is that something that you've considered in your solution?

@rohangarg
Copy link
Member

@churromorales : @imply-cheddar and I were also discussing this today and we think that if you really want to implement server side retry of queries in druid, you'd need to buffer up the results somewhere (maybe router since that's the query entry point) and hold them until the query finishes, since that allows you to retry the query from the buffering daemon - because the downstream client hasn't received any results yet.
But buffering can also introduce the need of managing memory to avoid OOMs on the router/buffering daemon.

@churromorales
Copy link
Contributor Author

@rohangarg sorry for the late reply and happy new years. Honestly I don't think it is worth pursuing this if we stream back to the client. I didn't realize that not only is data streamed from historicals to brokers, but also streamed back to the client. I don't think this is worth pursuing and I'll go ahead and close this PR if you agree.

@rohangarg
Copy link
Member

@churromorales : Thanks for the response and happy new years to you too! Yes, I agree with your sentiment that it is generally not worth pursuing the buffering of results for server side retry. It is a last resort solution - so we can close this PR if you think server-side retry doesn't provide enough value to the problem you're facing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants