Retry query automatically for failed historicals #13498
Retry query automatically for failed historicals #13498churromorales wants to merge 5 commits intoapache:masterfrom
Conversation
|
@churromorales : Thanks a lot for taking a stab at this! Although I wanted to confirm that does this change handle a problem mentioned by Gian in the issue tagged in the description #5709 (I have copied the relevant parts) :
I think |
|
@rohangarg i see your point. If you have a historical give partial data back and disconnect, you will end up re-querying the segments potentially and returning incorrect results. That is no good. I have another potential fix, wanted to go over it with you first though. I think a full query retry is okay in this scenario. What do you think about that? |
…etryOnDisconnect set in the Query context
|
@churromorales Thanks for the response!
Yes, a full query retry seems like the only option on a disconnect. Are you thinking on doing it on the server side or from the client by returning a clear/specific error? |
|
@rohangarg check out the latest PR - I updated it to retry the entire query (good point on the partial results), namely start at the |
|
@churromorales : Thanks for the context! I skimmed through the implementation and I think you're trying to implement full query retry through the |
|
@churromorales : @imply-cheddar and I were also discussing this today and we think that if you really want to implement server side retry of queries in druid, you'd need to buffer up the results somewhere (maybe router since that's the query entry point) and hold them until the query finishes, since that allows you to retry the query from the buffering daemon - because the downstream client hasn't received any results yet. |
|
@rohangarg sorry for the late reply and happy new years. Honestly I don't think it is worth pursuing this if we stream back to the client. I didn't realize that not only is data streamed from historicals to brokers, but also streamed back to the client. I don't think this is worth pursuing and I'll go ahead and close this PR if you agree. |
|
@churromorales : Thanks for the response and happy new years to you too! Yes, I agree with your sentiment that it is generally not worth pursuing the buffering of results for server side retry. It is a last resort solution - so we can close this PR if you think server-side retry doesn't provide enough value to the problem you're facing. |
Should address: #5709
This is a problem we experienced. When nodes go down unexpectedly, queries in flight will fail. We want druid to retry these segments for the failed historicals since we have replication > 1.
This adds missing segments to the context when a channel disconnect happens so the RetryQueryRunner can make the query.
I wanted to put up this PR and get some eyes on it first as testing this is a bit tricky so I will have to do a bit of refactoring. Before I go down the refactoring path, I want to make sure I'm not doing something outlandish. If this looks good to folks, I can try and add some tests.