Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exception to broker response when not all segments are available (partial response) #7264

Closed
Jackie-Jiang opened this issue Aug 9, 2021 · 22 comments · Fixed by #11592
Closed
Labels
beginner-task Small task for new contributors to ramp up

Comments

@Jackie-Jiang
Copy link
Contributor

There are 3 scenarios when the broker response might be partial:

  1. There are some segments without online server in external view - numUnavailableSegments in BaseBrokerRequestHandler line 718
  2. Some segments are not available on server - numSegmentsQueried > numSegmentsAcquired in ServerQueryExecutorV1Impl line 170
  3. Some servers not responded - numServersQueried > numServersResponded in SingleConnectionBrokerRequestHandler line 128

Currently these partial responses are only tracked by metrics/query stats, but not modeled as an exception. We should add an exception to the broker response to inform the users that the response might be partial

@Jackie-Jiang Jackie-Jiang added the beginner-task Small task for new contributors to ramp up label Aug 9, 2021
@mcvsubbu
Copy link
Contributor

mcvsubbu commented Aug 9, 2021

There is an indication that the response is partial. Why add an exception?
If there is a need for the reason, then we can add a "reason" field ?

@Jackie-Jiang
Copy link
Contributor Author

We don't really have that for case 1 and 2. For case 3, we should not expect normal users to check numServersQueried and numServersResponded and compare them to tell whether the result is partial. (The partialResponse flag is in LinkedIn internal code IIRC)

@mcvsubbu
Copy link
Contributor

mcvsubbu commented Aug 9, 2021

We don't really have that for case 1 and 2. For case 3, we should not expect normal users to check numServersQueried and numServersResponded and compare them to tell whether the result is partial. (The partialResponse flag is in LinkedIn internal code IIRC)

If partial response flag is not set in all cases, then that is a bug. Let us definitely fix it.

I think applications check for exceptions and assume that there is no result. So, exceptions when there are partial results may not be the right answer.

@Jackie-Jiang
Copy link
Contributor Author

We don't have that flag in OSS. IIRC the logic for that flag is check if numServersQueried matches numServersResponded and whether there are exceptions.

We don't model exceptions this way. Having exceptions does not indicate no result. It actually indicates the results are not complete (e.g. one segment throws exception, others work fine). When broker failed to route the query to a server, it also adds an exception.

@yupeng9
Copy link
Contributor

yupeng9 commented Aug 11, 2021

+1

I also suggest we change the default behavior to disallow partial response by default, and provide the flag to turn it on.
In many use cases, data quality and correctness is a serious issue. So it's ok to fail the query but not so to return a partial result silently.

@suddendust
Copy link
Contributor

I think the fundamental concern here is that partial results might be returned without the users taking cognisance of them. While having a flag like partialResponse solves to an extent, it is still prone to being missed by the user (many users don't even look at the response stats if the query was fast enough). I agree with @yupeng9 that the default behaviour in such cases should be failing the query. If the user has explicitly specified that he is okay with partial responses, then it can be indicated in a flag to the user.

So this entails three changes:

  1. Throwing exception from the broker in all the three cases.
  2. Indicating the user about partial response in all the three cases.
  3. Changing the default query behaviour.

@suddendust
Copy link
Contributor

@mcvsubbu @Jackie-Jiang do we have an agreement on this requirement? If yes, I can take it up.

@amrishlal
Copy link
Contributor

Partial responses and approximate results would be problematic in OLTP databases, but they make full sense in an analytical database such as Pinot and should be fully supported as first class citizens without throwing error or exception conditions. As an analytical database, there is value in providing partial responses or approximate results when full responses or accurate results are either missing or will take too long to generate provided that we can quantify how partial or approximate the results are. We already do a lot of this with approximation functions such as DISTINCTCOUNTHLL, high-cardinality group by estimates, so there shouldn't be anything wrong with providing partial responses as long as it can be quantified as to how partial the responses are. These should be treated different from exceptions, errors, and warnings.

@Jackie-Jiang
Copy link
Contributor Author

@amrishlal Partial and approximation are different, where approximation usually gives bounded error rate. The cases we want to notify the user here is when some segments are not available (e.g. all replica down), or some servers fail to respond (e.g. timeout). We will still return the response, but put some message in the query response so that user knows the result is not complete.
I don't see how this kind of problem is different from exceptions though. If some segments throw exception on a given query, we can also treat it as partial because it is essentially the same as all replicas down for these segments. User can choose to ignore the exceptions and use whatever returned in the response.

@amrishlal
Copy link
Contributor

The cases we want to notify the user here is when some segments are not available (e.g. all replica down), or some servers fail to respond (e.g. timeout).

I can easily see a case where a user may specify a timeout for a long running query as a way to get partial results early without waiting for those one or two servers that may take extra long time to process the results. In cases like these, one could set a warning in query results and try to quantify how accurate the results might be (i.e result is based on responses of 9 out of 11 segments, etc).

I don't see how this kind of problem is different from exceptions though. If some segments throw exception on a given query, we can also treat it as partial because it is essentially the same as all replicas down for these segments.

I think it's worth distinguishing between success, warning, and errors states (some databases do this quite well with large list of success, error, and warning codes that are returned along with results). The problem with throwing exception is that it is commonly used for error conditions, tends to be thrown arbitrarily at times so one can't usually distinguish between warning and errors, and requires special handling similar to errors conditions, so users would end up treating warnings as exceptions. Even in case of approximate functions such as DISTINCTCOUNTHLL it may be a good idea to set a warning saying that results are approximate and not definite.

@suddendust
Copy link
Contributor

suddendust commented Aug 17, 2021

Purely from the user's perspective, this certainly looks like a warning than an error ("Here are your results but be warned that these are partial"). I think regardless of how we do it in code, a flag indicating this for all the three cases above would be useful. Even now as @Jackie-Jiang pointed out, the user has to check numServersQueried and numServersResponded and compare them to tell if the result is partial. This is unnecessarily complex.

@yupeng9
Copy link
Contributor

yupeng9 commented Aug 17, 2021

I think the key here is to inform users and give them the options to choose the desirable behavior. For your example, users can choose DISTINCTCOUNT vs DISTCINTCOUNTHLL for the exact result or approximation. Similarly, users can also choose not to take partial results. For analytical databases, there are use cases that demand exact results, cannot accept partial results. Just to give one example, there are use cases at Uber that pinot results are used in the billing processing and accounting.

@suddendust
Copy link
Contributor

suddendust commented Aug 19, 2021

@Jackie-Jiang shall we start with adding a partial response flag here?

@Jackie-Jiang
Copy link
Contributor Author

I'd suggest adding different errors for each scenario described above to the QueryException, and users can decide on how to proceed based on the error type (error code). We already have similar errors such as BROKER_REQUEST_SEND_ERROR_CODE which is added when a server cannot be reached

@jackjlli
Copy link
Contributor

Not all the partial response is useless or harmful. There is another scenario that falls into partial response, which is that isNumGroupsLimitReached() from brokerResponse returns true. In this case, user can still leverage the response of the query as it still provides insights from the response.

@suddendust
Copy link
Contributor

suddendust commented Aug 20, 2021

@Jackie-Jiang Yes in addition to the specific error codes that you're suggesting to add, do you think an explicit partialResponse = true flag would be helpful for the users? Especially for users who do not want partial responses, it'll be very easy for them to discard the results of a query just by looking at this flag.

@Jackie-Jiang
Copy link
Contributor Author

We need to think about how to set this partialResponse flag. What scenarios should be count as partial

@yupeng9
Copy link
Contributor

yupeng9 commented Aug 21, 2021

Can we add it as a query option (allowPartial)?

@Jackie-Jiang
Copy link
Contributor Author

@yupeng9 Current pinot behavior will always return the available results, and set exceptions for unexpected scenarios such as unable to reach to a server. For now I would suggest following it, and the client can decide whether to retry based on the exceptions received.
This issue is mainly to discuss the cases not captured

@suddendust
Copy link
Contributor

I have taken up the exceptions part @Jackie-Jiang. Let us think about the partialResponse flag post this.

@suddendust
Copy link
Contributor

I have been a bit busy throughout last week, will raise a PR for this soon.

@suddendust
Copy link
Contributor

PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beginner-task Small task for new contributors to ramp up
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants