Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the bug of hybrid table request using the same request id #9443

Merged
merged 1 commit into from Sep 21, 2022

Conversation

Jackie-Jiang
Copy link
Contributor

Currently when querying a hybrid table, 2 requests (one for OFFLINE and one for REALTIME) are sent to the servers, but with the same request id. It can cause problem for 2 modules:

  • Tracing
  • Query cancellation

This PR fixes the bug by using negative request id for REALTIME request in hybrid table.

LOGGER.debug("Keep track of running query: {}", requestId);
queryServers.addServers(offlineRoutingTable, realtimeRoutingTable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one reason to use addServers() was to accumulate all servers involved when running a query, which might do some subqueries (thus recursively calling this handleRequest() method (where I assumed different routing tables might be created for some subqueries...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and that is the reason why I also moved the _queriesById.remove(requestId) into the handleRequest() to avoid this scenario

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense!

if (queryServers._realtimeRoutingTable != null) {
// NOTE: When the query is sent to both OFFLINE and REALTIME table, the REALTIME one has negative request id to
// differentiate from the OFFLINE one
long realtimeRequestId = queryServers._offlineRoutingTable == null ? requestId : -requestId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🆒

@codecov-commenter
Copy link

codecov-commenter commented Sep 21, 2022

Codecov Report

Merging #9443 (ac72549) into master (7be06af) will increase coverage by 0.10%.
The diff coverage is 58.00%.

@@             Coverage Diff              @@
##             master    #9443      +/-   ##
============================================
+ Coverage     69.76%   69.86%   +0.10%     
  Complexity     5098     5098              
============================================
  Files          1890     1890              
  Lines        100934   100945      +11     
  Branches      15347    15352       +5     
============================================
+ Hits          70420    70529     +109     
+ Misses        25541    25447      -94     
+ Partials       4973     4969       -4     
Flag Coverage Δ
integration1 26.07% <20.00%> (+0.17%) ⬆️
integration2 24.65% <32.00%> (-0.09%) ⬇️
unittests1 67.10% <44.44%> (+<0.01%) ⬆️
unittests2 15.34% <32.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...che/pinot/core/query/scheduler/QueryScheduler.java 85.90% <ø> (ø)
...e/pinot/core/transport/InstanceRequestHandler.java 64.84% <0.00%> (ø)
...c/main/java/org/apache/pinot/spi/trace/Tracer.java 0.00% <0.00%> (ø)
...roker/requesthandler/BaseBrokerRequestHandler.java 69.96% <51.42%> (-0.29%) ⬇️
...roker/requesthandler/GrpcBrokerRequestHandler.java 85.10% <83.33%> (-1.86%) ⬇️
...he/pinot/common/utils/grpc/GrpcRequestBuilder.java 90.90% <100.00%> (+12.12%) ⬆️
...che/pinot/core/query/reduce/BaseReduceService.java 97.39% <100.00%> (+2.08%) ⬆️
...a/org/apache/pinot/core/transport/QueryRouter.java 82.14% <100.00%> (+0.43%) ⬆️
.../common/request/context/predicate/EqPredicate.java 92.30% <0.00%> (-7.70%) ⬇️
...ion/function/DistinctCountAggregationFunction.java 50.22% <0.00%> (-4.04%) ⬇️
... and 35 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@jackjlli jackjlli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Do we also need the code change on canceling the request, since after this PR realtime servers may start to serve some queries with negative requestId?

@Jackie-Jiang
Copy link
Contributor Author

Good catch! Do we also need the code change on canceling the request, since after this PR realtime servers may start to serve some queries with negative requestId?

@jackjlli Yes, the query cancellation logic is also revised to cancel the correct request id on the server

@siddharthteotia siddharthteotia merged commit b6f3331 into apache:master Sep 21, 2022
@Jackie-Jiang Jackie-Jiang deleted the unique_request_id branch September 21, 2022 08:12
61yao pushed a commit to 61yao/pinot that referenced this pull request Oct 3, 2022
@jadami10
Copy link
Contributor

Is there a reason we needed to do it this way (negative ids)? Most log search technologies folks use will expect KV filter like requestId=123. This will by default filter out requestId=-123 which I think is a pretty big regression. Why not add a new dimension to the log like tableType. That way folks know whether this was the realtime/offline side, we filter in by default, and now you can filter/stratify by offline/realtime as needed.

@Jackie-Jiang
Copy link
Contributor Author

@jadami10 Good point. The reason why we choose to fix the issue by using negative id is because that won't change the transport object between brokers and servers, which is backward-incompatible change. On the server side, we may check whether the request-id is positive and log it accordingly.

@jadami10
Copy link
Contributor

that won't change the transport object between brokers and servers

does that matter since it was already working before? The description states the components it's trying to fix but not really what was broken, so it's a little tough for me to understand what the problem is in the first place.

@Jackie-Jiang
Copy link
Contributor Author

@jadami10 The problem is that for tracing and query cancellation, we use request id as the key to track the queries running on the server. For hybrid table, 2 queries are sent out, one for the OFFLINE table and one for the REALTIME table. If both tables are served on the same server, the 2 queries will override the entry for each other, and only one query can be tracked.
You didn't run into this issue probably because the OFFLINE table and REALTIME table is never running on the same server.

@jadami10
Copy link
Contributor

That makes sense. In the case where we have 1 server serving both parts, does it/can it know which query is which part without passing more info from the broker? That way we could append tableType in the logs directly in the server rather than doing anything to the requestId.

Is there actually a use case where someone wants to cancel a request id for just 1 part of the query? That feels strange since the broker will then return an exception anyway.

@Jackie-Jiang
Copy link
Contributor Author

That is possible by checking the table name. Let me think if we can solve it on server itself.

Currently we don't allow cancelling only one request. From endpoint perspective, it will be the same, where user still always give the positive request id, and broker automatically re-write it to negative one if necessary. The issue is that the logger might log the negative request id on the server side.

@Jackie-Jiang
Copy link
Contributor Author

@jadami10 Filed #9648 to handle this on the server side

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants