New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DISTINCT + LIMIT queries #1390
Fix DISTINCT + LIMIT queries #1390
Conversation
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
8a371c0
to
45f5602
Compare
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
45f5602
to
e57ebc2
Compare
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
c85d9b7
to
fbd2e8f
Compare
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
auto planNodeIdGenerator = std::make_shared<PlanNodeIdGenerator>(); | ||
core::PlanNodeId scanNodeId; | ||
|
||
CursorParameters params; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set params.maxDrivers = 4;
and it doesn't seem to terminate. Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set params.maxDrivers, then you need to set params.numResultDrivers = 1. Would you try that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting params.numResultDrivers = 1;
works.
numRead += vector->size(); | ||
} | ||
|
||
// Do not send no-more-splits message. Expect the task to finish without |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we should have a timeout here that throws an error. The current behavior does not terminate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice for tests to have a generic timeout functionality, but looks like GoggleTest doesn't offer that. Some folks suggested using timeout support in ctest.
https://cmake.org/cmake/help/v3.8/prop_test/TIMEOUT.html
Should we look into these options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried setting set_tests_properties(velox_exec_test PROPERTIES TIMEOUT 1)
and without params.numResultDrivers = 1;
. Strangely it did not timeout.
Probably adding a timeout can be a separate PR.
Summary: Limit operator finishes early if the limit is reached before the operator received all the input. In this case, the Driver closes all upstream operators and finished early. If this is the only Driver in the Task, the Task finishes early as well. However, when there are multiple pipelines, upstream pipelines need to be explicitly closed. This change uses Task::terminate to finish upstream pipelines early when output pipeline is finished and output has been consumed. This change applies only to un-grouped execution. With this change `SELECT DISTINCT c FROM t LIMIT 10` Presto queries finish quickly even when 't' is a very large table. Before this change such queries would finish only after all splits for 't' are enumerated and sent to workers, which could take minutes. A more generic change would be to introduce a mechanism similar to Task::terminate which can finish early only a subset of pipelines in a Task, not all pipelines. This mechanism then could be used with grouped execution to finish early pipelines only for a given split group. It could also be used to finish early upstream pipelines when a pipeline other than output pipeline finishes early. Pull Request resolved: facebookincubator#1390 Reviewed By: oerling Differential Revision: D35518213 Pulled By: mbasmanova fbshipit-source-id: 5b3bc695a1ce6abb6dfd41a665bc77f5ee8d4591
This pull request was exported from Phabricator. Differential Revision: D35518213 |
fbd2e8f
to
8ce382e
Compare
Summary: Limit operator finishes early if the limit is reached before the operator received all the input. In this case, the Driver closes all upstream operators and finished early. If this is the only Driver in the Task, the Task finishes early as well. However, when there are multiple pipelines, upstream pipelines need to be explicitly closed. This change uses Task::terminate to finish upstream pipelines early when output pipeline is finished and output has been consumed. This change applies only to un-grouped execution. With this change `SELECT DISTINCT c FROM t LIMIT 10` Presto queries finish quickly even when 't' is a very large table. Before this change such queries would finish only after all splits for 't' are enumerated and sent to workers, which could take minutes. A more generic change would be to introduce a mechanism similar to Task::terminate which can finish early only a subset of pipelines in a Task, not all pipelines. This mechanism then could be used with grouped execution to finish early pipelines only for a given split group. It could also be used to finish early upstream pipelines when a pipeline other than output pipeline finishes early. Pull Request resolved: facebookincubator#1390 Reviewed By: oerling Differential Revision: D35518213 Pulled By: mbasmanova fbshipit-source-id: 1cd754b315605d0b959440dff5d34508d0d6668b
…ng issue (facebookincubator#1390) What changes were proposed in this pull request? (Please fill in changes proposed in this fix) (Fixes: facebookincubator#1389) How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Limit operator finishes early if the limit is reached before the operator
received all the input. In this case, the Driver closes all upstream operators
and finished early. If this is the only Driver in the Task, the Task finishes early
as well. However, when there are multiple pipelines, upstream pipelines need
to be explicitly closed.
This change uses Task::terminate to finish upstream pipelines early when
output pipeline is finished and output has been consumed. This change
applies only to un-grouped execution. With this change
SELECT DISTINCT c FROM t LIMIT 10
Presto queries finish quickly evenwhen 't' is a very large table. Before this change such queries would finish
only after all splits for 't' are enumerated and sent to workers, which could
take minutes.
A more generic change would be to introduce a mechanism similar to
Task::terminate which can finish early only a subset of pipelines in a Task,
not all pipelines. This mechanism then could be used with grouped execution
to finish early pipelines only for a given split group. It could also be used to
finish early upstream pipelines when a pipeline other than output pipeline
finishes early.