-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17556: [C++] Unbound scan projection expression leads to all fields being loaded #14264
Conversation
cc @westonpace this is still WIP, but appreciate some comments from you on the core change made in this PR. |
@@ -1990,5 +2021,77 @@ TEST(ScanNode, MinimalGroupedAggEndToEnd) { | |||
AssertTablesEqual(*expected, *sorted.table(), /*same_chunk_layout=*/false); | |||
} | |||
|
|||
TEST(ScanNode, DiskScanIssue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be clearer in this test case, either in the naming or in comments, about what the purpose is. DiskScanIssue is vague. The goal here is to prove that the scan node doesn't read in columns that are not included in the project expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this was just a name selected without giving much thought. And I have updated the name.
cpp/src/arrow/dataset/scanner.cc
Outdated
// IsName() to be true). | ||
|
||
// process resultant dataset_schema after projection | ||
std::shared_ptr<Schema> projected_schema; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of duplication with the path above (e.g. when the schema is bound). I wonder if there is some way to simplify the two paths. Right now it looks like:
"If we have a bound expression we use the types and names from the expression nodes to form the schema"
and
"If we have an unbound expression we use the names from the expression to find fields in the dataset schema and get the types from there."
Perhaps the second approach would work in both cases (e.g. we could grab fields from the dataset schema even when the expression is bound)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was possible at least in the test cases in C++
, R
and Python
on my machine. I have updated the code. Let's see how it goes in the CIs.
@westonpace updated the PR. Missing any corners? Should we include more tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I took some time to look through this today. I think this is a good approach until we get the new scan node. Thanks for figuring out what works. I have a few cleanup suggestions.
@westonpace thank you for the suggestions. I will complete this today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I took some time to look through this today. I think this is a good approach until we get the new scan node. Thanks for figuring out what works. I have a few cleanup suggestions.
Sorry for the double post. Got my tabs confused :) |
That’s okay 👍 |
@westonpace I updated the PR, let's wait for the CIs. |
CI is green, can we merge? |
@westonpace should we take another look at this? WDYT? |
Benchmark runs are scheduled for baseline = 82c26c8 and contender = 8972ebd. 8972ebd is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
This PR is still working in progress, but the initial idea is ready for a review to get some feedback to streamline developing possible missing pieces.