Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17556: [C++] Unbound scan projection expression leads to all fields being loaded #14264

Merged
merged 8 commits into from
Oct 14, 2022

Conversation

vibhatha
Copy link
Collaborator

This PR is still working in progress, but the initial idea is ready for a review to get some feedback to streamline developing possible missing pieces.

@vibhatha
Copy link
Collaborator Author

cc @westonpace this is still WIP, but appreciate some comments from you on the core change made in this PR.

@@ -1990,5 +2021,77 @@ TEST(ScanNode, MinimalGroupedAggEndToEnd) {
AssertTablesEqual(*expected, *sorted.table(), /*same_chunk_layout=*/false);
}

TEST(ScanNode, DiskScanIssue) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be clearer in this test case, either in the naming or in comments, about what the purpose is. DiskScanIssue is vague. The goal here is to prove that the scan node doesn't read in columns that are not included in the project expression.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this was just a name selected without giving much thought. And I have updated the name.

// IsName() to be true).

// process resultant dataset_schema after projection
std::shared_ptr<Schema> projected_schema;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of duplication with the path above (e.g. when the schema is bound). I wonder if there is some way to simplify the two paths. Right now it looks like:

"If we have a bound expression we use the types and names from the expression nodes to form the schema"

and

"If we have an unbound expression we use the names from the expression to find fields in the dataset schema and get the types from there."

Perhaps the second approach would work in both cases (e.g. we could grab fields from the dataset schema even when the expression is bound)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was possible at least in the test cases in C++, R and Python on my machine. I have updated the code. Let's see how it goes in the CIs.

@vibhatha vibhatha marked this pull request as ready for review October 5, 2022 13:57
@vibhatha
Copy link
Collaborator Author

vibhatha commented Oct 5, 2022

@westonpace updated the PR. Missing any corners? Should we include more tests?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I took some time to look through this today. I think this is a good approach until we get the new scan node. Thanks for figuring out what works. I have a few cleanup suggestions.

cpp/src/arrow/dataset/scanner.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/scanner_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/scanner_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/scanner_test.cc Outdated Show resolved Hide resolved
@vibhatha
Copy link
Collaborator Author

@westonpace thank you for the suggestions. I will complete this today.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I took some time to look through this today. I think this is a good approach until we get the new scan node. Thanks for figuring out what works. I have a few cleanup suggestions.

@westonpace
Copy link
Member

Sorry for the double post. Got my tabs confused :)

@vibhatha
Copy link
Collaborator Author

Sorry for the double post. Got my tabs confused :)

That’s okay 👍

@vibhatha
Copy link
Collaborator Author

@westonpace I updated the PR, let's wait for the CIs.

@nealrichardson
Copy link
Member

CI is green, can we merge?

@vibhatha
Copy link
Collaborator Author

CI is green, can we merge?

@westonpace should we take another look at this? WDYT?

@westonpace westonpace merged commit 8972ebd into apache:master Oct 14, 2022
@ursabot
Copy link

ursabot commented Oct 16, 2022

Benchmark runs are scheduled for baseline = 82c26c8 and contender = 8972ebd. 8972ebd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️2.78% ⬆️15.56%] test-mac-arm
[Failed ⬇️1.36% ⬆️7.07%] ursa-i9-9960x
[Finished ⬇️0.14% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 8972ebd8 ec2-t3-xlarge-us-east-2
[Failed] 8972ebd8 test-mac-arm
[Finished] 8972ebd8 ursa-i9-9960x
[Finished] 8972ebd8 ursa-thinkcentre-m75q
[Finished] 82c26c8e ec2-t3-xlarge-us-east-2
[Failed] 82c26c8e test-mac-arm
[Failed] 82c26c8e ursa-i9-9960x
[Finished] 82c26c8e ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Oct 16, 2022

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants