ARROW-17556: [C++] Unbound scan projection expression leads to all fields being loaded #14264

vibhatha · 2022-09-28T13:37:24Z

This PR is still working in progress, but the initial idea is ready for a review to get some feedback to streamline developing possible missing pieces.

vibhatha · 2022-09-28T15:58:26Z

cc @westonpace this is still WIP, but appreciate some comments from you on the core change made in this PR.

westonpace · 2022-10-04T14:00:20Z

cpp/src/arrow/dataset/scanner_test.cc

@@ -1990,5 +2021,77 @@ TEST(ScanNode, MinimalGroupedAggEndToEnd) {
  AssertTablesEqual(*expected, *sorted.table(), /*same_chunk_layout=*/false);
 }

+TEST(ScanNode, DiskScanIssue) {


We should be clearer in this test case, either in the naming or in comments, about what the purpose is. DiskScanIssue is vague. The goal here is to prove that the scan node doesn't read in columns that are not included in the project expression.

I agree, this was just a name selected without giving much thought. And I have updated the name.

westonpace · 2022-10-04T14:24:56Z

cpp/src/arrow/dataset/scanner.cc

+      // IsName() to be true).
+
+      // process resultant dataset_schema after projection
+      std::shared_ptr<Schema> projected_schema;


There is a lot of duplication with the path above (e.g. when the schema is bound). I wonder if there is some way to simplify the two paths. Right now it looks like:

"If we have a bound expression we use the types and names from the expression nodes to form the schema"

and

"If we have an unbound expression we use the names from the expression to find fields in the dataset schema and get the types from there."

Perhaps the second approach would work in both cases (e.g. we could grab fields from the dataset schema even when the expression is bound)?

Yes, that was possible at least in the test cases in C++, R and Python on my machine. I have updated the code. Let's see how it goes in the CIs.

vibhatha · 2022-10-05T13:58:29Z

@westonpace updated the PR. Missing any corners? Should we include more tests?

westonpace

Ok, I took some time to look through this today. I think this is a good approach until we get the new scan node. Thanks for figuring out what works. I have a few cleanup suggestions.

cpp/src/arrow/dataset/scanner.cc

cpp/src/arrow/dataset/scanner_test.cc

vibhatha · 2022-10-13T04:37:44Z

@westonpace thank you for the suggestions. I will complete this today.

westonpace

Ok, I took some time to look through this today. I think this is a good approach until we get the new scan node. Thanks for figuring out what works. I have a few cleanup suggestions.

westonpace · 2022-10-13T04:42:54Z

Sorry for the double post. Got my tabs confused :)

vibhatha · 2022-10-13T04:49:39Z

Sorry for the double post. Got my tabs confused :)

That’s okay 👍

vibhatha · 2022-10-13T12:34:05Z

@westonpace I updated the PR, let's wait for the CIs.

nealrichardson · 2022-10-14T15:05:34Z

CI is green, can we merge?

vibhatha · 2022-10-14T16:12:16Z

CI is green, can we merge?

@westonpace should we take another look at this? WDYT?

ursabot · 2022-10-16T22:51:33Z

Benchmark runs are scheduled for baseline = 82c26c8 and contender = 8972ebd. 8972ebd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️2.78% ⬆️15.56%] test-mac-arm
[Failed ⬇️1.36% ⬆️7.07%] ursa-i9-9960x
[Finished ⬇️0.14% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 8972ebd8 ec2-t3-xlarge-us-east-2
[Failed] 8972ebd8 test-mac-arm
[Finished] 8972ebd8 ursa-i9-9960x
[Finished] 8972ebd8 ursa-thinkcentre-m75q
[Finished] 82c26c8e ec2-t3-xlarge-us-east-2
[Failed] 82c26c8e test-mac-arm
[Failed] 82c26c8e ursa-i9-9960x
[Finished] 82c26c8e ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-10-16T22:51:45Z

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm
ursa-i9-9960x

github-actions bot added the Component: C++ label Sep 28, 2022

westonpace reviewed Oct 4, 2022

View reviewed changes

vibhatha added 6 commits October 5, 2022 15:55

feat(initial): test case added for issue debugging

59039ac

fix(temp):

5002fb8

fix(temp): testing an idea

a7a3d4e

feat(initial): draft version of fix

517d47b

fix(comment): clean up unnecessary import

6f2ed0b

fix(cleanup): test case

e9f0fea

vibhatha force-pushed the arrow-17556 branch from fdbe981 to e9f0fea Compare October 5, 2022 10:32

fix(refactor): remove duplications

ebefed3

vibhatha requested a review from westonpace October 5, 2022 12:06

vibhatha marked this pull request as ready for review October 5, 2022 13:57

westonpace reviewed Oct 13, 2022

View reviewed changes

cpp/src/arrow/dataset/scanner.cc Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/scanner_test.cc Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/scanner_test.cc Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/scanner_test.cc Outdated Show resolved Hide resolved

westonpace reviewed Oct 13, 2022

View reviewed changes

fix(cleanup): address reviews

2fb020d

vibhatha requested a review from westonpace October 13, 2022 12:33

westonpace approved these changes Oct 14, 2022

View reviewed changes

westonpace merged commit 8972ebd into apache:master Oct 14, 2022

asfimport mentioned this pull request Oct 18, 2022

[C++] Unbound scan projection expression leads to all fields being loaded #32808

Closed

nealrichardson mentioned this pull request Jan 18, 2023

GH-33760: [R][C++] Handle nested field refs in scanner #33770

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17556: [C++] Unbound scan projection expression leads to all fields being loaded #14264

ARROW-17556: [C++] Unbound scan projection expression leads to all fields being loaded #14264

vibhatha commented Sep 28, 2022

vibhatha commented Sep 28, 2022

westonpace Oct 4, 2022

vibhatha Oct 5, 2022

westonpace Oct 4, 2022

vibhatha Oct 5, 2022

vibhatha commented Oct 5, 2022

westonpace left a comment

vibhatha commented Oct 13, 2022

westonpace left a comment

westonpace commented Oct 13, 2022

vibhatha commented Oct 13, 2022

vibhatha commented Oct 13, 2022

nealrichardson commented Oct 14, 2022

vibhatha commented Oct 14, 2022

ursabot commented Oct 16, 2022

ursabot commented Oct 16, 2022

ARROW-17556: [C++] Unbound scan projection expression leads to all fields being loaded #14264

ARROW-17556: [C++] Unbound scan projection expression leads to all fields being loaded #14264

Conversation

vibhatha commented Sep 28, 2022

vibhatha commented Sep 28, 2022

westonpace Oct 4, 2022

Choose a reason for hiding this comment

vibhatha Oct 5, 2022

Choose a reason for hiding this comment

westonpace Oct 4, 2022

Choose a reason for hiding this comment

vibhatha Oct 5, 2022

Choose a reason for hiding this comment

vibhatha commented Oct 5, 2022

westonpace left a comment

Choose a reason for hiding this comment

vibhatha commented Oct 13, 2022

westonpace left a comment

Choose a reason for hiding this comment

westonpace commented Oct 13, 2022

vibhatha commented Oct 13, 2022

vibhatha commented Oct 13, 2022

nealrichardson commented Oct 14, 2022

vibhatha commented Oct 14, 2022

ursabot commented Oct 16, 2022

ursabot commented Oct 16, 2022