Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-14939: [C++] Support Table lookups in FieldRef and FieldPath #34537

Merged

Conversation

dinimar
Copy link
Contributor

@dinimar dinimar commented Mar 11, 2023

Rationale for this change

Described in the issue

What changes are included in this PR?

  • added implementations for FieldPath::Get(const Table& table) and FindAll(const Table& table)
  • added unit tests for functions mentioned above both for Table and RecordBatch classes

Are these changes tested?

Yes, by unit tests

Are there any user-facing changes?

Most probably, no

- add overloaded implementations of FieldRef::FindAll and FieldPath::Get for Table type
- add unit tests for FieldRef::FindAll and FieldPath::Get both for RecordBatch and Table types
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@dinimar dinimar changed the title GH014939: [C++] Support Table lookups in FieldRef and FieldPath GH-14939: [C++] Support Table lookups in FieldRef and FieldPath Mar 11, 2023
@github-actions
Copy link

@dinimar dinimar marked this pull request as ready for review March 11, 2023 11:47
@dinimar
Copy link
Contributor Author

dinimar commented Mar 11, 2023

@benibus please review

@dinimar
Copy link
Contributor Author

dinimar commented Mar 12, 2023

cc @rok @pitrou @benibus

Copy link
Collaborator

@benibus benibus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dinimar! I just did a first-pass and left some comments.

cpp/src/arrow/type.cc Outdated Show resolved Hide resolved
cpp/src/arrow/type.cc Outdated Show resolved Hide resolved
cpp/src/arrow/type.cc Outdated Show resolved Hide resolved
cpp/src/arrow/type_test.cc Show resolved Hide resolved
cpp/src/arrow/type_test.cc Outdated Show resolved Hide resolved
@benibus
Copy link
Collaborator

benibus commented Mar 23, 2023

@dinimar I pushed a commit that reverts some of the formatting. I'm not entirely sure why the auto-formatter is still picking up arrow/compute/kernels/scalar_string_test.cc - probably best to leave it for now.

I'm going to add some more commits addressing the nested-field tests soon. Then I'll do another round of review :-)

@benibus
Copy link
Collaborator

benibus commented Mar 23, 2023

@zeroshade @westonpace Looking at this a little deeper... do you know if FieldPath::Get is meant to be zero-copy? I'm not entirely sure if my suggestion to use ChunkedArray::Flatten here was appropriate since I believe it collapses top-level nulls. The Get implementations for Array and RecordBatch both use their underlying ArrayData exclusively without modifying/copying anything (regardless of nesting level).

Achieving the same thing for ChunkedArrays would be more advanced, of course - but I can help with the implementation.

@dinimar
Copy link
Contributor Author

dinimar commented Mar 24, 2023

@dinimar I pushed a commit that reverts some of the formatting. I'm not entirely sure why the auto-formatter is still picking up arrow/compute/kernels/scalar_string_test.cc - probably best to leave it for now.

I'm going to add some more commits addressing the nested-field tests soon. Then I'll do another round of review :-)

About formatting it's possible to fix it by rebasing on the last successful commit (of course after all work is done). For example, yours 4487be. It passed all workflows including formatting check

@westonpace
Copy link
Member

@zeroshade @westonpace Looking at this a little deeper... do you know if FieldPath::Get is meant to be zero-copy? I'm not entirely sure if my suggestion to use ChunkedArray::Flatten here was appropriate since I believe it collapses top-level nulls. The Get implementations for Array and RecordBatch both use their underlying ArrayData exclusively without modifying/copying anything (regardless of nesting level).

Yes, Get is meant to be zero-copy. This came up in discussion a little while back. See #14946 I believe the consensus was that FieldRef::Get has been around long enough that changing the definition now (to mean a flattened get) would be a breaking change and a better approach would be to introduce a GetFlattened alongside Get.

Copy link
Collaborator

@benibus benibus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed commits that address the tests and zero-copy requirement (the latter required some implementation changes, but it's similar in principle). I also added overloads for ChunkedArrays for good measure.

Hopefully the tests better-clarify what I was getting at regarding nested fields, but let me know if you have any questions!

cpp/src/arrow/type_test.cc Show resolved Hide resolved
cpp/src/arrow/type_test.cc Show resolved Hide resolved
@dinimar dinimar requested a review from benibus March 28, 2023 07:17
@benibus benibus requested a review from westonpace March 29, 2023 19:34
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor suggestions. Marking Request Changes because it isn't clear to me that a follow-up issue for creating generic tests has yet been created. Once we do that I think we can move forward with this.

Result<std::shared_ptr<ChunkedArray>> FieldPath::Get(
const ChunkedArray& chunked_array) const {
if (chunked_array.num_chunks() < 1) {
return Status::Invalid("Chunked array must have at least one chunk");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that it would be considered invalid since ChunkedArray::MakeEmpty always allocates one chunk. https://github.com/dinimar/arrow/blob/ec6caeff8477af08679de7a7f12b90d4ba302457/cpp/src/arrow/chunked_array.h#L95-L96

Whether that's reasonable or not, I'm open to just returning an empty ChunkedArray here (based on its type or storage_type) if that makes more sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. I very rarely work with chunked arrays and so you may be right. It is possible to create chunks arrays with 0 chunks though:

>>> pa.chunked_array([], pa.int8()).num_chunks
0

My comment was mostly for understanding. I thought maybe a chunk was needed for some kind of inference.

I think I would slightly prefer returning an empty chunked array (with 1 chunk for consistency with ChunkedArray::MakeEmpty).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benibus please resolve this converstion. At the moment, it's the last thing to be done before second review from Weston Pace

cpp/src/arrow/type_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/type_test.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 31, 2023
@benibus
Copy link
Collaborator

benibus commented Mar 31, 2023

I've created an issue for the tests: #34830

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 2, 2023
@benibus benibus requested a review from westonpace April 3, 2023 20:35
@dinimar
Copy link
Contributor Author

dinimar commented Apr 8, 2023

@westonpace please review

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for working on this.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Apr 11, 2023
@westonpace westonpace merged commit 1c06854 into apache:main Apr 11, 2023
27 of 33 checks passed
@ursabot
Copy link

ursabot commented Apr 13, 2023

Benchmark runs are scheduled for baseline = 1bda003 and contender = 1c06854. 1c06854 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.82% ⬆️1.05%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 1c06854f ec2-t3-xlarge-us-east-2
[Failed] 1c06854f test-mac-arm
[Finished] 1c06854f ursa-i9-9960x
[Finished] 1c06854f ursa-thinkcentre-m75q
[Finished] 1bda003b ec2-t3-xlarge-us-east-2
[Failed] 1bda003b test-mac-arm
[Finished] 1bda003b ursa-i9-9960x
[Finished] 1bda003b ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
…apache#34537)

### Rationale for this change
Described in the issue

### What changes are included in this PR?

- added implementations for `FieldPath::Get(const Table& table)` and `FindAll(const Table& table)`
- added unit tests for functions mentioned above both for `Table` and `RecordBatch` classes

### Are these changes tested?

Yes, by unit tests

### Are there any user-facing changes?

Most probably, no

* Closes: apache#14939

Lead-authored-by: Dinir Imameev <dinir.imameev@gmail.com>
Co-authored-by: benibus <bpharks@gmx.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…apache#34537)

### Rationale for this change
Described in the issue

### What changes are included in this PR?

- added implementations for `FieldPath::Get(const Table& table)` and `FindAll(const Table& table)`
- added unit tests for functions mentioned above both for `Table` and `RecordBatch` classes

### Are these changes tested?

Yes, by unit tests

### Are there any user-facing changes?

Most probably, no

* Closes: apache#14939

Lead-authored-by: Dinir Imameev <dinir.imameev@gmail.com>
Co-authored-by: benibus <bpharks@gmx.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
…apache#34537)

### Rationale for this change
Described in the issue

### What changes are included in this PR?

- added implementations for `FieldPath::Get(const Table& table)` and `FindAll(const Table& table)`
- added unit tests for functions mentioned above both for `Table` and `RecordBatch` classes

### Are these changes tested?

Yes, by unit tests

### Are there any user-facing changes?

Most probably, no

* Closes: apache#14939

Lead-authored-by: Dinir Imameev <dinir.imameev@gmail.com>
Co-authored-by: benibus <bpharks@gmx.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Support Table lookups in FieldRef and FieldPath
5 participants