Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35579: [C++] Support non-named FieldRefs in Parquet scanner #35798

Merged
merged 4 commits into from
Jun 24, 2023

Conversation

benibus
Copy link
Collaborator

@benibus benibus commented May 26, 2023

Rationale for this change

When setting projections/filters for the file system scanner, the Parquet implementation requires that all materialized FieldRefs be position-independent (containing only names). However, it may be useful to support index-based field lookups as well - assuming the dataset schema is known.

What changes are included in this PR?

Adds a translation step for field refs prior to looking them up in the fragment schema. A known dataset schema is required to do this reliably, however (since the fragment schema may be a sub/superset of the dataset schema) - so in the absence of one, we fall back to the existing behavior.

Are these changes tested?

Yes (tests are included)

Are there any user-facing changes?

Yes

@github-actions
Copy link

@benibus benibus marked this pull request as ready for review May 26, 2023 23:42
@benibus benibus requested a review from westonpace as a code owner May 26, 2023 23:42
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if manifest not matches input in nested schema? Would extra checking is required here?
Assume we have schema evolution in some of the dataset file, how to handle these file with mismatched Field?

// names) based on the dataset schema. Returns `false` if no conversion was needed.
Result<bool> ValidateFieldRef(const FieldRef& ref, const Schema& dataset_schema,
FieldRef* out) {
if (ARROW_PREDICT_TRUE(IsValidFieldRef(ref))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you support LookUp by index, why PREDICT_TRUE is used here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly because all existing user code uses named lookups. Plus, I'd imagine indexed lookups would be fairly rare in practice since it requires knowing the exact structure (i.e. field order) of the dataset schema upfront, which may not be predictable if it was inferred from multiple files.

(It's probably not very consequential though)

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 29, 2023
@benibus
Copy link
Collaborator Author

benibus commented May 30, 2023

What would happen if manifest not matches input in nested schema? Would extra checking is required here?
Assume we have schema evolution in some of the dataset file, how to handle these file with mismatched Field?

We shouldn't need any additional checks there, I don't think. The routine for resolving discrepancies between the dataset schema and file manifest (ResolveOneFieldRef) is unchanged - and the logic is still in terms of field names exclusively. All this PR really does is convert indexed refs into named refs (using the dataset schema) before those checks occur.

That case should probably be reflected in the tests though... (anecdotally, it did work in my ad hoc testing)

@mapleFU
Copy link
Member

mapleFU commented May 31, 2023

Yeah, I know what you mean, however, Parquet files in a dataset might has different schema, the most typical case is at: https://iceberg.apache.org/docs/latest/evolution/

Assume user insert a column, name might be better than FieldIndex, because it can maintain some consistency.

If we're sure we don't need to support that case, or user can make sure that file has same schema, then I'm +1 on this patch

cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved
@lidavidm
Copy link
Member

lidavidm commented Jun 1, 2023

I'm not sure what the problem is @mapleFU? We still support named refs even with this PR. This just allows the user to also provide indices, and we resolve them into names against the overall dataset schema, so that should actually allow for schema evolution if the positions of those fields changes.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jun 1, 2023
@lidavidm
Copy link
Member

lidavidm commented Jun 1, 2023

I think right now, of course, we can't yet unify files of different schemas into a consistent schema. But this doesn't affect that either way.

@mapleFU
Copy link
Member

mapleFU commented Jun 2, 2023

OK, thanks for your explanition @lidavidm .
I think when schema evolution happens, seek by index might provide unconsistent result, and seek by name doesn't has that problem. But if we we can't yet unify files of different schemas into a consistent schema, I'm +1 on this patch

@lidavidm
Copy link
Member

lidavidm commented Jun 2, 2023

What (would) happen is we resolve the file schemas into an overall dataset schema, then resolve any indices against that unified schema back into names, so that issue shouldn't come up

@mapleFU
Copy link
Member

mapleFU commented Jun 5, 2023

Okay, I think currently parquet::SchemaManifest can build the bridge from arrow to parquet and parquet to arrow, but for file with different schema, it need to follow some rules, maybe by "PARQUET:field_id" or others. I'm ok on this patch now!

@benibus benibus force-pushed the GH-35579-parquet-dataset-field-refs branch from cf618ae to 4d845c5 Compare June 7, 2023 14:37
@lidavidm
Copy link
Member

lidavidm commented Jun 7, 2023

There's build failures, but I think they're unrelated?

@benibus
Copy link
Collaborator Author

benibus commented Jun 7, 2023

There's build failures, but I think they're unrelated?

I think so... I'm seeing similar macos failures elsewhere after a fresh rebase

@lidavidm
Copy link
Member

lidavidm commented Jun 7, 2023

Probably what happened in conda-forge/cpp-opentelemetry-sdk-feedstock#29

We need to set WITH_STL=ON for the bundled OpenTelemetry build

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, just a few thoughts.

Comment on lines 227 to 236
bool IsNamedFieldRef(const FieldRef& ref) {
if (ref.IsName()) return true;
if (const auto* nested_refs = ref.nested_refs()) {
for (const auto& nested_ref : *nested_refs) {
if (!nested_ref.IsName()) return false;
}
return true;
}
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor thing but I wonder if we might want to add this directly to FieldRef?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it into FieldRef in the update. Looking at it now though, I suspect it may be too niche to justify its place there - at least on its own.

Methods that transform a ref into a flat vector of names or indices might be more useful in general (but less trivial, of course).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's too niche. I feel like I have run into situations a few times now in the scanner where I've needed to know if a ref is all-names, all-indices, or mixed (a lot of the new scanner stuff normalizes to all-indices). We do have FieldPath already which is a flat vector of indices.

cpp/src/arrow/dataset/file_parquet.cc Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/file_parquet_test.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jun 13, 2023
@benibus benibus force-pushed the GH-35579-parquet-dataset-field-refs branch from 4d845c5 to 56240a3 Compare June 22, 2023 20:20
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 22, 2023
@benibus benibus requested a review from westonpace June 23, 2023 18:22
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jun 24, 2023
@westonpace
Copy link
Member

CI issues seem unrelated

@westonpace westonpace merged commit 10eedbe into apache:main Jun 24, 2023
34 of 36 checks passed
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jun 24, 2023
@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit 10eedbe6.

There were 5 benchmark results indicating a performance regression:

The full Conbench report has more details.

lidavidm added a commit that referenced this pull request Sep 20, 2023
…ilter as a Substrait proto extended expression (#35570)

### Rationale for this change

To close #34252

### What changes are included in this PR?

This is a proposal to try to solve:
1. Receive a list of Substrait scalar expressions and use them to Project a Dataset
- [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus)
- [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages
- [x] Create JNI Wrapper for ScannerBuilder::Project 
- [x] Create JNI API
- [x] Testing coverage
- [x] Documentation

Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by #35798

This PR needs/use this PRs/Issues:
- #34834
- #34227
- #35579

2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
- [x] Working to identify activities

### Are these changes tested?

Initial unit test added.

### Are there any user-facing changes?

No
* Closes: #34252

Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…der::Filter as a Substrait proto extended expression (apache#35570)

### Rationale for this change

To close apache#34252

### What changes are included in this PR?

This is a proposal to try to solve:
1. Receive a list of Substrait scalar expressions and use them to Project a Dataset
- [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus)
- [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages
- [x] Create JNI Wrapper for ScannerBuilder::Project 
- [x] Create JNI API
- [x] Testing coverage
- [x] Documentation

Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache#35798

This PR needs/use this PRs/Issues:
- apache#34834
- apache#34227
- apache#35579

2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
- [x] Working to identify activities

### Are these changes tested?

Initial unit test added.

### Are there any user-facing changes?

No
* Closes: apache#34252

Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…der::Filter as a Substrait proto extended expression (apache#35570)

### Rationale for this change

To close apache#34252

### What changes are included in this PR?

This is a proposal to try to solve:
1. Receive a list of Substrait scalar expressions and use them to Project a Dataset
- [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus)
- [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages
- [x] Create JNI Wrapper for ScannerBuilder::Project 
- [x] Create JNI API
- [x] Testing coverage
- [x] Documentation

Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache#35798

This PR needs/use this PRs/Issues:
- apache#34834
- apache#34227
- apache#35579

2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
- [x] Working to identify activities

### Are these changes tested?

Initial unit test added.

### Are there any user-facing changes?

No
* Closes: apache#34252

Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Enable support on field_ref compute expression for also Column Indice
4 participants