-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35579: [C++] Support non-named FieldRefs in Parquet scanner #35798
GH-35579: [C++] Support non-named FieldRefs in Parquet scanner #35798
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if manifest
not matches input in nested schema? Would extra checking is required here?
Assume we have schema evolution in some of the dataset file, how to handle these file with mismatched Field?
// names) based on the dataset schema. Returns `false` if no conversion was needed. | ||
Result<bool> ValidateFieldRef(const FieldRef& ref, const Schema& dataset_schema, | ||
FieldRef* out) { | ||
if (ARROW_PREDICT_TRUE(IsValidFieldRef(ref))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you support LookUp by index, why PREDICT_TRUE
is used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly because all existing user code uses named lookups. Plus, I'd imagine indexed lookups would be fairly rare in practice since it requires knowing the exact structure (i.e. field order) of the dataset schema upfront, which may not be predictable if it was inferred from multiple files.
(It's probably not very consequential though)
We shouldn't need any additional checks there, I don't think. The routine for resolving discrepancies between the dataset schema and file manifest ( That case should probably be reflected in the tests though... (anecdotally, it did work in my ad hoc testing) |
Yeah, I know what you mean, however, Parquet files in a dataset might has different schema, the most typical case is at: https://iceberg.apache.org/docs/latest/evolution/ Assume user insert a column, name might be better than FieldIndex, because it can maintain some consistency. If we're sure we don't need to support that case, or user can make sure that file has same schema, then I'm +1 on this patch |
I'm not sure what the problem is @mapleFU? We still support named refs even with this PR. This just allows the user to also provide indices, and we resolve them into names against the overall dataset schema, so that should actually allow for schema evolution if the positions of those fields changes. |
I think right now, of course, we can't yet unify files of different schemas into a consistent schema. But this doesn't affect that either way. |
OK, thanks for your explanition @lidavidm . |
What (would) happen is we resolve the file schemas into an overall dataset schema, then resolve any indices against that unified schema back into names, so that issue shouldn't come up |
Okay, I think currently |
cf618ae
to
4d845c5
Compare
There's build failures, but I think they're unrelated? |
I think so... I'm seeing similar macos failures elsewhere after a fresh rebase |
Probably what happened in conda-forge/cpp-opentelemetry-sdk-feedstock#29 We need to set WITH_STL=ON for the bundled OpenTelemetry build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, just a few thoughts.
bool IsNamedFieldRef(const FieldRef& ref) { | ||
if (ref.IsName()) return true; | ||
if (const auto* nested_refs = ref.nested_refs()) { | ||
for (const auto& nested_ref : *nested_refs) { | ||
if (!nested_ref.IsName()) return false; | ||
} | ||
return true; | ||
} | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor thing but I wonder if we might want to add this directly to FieldRef
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it into FieldRef
in the update. Looking at it now though, I suspect it may be too niche to justify its place there - at least on its own.
Methods that transform a ref into a flat vector of names or indices might be more useful in general (but less trivial, of course).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's too niche. I feel like I have run into situations a few times now in the scanner where I've needed to know if a ref is all-names, all-indices, or mixed (a lot of the new scanner stuff normalizes to all-indices). We do have FieldPath already which is a flat vector of indices.
4d845c5
to
56240a3
Compare
CI issues seem unrelated |
Conbench analyzed the 6 benchmark runs on commit There were 5 benchmark results indicating a performance regression:
The full Conbench report has more details. |
…ilter as a Substrait proto extended expression (#35570) ### Rationale for this change To close #34252 ### What changes are included in this PR? This is a proposal to try to solve: 1. Receive a list of Substrait scalar expressions and use them to Project a Dataset - [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus) - [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages - [x] Create JNI Wrapper for ScannerBuilder::Project - [x] Create JNI API - [x] Testing coverage - [x] Documentation Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by #35798 This PR needs/use this PRs/Issues: - #34834 - #34227 - #35579 2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset - [x] Working to identify activities ### Are these changes tested? Initial unit test added. ### Are there any user-facing changes? No * Closes: #34252 Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: benibus <bpharks@gmx.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…der::Filter as a Substrait proto extended expression (apache#35570) ### Rationale for this change To close apache#34252 ### What changes are included in this PR? This is a proposal to try to solve: 1. Receive a list of Substrait scalar expressions and use them to Project a Dataset - [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus) - [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages - [x] Create JNI Wrapper for ScannerBuilder::Project - [x] Create JNI API - [x] Testing coverage - [x] Documentation Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache#35798 This PR needs/use this PRs/Issues: - apache#34834 - apache#34227 - apache#35579 2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset - [x] Working to identify activities ### Are these changes tested? Initial unit test added. ### Are there any user-facing changes? No * Closes: apache#34252 Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: benibus <bpharks@gmx.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…der::Filter as a Substrait proto extended expression (apache#35570) ### Rationale for this change To close apache#34252 ### What changes are included in this PR? This is a proposal to try to solve: 1. Receive a list of Substrait scalar expressions and use them to Project a Dataset - [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus) - [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages - [x] Create JNI Wrapper for ScannerBuilder::Project - [x] Create JNI API - [x] Testing coverage - [x] Documentation Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache#35798 This PR needs/use this PRs/Issues: - apache#34834 - apache#34227 - apache#35579 2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset - [x] Working to identify activities ### Are these changes tested? Initial unit test added. ### Are there any user-facing changes? No * Closes: apache#34252 Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: benibus <bpharks@gmx.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: David Li <li.davidm96@gmail.com>
Rationale for this change
When setting projections/filters for the file system scanner, the Parquet implementation requires that all materialized
FieldRef
s be position-independent (containing only names). However, it may be useful to support index-based field lookups as well - assuming the dataset schema is known.What changes are included in this PR?
Adds a translation step for field refs prior to looking them up in the fragment schema. A known dataset schema is required to do this reliably, however (since the fragment schema may be a sub/superset of the dataset schema) - so in the absence of one, we fall back to the existing behavior.
Are these changes tested?
Yes (tests are included)
Are there any user-facing changes?
Yes