GH-35579: [C++] Support non-named FieldRefs in Parquet scanner #35798

benibus · 2023-05-26T21:47:13Z

Rationale for this change

When setting projections/filters for the file system scanner, the Parquet implementation requires that all materialized FieldRefs be position-independent (containing only names). However, it may be useful to support index-based field lookups as well - assuming the dataset schema is known.

What changes are included in this PR?

Adds a translation step for field refs prior to looking them up in the fragment schema. A known dataset schema is required to do this reliably, however (since the fragment schema may be a sub/superset of the dataset schema) - so in the absence of one, we fall back to the existing behavior.

Are these changes tested?

Yes (tests are included)

Are there any user-facing changes?

Yes

Closes: [C++] Enable support on field_ref compute expression for also Column Indice #35579

github-actions · 2023-05-26T21:47:39Z

Closes: [C++] Enable support on field_ref compute expression for also Column Indice #35579

mapleFU

What would happen if manifest not matches input in nested schema? Would extra checking is required here?
Assume we have schema evolution in some of the dataset file, how to handle these file with mismatched Field?

mapleFU · 2023-05-29T15:52:54Z

cpp/src/arrow/dataset/file_parquet.cc

+// names) based on the dataset schema. Returns `false` if no conversion was needed.
+Result<bool> ValidateFieldRef(const FieldRef& ref, const Schema& dataset_schema,
+                              FieldRef* out) {
+  if (ARROW_PREDICT_TRUE(IsValidFieldRef(ref))) {


Since you support LookUp by index, why PREDICT_TRUE is used here?

Mostly because all existing user code uses named lookups. Plus, I'd imagine indexed lookups would be fairly rare in practice since it requires knowing the exact structure (i.e. field order) of the dataset schema upfront, which may not be predictable if it was inferred from multiple files.

(It's probably not very consequential though)

benibus · 2023-05-30T17:07:51Z

What would happen if manifest not matches input in nested schema? Would extra checking is required here?
Assume we have schema evolution in some of the dataset file, how to handle these file with mismatched Field?

We shouldn't need any additional checks there, I don't think. The routine for resolving discrepancies between the dataset schema and file manifest (ResolveOneFieldRef) is unchanged - and the logic is still in terms of field names exclusively. All this PR really does is convert indexed refs into named refs (using the dataset schema) before those checks occur.

That case should probably be reflected in the tests though... (anecdotally, it did work in my ad hoc testing)

mapleFU · 2023-05-31T07:59:28Z

Yeah, I know what you mean, however, Parquet files in a dataset might has different schema, the most typical case is at: https://iceberg.apache.org/docs/latest/evolution/

Assume user insert a column, name might be better than FieldIndex, because it can maintain some consistency.

If we're sure we don't need to support that case, or user can make sure that file has same schema, then I'm +1 on this patch

cpp/src/arrow/dataset/file_parquet.cc

lidavidm · 2023-06-01T14:54:41Z

I'm not sure what the problem is @mapleFU? We still support named refs even with this PR. This just allows the user to also provide indices, and we resolve them into names against the overall dataset schema, so that should actually allow for schema evolution if the positions of those fields changes.

lidavidm · 2023-06-01T14:55:20Z

I think right now, of course, we can't yet unify files of different schemas into a consistent schema. But this doesn't affect that either way.

mapleFU · 2023-06-02T18:06:11Z

OK, thanks for your explanition @lidavidm .
I think when schema evolution happens, seek by index might provide unconsistent result, and seek by name doesn't has that problem. But if we we can't yet unify files of different schemas into a consistent schema, I'm +1 on this patch

lidavidm · 2023-06-02T19:07:05Z

What (would) happen is we resolve the file schemas into an overall dataset schema, then resolve any indices against that unified schema back into names, so that issue shouldn't come up

mapleFU · 2023-06-05T11:14:45Z

Okay, I think currently parquet::SchemaManifest can build the bridge from arrow to parquet and parquet to arrow, but for file with different schema, it need to follow some rules, maybe by "PARQUET:field_id" or others. I'm ok on this patch now!

lidavidm · 2023-06-07T16:54:27Z

There's build failures, but I think they're unrelated?

benibus · 2023-06-07T17:02:10Z

There's build failures, but I think they're unrelated?

I think so... I'm seeing similar macos failures elsewhere after a fresh rebase

lidavidm · 2023-06-07T17:12:50Z

Probably what happened in conda-forge/cpp-opentelemetry-sdk-feedstock#29

We need to set WITH_STL=ON for the bundled OpenTelemetry build

westonpace

This looks good, just a few thoughts.

westonpace · 2023-06-13T14:16:33Z

cpp/src/arrow/dataset/file_parquet.cc

+bool IsNamedFieldRef(const FieldRef& ref) {
+  if (ref.IsName()) return true;
+  if (const auto* nested_refs = ref.nested_refs()) {
+    for (const auto& nested_ref : *nested_refs) {
+      if (!nested_ref.IsName()) return false;
+    }
+    return true;
+  }
+  return false;
+}


A minor thing but I wonder if we might want to add this directly to FieldRef?

Moved it into FieldRef in the update. Looking at it now though, I suspect it may be too niche to justify its place there - at least on its own.

Methods that transform a ref into a flat vector of names or indices might be more useful in general (but less trivial, of course).

I don't think it's too niche. I feel like I have run into situations a few times now in the scanner where I've needed to know if a ref is all-names, all-indices, or mixed (a lot of the new scanner stuff normalizes to all-indices). We do have FieldPath already which is a flat vector of indices.

cpp/src/arrow/dataset/file_parquet.cc

cpp/src/arrow/dataset/file_parquet_test.cc

westonpace · 2023-06-24T00:04:02Z

CI issues seem unrelated

conbench-apache-arrow · 2023-06-28T00:57:23Z

Conbench analyzed the 6 benchmark runs on commit 10eedbe6.

There were 5 benchmark results indicating a performance regression:

Commit Run on ursa-thinkcentre-m75q at 2023-06-26 23:32:20Z
- params=<Repetition::REQUIRED, Compression::LZ4>/65536/1024, source=cpp-micro, suite=parquet-column-io-benchmark
Commit Run on ursa-i9-9960x at 2023-06-27 21:57:10Z
- engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-04, scale_factor=1
and 3 more (see the report linked below)

The full Conbench report has more details.

…ilter as a Substrait proto extended expression (#35570) ### Rationale for this change To close #34252 ### What changes are included in this PR? This is a proposal to try to solve: 1. Receive a list of Substrait scalar expressions and use them to Project a Dataset - [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus) - [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages - [x] Create JNI Wrapper for ScannerBuilder::Project - [x] Create JNI API - [x] Testing coverage - [x] Documentation Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by #35798 This PR needs/use this PRs/Issues: - #34834 - #34227 - #35579 2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset - [x] Working to identify activities ### Are these changes tested? Initial unit test added. ### Are there any user-facing changes? No * Closes: #34252 Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: benibus <bpharks@gmx.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: David Li <li.davidm96@gmail.com>

…der::Filter as a Substrait proto extended expression (apache#35570) ### Rationale for this change To close apache#34252 ### What changes are included in this PR? This is a proposal to try to solve: 1. Receive a list of Substrait scalar expressions and use them to Project a Dataset - [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus) - [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages - [x] Create JNI Wrapper for ScannerBuilder::Project - [x] Create JNI API - [x] Testing coverage - [x] Documentation Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache#35798 This PR needs/use this PRs/Issues: - apache#34834 - apache#34227 - apache#35579 2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset - [x] Working to identify activities ### Are these changes tested? Initial unit test added. ### Are there any user-facing changes? No * Closes: apache#34252 Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: benibus <bpharks@gmx.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: David Li <li.davidm96@gmail.com>

github-actions bot added Component: C++ awaiting review Awaiting review labels May 26, 2023

benibus marked this pull request as ready for review May 26, 2023 23:42

benibus requested a review from westonpace as a code owner May 26, 2023 23:42

mapleFU reviewed May 29, 2023

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 29, 2023

davisusanibar mentioned this pull request May 30, 2023

GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression #35570

Merged

7 tasks

lidavidm approved these changes Jun 1, 2023

View reviewed changes

cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jun 1, 2023

mapleFU approved these changes Jun 2, 2023

View reviewed changes

benibus force-pushed the GH-35579-parquet-dataset-field-refs branch from cf618ae to 4d845c5 Compare June 7, 2023 14:37

westonpace reviewed Jun 13, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jun 13, 2023

benibus added 3 commits June 22, 2023 16:10

Support index-based FieldRefs

4de881e

Address review points

5ba2dcc

Add FieldRef::IsNameSequence method

9d57fef

Address review points

56240a3

benibus force-pushed the GH-35579-parquet-dataset-field-refs branch from 4d845c5 to 56240a3 Compare June 22, 2023 20:20

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 22, 2023

benibus requested a review from westonpace June 23, 2023 18:22

westonpace approved these changes Jun 24, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jun 24, 2023

westonpace merged commit 10eedbe into apache:main Jun 24, 2023
34 of 36 checks passed

github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jun 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-35579: [C++] Support non-named FieldRefs in Parquet scanner #35798

GH-35579: [C++] Support non-named FieldRefs in Parquet scanner #35798

benibus commented May 26, 2023 •

edited by github-actions bot

github-actions bot commented May 26, 2023

mapleFU left a comment •

edited

mapleFU May 29, 2023

benibus May 30, 2023

benibus commented May 30, 2023

mapleFU commented May 31, 2023

lidavidm commented Jun 1, 2023

lidavidm commented Jun 1, 2023

mapleFU commented Jun 2, 2023

lidavidm commented Jun 2, 2023

mapleFU commented Jun 5, 2023

lidavidm commented Jun 7, 2023

benibus commented Jun 7, 2023

lidavidm commented Jun 7, 2023

westonpace left a comment

westonpace Jun 13, 2023

benibus Jun 23, 2023

westonpace Jun 24, 2023

westonpace commented Jun 24, 2023

conbench-apache-arrow bot commented Jun 28, 2023

GH-35579: [C++] Support non-named FieldRefs in Parquet scanner #35798

GH-35579: [C++] Support non-named FieldRefs in Parquet scanner #35798

Conversation

benibus commented May 26, 2023 • edited by github-actions bot

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented May 26, 2023

mapleFU left a comment • edited

Choose a reason for hiding this comment

mapleFU May 29, 2023

Choose a reason for hiding this comment

benibus May 30, 2023

Choose a reason for hiding this comment

benibus commented May 30, 2023

mapleFU commented May 31, 2023

lidavidm commented Jun 1, 2023

lidavidm commented Jun 1, 2023

mapleFU commented Jun 2, 2023

lidavidm commented Jun 2, 2023

mapleFU commented Jun 5, 2023

lidavidm commented Jun 7, 2023

benibus commented Jun 7, 2023

lidavidm commented Jun 7, 2023

westonpace left a comment

Choose a reason for hiding this comment

westonpace Jun 13, 2023

Choose a reason for hiding this comment

benibus Jun 23, 2023

Choose a reason for hiding this comment

westonpace Jun 24, 2023

Choose a reason for hiding this comment

westonpace commented Jun 24, 2023

conbench-apache-arrow bot commented Jun 28, 2023

benibus commented May 26, 2023 •

edited by github-actions bot

mapleFU left a comment •

edited