Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Enable support on field_ref compute expression for also Column Indice #35579

Closed
davisusanibar opened this issue May 13, 2023 · 1 comment · Fixed by #35798
Closed

[C++] Enable support on field_ref compute expression for also Column Indice #35579

davisusanibar opened this issue May 13, 2023 · 1 comment · Fixed by #35798

Comments

@davisusanibar
Copy link
Contributor

davisusanibar commented May 13, 2023

Describe the enhancement requested

Current field_ref Expression are able to support ref by Column Name but does not offer support for Column Index.

Will be useful to also support Column index in case some integration need to pass Column index instead of Column name.

Reproduce message error:

void reproduceInferringColumnProjection(){
    const std::string& directory_base = "/diretory_of_your_parquet_file/nation_tpch/";;
    std::shared_ptr<arrow::fs::LocalFileSystem> fs =
            std::make_shared<arrow::fs::LocalFileSystem>();
    arrow::fs::FileSelector selector;
    selector.base_dir = directory_base;
    selector.recursive = true;
    std::vector<arrow::fs::FileInfo> file_infos = fs->GetFileInfo(selector).ValueOrDie();
    int num_printed = 0;
    auto format =
            std::make_shared<arrow::dataset::ParquetFileFormat>();
    arrow::dataset::FileSystemFactoryOptions options;
    std::shared_ptr<arrow::dataset::DatasetFactory> dataset_factory =
            (arrow::dataset::FileSystemDatasetFactory::Make(fs, selector, format, options)).ValueOrDie();
    std::shared_ptr<arrow::dataset::Dataset> dataset = (dataset_factory->Finish()).ValueOrDie();
    arrow::dataset::ScannerBuilder scanner_builder(dataset);
    // Error: NotImplemented: Inferring column projection from FieldRef FieldRef.FieldPath(0)
    scanner_builder.Project({compute::call("add", {compute::field_ref(0), compute::literal(10)})}, {"column_0"});
    // OK there are support in case we use Column Name instead of Column Index
    // scanner_builder.Project({compute::call("add", {compute::field_ref("n_nationkey"), compute::literal(10)})}, {"column_0"});
    std::shared_ptr<arrow::dataset::Scanner> scanner = scanner_builder.Finish().ValueOrDie();
    std::shared_ptr<arrow::Table> table = scanner->ToTable().ValueOrDie();
    std::cout << "Table with " << table->num_rows() << " rows and " << table->num_columns() << " columns" << std::endl;
    std::cout << table->ToString() << std::endl;
}

Message error:

NotImplemented: Inferring column projection from FieldRef FieldRef.FieldPath(0)

Component(s)

C++

@westonpace
Copy link
Member

This seems like a reasonable request. We should support both named and index lookups.

@benibus benibus self-assigned this May 24, 2023
westonpace pushed a commit that referenced this issue Jun 24, 2023
### Rationale for this change

When setting projections/filters for the file system scanner, the Parquet implementation requires that all materialized `FieldRef`s be position-independent (containing only names). However, it may be useful to support index-based field lookups as well - assuming the dataset schema is known.

### What changes are included in this PR?

Adds a translation step for field refs prior to looking them up in the fragment schema. A known dataset schema is required to do this reliably, however (since the fragment schema may be a sub/superset of the dataset schema) - so in the absence of one, we fall back to the existing behavior.

### Are these changes tested?

Yes (tests are included)

### Are there any user-facing changes?

Yes

* Closes: #35579

Authored-by: benibus <bpharks@gmx.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace added this to the 13.0.0 milestone Jun 24, 2023
lidavidm added a commit that referenced this issue Sep 20, 2023
…ilter as a Substrait proto extended expression (#35570)

### Rationale for this change

To close #34252

### What changes are included in this PR?

This is a proposal to try to solve:
1. Receive a list of Substrait scalar expressions and use them to Project a Dataset
- [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus)
- [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages
- [x] Create JNI Wrapper for ScannerBuilder::Project 
- [x] Create JNI API
- [x] Testing coverage
- [x] Documentation

Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by #35798

This PR needs/use this PRs/Issues:
- #34834
- #34227
- #35579

2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
- [x] Working to identify activities

### Are these changes tested?

Initial unit test added.

### Are there any user-facing changes?

No
* Closes: #34252

Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…der::Filter as a Substrait proto extended expression (apache#35570)

### Rationale for this change

To close apache#34252

### What changes are included in this PR?

This is a proposal to try to solve:
1. Receive a list of Substrait scalar expressions and use them to Project a Dataset
- [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus)
- [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages
- [x] Create JNI Wrapper for ScannerBuilder::Project 
- [x] Create JNI API
- [x] Testing coverage
- [x] Documentation

Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache#35798

This PR needs/use this PRs/Issues:
- apache#34834
- apache#34227
- apache#35579

2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
- [x] Working to identify activities

### Are these changes tested?

Initial unit test added.

### Are there any user-facing changes?

No
* Closes: apache#34252

Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…der::Filter as a Substrait proto extended expression (apache#35570)

### Rationale for this change

To close apache#34252

### What changes are included in this PR?

This is a proposal to try to solve:
1. Receive a list of Substrait scalar expressions and use them to Project a Dataset
- [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus)
- [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages
- [x] Create JNI Wrapper for ScannerBuilder::Project 
- [x] Create JNI API
- [x] Testing coverage
- [x] Documentation

Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache#35798

This PR needs/use this PRs/Issues:
- apache#34834
- apache#34227
- apache#35579

2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
- [x] Working to identify activities

### Are these changes tested?

Initial unit test added.

### Are there any user-facing changes?

No
* Closes: apache#34252

Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants