Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression #35570

Merged
merged 90 commits into from
Sep 20, 2023

Conversation

davisusanibar
Copy link
Contributor

@davisusanibar davisusanibar commented May 12, 2023

Rationale for this change

To close #34252

What changes are included in this PR?

This is a proposal to try to solve:

  1. Receive a list of Substrait scalar expressions and use them to Project a Dataset
  • Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus)
  • Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages
  • Create JNI Wrapper for ScannerBuilder::Project
  • Create JNI API
  • Testing coverage
  • Documentation

Current problem is: java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0). Not able to infer by column position by able to infer by colum name. This problem is solved by #35798

This PR needs/use this PRs/Issues:

  1. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
  • Working to identify activities

Are these changes tested?

Initial unit test added.

Are there any user-facing changes?

No

davisusanibar and others added 30 commits March 13, 2023 11:36
Co-authored-by: David Li <li.davidm96@gmail.com>
…SubstraitConsumer.java

Co-authored-by: David Li <li.davidm96@gmail.com>

private static ByteBuffer getByteBuffer(String base64EncodedSubstrait) {
byte[] substraitFilter = Base64.getDecoder().decode(base64EncodedSubstrait);
ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(substraitFilter.length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget about the rest of this function! e.g. substraitExpressionFilter and substraitFilter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

* @param columns Projected columns. Empty for scanning all columns.
* @return the ScanOptions configured.
*/
public Builder columns(Optional<String[]> columns) {
Copy link
Member

@danepitkin danepitkin Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing: We don't need Optional<> parameters for the builder APIs. We should expect the user to pass us a valid object. Same with substraitProjection and substraitFilter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values for substraitProjection and substraitFilter have been changed.

The rule definition for columns mentions that empty means scanning all columns, so that is how it works.

Copy link
Member

@danepitkin danepitkin Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. I assumed a user would prefer to only use the columns API when they want to project a subset of columns because, if left blank, the builder will build an empty Optional<> columns object automatically. I'm okay with leaving as-is. Thanks for the udpates!

Comment on lines +239 to +241
assertTrue(reader.getVectorSchemaRoot().getVector("id").toString().equals("[19, 1, 11]"));
assertTrue(reader.getVectorSchemaRoot().getVector("name").toString()
.equals("[value_19, value_1, value_11]"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider what to do here; possibly an Iterator/Iterable/Stream that gives Java objects would be sufficient (and you could collect into a Java collection and then use standard assertions). Can you file a follow-up task?

java/dataset/src/main/cpp/jni_wrapper.cc Outdated Show resolved Hide resolved
Comment on lines 509 to 518
for(arrow::engine::NamedExpression& named_expression :
bounded_expression.named_expressions) {
if (named_expression.expression.type()->id() == arrow::Type::BOOL) {
if (filter_expr.has_value()) {
JniThrow("Only one filter expression may be provided");
}
filter_expr = named_expression.expression;
}
}
JniAssertOkOrThrow(scanner_builder->Filter(*filter_expr));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will crash if you provide an empty list of expressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Comment on lines 511 to 516
if (named_expression.expression.type()->id() == arrow::Type::BOOL) {
if (filter_expr.has_value()) {
JniThrow("Only one filter expression may be provided");
}
filter_expr = named_expression.expression;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw if the expression is not of type BOOL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Comment on lines 182 to 184
} catch (Exception e) {
e.printStackTrace();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just declare everything as throws Exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Comment on lines 187 to 199
private static ByteBuffer getSubstraitExpressionProjection() {
// Expression: N_REGIONKEY + 10 = col 3 + 10
Expression.Builder selectionBuilderProjectOne = Expression.newBuilder().
setSelection(
Expression.FieldReference.newBuilder().
setDirectReference(
Expression.ReferenceSegment.newBuilder().
setStructField(
Expression.ReferenceSegment.StructField.newBuilder().setField(
2)
)
)
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how are you formatting examples? I think it would make sense to just use google-java-format which has a more compact style. In our docs, going too far to the right is unreadable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 14, 2023
Co-authored-by: David Li <li.davidm96@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 14, 2023
Comment on lines 163 to 164
//String uri = "file:///Users/dsusanibar/data/tpch_parquet/nation.parquet";
String uri = "file:////Users/dsusanibar/voltron/fork/consumer-testing/tests/data/tpch_parquet/nation.parquet";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not put company names in strings? And remove the redundant comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, second time same error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just updated

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 15, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 15, 2023
@lidavidm lidavidm merged commit 00481a2 into apache:main Sep 20, 2023
17 checks passed
@lidavidm lidavidm removed the awaiting change review Awaiting change review label Sep 20, 2023
@github-actions github-actions bot added the awaiting merge Awaiting merge label Sep 20, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 00481a2.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…der::Filter as a Substrait proto extended expression (apache#35570)

### Rationale for this change

To close apache#34252

### What changes are included in this PR?

This is a proposal to try to solve:
1. Receive a list of Substrait scalar expressions and use them to Project a Dataset
- [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus)
- [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages
- [x] Create JNI Wrapper for ScannerBuilder::Project 
- [x] Create JNI API
- [x] Testing coverage
- [x] Documentation

Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache#35798

This PR needs/use this PRs/Issues:
- apache#34834
- apache#34227
- apache#35579

2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
- [x] Working to identify activities

### Are these changes tested?

Initial unit test added.

### Are there any user-facing changes?

No
* Closes: apache#34252

Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…der::Filter as a Substrait proto extended expression (apache#35570)

### Rationale for this change

To close apache#34252

### What changes are included in this PR?

This is a proposal to try to solve:
1. Receive a list of Substrait scalar expressions and use them to Project a Dataset
- [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus)
- [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages
- [x] Create JNI Wrapper for ScannerBuilder::Project 
- [x] Create JNI API
- [x] Testing coverage
- [x] Documentation

Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache#35798

This PR needs/use this PRs/Issues:
- apache#34834
- apache#34227
- apache#35579

2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
- [x] Working to identify activities

### Are these changes tested?

Initial unit test added.

### Are there any user-facing changes?

No
* Closes: apache#34252

Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Java] Filter and project Datasets with Substrait expressions
6 participants