-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34252: [Java] Support ScannerBuilder::Project or ScannerBuilder::Filter as a Substrait proto extended expression #35570
Conversation
…on the class path
Co-authored-by: David Li <li.davidm96@gmail.com>
…SubstraitConsumer.java Co-authored-by: David Li <li.davidm96@gmail.com>
|
||
private static ByteBuffer getByteBuffer(String base64EncodedSubstrait) { | ||
byte[] substraitFilter = Base64.getDecoder().decode(base64EncodedSubstrait); | ||
ByteBuffer substraitExpressionFilter = ByteBuffer.allocateDirect(substraitFilter.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget about the rest of this function! e.g. substraitExpressionFilter
and substraitFilter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
* @param columns Projected columns. Empty for scanning all columns. | ||
* @return the ScanOptions configured. | ||
*/ | ||
public Builder columns(Optional<String[]> columns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thing: We don't need Optional<>
parameters for the builder APIs. We should expect the user to pass us a valid object. Same with substraitProjection
and substraitFilter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values for substraitProjection and substraitFilter have been changed.
The rule definition for columns mentions that empty means scanning all columns, so that is how it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. I assumed a user would prefer to only use the columns
API when they want to project a subset of columns because, if left blank, the builder will build an empty Optional<> columns
object automatically. I'm okay with leaving as-is. Thanks for the udpates!
assertTrue(reader.getVectorSchemaRoot().getVector("id").toString().equals("[19, 1, 11]")); | ||
assertTrue(reader.getVectorSchemaRoot().getVector("name").toString() | ||
.equals("[value_19, value_1, value_11]")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should consider what to do here; possibly an Iterator/Iterable/Stream that gives Java objects would be sufficient (and you could collect into a Java collection and then use standard assertions). Can you file a follow-up task?
for(arrow::engine::NamedExpression& named_expression : | ||
bounded_expression.named_expressions) { | ||
if (named_expression.expression.type()->id() == arrow::Type::BOOL) { | ||
if (filter_expr.has_value()) { | ||
JniThrow("Only one filter expression may be provided"); | ||
} | ||
filter_expr = named_expression.expression; | ||
} | ||
} | ||
JniAssertOkOrThrow(scanner_builder->Filter(*filter_expr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will crash if you provide an empty list of expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
if (named_expression.expression.type()->id() == arrow::Type::BOOL) { | ||
if (filter_expr.has_value()) { | ||
JniThrow("Only one filter expression may be provided"); | ||
} | ||
filter_expr = named_expression.expression; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throw if the expression is not of type BOOL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
docs/source/java/substrait.rst
Outdated
} catch (Exception e) { | ||
e.printStackTrace(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just declare everything as throws Exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
docs/source/java/substrait.rst
Outdated
private static ByteBuffer getSubstraitExpressionProjection() { | ||
// Expression: N_REGIONKEY + 10 = col 3 + 10 | ||
Expression.Builder selectionBuilderProjectOne = Expression.newBuilder(). | ||
setSelection( | ||
Expression.FieldReference.newBuilder(). | ||
setDirectReference( | ||
Expression.ReferenceSegment.newBuilder(). | ||
setStructField( | ||
Expression.ReferenceSegment.StructField.newBuilder().setField( | ||
2) | ||
) | ||
) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how are you formatting examples? I think it would make sense to just use google-java-format
which has a more compact style. In our docs, going too far to the right is unreadable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
Co-authored-by: David Li <li.davidm96@gmail.com>
docs/source/java/substrait.rst
Outdated
//String uri = "file:///Users/dsusanibar/data/tpch_parquet/nation.parquet"; | ||
String uri = "file:////Users/dsusanibar/voltron/fork/consumer-testing/tests/data/tpch_parquet/nation.parquet"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not put company names in strings? And remove the redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, second time same error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just updated
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 00481a2. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
…der::Filter as a Substrait proto extended expression (apache#35570) ### Rationale for this change To close apache#34252 ### What changes are included in this PR? This is a proposal to try to solve: 1. Receive a list of Substrait scalar expressions and use them to Project a Dataset - [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus) - [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages - [x] Create JNI Wrapper for ScannerBuilder::Project - [x] Create JNI API - [x] Testing coverage - [x] Documentation Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache#35798 This PR needs/use this PRs/Issues: - apache#34834 - apache#34227 - apache#35579 2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset - [x] Working to identify activities ### Are these changes tested? Initial unit test added. ### Are there any user-facing changes? No * Closes: apache#34252 Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: benibus <bpharks@gmx.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…der::Filter as a Substrait proto extended expression (apache#35570) ### Rationale for this change To close apache#34252 ### What changes are included in this PR? This is a proposal to try to solve: 1. Receive a list of Substrait scalar expressions and use them to Project a Dataset - [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus) - [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages - [x] Create JNI Wrapper for ScannerBuilder::Project - [x] Create JNI API - [x] Testing coverage - [x] Documentation Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache#35798 This PR needs/use this PRs/Issues: - apache#34834 - apache#34227 - apache#35579 2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset - [x] Working to identify activities ### Are these changes tested? Initial unit test added. ### Are there any user-facing changes? No * Closes: apache#34252 Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: benibus <bpharks@gmx.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com> Signed-off-by: David Li <li.davidm96@gmail.com>
Rationale for this change
To close #34252
What changes are included in this PR?
This is a proposal to try to solve:
Current problem is:
java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)
. Not able to infer by column position by able to infer by colum name. This problem is solved by #35798This PR needs/use this PRs/Issues:
Are these changes tested?
Initial unit test added.
Are there any user-facing changes?
No