Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34223: [Java] Java Substrait Consumer JNI call to ACERO C++ #34227

Merged
merged 53 commits into from May 24, 2023

Conversation

davisusanibar
Copy link
Contributor

@davisusanibar davisusanibar commented Feb 16, 2023

The purpose of this PR is to implement:

  1. JNI Wrappers to consume Acero capabilities that execute Substrait Plans.
  2. Java base code to offer API that consume Substrait Plans.
  3. Initial Substrait documentation

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #34223 has been automatically assigned in GitHub to PR creator.

@davisusanibar
Copy link
Contributor Author

@github-actions crossbow submit java-jars

@github-actions
Copy link

Revision: 7500dd7

Submitted crossbow builds: ursacomputing/crossbow @ actions-1555224742

Task Status
java-jars Github Actions

@github-actions
Copy link

Revision: 7500dd7

Submitted crossbow builds: ursacomputing/crossbow @ actions-c5c0135c13

Task Status
java-jars Github Actions

/**
* Java binding of the C++ ExecuteSerializedPlan.
*/
public class SubstraitConsumer implements AutoCloseable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should be an interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

* @param substraitPlan the JSON Substrait plan.
* @return the ArrowReader to iterate for record batches.
*/
public ArrowReader runQuery(String substraitPlan) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plans should be byte[] or ByteBuffer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this assumes JSON plans which is probably not what people will be working with

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please if yu could help me, What will be my options to exchange a Substrait Plan (Java byte) to std::shared_ptr<arrow::Buffer> called by JNI?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JNI functions can read byte arrays, you can then copy into an Arrow buffer (this is probably good anyways to avoid too many cross-boundary dependencies)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just create this ByteBuffer with Substrait Plan

ByteBuffer directByteBuffer = ByteBuffer.allocateDirect(64);
directByteBuffer.put("DEMO_SUBSTRAIT_PLAN".getBytes(StandardCharsets.UTF_8)); // protoPlan.toByteArray();

JNI Wrapper:
Recover ByteArray with:

jbyte *buff = (jbyte *) env->GetDirectBufferAddress(plan);

How I could transform that jbyte into std::shared_ptr<arrow::Buffer>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get the capacity, allocate a mutable buffer, memcpy the result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@@ -45,6 +45,7 @@ export ARROW_ORC
: ${ARROW_PLASMA:=ON}
export ARROW_PLASMA
: ${ARROW_S3:=ON}
: ${ARROW_SUBSTRAIT:=ON}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to always match ARROW_DATASET so a separate variable won't make sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@davisusanibar davisusanibar changed the title GH-34223: [Java] Proposal for Java Substrait Consumer GH-34223: [Java] Java Substrait Consumer JNI call to ACERO C++ Mar 15, 2023
@davisusanibar davisusanibar marked this pull request as ready for review March 15, 2023 22:14
@davisusanibar
Copy link
Contributor Author

@github-actions crossbow submit java-jars

@github-actions
Copy link

Revision: e5594f8

Submitted crossbow builds: ursacomputing/crossbow @ actions-bf821d607d

Task Status
java-jars Github Actions

@davisusanibar
Copy link
Contributor Author

@github-actions crossbow submit java-jars

@github-actions
Copy link

Revision: 223ddef

Submitted crossbow builds: ursacomputing/crossbow @ actions-17d27841d2

Task Status
java-jars Github Actions

@kou
Copy link
Member

kou commented Mar 16, 2023

Could you rebase on main to use #34480?

@github-actions
Copy link

Revision: ce7800b

Submitted crossbow builds: ursacomputing/crossbow @ actions-02cc882fff

Task Status
java-jars Github Actions

Comment on lines +81 to +83
Map<String, String> metadataName = new HashMap<>();
metadataName.put("ARROW:extension:name", "varchar");
metadataName.put("ARROW:extension:metadata", "varchar{length:150}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the metadata actually necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is needed, in the case of local tables we are seeing this metadata information in the response.

* @param planInput the JSON Substrait plan.
* @param memoryAddressOutput the memory address where RecordBatchReader is exported.
*/
public native void executeSerializedPlanLocalFiles(String planInput, long memoryAddressOutput);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was never addressed?

Comment on lines 125 to 131
try {
AutoCloseables.close(arrowArrayStream);
} catch (RuntimeException e) {
throw e;
} catch (Exception e) {
throw new RuntimeException(e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the extra catches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted

java/dataset/src/main/cpp/jni_wrapper.cc Outdated Show resolved Hide resolved
java/dataset/src/main/cpp/jni_wrapper.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels May 12, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 23, 2023
try {
AutoCloseables.close(arrowArrayStream);
} catch (Exception e) {
throw new RuntimeException(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is declared to throw Exception already. Why are we catching and re-throwing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we catching it in the first place? Let it propagate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, changed

@@ -124,8 +126,6 @@ private ArrowReader execute(ByteBuffer plan, Map<String, ArrowReader> namedTable
} finally {
try {
AutoCloseables.close(arrowArrayStream);
} catch (RuntimeException e) {
throw e;
} catch (Exception e) {
throw new RuntimeException(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

builder.replace(builder.indexOf("FILENAME_PLACEHOLDER"),
builder.indexOf("FILENAME_PLACEHOLDER") + "FILENAME_PLACEHOLDER".length(), uri);
return builder.toString();
return plan.replace("FILENAME_PLACEHOLDER", uri);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: why is this a whole method? Just inline it; it's only used once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 24, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 24, 2023
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last comments are still unaddressed.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 24, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 24, 2023
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 24, 2023
@lidavidm lidavidm merged commit 95c33d8 into apache:main May 24, 2023
22 checks passed
@ursabot
Copy link

ursabot commented May 30, 2023

Benchmark runs are scheduled for baseline = c4ea194 and contender = 95c33d8. 95c33d8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.84% ⬆️0.0%] test-mac-arm
[Failed ⬇️3.22% ⬆️0.0%] ursa-i9-9960x
[Failed ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 95c33d82 ec2-t3-xlarge-us-east-2
[Failed] 95c33d82 test-mac-arm
[Failed] 95c33d82 ursa-i9-9960x
[Failed] 95c33d82 ursa-thinkcentre-m75q
[Finished] c4ea194c ec2-t3-xlarge-us-east-2
[Failed] c4ea194c test-mac-arm
[Finished] c4ea194c ursa-i9-9960x
[Failed] c4ea194c ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented May 31, 2023

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

lidavidm added a commit that referenced this pull request Sep 20, 2023
…ilter as a Substrait proto extended expression (#35570)

### Rationale for this change

To close #34252

### What changes are included in this PR?

This is a proposal to try to solve:
1. Receive a list of Substrait scalar expressions and use them to Project a Dataset
- [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus)
- [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages
- [x] Create JNI Wrapper for ScannerBuilder::Project 
- [x] Create JNI API
- [x] Testing coverage
- [x] Documentation

Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by #35798

This PR needs/use this PRs/Issues:
- #34834
- #34227
- #35579

2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
- [x] Working to identify activities

### Are these changes tested?

Initial unit test added.

### Are there any user-facing changes?

No
* Closes: #34252

Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…der::Filter as a Substrait proto extended expression (apache#35570)

### Rationale for this change

To close apache#34252

### What changes are included in this PR?

This is a proposal to try to solve:
1. Receive a list of Substrait scalar expressions and use them to Project a Dataset
- [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus)
- [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages
- [x] Create JNI Wrapper for ScannerBuilder::Project 
- [x] Create JNI API
- [x] Testing coverage
- [x] Documentation

Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache#35798

This PR needs/use this PRs/Issues:
- apache#34834
- apache#34227
- apache#35579

2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
- [x] Working to identify activities

### Are these changes tested?

Initial unit test added.

### Are there any user-facing changes?

No
* Closes: apache#34252

Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…der::Filter as a Substrait proto extended expression (apache#35570)

### Rationale for this change

To close apache#34252

### What changes are included in this PR?

This is a proposal to try to solve:
1. Receive a list of Substrait scalar expressions and use them to Project a Dataset
- [x] Draft a Substrait Extended Expression to test (this will be generated by 3rd party project such as Isthmus)
- [x] Use C++ draft PR to Serialize/Deserialize Extended Expression proto messages
- [x] Create JNI Wrapper for ScannerBuilder::Project 
- [x] Create JNI API
- [x] Testing coverage
- [x] Documentation

Current problem is: `java.lang.RuntimeException: Inferring column projection from FieldRef FieldRef.FieldPath(0)`. Not able to infer by column position by able to infer by colum name. This problem is solved by apache#35798

This PR needs/use this PRs/Issues:
- apache#34834
- apache#34227
- apache#35579

2. Receive a Boolean-valued Substrait scalar expression and use it to filter a Dataset
- [x] Working to identify activities

### Are these changes tested?

Initial unit test added.

### Are there any user-facing changes?

No
* Closes: apache#34252

Lead-authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Java] Proposal for Java Substrait Consumers
5 participants