Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16913: [Java] Implement ArrowArrayStream #13465

Merged
merged 5 commits into from
Jul 5, 2022

Conversation

lidavidm
Copy link
Member

Implements ArrowArrayStream for Java. The equivalent Java-side interface chosen is ArrowReader.

Also:

  • Fixes a couple of JDK9 compatibility issues I ran into. I think these will not normally affect people except during development (I think because I was mixing IntelliJ and Maven).
  • Manually clang-format the C++ code. Clean up some things to match Arrow convention and remove some unused declarations.
  • Extends the DictionaryProvider interface. This is a potentially breaking change; we could make the method default (and raise an exception) instead.

@github-actions
Copy link

ThrowPendingException(message);
}
jclass global_class = (jclass)env->NewGlobalRef(local_class);
if (!local_class) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a mistake?

Suggested change
if (!local_class) {
if (!global_class) {

const int err_code = env->CallIntMethod(private_data->j_private_data_,
kPrivateDataGetSchemaMethod, out_addr);
if (env->ExceptionCheck()) {
env->ExceptionDescribe();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's an exception, should it perhaps participate in last_error_?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally the JNI side sets the last error, the check here is just a last-resort safeguard. I suppose this can be refactored though: copy the Java-side error to the C++ side after get_next/get_stream, and get_last_error only has to return the C++-side error; then get_next/get_stream can also update last_error_ if it ends up catching a stray error.

if (env->ExceptionCheck()) {
env->ExceptionDescribe();
env->ExceptionClear();
ThrowPendingException("Error calling close of private data");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? The release callback could be called from any context, such as a Python thread or R interpreter. In those contexts, a C++ exception would probably crash the process (or silently exit the thread)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, you're right. The existing handler has this issue too. I'll remove the throw. (Actually here I suppose we should do our best to free resources in C++/Java regardless.)

JNIEnvGuard guard(private_data->vm_);
JNIEnv* env = guard.env();

const long out_addr = static_cast<long>(reinterpret_cast<uintptr_t>(out));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this doesn't work on 64-bit Windows? long is 32 bits there...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, according to the JNI spec, a jlong is always 64 bits, so perhaps we should use jlong or simply int64_t here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks.

* @param stream C stream interface struct to import.
* @return Imported reader
*/
public static ArrowReader importStream(BufferAllocator allocator, ArrowArrayStream stream) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for the naming discrepancy (importStream vs. exportArrayStream)?

static class ExportedArrayStreamPrivateData implements PrivateData {
final BufferAllocator allocator;
final ArrowReader reader;
int nextDictionary;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This member doesn't seem used anymore, or am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it's not used. I missed this when backing out a change.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. For the record, did you try to use this to communicate with e.g. PyArrow?

@pitrou
Copy link
Member

pitrou commented Jun 30, 2022

@lwhite1 Would you like to take a look?

@lidavidm
Copy link
Member Author

For the record, did you try to use this to communicate with e.g. PyArrow?

I have not yet, I need to give this a try: https://arrow.apache.org/docs/dev/python/integration/python_java.html#java-to-python-communication-using-the-c-data-interface

and actually, I'll extend the doc page there as well.

@lidavidm lidavidm closed this Jun 30, 2022
@lidavidm lidavidm reopened this Jun 30, 2022
@lidavidm
Copy link
Member Author

…wow, whatever GitHub did to their UI is rather frustrating.

@lidavidm
Copy link
Member Author

Hmm, there's a possible minor bug between PyArrow/C++/Java: Python can keep a reference to the reader until interpreter shutdown (at which point the JVM has been shut down), and then collects the reader. This frees the shared_ptr<RecordBatchReader>; the implementation is ArrayStreamBatchReader whose destructor unconditionally calls ArrowArrayStreamRelease. (It doesn't implement Close() since this was just added). This calls into the JNI code, which tries to attach the JVM, fails, throws an uncaught exception, and dies.

Changes needed:

  • Implement Close() properly so we can explicitly free
  • Don't unconditionally call ArrowArrayStreamRelease
  • In the JNI side, fail gracefully if we can't attach the JVM (warn instead of abort?)

@pitrou
Copy link
Member

pitrou commented Jun 30, 2022

Python can keep a reference to the reader until interpreter shutdown (at which point the JVM has been shut down), and then collects the reader

Can Python perhaps release that reference once close() is called?

@lidavidm
Copy link
Member Author

Well, the Python-side reference is the Python reader object itself. But close() should be wired up to call the new RecordBatchReader::Close() so we can at least explicitly call the release callback at a suitable time.

@lidavidm lidavidm closed this Jun 30, 2022
@pitrou
Copy link
Member

pitrou commented Jun 30, 2022

Though the Java improvements are welcome as well. We should probably try to do both.

@lidavidm
Copy link
Member Author

@lidavidm lidavidm reopened this Jun 30, 2022
@pitrou pitrou reopened this Jun 30, 2022
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question, this is great otherwise.

@pitrou
Copy link
Member

pitrou commented Jun 30, 2022

@amol- Do you want to take a look at the doc additions?

@amol- amol- merged commit cd3c6ea into apache:master Jul 5, 2022
@lidavidm lidavidm deleted the arrow-16913 branch July 5, 2022 14:00
drin pushed a commit to drin/arrow that referenced this pull request Jul 5, 2022
Implements ArrowArrayStream for Java. The equivalent Java-side interface chosen is ArrowReader.

Also:
- Fixes a couple of JDK9 compatibility issues I ran into. I _think_ these will not normally affect people except during development (I think because I was mixing IntelliJ and Maven).
- Manually clang-format the C++ code. Clean up some things to match Arrow convention and remove some unused declarations.
- Extends the DictionaryProvider interface. This is a potentially breaking change; we could make the method default (and raise an exception) instead.

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Alessandro Molina <amol@turbogears.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants