Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python, Java] UnionArray round trip not working #17700

Closed
asfimport opened this issue Oct 20, 2017 · 8 comments
Closed

[Python, Java] UnionArray round trip not working #17700

asfimport opened this issue Oct 20, 2017 · 8 comments

Comments

@asfimport
Copy link

asfimport commented Oct 20, 2017

I'm currently working on making pyarrow.serialization data available from the Java side, one problem I was running into is that it seems the Java implementation cannot read UnionArrays generated from C++. To make this easily reproducible I created a clean Python implementation for creating UnionArrays: #1216

The data is generated with the following script:

import pyarrow as pa

binary = pa.array([b'a', b'b', b'c', b'd'], type='binary')
int64 = pa.array([1, 2, 3], type='int64')
types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8')
value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')

result = pa.UnionArray.from_arrays([binary, int64], types, value_offsets)

batch = pa.RecordBatch.from_arrays([result], ["test"])

sink = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(sink, batch.schema)

writer.write_batch(batch)

sink.close()

b = sink.get_result()

with open("union_array.arrow", "wb") as f:
    f.write(b)

# Sanity check: Read the batch in again

with open("union_array.arrow", "rb") as f:
    b = f.read()
    reader = pa.RecordBatchStreamReader(pa.BufferReader(b))

batch = reader.read_next_batch()

print("union array is", batch.column(0))

I attached the file generated by that script. Then when I run the following code in Java:

RootAllocator allocator = new RootAllocator(1000000000);

ByteArrayInputStream in = new ByteArrayInputStream(Files.readAllBytes(Paths.get("union_array.arrow")));

ArrowStreamReader reader = new ArrowStreamReader(in, allocator);

reader.loadNextBatch()

I get the following error:

|  java.lang.IllegalArgumentException thrown: Could not load buffers for field test: Union(Sparse, [22, 5])<0: Binary, 1: Int(64, true)>. error message: can not truncate buffer to a larger size 7: 0
|        at VectorLoader.loadBuffers (VectorLoader.java:83)
|        at VectorLoader.load (VectorLoader.java:62)
|        at ArrowReader$1.visit (ArrowReader.java:125)
|        at ArrowReader$1.visit (ArrowReader.java:111)
|        at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
|        at ArrowReader.loadNextBatch (ArrowReader.java:137)
|        at (#7:1)

It seems like Java is not picking up that the UnionArray is Dense instead of Sparse. After changing the default in java/vector/src/main/codegen/templates/UnionVector.java from Sparse to Dense, I get this:

jshell> reader.getVectorSchemaRoot().getSchema()
$9 ==> Schema<list: Union(Dense, [0])<: Struct<list: List<item: Union(Dense, [0])<: Int(64, true)>>>>>

but then reading doesn't work:

jshell> reader.loadNextBatch()
|  java.lang.IllegalArgumentException thrown: Could not load buffers for field list: Union(Dense, [1])<: Struct<list: List<$data$: Union(Dense, [5])<: Int(64, true)>>>>. error message: can not truncate buffer to a larger size 1: 0
|        at VectorLoader.loadBuffers (VectorLoader.java:83)
|        at VectorLoader.load (VectorLoader.java:62)
|        at ArrowReader$1.visit (ArrowReader.java:125)
|        at ArrowReader$1.visit (ArrowReader.java:111)
|        at ArrowRecordBatch.accepts (ArrowRecordBatch.java:128)
|        at ArrowReader.loadNextBatch (ArrowReader.java:137)
|        at (#8:1)

Any help with this is appreciated!

Reporter: Philipp Moritz / @pcmoritz
Assignee: Ryan Murray / @rymurr

Related issues:

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-1692. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
We have yet to complete integration tests for unions, so it does not surprise me that there are some small issues.

See open PR #987

@asfimport
Copy link
Author

Wes McKinney / @wesm:
cc @icexelloss

@asfimport
Copy link
Author

Li Jin / @icexelloss:
Yeah Union type doesn't work between java/c++ because they have different presentation. (The java one is incorrect I think)

@asfimport
Copy link
Author

Li Jin / @icexelloss:
We can probably do the integration for Union in 0.8 if the refactor work finishes ahead of schedule, otherwise I'd suggest we prioritize refactoring work and ensure its quality.

@asfimport
Copy link
Author

Philipp Moritz / @pcmoritz:
Thanks for your help! I tried to make it work on top of #987 but the Dense Union integration there is also not compatible with C++ and if this code will be deprecated soon it probably doesn't make much sense to fix it.

If there is anything I can do to speed up us having Dense Union support in Java that interoperates with C++ let me know!

@asfimport
Copy link
Author

Philipp Moritz / @pcmoritz:
I tried this again with 0.8 and it gives the following error:

jshell> ArrowStreamReader reader = new ArrowStreamReader(in, allocator);
reader ==> org.apache.arrow.vector.ipc.ArrowStreamReader@55cf0d14

jshell> ByteArrayInputStream in = new ByteArrayInputStream(Files.readAllBytes(Paths.get("/Users/pcmoritz/arrow/python/union_array.arrow")));
in ==> java.io.ByteArrayInputStream@3b74ac8

jshell> ArrowStreamReader reader = new ArrowStreamReader(in, allocator);
reader ==> org.apache.arrow.vector.ipc.ArrowStreamReader@27adc16e

jshell> reader.loadNextBatch()
|  java.lang.IndexOutOfBoundsException thrown: 
|        at Buffer.checkIndex (Buffer.java:675)
|        at HeapByteBuffer.getInt (HeapByteBuffer.java:405)
|        at Table.__string (Table.java:50)
|        at KeyValue.key (KeyValue.java:21)
|        at Field.convertField (Field.java:126)
|        at Field.convertField (Field.java:118)
|        at Schema.convertSchema (Schema.java:85)
|        at MessageSerializer.deserializeSchema (MessageSerializer.java:112)
|        at ArrowStreamReader.readSchema (ArrowStreamReader.java:128)
|        at ArrowReader.initialize (ArrowReader.java:181)
|        at ArrowReader.ensureInitialized (ArrowReader.java:172)
|        at ArrowReader.prepareLoadNextBatch (ArrowReader.java:211)
|        at ArrowStreamReader.loadNextBatch (ArrowStreamReader.java:103)
|        at (#12:1)

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I hope we can resolve this in the 0.11 release cycle

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 7290
#7290

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant