IndexOutOfBoundsException when loading compressed IPC format #33384

asfimport · 2022-10-31T12:23:02Z

I encountered this bug when I loaded a dataframe stored in the Arrow IPC format.

// Java Code from "Apache Arrow Java Cookbook"
File file = new File("example.arrow");
try (
        BufferAllocator rootAllocator = new RootAllocator();
        FileInputStream fileInputStream = new FileInputStream(file);
        ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(), rootAllocator)
) {
    System.out.println("Record batches in file: " + reader.getRecordBlocks().size());
    for (ArrowBlock arrowBlock : reader.getRecordBlocks()) {
        reader.loadRecordBatch(arrowBlock);
        VectorSchemaRoot vectorSchemaRootRecover = reader.getVectorSchemaRoot();
        System.out.print(vectorSchemaRootRecover.contentToTSVString());
    }
} catch (IOException e) {
    e.printStackTrace();
}

Call stack:


Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 0, length: 2048 (expected: range(0, 2024))
    at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:701)
    at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:955)
    at org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:451)
    at org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:732)
    at org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:240)
    at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:86)
    at org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:220)
    at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:166)
    at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(ArrowFileReader.java:197)

This bug can be reproduced by a simple dataframe created by pandas:

pd.DataFrame({'a': range(10000)}).to_feather('example.arrow')

Pandas compresses the dataframe by default. If the compression is turned off, Java can load the dataframe. Thus, I guess the bounds checking code is buggy when loading compressed file.

That dataframe can be loaded in polars, pandas and pyarrow, so it's unlikely to be a pandas bug.

Environment: Linux and Windows.
Apache Arrow Java version: 10.0.0, 9.0.0, 4.0.1.
Pandas 1.4.2 using pyarrow 8.0.0 (anaconda3-2022.05)
Reporter: Georeth Zhou

_{Note: This issue was originally created as ARROW-18198. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2022-10-31T13:59:32Z

David Li / @lidavidm:
CC @davisusanibar

asfimport · 2022-11-10T11:15:59Z

Georeth Zhou:
any updates?

asfimport · 2022-11-10T16:34:15Z

David Dali Susanibar Arce / @davisusanibar:
Hi [~georeth] let me check that

asfimport · 2022-11-14T14:24:06Z

David Dali Susanibar Arce / @davisusanibar:
There isn't problem for binary file with less than rowCount <= 2048.

There is a problem with the Validity Buffer, for example for 2049 rows initially there is assigned 504 buffer size, but at the end is requested 512 length size.

Need to continue reviewing for changes needed.

asfimport · 2022-12-28T22:43:13Z

David Dali Susanibar Arce / @davisusanibar:

Base on the current implementation the default compression codec is no compression.

asfimport · 2022-12-29T14:53:47Z

David Dali Susanibar Arce / @davisusanibar:
@lidavidm please if you help me with this doubt:

Vector module was designed to support Compression codec (Lz4/Zstd)? Because I only see abstract class AbstractCompressionCodec, then doDecompress is only implemented on Compression module and if I try to used that this will cause cyclic dependency Vector <–> Compression.

Could you help us about a way to implement compression on Vector module?

asfimport · 2022-12-29T15:31:36Z

David Li / @lidavidm:
@davisusanibar I don't see the problem? Compression is implemented. Just add dependencies on both modules from your application.

In any case, the first issue here is that Java should detect the file is compressed and error if it doesn't support the codec.

asfimport · 2022-12-29T15:34:05Z

David Li / @lidavidm:
The ArrowFileReader/StreamReader take in an optional codec factory instance, so that's probably the underlying issue (the modules are decoupled so by default you can't read a compressed file), but we should still fix the error message when you don't pass in the factory.

asfimport · 2022-12-30T22:54:14Z

David Dali Susanibar Arce / @davisusanibar:
Hi [~georeth] ,

Please consider this PR to add cookbook for read compressed files.

File file = new File("src/main/resources/compare/lz4.arrow");
try (
    BufferAllocator rootAllocator = new RootAllocator();
    FileInputStream fileInputStream = new FileInputStream(file);
    // ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(), rootAllocator): Use CommonsCompressionFactory for compressed files
    ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(),
        rootAllocator, CommonsCompressionFactory.INSTANCE)
) {
    System.out.println("Record batches in file: " + reader.getRecordBlocks().size());
    for (ArrowBlock arrowBlock : reader.getRecordBlocks()) {
        reader.loadRecordBatch(arrowBlock);
        VectorSchemaRoot vectorSchemaRootRecover = reader.getVectorSchemaRoot();
        System.out.println("Size: --> " + vectorSchemaRootRecover.getRowCount());
        System.out.print(vectorSchemaRootRecover.contentToTSVString());
    }
} catch (IOException e) {
    e.printStackTrace();
}

asfimport · 2023-01-09T03:18:45Z

Georeth Zhou:
@davisusanibar thank you.

It works now.

davisusanibar mentioned this issue Mar 16, 2023

[C++][Java][IPC] Java reader cannot read compressed file created by C++ writer #34432

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexOutOfBoundsException when loading compressed IPC format #33384

IndexOutOfBoundsException when loading compressed IPC format #33384

asfimport commented Oct 31, 2022

asfimport commented Oct 31, 2022

asfimport commented Nov 10, 2022

asfimport commented Nov 10, 2022

asfimport commented Nov 14, 2022

asfimport commented Dec 28, 2022

asfimport commented Dec 29, 2022

asfimport commented Dec 29, 2022

asfimport commented Dec 29, 2022

asfimport commented Dec 30, 2022

asfimport commented Jan 9, 2023

IndexOutOfBoundsException when loading compressed IPC format #33384

IndexOutOfBoundsException when loading compressed IPC format #33384

Comments

asfimport commented Oct 31, 2022

asfimport commented Oct 31, 2022

asfimport commented Nov 10, 2022

asfimport commented Nov 10, 2022

asfimport commented Nov 14, 2022

asfimport commented Dec 28, 2022

asfimport commented Dec 29, 2022

asfimport commented Dec 29, 2022

asfimport commented Dec 29, 2022

asfimport commented Dec 30, 2022

asfimport commented Jan 9, 2023