Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java][C][Dataset] Exception when importing a vector with empty data array from c-data #37056

Closed
janosik47 opened this issue Aug 8, 2023 · 2 comments · Fixed by #37531
Closed

Comments

@janosik47
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

Java import from c-data arrays throws an exception when attempting to construct a vector for which the data buffer is empty.
Example: importing an empty list of Int32 primitives throws the following

Exception in thread "main" java.lang.IllegalArgumentException: Could not load buffers for field $data$: Int(32, true). error message: Buffer 1 for type Int(32, true) cannot be null
	at org.apache.arrow.c.ArrayImporter.doImport(ArrayImporter.java:131)
	at org.apache.arrow.c.ArrayImporter.importChild(ArrayImporter.java:84)
	at org.apache.arrow.c.ArrayImporter.doImport(ArrayImporter.java:97)
	at org.apache.arrow.c.ArrayImporter.importChild(ArrayImporter.java:84)
	at org.apache.arrow.c.ArrayImporter.doImport(ArrayImporter.java:97)
	at org.apache.arrow.c.ArrayImporter.importArray(ArrayImporter.java:71)
	at org.apache.arrow.c.Data.importIntoVector(Data.java:289)
	at org.apache.arrow.c.Data.importIntoVectorSchemaRoot(Data.java:332)
	at org.apache.arrow.dataset.jni.NativeScanner$NativeReader.loadNextBatch(NativeScanner.java:151)

How to reproduce

Creation of a sample document (Jupyter notebook) :

python: 3.10.4
pyarrow: 12.0.1
pandas: 2.0.3

import pyarrow as pa
import pyarrow.feather as pf
import pandas as pd

schema = pa.schema([
    pa.field("a", pa.list_(pa.int32()), True),
])
df = pd.DataFrame(columns=["a"],index=range(1))
df.iloc[0] = [[]]
table = pa.table(df, schema)
pf.write_feather(table, "/tmp/sample.feather", compression="uncompressed")

Access the document via Java DataSet API (kotlin):

JVM: openjdk/20.0.1
arrow: 12.0.1
kotlin: 1.9.0

    val allocator = RootAllocator()
    val nativeMemoryPool = NativeMemoryPool.getDefault()
    
    val factory = FileSystemDatasetFactory(allocator, nativeMemoryPool, FileFormat.ARROW_IPC, "file:/tmp/sample.feather")
    factory.finish().use { dataset ->
        dataset.newScan(ScanOptions(10L)).use { scanner ->
            scanner.scanBatches().use { reader ->
                println("$path: schema=${reader.vectorSchemaRoot.schema.toJson()}")

                while (reader.loadNextBatch()) {
                    println(reader.vectorSchemaRoot.contentToTSVString())
                }
            }
        }
    }

Comment

Quick look at the Java code shows that the ArrayImporter class uses an instance of BufferImportTypeVisitor which performs import vector's buffers based on the knowledge of the field data type.

In this case the visit(ArrowType.Int type) method is called which accepts nullable bit mask buffer (here) but demands non-nullable data buffer (here & then here).

As my understanding is from the Vector perspective the data buffer must not be null hence the visitor enforces it, however according to the C data ArrowArray spec it can hold null buffers:

The buffer pointers MAY be null only in two situations:

  1. for the null bitmap buffer, if ArrowArray.null_count is 0;
  2. for any buffer, if the size in bytes of the corresponding buffer would be 0.

Based on the above, seems to me, the BufferImportTypeVisitor could create an empty data buffer if the corresponding c data one is null and the filed is empty (fieldNode.length == 0) or throw if the field is not empty.

Component(s)

Java

@kou kou changed the title [Java] [c-data] [dataset] Exception when importing a vector with empty data array from c-data [Java][C][Dataset] Exception when importing a vector with empty data array from c-data Aug 8, 2023
@pitrou
Copy link
Member

pitrou commented Aug 22, 2023

@janosik47 Thanks a lot for the analysis. Would you like to try contributing a PR for this?

Also cc @lidavidm @roee88

@janosik47
Copy link
Contributor Author

@pitrou could you have a look at the above PR please?

pitrou pushed a commit to janosik47/arrow that referenced this issue Sep 7, 2023
pitrou pushed a commit that referenced this issue Sep 7, 2023
### Rationale for this change

Fix java lib bug that prevents from importing specific data via c-data interface.
Currently an attempt to load a vector with empty data buffer results in an IllegalStateException error.

### What changes are included in this PR?
Updated BufferImportTypeVisitor to correctly handle a situation when underlying c-data array pointer is NULL (0) and the expected length of data is zero (0). 

### Are these changes tested?
Yes, updated the existing unit tests

### Are there any user-facing changes?
No

* Closes: #37056

Lead-authored-by: Jacek Stania <janosik47@gmail.com>
Co-authored-by: Jacek Stania <38670505+janosik47@users.noreply.github.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 14.0.0 milestone Sep 7, 2023
dongjoon-hyun pushed a commit to apache/spark that referenced this issue Nov 4, 2023
### What changes were proposed in this pull request?
This pr upgrade Apache Arrow from 13.0.0 to 14.0.0.

### Why are the changes needed?
The Apache Arrow 14.0.0 release brings a number of enhancements and bug fixes.
‎
In terms of bug fixes, the release addresses several critical issues that were causing failures in integration jobs with Spark([GH-36332](apache/arrow#36332)) and problems with importing empty data arrays([GH-37056](apache/arrow#37056)). It also optimizes the process of appending variable length vectors([GH-37829](apache/arrow#37829)) and includes C++ libraries for MacOS AARCH 64 in Java-Jars([GH-38076](apache/arrow#38076)).
‎
The new features and improvements focus on enhancing the handling and manipulation of data. This includes the introduction of DefaultVectorComparators for large types([GH-25659](apache/arrow#25659)), support for extended expressions in ScannerBuilder([GH-34252](apache/arrow#34252)), and the exposure of the VectorAppender class([GH-37246](apache/arrow#37246)).
‎
The release also brings enhancements to the development and testing process, with the CI environment now using JDK 21([GH-36994](apache/arrow#36994)). In addition, the release introduces vector validation consistent with C++, ensuring consistency across different languages([GH-37702](apache/arrow#37702)).
‎
Furthermore, the usability of VarChar writers and binary writers has been improved with the addition of extra input methods([GH-37705](apache/arrow#37705)), and VarCharWriter now supports writing from `Text` and `String`([GH-37706](apache/arrow#37706)). The release also adds typed getters for StructVector, improving the ease of accessing data([GH-37863](apache/arrow#37863)).

The full release notes as follows:
- https://arrow.apache.org/release/14.0.0.html

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43650 from LuciferYang/arrow-14.

Lead-authored-by: yangjie01 <yangjie01@baidu.com>
Co-authored-by: YangJie <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…pache#37531)

### Rationale for this change

Fix java lib bug that prevents from importing specific data via c-data interface.
Currently an attempt to load a vector with empty data buffer results in an IllegalStateException error.

### What changes are included in this PR?
Updated BufferImportTypeVisitor to correctly handle a situation when underlying c-data array pointer is NULL (0) and the expected length of data is zero (0). 

### Are these changes tested?
Yes, updated the existing unit tests

### Are there any user-facing changes?
No

* Closes: apache#37056

Lead-authored-by: Jacek Stania <janosik47@gmail.com>
Co-authored-by: Jacek Stania <38670505+janosik47@users.noreply.github.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…pache#37531)

### Rationale for this change

Fix java lib bug that prevents from importing specific data via c-data interface.
Currently an attempt to load a vector with empty data buffer results in an IllegalStateException error.

### What changes are included in this PR?
Updated BufferImportTypeVisitor to correctly handle a situation when underlying c-data array pointer is NULL (0) and the expected length of data is zero (0). 

### Are these changes tested?
Yes, updated the existing unit tests

### Are there any user-facing changes?
No

* Closes: apache#37056

Lead-authored-by: Jacek Stania <janosik47@gmail.com>
Co-authored-by: Jacek Stania <38670505+janosik47@users.noreply.github.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants