Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Validation of ExtensionType with null storage type failing (Can't read empty-but-for-nulls data from Parquet if it has an ExtensionType) #30077

Closed
asfimport opened this issue Oct 29, 2021 · 3 comments

Comments

@asfimport
Copy link

Here's a corner case: suppose that I have data with type null, but it can have missing values so the whole array consists of nothing but nulls. In real life, this might only happen inside a nested data structure, at some level where an untyped data source (e.g. nested Python lists) had no entries so a type could not be determined. We expect to be able to write and read this data to and from Parquet, and we can—as long as it doesn't have an ExtensionType.

Here's an example that works, without ExtensionType:

>>> import json
>>> import numpy as np
>>> import pyarrow as pa
>>> import pyarrow.parquet
>>> 
>>> validbits = np.packbits(np.ones(14, dtype=np.uint8), bitorder="little")
>>> empty_but_for_nulls = pa.Array.from_buffers(
...     pa.null(), 14, [pa.py_buffer(validbits)], null_count=14
... )
>>> empty_but_for_nulls
<pyarrow.lib.NullArray object at 0x7fb1560bbd00>
14 nulls
>>> 
>>> pa.parquet.write_table(pa.table({"": empty_but_for_nulls}), "tmp.parquet")
>>> pa.parquet.read_table("tmp.parquet")
pyarrow.Table
: null
----
: [14 nulls]

And here's a continuation of that example, which doesn't work because the type pa.null() is replaced by AnnotatedType(pa.null(), \{"cool": "beans"}):

>>> class AnnotatedType(pa.ExtensionType):
...     def __init__(self, storage_type, annotation):
...         self.annotation = annotation
...         super().__init__(storage_type, "my:app")
...     def __arrow_ext_serialize__(self):
...         return json.dumps(self.annotation).encode()
...     @classmethod
...     def __arrow_ext_deserialize__(cls, storage_type, serialized):
...         annotation = json.loads(serialized.decode())
...         return cls(storage_type, annotation)
... 
>>> pa.register_extension_type(AnnotatedType(pa.null(), None))
>>> 
>>> empty_but_for_nulls = pa.Array.from_buffers(
...     AnnotatedType(pa.null(), {"cool": "beans"}),
...     14,
...     [pa.py_buffer(validbits)],
...     null_count=14,
... )
>>> empty_but_for_nulls
<pyarrow.lib.ExtensionArray object at 0x7fb14b5e1ca0>
14 nulls
>>> 
>>> pa.parquet.write_table(pa.table({"": empty_but_for_nulls}), "tmp2.parquet")
>>> pa.parquet.read_table("tmp2.parquet")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1941, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1776, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Array of type extension<my:app<AnnotatedType>> has 14 nulls but no null bitmap

If "nullable type null" were outside the set of types that should be writable to Parquet, then it would not work for the non-ExtensionType or it would fail on writing, not reading, so I'm quite sure this is a bug.

Reporter: Jim Pivarski / @jpivarski
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

Note: This issue was originally created as ARROW-14522. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
@jpivarski Thanks for the report!

While trying to replicate this with simplified code (you can create a null array more easily, and an ExtensionArray from its storage array), it seems just creating this array already fails that way:

>>> null_array = pa.array([None] * 14)
>>> ext_type = AnnotatedType(pa.null(), {"cool": "beans"})
>>> arr = pa.ExtensionArray.from_storage(ext_type, null_array)
Traceback (most recent call last)
~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.ExtensionArray.from_storage()
~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.Array.validate()
~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Array of type extension<my:app<AnnotatedType>> has 14 nulls but no null bitmap

Also doing a full validation of the null extension array you created with from_buffers seems to fail (although without full=True it does not fail, which is a bit strange):

>>> empty_but_for_nulls.validate(full=True)
Traceback (most recent call last)
~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.Array.validate()
~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: null_count value (14) doesn't match actual number of nulls in array (0)

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
There is a second issue here that the way you are creating the null array is using a buffer, while a it is expected that a NullArray has no buffers allocated at all (only stores the length). That trips up the ValidationFull (the second code example in the comment above)

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 11650
#11650

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants