I've been happily making ExtensionArrays, but recently noticed that they aren't preserved by round-trips through Parquet files when {}use_compliant_nested_type=True{}.
Consider this writer.py:
import json
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
class AnnotatedType(pa.ExtensionType):
def __init__(self, storage_type, annotation):
self.annotation = annotation
super().__init__(storage_type, "my:app")
def __arrow_ext_serialize__(self):
return json.dumps(self.annotation).encode()
@classmethod
def __arrow_ext_deserialize__(cls, storage_type, serialized):
annotation = json.loads(serialized.decode())
return cls(storage_type, annotation)
@property
def num_buffers(self):
return self.storage_type.num_buffers
@property
def num_fields(self):
return self.storage_type.num_fields
pa.register_extension_type(AnnotatedType(pa.null(), None))
array = pa.Array.from_buffers(
AnnotatedType(pa.list_(pa.float64()), {"cool": "beans"}),
3,
[None, pa.py_buffer(np.array([0, 3, 3, 5], np.int32))],
children=[pa.array([1.1, 2.2, 3.3, 4.4, 5.5])],
)
table = pa.table({"": array})
print(table)
pq.write_table(table, "tmp.parquet", use_compliant_nested_type=True)
And this reader.py:
import json
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
class AnnotatedType(pa.ExtensionType):
def __init__(self, storage_type, annotation):
self.annotation = annotation
super().__init__(storage_type, "my:app")
def __arrow_ext_serialize__(self):
return json.dumps(self.annotation).encode()
@classmethod
def __arrow_ext_deserialize__(cls, storage_type, serialized):
annotation = json.loads(serialized.decode())
return cls(storage_type, annotation)
@property
def num_buffers(self):
return self.storage_type.num_buffers
@property
def num_fields(self):
return self.storage_type.num_fields
pa.register_extension_type(AnnotatedType(pa.null(), None))
table = pq.read_table("tmp.parquet")
print(table)
(The AnnotatedType is the same; I wrote it twice for explicitness.)
When the writer.py has {}use_compliant_nested_type=False{}, the output is
% python writer.py
pyarrow.Table
: extension<my:app<AnnotatedType>>
----
: [[[1.1,2.2,3.3],[],[4.4,5.5]]]
% python reader.py
pyarrow.Table
: extension<my:app<AnnotatedType>>
----
: [[[1.1,2.2,3.3],[],[4.4,5.5]]]
In other words, the AnnotatedType is preserved. When {}use_compliant_nested_type=True{}, however,
% rm tmp.parquet
rm: remove regular file 'tmp.parquet'? y
% python writer.py
pyarrow.Table
: extension<my:app<AnnotatedType>>
----
: [[[1.1,2.2,3.3],[],[4.4,5.5]]]
% python reader.py
pyarrow.Table
: list<element: double>
child 0, element: double
----
: [[[1.1,2.2,3.3],[],[4.4,5.5]]]
The issue doesn't seem to be in the writing, but in the reading: regardless of whether use_compliant_nested_type is True or {}False{}, I can see the extension metadata in the Parquet → Arrow converted schema.
>>> import pyarrow.parquet as pq
>>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema()
: list<item: double>
child 0, item: double
-- field metadata --
ARROW:extension:metadata: '{"cool": "beans"}'
ARROW:extension:name: 'my:app'
versus
>>> import pyarrow.parquet as pq
>>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema()
: list<element: double>
child 0, element: double
-- field metadata --
ARROW:extension:metadata: '{"cool": "beans"}'
ARROW:extension:name: 'my:app'
Note that the first has "{}item: double{}" and the second has "{}element: double{}".
(I'm also rather surprised that use_compliant_nested_type=False is an option. Wouldn't you want the Parquet files to always be written with compliant lists? I noticed this when I was having trouble getting the data into BigQuery.)
Environment: pyarrow 7.0.0 installed via pip.
Reporter: Jim Pivarski / @jpivarski
Note: This issue was originally created as ARROW-16348. Please see the migration documentation for further details.
I've been happily making ExtensionArrays, but recently noticed that they aren't preserved by round-trips through Parquet files when
{}use_compliant_nested_type=True{}.Consider this writer.py:
And this reader.py:
(The AnnotatedType is the same; I wrote it twice for explicitness.)
When the writer.py has
{}use_compliant_nested_type=False{}, the output isIn other words, the AnnotatedType is preserved. When
{}use_compliant_nested_type=True{}, however,The issue doesn't seem to be in the writing, but in the reading: regardless of whether
use_compliant_nested_typeisTrueor{}False{}, I can see the extension metadata in the Parquet → Arrow converted schema.versus
Note that the first has "
{}item: double{}" and the second has "{}element: double{}".(I'm also rather surprised that
use_compliant_nested_type=Falseis an option. Wouldn't you want the Parquet files to always be written with compliant lists? I noticed this when I was having trouble getting the data into BigQuery.)Environment: pyarrow 7.0.0 installed via pip.
Reporter: Jim Pivarski / @jpivarski
Note: This issue was originally created as ARROW-16348. Please see the migration documentation for further details.