You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We can make nested columns in a Parquet file by putting a pa.StructArray in a pa.Table and writing that Table to Parquet. We can selectively read back that nested column by specifying it with dot syntax:
But if the Arrow types are ExtensionTypes, then the above causes a segfault. The segfault depends both on the nested struct field and the ExtensionTypes.
Here is a minimally reproducing example of reading a nested struct field without extension types, which does not raise a segfault. (I'm building the pa.StructArray manually with from_buffers because I'll have to add the ExtensionTypes in the next example.)
Update on this issue: we encountered it "in the wild" in dask-contrib/dask-awkward#140. (As a work-around, the user has turned off ExtensionArray/ExtensionType, but that's not a long-term solution because it drops metadata in the Awkward ←→ Arrow conversion.)
So I've found the problem but haven't yet worked out the solution. The segmentation fault occurs in reader.cc in GetReader. The handling for extension type is:
However, if the nested field is not loaded, then the recursive GetReader call sets out to nullptr and this code creates an ExtensionReader with a null storage reader. This later crashes.
The fix is, unfortunately, not as simple as returning null. The problem is that the Parquet reader is trying to maintain the nested structure. As you see in your example that works, column.one yields a partial struct:
However, it is not clear that a partial "extension type" is a valid thing. For example, imagine your extension type was a 2DPoint with "x" and "y". What should be returned if the user loads points.x? We can't maintain structure in that case.
I'm a little new to parquet and nested references so I don't know if there is a syntax we can use to ask for the nested columns without structure. In this case you would get:
…3634)
When the parquet reader is performing a partial read it will try and maintain field structure. So, for example, given a schema of `points: struct<x: int32, y:int32>` and a load of `points.x` it will return `points: struct<x: int32>`. However, if there is an extension type `points: Point` where `Point` has a storage type `struct<x: int32, y: int32>` then suddenly the reference `points.x` no longer makes sense.
* Closes: #20385
Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
We can make nested columns in a Parquet file by putting a
pa.StructArray
in apa.Table
and writing that Table to Parquet. We can selectively read back that nested column by specifying it with dot syntax:pq.ParquetFile("f.parquet").read_row_groups([0], ["table_column.struct_field"])
But if the Arrow types are ExtensionTypes, then the above causes a segfault. The segfault depends both on the nested struct field and the ExtensionTypes.
Here is a minimally reproducing example of reading a nested struct field without extension types, which does not raise a segfault. (I'm building the
pa.StructArray
manually withfrom_buffers
because I'll have to add the ExtensionTypes in the next example.)So far, so good; no segfault. Next, we define and register an ExtensionType,
build the
pa.StructArray
again,Now when we write and read this back, there's a segfault:
The output, which prints each annotation as the ExtensionType is deserialized, is
Note that if we read back that file,
{}record_annotated.parquet{
}, without the ExtensionType, everything is fine:and if we register the ExtensionType but don't select a column, everything is fine:
It's just the case of doing both that causes the segfault.
Reporter: Jim Pivarski / @jpivarski
Watchers: Rok Mihevc / @rok
Note: This issue was originally created as ARROW-17539. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: