Add support for persisting big-endian arrays to Parquet by byte-swapping on write. #14373

erykoff · 2023-02-08T23:13:54Z

Description

This PR adds support to the parquet writer for big-endian columns. This is particularly relevant because arrays serialized with FITS will be converted to big-endian arrays, and then this won't be serializable with Parquet. For example with the current Parquet code this fails:

from astropy.table import Table
import numpy as np

table = Table(data=np.zeros(10, dtype=[("a", np.float64)]))
table.write("test.fits", overwrite=True)
table2 = Table.read("test.fits")
table2.write("test.parq", overwrite=True)

with

ArrowNotImplementedError: Byte-swapped arrays not supported

This PR now checks for this error and will do the byte-swapping on write if possible. Note that the table after reading will have all columns in little-endian order. This behavior is analogous to that of the FITS io where little-endian columns will be switched to big-endian after reading (as in the failing example above).

Fixes #

github-actions · 2023-02-08T23:14:33Z

erykoff · 2023-02-08T23:14:36Z

Special thanks to @leeskelvin for catching this!

pllim · 2023-02-08T23:27:19Z

Thanks! Does this need backporting?

mhvk · 2023-02-09T00:32:48Z

astropy/io/misc/parquet.py

+                    not np.little_endian and val.dtype.byteorder == "="
+                ):
+                    # We need to convert the array to little-endian.
+                    val2 = val.byteswap()


The standard way to do this is val.astype(new_dtype) -- this will set the dtype as well.

Note that val.astype(dtype, copy=False) is very fast if the dtype is identical, so if you know the dtype it should become, you can do that upfront, and not have to worry about the try/except.

Oooh, I didn't realize that. However, one of the reasons that I did it this way (with the copy) is that I didn't want persisting the table to change the datatype (even endianness) of the columns in the table. What are your thoughts about that?

If you always want a copy, then .astype is even better. But I'm not familiar with how pa.array works, so cannot really comment on that part. Regardless, I would use .astype even for your stanza here, as it is meant to include everything one needs to get to a new dtype, including byte swapping.

I don't always want a copy (which would be slower and waste memory), but I do want a copy when we need to do a byte swap because I think it's bad behavior for writing a table to change the datatypes in the table. So there still needs to be a check for the endianness. I'll update the code to do an astype rather than a byteswap and newbyteorder. But there's still a question of whether it should be a try/except (that would only trigger in the hopefully less frequent occasion when a column does need a byte swap) vs checking the byte order of every array every time.

Sorry for the confusion .astype(..., copy=False) will make a copy if a byteswap is needed, the flag just asks not to copy if it is not needed. So, it should be just right.

pllim · 2023-04-18T13:23:31Z

@mhvk , is this ready to go in? If so, can you please approve? Thanks!

mhvk

So sorry I missed this for so long. It looks all great!

erykoff requested review from WilliamJamieson and matteobachetti as code owners February 8, 2023 23:13

github-actions bot added the io.misc label Feb 8, 2023

erykoff force-pushed the u/erykoff/parquet-byteswap branch from 0c9b40b to 33a6762 Compare February 8, 2023 23:24

pllim added this to the v5.3 milestone Feb 8, 2023

pllim added the Bug label Feb 8, 2023

mhvk reviewed Feb 9, 2023

View reviewed changes

erykoff added 3 commits February 9, 2023 09:48

Add test for serializing big-endian datatypes with Parquet.

32cc9da

Add support for serializing big-endian arrays to Parquet.

c4c1288

Add changelog fragment.

892be27

erykoff force-pushed the u/erykoff/parquet-byteswap branch from 33a6762 to 892be27 Compare February 9, 2023 17:48

mhvk approved these changes Apr 18, 2023

View reviewed changes

mhvk merged commit d0cb3ed into astropy:main Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for persisting big-endian arrays to Parquet by byte-swapping on write. #14373

Add support for persisting big-endian arrays to Parquet by byte-swapping on write. #14373

erykoff commented Feb 8, 2023

github-actions bot commented Feb 8, 2023

erykoff commented Feb 8, 2023

pllim commented Feb 8, 2023

mhvk Feb 9, 2023 •

edited

erykoff Feb 9, 2023

mhvk Feb 9, 2023

erykoff Feb 9, 2023

mhvk Feb 9, 2023

pllim commented Apr 18, 2023

mhvk left a comment

Add support for persisting big-endian arrays to Parquet by byte-swapping on write. #14373

Add support for persisting big-endian arrays to Parquet by byte-swapping on write. #14373

Conversation

erykoff commented Feb 8, 2023

Description

github-actions bot commented Feb 8, 2023

erykoff commented Feb 8, 2023

pllim commented Feb 8, 2023

mhvk Feb 9, 2023 • edited

Choose a reason for hiding this comment

erykoff Feb 9, 2023

Choose a reason for hiding this comment

mhvk Feb 9, 2023

Choose a reason for hiding this comment

erykoff Feb 9, 2023

Choose a reason for hiding this comment

mhvk Feb 9, 2023

Choose a reason for hiding this comment

pllim commented Apr 18, 2023

mhvk left a comment

Choose a reason for hiding this comment

mhvk Feb 9, 2023 •

edited