New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for persisting big-endian arrays to Parquet by byte-swapping on write. #14373
Conversation
Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.
|
Special thanks to @leeskelvin for catching this! |
0c9b40b
to
33a6762
Compare
Thanks! Does this need backporting? |
astropy/io/misc/parquet.py
Outdated
not np.little_endian and val.dtype.byteorder == "=" | ||
): | ||
# We need to convert the array to little-endian. | ||
val2 = val.byteswap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard way to do this is val.astype(new_dtype)
-- this will set the dtype
as well.
Note that val.astype(dtype, copy=False)
is very fast if the dtype
is identical, so if you know the dtype
it should become, you can do that upfront, and not have to worry about the try/except
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh, I didn't realize that. However, one of the reasons that I did it this way (with the copy) is that I didn't want persisting the table to change the datatype (even endianness) of the columns in the table. What are your thoughts about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you always want a copy, then .astype
is even better. But I'm not familiar with how pa.array
works, so cannot really comment on that part. Regardless, I would use .astype
even for your stanza here, as it is meant to include everything one needs to get to a new dtype, including byte swapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't always want a copy (which would be slower and waste memory), but I do want a copy when we need to do a byte swap because I think it's bad behavior for writing a table to change the datatypes in the table. So there still needs to be a check for the endianness. I'll update the code to do an astype
rather than a byteswap
and newbyteorder
. But there's still a question of whether it should be a try/except (that would only trigger in the hopefully less frequent occasion when a column does need a byte swap) vs checking the byte order of every array every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion .astype(..., copy=False)
will make a copy if a byteswap is needed, the flag just asks not to copy if it is not needed. So, it should be just right.
33a6762
to
892be27
Compare
@mhvk , is this ready to go in? If so, can you please approve? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So sorry I missed this for so long. It looks all great!
Description
This PR adds support to the parquet writer for big-endian columns. This is particularly relevant because arrays serialized with FITS will be converted to big-endian arrays, and then this won't be serializable with Parquet. For example with the current Parquet code this fails:
with
This PR now checks for this error and will do the byte-swapping on write if possible. Note that the table after reading will have all columns in little-endian order. This behavior is analogous to that of the FITS io where little-endian columns will be switched to big-endian after reading (as in the failing example above).
Fixes #