Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for persisting big-endian arrays to Parquet by byte-swapping on write. #14373

Merged
merged 3 commits into from Apr 18, 2023

Conversation

erykoff
Copy link
Contributor

@erykoff erykoff commented Feb 8, 2023

Description

This PR adds support to the parquet writer for big-endian columns. This is particularly relevant because arrays serialized with FITS will be converted to big-endian arrays, and then this won't be serializable with Parquet. For example with the current Parquet code this fails:

from astropy.table import Table
import numpy as np

table = Table(data=np.zeros(10, dtype=[("a", np.float64)]))
table.write("test.fits", overwrite=True)
table2 = Table.read("test.fits")
table2.write("test.parq", overwrite=True)

with

ArrowNotImplementedError: Byte-swapped arrays not supported

This PR now checks for this error and will do the byte-swapping on write if possible. Note that the table after reading will have all columns in little-endian order. This behavior is analogous to that of the FITS io where little-endian columns will be switched to big-endian after reading (as in the failing example above).

Fixes #

@github-actions
Copy link

github-actions bot commented Feb 8, 2023

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see "When to rebase and squash commits".
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • Is a milestone set? Milestone must be set but we cannot check for it on Actions; do not let the green checkmark fool you.
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

@erykoff
Copy link
Contributor Author

erykoff commented Feb 8, 2023

Special thanks to @leeskelvin for catching this!

@pllim pllim added this to the v5.3 milestone Feb 8, 2023
@pllim pllim added the Bug label Feb 8, 2023
@pllim
Copy link
Member

pllim commented Feb 8, 2023

Thanks! Does this need backporting?

not np.little_endian and val.dtype.byteorder == "="
):
# We need to convert the array to little-endian.
val2 = val.byteswap()
Copy link
Contributor

@mhvk mhvk Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard way to do this is val.astype(new_dtype) -- this will set the dtype as well.

Note that val.astype(dtype, copy=False) is very fast if the dtype is identical, so if you know the dtype it should become, you can do that upfront, and not have to worry about the try/except.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, I didn't realize that. However, one of the reasons that I did it this way (with the copy) is that I didn't want persisting the table to change the datatype (even endianness) of the columns in the table. What are your thoughts about that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you always want a copy, then .astype is even better. But I'm not familiar with how pa.array works, so cannot really comment on that part. Regardless, I would use .astype even for your stanza here, as it is meant to include everything one needs to get to a new dtype, including byte swapping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't always want a copy (which would be slower and waste memory), but I do want a copy when we need to do a byte swap because I think it's bad behavior for writing a table to change the datatypes in the table. So there still needs to be a check for the endianness. I'll update the code to do an astype rather than a byteswap and newbyteorder. But there's still a question of whether it should be a try/except (that would only trigger in the hopefully less frequent occasion when a column does need a byte swap) vs checking the byte order of every array every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion .astype(..., copy=False) will make a copy if a byteswap is needed, the flag just asks not to copy if it is not needed. So, it should be just right.

@pllim
Copy link
Member

pllim commented Apr 18, 2023

@mhvk , is this ready to go in? If so, can you please approve? Thanks!

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So sorry I missed this for so long. It looks all great!

@mhvk mhvk merged commit d0cb3ed into astropy:main Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants