-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve non-float dtypes in Quantity columns from serialized mixins #12505
base: main
Are you sure you want to change the base?
Conversation
👋 Thank you for your draft pull request! Do you know that you can use |
There's a bit of a problem with this:
For mixin classes like I think a better approach is to change how Quantity serializes itself to add a
|
This seems to work:
|
astropy/io/fits/connect.py
Outdated
if 'datatype' in col: | ||
if col['name'] in tbl.meta['__serialized_columns__']: | ||
tbl.meta['__serialized_columns__'][col['name']]['dtype'] = DATA_TYPES.get( | ||
col['datatype'], col['datatype']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the other fix, this is no longer necessary:
In [2]: qt = QTable([u.Quantity([1,2], unit=u.m, dtype=int)])
In [3]: qt.write('junk.fits', overwrite=True)
In [4]: qt2 = QTable.read('junk.fits')
In [5]: qt2
Out[5]:
<QTable length=2>
col0
m
int64
-----
1
2
astropy/table/table.py
Outdated
@@ -3963,7 +3963,7 @@ def _convert_col_for_table(self, col): | |||
# Quantity subclasses identified in the unit (such as u.mag()). | |||
q_cls = Masked(Quantity) if isinstance(col, MaskedColumn) else Quantity | |||
try: | |||
qcol = q_cls(col.data, col.unit, copy=False, subok=True) | |||
qcol = q_cls(col.data, col.unit, copy=False, subok=True, dtype=col.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine but this should have an explicit test (even if it is implicitly tested in the mixin handling stuff).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure; was just waiting whether to test QTable(tab)
or only QTable(tab, dtype=tab.dtype)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added under TestNewFromColumns
Certainly a bug that the dtype does not round-trip. I like @taldcroft's solution! I wondered whether it is worth not storing the |
Thanks, this looks much neater and more concise than my approach. And I guess keeping out['wrap_angle'] = {'value': q.wrap_angle.value, 'unit': q.wrap_angle.unit} fails on decoding with a
but instead outputs
i.e. the But I do see several more failures in
(all with the same |
e4500cc
to
0333c0f
Compare
Stealing the |
astropy/units/quantity.py
Outdated
def _represent_as_dict(self, attrs=None): | ||
q = self._parent | ||
out = super()._represent_as_dict(attrs) | ||
out['dtype'] = q.dtype.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work for structured quantities; do we need to overwrite _represent_as_dict
at all?
I actually wonder why dtype
is needed at all; in principle, value
has the dtype information, so that should be propagated correctly if it isn't already...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p.s. For the structured types, would be good to include #12509.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs some instruction on how to use dtype.name
to store the dtype
information, apparently. value
is simply written as something like
value: !astropy.table.SerializedColumn {name: lon}
so the dtype
will have to be passed on to the Quantity
constructor through some channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks, I now see the problem: the .value
does have a dtype
and that gets used, but then the result is passed through u.Quantity(value, unit)
and the dtype
is lost at that point. What perhaps we could do is only pass the dtype
that actually is converted, i.e.,
if q.dtype.kind in 'iu':
out['dtype'] = q.dtype.name
This also solves the problem that what you have now would fail for structured dtype (which probably means there should be a test for that...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes; I thought about something like that for the first implementation to resolve the Time
regression either intercepting object
or other complex dtypes, or restricting explicitly to integers. The latter choice is probably more prudent; this only leaves the corner case of wrap_angle
in Longitude
, which is converted to a float64
default; also floats will no longer be returned in native endianness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think currently scalars are treated differently from arrays, as they are typeset as string instead of a binary blob. Not sure if that is worth changing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
astropy/table/tests/conftest.py
Outdated
@@ -69,7 +69,7 @@ class MyTable(table.Table): | |||
# and masked (MaskedArray) column. | |||
|
|||
|
|||
@pytest.fixture(params=['unmasked', 'masked', 'subclass']) | |||
@pytest.fixture(params=['unmasked', 'masked', 'subclass', 'quantity']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds a new case to a lot (hundreds?) of tests since this fixture is used a lot. While on principle it never hurts to add more tests, I'm not sure if this is global addition is helping our coverage or just burning CI time and carbon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, initially I was only looking for a simple way to add QTable
coverage to some tests...
This is adding 339 tests to the existing 2079 tests in table
– I found about 0.8 s or 5 % longer runtime for that.
Don't know if that is worth it, or QTable
is just not that critical compared to MaskedTable
or the subclass.
More general comment: there have been quite a few issues about Given this, I wonder if it isn't better to use some kind of option for the cases where one really wants to preserve the data type with which the data are stored internally. After all, conversion to float is not that different from applying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me apart from the one comment about the conftest update.
Yes, though to me the biggest concern seemed that many in-place operations would simply fail on integer quantities. Put it another way, operations that may automatically cast to
where the last option would offer a yet to be implemented option analogous to |
@dhomeier - good to see your summary - I agree with the order, and that (4) and (5) for sure should respect the Also, stopping the auto-conversion to float (i.e., (3)) may well break code that relies on |
This version implements (3) - (5) now. There are pros and cons for adding 'f' (and 'c'?) to the dtype kinds to be serialised; the
That's probably still worth discussing; I'd think that since creating a QTable with integer quantities already requires some very explicit steps, people would generally be aware of the format. But it might be safer to keep the old behaviour as default. I've checked locally this works with #12509, so this should be safe to rebase once either PR is merged. Need to revert the |
Merged! So good to rebase.
I have more mixed feelings about this. When wearing my science hat I agree completely: rarely is the
When wearing my comp-sci hat, the thought of auto-upcasting is less attractive.
I do think there might be a comfortable middle ground: For a list
The Ellipsis option would essentially mean "pass it on". Numpy would be the end-point here, so a The advantage of this approach is that there is a simple means of specifying the dtype without knowing what it is. It's essentially equivalent to
but allows the user to skip the explicit construction of an intermediate array. If we apply this consistently, I hope it can solve (or at least alleviate) @dhomeier's list of upcast situations (#12505 (comment)). 🤞 |
85dfc64
to
3b240a8
Compare
Rebased (it works!) to get the discussion going again.
The unexpected upcasting was also the initial issue triggering this, so the use case is certainly relevant enough.
Just wondering how many in places code needs to be changed to handle the And for |
I'm going to close this pull request as per my previous message. If you think what is being added/fixed here is still important, please remember to open an issue to keep track of it. Thanks! If this is the first time I am commenting on this issue, or if you believe I closed this issue incorrectly, please report this here. |
One point from #15447 which I opened and then learned that this PR already exists: I'm looking at specific fits tables that are the standard for all of X-ray astronomy, a so-called ARF/RMF/PHA files. All three of those file types are tables where most columns have good, physical fits units, e.g. keV or counts or counts/s or cm**2; thus I'd like to read those files as a Without explaining the format in more detail here, there is no way to represent this as a QTable without the ability to have some of the columns as an int. |
@hamogu - yes, this PR was a good one that we should try to revive. One question: does the channel column have a unit? Also, for a possible rebase: what @nstarman ended up implementing was that p.s. Apologies, but I won't be able to spend time here before feature-freeze: I'm stuck writing my next 5-year research grant until Nov 1st. |
It has a unit set in the fits header that does not parse with astropy, probably because it's not part of the main fits standard, but only defined by NASA's HEASARC fits working group (formerly "Office of guest investigator programs") (e.g. https://heasarc.gsfc.nasa.gov/docs/heasarc/ofwg/docs/spectra/ogip_92_007/node7.html for PHA type I files)
(Links given for future reference, not because it's needed to follow all of them to understand what I need Quantity to do to make this work.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked briefly at this PR again and am not sure it is quite complete: missing is some kind of option to tell QTable.read()
to use it. Indeed, it may be that just adjusting the reading is the right thing to do.
Nevertheless a few comments on what is here too.
@@ -3948,6 +3948,13 @@ class QTable(Table): | |||
|
|||
""" | |||
|
|||
def __init__(self, data=None, masked=False, names=None, dtype=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick look through the PR: here, I think one could write quantity_dtype=np.inexact
- and then allow the option None
.
We cannot use dtype
since that is supposed to give the separate dtype for every column.
try: | ||
qcol = q_cls(col.data, col.unit, copy=False, subok=True) | ||
qcol = q_cls(col.data, col.unit, copy=False, subok=True, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we can then just pass on dtype=self._quantity_dtype
astropy/table/tests/test_info.py
Outdated
cls = t.ColumnClass.__name__ | ||
assert np.all(tinfo['class'] == [cls, cls, cls, cls, 'Time', 'SkyCoord']) | ||
# QTable does not preserve 'quantity' description - should it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should...
Co-authored-by: Tom Aldcroft <taldcroft@noreply.users.github.com>
3b240a8
to
af8f1d3
Compare
Just rebasing this with the black/merge conflicts resolved; I'll need more time to look into @mhvk's last round of comments, too. |
Description
Columns of integer types that are serialised with Quantity information in FITS tables are presently automatically "upcast" when read back in, since
Quantity(data)
is called with its default settings, converting all numerical input to float[64]-type.Fixes #12494, enabling both
Table.read()
andQTable.read()
to recover the original (and stored) dtypes from a FITS table.Fixes #15447 also.
The second commit addresses an additional issue that came up with the above: instantiating a
QTable
from integer columns also casts every non-float column tofloat64
, overriding any types set with thedtype
option as well.Since the docstring states that
dtype=[...]
will set the column dtypes, I consider that latter part a bug.The commit here implements the simple fix of adopting all original dtypes from the column, so creating a
QTable
from aTable
or list ofColumn
s will always preserve their dtypes, whether explicitly set via the option or not (tested interactively to work withMasked
as well, but proper tests are to follow when we have agreement on the desired functionality).The above may be reasonable behaviour and is not breaking any existing tests, but of course changes results; in particular it is not consistent with the default behaviour of
Quantity()
, so I am leaving this to further discussion.Honouring only an explicitly set list of
dtype=[...]
would be a bit lengthier, probably involving replicating a good bit of heTable.__init__
setup inQTable
.Checklist for package maintainer(s)
This checklist is meant to remind the package maintainer(s) who will review this pull request of some common things to look for. This list is not exhaustive.
Extra CI
label.no-changelog-entry-needed
label. If this is a manual backport, use theskip-changelog-checks
label unless special changelog handling is necessary.astropy-bot
check might be missing; do not let the green checkmark fool you.backport-X.Y.x
label(s) before merge.