Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve non-float dtypes in Quantity columns from serialized mixins #12505

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

dhomeier
Copy link
Contributor

@dhomeier dhomeier commented Nov 19, 2021

Description

Columns of integer types that are serialised with Quantity information in FITS tables are presently automatically "upcast" when read back in, since Quantity(data) is called with its default settings, converting all numerical input to float[64]-type.

Fixes #12494, enabling both Table.read() and QTable.read() to recover the original (and stored) dtypes from a FITS table.

Fixes #15447 also.

The second commit addresses an additional issue that came up with the above: instantiating a QTable from integer columns also casts every non-float column to float64, overriding any types set with the dtype option as well.
Since the docstring states that dtype=[...] will set the column dtypes, I consider that latter part a bug.
The commit here implements the simple fix of adopting all original dtypes from the column, so creating a QTable from a Table or list of Columns will always preserve their dtypes, whether explicitly set via the option or not (tested interactively to work with Masked as well, but proper tests are to follow when we have agreement on the desired functionality).
The above may be reasonable behaviour and is not breaking any existing tests, but of course changes results; in particular it is not consistent with the default behaviour of Quantity(), so I am leaving this to further discussion.
Honouring only an explicitly set list of dtype=[...] would be a bit lengthier, probably involving replicating a good bit of he Table.__init__ setup in QTable.

Checklist for package maintainer(s)

This checklist is meant to remind the package maintainer(s) who will review this pull request of some common things to look for. This list is not exhaustive.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see "When to rebase and squash commits".
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the Extra CI label.
  • Is a change log needed? If yes, did the change log check pass? If no, add the no-changelog-entry-needed label. If this is a manual backport, use the skip-changelog-checks label unless special changelog handling is necessary.
  • Is a milestone set? Milestone must be set but astropy-bot check might be missing; do not let the green checkmark fool you.
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate backport-X.Y.x label(s) before merge.

@github-actions
Copy link

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

@taldcroft
Copy link
Member

There's a bit of a problem with this:

In [5]: t = Table([Time([1,2], format='cxcsec')])
In [6]: t.write('junk.ecsv', overwrite=True)
In [7]: t2 = Table.read('junk.ecsv')
In [8]: t2['col0'].info.dtype
Out[8]: dtype('float64')   # <== dtype('O') in main

For mixin classes like Time, the dtype is left unset which then defaults to object.

I think a better approach is to change how Quantity serializes itself to add a dtype object attribute:

Out[12]: 
<QTable length=2>
 col0
  m  
int64
-----
    1
    2

In [13]: ascii.write(qt, format='ecsv')
# %ECSV 1.0
# ---
# datatype:
# - {name: col0, unit: m, datatype: int64}
# meta: !!omap
# - __serialized_columns__:
#     col0:
#       __class__: astropy.units.quantity.Quantity
#       unit: !astropy.units.Unit {unit: m}
#       value: !astropy.table.SerializedColumn {name: col0}
#       dtype: int64                                         <=== ADD THIS
# schema: astropy-2.0
col0
1
2

@taldcroft
Copy link
Member

This seems to work:

diff --git a/astropy/table/serialize.py b/astropy/table/serialize.py
index 5dfa402818..62d69ccf44 100644
--- a/astropy/table/serialize.py
+++ b/astropy/table/serialize.py
@@ -342,7 +342,6 @@ def _construct_mixin_from_columns(new_name, obj_attrs, out):
 
         # Now copy the relevant attributes
         for attr, nontrivial in (('unit', lambda x: x not in (None, '')),
-                                 ('dtype', lambda x: x is not None),
                                  ('format', lambda x: x is not None),
                                  ('description', lambda x: x is not None),
                                  ('meta', lambda x: x)):
diff --git a/astropy/units/quantity.py b/astropy/units/quantity.py
index 2c60005a84..35b01159e5 100644
--- a/astropy/units/quantity.py
+++ b/astropy/units/quantity.py
@@ -171,10 +171,16 @@ class QuantityInfo(QuantityInfoBase):
     required when the object is used as a mixin column within a table, but can
     be used as a general way to store meta information.
     """
-    _represent_as_dict_attrs = ('value', 'unit')
     _construct_from_dict_args = ['value']
     _represent_as_dict_primary_data = 'value'
 
+    def _represent_as_dict(self, attrs=None):
+        q = self._parent
+        out = {'value': q.value,
+               'unit': q.unit,
+               'dtype': q.dtype.name}
+        return out
+
     def new_like(self, cols, length, metadata_conflicts='warn', name=None):
         """
         Return a new Quantity instance which is consistent with the
In [3]: qt = QTable([u.Quantity([1,2], unit=u.m, dtype=int)])

In [4]: qt.write('junk.ecsv', overwrite=True)

In [5]: qt2 = Table.read('junk.ecsv')

In [6]: qt2
Out[6]: 
<Table length=2>
 col0
  m  
int64
-----
    1
    2
(astropy) ➜  astropy git:(decode-dtype-mixins) ✗ cat junk.ecsv
# %ECSV 1.0
# ---
# datatype:
# - {name: col0, unit: m, datatype: int64}
# meta: !!omap
# - __serialized_columns__:
#     col0:
#       __class__: astropy.units.quantity.Quantity
#       dtype: int64
#       unit: !astropy.units.Unit {unit: m}
#       value: !astropy.table.SerializedColumn {name: col0}
# schema: astropy-2.0
col0
1
2

Comment on lines 111 to 114
if 'datatype' in col:
if col['name'] in tbl.meta['__serialized_columns__']:
tbl.meta['__serialized_columns__'][col['name']]['dtype'] = DATA_TYPES.get(
col['datatype'], col['datatype'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the other fix, this is no longer necessary:

In [2]: qt = QTable([u.Quantity([1,2], unit=u.m, dtype=int)])

In [3]: qt.write('junk.fits', overwrite=True)

In [4]: qt2 = QTable.read('junk.fits')

In [5]: qt2
Out[5]: 
<QTable length=2>
 col0
  m  
int64
-----
    1
    2

@@ -3963,7 +3963,7 @@ def _convert_col_for_table(self, col):
# Quantity subclasses identified in the unit (such as u.mag()).
q_cls = Masked(Quantity) if isinstance(col, MaskedColumn) else Quantity
try:
qcol = q_cls(col.data, col.unit, copy=False, subok=True)
qcol = q_cls(col.data, col.unit, copy=False, subok=True, dtype=col.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine but this should have an explicit test (even if it is implicitly tested in the mixin handling stuff).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure; was just waiting whether to test QTable(tab) or only QTable(tab, dtype=tab.dtype).

Copy link
Contributor Author

@dhomeier dhomeier Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added under TestNewFromColumns

@mhvk
Copy link
Contributor

mhvk commented Nov 20, 2021

Certainly a bug that the dtype does not round-trip. I like @taldcroft's solution! I wondered whether it is worth not storing the dtype if it float64 (i.e., the default), but checking quickly, I saw that Column also stores its dtype in the header. So, I think this is better generally - really seems more like an oversight that we did not do this already!

@dhomeier
Copy link
Contributor Author

dhomeier commented Nov 21, 2021

This seems to work:

+    def _represent_as_dict(self, attrs=None):
+        q = self._parent
+        out = {'value': q.value,
+               'unit': q.unit,
+               'dtype': q.dtype.name}
+        return out
+

Thanks, this looks much neater and more concise than my approach. And I guess keeping datatype to record the format the columns were written in, while meta stores the format and dtype they are supposed to be returned in, has some logic to it. The only drawback would be that this does not decode files written with earlier versions.
And I found one test failure on the Longitude mixin, due to the wrap_angle not being included with the definition above.
A naïve attempt to just add it to the dict as

             out['wrap_angle'] = {'value': q.wrap_angle.value, 'unit': q.wrap_angle.unit}

fails on decoding with a ValueError: Invalid character at col 0 in angle 'unit', since this is not fully reproducing the previous description

COMMENT     lon:
COMMENT       __class__: astropy.coordinates.angles.Longitude
COMMENT       unit: &id001 !astropy.units.Unit {unit: deg}
COMMENT       value: !astropy.table.SerializedColumn {name: lon}
COMMENT       wrap_angle: !astropy.coordinates.Angle
COMMENT         unit: *id001
COMMENT         value: 360.0

but instead outputs

COMMENT     lon:
COMMENT       __class__: astropy.coordinates.angles.Longitude
COMMENT       dtype: float32
COMMENT       unit: &id001 !astropy.units.Unit {unit: deg}
COMMENT       value: !astropy.table.SerializedColumn {name: lon}
COMMENT       wrap_angle:
COMMENT         unit: *id001
COMMENT         value: 360.0

i.e. the !astropy.coordinates.Angle still needs to be reconstructed, but how? Also, need to make sure there are no other attributes missed. I did not find any failures pointing to others in the io, coordinates or units tests though.

But I do see several more failures in table I cannot really make sense of:

___________________________________________________________________________________ test_vstack ____________________________________________________________________________________

    def test_vstack():
        """
        Vstack tables with mixin cols.
        """
        t1 = QTable(MIXIN_COLS)
        t2 = QTable(MIXIN_COLS)
        with pytest.raises(NotImplementedError):
>           vstack([t1, t2])

opt/lib/python3.10/site-packages/astropy/table/tests/test_mixin.py:504: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
opt/lib/python3.10/site-packages/astropy/table/operations.py:651: in vstack
    out = _vstack(tables, join_type, col_name_map, metadata_conflicts)
opt/lib/python3.10/site-packages/astropy/table/operations.py:1396: in _vstack
    col = col_cls.info.new_like(cols, n_rows, metadata_conflicts, out_name)
opt/lib/python3.10/site-packages/astropy/units/quantity.py:224: in new_like
    out = self._construct_from_dict(map)
opt/lib/python3.10/site-packages/astropy/utils/data_info.py:387: in _construct_from_dict
    args = [map.pop(attr) for attr in self._construct_from_dict_args]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

.0 = <list_iterator object at 0x12adeda20>

>   args = [map.pop(attr) for attr in self._construct_from_dict_args]
E   KeyError: 'value'

opt/lib/python3.10/site-packages/astropy/utils/data_info.py:387: KeyError
...
============================================================================= short test summary info ==============================================================================
FAILED opt/lib/python3.10/site-packages/astropy/table/tests/test_mixin.py::test_vstack - KeyError: 'value'
FAILED opt/lib/python3.10/site-packages/astropy/table/tests/test_operations.py::TestJoin::test_col_meta_merge[QTable] - KeyError: 'value'
FAILED opt/lib/python3.10/site-packages/astropy/table/tests/test_operations.py::TestVStack::test_col_meta_merge_inner[QTable] - KeyError: 'value'
FAILED opt/lib/python3.10/site-packages/astropy/table/tests/test_operations.py::TestVStack::test_mixin_functionality[latitude] - KeyError: 'value'
FAILED opt/lib/python3.10/site-packages/astropy/table/tests/test_operations.py::TestVStack::test_mixin_functionality[longitude] - KeyError: 'value'
FAILED opt/lib/python3.10/site-packages/astropy/table/tests/test_operations.py::TestVStack::test_mixin_functionality[quantity] - KeyError: 'value'
FAILED opt/lib/python3.10/site-packages/astropy/table/tests/test_operations.py::test_stack_columns - KeyError: 'value'
FAILED opt/lib/python3.10/site-packages/astropy/table/tests/test_operations.py::test_mixin_join_regression - KeyError: 'value'

(all with the same KeyError in map.pop(attr) for attr in self._construct_from_dict_args, but why is 'value' not present in those cases???)

@dhomeier dhomeier force-pushed the decode-dtype-mixins branch 2 times, most recently from e4500cc to 0333c0f Compare November 22, 2021 16:47
@dhomeier dhomeier marked this pull request as ready for review November 22, 2021 17:19
@dhomeier
Copy link
Contributor Author

Stealing the _represent_as_dict implementation from representations seems to do.

def _represent_as_dict(self, attrs=None):
q = self._parent
out = super()._represent_as_dict(attrs)
out['dtype'] = q.dtype.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work for structured quantities; do we need to overwrite _represent_as_dict at all?

I actually wonder why dtype is needed at all; in principle, value has the dtype information, so that should be propagated correctly if it isn't already...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. For the structured types, would be good to include #12509.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs some instruction on how to use dtype.name to store the dtype information, apparently. value is simply written as something like
value: !astropy.table.SerializedColumn {name: lon}
so the dtype will have to be passed on to the Quantity constructor through some channel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks, I now see the problem: the .value does have a dtype and that gets used, but then the result is passed through u.Quantity(value, unit) and the dtype is lost at that point. What perhaps we could do is only pass the dtype that actually is converted, i.e.,

if q.dtype.kind in 'iu':
    out['dtype'] = q.dtype.name

This also solves the problem that what you have now would fail for structured dtype (which probably means there should be a test for that...)

Copy link
Contributor Author

@dhomeier dhomeier Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; I thought about something like that for the first implementation to resolve the Time regression either intercepting object or other complex dtypes, or restricting explicitly to integers. The latter choice is probably more prudent; this only leaves the corner case of wrap_angle in Longitude, which is converted to a float64 default; also floats will no longer be returned in native endianness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently scalars are treated differently from arrays, as they are typeset as string instead of a binary blob. Not sure if that is worth changing...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. For the structured types, would be good to include #12509.

@dhomeier @mhvk, merged!

@@ -69,7 +69,7 @@ class MyTable(table.Table):
# and masked (MaskedArray) column.


@pytest.fixture(params=['unmasked', 'masked', 'subclass'])
@pytest.fixture(params=['unmasked', 'masked', 'subclass', 'quantity'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a new case to a lot (hundreds?) of tests since this fixture is used a lot. While on principle it never hurts to add more tests, I'm not sure if this is global addition is helping our coverage or just burning CI time and carbon.

Copy link
Contributor Author

@dhomeier dhomeier Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, initially I was only looking for a simple way to add QTable coverage to some tests...
This is adding 339 tests to the existing 2079 tests in table – I found about 0.8 s or 5 % longer runtime for that.
Don't know if that is worth it, or QTable is just not that critical compared to MaskedTable or the subclass.

@mhvk
Copy link
Contributor

mhvk commented Nov 22, 2021

More general comment: there have been quite a few issues about Quantity automatically converting to float, and whether or not that is desired. My sense remains that generally it is desired, as an integer quantity makes little physical sense (apart from spin!). I think there is a lot to be said for avoiding silent truncation errors from, e.g., in-place operations.

Given this, I wonder if it isn't better to use some kind of option for the cases where one really wants to preserve the data type with which the data are stored internally. After all, conversion to float is not that different from applying BZERO and BSCALE to integer image data.

Copy link
Member

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me apart from the one comment about the conftest update.

@dhomeier
Copy link
Contributor Author

More general comment: there have been quite a few issues about Quantity automatically converting to float, and whether or not that is desired. My sense remains that generally it is desired, as an integer quantity makes little physical sense (apart from spin!). I think there is a lot to be said for avoiding silent truncation errors from, e.g., in-place operations.

Yes, though to me the biggest concern seemed that many in-place operations would simply fail on integer quantities.

Put it another way, operations that may automatically cast to float, in order of decreasing desirability, to me would be

  1. u.Quantity(arr)
  2. QTable(tab), QTable([columns])
  3. QTable.read(file)
  4. QTable(tab, dtype=tab.dtype)
  5. QTable.read(file, do_not_scale_table_data=True)

where the last option would offer a yet to be implemented option analogous to do_not_scale_image_data for ensuring that the data are read in as stored. That was actually a kind of minimal solution I should have pointed out a bit more in the initial comment. But the original specifically issue arose from the large memory footprint of data read in from file, so there should be some form of control at that step.

@mhvk
Copy link
Contributor

mhvk commented Nov 23, 2021

@dhomeier - good to see your summary - I agree with the order, and that (4) and (5) for sure should respect the dtype. I think I'm actually also on-board with (3) , though maybe prefer (5) with the keyword argument to control it (as in the difference between (2) and (4)).

Also, stopping the auto-conversion to float (i.e., (3)) may well break code that relies on QTable doing that...

@dhomeier
Copy link
Contributor Author

dhomeier commented Nov 23, 2021

This version implements (3) - (5) now.

There are pros and cons for adding 'f' (and 'c'?) to the dtype kinds to be serialised; the wrap_angle case pointed out above is probably not very relevant, and both keeping read-in float columns in big-endian format and automatically converting them to native has its own points in favour.

Also, stopping the auto-conversion to float (i.e., (3)) may well break code that relies on QTable doing that...

That's probably still worth discussing; I'd think that since creating a QTable with integer quantities already requires some very explicit steps, people would generally be aware of the format. But it might be safer to keep the old behaviour as default.
On adding an option to QTable.read() for controlling this: both outcomes can already be effected relatively easily by forcing conversion with QTable(Table.read(file)) in the current version, or
t = Table.read(file); qt = QTable(t, dtype=t.dtype) if conversion is made the default.

I've checked locally this works with #12509, so this should be safe to rebase once either PR is merged. Need to revert the yaml docstring changes anyway.
Pinging @saimn and @nstarman for opinions.

@nstarman
Copy link
Member

nstarman commented Nov 29, 2021

I've checked locally this works with #12509, so this should be safe to rebase once either PR is merged. Need to revert the yaml docstring changes anyway.

Merged! So good to rebase.

More general comment: there have been quite a few issues about Quantity automatically converting to float, and whether or not that is desired. My sense remains that generally it is desired, as an integer quantity makes little physical sense (apart from spin!). I think there is a lot to be said for avoiding silent truncation errors from, e.g., in-place operations.

I have more mixed feelings about this. When wearing my science hat I agree completely: rarely is the int dtype desirable and if it's necessary should be explicitly stated.

  • stays int: Quantity([1, 2, 3], dtype=int)
  • upcast to float: Quantity([1, 2, 3])

When wearing my comp-sci hat, the thought of auto-upcasting is less attractive. Quantity is an extension of ndarray and this is a departure from their treatment of dtype. At the very least we should more explicitly document this. I don't feel I was sufficiently aware of these issues until reading the discussion in this PR.

Given this, I wonder if it isn't better to use some kind of option for the cases where one really wants to preserve the data type with which the data are stored internally. After all, conversion to float is not that different from applying BZERO and BSCALE to integer image data.

I do think there might be a comfortable middle ground:

For a list x = [1, 2, 3], using the following constructor

dtype= np.array() Quantity() comment
float float float specified
int int int specified
None int float upcast
... TypeError int passed to numpy

The Ellipsis option would essentially mean "pass it on". Numpy would be the end-point here, so a Longitude(x, dtype=...) -> Angle -> SpecificTypeQuantity -> Quantity -> ndarray, and only all the Quantity subclasses understand to delegate dtype to super, until it hits numpy. When passed to numpy the Ellipsis would be changed to None to use numpy's dtype detection machinery.

The advantage of this approach is that there is a simple means of specifying the dtype without knowing what it is. It's essentially equivalent to

x = [1, 2, 3]
arr = np.array(x)
q = Quantity(arr, dtype=arr.dtype)

but allows the user to skip the explicit construction of an intermediate array.

If we apply this consistently, I hope it can solve (or at least alleviate) @dhomeier's list of upcast situations (#12505 (comment)). 🤞
Each class can have appropriate (& well documented) default rules for upcasting, and allow for an easy override to force the dtypes to be kept unchanged.

@dhomeier
Copy link
Contributor Author

Rebased (it works!) to get the discussion going again.

When wearing my comp-sci hat, the thought of auto-upcasting is less attractive. Quantity is an extension of ndarray and this is a departure from their treatment of dtype. At the very least we should more explicitly document this. I don't feel I was sufficiently aware of these issues until reading the discussion in this PR.

The unexpected upcasting was also the initial issue triggering this, so the use case is certainly relevant enough.
I do quite like that proposal @nstarman!

The Ellipsis option would essentially mean "pass it on". Numpy would be the end-point here, so a Longitude(x, dtype=...) -> Angle -> SpecificTypeQuantity -> Quantity -> ndarray, and only all the Quantity subclasses understand to delegate dtype to super, until it hits numpy. When passed to numpy the Ellipsis would be changed to None to use numpy's dtype detection machinery.

The advantage of this approach is that there is a simple means of specifying the dtype without knowing what it is. It's essentially equivalent to

x = [1, 2, 3]
arr = np.array(x)
q = Quantity(arr, dtype=arr.dtype)

but allows the user to skip the explicit construction of an intermediate array.

Just wondering how many in places code needs to be changed to handle the dtype='...' setting...

And for Quantity that would need a decision whether '...' should become the new default (still a change in behaviour), or keep that at None+upcast. But even being able to simply call Quantity(array_like, dtype='...') without having to bother what the actual dtype is would be a great advantage IMO.

@github-actions
Copy link

I'm going to close this pull request as per my previous message. If you think what is being added/fixed here is still important, please remember to open an issue to keep track of it. Thanks!

If this is the first time I am commenting on this issue, or if you believe I closed this issue incorrectly, please report this here.

@hamogu
Copy link
Member

hamogu commented Oct 16, 2023

One point from #15447 which I opened and then learned that this PR already exists: I'm looking at specific fits tables that are the standard for all of X-ray astronomy, a so-called ARF/RMF/PHA files. All three of those file types are tables where most columns have good, physical fits units, e.g. keV or counts or counts/s or cm**2; thus I'd like to read those files as a QTable. However, there are other columns that need to remain integers; in particular the "channel". A channel is by definition a discrete quantity and downstream software will fail with floating point numbers of channels; also channel numbers are frequently used as indexes into the other vector-valued columns (e.g. in an RMF, there a "response matrix". Since this matrix is very sparse, it's not stored in a rectangular form, but instead only the non-zero elements of a row are stored in the file and other columns give the integer indices for that, e.g. a row in the matrix might be non-zero for elements 567 to 681, and then there would be a column that says N_start=567 - obviously, that column should be an integer).

Without explaining the format in more detail here, there is no way to represent this as a QTable without the ability to have some of the columns as an int.

@mhvk
Copy link
Contributor

mhvk commented Oct 16, 2023

@hamogu - yes, this PR was a good one that we should try to revive.

One question: does the channel column have a unit?

Also, for a possible rebase: what @nstarman ended up implementing was that u.Quantity(..., dtype=None) is what is needed to avoid the upcast (not the ... mentioned in the comments above; the change that we ended up with is to change the default to dtype=np.inexact to indicate auto-convert to float).

p.s. Apologies, but I won't be able to spend time here before feature-freeze: I'm stuck writing my next 5-year research grant until Nov 1st.

@hamogu
Copy link
Member

hamogu commented Oct 16, 2023

It has a unit set in the fits header that does not parse with astropy, probably because it's not part of the main fits standard, but only defined by NASA's HEASARC fits working group (formerly "Office of guest investigator programs") (e.g. https://heasarc.gsfc.nasa.gov/docs/heasarc/ofwg/docs/spectra/ogip_92_007/node7.html for PHA type I files)

WARNING: UnitsWarning: 'channel' did not parse as fits unit: At col 0, Unit 'channel' not supported by the FITS standard.

(Links given for future reference, not because it's needed to follow all of them to understand what I need Quantity to do to make this work.)

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked briefly at this PR again and am not sure it is quite complete: missing is some kind of option to tell QTable.read() to use it. Indeed, it may be that just adjusting the reading is the right thing to do.

Nevertheless a few comments on what is here too.

@@ -3948,6 +3948,13 @@ class QTable(Table):

"""

def __init__(self, data=None, masked=False, names=None, dtype=None, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick look through the PR: here, I think one could write quantity_dtype=np.inexact - and then allow the option None.
We cannot use dtype since that is supposed to give the separate dtype for every column.

try:
qcol = q_cls(col.data, col.unit, copy=False, subok=True)
qcol = q_cls(col.data, col.unit, copy=False, subok=True, dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we can then just pass on dtype=self._quantity_dtype

cls = t.ColumnClass.__name__
assert np.all(tinfo['class'] == [cls, cls, cls, cls, 'Time', 'SkyCoord'])
# QTable does not preserve 'quantity' description - should it?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should...

@dhomeier
Copy link
Contributor Author

Just rebasing this with the black/merge conflicts resolved; I'll need more time to look into @mhvk's last round of comments, too.

@saimn saimn modified the milestones: v6.0, v6.1 Nov 3, 2023
@astrofrog astrofrog modified the milestones: v6.1.0, v7.0.0 Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants