Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table cannot read numpy void dtype #12577

Closed
nstarman opened this issue Dec 9, 2021 · 16 comments
Closed

Table cannot read numpy void dtype #12577

nstarman opened this issue Dec 9, 2021 · 16 comments

Comments

@nstarman
Copy link
Member

nstarman commented Dec 9, 2021

Description

Cannot read a Table from a file with a numpy.void dtype column.

Expected behavior

Roundtripping for all numpy array types.

Actual behavior

A ValueError is raised that void is not a recognized dtype

Steps to Reproduce

import tempfile
import os

with tempfile.TemporaryDirectory() as tempdir:

    tbl = QTable({"a": [u.Quantity((0, 0, 0.6), unit="(eV, eV, eV)")]})
    tbl.write(os.path.join(tempdir, "temp.ecsv"))

    QTable.read(os.path.join(tempdir, "temp.ecsv"))

ValueError: datatype 'void192' of column 'a' is not in allowed values ('bool', 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', 'float16', 'float32', 'float64', 'float128', 'string')

System Details

macOS-10.16-x86_64-i386-64bit
Python 3.9.5 (default, May 18 2021, 12:31:01)
[Clang 10.0.0 ]
Numpy 1.21.4
pyerfa 2.0.0.1
astropy 5.1.dev207+gbfb9252df.d20211130
Scipy 1.7.1
Matplotlib 3.3.4

@nstarman
Copy link
Member Author

nstarman commented Dec 9, 2021

@mhvk, I'm not sure if ECSV is saving enough info to reconstruct the full dtype. It might be, because the type isn't just void but void192, but I'm not sure what's the difference...

@mhvk
Copy link
Contributor

mhvk commented Dec 9, 2021

Yikes, I thought we had just solved that... Though it may just be to include 'void' in the first place in that list (which I'm guessing is not in the same io.misc.yaml place... but I haven't looked yet)

@nstarman
Copy link
Member Author

Unfortunately it doesn't appear that easy a fix. Adding void or void192 to ECSV_DATATYPES in ecsv.py does not fix the problem.

@mhvk
Copy link
Contributor

mhvk commented Dec 10, 2021

Probably good to get @taldcroft to look at this case. Let me also give a more minimal example, since it is not related to Quantity but just to the structured dtype (I add a string column as that is special-cased as well).

import numpy as np
from astropy.table import Table, Column

tbl = Table({"a": Column([(0, 0, 0.6)], dtype='f8,f8,f8'), "b": ['abcdef']})
tbl.write('temp.ecsv', overwrite=True)
Table.read('temp.ecsv')
ValueError: datatype 'void192' of column 'a' is not in allowed values ('bool', 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', 'float16', 'float32', 'float64', 'float128', 'string')

!cat temp.ecsv
# %ECSV 1.0
# ---
# datatype:
# - {name: a, datatype: void192}
# - {name: b, datatype: string}
# schema: astropy-2.0
a b
"(0., 0., 0.6)" abcdef

So, I think the basic first step is to represent the data type correctly, as a structured one (which of course we just did, but may need to be careful about strings inside structured data types...)

@taldcroft
Copy link
Member

There is no support in ECSV for structured data like dtype='f8,f8,f8'. That would be a whole new can of worms after the extension to allow multidimensional data. I suppose one could imagine subtype: 'float64,float64,int32' or similar to map to the structured dtype. (See https://github.com/astropy/astropy-APEs/blob/main/APE6.rst#multidimensional-array-data).

I think for now this could be called a bug in the ECSV writer that it doesn't detect such a datatype and stop with a good exception message.

@taldcroft taldcroft added io.ascii and removed table labels Dec 10, 2021
@mhvk
Copy link
Contributor

mhvk commented Dec 10, 2021

Indeed, I think that subtype trick could work! Though it would loose the names of the structure elements.

Alternatively, and perhaps easier, would treat this case more like a Representation and just write out individual columns according to the names of the dtype - dealing with it would then become something for ColumnInfo and QuantityInfo, and we don't need to change the ECSV standard.

@nstarman
Copy link
Member Author

nstarman commented Dec 11, 2021

@taldcroft @mhvk, I think the Representation approach may be the optimal route since a structured dtype can mix scalars and non-scalars, and the latter would require ECSV's multidimensional capability

>>> x = np.array([(1, np.ones((2, 2))), (2, np.zeros((2, 2)))],
...              dtype=[("a", "i4"), ("b", "float64", (2, 2))])
>>> x["b"]
array([[[1., 1.],
        [1., 1.]],

       [[0., 0.],
        [0., 0.]]])

Making a problematic Table

>>> QTable({"name": [x[0][()]]})
<QTable length=1>
          name            
         void288         
-------------------------
(1, [[1., 1.], [1., 1.]])

Saving similarly to Representation, the void can be saved to look something approximately like this:

>>> print(repr(QTable(x)[0]))
<Row index=0>
  a    b [2,2]  
int32  float64  
----- ----------
    1 1.0 .. 1.0

though of course be read back to look like the void Table, above.

@nstarman
Copy link
Member Author

nstarman commented Dec 11, 2021

Continuing the above comment, building support for this in ColumnInfo (or wherever appropriate) would presumably serve as the basis / base class for saving all structured data, like Representation.

Alternatively, I recall there was discussion in an old Issue/PR of allowing Table nesting. Would that be a similarly viable solution?
Found it: #3963

@mhvk
Copy link
Contributor

mhvk commented Dec 11, 2021

I'm playing around with ColumnInfo - I think that can work. An additional benefit is that one gets writing of structured data for FITS, etc., for free.

@mhvk
Copy link
Contributor

mhvk commented Dec 13, 2021

See #12589 for an idea.

@mhvk
Copy link
Contributor

mhvk commented Dec 13, 2021

Am reconsidering the approach. Right now, the following actually works!

In [19]: q = u.Quantity([(0.1, 0.3, 0.6)], dtype='f8,f8,f8', unit='m,cm,s')

In [20]: QTable(q)
Out[20]: 
<QTable length=1>
   f0      f1      f2  
   m       cm      s   
float64 float64 float64
------- ------- -------
    0.1     0.3     0.6

In [21]: u.Quantity(QTable(q))
Out[21]: <Quantity [(0.1, 0.3, 0.6)]>

@taldcroft
Copy link
Member

Am reconsidering the approach. Right now, the following actually works!

Humm. I'm not sure if I think that is a feature or a bug. The result is surprising to me, somewhat like how people are often surprised that list('abc') gives ['a', 'b', 'c']. But I guess there is some precedent since you can put in a 3x2 float array as the arg to Table and get some columns.

(I was wondering what Table('abc') would give, and it does the right thing of ValueError: Data type <class 'str'> not allowed to init Table.)

@mhvk
Copy link
Contributor

mhvk commented Dec 13, 2021

The direction from Table to Column makes some sense - it calls table.__array__() (EDIT: that is almost certainly why the unit is lost). The other direction, there is a specific stanza:

elif isinstance(data, np.ndarray):
if data.dtype.names:
init_func = self._init_from_ndarray # _struct
n_cols = len(data.dtype.names)
default_names = data.dtype.names

@nstarman
Copy link
Member Author

@mhvk, will #12589 close this Issue?

@mhvk
Copy link
Contributor

mhvk commented Dec 15, 2021

Yes, it should - but need to decide on implementation!

@nstarman
Copy link
Member Author

This appears to have been fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants