Support record array columns in astropy.table.Table #3759

mdboom · 2015-05-07T15:11:55Z

In the interest of generality over in ASDF (see asdf-format/asdf-standard#83), I thought it would be worth supporting table columns where the data type is a record array. This is not the same as nested Tables, obviously, which would be quite a bit more work, but it seemed like only a couple of minor changes were necessary to have a Column dtype like "<i4,S1".

mdboom · 2015-05-11T15:05:51Z

@taldcroft: Any objections to this in principle?

taldcroft · 2015-05-11T15:32:21Z

astropy/table/table.py

@@ -44,7 +44,7 @@ def descr(col):
    This returns a 3-tuple (name, type, shape) that can always be
    used in a structured array dtype definition.
    """
-    col_dtype_str = col.dtype.str if hasattr(col, 'dtype') else 'O'
+    col_dtype_str = col.dtype if hasattr(col, 'dtype') else 'O'


Just a niggle, but the var name should probably be called col_dtype now to correspond to the actual contents.

taldcroft · 2015-05-11T15:37:26Z

@mdboom - sorry, I looked at this a few days ago and was going to say "great!".

The initial questions I had were:

Is the table representation OK with these structured-array columns?
What happens if you try to use one as a key column in a high-level operation?
Some basic tests of slicing and element access would be good. Not likely to be a problem but...

taldcroft · 2015-05-11T15:53:44Z

One idea which would be to add a recarray column into the list of MIXIN_COLS here. Sort of re-purpose the mixin tests to get better coverage of basic table interface functionality. There might be some things that don't initially pass but you might special-case them in the tests. Anyway, just a thought.

taldcroft · 2015-05-11T16:06:23Z

astropy/table/tests/test_masked.py

@@ -19,6 +19,8 @@ def setup_method(self, method):
        self.c = MaskedColumn(name='c', data=[7, 8, 9], mask=False)
        self.d_mask = np.array([False, True, False])
        self.d = MaskedColumn(name='d', data=[7, 8, 7], mask=self.d_mask)
+        self.e = MaskedColumn(name='e', data=[(7, 'a'), (8, 'b'), (7, 'c')],
+                              mask=[[False, True], [True, False], [False, False]])


Do you need a dtype=.. here? The dtype that comes out seems to be string:

In [20]: e = MaskedColumn(name='e', data=[(7, 'a'), (8, 'b'), (7, 'c')], mask=[[False, True], [True, False], [False, False]]) In [21]: e Out[21]: <MaskedColumn name='e' dtype='string168' shape=(2,) length=3> 7 .. -- -- .. b 7 .. c

Oops. Good catch.

mdboom · 2015-05-11T17:58:52Z

I suspect table reading/writing with io.ascii may also not work (though strictly speaking it doesn't have to for this to be useful).

taldcroft · 2015-05-12T15:22:04Z

@mdboom - I've been playing around with additional tests etc but got pulled away. I will try to have a PR for your branch this week.

embray · 2015-05-12T16:09:47Z

I guess we may have to put something in the io.ascii I/O connector for now to prevent writing such tables. I could imagine supporting this though in ECSV at the very least. This is a great thing to support, in principle though.

hamogu · 2015-05-13T12:38:26Z

May I ask what the use case for this feature is?
I looked at the ADSF thread, where this seems to be motivated by (RA, DEC) pairs. However, the proper way to express that in astropy should be a column of coordinates objects. Another thing I can come up with are tuples of (value, error, unit, literature reference). However, in that case I would strongly argue that that should be expressed as four separate columns and not a a single column of record arrays, so that I can actually use that in calculations as tab['V_mag'] + 2.7 * tab['V_mag_err'] < upper_limit.

[\rant on]
I'm not so much opposed to adding features in astropy.tables in general, but every now and then we have to step back and check that this actually adds value for our users. I bet that > 90 % of all users would never miss having record array columns (particularly since mixin columns cover special use cases like times and coordinates) and 90 % or the remaining 10% look for record arrays, but would be better served using several different single-valued columns.

Is that really worth the added code complexity, added test cases, added developer time for maintenance and, most importantly, added complexity that the user sees when he/she reads te documentation?

astropy.table started of as a data table, then developed into a database system (merge, join) and is now on its way to become a database development framework (mixin columns, arbitrary data types).
Do we really, really want that? Is that serving the astronomical community best with our limited resources at a time, when e.g. we cannot even plot a spectrum properly?
[\rant off]

@mdboom: Sorry, feel free to ignore the rant. This is not particularly directed against your specific feature; I'm just warning of feature-creep in general.

mdboom · 2015-05-13T13:41:33Z

This is a way forward to serialize the kind of mixin columns (coordinates etc.) you describe to/from disk in a packed binary format.

I wouldn't necessarily point to this as an example of feature creep -- outside of the testing and the pretty-printing, this is a 3-line bugfix to use numpy to properly use its generality.

taldcroft · 2015-05-13T16:55:47Z

@hamogu - the biggest long-term worry about feature creep is that it grows the code complexity to the point where it becomes unmaintainable. So we need to manage that carefully, and I don't think this change impacts overall complexity.

(But I do note that this PR may be currently incomplete in terms of preventing / warning the user from doing things like writing an ASCII table that won't work quite as expected, but that's something I'm still reviewing.)

The other worry about feature-creep is draining resources, as you mentioned. That's a valid point, but with open-source software you often get features in an area where people are excited to develop. I'm excited about table functionality (and we have a GSoC student that is similarly motivated). Alas I don't personally care about plotting a spectrum. 😄

taldcroft · 2015-05-17T18:04:43Z

@mdboom - check out the taldcroft/table-recarray-columns branch. This has a commit that shows a few things that fail in ways that users would likely find confusing. So the easiest thing in the short term would just be catching these cases and raising informative errors. The harder thing is to make them work.

BTW, this relies on a change that is outside the scope of this PR, namely allowing initialization of Table with a Column object that doesn't have a defined name. I don't honestly remember why we put in this limitation because it is just a 2-word fix or def_name. This makes Column behave more like an ndarray or any other table input that doesn't require a name.

taldcroft · 2015-05-17T18:05:52Z

(So I'm likely going to put in a PR to make the or def_name change and see if that flies with the gang).

mdboom · 2015-05-18T14:43:50Z

Thanks for drilling down on the tests. I see three classes of failures on taldcroft/table-recarray-columns:

Reading/writing these things. Of course, the failure suggests a possible solution. Would it be better to implement recarray columns as mixins? Then the readers/writers could continue to reject them as they already do. (Of course, speaking without a deep understanding of mixin columns and how they are implemented).

>           assert 'cannot write table with mixin column(s)' in str(err.value)
E           assert 'cannot write table with mixin column(s)' in "Illegal format [('f0', '<i4'), ('f1', 'S1')]."
E            +  where "Illegal format [('f0', '<i4'), ('f1', 'S1')]." = str(ValueError("Illegal format [('f0', '<i4'), ('f1', 'S1')].",))
E            +    where ValueError("Illegal format [('f0', '<i4'), ('f1', 'S1')].",) = <ExceptionInfo ValueError tblen=10>.value

Masked array stuff. Numpy masked arrays with record arrays is just really messy in Numpy itself. There's a whole bunch of hacks around this in astropy.io.votable, none of which are terribly great -- mainly involving shoving arrays inside object arrays so that the mask can match the data array. That's getting deep down in a pretty remote corner case, so I'm just as happy to exclude it. Is it the case that masked columns can be mixed with non-masked columns in the same table? If so, I think this limitation is not terribly limiting.
The "nameless" column initializer stuff you mention. I'm not quite following why supporting nameless columns is necessary wrt this feature...

taldcroft · 2015-05-18T18:27:13Z

Implementing this as a mixin column is an interesting idea. That would effectively solve the first point about problems in read/write. For masked arrays the current mixin implementation has a pseudo-mask which is a 1-d mask that is there for API compatibility but which raises an exception if any element is set to True. So again this would only allow operations that are known to work or else raise an informative exception.

The downside is that in order to make it a mixin you need to be able to set attributes, which (I think) means taking the input ndarray column and making an ndarray-subclass that has the _astropy_column_attrs attribute. I think this can work but it's a bigger footprint. Wherever an ndarray is used in creating a new column then you need to check if it's a structured array first and make a new view as the subclass.

In the end it might be easier to make a new Table method has_structured_columns that gets used to check for method calls that are not supported.

The business about the nameless columns was mostly an artifact of a particular mixin test, so you're right that it is not really directly relevant here. Anyway though, #3781 is merged.

taldcroft · 2015-05-21T11:58:51Z

@mdboom - here is a proof of concept for viewing any structured ndarray as a mixin column. It works for the simple test of adding a structured array column and confirming that it is indeed a mixin. Unfortunately it fails 21 tests in table. I don't have time today to look further and see if it is simple or a deep problem with the new failing tests.

taldcroft@56f44a7

mdboom · 2015-05-21T13:23:40Z

Thanks, @taldcroft. No rush on this -- I'll let you know if I find the time to look into this and find anything.

mhvk · 2015-05-21T19:01:15Z

I share a bit of @hamogu's concern, especially with Table originally having been a recarray, and with me not seeing that much use for it. But it does seem this PR is particularly simple and it moves somewhat closer to a state where Table will just take anything that has a length and is subscriptable. So, still +1 from me. (And I also am quite happy with astropy providing useful infrastructure; I'm not sure plotting a spectrum necessarily part of that (though ensuring that units show up in labels might well be).

mdboom · 2015-06-03T14:55:19Z

@taldcroft: Stepped away from this for a while. Have you done any work on the remaining details here, or should I? (No problem either way, just don't want to duplicate work if possible).

taldcroft · 2015-06-03T16:45:31Z

@mdboom - I never got a chance to dig into this and see why so many tests were failing. If you have time to take a look and see if you understand the problem that would be good. It might be worth trying this in the context of #3731, which is a fairly significant overhaul of how mixins are handled. I'm hoping that this will get merged in a week (or two at the latest).

taldcroft · 2015-06-12T21:37:29Z

@mdboom - I took a few minutes to rebase my table-recarray-columns branch on #3731, which I'm praying will be merged soon. There is the philosophical question of whether to support this capability via mixins, but regardless I want to see if I can get this working as a POC for real-world use of mixins.

taldcroft · 2015-06-17T13:09:17Z

@mdboom @mhvk - I have something which supports using an arbitrary ndarray as a Mixin column and passes tests. See: master...taldcroft:table-recarray-columns

This is a mash of the original implementation by @mdboom and mixins. Note that the fixes that @mdboom made related to pprint'ing the dtype should all be moved into NdarrayMixin now (as the default_format method of a new NdarrayInfo class) but I haven't done that yet. It also includes a new subclass MixinInfo that distinguishes a Table mixin info class from the base DataInfo.

So the question now is whether this is the right direction. I think so, but welcome discussion. This is a variation on another useful trick, namely embedding a Table as a column. Having the info object to control representation in the table is crucial here.

mdboom · 2015-06-17T15:02:20Z

Your branch seems to make sense, and I think it clearly better than what's in this PR.

mhvk · 2015-06-17T15:27:10Z

@taldcroft - i like the idea too, and the power of info is really becoming obvious. I hope that soon we can add a method that, e.g., produces a fits table column!

But your PR does pose a question about direction, since in some way Column is not that different from NdarrayMixin, except that it provides more initialisation options and direct access to info attributes. Is the idea to let Column perhaps be a subclass of NdarrayMixin eventually? (and have a MaskedArrayInfo? For those, we can stick info in directly...)

More specifically: I'm a bit confused about the new MixinInfo. Is there a specific reason not to keep DataInfo the one that is used? I like the name better... Maybe rename DataInfo to just Info (or BaseInfo?), and MixinInfo to DataInfo?

taldcroft · 2015-06-17T15:56:53Z

Is there a specific reason not to keep DataInfo the one that is used?

This was driven by the need for a parent_table attribute in info for table columns, while the general DataInfo does not have a parent table. Originally I just accepted that DataInfo had some residual connection to Table columns, but there was a problem that came up in testing that required the parent_table attribute to be handled specially and not be copied when an Info object is copied. Hence the _attrs_no_copy addition.

I think that MixinInfo is appropriate since it specifically targets objects that are going to be a mixin column in Table. In some sense that should be the test for whether an object is (or should be) a mixin column. (notwithstanding the special case of Quantity).

I'm not sure of the real benefit to making Column eventually be a subclass of NdarrayMixin. The key thing to remember is that Column and its subclasses have a larger set of guaranteed behaviors than a mixin. We may eventually have a hierarchy of mixin types where some have more compliance with the full Column behaviors. For instance Quantity supports insertion while Time does not. The nice thing about the info object is that we can now use the class of that to know these things. So ColumnInfo would be at the top of this chain as a type that supports all behaviors.

Having said all this, having a big picture that works toward convergence of Column and mixins is a good thing to make things more and more seamless.

taldcroft · 2015-06-17T16:01:20Z

As a side note, it occurred to me that an object with a MixinInfo info attribute can change its repr based on whether it is part of a table or not. You know that if its parent_table is not None. I'm not necessarily saying that we want to do that, but it would certainly make mixin columns seem more like regular columns if they printed out as a vertical column when part of a table. With any kind of slicing or copying the parent_table goes away.

mhvk · 2015-06-17T16:16:44Z

Am not completely sold on the names still, but it no big deal, and I like the noting that info helps determine what exactly a column can do in a table (and since this is table-specific, the mixin nature is more logical).

On the repr -- we've started already a bit that way by using default_format to remove the unit from a Quantity, which will only happen if it is in a table, and it may be possible to extend this (though, like you, I'm not sure we'd necessarily want to).

But, as said, what I really like is that we now can start using info to help interact with io.fits, io.hdf5, etc.!

taldcroft · 2015-06-26T18:51:04Z

@mdboom - I'll take this over now with my implementation. We should leave this open until I get some new PR's ready. My current branch is a messy mix right now.

taldcroft · 2015-07-08T21:52:20Z

Superceded and closed by #3925.

mdboom added the table label May 7, 2015

mdboom assigned taldcroft May 7, 2015

mdboom added this to the v1.1.0 milestone May 7, 2015

mdboom mentioned this pull request May 7, 2015

Add table support asdf-format/asdf#123

Merged

taldcroft reviewed May 11, 2015
View reviewed changes

embray added Affects-release Enhancement labels May 12, 2015

taldcroft mentioned this pull request May 17, 2015

Allow initialization of Table from Column with no name #3781

Merged

mdboom added 4 commits June 3, 2015 11:03

Support record array columns in astropy.table.Table

f1854da

Add CHANGES entry.

a596448

Improve tests

b1417f9

Pretty-print recarray dtypes

0884543

mdboom added 4 commits June 3, 2015 11:03

Fix variable name

cd84d03

Fix test

422524d

Fix Python 3

9c6e7bf

Fix Windows test

93dc14c

mdboom force-pushed the table/recarray-columns branch from cbbbae2 to 93dc14c Compare June 3, 2015 15:04

This was referenced Jul 7, 2015

Follow-on to mixin changes - tidying and simplifying #3898

Merged

Support structured array, recarray, or just ndarray as mixin table columns #3925

Merged

taldcroft closed this Jul 8, 2015

mhvk mentioned this pull request Dec 30, 2021

Serialized structured column via SerializedColumn #12644

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support record array columns in astropy.table.Table #3759

Support record array columns in astropy.table.Table #3759

mdboom commented May 7, 2015

mdboom commented May 11, 2015

taldcroft May 11, 2015

taldcroft commented May 11, 2015

taldcroft commented May 11, 2015

taldcroft May 11, 2015

mdboom May 11, 2015

mdboom commented May 11, 2015

taldcroft commented May 12, 2015

embray commented May 12, 2015

hamogu commented May 13, 2015

mdboom commented May 13, 2015

taldcroft commented May 13, 2015

taldcroft commented May 17, 2015

taldcroft commented May 17, 2015

mdboom commented May 18, 2015

taldcroft commented May 18, 2015

taldcroft commented May 21, 2015

mdboom commented May 21, 2015

mhvk commented May 21, 2015

mdboom commented Jun 3, 2015

taldcroft commented Jun 3, 2015

taldcroft commented Jun 12, 2015

taldcroft commented Jun 17, 2015

mdboom commented Jun 17, 2015

mhvk commented Jun 17, 2015

taldcroft commented Jun 17, 2015

taldcroft commented Jun 17, 2015

mhvk commented Jun 17, 2015

taldcroft commented Jun 26, 2015

taldcroft commented Jul 8, 2015

Support record array columns in astropy.table.Table #3759

Support record array columns in astropy.table.Table #3759

Conversation

mdboom commented May 7, 2015

mdboom commented May 11, 2015

taldcroft May 11, 2015

Choose a reason for hiding this comment

taldcroft commented May 11, 2015

taldcroft commented May 11, 2015

taldcroft May 11, 2015

Choose a reason for hiding this comment

mdboom May 11, 2015

Choose a reason for hiding this comment

mdboom commented May 11, 2015

taldcroft commented May 12, 2015

embray commented May 12, 2015

hamogu commented May 13, 2015

mdboom commented May 13, 2015

taldcroft commented May 13, 2015

taldcroft commented May 17, 2015

taldcroft commented May 17, 2015

mdboom commented May 18, 2015

taldcroft commented May 18, 2015

taldcroft commented May 21, 2015

mdboom commented May 21, 2015

mhvk commented May 21, 2015

mdboom commented Jun 3, 2015

taldcroft commented Jun 3, 2015

taldcroft commented Jun 12, 2015

taldcroft commented Jun 17, 2015

mdboom commented Jun 17, 2015

mhvk commented Jun 17, 2015

taldcroft commented Jun 17, 2015

taldcroft commented Jun 17, 2015

mhvk commented Jun 17, 2015

taldcroft commented Jun 26, 2015

taldcroft commented Jul 8, 2015