Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support record array columns in astropy.table.Table #3759

Closed
wants to merge 8 commits into from

Conversation

mdboom
Copy link
Contributor

@mdboom mdboom commented May 7, 2015

In the interest of generality over in ASDF (see asdf-format/asdf-standard#83), I thought it would be worth supporting table columns where the data type is a record array. This is not the same as nested Tables, obviously, which would be quite a bit more work, but it seemed like only a couple of minor changes were necessary to have a Column dtype like "<i4,S1".

@mdboom mdboom added the table label May 7, 2015
@mdboom mdboom added this to the v1.1.0 milestone May 7, 2015
@mdboom
Copy link
Contributor Author

mdboom commented May 11, 2015

@taldcroft: Any objections to this in principle?

@@ -44,7 +44,7 @@ def descr(col):
This returns a 3-tuple (name, type, shape) that can always be
used in a structured array dtype definition.
"""
col_dtype_str = col.dtype.str if hasattr(col, 'dtype') else 'O'
col_dtype_str = col.dtype if hasattr(col, 'dtype') else 'O'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a niggle, but the var name should probably be called col_dtype now to correspond to the actual contents.

@taldcroft
Copy link
Member

@mdboom - sorry, I looked at this a few days ago and was going to say "great!".

The initial questions I had were:

  • Is the table representation OK with these structured-array columns?
  • What happens if you try to use one as a key column in a high-level operation?
  • Some basic tests of slicing and element access would be good. Not likely to be a problem but...

@taldcroft
Copy link
Member

One idea which would be to add a recarray column into the list of MIXIN_COLS here. Sort of re-purpose the mixin tests to get better coverage of basic table interface functionality. There might be some things that don't initially pass but you might special-case them in the tests. Anyway, just a thought.

@@ -19,6 +19,8 @@ def setup_method(self, method):
self.c = MaskedColumn(name='c', data=[7, 8, 9], mask=False)
self.d_mask = np.array([False, True, False])
self.d = MaskedColumn(name='d', data=[7, 8, 7], mask=self.d_mask)
self.e = MaskedColumn(name='e', data=[(7, 'a'), (8, 'b'), (7, 'c')],
mask=[[False, True], [True, False], [False, False]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need a dtype=.. here? The dtype that comes out seems to be string:

In [20]: e = MaskedColumn(name='e', data=[(7, 'a'), (8, 'b'), (7, 'c')],                             
                              mask=[[False, True], [True, False], [False, False]])

In [21]: e
Out[21]: 
<MaskedColumn name='e' dtype='string168' shape=(2,) length=3>
7 .. --
-- .. b
 7 .. c

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Good catch.

@mdboom
Copy link
Contributor Author

mdboom commented May 11, 2015

I suspect table reading/writing with io.ascii may also not work (though strictly speaking it doesn't have to for this to be useful).

@taldcroft
Copy link
Member

@mdboom - I've been playing around with additional tests etc but got pulled away. I will try to have a PR for your branch this week.

@embray
Copy link
Member

embray commented May 12, 2015

I guess we may have to put something in the io.ascii I/O connector for now to prevent writing such tables. I could imagine supporting this though in ECSV at the very least. This is a great thing to support, in principle though.

@hamogu
Copy link
Member

hamogu commented May 13, 2015

May I ask what the use case for this feature is?
I looked at the ADSF thread, where this seems to be motivated by (RA, DEC) pairs. However, the proper way to express that in astropy should be a column of coordinates objects. Another thing I can come up with are tuples of (value, error, unit, literature reference). However, in that case I would strongly argue that that should be expressed as four separate columns and not a a single column of record arrays, so that I can actually use that in calculations as tab['V_mag'] + 2.7 * tab['V_mag_err'] < upper_limit.

[\rant on]
I'm not so much opposed to adding features in astropy.tables in general, but every now and then we have to step back and check that this actually adds value for our users. I bet that > 90 % of all users would never miss having record array columns (particularly since mixin columns cover special use cases like times and coordinates) and 90 % or the remaining 10% look for record arrays, but would be better served using several different single-valued columns.

Is that really worth the added code complexity, added test cases, added developer time for maintenance and, most importantly, added complexity that the user sees when he/she reads te documentation?

astropy.table started of as a data table, then developed into a database system (merge, join) and is now on its way to become a database development framework (mixin columns, arbitrary data types).
Do we really, really want that? Is that serving the astronomical community best with our limited resources at a time, when e.g. we cannot even plot a spectrum properly?
[\rant off]

@mdboom: Sorry, feel free to ignore the rant. This is not particularly directed against your specific feature; I'm just warning of feature-creep in general.

@mdboom
Copy link
Contributor Author

mdboom commented May 13, 2015

This is a way forward to serialize the kind of mixin columns (coordinates etc.) you describe to/from disk in a packed binary format.

I wouldn't necessarily point to this as an example of feature creep -- outside of the testing and the pretty-printing, this is a 3-line bugfix to use numpy to properly use its generality.

@taldcroft
Copy link
Member

@hamogu - the biggest long-term worry about feature creep is that it grows the code complexity to the point where it becomes unmaintainable. So we need to manage that carefully, and I don't think this change impacts overall complexity.

(But I do note that this PR may be currently incomplete in terms of preventing / warning the user from doing things like writing an ASCII table that won't work quite as expected, but that's something I'm still reviewing.)

The other worry about feature-creep is draining resources, as you mentioned. That's a valid point, but with open-source software you often get features in an area where people are excited to develop. I'm excited about table functionality (and we have a GSoC student that is similarly motivated). Alas I don't personally care about plotting a spectrum. 😄

@taldcroft
Copy link
Member

@mdboom - check out the taldcroft/table-recarray-columns branch. This has a commit that shows a few things that fail in ways that users would likely find confusing. So the easiest thing in the short term would just be catching these cases and raising informative errors. The harder thing is to make them work.

BTW, this relies on a change that is outside the scope of this PR, namely allowing initialization of Table with a Column object that doesn't have a defined name. I don't honestly remember why we put in this limitation because it is just a 2-word fix or def_name. This makes Column behave more like an ndarray or any other table input that doesn't require a name.

@taldcroft
Copy link
Member

(So I'm likely going to put in a PR to make the or def_name change and see if that flies with the gang).

@mdboom
Copy link
Contributor Author

mdboom commented May 18, 2015

Thanks for drilling down on the tests. I see three classes of failures on taldcroft/table-recarray-columns:

  1. Reading/writing these things. Of course, the failure suggests a possible solution. Would it be better to implement recarray columns as mixins? Then the readers/writers could continue to reject them as they already do. (Of course, speaking without a deep understanding of mixin columns and how they are implemented).
>           assert 'cannot write table with mixin column(s)' in str(err.value)
E           assert 'cannot write table with mixin column(s)' in "Illegal format [('f0', '<i4'), ('f1', 'S1')]."
E            +  where "Illegal format [('f0', '<i4'), ('f1', 'S1')]." = str(ValueError("Illegal format [('f0', '<i4'), ('f1', 'S1')].",))
E            +    where ValueError("Illegal format [('f0', '<i4'), ('f1', 'S1')].",) = <ExceptionInfo ValueError tblen=10>.value
  1. Masked array stuff. Numpy masked arrays with record arrays is just really messy in Numpy itself. There's a whole bunch of hacks around this in astropy.io.votable, none of which are terribly great -- mainly involving shoving arrays inside object arrays so that the mask can match the data array. That's getting deep down in a pretty remote corner case, so I'm just as happy to exclude it. Is it the case that masked columns can be mixed with non-masked columns in the same table? If so, I think this limitation is not terribly limiting.
  2. The "nameless" column initializer stuff you mention. I'm not quite following why supporting nameless columns is necessary wrt this feature...

@taldcroft
Copy link
Member

Implementing this as a mixin column is an interesting idea. That would effectively solve the first point about problems in read/write. For masked arrays the current mixin implementation has a pseudo-mask which is a 1-d mask that is there for API compatibility but which raises an exception if any element is set to True. So again this would only allow operations that are known to work or else raise an informative exception.

The downside is that in order to make it a mixin you need to be able to set attributes, which (I think) means taking the input ndarray column and making an ndarray-subclass that has the _astropy_column_attrs attribute. I think this can work but it's a bigger footprint. Wherever an ndarray is used in creating a new column then you need to check if it's a structured array first and make a new view as the subclass.

In the end it might be easier to make a new Table method has_structured_columns that gets used to check for method calls that are not supported.

The business about the nameless columns was mostly an artifact of a particular mixin test, so you're right that it is not really directly relevant here. Anyway though, #3781 is merged.

@taldcroft
Copy link
Member

@mdboom - here is a proof of concept for viewing any structured ndarray as a mixin column. It works for the simple test of adding a structured array column and confirming that it is indeed a mixin. Unfortunately it fails 21 tests in table. I don't have time today to look further and see if it is simple or a deep problem with the new failing tests.

taldcroft@56f44a7

@mdboom
Copy link
Contributor Author

mdboom commented May 21, 2015

Thanks, @taldcroft. No rush on this -- I'll let you know if I find the time to look into this and find anything.

@mhvk
Copy link
Contributor

mhvk commented May 21, 2015

I share a bit of @hamogu's concern, especially with Table originally having been a recarray, and with me not seeing that much use for it. But it does seem this PR is particularly simple and it moves somewhat closer to a state where Table will just take anything that has a length and is subscriptable. So, still +1 from me. (And I also am quite happy with astropy providing useful infrastructure; I'm not sure plotting a spectrum necessarily part of that (though ensuring that units show up in labels might well be).

@mdboom
Copy link
Contributor Author

mdboom commented Jun 3, 2015

@taldcroft: Stepped away from this for a while. Have you done any work on the remaining details here, or should I? (No problem either way, just don't want to duplicate work if possible).

@taldcroft
Copy link
Member

@mdboom - I never got a chance to dig into this and see why so many tests were failing. If you have time to take a look and see if you understand the problem that would be good. It might be worth trying this in the context of #3731, which is a fairly significant overhaul of how mixins are handled. I'm hoping that this will get merged in a week (or two at the latest).

@taldcroft
Copy link
Member

@mdboom - I took a few minutes to rebase my table-recarray-columns branch on #3731, which I'm praying will be merged soon. There is the philosophical question of whether to support this capability via mixins, but regardless I want to see if I can get this working as a POC for real-world use of mixins.

@taldcroft
Copy link
Member

@mdboom @mhvk - I have something which supports using an arbitrary ndarray as a Mixin column and passes tests. See: master...taldcroft:table-recarray-columns

This is a mash of the original implementation by @mdboom and mixins. Note that the fixes that @mdboom made related to pprint'ing the dtype should all be moved into NdarrayMixin now (as the default_format method of a new NdarrayInfo class) but I haven't done that yet. It also includes a new subclass MixinInfo that distinguishes a Table mixin info class from the base DataInfo.

So the question now is whether this is the right direction. I think so, but welcome discussion. This is a variation on another useful trick, namely embedding a Table as a column. Having the info object to control representation in the table is crucial here.

@mdboom
Copy link
Contributor Author

mdboom commented Jun 17, 2015

Your branch seems to make sense, and I think it clearly better than what's in this PR.

@mhvk
Copy link
Contributor

mhvk commented Jun 17, 2015

@taldcroft - i like the idea too, and the power of info is really becoming obvious. I hope that soon we can add a method that, e.g., produces a fits table column!

But your PR does pose a question about direction, since in some way Column is not that different from NdarrayMixin, except that it provides more initialisation options and direct access to info attributes. Is the idea to let Column perhaps be a subclass of NdarrayMixin eventually? (and have a MaskedArrayInfo? For those, we can stick info in directly...)

More specifically: I'm a bit confused about the new MixinInfo. Is there a specific reason not to keep DataInfo the one that is used? I like the name better... Maybe rename DataInfo to just Info (or BaseInfo?), and MixinInfo to DataInfo?

@taldcroft
Copy link
Member

Is there a specific reason not to keep DataInfo the one that is used?

This was driven by the need for a parent_table attribute in info for table columns, while the general DataInfo does not have a parent table. Originally I just accepted that DataInfo had some residual connection to Table columns, but there was a problem that came up in testing that required the parent_table attribute to be handled specially and not be copied when an Info object is copied. Hence the _attrs_no_copy addition.

I think that MixinInfo is appropriate since it specifically targets objects that are going to be a mixin column in Table. In some sense that should be the test for whether an object is (or should be) a mixin column. (notwithstanding the special case of Quantity).

I'm not sure of the real benefit to making Column eventually be a subclass of NdarrayMixin. The key thing to remember is that Column and its subclasses have a larger set of guaranteed behaviors than a mixin. We may eventually have a hierarchy of mixin types where some have more compliance with the full Column behaviors. For instance Quantity supports insertion while Time does not. The nice thing about the info object is that we can now use the class of that to know these things. So ColumnInfo would be at the top of this chain as a type that supports all behaviors.

Having said all this, having a big picture that works toward convergence of Column and mixins is a good thing to make things more and more seamless.

@taldcroft
Copy link
Member

As a side note, it occurred to me that an object with a MixinInfo info attribute can change its repr based on whether it is part of a table or not. You know that if its parent_table is not None. I'm not necessarily saying that we want to do that, but it would certainly make mixin columns seem more like regular columns if they printed out as a vertical column when part of a table. With any kind of slicing or copying the parent_table goes away.

@mhvk
Copy link
Contributor

mhvk commented Jun 17, 2015

Am not completely sold on the names still, but it no big deal, and I like the noting that info helps determine what exactly a column can do in a table (and since this is table-specific, the mixin nature is more logical).

On the repr -- we've started already a bit that way by using default_format to remove the unit from a Quantity, which will only happen if it is in a table, and it may be possible to extend this (though, like you, I'm not sure we'd necessarily want to).

But, as said, what I really like is that we now can start using info to help interact with io.fits, io.hdf5, etc.!

@taldcroft
Copy link
Member

@mdboom - I'll take this over now with my implementation. We should leave this open until I get some new PR's ready. My current branch is a messy mix right now.

@taldcroft
Copy link
Member

Superceded and closed by #3925.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants