Stack tables horizontally (by columns) or vertically (by rows, i.e. append) #937

Merged
merged 36 commits into from May 10, 2013

Conversation

Projects
None yet
6 participants
@taldcroft
Owner

taldcroft commented Apr 3, 2013

This PR provides two new methods hstack and vstack that allow stacking tables by rows or columns.

This also addressses two post-merge comments by @iguananaut in #903 (meta.py => metadata.py and use of deepcopy()).

Docs and testing are still to be done.

@ghost ghost assigned taldcroft Apr 3, 2013

@eteq

This comment has been minimized.

Show comment Hide comment
@eteq

eteq Apr 3, 2013

Owner

Quick broad-brush comment (before you start doing docs/testing, which presumably is why you posted it): how attached are you to the names hstack and vstack? Personally, I find that horribly confusing in numpy because I am never entirely sure if it means "stack along the horizontal axis" or "stack things that are horizontal". In a table context it's also a bit ambiguous because occasionally you see tables where the "columns" are actually shown horizontally. (Although that might be partly a thing with me because I get the words "row" and "column" often confused for some reason...)

So perhaps instead use something like append_rows and append_cols?

That leads to another question, though: what's the difference between vstack (or hstack? whichever one is adding new columns) and add_columns?

Owner

eteq commented Apr 3, 2013

Quick broad-brush comment (before you start doing docs/testing, which presumably is why you posted it): how attached are you to the names hstack and vstack? Personally, I find that horribly confusing in numpy because I am never entirely sure if it means "stack along the horizontal axis" or "stack things that are horizontal". In a table context it's also a bit ambiguous because occasionally you see tables where the "columns" are actually shown horizontally. (Although that might be partly a thing with me because I get the words "row" and "column" often confused for some reason...)

So perhaps instead use something like append_rows and append_cols?

That leads to another question, though: what's the difference between vstack (or hstack? whichever one is adding new columns) and add_columns?

@taldcroft

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft Apr 3, 2013

Owner

I'm not too keen on append_rows or append_cols exactly because of the confusion you just experienced. It leads you to believe that you are appending rows or appending columns, when you are really appending tables in a row-wise or column-wise direction. That should answer your question about the difference between hstack and add_columns. Other differences: hstack allows for inner and outer merging and handles column name conflicts.

Everyone's mileage varies, but I personally find vstack and hstack intuitive. I think what's good about "horizontal" and "vertical" is this: if I have two tables and tell you to "append" or "concatenate" the columns of those tables, that's quite ambiguous. That could easily mean appending columns end-to-end. I agree that hstack/vstack depends on the convention that table columns are represented as vertical, but I would say this is the case the vast majority of the time. (Ask 100 astronomers to write a data table on a piece of paper and I bet at least 95 will orient columns vertically). With this in mind, when I say to stack a list of tables vertically this is entirely clear and visually compelling.

Owner

taldcroft commented Apr 3, 2013

I'm not too keen on append_rows or append_cols exactly because of the confusion you just experienced. It leads you to believe that you are appending rows or appending columns, when you are really appending tables in a row-wise or column-wise direction. That should answer your question about the difference between hstack and add_columns. Other differences: hstack allows for inner and outer merging and handles column name conflicts.

Everyone's mileage varies, but I personally find vstack and hstack intuitive. I think what's good about "horizontal" and "vertical" is this: if I have two tables and tell you to "append" or "concatenate" the columns of those tables, that's quite ambiguous. That could easily mean appending columns end-to-end. I agree that hstack/vstack depends on the convention that table columns are represented as vertical, but I would say this is the case the vast majority of the time. (Ask 100 astronomers to write a data table on a piece of paper and I bet at least 95 will orient columns vertically). With this in mind, when I say to stack a list of tables vertically this is entirely clear and visually compelling.

@eteq

This comment has been minimized.

Show comment Hide comment
@eteq

eteq Apr 4, 2013

Owner

Ahhh, ok, I understand why append is ambiguous now. I still think hstack and vstack are equally ambiguous in that sense, though. That is, I'm ass likely to confuse add_columns with vstack as I am to confuse it with append_columns.

And I still think that hstack/vstack is ambiguous in which sense it means, even if the "visual" orientation is the more common one. That is, even if rows are horizontal and columns are vertical, vstack could mean "stack these tables vertically" or "stack these tables by joining on the vertical axis". And indeed, in numpy, I've encountered this confusion among multiple people (myself included). Although, part of this in numpy comes from the fact that there's no intrinsic distinction between the first two dimensions, and you're right that there is in table.

So given that in the face of ambiguity, I'm supposed to refuse the temptation to guess, how about stack_tables_by_row/stack_tables_by_col? Or if too verbose, stack_by_row/stack_by_col?

Owner

eteq commented Apr 4, 2013

Ahhh, ok, I understand why append is ambiguous now. I still think hstack and vstack are equally ambiguous in that sense, though. That is, I'm ass likely to confuse add_columns with vstack as I am to confuse it with append_columns.

And I still think that hstack/vstack is ambiguous in which sense it means, even if the "visual" orientation is the more common one. That is, even if rows are horizontal and columns are vertical, vstack could mean "stack these tables vertically" or "stack these tables by joining on the vertical axis". And indeed, in numpy, I've encountered this confusion among multiple people (myself included). Although, part of this in numpy comes from the fact that there's no intrinsic distinction between the first two dimensions, and you're right that there is in table.

So given that in the face of ambiguity, I'm supposed to refuse the temptation to guess, how about stack_tables_by_row/stack_tables_by_col? Or if too verbose, stack_by_row/stack_by_col?

@taldcroft

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft Apr 5, 2013

Owner

@eteq - after thinking about this more, I had a couple of thoughts:

  • Starting the method names with the same stack_ is very good because it will put the two methods right next to each other in tab-completion or API docs. This will make it more clear which is which, because for some reason stacking rows is more obviously unique while the columns is where the problems seem to be.
  • Hopefully the stack word will be enough to distinguish stacking tables from adding rows or columns.

With that, what about stack_rows and stack_columns. (For consistency we should use columns not cols here).

Owner

taldcroft commented Apr 5, 2013

@eteq - after thinking about this more, I had a couple of thoughts:

  • Starting the method names with the same stack_ is very good because it will put the two methods right next to each other in tab-completion or API docs. This will make it more clear which is which, because for some reason stacking rows is more obviously unique while the columns is where the problems seem to be.
  • Hopefully the stack word will be enough to distinguish stacking tables from adding rows or columns.

With that, what about stack_rows and stack_columns. (For consistency we should use columns not cols here).

@taldcroft

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft Apr 5, 2013

Owner

Oh, and not that it matters that much, but the tabular package uses colstack and rowstack, so there is some precedence. But I do like have the stack verb in front to indicate a common action.

Owner

taldcroft commented Apr 5, 2013

Oh, and not that it matters that much, but the tabular package uses colstack and rowstack, so there is some precedence. But I do like have the stack verb in front to indicate a common action.

@astrofrog

This comment has been minimized.

Show comment Hide comment
@astrofrog

astrofrog Apr 5, 2013

Owner

+1 to stack_rows and stack_columns!

Owner

astrofrog commented Apr 5, 2013

+1 to stack_rows and stack_columns!

@banados

This comment has been minimized.

Show comment Hide comment
@banados

banados Apr 5, 2013

I also like stack_ rows and stack_columns 👍

banados commented Apr 5, 2013

I also like stack_ rows and stack_columns 👍

@eteq

This comment has been minimized.

Show comment Hide comment
@eteq

eteq Apr 5, 2013

Owner

I still think stack_rows / stack_columnsis ambiguous in that it's not obvious from reading the names that it means "stack the tables by rows" vs "stack individual rows onto this one table" (which is precisely the confusion I experienced that you rightly pointed out, @taldcroft). If a user already knows about the add_row method, you're right that there's a distinction. But I don't like counting on the user to know about a different method to be able to recognize what this method does.

One hopes that it's somewhat mitigated by the fact that if you actually read this in code, it would be something like table1.stack_rows(table2) rather than table1.stack_rows(row). But that counts on astronomers to make sensible variable names... and that's something I empirically have found to not be too valid...

But I can accept being outvoted if no one else is convinced by any of this :)

Owner

eteq commented Apr 5, 2013

I still think stack_rows / stack_columnsis ambiguous in that it's not obvious from reading the names that it means "stack the tables by rows" vs "stack individual rows onto this one table" (which is precisely the confusion I experienced that you rightly pointed out, @taldcroft). If a user already knows about the add_row method, you're right that there's a distinction. But I don't like counting on the user to know about a different method to be able to recognize what this method does.

One hopes that it's somewhat mitigated by the fact that if you actually read this in code, it would be something like table1.stack_rows(table2) rather than table1.stack_rows(row). But that counts on astronomers to make sensible variable names... and that's something I empirically have found to not be too valid...

But I can accept being outvoted if no one else is convinced by any of this :)

@embray

This comment has been minimized.

Show comment Hide comment
@embray

embray Apr 8, 2013

Member

I have no problem with hstack and vstack. The only time I've ever had to pause to thinking about it was when using hstack with two 1-D arrays. But maybe my brain just works the same way as whoever came up with those functions in the first place. At first I liked the stack_rows and stack_columns thing but I'm with @eteq that this is confusing, eep :)

What about extend_rows and extend_columns? I like the "extend" verb for analogy with Python lists, though I think it might suffer from some of the same ambiguity. I would just keep hstack and vstack since most users already know from Numpy what that means.

Member

embray commented Apr 8, 2013

I have no problem with hstack and vstack. The only time I've ever had to pause to thinking about it was when using hstack with two 1-D arrays. But maybe my brain just works the same way as whoever came up with those functions in the first place. At first I liked the stack_rows and stack_columns thing but I'm with @eteq that this is confusing, eep :)

What about extend_rows and extend_columns? I like the "extend" verb for analogy with Python lists, though I think it might suffer from some of the same ambiguity. I would just keep hstack and vstack since most users already know from Numpy what that means.

@taldcroft

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft Apr 8, 2013

Owner

@iguananaut - my brain is wired the same way and I find hstack / vstack intuitive. I was trying to come to a compromise with @eteq to move things forward, but I'm happy to re-open this. I'll say that in my dev version I've made the change to stack_rows/columns and I had to think to make sure it was right, while with hstack/vstack I never had to think. Brain wiring I guess? I also like the connection to numpy and trying to re-use existing method names that mean the same thing. Who knows, maybe @eteq could finally get hstack/vstack clear in his head and start using them confidently in numpy. 😃

As mitigation to confusion, note that the first line of doc strings will (in any case) contain the pair of words (rows, vertical) or (columns, horizontal). I.e. Stack the tables by columns (horizontally) for stack_columns or Stack the tables horizontally (by columns).

I think extend suffers the same problem, but there is a new problem which is that for join/hstack/vstack I want to leave open the possibility of table module methods of the same name, i.e. t = astropy.table.hstack(tables). This fits neatly with the numpy analogy and somehow if you have a list of tables then it would be weird to have to do:

t = tables[0].hstack(tables[1:])

With extend then it wouldn't make sense to have the same name for the module function.

Owner

taldcroft commented Apr 8, 2013

@iguananaut - my brain is wired the same way and I find hstack / vstack intuitive. I was trying to come to a compromise with @eteq to move things forward, but I'm happy to re-open this. I'll say that in my dev version I've made the change to stack_rows/columns and I had to think to make sure it was right, while with hstack/vstack I never had to think. Brain wiring I guess? I also like the connection to numpy and trying to re-use existing method names that mean the same thing. Who knows, maybe @eteq could finally get hstack/vstack clear in his head and start using them confidently in numpy. 😃

As mitigation to confusion, note that the first line of doc strings will (in any case) contain the pair of words (rows, vertical) or (columns, horizontal). I.e. Stack the tables by columns (horizontally) for stack_columns or Stack the tables horizontally (by columns).

I think extend suffers the same problem, but there is a new problem which is that for join/hstack/vstack I want to leave open the possibility of table module methods of the same name, i.e. t = astropy.table.hstack(tables). This fits neatly with the numpy analogy and somehow if you have a list of tables then it would be weird to have to do:

t = tables[0].hstack(tables[1:])

With extend then it wouldn't make sense to have the same name for the module function.

@astrofrog

This comment has been minimized.

Show comment Hide comment
@astrofrog

astrofrog Apr 8, 2013

Owner

We could just use vstack and hstack and make sure the docstrings make it very explicit. I think we're starting to agree there's no good name anyway :-)

Owner

astrofrog commented Apr 8, 2013

We could just use vstack and hstack and make sure the docstrings make it very explicit. I think we're starting to agree there's no good name anyway :-)

@eteq

This comment has been minimized.

Show comment Hide comment
@eteq

eteq Apr 8, 2013

Owner

Well, part of my point was that hstack/vstack sound simple, so I think lots of people will just assume they know what they do, and never look at the docstrings to make sure. That's the problem I've seen when people read/use numpy.*stack - it seems obvious what it's doing... except that sometimes it's the opposite. That strongly suggests that @taldcroft and @astrofrog are right that there's no simple answer that works for everyone :(

Just to muddy the waters further, though, I do think stack_rows/stack_columns is superior to hstack/vstack. I think they both feature the ambiguity wrt add_*, but at least stack_rows/stack_columns don't have the horizontal/vertical vs. rows/columns ambiguity. I also liked extend_* ... until @taldcroft pointed out the problem :)

I still think both ambiguities (h/v vs. row/col and "extend" vs. add) are resolved with a sufficiently verbose option (something like stack_tables_along_rows/stack_tables_along_columns) - am I correct in assuming that you want to avoid that because it's too long, @taldcroft ?

Owner

eteq commented Apr 8, 2013

Well, part of my point was that hstack/vstack sound simple, so I think lots of people will just assume they know what they do, and never look at the docstrings to make sure. That's the problem I've seen when people read/use numpy.*stack - it seems obvious what it's doing... except that sometimes it's the opposite. That strongly suggests that @taldcroft and @astrofrog are right that there's no simple answer that works for everyone :(

Just to muddy the waters further, though, I do think stack_rows/stack_columns is superior to hstack/vstack. I think they both feature the ambiguity wrt add_*, but at least stack_rows/stack_columns don't have the horizontal/vertical vs. rows/columns ambiguity. I also liked extend_* ... until @taldcroft pointed out the problem :)

I still think both ambiguities (h/v vs. row/col and "extend" vs. add) are resolved with a sufficiently verbose option (something like stack_tables_along_rows/stack_tables_along_columns) - am I correct in assuming that you want to avoid that because it's too long, @taldcroft ?

@taldcroft

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft Apr 8, 2013

Owner

Even the excessively verbose stack_tables_along_columns isn't without ambiguity. In your brain you are preconditioned to intepret "along" in a certain way, but that's not the only direction that "along" could mean (and in fact I would interpret "along columns" to be along their long direction). Only stack_tables_along_the_direction_going_from_the_leftmost_to_the_rightmost_column() would really do the trick.

So two people have a clear preference for *stack, and we haven't seen any succinct suggestion that is obviously better and clearly less ambiguous. Given heritage from numpy (thus not inventing new names for the same functions), I'm going to say that *stack is "good enough" (in the words of Alex Martelli).

Owner

taldcroft commented Apr 8, 2013

Even the excessively verbose stack_tables_along_columns isn't without ambiguity. In your brain you are preconditioned to intepret "along" in a certain way, but that's not the only direction that "along" could mean (and in fact I would interpret "along columns" to be along their long direction). Only stack_tables_along_the_direction_going_from_the_leftmost_to_the_rightmost_column() would really do the trick.

So two people have a clear preference for *stack, and we haven't seen any succinct suggestion that is obviously better and clearly less ambiguous. Given heritage from numpy (thus not inventing new names for the same functions), I'm going to say that *stack is "good enough" (in the words of Alex Martelli).

@eteq

This comment has been minimized.

Show comment Hide comment
@eteq

eteq Apr 9, 2013

Owner

Well, stack_tables_along_the_direction_going_from_the_leftmost_to_the_rightmost_column is exactly 80 characters, so still PEP8 compliant, right? :) Well it wasn't so clear, what I was trying to say in the earlier post was that I was concerned as much about the "add" vs. "extend"/"tables" ambiguity as row/col vs. h/v. I think your earlier point about why stack_* is good was quite valid, @taldcroft - it makes it clearly distinct from add_* in a parallel structure way. Anyway, by my count we had three (four counting you, originally, @taldcroft :) in favor of stack_rows/stack_columns over *stack.

Admittedly this is really getting into over-analysis land, so I'll stop now until you've gotten the tests/docs. I obviously agree the actual idea is a very good one.

Owner

eteq commented Apr 9, 2013

Well, stack_tables_along_the_direction_going_from_the_leftmost_to_the_rightmost_column is exactly 80 characters, so still PEP8 compliant, right? :) Well it wasn't so clear, what I was trying to say in the earlier post was that I was concerned as much about the "add" vs. "extend"/"tables" ambiguity as row/col vs. h/v. I think your earlier point about why stack_* is good was quite valid, @taldcroft - it makes it clearly distinct from add_* in a parallel structure way. Anyway, by my count we had three (four counting you, originally, @taldcroft :) in favor of stack_rows/stack_columns over *stack.

Admittedly this is really getting into over-analysis land, so I'll stop now until you've gotten the tests/docs. I obviously agree the actual idea is a very good one.

@astrofrog

This comment has been minimized.

Show comment Hide comment
@astrofrog

astrofrog Apr 16, 2013

Owner

I vote for stack_rows and stack_columns - so just to be clear, stack_columns means the table will be expanded in the column direction, and the number of rows stays the same?

Owner

astrofrog commented Apr 16, 2013

I vote for stack_rows and stack_columns - so just to be clear, stack_columns means the table will be expanded in the column direction, and the number of rows stays the same?

@taldcroft

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft Apr 16, 2013

Owner

so just to be clear, stack_columns means the table will be expanded in the column direction, and the number of rows stays the same?

That's exactly the point, after all this discussion it is still not obvious what those names mean. 😉

If I showed you the tables below and said to horizontally or vertically stack the tables t1, t2, and t3, would you need clarification?

image

Owner

taldcroft commented Apr 16, 2013

so just to be clear, stack_columns means the table will be expanded in the column direction, and the number of rows stays the same?

That's exactly the point, after all this discussion it is still not obvious what those names mean. 😉

If I showed you the tables below and said to horizontally or vertically stack the tables t1, t2, and t3, would you need clarification?

image

@astrofrog

This comment has been minimized.

Show comment Hide comment
@astrofrog

astrofrog Apr 16, 2013

Owner

Er well now I'm confused if you say vertical and horizontal. But if you said stack t1 and t2 by rows, I'd say that means it's a table with 4 rows and 2 columns. If I stacked them by columns I'd get tables with 4 columns and 2 rows. To be fair though, I'd probably just check the docstring, which should include a graphical example in ASCII art or Markdown.

Owner

astrofrog commented Apr 16, 2013

Er well now I'm confused if you say vertical and horizontal. But if you said stack t1 and t2 by rows, I'd say that means it's a table with 4 rows and 2 columns. If I stacked them by columns I'd get tables with 4 columns and 2 rows. To be fair though, I'd probably just check the docstring, which should include a graphical example in ASCII art or Markdown.

@taldcroft

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft Apr 16, 2013

Owner

Even though I prefer hstack/vstack, if everyone wants stack_rows/columns then so be it. @iguananaut had an opinion however, and the commit history will show hstack -> stack_columns -> hstack -> stack_columns. :-)

Owner

taldcroft commented Apr 16, 2013

Even though I prefer hstack/vstack, if everyone wants stack_rows/columns then so be it. @iguananaut had an opinion however, and the commit history will show hstack -> stack_columns -> hstack -> stack_columns. :-)

@astrofrog

This comment has been minimized.

Show comment Hide comment
@astrofrog

astrofrog Apr 16, 2013

Owner

Actually I don't really care :) I'll use whatever is decided, and I'll check the docstrings the first few times anyway :)

Owner

astrofrog commented Apr 16, 2013

Actually I don't really care :) I'll use whatever is decided, and I'll check the docstrings the first few times anyway :)

@taldcroft

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft Apr 16, 2013

Owner

I say that @mdboom flips a coin for us (real or virtual). Heads is hstack/vstack, tails is stack_rows/columns.

Owner

taldcroft commented Apr 16, 2013

I say that @mdboom flips a coin for us (real or virtual). Heads is hstack/vstack, tails is stack_rows/columns.

@mdboom

This comment has been minimized.

Show comment Hide comment
@mdboom

mdboom Apr 16, 2013

Contributor

Wow -- why have I been given all the power? Because I haven't commented yet? 😉

I've been watching from the sidelines and been on the side of hstack/vstack from the beginning. It's never confused me in Numpy, and I think it's +1 to be consistent with it.

(Ok, so there was nothing random about that. If you want, I can go see what random.choose(['hstack', 'stack_rows']) returns in my console...)

Contributor

mdboom commented Apr 16, 2013

Wow -- why have I been given all the power? Because I haven't commented yet? 😉

I've been watching from the sidelines and been on the side of hstack/vstack from the beginning. It's never confused me in Numpy, and I think it's +1 to be consistent with it.

(Ok, so there was nothing random about that. If you want, I can go see what random.choose(['hstack', 'stack_rows']) returns in my console...)

@astrofrog

This comment has been minimized.

Show comment Hide comment
@astrofrog

astrofrog Apr 16, 2013

Owner

I think the reason vstack and hstack confused me in the past is because of 1D arrays (as @iguananaut mentioned) but maybe it's more obvious for 2d arrays. I mean if I stack LEGOs vertically or horizontally I can picture what that means. So maybe this is not really an issue to be consistent with Numpy since we don't have to deal with the Numpy case?

Owner

astrofrog commented Apr 16, 2013

I think the reason vstack and hstack confused me in the past is because of 1D arrays (as @iguananaut mentioned) but maybe it's more obvious for 2d arrays. I mean if I stack LEGOs vertically or horizontally I can picture what that means. So maybe this is not really an issue to be consistent with Numpy since we don't have to deal with the Numpy case?

@eteq

This comment has been minimized.

Show comment Hide comment
@eteq

eteq Apr 18, 2013

Owner

👍 in general to the approach of solving our "this seems natural to me!" "but this seems natural to me!" debates by having @mbdoom flip a coin :)

Owner

eteq commented Apr 18, 2013

👍 in general to the approach of solving our "this seems natural to me!" "but this seems natural to me!" debates by having @mbdoom flip a coin :)

@taldcroft

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft Apr 21, 2013

Owner

The python 3.3 travis tests failed in a weird way, in what appears to be a hard abort prior to finishing all the tests:

https://travis-ci.org/astropy/astropy/jobs/6494187

...
astropy/table/tests/test_table.py ..............................................................................................................................................
astropy/table/tests/test_utils.py .....................
The command "python setup.py $SETUP_CMD $OPEN_FILES" exited with 245.

Done. Your build exited with 1.
Owner

taldcroft commented Apr 21, 2013

The python 3.3 travis tests failed in a weird way, in what appears to be a hard abort prior to finishing all the tests:

https://travis-ci.org/astropy/astropy/jobs/6494187

...
astropy/table/tests/test_table.py ..............................................................................................................................................
astropy/table/tests/test_utils.py .....................
The command "python setup.py $SETUP_CMD $OPEN_FILES" exited with 245.

Done. Your build exited with 1.
@taldcroft

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft Apr 21, 2013

Owner

This is ready for "final" review (notwithstanding the unexplained travis failure on 3.3 above). You can see docs at:
http://hea-www.harvard.edu/~aldcroft/tmp/astropy-stack/html/table/operations.html

The only outstanding action is to include module-level functions table.hstack(tables, ...) and table.vstack(tables, ..). So t1.hstack([t2, t3]) would be equivalent to table.hstack([t1, t2, t3]). I might do some refactoring to try reducing the size of table.py, but that will be transparent from the code functionality viewpoint.

Owner

taldcroft commented Apr 21, 2013

This is ready for "final" review (notwithstanding the unexplained travis failure on 3.3 above). You can see docs at:
http://hea-www.harvard.edu/~aldcroft/tmp/astropy-stack/html/table/operations.html

The only outstanding action is to include module-level functions table.hstack(tables, ...) and table.vstack(tables, ..). So t1.hstack([t2, t3]) would be equivalent to table.hstack([t1, t2, t3]). I might do some refactoring to try reducing the size of table.py, but that will be transparent from the code functionality viewpoint.

taldcroft added some commits Apr 22, 2013

Fix issue with empty string values in Python 3.3
In Python 3.3 if string (unicode) table elements are left uninitialized
then subsequent attempts to get the string repr can produce a seg
fault.  This can occur even if the value is masked out.   See:
numpy/numpy#3276

This patches initializes the values to zero instead of empty for
masked arrays.  For regular numpy arrays all values get filled
with something meaningful.

Note that the join() function works a bit differently and doesn't
suffer from this problem.
@@ -1,5 +1,7 @@
# Licensed under a 3-clause BSD style license - see LICENSE.rst
from .table import Column, Table, TableColumns, Row, MaskedColumn
+from .np_utils import TableMergeError

This comment has been minimized.

Show comment Hide comment
@eteq

eteq Apr 30, 2013

Owner

Given how it's now more likely to come up in regular use, perhaps TableMergeError should be moved to astropy.table.table? np_utils is otherwise all "internal use" stuff, right?

@eteq

eteq Apr 30, 2013

Owner

Given how it's now more likely to come up in regular use, perhaps TableMergeError should be moved to astropy.table.table? np_utils is otherwise all "internal use" stuff, right?

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft May 2, 2013

Owner

This exception is always raised within np_utils so to me it makes sense to define it there.

The intent with np_utils was to keep this module essentially "pure numpy" (no astropy dependencies), with the idea that I might eventually ask numpy whether they are interested in putting any of the routines into numpy.

(np_utils.py does currently depend on the astropy backport of OrderDict, but that's not a real astropy dependence).

@taldcroft

taldcroft May 2, 2013

Owner

This exception is always raised within np_utils so to me it makes sense to define it there.

The intent with np_utils was to keep this module essentially "pure numpy" (no astropy dependencies), with the idea that I might eventually ask numpy whether they are interested in putting any of the routines into numpy.

(np_utils.py does currently depend on the astropy backport of OrderDict, but that's not a real astropy dependence).

This comment has been minimized.

Show comment Hide comment
@eteq

eteq May 2, 2013

Owner

Ah, ok, that makes sense then. I think I was thinking it would appear in more places because I was getting some of the various hstacks and vstacks confused when quickly jumping around in the code.

@eteq

eteq May 2, 2013

Owner

Ah, ok, that makes sense then. I think I was thinking it would appear in more places because I was getting some of the various hstacks and vstacks confused when quickly jumping around in the code.

astropy/table/np_utils.py
+ Parameters
+ ----------
+
+ tables : Table or list of Table objects

This comment has been minimized.

Show comment Hide comment
@eteq

eteq Apr 30, 2013

Owner

This says Table objects, but above it says structured arrays. Does that mean it will actually work with both, or is one of these left over from a previous version of something?

@eteq

eteq Apr 30, 2013

Owner

This says Table objects, but above it says structured arrays. Does that mean it will actually work with both, or is one of these left over from a previous version of something?

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft May 2, 2013

Owner

Fixed in 5f34591.

astropy/table/np_utils.py
+def hstack(arrays, join_type='exact', uniq_col_name='{col_name}_{table_name}',
+ table_names=None, col_name_map=None):
+ """
+ Stack tables by horizontally (by columns)

This comment has been minimized.

Show comment Hide comment
@eteq

eteq Apr 30, 2013

Owner

Does "tables" mean Table objects here, or structured arrays? The parameter name arrays makes me think the latter, but it says Table objects below.

(I realize these are internal functions, but it might be useful to clear up for other people reading the code, given that there are actually three distinct def hstacks here.)

@eteq

eteq Apr 30, 2013

Owner

Does "tables" mean Table objects here, or structured arrays? The parameter name arrays makes me think the latter, but it says Table objects below.

(I realize these are internal functions, but it might be useful to clear up for other people reading the code, given that there are actually three distinct def hstacks here.)

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft May 2, 2013

Owner

Fixed in 5f34591.

+
+class TestHStack():
+
+ def setup_method(self, method):

This comment has been minimized.

Show comment Hide comment
@eteq

eteq Apr 30, 2013

Owner

I see you're using the "xunit style" tests here. Are you familiar with the pytests fixtures that can be used to acheive the same goal (http://pytest.org/latest/fixture.html ) ? I haven't really tested this, but the docs there seem to indicate the fixtures can be substantially faster because you can generate the fixture once per module (or class, or whatever).

If you know about them and just don't like them for some reason, that's fine. Just thought I'd mention in case you hadn't looked seen them.

@eteq

eteq Apr 30, 2013

Owner

I see you're using the "xunit style" tests here. Are you familiar with the pytests fixtures that can be used to acheive the same goal (http://pytest.org/latest/fixture.html ) ? I haven't really tested this, but the docs there seem to indicate the fixtures can be substantially faster because you can generate the fixture once per module (or class, or whatever).

If you know about them and just don't like them for some reason, that's fine. Just thought I'd mention in case you hadn't looked seen them.

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft May 2, 2013

Owner

I've used pytest fixtures quite a bit in the rest of table tests, mostly to do tests on masked and non-masked objects with the same code.

In this case and other similar ones, I feel like the xunit setup is natural because I'm making a bunch of objects that I might be modifying in place during tests, and I want to be totally sure the tests are independent. All the table setup is cheap and I'm not worried about speed.

@taldcroft

taldcroft May 2, 2013

Owner

I've used pytest fixtures quite a bit in the rest of table tests, mostly to do tests on masked and non-masked objects with the same code.

In this case and other similar ones, I feel like the xunit setup is natural because I'm making a bunch of objects that I might be modifying in place during tests, and I want to be totally sure the tests are independent. All the table setup is cheap and I'm not worried about speed.

This comment has been minimized.

Show comment Hide comment
@eteq

eteq May 2, 2013

Owner

Ok, that's fine - just wanted to make sure you knew about it (and it didn't occur to me to check other tests you wrote...)

@eteq

eteq May 2, 2013

Owner

Ok, that's fine - just wanted to make sure you knew about it (and it didn't occur to me to check other tests you wrote...)

docs/table/operations.rst
+
+In order to accomodate different use cases, each of the table operations can be
+called in two ways. The first is using a |Table| object *method*, and the second
+is using a *function* within the `~astropy.table` sub-package. For example::

This comment has been minimized.

Show comment Hide comment
@eteq

eteq Apr 30, 2013

Owner

I think it might be worth throwing in just a little bit more stating the distinction. E.g.:

The first is using a |Table| object *method*, which combines the object it is called from with other `Table`s.  The second
+is using a *function* within the `~astropy.table` sub-package, which stacks all of the tables it is called on.  For example::

I say this (or some similar wording) just because I suspect some people reading this won't really know what the word "method" means relative to "function". Tables are a relatively "accessible" concept, so I think somewhat more novices would come through here than some of the other subpackages.

@eteq

eteq Apr 30, 2013

Owner

I think it might be worth throwing in just a little bit more stating the distinction. E.g.:

The first is using a |Table| object *method*, which combines the object it is called from with other `Table`s.  The second
+is using a *function* within the `~astropy.table` sub-package, which stacks all of the tables it is called on.  For example::

I say this (or some similar wording) just because I suspect some people reading this won't really know what the word "method" means relative to "function". Tables are a relatively "accessible" concept, so I think somewhat more novices would come through here than some of the other subpackages.

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft May 2, 2013

Owner

Let's just remove the methods and this whole section can be cut. Yay, less is more.

@taldcroft

taldcroft May 2, 2013

Owner

Let's just remove the methods and this whole section can be cut. Yay, less is more.

astropy/table/operations.py
+def hstack(tables, join_type='outer',
+ uniq_col_name='{col_name}_{table_name}', table_names=None):
+ """
+ Stack tables by columns (horizontally)

This comment has been minimized.

Show comment Hide comment
@eteq

eteq Apr 30, 2013

Owner

How do you feel about changing this (and all similar references) to "stack tables along columns"? For some reason that minor word change makes it quite a bit more apparent to me.

@eteq

eteq Apr 30, 2013

Owner

How do you feel about changing this (and all similar references) to "stack tables along columns"? For some reason that minor word change makes it quite a bit more apparent to me.

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft May 2, 2013

Owner

Done in f7e35c3.

astropy/table/operations.py
+ To stack two tables horizontally (by columns) do::
+
+ >>> from astropy.table import Table, hstack
+ >>> t1 = Table({'a': [1, 2], 'b': [3, 4]}, names=('a', 'b'))

This comment has been minimized.

Show comment Hide comment
@eteq

eteq Apr 30, 2013

Owner

Here and in all similar examples, I think it would help to add in something like:

>>> t1
       a   b
      --- ---
        1   3
        2   4

and the same for t2. Then just by inspection the user can see what both the inputs and the outputs in the same format, rather than having to figure out what sort of table actually comes out of this line.

@eteq

eteq Apr 30, 2013

Owner

Here and in all similar examples, I think it would help to add in something like:

>>> t1
       a   b
      --- ---
        1   3
        2   4

and the same for t2. Then just by inspection the user can see what both the inputs and the outputs in the same format, rather than having to figure out what sort of table actually comes out of this line.

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft May 2, 2013

Owner

Done in 2b09017.

@eteq

This comment has been minimized.

Show comment Hide comment
@eteq

eteq Apr 30, 2013

Owner

One larger-scale question/suggestion came to mind from the most recent commits. I like the ?stack functions. My question is: should the methods be dropped (or made private) in favor of only the functions? I know this is intended to mirror numpy, but it's not clear to me there's much to be gained by having both. Put another way, is there a compelling case for definitely wanting t1.hstack(t2) over hstack([t1, t2])?

I can see the case for the method when something of the same shape/type comes out (to support chaining), as well as the max or similar methods that can be used conveniently "inline", but for this it seems like one almost always is taking two rather complicated things and getting out a third different but at least as complicated thing.

Owner

eteq commented Apr 30, 2013

One larger-scale question/suggestion came to mind from the most recent commits. I like the ?stack functions. My question is: should the methods be dropped (or made private) in favor of only the functions? I know this is intended to mirror numpy, but it's not clear to me there's much to be gained by having both. Put another way, is there a compelling case for definitely wanting t1.hstack(t2) over hstack([t1, t2])?

I can see the case for the method when something of the same shape/type comes out (to support chaining), as well as the max or similar methods that can be used conveniently "inline", but for this it seems like one almost always is taking two rather complicated things and getting out a third different but at least as complicated thing.

@astrofrog

This comment has been minimized.

Show comment Hide comment
@astrofrog

astrofrog Apr 30, 2013

Owner

I personally plan to use the functions, not the methods, but the methods have the advantage they will be found by tab-completing the Table attributes.

Owner

astrofrog commented Apr 30, 2013

I personally plan to use the functions, not the methods, but the methods have the advantage they will be found by tab-completing the Table attributes.

@taldcroft

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft Apr 30, 2013

Owner

I'm +0.25 on dropping the methods because it does simplify things (docs and code) all around and make there be just one way to do things. The only reservation is exactly what @astrofrog said about method discovery.

Owner

taldcroft commented Apr 30, 2013

I'm +0.25 on dropping the methods because it does simplify things (docs and code) all around and make there be just one way to do things. The only reservation is exactly what @astrofrog said about method discovery.

@taldcroft

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft May 2, 2013

Owner

@iguananaut @mdboom - do you have any opinion on removing the join, hstack, and vstack methods from Table and leaving only the function forms within the table module? I'm ready to do this since it does simplify things overall. See the comment from @eteq above.

Updating the docs and tests is a bit of work so I don't want to go ahead with this if there might be disagreement.

Owner

taldcroft commented May 2, 2013

@iguananaut @mdboom - do you have any opinion on removing the join, hstack, and vstack methods from Table and leaving only the function forms within the table module? I'm ready to do this since it does simplify things overall. See the comment from @eteq above.

Updating the docs and tests is a bit of work so I don't want to go ahead with this if there might be disagreement.

@taldcroft

This comment has been minimized.

Show comment Hide comment
@taldcroft

taldcroft May 9, 2013

Owner

OK gang, the hstack and vstack functions are really really ready for merge. Now we have our fast parallel tests (#1040) waiting on this by kind agreement from @mdboom. If there are no objections or requests for additional time by Friday morning then I'll merge.

Owner

taldcroft commented May 9, 2013

OK gang, the hstack and vstack functions are really really ready for merge. Now we have our fast parallel tests (#1040) waiting on this by kind agreement from @mdboom. If there are no objections or requests for additional time by Friday morning then I'll merge.

@eteq

This comment has been minimized.

Show comment Hide comment
@eteq

eteq May 9, 2013

Owner

looks good to me!

Owner

eteq commented May 9, 2013

looks good to me!

taldcroft added a commit that referenced this pull request May 10, 2013

Merge pull request #937 from taldcroft/table/stack
Stack tables horizontally (by columns) or vertically (by rows, i.e. append)

@taldcroft taldcroft merged commit e07f417 into astropy:master May 10, 2013

1 check passed

default The Travis CI build passed
Details

@taldcroft taldcroft deleted the taldcroft:table/stack branch May 10, 2013

@astrofrog

This comment has been minimized.

Show comment Hide comment
@astrofrog

astrofrog May 10, 2013

Owner

This is awesome - thanks for all your hard work on astropy.table, @taldcroft!

Owner

astrofrog commented May 10, 2013

This is awesome - thanks for all your hard work on astropy.table, @taldcroft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment