Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No way to efficiently add many rows to astropy.table.Table #9212

Open
aarchiba opened this issue Sep 8, 2019 · 4 comments
Open

No way to efficiently add many rows to astropy.table.Table #9212

aarchiba opened this issue Sep 8, 2019 · 4 comments

Comments

@aarchiba
Copy link
Contributor

aarchiba commented Sep 8, 2019

If you want to modify a Table in-place, so that anyone who holds the Table object will see your changes, you can use add_row to add a single row. If you want to add multiple rows, for example in astropy.utils.iers.IERS_Auto when a new table is available, you have to call add_row multiple times, producing a reallocation every time. The same is true for insert_row, but there is remove_rows that can remove multiple rows at once. I suggest new methods insert_rows and add_rows (which just calls insert_rows).

As of numpy 1.8, insert can do multiple insertions at once, but even before that the code is just a matter of input validation; Columns may be freely reallocated, and are in insert_row, so users already cannot hang on to Column objects and expect to see changes.

Currently if you want to make major changes to a table without creating a new table object, you're probably best removing all its columns and then adding a new set of columns at the new, longer length.

@taldcroft
Copy link
Member

If this is for use within the astropy iers module, one could use an internal API available now in master to do this efficiently and with minimal memory. The validated flag lets you temporarily make a table with different length columns. Obviously this needs to be used with care.

In [5]: t = simple_table()
In [6]: t.columns.__setitem__('a', t['a'].insert(len(t['a']), [4, 5, 6]), validated=True)
In [7]: t.columns.__setitem__('b', t['b'].insert(len(t['b']), [4.5, 5.5, 6.5]), validated=True)
In [8]: t.columns.__setitem__('c', t['c'].insert(len(t['c']), ['x', 'y', 'z']), validated=True)

In [9]: t
Out[9]: 
<Table length=6>
  a      b     c  
int64 float64 str1
----- ------- ----
    1     1.0    c
    2     2.0    d
    3     3.0    e
    4     4.5    x
    5     5.5    y
    6     6.5    z

Thinking about the idea of add_rows and insert_rows, this seems reasonable. The trick is the API and the flexibility of add_row / insert_row w/r/t input data structures. But if you restrict things slightly and say the input rows must be something that can be used in QTable(rows=rows, names=self.names), that might be reasonable. This does all your data validation. Likewise maybe QTable(rows=mask, names=self.names) coerces a mask arg into something reasonable. From there doing the column manipulation should be a straightforward extension of existing code.

@aarchiba
Copy link
Contributor Author

aarchiba commented Sep 9, 2019

For astropy,utils.iers this isn't strictly necessary, or even (arguably) a good idea - we have a new table with the right coiumns already, so I just delete all the columns and add the new columns all at once. No need to have incompatible column lengths at any point. Plus it's not guaranteed that the old data won't change. (But it is what got me looking for insert_rows.)

But yes, providing the flexibility of insert_row is cumbersome and maybe not essential. Are tables necessarily one-dimensional, or can a "row" be an array of values per column? This shape validation, including broadcasting, definitely complicates numpy's insert function.

@taldcroft
Copy link
Member

A table is essentially two-dimensional in the sense that the table "shape" must be M rows by N columns where M and N are scalar integers. Columns can have any shape so long as col.shape[0] is M. In addition there are mixin columns (e.g. Time) which are not numpy subclass arrays but do provide an insert method.

@aarchiba
Copy link
Contributor Author

A table is essentially two-dimensional in the sense that the table "shape" must be M rows by N columns where M and N are scalar integers. Columns can have any shape so long as col.shape[0] is M. In addition there are mixin columns (e.g. Time) which are not numpy subclass arrays but do provide an insert method.

Ah, that does complicate things - if a column is higher-dimensional we're going to have to be careful with broadcasting and input validation. If it's an arbitrary subtype, we might have to fall back on the present method (use insert() a lot), though duck typing might save the day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants