Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieving rows in a table via a multi-column index fails (with possible fix) #13176

Open
taldcroft opened this issue Apr 26, 2022 · 7 comments · May be fixed by #15826
Open

Retrieving rows in a table via a multi-column index fails (with possible fix) #13176

taldcroft opened this issue Apr 26, 2022 · 7 comments · May be fixed by #15826

Comments

@taldcroft
Copy link
Member

Description

The current code for table indexing precludes using .loc[item] for a multi-column index where item is a tuple. See this slack thread for more context.

Expected behavior

It should just work.

Actual behavior

You get ValueError("Cannot use .loc on multi-column indices").

Steps to Reproduce

In [2]: t = Table(
   ...:     {
   ...:         "obs_id": [1, 1, 1, 1, 2, 2, 2, 2, 2],
   ...:         "event_id": [1, 1, 2, 2, 1, 1, 1, 2, 2],
   ...:         "tel_id": [1, 2, 1, 2, 1, 2, 3, 1, 2],
   ...:         "value": [1, 2, 3, 4, 5, 6, 7, 8, 9],
   ...:     }
   ...: )

In [3]: keys = ["obs_id", "event_id"]
In [4]: t.add_index(keys)
In [5]: t.loc[(1, 2)]

Possible solution

The solution below appears (at first glance) to allow using a multi-column index using t.loc[keys, (1, 2)]. I have not tried running tests with this.

(astropy) ➜  astropy git:(main) ✗ git diff
diff --git a/astropy/table/index.py b/astropy/table/index.py
index 04d2c38678..6f47117d24 100644
--- a/astropy/table/index.py
+++ b/astropy/table/index.py
@@ -824,8 +824,6 @@ class TableLoc:
             key = self.table.primary_key
 
         index = self.indices[key]
-        if len(index.columns) > 1:
-            raise ValueError("Cannot use .loc on multi-column indices")
 
         if isinstance(item, slice):
             # None signifies no upper/lower bound
@@ -838,7 +836,7 @@ class TableLoc:
             # item should be a list or ndarray of values
             rows = []
             for key in item:
-                p = index.find((key,))
+                p = index.find(key if isinstance(key, tuple) else (key,))
                 if len(p) == 0:
                     raise KeyError(f'No matches found for key {key}')
                 else:
@rbuehler
Copy link

rbuehler commented Nov 1, 2022

I am just a user of astropy and in need of the functionality described above (as a temporary solution I am relying on pandas MultiIndex).
Any news on the inclusion of the above solution to the code? Thank you.

@neutrinoceros
Copy link
Contributor

I had a quick look and I report that the error produced is different on astropy 6.0:

Traceback (most recent call last):
  File "/Users/clm/dev/astropy-project/astropy/bugs/13176/t.py", line 14, in <module>
    t.loc[(1, 2)]
    ~~~~~^^^^^^^^
  File "/Users/clm/dev/astropy-project/astropy/astropy/table/index.py", line 878, in __getitem__
    rows = self._get_rows(item)
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/clm/dev/astropy-project/astropy/astropy/table/index.py", line 843, in _get_rows
    index = self.indices[key]
            ~~~~~~~~~~~~^^^^^
  File "/Users/clm/dev/astropy-project/astropy/astropy/table/index.py", line 814, in __getitem__
    return super().__getitem__(item)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

I don't think the proposed patch is sufficient now, since the raise statement it proposed to remove is not even reached anymore. I'll try to come back to this when I have a better understanding of astropy.table !

@neutrinoceros
Copy link
Contributor

@taldcroft just to be sure, can I ask what behaviour you'd expect here ?

@taldcroft
Copy link
Member Author

The expected output would be below, i.e. the rows with (obs_id, event_id) == (1, 2).

<Table length=2>
obs_id event_id tel_id value
int64   int64   int64  int64
------ -------- ------ -----
     1        2      1     3
     1        2      2     4

For reference here is the table in the original example:

In [13]: t
Out[13]: 
<Table length=9>
obs_id event_id tel_id value
int64   int64   int64  int64
------ -------- ------ -----
     1        1      1     1
     1        1      2     2
     1        2      1     3
     1        2      2     4
     2        1      1     5
     2        1      2     6
     2        1      3     7
     2        2      1     8
     2        2      2     9

@taldcroft
Copy link
Member Author

@neutrinoceros - the table indexing code was written a while ago by a GSoC student. I've always found the code a bit difficult to understand... anyway this is a fine opportunity to dig into Table and potentially improve that section of code from a readability / maintenance perspective.

@neutrinoceros
Copy link
Contributor

Yes, it may take me more time than I first thought but I'll give this a real try soon. Thank you very much for your replies !

@neutrinoceros
Copy link
Contributor

Just a quick note for now; I think that I was misguided when I wrote #13176 (comment) : I don't think the raise statement that was pointed to in the original report was actually ever reached with the provided example. In fact, even revisions of the codebase that were current when the report was written produce a traceback similar to the one I posted. I guess that is why the simple suggested patch never made it to a PR yet; it was actually never sufficient.
This also means we can't just bisect for when a secondary bug was checked in.

I'll study the logic more deeply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants