Change regtap to only make 2 calls instead of N+1 for get_tables#750
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #750 +/- ##
==========================================
+ Coverage 79.57% 79.80% +0.23%
==========================================
Files 91 91
Lines 10294 10297 +3
==========================================
+ Hits 8191 8218 +27
+ Misses 2103 2079 -24 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
eac6c50 to
cdbb111
Compare
|
I see that codecov is sad, but I'll hold off adding any unit tests until someone has a chance to review and confirms whether the approach is reasonable. |
|
On Fri, May 15, 2026 at 06:05:45PM -0700, stvoutsin wrote:
## Description
`get_tables` issues one TAP query to list a resource's tables,
then a separate query *per table* to fetch its columns which leads
to O(N) round-trips for a resource with N tables.
Uh. What was I thinking?
This PR replaces that with two queries:
- one for table metadata
- one for all columns of the resource at once
We then group in Python by `table_index`.
## Approach
It would also be possible to do this in a single query with an
`LEFT OUTER JOIN` across the two queries, but I wasn't sure if
that is guaranteed to be supported across all RegTAP service
implementations.
It is. But I think the normalisation is worth it, in particular for
larger services.
Given things are going to be much faster this way, I'd raise the
table limit to 500 at least, which covers all but the most humungous
services unless this results in excessive runtimes or timeouts (can
you try with VizieR?)
- I haven't added any tests, but I think the existing suite
should validate / test this correctly as far as I can see
I agree. And don't worry about codecov being unhappy. I think it's
running things with network accesses turned off, and then this code
won't be exercised.
Any thoughts on whether this approach seems reasonable?
This is obviously an improvement. Looking at things now I have to
say that I don't particularly like the interface of get_tables; the
way this should have been is a mapping from table metadata to opaque
column objects that do the column queries when a user actually
requests the columns. We *could* still try this with some descriptor
magic in what _build_vosi_table returns. But I'd only look into this
if fetching the columns with the table metadata really turns out to
be a drag.
Thanks!
|
0f522a4 to
5c12439
Compare
|
@msdemlei Thanks for the feedback, I've updated as you suggested the table limit to 500 and ran some tests with external services. What do you think is the best way to handle cases like that? Should we add an arbitrarily large Or perhaps adding a Lazy loading of the columns does seem like a good way around this, but I'm not very familiar with the usual use-cases here. Do we know if most users would be doing "get all tables and inspect all their columns" (which would take us back to N+1), or mostly just targeting specific tables? |
|
Codecov: looking at the diff and codecov report it looks like these affected lines were not covered previously either and thus you get the failing report. I would say it's not a blocker for the PR as it isn't introducing any actual degradation of coverage, but getting these lines covered would be nice. |
|
On Mon, May 18, 2026 at 11:37:26AM -0700, stvoutsin wrote:
I tested with Vizier, it seems that their catalogs are registered
as separate resources, with the largest having 126 tables (this
completes within 2s).
Ohright. VizieR's good then. Thanks for looking into this.
The resource with the largest table count that I found was WFAU's
vsa-tap which has 2895 tables. This completed within 5s, though the
column query did hit the row limit and we got truncated results.
Oh dang, yes, we'll need to think about this.
What do you think is the best way to handle cases like that? Should
we add an arbitrarily large `maxrec`, perhaps based on the
`table_limit` (`table_limit` * 500)? Or `maxrec=-1`? Do we know if
`maxrec -1` is adopted everywhere?
I'm not aware that we've giving MAXREC=-1 a special meaning anywere.
Do we?
At least current DALI doesn't say anything about it, and DaCHS
doesn't interpret it either. Whether I think giving an option to say
"use your maximum" is a good idea I'm not sure... Anyway, DALI is in
RFC right now, so if you think this would a good thing, do chime in
at https://wiki.ivoa.net/twiki/bin/view/IVOA/DALIV12RFC.
Without MAXREC=-1, I'd say we should just have a more or less
arbitrarily high limit. For reference, there are currently 1.7e6
records in rr.table_column. As a rule of thumb, I'd say pulling more
than half of this would probably indicate someone is doing it wrong.
What about MAXREC=1000000 just so we say *something*? I also don't
think this would cause "weird" (resource exhaustion) crashes on real
hardware, so it'd still be safe, I guess.
Lazy loading of the columns does seem like a good way around this,
but I'm not very familiar with the usual use-cases here. Do we know
if most users would be doing "get all tables and inspect all their
columns" (which would take us back to N+1), or mostly just
targeting specific tables?
I don't think too many people inspect table metadata at all in pyVO,
let alone through the registry interface at this point (rather than
from TAP's VOSI endpoints or TAP_SCHEMA). Once they do, I'm pretty
sure it would be inspecting individual (O(10): "Is this the main
table that has the sizes of the things?") tables rather than go
through all the tables.
Having a global look might be required for something like "give me
all photometry columns", and that they should really be doing using
queries against rr.table_column directly. So, my take is that for
large services, lazy loading would almost always be a win. But if I
have spare cycles for pyVO, there'd be other, more urgent matters
(EPN-TAP and spectra global discovery, say), so I won't put it in and
I'm happy with this PR and MAXREC=1000000.
|
Oh right we use it for some of our services but you're right I don't see it documented nevermind. Although for cases like this I can see the usefulness and potential gap in the current spec of being able to say ""give me your max records for this query (no truncation warning)." without having to guess an arbitrarily large number for the maxrec. I'll have a look at the RFC.
Ok sounds good!
I've added just a couple of unit tests for the relevant method. |
Description
get_tablesissues one TAP query to list a resource's tables, then a separate query per table to fetch its columns which leads to O(N) round-trips for a resource with N tables.This PR replaces that with two queries:
We then group in Python by
table_index.Approach
It would also be possible to do this in a single query with an
LEFT OUTER JOINacross the two queries, but I wasn't sure if that is guaranteed to be supported across all RegTAP service implementations.Tests
Any thoughts on whether this approach seems reasonable?