-
-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batched DataLink calls #218
Conversation
First attempt. I've tried to make it a drop in change, transparent change. A bit tricky since |
On Wed, Feb 26, 2020 at 08:59:36PM -0800, andamian wrote:
@msdemlei & @funbaker or anyone else, please let me know if you can
suggest a better approach. I wouldn't leave the decision on the
batch size on the user, but the variable can be tweaked. This is
the first version so it's not entirely optimized but it works with
a few real examples I've tried.
I suppose given the iter_datalinks approach this is about how it
should be done. There is a bit of a problem with this approach
because datalink services can have match limits. 10000 for such a
limit is plausible, and 10 datalinks per id is also not unheard-of (I
have some datasets with ~100 datalinks).
Hence, for DATALINK_BATCH_SIZE=1000 you might well run into a match
limit. That, I think, shouldn't pass unnoticed. The way the spec is
written, it's relatively simple to catch such a situation: After
doing the batched query, you could do the equivalent (I've not lookad
at what sort of thing _get_datalinks actualy returns) of
```
if set(ids)!=set(r["ID"] for r in self._cached_dl):
raise WhateverError("Datalink batched query overflowed match limit;"
" Decreate DATALINK_BATCH_CALL_SIZE.")
```
To make the code a bit more transparent, I'd pull out the cache
filling into a separate method.
You can also pass MAXREC to your datalink call, but since we have no
way of determining the hard limit on the service at this point, I
don't think that's useful until we've thought a bit more about
additional service metadata (which brings us back to caproles...).
…-- Markus
|
@msdemlei the batch size is in the input but the server limitation is in the output. The client can't come up with the optimal batch size since it doesn't know how many data links it would generate. Each different call with the same batch size can generate a different number of output rows. I was thinking of a different approach: send all the IDs at once, and if the server returns OVERFLOW status remove the processed ones from the list before sending them again. Continue until status OK. Do you think this will put unnecessary burden on the service? What do you think? |
BTW, one of the scenarios that astronomers have asked is:
|
On Thu, Feb 27, 2020 at 11:45:51AM -0800, andamian wrote:
I was thinking of a different approach: send all the IDs at once,
and if the server returns OVERFLOW status remove the processed ones
from the list before sending them again. Continue until status OK.
Do you think this will put unnecessary burden on the service?
I think that's a totally reasonable thing to do. If you're going to
iter_datalinks fairly exhaustively anyway, the server will certainly
be grateful if you're firing off a few large requests rather than a
couple of thousand small requests.
Perhaps one needs to think about really large queries, as in several
1e5 or so; a datalink record with an attached service block may be a
few k, so the results may work out to be in the Gigabyte range.
But then that should probably be done on the server side by
introducing a reasonable match limit there.
To accommodate large number of IDs we might need to change the
submission from the current GET to a POST but I think the spec
supports that.
It should definitely be POST either way; Datalink sect. 2.1 requires
support for both, and the only reason I see for having GET in the
first place is so people can pass around canned queries as URL
literals. For dynamic queries, POST is a lot more robust all around.
|
Codecov Report
@@ Coverage Diff @@
## master #218 +/- ##
==========================================
+ Coverage 72.08% 72.19% +0.11%
==========================================
Files 40 42 +2
Lines 4402 4478 +76
==========================================
+ Hits 3173 3233 +60
- Misses 1229 1245 +16
Continue to review full report at Codecov.
|
@msdemlei @funbaker I've pushed another version. Initially, the client sends all the IDs with a POST. If the result is incomplete, the size of the returned result represents the size of the batch. Subsequent calls are initiated until all the IDs are resolved. |
Fixed tests broken due to upstream changes (astropy/astropy#9505 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The immediate changes look fine to me, and the new functionality is a win. The (only loosely related) way we deal with bytes/string here is, of course, painful, but that probably can't be helped for now.
self._remaining_ids.remove(id) | ||
yield self._current_batch.clone_byid(id) | ||
else: | ||
yield DatalinkResults.from_result_url(row.getdataurl()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm. No. Please stick to getdatalink
because of parameters which are described in the RESOURCE
element.
I see. Since this only applies when there is no resource, it can also be moved into the except clause above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was wrong. I've replaced it with: elif row.access_format == 'application/x-votable+xml;content=datalink':
(according to the specs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I think there should be an else
clause, even if it just emits warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else: yield None
maybe since there is no info to determine the corresponding datalink for that row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh I'm not sure whats the best way here. There shouldn't be empty rows but they also shouldn't go unnoticed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iterator is returning Datalink resources associated with each row. Question is what to do when no such resources are available (access_url is probably a direct url in that case). This case is valid so I don't think the method should err or warn. The question IMO is whether to return None
or continue and now that I'm thinking more about it I'm more inclined for the latter (None
in itself is not a useful info). The contract of the method is to return corresponding Datalink resources so skipping the rows that don't have such resources is OK, no?
This is the relevant section of SIA spec:
If the SIA service is only dealing with simple data (one file per result), the
access_url column may be a link directly to that file, in which case the
access_format column should specify the file format (e.g. application/fits).
If the data provider implements a DataLink service for the data being found
via the SIA {query} capability, they may put a URL to invoke the DataLink
{links} capability (with ID parameter and value) in the access_url column; if
they do this, they must also put the standard DataLink MIME type [9] in the
access_format column.```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, better to return None
instead of continue
you think?
This doesn't work for the HEASARC TAP service's DataLinks. The reason is that our service uses the integer row number as the ID. So in the VOTable, there's:
Then in adhoc.py:213, the following code
So Tom McGlynn sent a message to the DAL group asking about this, since the standard document doesn't make clear that IDs have to be strings. So which needs to be fixed, our TAP service to use datatype="char" for the row or PyVO to convert in the above check? I found two places that a simple type conversion would fix this in PyVO. (The other is adhoc.py:556 for the same reason.) Of course, it could also be fixed in our service. Thoughts? |
summoning @msdemlei for a standard related question |
On Wed, Apr 01, 2020 at 10:30:18AM -0700, trjaffe wrote:
So Tom McGlynn sent a message to the DAL group asking about this,
since the standard document doesn't make clear that IDs have to be
strings. So which needs to be fixed, our TAP service to use
datatype="char" for the row or PyVO to convert in the above check?
I found two places that a simple type conversion would fix this in
PyVO. (The other is adhoc.py:556 for the same reason.) Of course,
it could also be fixed in our service.
Thoughts?
I'm not so sure about the "simple type conversion" -- type
conversions have a way of eventually turning out hairy.
Anyway, this is clearly a lacuna in the specs; actually, both VOTable
(which is hardly to blame) and Datalink, so I don't think we can
simply decide this here. My current position is in
http://mail.ivoa.net/pipermail/dal/2020-April/008322.html
…-- were I don't think any amount of discussion will change my stance
on avoiding different type/arraysize/xtype tuples on the two ends of
a ./@ref relationship.
|
I see two possibilities:
|
A belated update: the DAL group is going to clarify the standard that the ID should be char. We haven't yet finished fixing our service, but this mod works with others and looks fine to everybody else. Nor did I have any other comments. So I think this PR is good to go. |
Thank you @trjaffe for looking into it. |
@funbaker - I think I've now addressed all the concerns. Please let me know if that's not the case. Thanks |
Aye, should be fine for now. |
#217