Batched DataLink calls #218

andamian · 2020-02-27T04:51:13Z

andamian · 2020-02-27T04:59:36Z

First attempt. I've tried to make it a drop in change, transparent change. A bit tricky since DatalinkResults was designed to hold entries for a single ID. The work around was to cache multiple calls to datalink but return only rows that belong to a specific ID. Not that clean, I admit but it seems to work. I kept the old DatalinkResultsMixin.iter_datalinks method for now just a bench mark for correctness and performance but it can be replaced by the corresponding iter_datalinks without any effect (hopefully) on the calling code.
@msdemlei & @funbaker or anyone else, please let me know if you can suggest a better approach. I wouldn't leave the decision on the batch size on the user, but the variable can be tweaked. This is the first version so it's not entirely optimized but it works with a few real examples I've tried.

msdemlei · 2020-02-27T07:08:37Z

On Wed, Feb 26, 2020 at 08:59:36PM -0800, andamian wrote: @msdemlei & @funbaker or anyone else, please let me know if you can suggest a better approach. I wouldn't leave the decision on the batch size on the user, but the variable can be tweaked. This is the first version so it's not entirely optimized but it works with a few real examples I've tried.

I suppose given the iter_datalinks approach this is about how it should be done. There is a bit of a problem with this approach because datalink services can have match limits. 10000 for such a limit is plausible, and 10 datalinks per id is also not unheard-of (I have some datasets with ~100 datalinks). Hence, for DATALINK_BATCH_SIZE=1000 you might well run into a match limit. That, I think, shouldn't pass unnoticed. The way the spec is written, it's relatively simple to catch such a situation: After doing the batched query, you could do the equivalent (I've not lookad at what sort of thing _get_datalinks actualy returns) of ``` if set(ids)!=set(r["ID"] for r in self._cached_dl): raise WhateverError("Datalink batched query overflowed match limit;" " Decreate DATALINK_BATCH_CALL_SIZE.") ``` To make the code a bit more transparent, I'd pull out the cache filling into a separate method. You can also pass MAXREC to your datalink call, but since we have no way of determining the hard limit on the service at this point, I don't think that's useful until we've thought a bit more about additional service metadata (which brings us back to caproles...).

…

-- Markus

andamian · 2020-02-27T19:45:51Z

@msdemlei the batch size is in the input but the server limitation is in the output. The client can't come up with the optimal batch size since it doesn't know how many data links it would generate. Each different call with the same batch size can generate a different number of output rows.

I was thinking of a different approach: send all the IDs at once, and if the server returns OVERFLOW status remove the processed ones from the list before sending them again. Continue until status OK. Do you think this will put unnecessary burden on the service?
To accommodate large number of IDs we might need to change the submission from the current GET to a POST but I think the spec supports that.

What do you think?

andamian · 2020-02-27T20:56:23Z

BTW, one of the scenarios that astronomers have asked is:

Perform SIA2 query
Process results which typically is just some further filtering that cannot be achieved with the query.
Download data
How can step 2 be performed with the SIAResults object? The modified SIAResults object is required in step 3 since it stores information about the DataLink step. Users should be able to easily perform record deletes from an SIAResults, no? Maybe the functionality is there but I'm missing it.

msdemlei · 2020-02-28T08:43:33Z

On Thu, Feb 27, 2020 at 11:45:51AM -0800, andamian wrote: I was thinking of a different approach: send all the IDs at once, and if the server returns OVERFLOW status remove the processed ones from the list before sending them again. Continue until status OK. Do you think this will put unnecessary burden on the service?

I think that's a totally reasonable thing to do. If you're going to iter_datalinks fairly exhaustively anyway, the server will certainly be grateful if you're firing off a few large requests rather than a couple of thousand small requests. Perhaps one needs to think about really large queries, as in several 1e5 or so; a datalink record with an attached service block may be a few k, so the results may work out to be in the Gigabyte range. But then that should probably be done on the server side by introducing a reasonable match limit there.

To accommodate large number of IDs we might need to change the submission from the current GET to a POST but I think the spec supports that.

It should definitely be POST either way; Datalink sect. 2.1 requires support for both, and the only reason I see for having GET in the first place is so people can pass around canned queries as URL literals. For dynamic queries, POST is a lot more robust all around.

codecov · 2020-03-05T08:11:00Z

Codecov Report

Merging #218 into master will increase coverage by 0.11%.
The diff coverage is 84.94%.

@@            Coverage Diff             @@
##           master     #218      +/-   ##
==========================================
+ Coverage   72.08%   72.19%   +0.11%     
==========================================
  Files          40       42       +2     
  Lines        4402     4478      +76     
==========================================
+ Hits         3173     3233      +60     
- Misses       1229     1245      +16

Impacted Files	Coverage Δ
pyvo/dal/params.py	`86.29% <66.66%> (-0.30%)`	⬇️
pyvo/dal/adhoc.py	`62.16% <82.19%> (+3.06%)`	⬆️
pyvo/dal/query.py	`85.06% <100.00%> (+0.12%)`	⬆️
pyvo/dal/tap.py	`71.19% <100.00%> (ø)`
pyvo/utils/__init__.py	`100.00% <100.00%> (ø)`
pyvo/utils/compat.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c59cea...39b7dfc. Read the comment docs.

andamian · 2020-03-05T09:00:21Z

@msdemlei @funbaker I've pushed another version. Initially, the client sends all the IDs with a POST. If the result is incomplete, the size of the returned result represents the size of the batch. Subsequent calls are initiated until all the IDs are resolved.
Please review at your earliest convenience. Thanks

andamian · 2020-03-06T20:33:37Z

Fixed tests broken due to upstream changes (astropy/astropy#9505 )

pyvo/dal/adhoc.py

msdemlei

The immediate changes look fine to me, and the new functionality is a win. The (only loosely related) way we deal with bytes/string here is, of course, painful, but that probably can't be helped for now.

funbaker · 2020-03-30T11:18:44Z

pyvo/dal/adhoc.py

+                self._remaining_ids.remove(id)
+                yield self._current_batch.clone_byid(id)
+            else:
+                yield DatalinkResults.from_result_url(row.getdataurl())


~~Umm. No. Please stick to getdatalink because of parameters which are described in the RESOURCE element.~~

I see. Since this only applies when there is no resource, it can also be moved into the except clause above.

That was wrong. I've replaced it with: elif row.access_format == 'application/x-votable+xml;content=datalink': (according to the specs)

Hmm. I think there should be an else clause, even if it just emits warnings.

else: yield None maybe since there is no info to determine the corresponding datalink for that row?

tbh I'm not sure whats the best way here. There shouldn't be empty rows but they also shouldn't go unnoticed.

The iterator is returning Datalink resources associated with each row. Question is what to do when no such resources are available (access_url is probably a direct url in that case). This case is valid so I don't think the method should err or warn. The question IMO is whether to return None or continue and now that I'm thinking more about it I'm more inclined for the latter (None in itself is not a useful info). The contract of the method is to return corresponding Datalink resources so skipping the rows that don't have such resources is OK, no?

This is the relevant section of SIA spec:

If the SIA service is only dealing with simple data (one file per result), the access_url column may be a link directly to that file, in which case the access_format column should specify the file format (e.g. application/fits). If the data provider implements a DataLink service for the data being found via the SIA {query} capability, they may put a URL to invoke the DataLink {links} capability (with ID parameter and value) in the access_url column; if they do this, they must also put the standard DataLink MIME type [9] in the access_format column.```

So, better to return None instead of continue you think?

pyvo/dal/adhoc.py

pyvo/utils/__init__.py

trjaffe · 2020-04-01T17:30:01Z

This doesn't work for the HEASARC TAP service's DataLinks. The reason is that our service uses the integer row number as the ID. So in the VOTable, there's:

<FIELD ID="DataLinkID" datatype="int" name="__row">
...
<RESOURCE type="meta" utype="adhoc:service">
<PARAM ID="standardID" arraysize="" datatype="char" name="standardID" value="ivo://ivoa.net/std/DataLink#links-1.0"/>
<PARAM ID="accessURL" arraysize="" datatype="char" name="accessURL" value="https://heasarc.gsfc.nasa.gov/xamin/vo/datalink/chanmaster"/>
</RESOURCE>

Then in adhoc.py:213, the following code

self._current_ids = list(OrderedDict.fromkeys( [_ for _ in self._current_batch.to_table()['ID']]))
gets the IDs from the DataLinkResults as a list of strings, while the self._remaining_ids are from the TAPResults votable as integers. So then the remove() further down doesn't work.

So Tom McGlynn sent a message to the DAL group asking about this, since the standard document doesn't make clear that IDs have to be strings. So which needs to be fixed, our TAP service to use datatype="char" for the row or PyVO to convert in the above check? I found two places that a simple type conversion would fix this in PyVO. (The other is adhoc.py:556 for the same reason.) Of course, it could also be fixed in our service.

Thoughts?

funbaker · 2020-04-01T17:49:13Z

summoning @msdemlei for a standard related question

msdemlei · 2020-04-02T08:08:09Z

On Wed, Apr 01, 2020 at 10:30:18AM -0700, trjaffe wrote: So Tom McGlynn sent a message to the DAL group asking about this, since the standard document doesn't make clear that IDs have to be strings. So which needs to be fixed, our TAP service to use datatype="char" for the row or PyVO to convert in the above check? I found two places that a simple type conversion would fix this in PyVO. (The other is adhoc.py:556 for the same reason.) Of course, it could also be fixed in our service. Thoughts?

I'm not so sure about the "simple type conversion" -- type conversions have a way of eventually turning out hairy. Anyway, this is clearly a lacuna in the specs; actually, both VOTable (which is hardly to blame) and Datalink, so I don't think we can simply decide this here. My current position is in http://mail.ivoa.net/pipermail/dal/2020-April/008322.html

…

-- were I don't think any amount of discussion will change my stance on avoiding different type/arraysize/xtype tuples on the two ends of a ./@ref relationship.

funbaker · 2020-04-02T08:27:43Z

I see two possibilities:

make everything char
do implicit conversion but raise a warning if it's with non-trivial types in terms of conversion.

trjaffe · 2020-04-15T14:53:42Z

A belated update: the DAL group is going to clarify the standard that the ID should be char. We haven't yet finished fixing our service, but this mod works with others and looks fine to everybody else. Nor did I have any other comments. So I think this PR is good to go.

andamian · 2020-04-18T01:08:30Z

A belated update: the DAL group is going to clarify the standard that the ID should be char. We haven't yet finished fixing our service, but this mod works with others and looks fine to everybody else. Nor did I have any other comments. So I think this PR is good to go.

Thank you @trjaffe for looking into it.

pyvo/dal/params.py

andamian · 2020-04-21T22:19:47Z

@funbaker - I think I've now addressed all the concerns. Please let me know if that's not the case. Thanks

pyvo/dal/adhoc.py

funbaker · 2020-04-22T23:31:29Z

@funbaker - I think I've now addressed all the concerns. Please let me know if that's not the case. Thanks

Aye, should be fine for now.

First version for batched DataLink calls

0309a8b

andamian requested a review from funbaker February 27, 2020 04:51

Rework

21ea3a2

andamian added the enhancement label Mar 5, 2020

andamian added this to the v1.1 milestone Mar 5, 2020

andamian added the no-changelog-entry-needed label Mar 5, 2020

Adrian Damian added 4 commits March 6, 2020 09:06

Fixed tests

2f3cbfb

Fixed tests

701ab9b

Attempt to fix Travis

5092294

Fixed Travis test

5c72df0

andamian mentioned this pull request Mar 6, 2020

TJ's implementation for issue 212: TAP examples #220

Merged

msdemlei reviewed Mar 9, 2020

View reviewed changes

pyvo/dal/adhoc.py Outdated Show resolved Hide resolved

msdemlei approved these changes Mar 9, 2020

View reviewed changes

Adrian Damian added 2 commits March 10, 2020 14:18

Rework after code review

217b9e5

Rework after code review

93a4d5c

funbaker reviewed Mar 30, 2020

View reviewed changes

Adrian Damian added 3 commits April 18, 2020 13:13

Rework after code review

3b11c9b

Rework after code review

c9b9741

Fixed tests

592a301

andamian commented Apr 20, 2020

View reviewed changes

pyvo/dal/params.py Outdated Show resolved Hide resolved

Adrian Damian added 2 commits April 22, 2020 09:32

Removed unrelated changes

44ea9d1

Fixed style

4dc191f

funbaker reviewed Apr 21, 2020

View reviewed changes

pyvo/dal/adhoc.py Outdated Show resolved Hide resolved

Adrian Damian added 2 commits April 22, 2020 11:45

Fixed typo

69de666

Bug fix

39b7dfc

funbaker approved these changes Apr 22, 2020

View reviewed changes

andamian merged commit 9d527e1 into astropy:master Apr 23, 2020

andamian deleted the datalink branch April 23, 2020 04:42

aragilar mentioned this pull request Jul 1, 2024

Avoid assuming that access_url exists always #570

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched DataLink calls #218

Batched DataLink calls #218

andamian commented Feb 27, 2020

andamian commented Feb 27, 2020

msdemlei commented Feb 27, 2020 via email

andamian commented Feb 27, 2020

andamian commented Feb 27, 2020

msdemlei commented Feb 28, 2020 via email

codecov bot commented Mar 5, 2020 •

edited

Loading

andamian commented Mar 5, 2020

andamian commented Mar 6, 2020

msdemlei left a comment •

edited

Loading

funbaker Mar 30, 2020 •

edited

Loading

andamian Apr 18, 2020 •

edited

Loading

funbaker Apr 21, 2020

andamian Apr 22, 2020

funbaker Apr 22, 2020

andamian Apr 22, 2020

andamian Apr 23, 2020

trjaffe commented Apr 1, 2020

funbaker commented Apr 1, 2020

msdemlei commented Apr 2, 2020 via email

funbaker commented Apr 2, 2020

trjaffe commented Apr 15, 2020 •

edited

Loading

andamian commented Apr 18, 2020

andamian commented Apr 21, 2020

funbaker commented Apr 22, 2020

Batched DataLink calls #218

Batched DataLink calls #218

Conversation

andamian commented Feb 27, 2020

andamian commented Feb 27, 2020

msdemlei commented Feb 27, 2020 via email

andamian commented Feb 27, 2020

andamian commented Feb 27, 2020

msdemlei commented Feb 28, 2020 via email

codecov bot commented Mar 5, 2020 • edited Loading

Codecov Report

andamian commented Mar 5, 2020

andamian commented Mar 6, 2020

msdemlei left a comment • edited Loading

Choose a reason for hiding this comment

funbaker Mar 30, 2020 • edited Loading

Choose a reason for hiding this comment

andamian Apr 18, 2020 • edited Loading

Choose a reason for hiding this comment

funbaker Apr 21, 2020

Choose a reason for hiding this comment

andamian Apr 22, 2020

Choose a reason for hiding this comment

funbaker Apr 22, 2020

Choose a reason for hiding this comment

andamian Apr 22, 2020

Choose a reason for hiding this comment

andamian Apr 23, 2020

Choose a reason for hiding this comment

trjaffe commented Apr 1, 2020

funbaker commented Apr 1, 2020

msdemlei commented Apr 2, 2020 via email

funbaker commented Apr 2, 2020

trjaffe commented Apr 15, 2020 • edited Loading

andamian commented Apr 18, 2020

andamian commented Apr 21, 2020

funbaker commented Apr 22, 2020

codecov bot commented Mar 5, 2020 •

edited

Loading

msdemlei left a comment •

edited

Loading

funbaker Mar 30, 2020 •

edited

Loading

andamian Apr 18, 2020 •

edited

Loading

trjaffe commented Apr 15, 2020 •

edited

Loading