Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batched DataLink calls #218

Merged
merged 15 commits into from
Apr 23, 2020
Merged

Batched DataLink calls #218

merged 15 commits into from
Apr 23, 2020

Conversation

andamian
Copy link
Contributor

@andamian
Copy link
Contributor Author

First attempt. I've tried to make it a drop in change, transparent change. A bit tricky since DatalinkResults was designed to hold entries for a single ID. The work around was to cache multiple calls to datalink but return only rows that belong to a specific ID. Not that clean, I admit but it seems to work. I kept the old DatalinkResultsMixin.iter_datalinks method for now just a bench mark for correctness and performance but it can be replaced by the corresponding iter_datalinks without any effect (hopefully) on the calling code.
@msdemlei & @funbaker or anyone else, please let me know if you can suggest a better approach. I wouldn't leave the decision on the batch size on the user, but the variable can be tweaked. This is the first version so it's not entirely optimized but it works with a few real examples I've tried.

@msdemlei
Copy link
Contributor

msdemlei commented Feb 27, 2020 via email

@andamian
Copy link
Contributor Author

@msdemlei the batch size is in the input but the server limitation is in the output. The client can't come up with the optimal batch size since it doesn't know how many data links it would generate. Each different call with the same batch size can generate a different number of output rows.

I was thinking of a different approach: send all the IDs at once, and if the server returns OVERFLOW status remove the processed ones from the list before sending them again. Continue until status OK. Do you think this will put unnecessary burden on the service?
To accommodate large number of IDs we might need to change the submission from the current GET to a POST but I think the spec supports that.

What do you think?

@andamian
Copy link
Contributor Author

BTW, one of the scenarios that astronomers have asked is:

  1. Perform SIA2 query
  2. Process results which typically is just some further filtering that cannot be achieved with the query.
  3. Download data
    How can step 2 be performed with the SIAResults object? The modified SIAResults object is required in step 3 since it stores information about the DataLink step. Users should be able to easily perform record deletes from an SIAResults, no? Maybe the functionality is there but I'm missing it.

@msdemlei
Copy link
Contributor

msdemlei commented Feb 28, 2020 via email

@codecov
Copy link

codecov bot commented Mar 5, 2020

Codecov Report

Merging #218 into master will increase coverage by 0.11%.
The diff coverage is 84.94%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #218      +/-   ##
==========================================
+ Coverage   72.08%   72.19%   +0.11%     
==========================================
  Files          40       42       +2     
  Lines        4402     4478      +76     
==========================================
+ Hits         3173     3233      +60     
- Misses       1229     1245      +16     
Impacted Files Coverage Δ
pyvo/dal/params.py 86.29% <66.66%> (-0.30%) ⬇️
pyvo/dal/adhoc.py 62.16% <82.19%> (+3.06%) ⬆️
pyvo/dal/query.py 85.06% <100.00%> (+0.12%) ⬆️
pyvo/dal/tap.py 71.19% <100.00%> (ø)
pyvo/utils/__init__.py 100.00% <100.00%> (ø)
pyvo/utils/compat.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c59cea...39b7dfc. Read the comment docs.

@andamian
Copy link
Contributor Author

andamian commented Mar 5, 2020

@msdemlei @funbaker I've pushed another version. Initially, the client sends all the IDs with a POST. If the result is incomplete, the size of the returned result represents the size of the batch. Subsequent calls are initiated until all the IDs are resolved.
Please review at your earliest convenience. Thanks

@andamian
Copy link
Contributor Author

andamian commented Mar 6, 2020

Fixed tests broken due to upstream changes (astropy/astropy#9505 )

pyvo/dal/adhoc.py Outdated Show resolved Hide resolved
Copy link
Contributor

@msdemlei msdemlei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The immediate changes look fine to me, and the new functionality is a win. The (only loosely related) way we deal with bytes/string here is, of course, painful, but that probably can't be helped for now.

self._remaining_ids.remove(id)
yield self._current_batch.clone_byid(id)
else:
yield DatalinkResults.from_result_url(row.getdataurl())
Copy link
Contributor

@funbaker funbaker Mar 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm. No. Please stick to getdatalink because of parameters which are described in the RESOURCE element.

I see. Since this only applies when there is no resource, it can also be moved into the except clause above.

Copy link
Contributor Author

@andamian andamian Apr 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was wrong. I've replaced it with: elif row.access_format == 'application/x-votable+xml;content=datalink': (according to the specs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I think there should be an else clause, even if it just emits warnings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else: yield None maybe since there is no info to determine the corresponding datalink for that row?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I'm not sure whats the best way here. There shouldn't be empty rows but they also shouldn't go unnoticed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iterator is returning Datalink resources associated with each row. Question is what to do when no such resources are available (access_url is probably a direct url in that case). This case is valid so I don't think the method should err or warn. The question IMO is whether to return None or continue and now that I'm thinking more about it I'm more inclined for the latter (None in itself is not a useful info). The contract of the method is to return corresponding Datalink resources so skipping the rows that don't have such resources is OK, no?

This is the relevant section of SIA spec:

If the SIA service is only dealing with simple data (one file per result), the 
access_url column may be a link directly to that file, in which case the 
access_format column should specify the file format (e.g. application/fits).

If the data provider implements a DataLink service for the data being found 
via the SIA {query} capability, they may put a URL to invoke the DataLink 
{links} capability (with ID parameter and value) in the access_url column; if 
they do this, they must also put the standard DataLink MIME type [9] in the 
access_format column.```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, better to return None instead of continue you think?

pyvo/dal/adhoc.py Show resolved Hide resolved
pyvo/dal/adhoc.py Show resolved Hide resolved
pyvo/dal/adhoc.py Show resolved Hide resolved
pyvo/dal/adhoc.py Show resolved Hide resolved
pyvo/utils/__init__.py Outdated Show resolved Hide resolved
@trjaffe
Copy link
Contributor

trjaffe commented Apr 1, 2020

This doesn't work for the HEASARC TAP service's DataLinks. The reason is that our service uses the integer row number as the ID. So in the VOTable, there's:

<FIELD ID="DataLinkID" datatype="int" name="__row">
...
<RESOURCE type="meta" utype="adhoc:service">
<PARAM ID="standardID" arraysize="" datatype="char" name="standardID" value="ivo://ivoa.net/std/DataLink#links-1.0"/>
<PARAM ID="accessURL" arraysize="
" datatype="char" name="accessURL" value="https://heasarc.gsfc.nasa.gov/xamin/vo/datalink/chanmaster"/>
</RESOURCE>

Then in adhoc.py:213, the following code

self._current_ids = list(OrderedDict.fromkeys( [_ for _ in self._current_batch.to_table()['ID']]))
gets the IDs from the DataLinkResults as a list of strings, while the self._remaining_ids are from the TAPResults votable as integers. So then the remove() further down doesn't work.

So Tom McGlynn sent a message to the DAL group asking about this, since the standard document doesn't make clear that IDs have to be strings. So which needs to be fixed, our TAP service to use datatype="char" for the row or PyVO to convert in the above check? I found two places that a simple type conversion would fix this in PyVO. (The other is adhoc.py:556 for the same reason.) Of course, it could also be fixed in our service.

Thoughts?

@funbaker
Copy link
Contributor

funbaker commented Apr 1, 2020

summoning @msdemlei for a standard related question

@msdemlei
Copy link
Contributor

msdemlei commented Apr 2, 2020 via email

@funbaker
Copy link
Contributor

funbaker commented Apr 2, 2020

I see two possibilities:

  1. make everything char
  2. do implicit conversion but raise a warning if it's with non-trivial types in terms of conversion.

@trjaffe
Copy link
Contributor

trjaffe commented Apr 15, 2020

A belated update: the DAL group is going to clarify the standard that the ID should be char. We haven't yet finished fixing our service, but this mod works with others and looks fine to everybody else. Nor did I have any other comments. So I think this PR is good to go.

@andamian
Copy link
Contributor Author

A belated update: the DAL group is going to clarify the standard that the ID should be char. We haven't yet finished fixing our service, but this mod works with others and looks fine to everybody else. Nor did I have any other comments. So I think this PR is good to go.

Thank you @trjaffe for looking into it.

pyvo/dal/params.py Outdated Show resolved Hide resolved
@andamian
Copy link
Contributor Author

@funbaker - I think I've now addressed all the concerns. Please let me know if that's not the case. Thanks

pyvo/dal/adhoc.py Outdated Show resolved Hide resolved
@funbaker
Copy link
Contributor

@funbaker - I think I've now addressed all the concerns. Please let me know if that's not the case. Thanks

Aye, should be fine for now.

@andamian andamian merged commit 9d527e1 into astropy:master Apr 23, 2020
@andamian andamian deleted the datalink branch April 23, 2020 04:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants