Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download-url: Set up datalad special remote if needed #5648

Merged
merged 5 commits into from May 12, 2021

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented May 7, 2021

As suggested in gh-5510, download-url should set up the datalad special remote if needed to avoid non-functional URLs.

  • The added code should be covered by existing tests, but an explicit test of the new behavior would be good.

    It seems tricky to test this in test_download_url, though, assuming we don't want to hit a real endpoint. For example, if we were to use shub://, patching for the initial download to point to the serve_path_via_http works, but then anything that involves the special remote (registering the URL, downloading it from another clone) wouldn't because that is of course in a separate process.

kyleam added 3 commits May 7, 2021 13:08
The datalad special remote is needed for a good amount of
functionality.  Add a helper that makes it easier for callers to
ensure that it's available.
download_url() downloads through DataLad's providers.  For URLs other
than unauthenticated ones, the URL won't be functional later if the
datalad special remote isn't enabled when the URL is registered with
git-annex.

If it looks like the datalad special remote is needed, handle the
setup automatically for the caller.

Closes datalad#5510.
@codecov
Copy link

codecov bot commented May 7, 2021

Codecov Report

Merging #5648 (1adf05f) into maint (8dc1c56) will decrease coverage by 5.43%.
The diff coverage is 100.00%.

❗ Current head 1adf05f differs from pull request most recent head 4c303af. Consider uploading reports for the commit 4c303af to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##            maint    #5648      +/-   ##
==========================================
- Coverage   90.21%   84.77%   -5.44%     
==========================================
  Files         299      296       -3     
  Lines       42256    42294      +38     
==========================================
- Hits        38121    35855    -2266     
- Misses       4135     6439    +2304     
Impacted Files Coverage Δ
datalad/core/local/tests/test_save.py 84.53% <100.00%> (-12.00%) ⬇️
datalad/customremotes/base.py 83.20% <100.00%> (+1.04%) ⬆️
datalad/customremotes/tests/test_base.py 97.89% <100.00%> (+0.87%) ⬆️
datalad/interface/download_url.py 98.09% <100.00%> (+0.17%) ⬆️
datalad/interface/tests/test_download_url.py 100.00% <100.00%> (ø)
datalad/version.py 40.90% <100.00%> (ø)
datalad/support/tests/utils.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/metadata/extractors/tests/test_audio.py 19.35% <0.00%> (-80.65%) ⬇️
datalad/metadata/extractors/xmp.py 12.96% <0.00%> (-79.63%) ⬇️
datalad/metadata/extractors/exif.py 18.75% <0.00%> (-78.13%) ⬇️
... and 97 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8dc1c56...4c303af. Read the comment docs.

@kyleam kyleam marked this pull request as ready for review May 7, 2021 19:54
@yarikoptic
Copy link
Member

  • It seems tricky to test this in test_download_url, though, assuming we don't want to hit a real endpoint.

I think it is ok to hit a "quick" end point... we already have

datalad/downloaders/tests/test_s3.py:url_2versions_nonversioned1 = 's3://datalad-test0-versioned/2versions-nonversioned1.txt'
datalad/downloaders/tests/test_s3.py:url_1version_bucketwithdot = 's3://datalad.test1/version1.txt'

and use them in some VCR'ed tests (I believe that bucket is under my personal account, so I might get broke I believe if someone starts hammering that bucket for millions of copies of that tiny file), so may be just use @use_cassette('<name of the test>') around it and commit that tape in as was done e.g. for github ones (`datalad/distribution/tests/vcr_cassettes/github_yarikoptic.yaml). Then we will not be really hitting the end point while still testing the logic/code.

@yarikoptic
Copy link
Member

of cause we could improve our @serve_path_via_http with some auth like https://gist.github.com/fxsjy/5465353 and then test "fully" but probably would be an overkill ;)

This test fails without the previous commit, which taught download-url
to initialize the datalad special remote if it is needed.

It requires credentials for s3://datalad-test0-versioned, so it will
not run in the CI builds.
@kyleam
Copy link
Contributor Author

kyleam commented May 11, 2021

@yarikoptic Thanks for your comments. I've added a regression test that uses s3://datalad-test0-versioned/. That depends on me having the credentials set up locally. As far as I can tell, this puts us in the same place as the other s3://datalad-test* tests, which are skipped on the CI builds. In this case, it doesn't look like using a cassette would bypass the need to have credentials/providers set up on the machine.

Please feel free to tweak the test if I'm missing an obvious way that a cassette would allow this to work without credentials.

This is a file which is also used by h5py for testing of their support
of ROS3 driver.  Bucket is public and would not incure any cost.  File is tiny (4KB).

We could use a persistent VCR tape but it would have shaved off only about 2
seconds from total 14 seconds run time (hence adding @slow), and to make
it happen we would have needed to either copy helper code from
datalad/distribution/tests/test_create_github.py or just craft full path here.
So may be later if more usecases come up.

I have also harmonized imports a little while touching them
@yarikoptic
Copy link
Member

Thank you @kyleam. I have pushed 4c303af where I switched to use another s3 url pointing to a tiny file in dandiarchive bucket. See that commit for more description (decided not to bother with vcr for it although first did it), feel welcome to drop it if you do not like it.

@kyleam
Copy link
Contributor Author

kyleam commented May 11, 2021

I have pushed 4c303af where I switched to use another s3 url pointing to a tiny file in dandiarchive bucket.

Sounds good, thanks.

@yarikoptic
Copy link
Member

@satra - I am merging this fixup into maint so if you are still interested to give a shot to addurls (without needing a procedure to setup unless you have some other needs) -- would be great.

@yarikoptic yarikoptic merged commit 2c60c53 into datalad:maint May 12, 2021
@kyleam kyleam deleted the download-url-init-datalad branch May 12, 2021 13:09
@satra
Copy link

satra commented May 12, 2021

thanks @yarikoptic, i will try master for the next participant.

some misc updates.
with my procedure, i just did one participant and it took almost a day to add 15K urls, so not super hopeful that i can do this for 200 participants in a meaningful time. i wonder if --fast could work by not checking just adding and checking only if someone decided to retrieve. also it would be nice if addurls could do more magic if i also gave it size/checksum and completely reduce the time required to create a datalad dataset/subdatasets.

@kyleam
Copy link
Contributor Author

kyleam commented May 12, 2021

Why would this PR affect addurls?

@yarikoptic yarikoptic added the semver-patch Increment the patch version when merged label Jun 2, 2021
@yarikoptic yarikoptic mentioned this pull request Jun 3, 2021
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-patch Increment the patch version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants