Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download-url: Set up datalad special remote if needed #5648

merged 5 commits into from May 12, 2021


Copy link

@kyleam kyleam commented May 7, 2021

As suggested in gh-5510, download-url should set up the datalad special remote if needed to avoid non-functional URLs.

  • The added code should be covered by existing tests, but an explicit test of the new behavior would be good.

    It seems tricky to test this in test_download_url, though, assuming we don't want to hit a real endpoint. For example, if we were to use shub://, patching for the initial download to point to the serve_path_via_http works, but then anything that involves the special remote (registering the URL, downloading it from another clone) wouldn't because that is of course in a separate process.

kyleam added 3 commits May 7, 2021 13:08
The datalad special remote is needed for a good amount of
functionality.  Add a helper that makes it easier for callers to
ensure that it's available.
download_url() downloads through DataLad's providers.  For URLs other
than unauthenticated ones, the URL won't be functional later if the
datalad special remote isn't enabled when the URL is registered with

If it looks like the datalad special remote is needed, handle the
setup automatically for the caller.

Closes datalad#5510.
Copy link

codecov bot commented May 7, 2021

Codecov Report

Merging #5648 (1adf05f) into maint (8dc1c56) will decrease coverage by 5.43%.
The diff coverage is 100.00%.

❗ Current head 1adf05f differs from pull request most recent head 4c303af. Consider uploading reports for the commit 4c303af to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##            maint    #5648      +/-   ##
- Coverage   90.21%   84.77%   -5.44%     
  Files         299      296       -3     
  Lines       42256    42294      +38     
- Hits        38121    35855    -2266     
- Misses       4135     6439    +2304     
Impacted Files Coverage Δ
datalad/core/local/tests/ 84.53% <100.00%> (-12.00%) ⬇️
datalad/customremotes/ 83.20% <100.00%> (+1.04%) ⬆️
datalad/customremotes/tests/ 97.89% <100.00%> (+0.87%) ⬆️
datalad/interface/ 98.09% <100.00%> (+0.17%) ⬆️
datalad/interface/tests/ 100.00% <100.00%> (ø)
datalad/ 40.90% <100.00%> (ø)
datalad/support/tests/ 0.00% <0.00%> (-100.00%) ⬇️
datalad/metadata/extractors/tests/ 19.35% <0.00%> (-80.65%) ⬇️
datalad/metadata/extractors/ 12.96% <0.00%> (-79.63%) ⬇️
datalad/metadata/extractors/ 18.75% <0.00%> (-78.13%) ⬇️
... and 97 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8dc1c56...4c303af. Read the comment docs.

@kyleam kyleam marked this pull request as ready for review May 7, 2021 19:54
Copy link

  • It seems tricky to test this in test_download_url, though, assuming we don't want to hit a real endpoint.

I think it is ok to hit a "quick" end point... we already have

datalad/downloaders/tests/ = 's3://datalad-test0-versioned/2versions-nonversioned1.txt'
datalad/downloaders/tests/ = 's3://datalad.test1/version1.txt'

and use them in some VCR'ed tests (I believe that bucket is under my personal account, so I might get broke I believe if someone starts hammering that bucket for millions of copies of that tiny file), so may be just use @use_cassette('<name of the test>') around it and commit that tape in as was done e.g. for github ones (`datalad/distribution/tests/vcr_cassettes/github_yarikoptic.yaml). Then we will not be really hitting the end point while still testing the logic/code.

Copy link

of cause we could improve our @serve_path_via_http with some auth like and then test "fully" but probably would be an overkill ;)

This test fails without the previous commit, which taught download-url
to initialize the datalad special remote if it is needed.

It requires credentials for s3://datalad-test0-versioned, so it will
not run in the CI builds.
Copy link
Contributor Author

kyleam commented May 11, 2021

@yarikoptic Thanks for your comments. I've added a regression test that uses s3://datalad-test0-versioned/. That depends on me having the credentials set up locally. As far as I can tell, this puts us in the same place as the other s3://datalad-test* tests, which are skipped on the CI builds. In this case, it doesn't look like using a cassette would bypass the need to have credentials/providers set up on the machine.

Please feel free to tweak the test if I'm missing an obvious way that a cassette would allow this to work without credentials.

This is a file which is also used by h5py for testing of their support
of ROS3 driver.  Bucket is public and would not incure any cost.  File is tiny (4KB).

We could use a persistent VCR tape but it would have shaved off only about 2
seconds from total 14 seconds run time (hence adding @slow), and to make
it happen we would have needed to either copy helper code from
datalad/distribution/tests/ or just craft full path here.
So may be later if more usecases come up.

I have also harmonized imports a little while touching them
Copy link

Thank you @kyleam. I have pushed 4c303af where I switched to use another s3 url pointing to a tiny file in dandiarchive bucket. See that commit for more description (decided not to bother with vcr for it although first did it), feel welcome to drop it if you do not like it.

Copy link
Contributor Author

kyleam commented May 11, 2021

I have pushed 4c303af where I switched to use another s3 url pointing to a tiny file in dandiarchive bucket.

Sounds good, thanks.

Copy link

@satra - I am merging this fixup into maint so if you are still interested to give a shot to addurls (without needing a procedure to setup unless you have some other needs) -- would be great.

@yarikoptic yarikoptic merged commit 2c60c53 into datalad:maint May 12, 2021
@kyleam kyleam deleted the download-url-init-datalad branch May 12, 2021 13:09
Copy link

satra commented May 12, 2021

thanks @yarikoptic, i will try master for the next participant.

some misc updates.
with my procedure, i just did one participant and it took almost a day to add 15K urls, so not super hopeful that i can do this for 200 participants in a meaningful time. i wonder if --fast could work by not checking just adding and checking only if someone decided to retrieve. also it would be nice if addurls could do more magic if i also gave it size/checksum and completely reduce the time required to create a datalad dataset/subdatasets.

Copy link
Contributor Author

kyleam commented May 12, 2021

Why would this PR affect addurls?

@yarikoptic yarikoptic added the semver-patch Increment the patch version when merged label Jun 2, 2021
@yarikoptic yarikoptic mentioned this pull request Jun 3, 2021
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
semver-patch Increment the patch version when merged
None yet

Successfully merging this pull request may close these issues.

None yet

3 participants