Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Give clone candidates a priority #4619

Merged
merged 10 commits into from Jun 15, 2020
Merged

ENH: Give clone candidates a priority #4619

merged 10 commits into from Jun 15, 2020

Conversation

mih
Copy link
Member

@mih mih commented Jun 9, 2020

This is achieved by prefixing their existing label with three digits and sorting them before processing them.

This implementation makes no change to the previously established order, it just makes that order explicit in the labels.

The default setup uses the priorities in the second half of the spectrum:

  • 500 for remote URL + submodule path
  • 600 for the configured submodule URL
  • 700 for any unprioritized 'datalad.get.subdataset-source-candidate' config
  • 900 for the local subdataset path

The first half of the spectrum and the spaces inbetween are available for re-prioritizing clone attempts. The idea is to be able to configure

datalad.get.subdataset-source-candidate-000takemefirst = http://some

and have this be processed first.

I have confirmed manually that this solution enables successful clones on first attempt, and substantially speeds up dataset installation. Likewise using '999label' it is possible to configure a fallback source candidate.

Towards a resolution of gh-4613

TODO:

  • adjust clone to use this feature
  • adjust documentation
  • add test
  • potentially prevent "flexibilizing" of explicitly configured URLs (e.g. /.git appending)

This change has no effect, because the final unique() immediately
takes them out again (and always did), but it is still wrong and
confusing.
@codecov
Copy link

codecov bot commented Jun 9, 2020

Codecov Report

Merging #4619 into master will increase coverage by 39.76%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #4619       +/-   ##
===========================================
+ Coverage   49.82%   89.59%   +39.76%     
===========================================
  Files         286      288        +2     
  Lines       38822    39016      +194     
===========================================
+ Hits        19344    34957    +15613     
+ Misses      19478     4059    -15419     
Impacted Files Coverage Δ
datalad/core/distributed/clone.py 89.26% <ø> (+43.86%) ⬆️
datalad/core/distributed/tests/test_clone.py 93.41% <100.00%> (+87.59%) ⬆️
datalad/distribution/get.py 96.22% <100.00%> (+17.20%) ⬆️
datalad/distribution/tests/test_get.py 100.00% <100.00%> (+93.04%) ⬆️
datalad/tests/utils_cached_dataset.py 92.30% <0.00%> (ø)
datalad/tests/test_utils_cached_dataset.py 100.00% <0.00%> (ø)
datalad/core/local/tests/test_create.py 100.00% <0.00%> (+0.44%) ⬆️
datalad/local/subdatasets.py 95.65% <0.00%> (+0.86%) ⬆️
datalad/core/local/diff.py 95.29% <0.00%> (+1.17%) ⬆️
datalad/support/external_versions.py 96.07% <0.00%> (+1.30%) ⬆️
... and 181 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0adaed...a30b7c6. Read the comment docs.

@mih
Copy link
Member Author

mih commented Jun 11, 2020

Crippled failure, not yet replicatable locally...

======================================================================
FAIL: datalad.core.distributed.tests.test_push.test_push(True,)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.5.9/x64/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/opt/hostedtoolcache/Python/3.5.9/x64/lib/python3.5/site-packages/datalad/tests/utils.py", line 714, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "/opt/hostedtoolcache/Python/3.5.9/x64/lib/python3.5/site-packages/datalad/tests/utils.py", line 714, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "/opt/hostedtoolcache/Python/3.5.9/x64/lib/python3.5/site-packages/datalad/core/distributed/tests/test_push.py", line 131, in check_push
    assert_result_count(res, 2 if annex else 1)
  File "/opt/hostedtoolcache/Python/3.5.9/x64/lib/python3.5/site-packages/datalad/tests/utils.py", line 1305, in assert_result_count
    _format_res(results)))
AssertionError: Got 3 instead of 2 expected results matching
  {}
...

Because it is spurious... :(

@mih
Copy link
Member Author

mih commented Jun 11, 2020

I will not add the ability to disable automatically appending /.git to all relevant URLs. It is unclear to me when that is possible without impairing legitimate use cases. The sorting already enables to get in front of the list of candidate and provide a URL that works as the first one to be tried.

mih added 2 commits Jun 11, 2020
This is achieved by simply prefixing their existing label with two digits
and sorting them before processing them.

This implementation makes no change to the previously established order,
it just makes that order explicit in the labels.

The default setup uses the priorities in the second half of the
spectrum:

50 for remote URL + submodule path
60 for the configured submodule URL
70 for any unprioritized 'datalad.get.subdataset-source-candidate' config
90 for the local subdataset path

The first half of the spectrum and the spaces inbetween are available for
re-prioritizing clone attempts. The idea is to be able to configure

datalad.get.subdataset-source-candidate-00takemefirst = http://some

and have this be processed first. Likewise using '99label' it is
possible to configure a fallback source candidate.
… clones

This avoid 3-5 unsuccessful clone attempts per subdataset in all cases
with subdataset can be found in the same store. It added 1-2 failed
attempts when this is not the case. However, those were already present
before.
Copy link
Member

@yarikoptic yarikoptic left a comment

Left some ideas to make it work with already existing clones, and making it more explicit than embedding into a variable name, and a typo fix.
Also, I think looking forward it might be better to make priorities (or costs) to align with git-annex which uses three digits level (as e.g. debian does for apt priorities). That might allow later to align priorities of remotes with e.g. priorities for annex remotes and/or URLs (based on regexes).

datalad/distribution/get.py Outdated Show resolved Hide resolved
datalad/distribution/get.py Outdated Show resolved Hide resolved
# we pick a priority of 20 to sort it before datalad's default candidates
# for non-RIA URLs, because they prioritize hierarchical layouts that
# cannot be found in a RIA store
'datalad.get.subdataset-source-candidate-20origin',
Copy link
Member

@yarikoptic yarikoptic Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered if it might make sense to separate out priority from the variable: i.e. introduce e.g.datalad.get.subdataset-source-candidate-origin-priority which if not set would get 20. That would

  • make existing clones work without any changes
  • make it more explicit
  • internal implementation could do whatever it does now (I would have preferred to have priority an explicit variable/tuple element to avoid all the splitting etc) and could be refactored later if desired/needed

Copy link
Member Author

@mih mih Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a second variable is introduced, one would also need to make sure that they stay in sync across all sources we read config from. I do not know how to achieve that, and how to test whether I have achieved that. It is perfectly possible that datalad.get.subdataset-source-candidate-origin-priority comes from user config and datalad.get.subdataset-source-candidate-origin does not.

I am not much concerned about existing clones -- I have all of them in front of me.

Copy link
Member Author

@mih mih Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed the topic of the messy internal implementation in 05e4d8f and it also switched to a three-digit priority setup.

Copy link
Member

@yarikoptic yarikoptic Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is perfectly possible that datalad.get.subdataset-source-candidate-origin-priority comes from user config and datalad.get.subdataset-source-candidate-origin does not.

Wouldn't that be a feature - I could boost or penalize priority of all ria stores with one config setting?

Copy link
Member

@yarikoptic yarikoptic Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides I would follow annex and call it cost and not priority (ambiguous - is higher number a higher priority? More obvious with cost, and is taking the one with lower cost first).

Copy link
Member

@yarikoptic yarikoptic Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in current approach it would be easy only to raise priority (lower cost) but not to lower it since the original variable would still be available somewhere with higher cost?

Copy link
Member Author

@mih mih Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that be a feature - I could boost or penalize priority of all ria stores with one config setting?

That may be, but a feature that I do not comprehend enough to be the one implementing it, and there wasn't a use case for it either.

Besides I would follow annex and call it cost and not priority (ambiguous - is higher number a higher priority? More obvious with cost, and is taking the one with lower cost first).

I can do this, if needed (but it may not be enough, see below).

And in current approach it would be easy only to raise priority (lower cost) but not to lower it since the original variable would still be available somewhere with higher cost?

There is no concept of raising or lowering, there are just candidates with (default) priorities.

I suspect that you want to aim for a bigger solution. If you insist, I can obscure the fact that the sorting can be used to achieve a prioritizing and just talk about sorting. In this case the space would be open for the feature that you have in mind. I just want to get a handle on gh-4613, have that quickly, and then I will detach from this topic.

Copy link
Member

@yarikoptic yarikoptic Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no concept of raising or lowering, there are just candidates with (default) priorities.

ah -- then there should be no need for any config changes? just RF code (if still needed) so priority for RIA remotes is higher and be done?

I suspect that you want to aim for a bigger solution. If you insist, ...

oh no -- I am not insisting or even suggesting to implement anything bigger solution here. I just want this solution not require changes to user visible parts (e.g. config) later if we decide to go with some global user visible approach.

Copy link
Member Author

@mih mih Jun 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an existing unprioritized candidate configuration is found, it automatically gets priority 700 such that identical behavior is maintained. This doesn't fix gh-4613, but I see no reason for doing such intervention in repos where this wasn't a problem so far (and I believe there are just a handful). All new clones will get the new behavior.

mih and others added 6 commits Jun 12, 2020
Follows up on a aspected of datalad#4619 (comment)

Stop merging and splitting strings, but use more appropriate data
types (dicts for records, int for priorities).

Also switch to a three-digit priority to align with other setups
(datalad#4619 (review)).
This .split() call was dropped with the switch to structured records.

[ci skip]
This no longer applies after the switch to structured records.

[ci skip]
kyleam
kyleam approved these changes Jun 12, 2020
Copy link
Collaborator

@kyleam kyleam left a comment

The approach looks good to me.

To resolve the polarity issue pointed out in
datalad#4619 (comment)
@mih
Copy link
Member Author

mih commented Jun 15, 2020

Thanks for the reviews and for the approval @kyleam

I took the liberty to still rename priority to cost to resolve interpretation issue pointed out in #4619 (comment)

Will merge when green.

@mih mih merged commit 896aa20 into datalad:master Jun 15, 2020
12 checks passed
@mih mih deleted the bf-4613 branch Jun 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants