New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Give clone candidates a priority #4619
Conversation
This change has no effect, because the final unique() immediately takes them out again (and always did), but it is still wrong and confusing.
Codecov Report
@@ Coverage Diff @@
## master #4619 +/- ##
===========================================
+ Coverage 49.82% 89.59% +39.76%
===========================================
Files 286 288 +2
Lines 38822 39016 +194
===========================================
+ Hits 19344 34957 +15613
+ Misses 19478 4059 -15419
Continue to review full report at Codecov.
|
Crippled failure, not yet replicatable locally...
Because it is spurious... :( |
I will not add the ability to disable automatically appending |
This is achieved by simply prefixing their existing label with two digits and sorting them before processing them. This implementation makes no change to the previously established order, it just makes that order explicit in the labels. The default setup uses the priorities in the second half of the spectrum: 50 for remote URL + submodule path 60 for the configured submodule URL 70 for any unprioritized 'datalad.get.subdataset-source-candidate' config 90 for the local subdataset path The first half of the spectrum and the spaces inbetween are available for re-prioritizing clone attempts. The idea is to be able to configure datalad.get.subdataset-source-candidate-00takemefirst = http://some and have this be processed first. Likewise using '99label' it is possible to configure a fallback source candidate.
… clones This avoid 3-5 unsuccessful clone attempts per subdataset in all cases with subdataset can be found in the same store. It added 1-2 failed attempts when this is not the case. However, those were already present before.
Left some ideas to make it work with already existing clones, and making it more explicit than embedding into a variable name, and a typo fix.
Also, I think looking forward it might be better to make priorities (or costs) to align with git-annex which uses three digits level (as e.g. debian does for apt priorities). That might allow later to align priorities of remotes with e.g. priorities for annex remotes and/or URLs (based on regexes).
datalad/core/distributed/clone.py
Outdated
# we pick a priority of 20 to sort it before datalad's default candidates | ||
# for non-RIA URLs, because they prioritize hierarchical layouts that | ||
# cannot be found in a RIA store | ||
'datalad.get.subdataset-source-candidate-20origin', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered if it might make sense to separate out priority from the variable: i.e. introduce e.g.datalad.get.subdataset-source-candidate-origin-priority
which if not set would get 20
. That would
- make existing clones work without any changes
- make it more explicit
- internal implementation could do whatever it does now (I would have preferred to have priority an explicit variable/tuple element to avoid all the splitting etc) and could be refactored later if desired/needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a second variable is introduced, one would also need to make sure that they stay in sync across all sources we read config from. I do not know how to achieve that, and how to test whether I have achieved that. It is perfectly possible that datalad.get.subdataset-source-candidate-origin-priority
comes from user config and datalad.get.subdataset-source-candidate-origin
does not.
I am not much concerned about existing clones -- I have all of them in front of me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed the topic of the messy internal implementation in 05e4d8f and it also switched to a three-digit priority setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is perfectly possible that datalad.get.subdataset-source-candidate-origin-priority comes from user config and datalad.get.subdataset-source-candidate-origin does not.
Wouldn't that be a feature - I could boost or penalize priority of all ria stores with one config setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides I would follow annex and call it cost and not priority (ambiguous - is higher number a higher priority? More obvious with cost, and is taking the one with lower cost first).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in current approach it would be easy only to raise priority (lower cost) but not to lower it since the original variable would still be available somewhere with higher cost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that be a feature - I could boost or penalize priority of all ria stores with one config setting?
That may be, but a feature that I do not comprehend enough to be the one implementing it, and there wasn't a use case for it either.
Besides I would follow annex and call it cost and not priority (ambiguous - is higher number a higher priority? More obvious with cost, and is taking the one with lower cost first).
I can do this, if needed (but it may not be enough, see below).
And in current approach it would be easy only to raise priority (lower cost) but not to lower it since the original variable would still be available somewhere with higher cost?
There is no concept of raising or lowering, there are just candidates with (default) priorities.
I suspect that you want to aim for a bigger solution. If you insist, I can obscure the fact that the sorting can be used to achieve a prioritizing and just talk about sorting. In this case the space would be open for the feature that you have in mind. I just want to get a handle on gh-4613, have that quickly, and then I will detach from this topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no concept of raising or lowering, there are just candidates with (default) priorities.
ah -- then there should be no need for any config changes? just RF code (if still needed) so priority for RIA remotes is higher and be done?
I suspect that you want to aim for a bigger solution. If you insist, ...
oh no -- I am not insisting or even suggesting to implement anything bigger solution here. I just want this solution not require changes to user visible parts (e.g. config) later if we decide to go with some global user visible approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an existing unprioritized candidate configuration is found, it automatically gets priority 700 such that identical behavior is maintained. This doesn't fix gh-4613, but I see no reason for doing such intervention in repos where this wasn't a problem so far (and I believe there are just a handful). All new clones will get the new behavior.
Follows up on a aspected of datalad#4619 (comment) Stop merging and splitting strings, but use more appropriate data types (dicts for records, int for priorities). Also switch to a three-digit priority to align with other setups (datalad#4619 (review)).
This .split() call was dropped with the switch to structured records. [ci skip]
This no longer applies after the switch to structured records. [ci skip]
To resolve the polarity issue pointed out in datalad#4619 (comment)
Thanks for the reviews and for the approval @kyleam I took the liberty to still rename Will merge when green. |
This is achieved by prefixing their existing label with three digits and sorting them before processing them.
This implementation makes no change to the previously established order, it just makes that order explicit in the labels.
The default setup uses the priorities in the second half of the spectrum:
The first half of the spectrum and the spaces inbetween are available for re-prioritizing clone attempts. The idea is to be able to configure
datalad.get.subdataset-source-candidate-000takemefirst = http://some
and have this be processed first.
I have confirmed manually that this solution enables successful clones on first attempt, and substantially speeds up dataset installation. Likewise using '999label' it is possible to configure a fallback source candidate.
Towards a resolution of gh-4613
TODO:
clone
to use this featurepotentially prevent "flexibilizing" of explicitly configured URLs (e.g./.git
appending)