-
Notifications
You must be signed in to change notification settings - Fork 113
Make clone_dataset() patchable and less monolithic #7017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change is equivalent to the commuluative effort posted at datalad/datalad-next#99. Its description is included below. In addition to this original changeset: - RIA-related helper are moved into `clone_ria.py` too - internal patches `ria` and `ephemeral` are applied via functions that can be easily patched-out, in order to turn the patches off externally -- for example when an extension starts to provide a superior replacement. The (adjusted for filenames) description of the original PR follows: The series provides a patch to replace `clone_dataset()` the key workhorse of datalad-core's `clone` command. The patch causes no change to any functionality, but refactors the code from large monolithic functions into significantly smaller, more comprehensible units. Emphasis is made on separating function that control the flow of processing (`clone.py`) from utilities that perform focused tasks (`clone_utils.py`). Key feature of the refactoring is the provisioning of a set of dedicated function that implement particular stages of the clone and post-clone setup processing: - `_try_clone_candidates()` (the main clone attempt loop) - `_try_clone_candidate()` (handle a single clone attempt) - `_try_git_clone_candidate()` (only workhorse ATM with all git-clone specifics) - `_post_gitclone_processing_()` (implements the logic to call all others) - `_post_git_init_processing_()` - `_pre_annex_init_processing_()` - `_annex_init()` - `_post_annex_init_processing_()` - `_pre_final_processing_()` All functions implement a keyword-argument-only API. All functions ending with the name `processing_` are implemented as generators that yield standard DataLad result records. Any of these functions can be "patched" following the pattern used in `datalad-next` for any kind of modifications of datalad-core. This capability is demonstrated by pulling out two optional features, originally intermingled with essential code: - `clone_ria.py` - `clone_ephemeral.py` (related: datalad#6961) (Note that for both patches, the documentation is not patched right now, because it seems unnecessary, and can be left to when the patch is turned off externally) These document how it is possible, with this new implementation, to provide such add-on features externally, without having to introduce them in datalad-core directly. They also show, how the patches can benefit from the keyword-argument-only API. They should continue to work as designed, as long as the particular keyword-arguments the require continue to be provided with the same semantics. Taken together, this series provides a solution to a number of issues that are linked below with a brief comment: - datalad#4903 -- more-or-less arbitrary modifications can be implemented following `git-clone`. The latter is itself encapsulated in the `_try_clone_candidate()` utility, so it would even be possible to patch-in means to establish repository clones that do not involve calling `git-clone` - This is related to datalad/datalad-registry#7 -- while we already have flexibility to add handling of new URL-types (e.g., URL mapping, subdataset candidate URL configurations), now, either `_try_clone_candidates()` or another utility (`_generate_candidate_clone_sources()`) can be patched to query an external resolver of some kind, including a DataLad registry - datalad#5289 -- the datalad-next style patching has proven to be somewhat reliable, and continue to grow in usage. This series modifies datalad-core to enable more surgical patching for add-on functionality. Continuing this for other parts of the code base would lessen the need for (possibly expensive) hooks that may require additional setup and maintenance. The patch-in-a-clone-feature approach should be applicable to the growling list of open issues asking for additional (special interest) capabilities. Examples: - datalad#6408 - datalad#5094 - datalad#4142 Lastly, this approach can help to expedite work on rejuvenating datalad components that struggle to keep up within the corset of datalad-core, with its ambition to remain as stable as possible. One example is the RIA functionality that alone is associated with 7% of the open issues, and would benefit from a few faster, less constrained development cycles. The previous impossibility of `clone` to be customized externally (like done here), was one of the main reasons for having to introduce RIA in datalad-core directly, rather than having it prove itself in a dedicated extension.
Code Climate has analyzed commit 3b38dae and detected 19 issues on this pull request. Here's the issue category breakdown:
View more on Code Climate. |
Doc build failure is fixed in #7018 |
Codecov Report
@@ Coverage Diff @@
## master #7017 +/- ##
===========================================
- Coverage 90.13% 76.13% -14.01%
===========================================
Files 354 357 +3
Lines 46352 59112 +12760
Branches 6610 6615 +5
===========================================
+ Hits 41781 45002 +3221
- Misses 4555 14094 +9539
Partials 16 16
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Container extension test failures a I think this is ready to go |
This changeset is part of an effort to strip datalad-core off non-essential components in order to reduce technical debt and improve maintainability. RIA (the concept of a dataset store, the git-annex special remote the URL handling) is not an essential component, but an application built on the datalad core. datalad#7017 shows how it can be completely separated from the core, because it is only integrated in within the `clone()` command -- all other code is located in pieces that can be installed in parallel. At the same time RIA is large (~6k lines of code), and has accumulated as significant number of unaddressed technical and conceptual issues. The amount to more than 7% of the total number of currently open issues in datalad-core. Moreoever, it is severly underdocumented. This change documents the code pieces that make up RIA, and the impact of their removal from core. They would move into an extension package, and remain available to users. The most significant impact of this removal is the removal of a set of tests of core functionality that made use of RIA in one form or another. This loss could trun out to be substantial and require compensatory actions. However, these would have been needed anyways (and or simply harder to ignore now), because the RIA tooling did not work across all target platforms, and the majority of these tests were skipped on one or more platform.
This changeset is part of an effort to strip datalad-core off non-essential components in order to reduce technical debt and improve maintainability. RIA (the concept of a dataset store, the git-annex special remote the URL handling) is not an essential component, but an application built on the datalad core. datalad#7017 shows how it can be completely separated from the core, because it is only integrated in within the `clone()` command -- all other code is located in pieces that can be installed in parallel. At the same time RIA is large (~6k lines of code), and has accumulated as significant number of unaddressed technical and conceptual issues. The amount to more than 7% of the total number of currently open issues in datalad-core. Moreoever, it is severly underdocumented. This change documents the code pieces that make up RIA, and the impact of their removal from core. They would move into an extension package, and remain available to users. The most significant impact of this removal is the removal of a set of tests of core functionality that made use of RIA in one form or another. This loss could trun out to be substantial and require compensatory actions. However, these would have been needed anyways (and or simply harder to ignore now), because the RIA tooling did not work across all target platforms, and the majority of these tests were skipped on one or more platform.
This changeset is part of an effort to strip datalad-core off non-essential components in order to reduce technical debt and improve maintainability. RIA (the concept of a dataset store, the git-annex special remote the URL handling) is not an essential component, but an application built on the datalad core. datalad#7017 shows how it can be completely separated from the core, because it is only integrated in within the `clone()` command -- all other code is located in pieces that can be installed in parallel. At the same time RIA is large (~6k lines of code), and has accumulated as significant number of unaddressed technical and conceptual issues. The amount to more than 7% of the total number of currently open issues in datalad-core. Moreoever, it is severly underdocumented. This change documents the code pieces that make up RIA, and the impact of their removal from core. They would move into an extension package, and remain available to users. The most significant impact of this removal is the removal of a set of tests of core functionality that made use of RIA in one form or another. This loss could trun out to be substantial and require compensatory actions. However, these would have been needed anyways (and or simply harder to ignore now), because the RIA tooling did not work across all target platforms, and the majority of these tests were skipped on one or more platform.
docs - unrelated, fixed AFAIK elsewhere. |
Thx! |
PR released in |
This change is equivalent to the commuluative effort posted at datalad/datalad-next#99. Its description is included below.
In addition to this original changeset:
clone_ria.py
tooria
andephemeral
are applied via functions that can be easily patched-out, in order to turn the patches off externally -- for example when an extension starts to provide a superior replacement.The (adjusted for filenames) description of the original PR follows:
The series provides a patch to replace
clone_dataset()
the key workhorse of datalad-core'sclone
command. The patch causes no change to any functionality, but refactors the code from large monolithic functions into significantly smaller, more comprehensible units. Emphasis is made on separating function that control the flow of processing (clone.py
) from utilities that perform focused tasks (clone_utils.py
).Key feature of the refactoring is the provisioning of a set of dedicated function that implement particular stages of the clone and post-clone setup processing:
_try_clone_candidates()
(the main clone attempt loop)_try_clone_candidate()
(handle a single clone attempt)_try_git_clone_candidate()
(only workhorse ATM with all git-clone specifics)_post_gitclone_processing_()
(implements the logic to call all others)_post_git_init_processing_()
_pre_annex_init_processing_()
_annex_init()
_post_annex_init_processing_()
_pre_final_processing_()
All functions implement a keyword-argument-only API. All functions ending with the name
processing_
are implemented as generators that yield standard DataLad result records.Any of these functions can be "patched" following the pattern used in
datalad-next
for any kind of modifications of datalad-core. This capability is demonstrated by pulling out two optional features, originally intermingled with essential code:clone_ria.py
clone_ephemeral.py
(related: Removeclone --reckless ephemeral
annex-symlinking feature #6961)(Note that for both patches, the documentation is not patched right now, because it seems unnecessary, and can be left to when the patch is turned off externally)
These document how it is possible, with this new implementation, to provide such add-on features externally, without having to introduce them in datalad-core directly. They also show, how the patches can benefit from the keyword-argument-only API. They should continue to work as designed, as long as the particular keyword-arguments the require continue to be provided with the same semantics.
Taken together, this series provides a solution to a number of issues that are linked below with a brief comment:
post-clone-config hooks #4903 -- more-or-less arbitrary modifications can be implemented following
git-clone
. The latter is itself encapsulated in the_try_clone_candidate()
utility, so it would even be possible to patch-in means to establish repository clones that do not involve callinggit-clone
This is related to Implement support within datalad "core" for registering new urls for a dataset datalad-registry#7 -- while we already have flexibility to add handling of new URL-types (e.g., URL mapping, subdataset candidate URL configurations), now, either
_try_clone_candidates()
or another utility (_generate_candidate_clone_sources()
) can be patched to query an external resolver of some kind, including a DataLad registry"python-space" installable hooks #5289 -- the datalad-next style patching has proven to be somewhat reliable, and continue to grow in usage. This series modifies datalad-core to enable more surgical patching for add-on functionality. Continuing this for other parts of the code base would lessen the need for (possibly expensive) hooks that may require additional setup and maintenance.
The patch-in-a-clone-feature approach should be applicable to the growling list of open issues asking for additional (special interest) capabilities. Examples:
clone/create
could/should writedatalad.dataset.id
into local git config #6408Lastly, this approach can help to expedite work on rejuvenating datalad components that struggle to keep up within the corset of datalad-core, with its ambition to remain as stable as possible. One example is the RIA functionality that alone is associated with 7% of the open issues, and would benefit from a few faster, less constrained development cycles. The previous impossibility of
clone
to be customized externally (like done here), was one of the main reasons for having to introduce RIA in datalad-core directly, rather than having it prove itself in a dedicated extension.