Skip to content

Make clone_dataset() patchable and less monolithic #7017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 13, 2022
Merged

Conversation

mih
Copy link
Member

@mih mih commented Sep 7, 2022

This change is equivalent to the commuluative effort posted at datalad/datalad-next#99. Its description is included below.

In addition to this original changeset:

  • RIA-related helper are moved into clone_ria.py too
  • internal patches ria and ephemeral are applied via functions that can be easily patched-out, in order to turn the patches off externally -- for example when an extension starts to provide a superior replacement.

The (adjusted for filenames) description of the original PR follows:

The series provides a patch to replace clone_dataset() the key workhorse of datalad-core's clone command. The patch causes no change to any functionality, but refactors the code from large monolithic functions into significantly smaller, more comprehensible units. Emphasis is made on separating function that control the flow of processing (clone.py) from utilities that perform focused tasks (clone_utils.py).

Key feature of the refactoring is the provisioning of a set of dedicated function that implement particular stages of the clone and post-clone setup processing:

  • _try_clone_candidates() (the main clone attempt loop)
  • _try_clone_candidate() (handle a single clone attempt)
  • _try_git_clone_candidate() (only workhorse ATM with all git-clone specifics)
  • _post_gitclone_processing_() (implements the logic to call all others)
  • _post_git_init_processing_()
  • _pre_annex_init_processing_()
  • _annex_init()
  • _post_annex_init_processing_()
  • _pre_final_processing_()

All functions implement a keyword-argument-only API. All functions ending with the name processing_ are implemented as generators that yield standard DataLad result records.

Any of these functions can be "patched" following the pattern used in datalad-next for any kind of modifications of datalad-core. This capability is demonstrated by pulling out two optional features, originally intermingled with essential code:

(Note that for both patches, the documentation is not patched right now, because it seems unnecessary, and can be left to when the patch is turned off externally)

These document how it is possible, with this new implementation, to provide such add-on features externally, without having to introduce them in datalad-core directly. They also show, how the patches can benefit from the keyword-argument-only API. They should continue to work as designed, as long as the particular keyword-arguments the require continue to be provided with the same semantics.

Taken together, this series provides a solution to a number of issues that are linked below with a brief comment:

  • post-clone-config hooks #4903 -- more-or-less arbitrary modifications can be implemented following git-clone. The latter is itself encapsulated in the _try_clone_candidate() utility, so it would even be possible to patch-in means to establish repository clones that do not involve calling git-clone

  • This is related to Implement support within datalad "core" for registering new urls for a dataset datalad-registry#7 -- while we already have flexibility to add handling of new URL-types (e.g., URL mapping, subdataset candidate URL configurations), now, either _try_clone_candidates() or another utility (_generate_candidate_clone_sources()) can be patched to query an external resolver of some kind, including a DataLad registry

  • "python-space" installable hooks  #5289 -- the datalad-next style patching has proven to be somewhat reliable, and continue to grow in usage. This series modifies datalad-core to enable more surgical patching for add-on functionality. Continuing this for other parts of the code base would lessen the need for (possibly expensive) hooks that may require additional setup and maintenance.

The patch-in-a-clone-feature approach should be applicable to the growling list of open issues asking for additional (special interest) capabilities. Examples:

Lastly, this approach can help to expedite work on rejuvenating datalad components that struggle to keep up within the corset of datalad-core, with its ambition to remain as stable as possible. One example is the RIA functionality that alone is associated with 7% of the open issues, and would benefit from a few faster, less constrained development cycles. The previous impossibility of clone to be customized externally (like done here), was one of the main reasons for having to introduce RIA in datalad-core directly, rather than having it prove itself in a dedicated extension.

@mih mih added cmd-clone semver-minor Increment the minor version when merged labels Sep 7, 2022
This change is equivalent to the commuluative effort posted at
datalad/datalad-next#99. Its description
is included below.

In addition to this original changeset:

- RIA-related helper are moved into `clone_ria.py` too
- internal patches `ria` and `ephemeral` are applied via functions
  that can be easily patched-out, in order to turn the patches off
  externally -- for example when an extension starts to provide a
  superior replacement.

The (adjusted for filenames) description of the original PR follows:

The series provides a patch to replace `clone_dataset()` the key
workhorse of datalad-core's `clone` command. The patch causes no change
to any functionality, but refactors the code from large monolithic
functions into significantly smaller, more comprehensible units.
Emphasis is made on separating function that control the flow of
processing (`clone.py`) from utilities that perform focused
tasks (`clone_utils.py`).

Key feature of the refactoring is the provisioning of a set of dedicated
function that implement particular stages of the clone and post-clone
setup processing:

- `_try_clone_candidates()` (the main clone attempt loop)
- `_try_clone_candidate()` (handle a single clone attempt)
- `_try_git_clone_candidate()` (only workhorse ATM with all git-clone specifics)
- `_post_gitclone_processing_()` (implements the logic to call all others)
- `_post_git_init_processing_()`
- `_pre_annex_init_processing_()`
- `_annex_init()`
- `_post_annex_init_processing_()`
- `_pre_final_processing_()`

All functions implement a keyword-argument-only API. All functions
ending with the name `processing_` are implemented as generators that
yield standard DataLad result records.

Any of these functions can be "patched" following the pattern used in
`datalad-next` for any kind of modifications of datalad-core. This
capability is demonstrated by pulling out two optional features,
originally intermingled with essential code:

- `clone_ria.py`
- `clone_ephemeral.py`
  (related: datalad#6961)

(Note that for both patches, the documentation is not patched right now,
because it seems unnecessary, and can be left to when the patch is
turned off externally)

These document how it is possible, with this new implementation, to
provide such add-on features externally, without having to introduce
them in datalad-core directly. They also show, how the patches can
benefit from the keyword-argument-only API. They should continue to work
as designed, as long as the particular keyword-arguments the require
continue to be provided with the same semantics.

Taken together, this series provides a solution to a number of issues
that are linked below with a brief comment:

- datalad#4903 -- more-or-less
  arbitrary modifications can be implemented following `git-clone`. The
  latter is itself encapsulated in the `_try_clone_candidate()` utility,
  so it would even be possible to patch-in means to establish repository
  clones that do not involve calling `git-clone`

- This is related to
  datalad/datalad-registry#7 -- while we
  already have flexibility to add handling of new URL-types (e.g., URL
  mapping, subdataset candidate URL configurations), now, either
  `_try_clone_candidates()` or another utility
  (`_generate_candidate_clone_sources()`) can be patched to query an
  external resolver of some kind, including a DataLad registry

- datalad#5289 -- the datalad-next
  style patching has proven to be somewhat reliable, and continue to
  grow in usage. This series modifies datalad-core to enable more surgical
  patching for add-on functionality. Continuing this for other parts of
  the code base would lessen the need for (possibly expensive) hooks that
  may require additional setup and maintenance.

The patch-in-a-clone-feature approach should be applicable to the
growling list of open issues asking for additional (special interest)
capabilities. Examples:

- datalad#6408
- datalad#5094
- datalad#4142

Lastly, this approach can help to expedite work on rejuvenating datalad
components that struggle to keep up within the corset of datalad-core,
with its ambition to remain as stable as possible. One example is the
RIA functionality that alone is associated with 7% of the open issues,
and would benefit from a few faster, less constrained development
cycles. The previous impossibility of `clone` to be customized
externally (like done here), was one of the main reasons for having to
introduce RIA in datalad-core directly, rather than having it prove
itself in a dedicated extension.
@codeclimate
Copy link

codeclimate bot commented Sep 7, 2022

Code Climate has analyzed commit 3b38dae and detected 19 issues on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 19

View more on Code Climate.

@mih
Copy link
Member Author

mih commented Sep 7, 2022

Doc build failure is fixed in #7018

@codecov
Copy link

codecov bot commented Sep 7, 2022

Codecov Report

Merging #7017 (3b38dae) into master (85102f6) will decrease coverage by 14.00%.
The diff coverage is 87.50%.

@@             Coverage Diff             @@
##           master    #7017       +/-   ##
===========================================
- Coverage   90.13%   76.13%   -14.01%     
===========================================
  Files         354      357        +3     
  Lines       46352    59112    +12760     
  Branches     6610     6615        +5     
===========================================
+ Hits        41781    45002     +3221     
- Misses       4555    14094     +9539     
  Partials       16       16               
Impacted Files Coverage Δ
datalad/core/distributed/clone.py 65.39% <82.51%> (-25.86%) ⬇️
datalad/core/distributed/clone_ria.py 88.23% <88.23%> (ø)
datalad/core/distributed/clone_utils.py 89.57% <89.57%> (ø)
datalad/core/distributed/clone_ephemeral.py 91.30% <91.30%> (ø)
datalad/core/distributed/tests/test_clone.py 75.48% <100.00%> (-22.12%) ⬇️
datalad/tests/test_misc.py 48.00% <0.00%> (-52.00%) ⬇️
datalad/core/local/tests/test_results.py 58.33% <0.00%> (-41.67%) ⬇️
datalad/tests/test_api.py 56.14% <0.00%> (-40.83%) ⬇️
datalad/local/tests/test_configuration.py 62.12% <0.00%> (-37.88%) ⬇️
datalad/runner/tests/test_exception.py 63.76% <0.00%> (-36.24%) ⬇️
... and 166 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@mih
Copy link
Member Author

mih commented Sep 7, 2022

Container extension test failures a
look same as in #7018

I think this is ready to go

mih added a commit to mih/datalad that referenced this pull request Sep 8, 2022
This changeset is part of an effort to strip datalad-core off
non-essential components in order to reduce technical debt and
improve maintainability.

RIA (the concept of a dataset store, the git-annex special remote
the URL handling) is not an essential component, but an application
built on the datalad core.

datalad#7017 shows how it can be
completely separated from the core, because it is only integrated
in within the `clone()` command -- all other code is located in
pieces that can be installed in parallel.

At the same time RIA is large (~6k lines of code), and has accumulated
as significant number of unaddressed technical and conceptual issues.
The amount to more than 7% of the total number of currently open issues
in datalad-core. Moreoever, it is severly underdocumented.

This change documents the code pieces that make up RIA, and the impact
of their removal from core. They would move into an extension package,
and remain available to users.

The most significant impact of this removal is the removal of a set
of tests of core functionality that made use of RIA in one form or
another. This loss could trun out to be substantial and require
compensatory actions. However, these would have been needed anyways
(and or simply harder to ignore now), because the RIA tooling did not
work across all target platforms, and the majority of these tests
were skipped on one or more platform.
@mih mih mentioned this pull request Sep 8, 2022
mih added a commit to mih/datalad that referenced this pull request Sep 8, 2022
This changeset is part of an effort to strip datalad-core off
non-essential components in order to reduce technical debt and
improve maintainability.

RIA (the concept of a dataset store, the git-annex special remote
the URL handling) is not an essential component, but an application
built on the datalad core.

datalad#7017 shows how it can be
completely separated from the core, because it is only integrated
in within the `clone()` command -- all other code is located in
pieces that can be installed in parallel.

At the same time RIA is large (~6k lines of code), and has accumulated
as significant number of unaddressed technical and conceptual issues.
The amount to more than 7% of the total number of currently open issues
in datalad-core. Moreoever, it is severly underdocumented.

This change documents the code pieces that make up RIA, and the impact
of their removal from core. They would move into an extension package,
and remain available to users.

The most significant impact of this removal is the removal of a set
of tests of core functionality that made use of RIA in one form or
another. This loss could trun out to be substantial and require
compensatory actions. However, these would have been needed anyways
(and or simply harder to ignore now), because the RIA tooling did not
work across all target platforms, and the majority of these tests
were skipped on one or more platform.
mih added a commit to mih/datalad that referenced this pull request Sep 8, 2022
This changeset is part of an effort to strip datalad-core off
non-essential components in order to reduce technical debt and
improve maintainability.

RIA (the concept of a dataset store, the git-annex special remote
the URL handling) is not an essential component, but an application
built on the datalad core.

datalad#7017 shows how it can be
completely separated from the core, because it is only integrated
in within the `clone()` command -- all other code is located in
pieces that can be installed in parallel.

At the same time RIA is large (~6k lines of code), and has accumulated
as significant number of unaddressed technical and conceptual issues.
The amount to more than 7% of the total number of currently open issues
in datalad-core. Moreoever, it is severly underdocumented.

This change documents the code pieces that make up RIA, and the impact
of their removal from core. They would move into an extension package,
and remain available to users.

The most significant impact of this removal is the removal of a set
of tests of core functionality that made use of RIA in one form or
another. This loss could trun out to be substantial and require
compensatory actions. However, these would have been needed anyways
(and or simply harder to ignore now), because the RIA tooling did not
work across all target platforms, and the majority of these tests
were skipped on one or more platform.
@yarikoptic
Copy link
Member

docs - unrelated, fixed AFAIK elsewhere.

@mih
Copy link
Member Author

mih commented Sep 13, 2022

Thx!

@mih mih merged commit 03f027b into datalad:master Sep 13, 2022
@mih mih deleted the rf-clone branch September 13, 2022 13:42
@mih mih mentioned this pull request Sep 21, 2022
@yarikoptic-gitmate
Copy link
Collaborator

PR released in 0.18.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmd-clone semver-minor Increment the minor version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants