Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF: Bypass git-submodule update --init when getting subdatasets #4853

Merged
merged 6 commits into from
Sep 5, 2020

Conversation

mih
Copy link
Member

@mih mih commented Sep 1, 2020

Doing it with config settings has two advantages.

  • benefits from interprocess locking of datalad ConfigManager, such that
    we can "get" in parallel without running into locking issues.
    (datalad -f '{path}' subdatasets | xargs -n 1 -P10 datalad get -n)

  • we can set the real source URL of the successful clone, whereas
    GitRepo.update_submodule() is what cause a guess to be made.

    For example, cloning a superdataset from

    git@github.com:datalad-datasets/human-connectome-project-openaccess.git

    would lead to invalid submodule URLs like:

    git@github.com:datalad-datasets/human-connectome-project-openaccess.git/HCP1200/100307

    whereas the actual clone came from

    http://store.datalad.org/8fd/bc694-2c2b-11ea-b467-0025904abcb0

This change has a high potential for missed corner-cases. It would be good to get this into master a long time before a planned release.

For example, we are resetting the existing master branch to the recorded commit. Before, we might have had to fix a detached state that was created by git-submodule update. I think this is a net win, but maybe we are missing out on other standardizations.

Doing it with config settings has two advantages.

- benefits from interprocess locking of datalad ConfigManager, such that
  we can "get" in parallel without running into locking issues.
  (`datalad -f '{path}' subdatasets | xargs -n 1 -P10 datalad get -n`)

- we can set the real source URL of the successful clone, whereas
  `GitRepo.update_submodule()` what cause a guess to be made.

  For example, cloning a superdataset from

  git@github.com:datalad-datasets/human-connectome-project-openaccess.git

  would lead to invalid submodule URLs like:

  git@github.com:datalad-datasets/human-connectome-project-openaccess.git/HCP1200/100307

  whereas the actual clone came from

  http://store.datalad.org/8fd/bc694-2c2b-11ea-b467-0025904abcb0
@mih mih added the performance Improve performance of an existing feature label Sep 1, 2020
@codecov
Copy link

codecov bot commented Sep 1, 2020

Codecov Report

Merging #4853 into master will increase coverage by 0.00%.
The diff coverage is 95.91%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master    #4853    +/-   ##
========================================
  Coverage   89.69%   89.70%            
========================================
  Files         289      289            
  Lines       40480    40675   +195     
========================================
+ Hits        36310    36486   +176     
- Misses       4170     4189    +19     
Impacted Files Coverage Δ
datalad/distribution/tests/test_get.py 99.69% <95.83%> (-0.31%) ⬇️
datalad/distribution/get.py 97.45% <96.00%> (+1.23%) ⬆️
datalad/downloaders/credentials.py 81.87% <0.00%> (-4.80%) ⬇️
datalad/downloaders/tests/test_http.py 89.92% <0.00%> (-1.23%) ⬇️
datalad/downloaders/tests/test_s3.py 72.09% <0.00%> (-1.08%) ⬇️
datalad/downloaders/s3.py 57.66% <0.00%> (-0.86%) ⬇️
datalad/cmd.py 91.52% <0.00%> (-0.76%) ⬇️
datalad/support/s3.py 22.98% <0.00%> (-0.69%) ⬇️
datalad/utils.py 82.46% <0.00%> (-0.42%) ⬇️
datalad/downloaders/http.py 84.55% <0.00%> (-0.39%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b416b4...816bd53. Read the comment docs.

@mih mih added the merge-if-ok OP considers this work done, and requests review/merge label Sep 1, 2020
Copy link
Member

@bpoldrack bpoldrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about any implications tbh, but going through the code it looks right.

Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing it with config settings has two advantages.

  • benefits from interprocess locking of datalad ConfigManager, such that
    we can "get" in parallel without running into locking issues.
    (datalad -f '{path}' subdatasets | xargs -n 1 -P10 datalad get -n)

Makes sense.

  • we can set the real source URL of the successful clone, whereas
    GitRepo.update_submodule() what cause a guess to be made.

    For example, cloning a superdataset from

That should be subdataset, right?

git@github.com:datalad-datasets/human-connectome-project-openaccess.git

would lead to invalid submodule URLs like:

git@github.com:datalad-datasets/human-connectome-project-openaccess.git/HCP1200/100307

whereas the actual clone came from

http://store.datalad.org/8fd/bc694-2c2b-11ea-b467-0025904abcb0

If I'm thinking about this correctly, the main advantage of here, beyond .git/config being more informative to someone inspecting it, is that direct git callers will be able to update from the correct place. That sounds good. Or are there other places where this currently comes into play?

# TODO we could simulatanously check out a configured branch
# and setup a tracking branch. This should be fine, as the implied
# config manipulation is local to the subdataset.
sub.call_git(['reset', '--hard', sm['gitshasum']])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so this is a bit more aggressive than the logic in update_submodule. We could end up resetting an unrelated branch. What about checking out gitshasum and then using the same get_merge_base check from update_submodule? (It looks like it could probably be pulled out into a helper fairly cleanly, but perhaps I'm missing some complication.)

Another potential problem here is that gitshasum might not have been pulled in with the clone. git submodule update deals with this by following up with a direct fetch of the commit to try to retrieve it. We could do the same, but at the very least I think we should translate the failed reset into an error result.

demo

Patch to make cloning a local repo behave like cloning a remote one.

diff --git a/datalad/core/distributed/clone.py b/datalad/core/distributed/clone.py
index 4b3dd898d..f11f99d6e 100644
--- a/datalad/core/distributed/clone.py
+++ b/datalad/core/distributed/clone.py
@@ -426,7 +426,7 @@ def clone_dataset(
             update=1,
             increment=True)

-        clone_opts = {}
+        clone_opts = {"no_local": True}

         if cand.get('version', None):
             clone_opts['branch'] = cand['version']
#!/bin/sh

set -eu

cd "$(mktemp -d "${TMPDIR:-/tmp}"/dl-XXXXXXX)"

git init a
(
    cd a
    git commit --allow-empty -ma
    git init sub
    (
        cd sub
        git commit --allow-empty -msub
        git checkout master^0
        git commit --allow-empty -mdetached
        git config uploadpack.allowAnySHA1InWant true
    )
    git submodule add ./sub
    git commit -m'add sub'
    git -C sub checkout master
)

git clone a b
(
    cd b
    datalad get sub
)

Copy link
Member Author

@mih mih Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments and pointers. I pushed 7f10a4b that hopefully addresses them.

@mih
Copy link
Member Author

mih commented Sep 2, 2020

  • we can set the real source URL of the successful clone, whereas
    GitRepo.update_submodule() what cause a guess to be made.
    For example, cloning a superdataset from

That should be subdataset, right?

No, the first URL below is for the superdataset, but the differences reported next are for the "getted" subdatasets of it.

If I'm thinking about this correctly, the main advantage of here, beyond .git/config being more informative to someone inspecting it, is that direct git callers will be able to update from the correct place. That sounds good. Or are there other places where this currently comes into play?

ATM I am not aware of any additional (side-)effects.

mih added a commit to mih/datalad that referenced this pull request Sep 2, 2020
It now replicates what was done in GitRepo.update_submodule(), but
uses information that is already available in the present scope.

Thanks to @kyleam for pointing out the issues with the previous
approach
datalad#4853 (comment)

Includes a TODO for better integration with git-submodule branch
configuration.
It now replicates what was done in GitRepo.update_submodule(), but
uses information that is already available in the present scope.

Thanks to @kyleam for pointing out the issues with the previous
approach
datalad#4853 (comment)

Includes a TODO for better integration with git-submodule branch
configuration.
# if we are on a branch this hexsha will be the tip of that branch
sub_orig_hexsha = sub.get_hexsha()
# make sure we have the desired commit locally
sub.fetch(remote='origin', refspec=target_commit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update.

I think we should guard this fetch with a commit_exists call because 1) in the common case that commit will exist and that local check is less expensive than the unnecessary fetching and 2) a direct fetch of a commit may fail depending on the remote's configuration

If the commit isn't present locally and the fetch fails, it seems like it'd be good to yield an error result so the operation can respect --on-failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thx for the analysis. I fully agree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've tweaked it a bit more to catch the CommandError and added a test. Of course feel free to drop an of the changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx very much, in particular for the test!

mih and others added 3 commits September 4, 2020 09:39
A failed fetch will raise a CommandError, so it needs to be caught to
get to the downstream processing.
This series drops get's use of 'git submodule update --init'.  Like
'git submodule update --init', our custom logic tries to fetch a
subdataset commit directly if it's not present.  Add a test for that.

Note that SSH is needed because, for a local path, git-clone's --local
behavior would kick in and the commit would be available.
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not thrilled about re-implementing git functionality due to the stated "high potential for missed corner-cases".
Also this PR largely re-implements logic already provided within update_submodule on resetting the HEAD upon ending up with detached HEAD so we might endup needing to change both code places (if we remember about their existence) if we decide to change the logic. IMHO that portion should have been extracted into a helper to be used by both implementations.

But ok -- let's proceed as is. I left only 1 request to future proof the code a bit.

'a fetch that commit from origin failed',
target_commit[:8]),
)
yield res
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this yield should be removed to break out from the loop with break below and the yielding res at the end of the function.
ATM break line below is not reached according to coverage annotation because outside code is just stopping iterating over this generator. But that outside behavior might change -- then we will reach break and re-yield the same res leading to duplicate records outside, thus possibly causing confusion or incorrect behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no. But thanks for raising the issue. The fix, however, is to replace the break immediately below with a continue. Whether or not processing should stop here must be left to decide by the caller (on_failure=stop vs on_failure=continue). Given the last break is still inside the loop the second part of your concern should be addressed as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it yield res twice if the last loop iteration ends up here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A continue in the last iteration should still skip the rest of the loop body. How could it reach that second yield?

Copy link
Member

@yarikoptic yarikoptic Sep 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is too long of a spaghetti now though to see clearly what is happening. I think I miscounted indentation for line 406, and thought it to be outside of the loop body, sorry for the noise

@mih
Copy link
Member Author

mih commented Sep 5, 2020

Also this PR largely re-implements logic already provided within update_submodule on resetting the HEAD upon ending up with detached HEAD so we might endup needing to change both code places (if we remember about their existence) if we decide to change the logic. IMHO that portion should have been extracted into a helper to be used by both implementations.

I share the sentiment. Once this is merged, I will propose a PR to rip out large parts of the submodule functionality from GitRepo. At that point it will be a largely (or completely) unused part of the codebase that is less efficient to use. But we will have the chance to discuss that, when there is a concrete proposal.

The previous `break` would forcibly cancel all processing on failure,
even with `on_failure=ignore`.
@mih
Copy link
Member Author

mih commented Sep 5, 2020

All green. Will merge.

@mih mih merged commit 105e1ff into datalad:master Sep 5, 2020
@mih mih deleted the opt-subdsupdate branch September 5, 2020 09:31
kyleam added a commit to kyleam/datalad that referenced this pull request Sep 25, 2020
dataladgh-4853 reworked get() to avoid using repo.update_submodule().
_install_subds_from_flexible_source() now has logic that

  * tries a direct fetch if the target commit isn't available locally
  * unconditionally checks out the target commit
  * tries to avoid a detached HEAD by moving the originally checked
    out branch back to HEAD if the branch contains HEAD

It does all of these even if the starting commit matches the target
commit.  The unconditional checkout means that we always end up in a
detached HEAD, leading to confusing messages like

    [INFO ] Reset subdataset branch 'master' to b292d7e3 (from
    b292d7e3) to avoid a detached HEAD

Make all of the above logic happen only when the current commit
doesn't match the target.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-if-ok OP considers this work done, and requests review/merge performance Improve performance of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants