Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generator-based GitRepo.call_git_items_() #6278

Merged
merged 6 commits into from
Dec 7, 2021

Conversation

christian-monch
Copy link
Contributor

This PR uses a generator-based runner to, i.e. a runner with a GeneratorMixIn-Protocol subclass, to implement GitRepo.call_git_items_().

This is a follow up to PR #6244, especially the following reviewer comment: #6244 (comment)

The interface of GitRepo.call_git_items_() does not change, but results will be yielded as soon as they are available.

@christian-monch christian-monch added the semver-internal Changes only affect the internal API label Dec 3, 2021
@christian-monch christian-monch requested review from mih and bpoldrack and removed request for bpoldrack December 3, 2021 04:54
@codecov
Copy link

codecov bot commented Dec 3, 2021

Codecov Report

Merging #6278 (16668da) into master (6fae464) will decrease coverage by 1.30%.
The diff coverage is 91.32%.

❗ Current head 16668da differs from pull request most recent head d44f389. Consider uploading reports for the commit d44f389 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6278      +/-   ##
==========================================
- Coverage   89.84%   88.53%   -1.31%     
==========================================
  Files         329      329              
  Lines       42867    42936      +69     
==========================================
- Hits        38512    38013     -499     
- Misses       4355     4923     +568     
Impacted Files Coverage Δ
datalad/dataset/tests/test_gitrepo.py 88.48% <ø> (ø)
datalad/runner/utils.py 100.00% <ø> (ø)
datalad/support/annexrepo.py 89.02% <0.00%> (-2.13%) ⬇️
datalad/support/tests/test_gitrepo.py 99.77% <ø> (ø)
datalad/distribution/uninstall.py 63.26% <12.50%> (-8.17%) ⬇️
datalad/interface/base.py 89.22% <50.00%> (-0.44%) ⬇️
datalad/cmdline/helpers.py 74.07% <90.00%> (+0.64%) ⬆️
datalad/cmdline/tests/test_main.py 83.06% <95.65%> (-1.77%) ⬇️
datalad/dataset/gitrepo.py 96.69% <95.89%> (-0.97%) ⬇️
datalad/__init__.py 78.88% <100.00%> (-5.91%) ⬇️
... and 98 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6fae464...d44f389. Read the comment docs.

@mih
Copy link
Member

mih commented Dec 3, 2021

Are you planning to re-consolidate the additions to call_git_items_() that are modified from _call_git()?

@christian-monch
Copy link
Contributor Author

Are you planning to re-consolidate the additions to call_git_items_() that are modified from _call_git()?

Yes, I will do that now.

@christian-monch
Copy link
Contributor Author

christian-monch commented Dec 3, 2021

@mih: The latest commit consolidates _call_git, call_git, and call_git_items in GitRepo. All three methodes are now based on _generator_call_git.

The changes implement a drop-in replacement in order to keep the PR small, i.e. only very few changes outside of GitRepo. The drop-in replacement capability limits the implementation space:

  1. careful handling of line-endings is required, which might seem strange in some places,
  2. generators might have to be unfolded earlier than one would wish for.

Having said that, I will create an issue to clean up the usage of call_git and _call_git. The latter returns a two-tuple, but only the first component is ever used, if it is used at all. That might also allow us to pass generators one level higher

@mih
Copy link
Member

mih commented Dec 6, 2021

I do not fully understand your last post. I can confirm that this code works, and is at least as fast as it was before. So all these signs point to a quick merge.

What I do not understand yet (and I assume this is what your last post was about), it why we have _call_git() and call_git_items_() being near-copies of each other.

This leads to possibly undesirable situations, like call_git_success() calling _call_git(), which now executes _generator_call_git(), but will only ever return a boolean. I assume this must be a more complex operation now than it was before.

FTR: unlike the other methods, _call_git() need not maintain its behavior, it is specifically not part of the public API -- and only used in 7 places throughout the code base.

@mih mih removed their request for review December 6, 2021 07:37
datalad/support/gitrepo.py Outdated Show resolved Hide resolved
Use context managers to factor out common code
from GitRepo._call_git and GitRepo.call_git_items_()
@christian-monch
Copy link
Contributor Author

christian-monch commented Dec 6, 2021

I do not fully understand your last post. I can confirm that this code works, and is at least as fast as it was before. So all these signs point to a quick merge.

I should have been clearer. Top priority in this PR was to keep it small and therefore keep internal and external APIs unmodified.

What I do not understand yet (and I assume this is what your last post was about), it why we have _call_git() and call_git_items_() being near-copies of each other.

With the aim of keeping _call_git() (due to the priority stated above), the similarities stem from the necessity to release the write_lock and from the GitIgnoreError-detection. The former is relatively easy to achieve with the try-finally-clause. Therefore it is present within _call_git and call_git_items_() (and could have been done simpler with context managers, as pushed in b6ff536).

This leads to possibly undesirable situations, like call_git_success() calling _call_git(), which now executes _generator_call_git(), but will only ever return a boolean. I assume this must be a more complex operation now than it was before.

Although the general complexity (O-Notation) remains the same for the same inputs, the additional method call might very well introduce some delay, i.e. increase a constant factor. That can be mitigated by cutting out _call_git() and using _generator_call_git(), which is probably what you had in mind (I did not do that in the first approach due to the small-diff-goal, but it is pushed now in ).

FTR: unlike the other methods, _call_git() need not maintain its behavior, it is specifically not part of the public API -- and only used in 7 places throughout the code base.

Its return value is directly returned by datalad.support.gitrepo.GitRepo.add_remote, although the return values are not documented, I did not go for changing it. If we can change the signature of add_remote, I will remove it.

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty cool pattern to hide-away the try-except complexities! Awesome!

I left a suggestion to generalize the lock-helper a bit further to be able to take it out of the context of GitRepo. WDYT?

datalad/dataset/gitrepo.py Outdated Show resolved Hide resolved
Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my POV we can run with this! Thx!

@mih
Copy link
Member

mih commented Dec 7, 2021

The MacOS failure is more fundamental #6285

@mih mih merged commit 50ecb5c into datalad:master Dec 7, 2021
@adswa
Copy link
Member

adswa commented Jun 8, 2022

Hey @christian-monch waving for attention 👋
The failures reported in datalad/datalad-ukbiobank#82 and the wrong fix and subsequent failure in PR datalad/datalad-ukbiobank#85 revealed that the output of any repo.call_git() call is stripped of the last line's trailing slash. This seems to have been introduced by you in this PR, and we debated in the chat that using call_git_() would fix this issue. But I'm not familiar enough with this code base to touch it, and as you changed it last and probably had good reasons to switch, could you take a look at how we can best fix this?
FTR, this patch (thx to @yarikoptic) would fix it for me, but unsure of side effects or better approaches:

diff --git a/datalad/dataset/gitrepo.py b/datalad/dataset/gitrepo.py
index a1e32b450..d7b831bca 100644
--- a/datalad/dataset/gitrepo.py
+++ b/datalad/dataset/gitrepo.py
@@ -429,12 +429,12 @@ class GitRepo(RepoInterface, metaclass=PathBasedFlyweight):
         ------
         CommandError if the call exits with a non-zero status.
         """
-        return "\n".join(
-            self.call_git_items_(args,
-                                 files,
-                                 expect_stderr=expect_stderr,
-                                 expect_fail=expect_fail,
-                                 read_only=read_only))
+
+        return self._call_git(args,
+                 files,
+                 expect_stderr=expect_stderr,
+                 expect_fail=expect_fail,
+                 read_only=read_only)[0]
 
     def call_git_items_(self,
                         args,

There were also a few comments by @yarikoptic in the chat:

BTW _call_git docstring needs fixing:

        The parameters, return value, and raised exceptions match those
        documented for `call_git`.

and

that is odd -- worth checking if call_git atm just passes stderr to screen?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-internal Changes only affect the internal API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants