Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gitrepo: Expose methods for direct git calls #3791

Merged
merged 2 commits into from Oct 19, 2019
Merged

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Oct 15, 2019

This is a draft PR with the initial implementation of the methods mentioned in gh-3789. There's a lot of wiggle room in terms of how to lump or split (you'll notice that I merged a couple of the methods from gh-3789 together), so any feedback on that would be welcome.

@codecov
Copy link

@codecov codecov bot commented Oct 15, 2019

Codecov Report

Merging #3791 into master will decrease coverage by <.01%.
The diff coverage is 95.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3791      +/-   ##
==========================================
- Coverage   80.77%   80.76%   -0.01%     
==========================================
  Files         273      273              
  Lines       35944    35974      +30     
==========================================
+ Hits        29033    29055      +22     
- Misses       6911     6919       +8
Impacted Files Coverage Δ
datalad/core/local/tests/test_create.py 100% <ø> (ø) ⬆️
datalad/support/tests/test_annexrepo.py 96.38% <ø> (ø) ⬆️
datalad/local/subdatasets.py 81.19% <0%> (ø) ⬆️
datalad/interface/rerun.py 62.18% <0%> (ø) ⬆️
datalad/interface/tests/test_save.py 100% <100%> (ø) ⬆️
datalad/interface/tests/test_unlock.py 100% <100%> (ø) ⬆️
datalad/interface/tests/test_rerun.py 100% <100%> (ø) ⬆️
datalad/support/repodates.py 95.87% <100%> (+0.04%) ⬆️
datalad/support/gitrepo.py 84.37% <100%> (+0.19%) ⬆️
datalad/interface/tests/test_rerun_merges.py 100% <100%> (ø) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f05760...7649f1b. Read the comment docs.

@bpoldrack
Copy link
Member

@bpoldrack bpoldrack commented Oct 16, 2019

That looks very neat and nice to me so far.

Copy link
Member

@mih mih left a comment

I left a few comment, one re the method signatures that prevents me from approving immediately. But in general, this looks great and we should do this IMHO.

Let's just quickly discuss, if the fake_dates thing is critically needed in this exact fashion.

@@ -153,7 +153,7 @@ def check_renamed_file(recursive, no_annex, path):
ds = Dataset(path).create(no_annex=no_annex)
create_tree(path, {'old': ''})
ds.repo.add('old')
ds.repo._git_custom_command(['old', 'new'], ['git', 'mv'])
ds.repo.call_git(["mv"], files=["old", "new"])
Copy link
Member

@mih mih Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I derive intense pleasure from seeing the files argument being second to the actual command!

expect_fail : bool, optional
A non-zero exit is expected and should not be elevated above the
DEBUG level.
check_fake_dates : boolean, optional
Copy link
Member

@mih mih Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't played with fake dates yet, but I wonder whether it is better (or needed) to expose this on a method-level, rather than at repo-level. In other words, would be be sufficient to put a repo instance into "fake dates modes" and have all methods act accordingly by querying an internal flag, rather than having one in each relevant method signature?

Copy link
Contributor Author

@kyleam kyleam Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up. I also don't like the idea of exposing check_fake_dates here. I'll have to take a closer look at dropping it.

Copy link
Contributor Author

@kyleam kyleam Oct 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be be sufficient to put a repo instance into "fake dates modes" and have all methods act accordingly by querying an internal flag

So this is essentially already the case. The .fake_dates_enabled property involves an initial .config.getbool call and then the value is cached at _fake_dates_enabled. I ended up just using _git_custom_command(..., check_fake_dates=True) for all the calls. Here's the relevant bit from the commit message:

All the methods call _git_custom_command() with check_fake_dates=True.
This will be unnecessary for nearly all repositories, but not doing so
risks leaking dates in repositories configured to use fake dates.  The
other option would be to add a check_fake_dates parameter to these
methods, which is ugly because this is an obscure parameter that most
callers should not have to worry about.  Unconditionally using
check_fake_dates=True costs an attribute lookup and a
.config.getbool() call on the first use and then a _fake_dates_enabled
attribute lookup on all remaining uses for that instance.  Here are
times of an example command on the datalad repo:

  % python -m timeit -n100 \
    -s "from datalad.support.gitrepo import GitRepo" \
    -s "repo = GitRepo('.')" --  "repo.call_git(['ls-files'])"

  Results of two runs with `check_fake_dates=False`:

  1) 100 loops, best of 5: 3.87 msec per loop
  2) 100 loops, best of 5: 3.87 msec per loop

  Results of two runs with `check_fake_dates=True`:

  1) 100 loops, best of 5: 3.97 msec per loop
  2) 100 loops, best of 5: 3.96 msec per loop

Copy link
Member

@mih mih Oct 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thx.

Returns
-------
Generator that yields output items.
Copy link
Member

@mih mih Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q (without any hidden messages): Should we do tandems of functions (list and generator) like we did elsewhere, for reasons of consistency? Or should we not?

I think the issue back than was a report on confusion that some processing didn't happen, because the generator wasn't consumed...

However, I am very open to break with this "tradition". I have not found myself switching between these often (or ever), and typically it is the generator that is ultimately the right choice.

Copy link
Contributor Author

@kyleam kyleam Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally haven't found the convention useful, but I didn't think too deeply about it here. I'm ok doing that to be consistent if others have a preference.

But before thinking about that, there is one to-do I left for myself in the commit message that I wanted to bring up:

todo: think more about whether using generator _items is beneficial

While I like generators, in the case of .call_git_items() I don't see what the advantage is. The methods depend on .splitlines() and .split(), both of which already return a list. So why not just return the list? My intuition was that using the generator in the case doesn't save any memory. Running quick tests with vprof -c m seem to support that the generator doesn't do better memory-wise for the "return .splitlines() result" scenario. What am I missing?

Copy link
Member

@mih mih Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the present context you aren't missing anything (given the limitations of the present Runner class, etc). I was thinking more like: If we ever can or need to handle "long" running processes that yield output line-by-line that could be acted upon, do we want to remain limited by this when installing this new API part now. I have no strong feelings, but it seems now is the time to bring it up.

Copy link
Contributor Author

@kyleam kyleam Oct 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. That's convincing enough for me. Thanks.

So, regarding the tandem, my current plan is to rename _items to have a trailing underscore and punt on adding the non-generator form until there's a need. That way at least the naming is consistent with the new style.

datalad/support/gitrepo.py Show resolved Hide resolved
@mih mih mentioned this pull request Oct 16, 2019
3 tasks
kyleam added 2 commits Oct 19, 2019
Our approach of providing lots of methods that wrap different Git
subcommands leads to us adding one-off methods, many of which do the
bare minimum for the use case we have at that time.  The minimal
implementation is a good thing if the code is meant to be used
internally because there's no point in trying to support and test
knobs we don't need, but it doesn't make for a nice external
interface, and it leads to churn and incompatibilities when we later
need to extend the method for another use case.

Let's instead expose a set of methods wrap simple calls based on the
return value.  Introducing these will hopefully have the following
benefits:

  * It is a set of commands that we can offer to third-party callers
    if they need a command that we don't have but still want to call
    git through us.

  * Likewise, it is a set of commands we can use to avoid adding
    one-off methods.

  * We already use _git_custom_command() in many spots outside of
    gitrepo.py to avoid adding more methods to GitRepo.  These new
    methods would simplify these calls.

  * If we decide to deprecate a method or change it in an incompatible
    way, we can point to these methods as an alternative.

The arguments of these new methods are pretty minimal compared to what
_git_custom_command() accepts.  The idea is to keep them simple until
we need to expose more options, given we can extend the keyword
arguments in a backward compatible way.  Between the new methods, the
arguments are largely consistent, but call_git() has `expect_fail`
while call_git_{oneline,items_} do not because we don't have a spot
where that'd be used yet in our code base.

Note that call_git_items_() is a generator.  While this isn't
particularly useful at the moment because str.split{,lines} already
put the entire list into memory, it keeps open the possibility of
changing the internal implementation to an approach the doesn't load
the entire output in memory.

All the methods call _git_custom_command() with check_fake_dates=True.
This will be unnecessary for nearly all repositories, but not doing so
risks leaking dates in repositories configured to use fake dates.  The
other option would be to add a check_fake_dates parameter to these
methods, which is ugly because this is an obscure parameter that most
callers should not have to worry about.  Unconditionally using
check_fake_dates=True costs an attribute lookup and a
.config.getbool() call on the first use and then a _fake_dates_enabled
attribute lookup on all remaining uses for that instance.  Here are
times of an example command on the datalad repo:

  % python -m timeit -n100 \
    -s "from datalad.support.gitrepo import GitRepo" \
    -s "repo = GitRepo('.')" --  "repo.call_git(['ls-files'])"

  Results of two runs with `check_fake_dates=False`:

  1) 100 loops, best of 5: 3.87 msec per loop
  2) 100 loops, best of 5: 3.87 msec per loop

  Results of two runs with `check_fake_dates=True`:

  1) 100 loops, best of 5: 3.97 msec per loop
  2) 100 loops, best of 5: 3.96 msec per loop

Closes datalad#3789.
This catches most _git_custom_command() usage outside of
{git,annex}repo.py.  utils_testrepos.py still has some.
@kyleam kyleam marked this pull request as ready for review Oct 19, 2019
@kyleam
Copy link
Contributor Author

@kyleam kyleam commented Oct 19, 2019

Pushed an update. Here's the relevant diff:

diff
diff --git a/datalad/core/local/tests/test_create.py b/datalad/core/local/tests/test_create.py
index 27f4eb822..bdc601c58 100644
--- a/datalad/core/local/tests/test_create.py
+++ b/datalad/core/local/tests/test_create.py
@@ -200,8 +200,8 @@ def test_create_sub(path):
     assert_in(
         'submodule.some/what/deeper.datalad-id={}'.format(
             subds.id),
-        list(ds.repo.call_git_items(['config', '--file', '.gitmodules',
-                                     '--list']))
+        list(ds.repo.call_git_items_(['config', '--file', '.gitmodules',
+                                      '--list']))
     )
 
     # subdataset is known to superdataset:
diff --git a/datalad/interface/rerun.py b/datalad/interface/rerun.py
index 1de10e9c7..75cc20095 100644
--- a/datalad/interface/rerun.py
+++ b/datalad/interface/rerun.py
@@ -409,8 +409,7 @@ def _rerun(dset, results, explicit=False):
                         dset.repo.call_git(
                             ["merge", "-m", msg,
                              "--no-ff", "--allow-unrelated-histories"] +
-                            new_parents[1:],
-                            check_fake_dates=True)
+                            new_parents[1:])
                     head = dset.repo.get_hexsha()
                     new_bases[res_hexsha] = head
             yield res
diff --git a/datalad/interface/tests/test_rerun_merges.py b/datalad/interface/tests/test_rerun_merges.py
index fa313c84f..c993b9bc2 100644
--- a/datalad/interface/tests/test_rerun_merges.py
+++ b/datalad/interface/tests/test_rerun_merges.py
@@ -708,8 +708,7 @@ def test_rerun_octopus(path):
     ds.run("echo baz >baz")
     ds.repo.checkout("master")
     ds.repo.call_git(
-        ["merge", "-m", "Merge octopus", "topic-1", "topic-2"],
-        check_fake_dates=True)
+        ["merge", "-m", "Merge octopus", "topic-1", "topic-2"])
     # o-.               f_M
     # |\ \
     # | | o             e_r
diff --git a/datalad/interface/tests/test_save.py b/datalad/interface/tests/test_save.py
index b65d0e86e..6d85b9539 100644
--- a/datalad/interface/tests/test_save.py
+++ b/datalad/interface/tests/test_save.py
@@ -291,7 +291,7 @@ def check_renamed_file(recursive, no_annex, path):
         ds = Dataset(path).create(no_annex=no_annex)
         create_tree(path, {'old': ''})
         ds.add('old')
-        ds.repo.call_git(['git', 'mv'], files=['old', 'new'])
+        ds.repo.call_git(['mv'], files=['old', 'new'])
         ds._save(recursive=recursive)
         ok_clean_git(path)
 
diff --git a/datalad/support/gitrepo.py b/datalad/support/gitrepo.py
index 23ec958f1..a7837b35a 100644
--- a/datalad/support/gitrepo.py
+++ b/datalad/support/gitrepo.py
@@ -1911,8 +1911,7 @@ def _run_command_files_split(
     # processing or error handling.
 
     def call_git(self, args, files=None,
-                 expect_stderr=False, expect_fail=False,
-                 check_fake_dates=False):
+                 expect_stderr=False, expect_fail=False):
         """Call git and return standard output.
 
         Parameters
@@ -1929,10 +1928,6 @@ def call_git(self, args, files=None,
         expect_fail : bool, optional
           A non-zero exit is expected and should not be elevated above the
           DEBUG level.
-        check_fake_dates : boolean, optional
-          DataLad supports faking the dates stored in repository objects. If
-          this call isn't purely for inspection (e.g., it can lead to a new
-          commit), this flag should be set so the dates are overridden.
 
         Returns
         -------
@@ -1940,15 +1935,15 @@ def call_git(self, args, files=None,
 
         Raises
         ------
-        GitCommandError if the call exits with a non-zero status.
+        CommandError if the call exits with a non-zero status.
         """
         out, _ = self._git_custom_command(files, ["git"] + args,
                                           expect_stderr=expect_stderr,
                                           expect_fail=expect_fail,
-                                          check_fake_dates=check_fake_dates)
+                                          check_fake_dates=True)
         return out
 
-    def call_git_items(self, args, files=None, expect_stderr=False, sep=None):
+    def call_git_items_(self, args, files=None, expect_stderr=False, sep=None):
         """Call git, splitting output on `sep`.
 
         Parameters
@@ -1971,10 +1966,11 @@ def call_git_items(self, args, files=None, expect_stderr=False, sep=None):
 
         Raises
         ------
-        GitCommandError if the call exits with a non-zero status.
+        CommandError if the call exits with a non-zero status.
         """
         out, _ = self._git_custom_command(files, ["git"] + args,
-                                          expect_stderr=expect_stderr)
+                                          expect_stderr=expect_stderr,
+                                          check_fake_dates=True)
         yield from (out.split(sep) if sep else out.splitlines())
 
     def call_git_oneline(self, args, files=None, expect_stderr=False):
@@ -1996,11 +1992,11 @@ def call_git_oneline(self, args, files=None, expect_stderr=False):
 
         Raises
         ------
-        GitCommandError if the call exits with a non-zero status.
+        CommandError if the call exits with a non-zero status.
         AssertionError if there is more than one line of output.
         """
-        lines = list(self.call_git_items(args, files=files,
-                                         expect_stderr=expect_stderr))
+        lines = list(self.call_git_items_(args, files=files,
+                                          expect_stderr=expect_stderr))
         if len(lines) > 1:
             raise AssertionError(
                 "Expected {} to return single line, but it returned {}"
@@ -2029,7 +2025,8 @@ def call_git_success(self, args, files=None, expect_stderr=False):
         try:
             self._git_custom_command(files, ["git"] + args,
                                      expect_fail=True,
-                                     expect_stderr=expect_stderr)
+                                     expect_stderr=expect_stderr,
+                                     check_fake_dates=True)
         except CommandError:
             return False
         return True
diff --git a/datalad/support/repodates.py b/datalad/support/repodates.py
index fa0040ffb..c4a9236f2 100644
--- a/datalad/support/repodates.py
+++ b/datalad/support/repodates.py
@@ -68,7 +68,7 @@ def branch_blobs(repo, branch):
     """
     # Note: This might be nicer with rev-list's --filter and
     # --filter-print-omitted, but those aren't available until Git v2.16.
-    lines = repo.call_git_items(["rev-list", "--objects"] + [branch])
+    lines = repo.call_git_items_(["rev-list", "--objects"] + [branch])
     # Trees and blobs have an associated path printed.
     objects = (ln.split() for ln in lines)
     blob_trees = [obj for obj in objects if len(obj) == 2]
@@ -107,8 +107,8 @@ def branch_blobs_in_tree(repo, branch):
     entry per blob is yielded).
     """
     seen_blobs = set()
-    lines = list(repo.call_git_items(["ls-tree", "-z", "-r", branch],
-                                     sep="\0"))
+    lines = list(repo.call_git_items_(["ls-tree", "-z", "-r", branch],
+                                      sep="\0"))
     if lines:
         num_lines = len(lines)
         log_progress(lgr.info,
diff --git a/datalad/support/tests/test_gitrepo.py b/datalad/support/tests/test_gitrepo.py
index 94032e975..731b9fa31 100644
--- a/datalad/support/tests/test_gitrepo.py
+++ b/datalad/support/tests/test_gitrepo.py
@@ -1553,3 +1553,38 @@ def test_gitrepo_add_to_git_with_annex_v7(path):
     gr.add("foo")
     gr.commit(msg="c1")
     assert_false(ar.is_under_annex("foo"))
+
+
+@with_tree({"foo": "foo", "bar": "bar"})
+def test_gitrepo_call_git_methods(path):
+    gr = GitRepo(path)
+    gr.add(["foo", "bar"])
+    gr.commit(msg="foobar")
+    gr.call_git(["mv"], files=["foo", "foo.txt"])
+    ok_(op.exists(op.join(gr.path, 'foo.txt')))
+
+    for expect_fail, check in [(False, assert_in),
+                               (True, assert_not_in)]:
+        with swallow_logs(new_level=logging.DEBUG) as cml:
+            with assert_raises(CommandError):
+                gr.call_git(["mv"], files=["notthere", "dest"],
+                            expect_fail=expect_fail)
+            check("notthere", cml.out)
+
+    eq_(list(gr.call_git_items_(["ls-files"])),
+        ["bar", "foo.txt"])
+    eq_(list(gr.call_git_items_(["ls-files", "-z"], sep="\0")),
+        # Note: The custom separator has trailing empty item, but this is an
+        # arbitrary command with unknown output it isn't safe to trim it.
+        ["bar", "foo.txt", ""])
+
+    with assert_raises(AssertionError):
+        gr.call_git_oneline(["ls-files"])
+
+    eq_(gr.call_git_oneline(["ls-files"], files=["bar"]),
+        "bar")
+
+    ok_(gr.call_git_success(["rev-parse", "HEAD^{commit}"]))
+    with swallow_logs(new_level=logging.DEBUG) as cml:
+        assert_false(gr.call_git_success(["rev-parse", "HEAD^{blob}"]))
+        assert_not_in("blob", cml.out)

mih
mih approved these changes Oct 19, 2019
Copy link
Member

@mih mih left a comment

I think this is great, and will merge. Thx much for the initiative @kyleam !

@mih mih merged commit bcfcefe into datalad:master Oct 19, 2019
16 checks passed
@kyleam kyleam deleted the gitrepo-callgit branch Feb 11, 2020
yarikoptic added a commit to yarikoptic/datalad that referenced this issue Nov 20, 2020
While looking at py-spy stack from long running status I noticed that majority
of time is spent on git config call, triggered by checking for fake dates
config setting. AFAIK for read-only operations, which cannot produce a commit,
it might benefit to skip that check. And _call_git already has that kwarg,
but it is not proxied by "higher-level" helpers.  To avoid breeding workarounds
(like the one RFed within this commit) I decided that we might benefit from
exposing that option in higher-level helpers as well. Then "read-only" invocations
could set it to False and we might avoid needless  config read in some cases
(I have not done any timing on any prototypical use case which might benefit
from this, but I would assume that status or diff might).

We had an original discussion on either to expose check_fake_dates in higher
level *Repo interfaces before:
datalad#3791 (comment) and decided
to not do it at that point.

As for 'subdatasets' call, performance issue was now addressed with
datalad#5076 , so this change is not strictly
necessary to optimize "subdatasets", but would still be generally benefitial
for possible other invocations of *Repo methods across many instances without
unnecessarily triggering loading of the config, typically in the case of
read-only git operations (git-annex might need to merge git-annex branch, so
typically annex calls should not disable fake dates).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants