New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gitrepo: Expose methods for direct git calls #3791
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3791 +/- ##
==========================================
- Coverage 80.77% 80.76% -0.01%
==========================================
Files 273 273
Lines 35944 35974 +30
==========================================
+ Hits 29033 29055 +22
- Misses 6911 6919 +8
Continue to review full report at Codecov.
|
That looks very neat and nice to me so far. |
I left a few comment, one re the method signatures that prevents me from approving immediately. But in general, this looks great and we should do this IMHO.
Let's just quickly discuss, if the fake_dates
thing is critically needed in this exact fashion.
@@ -153,7 +153,7 @@ def check_renamed_file(recursive, no_annex, path): | |||
ds = Dataset(path).create(no_annex=no_annex) | |||
create_tree(path, {'old': ''}) | |||
ds.repo.add('old') | |||
ds.repo._git_custom_command(['old', 'new'], ['git', 'mv']) | |||
ds.repo.call_git(["mv"], files=["old", "new"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I derive intense pleasure from seeing the files
argument being second to the actual command!
datalad/support/gitrepo.py
Outdated
expect_fail : bool, optional | ||
A non-zero exit is expected and should not be elevated above the | ||
DEBUG level. | ||
check_fake_dates : boolean, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't played with fake dates yet, but I wonder whether it is better (or needed) to expose this on a method-level, rather than at repo-level. In other words, would be be sufficient to put a repo instance into "fake dates modes" and have all methods act accordingly by querying an internal flag, rather than having one in each relevant method signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up. I also don't like the idea of exposing check_fake_dates
here. I'll have to take a closer look at dropping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be be sufficient to put a repo instance into "fake dates modes" and have all methods act accordingly by querying an internal flag
So this is essentially already the case. The .fake_dates_enabled
property involves an initial .config.getbool
call and then the value is cached at _fake_dates_enabled
. I ended up just using _git_custom_command(..., check_fake_dates=True)
for all the calls. Here's the relevant bit from the commit message:
All the methods call _git_custom_command() with check_fake_dates=True.
This will be unnecessary for nearly all repositories, but not doing so
risks leaking dates in repositories configured to use fake dates. The
other option would be to add a check_fake_dates parameter to these
methods, which is ugly because this is an obscure parameter that most
callers should not have to worry about. Unconditionally using
check_fake_dates=True costs an attribute lookup and a
.config.getbool() call on the first use and then a _fake_dates_enabled
attribute lookup on all remaining uses for that instance. Here are
times of an example command on the datalad repo:
% python -m timeit -n100 \
-s "from datalad.support.gitrepo import GitRepo" \
-s "repo = GitRepo('.')" -- "repo.call_git(['ls-files'])"
Results of two runs with `check_fake_dates=False`:
1) 100 loops, best of 5: 3.87 msec per loop
2) 100 loops, best of 5: 3.87 msec per loop
Results of two runs with `check_fake_dates=True`:
1) 100 loops, best of 5: 3.97 msec per loop
2) 100 loops, best of 5: 3.96 msec per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thx.
Returns | ||
------- | ||
Generator that yields output items. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q (without any hidden messages): Should we do tandems of functions (list and generator) like we did elsewhere, for reasons of consistency? Or should we not?
I think the issue back than was a report on confusion that some processing didn't happen, because the generator wasn't consumed...
However, I am very open to break with this "tradition". I have not found myself switching between these often (or ever), and typically it is the generator that is ultimately the right choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally haven't found the convention useful, but I didn't think too deeply about it here. I'm ok doing that to be consistent if others have a preference.
But before thinking about that, there is one to-do I left for myself in the commit message that I wanted to bring up:
todo: think more about whether using generator _items is beneficial
While I like generators, in the case of .call_git_items()
I don't see what the advantage is. The methods depend on .splitlines()
and .split()
, both of which already return a list. So why not just return the list? My intuition was that using the generator in the case doesn't save any memory. Running quick tests with vprof -c m
seem to support that the generator doesn't do better memory-wise for the "return .splitlines() result" scenario. What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the present context you aren't missing anything (given the limitations of the present Runner
class, etc). I was thinking more like: If we ever can or need to handle "long" running processes that yield output line-by-line that could be acted upon, do we want to remain limited by this when installing this new API part now. I have no strong feelings, but it seems now is the time to bring it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. That's convincing enough for me. Thanks.
So, regarding the tandem, my current plan is to rename _items
to have a trailing underscore and punt on adding the non-generator form until there's a need. That way at least the naming is consistent with the new style.
Our approach of providing lots of methods that wrap different Git subcommands leads to us adding one-off methods, many of which do the bare minimum for the use case we have at that time. The minimal implementation is a good thing if the code is meant to be used internally because there's no point in trying to support and test knobs we don't need, but it doesn't make for a nice external interface, and it leads to churn and incompatibilities when we later need to extend the method for another use case. Let's instead expose a set of methods wrap simple calls based on the return value. Introducing these will hopefully have the following benefits: * It is a set of commands that we can offer to third-party callers if they need a command that we don't have but still want to call git through us. * Likewise, it is a set of commands we can use to avoid adding one-off methods. * We already use _git_custom_command() in many spots outside of gitrepo.py to avoid adding more methods to GitRepo. These new methods would simplify these calls. * If we decide to deprecate a method or change it in an incompatible way, we can point to these methods as an alternative. The arguments of these new methods are pretty minimal compared to what _git_custom_command() accepts. The idea is to keep them simple until we need to expose more options, given we can extend the keyword arguments in a backward compatible way. Between the new methods, the arguments are largely consistent, but call_git() has `expect_fail` while call_git_{oneline,items_} do not because we don't have a spot where that'd be used yet in our code base. Note that call_git_items_() is a generator. While this isn't particularly useful at the moment because str.split{,lines} already put the entire list into memory, it keeps open the possibility of changing the internal implementation to an approach the doesn't load the entire output in memory. All the methods call _git_custom_command() with check_fake_dates=True. This will be unnecessary for nearly all repositories, but not doing so risks leaking dates in repositories configured to use fake dates. The other option would be to add a check_fake_dates parameter to these methods, which is ugly because this is an obscure parameter that most callers should not have to worry about. Unconditionally using check_fake_dates=True costs an attribute lookup and a .config.getbool() call on the first use and then a _fake_dates_enabled attribute lookup on all remaining uses for that instance. Here are times of an example command on the datalad repo: % python -m timeit -n100 \ -s "from datalad.support.gitrepo import GitRepo" \ -s "repo = GitRepo('.')" -- "repo.call_git(['ls-files'])" Results of two runs with `check_fake_dates=False`: 1) 100 loops, best of 5: 3.87 msec per loop 2) 100 loops, best of 5: 3.87 msec per loop Results of two runs with `check_fake_dates=True`: 1) 100 loops, best of 5: 3.97 msec per loop 2) 100 loops, best of 5: 3.96 msec per loop Closes datalad#3789.
This catches most _git_custom_command() usage outside of {git,annex}repo.py. utils_testrepos.py still has some.
Pushed an update. Here's the relevant diff: diffdiff --git a/datalad/core/local/tests/test_create.py b/datalad/core/local/tests/test_create.py
index 27f4eb822..bdc601c58 100644
--- a/datalad/core/local/tests/test_create.py
+++ b/datalad/core/local/tests/test_create.py
@@ -200,8 +200,8 @@ def test_create_sub(path):
assert_in(
'submodule.some/what/deeper.datalad-id={}'.format(
subds.id),
- list(ds.repo.call_git_items(['config', '--file', '.gitmodules',
- '--list']))
+ list(ds.repo.call_git_items_(['config', '--file', '.gitmodules',
+ '--list']))
)
# subdataset is known to superdataset:
diff --git a/datalad/interface/rerun.py b/datalad/interface/rerun.py
index 1de10e9c7..75cc20095 100644
--- a/datalad/interface/rerun.py
+++ b/datalad/interface/rerun.py
@@ -409,8 +409,7 @@ def _rerun(dset, results, explicit=False):
dset.repo.call_git(
["merge", "-m", msg,
"--no-ff", "--allow-unrelated-histories"] +
- new_parents[1:],
- check_fake_dates=True)
+ new_parents[1:])
head = dset.repo.get_hexsha()
new_bases[res_hexsha] = head
yield res
diff --git a/datalad/interface/tests/test_rerun_merges.py b/datalad/interface/tests/test_rerun_merges.py
index fa313c84f..c993b9bc2 100644
--- a/datalad/interface/tests/test_rerun_merges.py
+++ b/datalad/interface/tests/test_rerun_merges.py
@@ -708,8 +708,7 @@ def test_rerun_octopus(path):
ds.run("echo baz >baz")
ds.repo.checkout("master")
ds.repo.call_git(
- ["merge", "-m", "Merge octopus", "topic-1", "topic-2"],
- check_fake_dates=True)
+ ["merge", "-m", "Merge octopus", "topic-1", "topic-2"])
# o-. f_M
# |\ \
# | | o e_r
diff --git a/datalad/interface/tests/test_save.py b/datalad/interface/tests/test_save.py
index b65d0e86e..6d85b9539 100644
--- a/datalad/interface/tests/test_save.py
+++ b/datalad/interface/tests/test_save.py
@@ -291,7 +291,7 @@ def check_renamed_file(recursive, no_annex, path):
ds = Dataset(path).create(no_annex=no_annex)
create_tree(path, {'old': ''})
ds.add('old')
- ds.repo.call_git(['git', 'mv'], files=['old', 'new'])
+ ds.repo.call_git(['mv'], files=['old', 'new'])
ds._save(recursive=recursive)
ok_clean_git(path)
diff --git a/datalad/support/gitrepo.py b/datalad/support/gitrepo.py
index 23ec958f1..a7837b35a 100644
--- a/datalad/support/gitrepo.py
+++ b/datalad/support/gitrepo.py
@@ -1911,8 +1911,7 @@ def _run_command_files_split(
# processing or error handling.
def call_git(self, args, files=None,
- expect_stderr=False, expect_fail=False,
- check_fake_dates=False):
+ expect_stderr=False, expect_fail=False):
"""Call git and return standard output.
Parameters
@@ -1929,10 +1928,6 @@ def call_git(self, args, files=None,
expect_fail : bool, optional
A non-zero exit is expected and should not be elevated above the
DEBUG level.
- check_fake_dates : boolean, optional
- DataLad supports faking the dates stored in repository objects. If
- this call isn't purely for inspection (e.g., it can lead to a new
- commit), this flag should be set so the dates are overridden.
Returns
-------
@@ -1940,15 +1935,15 @@ def call_git(self, args, files=None,
Raises
------
- GitCommandError if the call exits with a non-zero status.
+ CommandError if the call exits with a non-zero status.
"""
out, _ = self._git_custom_command(files, ["git"] + args,
expect_stderr=expect_stderr,
expect_fail=expect_fail,
- check_fake_dates=check_fake_dates)
+ check_fake_dates=True)
return out
- def call_git_items(self, args, files=None, expect_stderr=False, sep=None):
+ def call_git_items_(self, args, files=None, expect_stderr=False, sep=None):
"""Call git, splitting output on `sep`.
Parameters
@@ -1971,10 +1966,11 @@ def call_git_items(self, args, files=None, expect_stderr=False, sep=None):
Raises
------
- GitCommandError if the call exits with a non-zero status.
+ CommandError if the call exits with a non-zero status.
"""
out, _ = self._git_custom_command(files, ["git"] + args,
- expect_stderr=expect_stderr)
+ expect_stderr=expect_stderr,
+ check_fake_dates=True)
yield from (out.split(sep) if sep else out.splitlines())
def call_git_oneline(self, args, files=None, expect_stderr=False):
@@ -1996,11 +1992,11 @@ def call_git_oneline(self, args, files=None, expect_stderr=False):
Raises
------
- GitCommandError if the call exits with a non-zero status.
+ CommandError if the call exits with a non-zero status.
AssertionError if there is more than one line of output.
"""
- lines = list(self.call_git_items(args, files=files,
- expect_stderr=expect_stderr))
+ lines = list(self.call_git_items_(args, files=files,
+ expect_stderr=expect_stderr))
if len(lines) > 1:
raise AssertionError(
"Expected {} to return single line, but it returned {}"
@@ -2029,7 +2025,8 @@ def call_git_success(self, args, files=None, expect_stderr=False):
try:
self._git_custom_command(files, ["git"] + args,
expect_fail=True,
- expect_stderr=expect_stderr)
+ expect_stderr=expect_stderr,
+ check_fake_dates=True)
except CommandError:
return False
return True
diff --git a/datalad/support/repodates.py b/datalad/support/repodates.py
index fa0040ffb..c4a9236f2 100644
--- a/datalad/support/repodates.py
+++ b/datalad/support/repodates.py
@@ -68,7 +68,7 @@ def branch_blobs(repo, branch):
"""
# Note: This might be nicer with rev-list's --filter and
# --filter-print-omitted, but those aren't available until Git v2.16.
- lines = repo.call_git_items(["rev-list", "--objects"] + [branch])
+ lines = repo.call_git_items_(["rev-list", "--objects"] + [branch])
# Trees and blobs have an associated path printed.
objects = (ln.split() for ln in lines)
blob_trees = [obj for obj in objects if len(obj) == 2]
@@ -107,8 +107,8 @@ def branch_blobs_in_tree(repo, branch):
entry per blob is yielded).
"""
seen_blobs = set()
- lines = list(repo.call_git_items(["ls-tree", "-z", "-r", branch],
- sep="\0"))
+ lines = list(repo.call_git_items_(["ls-tree", "-z", "-r", branch],
+ sep="\0"))
if lines:
num_lines = len(lines)
log_progress(lgr.info,
diff --git a/datalad/support/tests/test_gitrepo.py b/datalad/support/tests/test_gitrepo.py
index 94032e975..731b9fa31 100644
--- a/datalad/support/tests/test_gitrepo.py
+++ b/datalad/support/tests/test_gitrepo.py
@@ -1553,3 +1553,38 @@ def test_gitrepo_add_to_git_with_annex_v7(path):
gr.add("foo")
gr.commit(msg="c1")
assert_false(ar.is_under_annex("foo"))
+
+
+@with_tree({"foo": "foo", "bar": "bar"})
+def test_gitrepo_call_git_methods(path):
+ gr = GitRepo(path)
+ gr.add(["foo", "bar"])
+ gr.commit(msg="foobar")
+ gr.call_git(["mv"], files=["foo", "foo.txt"])
+ ok_(op.exists(op.join(gr.path, 'foo.txt')))
+
+ for expect_fail, check in [(False, assert_in),
+ (True, assert_not_in)]:
+ with swallow_logs(new_level=logging.DEBUG) as cml:
+ with assert_raises(CommandError):
+ gr.call_git(["mv"], files=["notthere", "dest"],
+ expect_fail=expect_fail)
+ check("notthere", cml.out)
+
+ eq_(list(gr.call_git_items_(["ls-files"])),
+ ["bar", "foo.txt"])
+ eq_(list(gr.call_git_items_(["ls-files", "-z"], sep="\0")),
+ # Note: The custom separator has trailing empty item, but this is an
+ # arbitrary command with unknown output it isn't safe to trim it.
+ ["bar", "foo.txt", ""])
+
+ with assert_raises(AssertionError):
+ gr.call_git_oneline(["ls-files"])
+
+ eq_(gr.call_git_oneline(["ls-files"], files=["bar"]),
+ "bar")
+
+ ok_(gr.call_git_success(["rev-parse", "HEAD^{commit}"]))
+ with swallow_logs(new_level=logging.DEBUG) as cml:
+ assert_false(gr.call_git_success(["rev-parse", "HEAD^{blob}"]))
+ assert_not_in("blob", cml.out) |
While looking at py-spy stack from long running status I noticed that majority of time is spent on git config call, triggered by checking for fake dates config setting. AFAIK for read-only operations, which cannot produce a commit, it might benefit to skip that check. And _call_git already has that kwarg, but it is not proxied by "higher-level" helpers. To avoid breeding workarounds (like the one RFed within this commit) I decided that we might benefit from exposing that option in higher-level helpers as well. Then "read-only" invocations could set it to False and we might avoid needless config read in some cases (I have not done any timing on any prototypical use case which might benefit from this, but I would assume that status or diff might). We had an original discussion on either to expose check_fake_dates in higher level *Repo interfaces before: datalad#3791 (comment) and decided to not do it at that point. As for 'subdatasets' call, performance issue was now addressed with datalad#5076 , so this change is not strictly necessary to optimize "subdatasets", but would still be generally benefitial for possible other invocations of *Repo methods across many instances without unnecessarily triggering loading of the config, typically in the case of read-only git operations (git-annex might need to merge git-annex branch, so typically annex calls should not disable fake dates).
This is a draft PR with the initial implementation of the methods mentioned in gh-3789. There's a lot of wiggle room in terms of how to lump or split (you'll notice that I merged a couple of the methods from gh-3789 together), so any feedback on that would be welcome.