generated from datalad/datalad-extension-template
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate performance of next-status
with many subdatasets
#606
Milestone
Comments
All of the runtime is coming from Test patch diff --git a/datalad_next/iter_collections/gitstatus.py b/datalad_next/iter_collections/gitstatus.py
index d3f4dd5..6fe6e71 100644
--- a/datalad_next/iter_collections/gitstatus.py
+++ b/datalad_next/iter_collections/gitstatus.py
@@ -289,7 +289,7 @@ def _yield_repo_items(
# TODO others?
)
# TODO possibly trim eval_submodule_state
- _eval_submodule(path, item, eval_submodule_state)
+ #_eval_submodule(path, item, eval_submodule_state)
if item.status:
yield item # with patch
❯ time datalad next-status
nothing to save, working tree clean
datalad next-status 0.84s user 0.13s system 101% cpu 0.962 total
# without the patch
❯ time datalad next-status
nothing to save, working tree clean
datalad next-status 95.98s user 26.99s system 110% cpu 1:51.23 total |
The culprit is the timing of detection that a submodule is absent. The following patch tried swapping out the diff --git a/datalad_next/iter_collections/gitstatus.py b/datalad_next/iter_collections/gitstatus.py
index d3f4dd5..9f0f7a8 100644
--- a/datalad_next/iter_collections/gitstatus.py
+++ b/datalad_next/iter_collections/gitstatus.py
@@ -13,6 +13,7 @@ from typing import Generator
from datalad_next.runners import (
CommandError,
+ call_git_lines,
iter_git_subproc,
)
from datalad_next.itertools import (
@@ -414,18 +415,16 @@ def _get_submod_worktree_head(path: Path) -> tuple[bool, str | None, bool]:
# its basis. it is not meaningful to track the managed branch in
# a superdataset
HEAD = corresponding_head
- with iter_git_subproc(
- ['rev-parse', '--path-format=relative',
- '--show-toplevel', HEAD],
+ res = call_git_lines(
+ ['rev-parse', '--path-format=relative', '--show-toplevel', HEAD],
cwd=path,
- ) as r:
- res = tuple(decode_bytes(itemize(r, sep=None, keep_ends=False)))
- assert len(res) == 2
- if res[0].startswith('..'):
- # this is not a report on a submodule at this location
- return False, None, adjusted
- else:
- return True, res[1], adjusted
+ )
+ assert len(res) == 2
+ if res[0].startswith('..'):
+ # this is not a report on a submodule at this location
+ return False, None, adjusted
+ else:
+ return True, res[1], adjusted
def _eval_submodule(basepath, item, eval_mode) -> None: |
Here is the patch diff --git a/datalad_next/iter_collections/gitstatus.py b/datalad_next/iter_collections/gitstatus.py
index d3f4dd5..5e4a980 100644
--- a/datalad_next/iter_collections/gitstatus.py
+++ b/datalad_next/iter_collections/gitstatus.py
@@ -437,6 +436,14 @@ def _eval_submodule(basepath, item, eval_mode) -> None:
return
item_path = basepath / item.path
+
+ # this is the cheapest test for the theoretical chance that a submodule
+ # is present at `item_path`. This is beneficial even when we would only
+ # run a single call to `git rev-parse`
+ # https://github.com/datalad/datalad-next/issues/606
+ if not (item_path / '.git').exists():
+ return
+
# get head commit, and whether a submodule is actually present,
# and/or in adjusted mode
subds_present, head_commit, adjusted = _get_submod_worktree_head(item_path)
A 80x speedup for this extreme use case. |
mih
added a commit
to mih/datalad-next
that referenced
this issue
Jan 26, 2024
This can dramatically boost performance with many submodules. Checking the filesystem for a file is much cheaper than running `git rev-parse`. When a subdataset is present, this obviously means an additional cost. However, in comparison to the then following state evaluation, it should still be cheap. Worth it, IMHO. Closes: datalad#606
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It takes twice as long as a plain listing of present submodules.
Sidenote: Listing absent submodules only is much faster.
Timings above are not depending on a "cold start", but are reproducible on repeated runs (more or less).
The text was updated successfully, but these errors were encountered: