Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance of next-status with many subdatasets #606

Closed
mih opened this issue Jan 26, 2024 · 3 comments · Fixed by #607
Closed

Investigate performance of next-status with many subdatasets #606

mih opened this issue Jan 26, 2024 · 3 comments · Fixed by #607
Milestone

Comments

@mih
Copy link
Member

mih commented Jan 26, 2024

It takes twice as long as a plain listing of present submodules.

% datalad next-status
untracked: somelog
datalad next-status  117.60s user 62.30s system 63% cpu 4:41.11 total
% datalad subdatasets --state present
datalad subdatasets --state present  16.54s user 2.94s system 11% cpu 2:43.88 total

Sidenote: Listing absent submodules only is much faster.

% datalad -f json subdatasets --state absent | wc -l
42715
datalad -f json subdatasets --state absent  17.78s user 1.54s system 149% cpu 12.952 total

Timings above are not depending on a "cold start", but are reproducible on repeated runs (more or less).

@mih
Copy link
Member Author

mih commented Jan 26, 2024

All of the runtime is coming from _eval_submodule():

Test patch

diff --git a/datalad_next/iter_collections/gitstatus.py b/datalad_next/iter_collections/gitstatus.py
index d3f4dd5..6fe6e71 100644
--- a/datalad_next/iter_collections/gitstatus.py
+++ b/datalad_next/iter_collections/gitstatus.py
@@ -289,7 +289,7 @@ def _yield_repo_items(
             # TODO others?
         )
         # TODO possibly trim eval_submodule_state
-        _eval_submodule(path, item, eval_submodule_state)
+        #_eval_submodule(path, item, eval_submodule_state)
         if item.status:
             yield item
# with patchtime datalad next-status
nothing to save, working tree clean
datalad next-status  0.84s user 0.13s system 101% cpu 0.962 total

# without the patchtime datalad next-status
nothing to save, working tree clean
datalad next-status  95.98s user 26.99s system 110% cpu 1:51.23 total

@mih
Copy link
Member Author

mih commented Jan 26, 2024

The culprit is the timing of detection that a submodule is absent.

The following patch tried swapping out the iter_subproc() method for a simpler subprocess.run(). Marginal difference -- big compliment to iter_subproc()!

diff --git a/datalad_next/iter_collections/gitstatus.py b/datalad_next/iter_collections/gitstatus.py
index d3f4dd5..9f0f7a8 100644
--- a/datalad_next/iter_collections/gitstatus.py
+++ b/datalad_next/iter_collections/gitstatus.py
@@ -13,6 +13,7 @@ from typing import Generator
 
 from datalad_next.runners import (
     CommandError,
+    call_git_lines,
     iter_git_subproc,
 )
 from datalad_next.itertools import (
@@ -414,18 +415,16 @@ def _get_submod_worktree_head(path: Path) -> tuple[bool, str | None, bool]:
         # its basis. it is not meaningful to track the managed branch in
         # a superdataset
         HEAD = corresponding_head
-    with iter_git_subproc(
-        ['rev-parse', '--path-format=relative',
-         '--show-toplevel', HEAD],
+    res = call_git_lines(
+        ['rev-parse', '--path-format=relative', '--show-toplevel', HEAD],
         cwd=path,
-    ) as r:
-        res = tuple(decode_bytes(itemize(r, sep=None, keep_ends=False)))
-        assert len(res) == 2
-        if res[0].startswith('..'):
-            # this is not a report on a submodule at this location
-            return False, None, adjusted
-        else:
-            return True, res[1], adjusted
+    )
+    assert len(res) == 2
+    if res[0].startswith('..'):
+        # this is not a report on a submodule at this location
+        return False, None, adjusted
+    else:
+        return True, res[1], adjusted
 
 
 def _eval_submodule(basepath, item, eval_mode) -> None:

@mih
Copy link
Member Author

mih commented Jan 26, 2024

Here is the patch

diff --git a/datalad_next/iter_collections/gitstatus.py b/datalad_next/iter_collections/gitstatus.py
index d3f4dd5..5e4a980 100644
--- a/datalad_next/iter_collections/gitstatus.py
+++ b/datalad_next/iter_collections/gitstatus.py
@@ -437,6 +436,14 @@ def _eval_submodule(basepath, item, eval_mode) -> None:
         return
 
     item_path = basepath / item.path
+
+    # this is the cheapest test for the theoretical chance that a submodule
+    # is present at `item_path`. This is beneficial even when we would only
+    # run a single call to `git rev-parse`
+    # https://github.com/datalad/datalad-next/issues/606
+    if not (item_path / '.git').exists():
+        return
+
     # get head commit, and whether a submodule is actually present,
     # and/or in adjusted mode
     subds_present, head_commit, adjusted = _get_submod_worktree_head(item_path)
❯ time datalad next-status
nothing to save, working tree clean
datalad next-status  1.17s user 0.21s system 100% cpu 1.372 total

A 80x speedup for this extreme use case.

@mih mih added this to the 1.1.1 milestone Jan 26, 2024
mih added a commit to mih/datalad-next that referenced this issue Jan 26, 2024
This can dramatically boost performance with many submodules. Checking
the filesystem for a file is much cheaper than running `git rev-parse`.

When a subdataset is present, this obviously means an additional cost.
However, in comparison to the then following state evaluation, it should
still be cheap.

Worth it, IMHO.

Closes: datalad#606
@mih mih closed this as completed in #607 Jan 26, 2024
@mih mih modified the milestones: 1.1.1, 1.2 Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant