Investigate performance of `next-status` with many subdatasets #606

mih · 2024-01-26T09:22:53Z

It takes twice as long as a plain listing of present submodules.

% datalad next-status
untracked: somelog
datalad next-status  117.60s user 62.30s system 63% cpu 4:41.11 total

% datalad subdatasets --state present
datalad subdatasets --state present  16.54s user 2.94s system 11% cpu 2:43.88 total

Sidenote: Listing absent submodules only is much faster.

% datalad -f json subdatasets --state absent | wc -l
42715
datalad -f json subdatasets --state absent  17.78s user 1.54s system 149% cpu 12.952 total

Timings above are not depending on a "cold start", but are reproducible on repeated runs (more or less).

The text was updated successfully, but these errors were encountered:

mih · 2024-01-26T09:49:02Z

All of the runtime is coming from _eval_submodule():

Test patch

diff --git a/datalad_next/iter_collections/gitstatus.py b/datalad_next/iter_collections/gitstatus.py
index d3f4dd5..6fe6e71 100644
--- a/datalad_next/iter_collections/gitstatus.py
+++ b/datalad_next/iter_collections/gitstatus.py
@@ -289,7 +289,7 @@ def _yield_repo_items(
             # TODO others?
         )
         # TODO possibly trim eval_submodule_state
-        _eval_submodule(path, item, eval_submodule_state)
+        #_eval_submodule(path, item, eval_submodule_state)
         if item.status:
             yield item

# with patch
❯ time datalad next-status
nothing to save, working tree clean
datalad next-status  0.84s user 0.13s system 101% cpu 0.962 total

# without the patch
❯ time datalad next-status
nothing to save, working tree clean
datalad next-status  95.98s user 26.99s system 110% cpu 1:51.23 total

mih · 2024-01-26T10:02:41Z

The culprit is the timing of detection that a submodule is absent.

The following patch tried swapping out the iter_subproc() method for a simpler subprocess.run(). Marginal difference -- big compliment to iter_subproc()!

diff --git a/datalad_next/iter_collections/gitstatus.py b/datalad_next/iter_collections/gitstatus.py
index d3f4dd5..9f0f7a8 100644
--- a/datalad_next/iter_collections/gitstatus.py
+++ b/datalad_next/iter_collections/gitstatus.py
@@ -13,6 +13,7 @@ from typing import Generator
 
 from datalad_next.runners import (
     CommandError,
+    call_git_lines,
     iter_git_subproc,
 )
 from datalad_next.itertools import (
@@ -414,18 +415,16 @@ def _get_submod_worktree_head(path: Path) -> tuple[bool, str | None, bool]:
         # its basis. it is not meaningful to track the managed branch in
         # a superdataset
         HEAD = corresponding_head
-    with iter_git_subproc(
-        ['rev-parse', '--path-format=relative',
-         '--show-toplevel', HEAD],
+    res = call_git_lines(
+        ['rev-parse', '--path-format=relative', '--show-toplevel', HEAD],
         cwd=path,
-    ) as r:
-        res = tuple(decode_bytes(itemize(r, sep=None, keep_ends=False)))
-        assert len(res) == 2
-        if res[0].startswith('..'):
-            # this is not a report on a submodule at this location
-            return False, None, adjusted
-        else:
-            return True, res[1], adjusted
+    )
+    assert len(res) == 2
+    if res[0].startswith('..'):
+        # this is not a report on a submodule at this location
+        return False, None, adjusted
+    else:
+        return True, res[1], adjusted
 
 
 def _eval_submodule(basepath, item, eval_mode) -> None:

mih · 2024-01-26T10:08:12Z

Here is the patch

diff --git a/datalad_next/iter_collections/gitstatus.py b/datalad_next/iter_collections/gitstatus.py
index d3f4dd5..5e4a980 100644
--- a/datalad_next/iter_collections/gitstatus.py
+++ b/datalad_next/iter_collections/gitstatus.py
@@ -437,6 +436,14 @@ def _eval_submodule(basepath, item, eval_mode) -> None:
         return
 
     item_path = basepath / item.path
+
+    # this is the cheapest test for the theoretical chance that a submodule
+    # is present at `item_path`. This is beneficial even when we would only
+    # run a single call to `git rev-parse`
+    # https://github.com/datalad/datalad-next/issues/606
+    if not (item_path / '.git').exists():
+        return
+
     # get head commit, and whether a submodule is actually present,
     # and/or in adjusted mode
     subds_present, head_commit, adjusted = _get_submod_worktree_head(item_path)

❯ time datalad next-status
nothing to save, working tree clean
datalad next-status  1.17s user 0.21s system 100% cpu 1.372 total

A 80x speedup for this extreme use case.

This can dramatically boost performance with many submodules. Checking the filesystem for a file is much cheaper than running `git rev-parse`. When a subdataset is present, this obviously means an additional cost. However, in comparison to the then following state evaluation, it should still be cheap. Worth it, IMHO. Closes: datalad#606

mih added this to the 1.1.1 milestone Jan 26, 2024

mih mentioned this issue Jan 26, 2024

Introduce early check for submodule presence in status evaluation #607

Merged

mih closed this as completed in #607 Jan 26, 2024

mih modified the milestones: 1.1.1, 1.2 Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance of `next-status` with many subdatasets #606

Investigate performance of `next-status` with many subdatasets #606

mih commented Jan 26, 2024 •

edited

Loading

mih commented Jan 26, 2024

mih commented Jan 26, 2024

mih commented Jan 26, 2024

Investigate performance of next-status with many subdatasets #606

Investigate performance of next-status with many subdatasets #606

Comments

mih commented Jan 26, 2024 • edited Loading

mih commented Jan 26, 2024

mih commented Jan 26, 2024

mih commented Jan 26, 2024

Investigate performance of `next-status` with many subdatasets #606

Investigate performance of `next-status` with many subdatasets #606

mih commented Jan 26, 2024 •

edited

Loading