Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subdatasets slow for no apparent reason #6940

Closed
mih opened this issue Aug 13, 2022 · 1 comment
Closed

subdatasets slow for no apparent reason #6940

mih opened this issue Aug 13, 2022 · 1 comment
Labels
cmd-subdatasets performance Improve performance of an existing feature

Comments

@mih
Copy link
Member

mih commented Aug 13, 2022

Scenario: dataset with two non-installed subdatasets and 40k top-level subdirectories:

% datalad -f disabled subdatasets 
datalad -f disabled subdatasets  91.01s user 11.94s system 100% cpu 1:42.54 total
% time git submodule
-dfa6d975ea8888ed33bf714c674371e1ab54e834 code/pipeline
-0c7f0b45140dde1d7291b15725e2016231a670b0 inputs/ukb
git submodule  0.02s user 0.02s system 105% cpu 0.035 total
% cat .gitmodules
[submodule "code/pipeline"]
        path = code/pipeline
        url = http://containers.ds.inm7.de/alias/cat
        datalad-id = 99300fb0-ea39-4fd2-b36d-2afc1e9ab7b2
[submodule "inputs/ukb"]
        path = inputs/ukb
        url = http://ukb.ds.inm7.de/alias/bids
        datalad-id = 14a5e5b6-d30d-11ea-8836-b4969157768c
% git submodule
-dfa6d975ea8888ed33bf714c674371e1ab54e834 code/pipeline
-0c7f0b45140dde1d7291b15725e2016231a670b0 inputs/ukb

So 35ms (git) vs 1m42s (datalad). 🤦

And the reason seems to be in GitRepo.get_submodules_():

modinfo = self._parse_gitmodules()
for path, props in self.get_content_info(
paths=paths,
ref=None,
untracked='no').items():
if props.get('type', None) != 'dataset':
# make sure this method never talks about non-dataset
# content
continue
props["path"] = path
props.update(modinfo.get(path, {}))
yield props

It calls get_content_info() to get everything about the dataset, only to discard anything that is not a type='dataset'. In this case of 2 subdatasets and 80k directories this is highly inefficient.

It seems to make sense to limit the call to get_content_info() to the paths obtain from _parse_gitmodules() already (or even a dedicated git submodule call).

get_submodules_() currently yields records like this:

 {'gitmodule_datalad-id': '14a5e5b6-d30d-11ea-8836-b4969157768c',
  'gitmodule_name': 'inputs/ukb',
  'gitmodule_url': 'http://ukb.ds.inm7.de/alias/bids',
  'gitshasum': '0c7f0b45140dde1d7291b15725e2016231a670b0',
  'path': PosixPath('/home/mih/ukb-gource/ukb-derivatives/inputs/ukb'),
  'type': 'dataset'}

All properties prefixed with gitmodule_ and the path are instantaneously provided by _parse_gitmodules(). The type='dataset' is pretty much implied and the gitshasum could come from get_content_info(), parameterized with the "intersection" of the reported submodule paths and the paths argument of get_submodules_().

However, #6941 points out that get_submodules_() and get_content_info() are actually circular dependencies. So it would probably be better to implement the gitshasum query with a plain ls-files --stage call, instead of going through the full complexity of get_content_info().

@mih
Copy link
Member Author

mih commented Aug 13, 2022

The following patch would turn 1min42s runtime into 270ms, and it would also remove the circular dependency, as well as the implied performance penalty #6941

diff --git a/datalad/support/gitrepo.py b/datalad/support/gitrepo.py
index 1159af3f8..e6a4bfc29 100644
--- a/datalad/support/gitrepo.py
+++ b/datalad/support/gitrepo.py
@@ -2355,17 +2358,37 @@ class GitRepo(CoreGitRepo):
             return
 
         modinfo = self._parse_gitmodules()
-        for path, props in self.get_content_info(
-                paths=paths,
-                ref=None,
-                untracked='no').items():
-            if props.get('type', None) != 'dataset':
+        if paths:
+            # ease comparison
+            paths = [self.pathobj / p for p in paths]
+            # contrain the report by the given paths
+            modinfo = {
+                # modpath is absolute
+                modpath: modprobs
+                for modpath, modprobs in modinfo.items()
+                # is_relative_to() also match equal paths
+                if any(modpath.is_relative_to(p) for p in paths)
+            }
+        for r in self.call_git_items_(
+            ['ls-files', '--stage', '-z'],
+            sep='\0',
+            files=[str(p.relative_to(self.pathobj)) for p in modinfo.keys()],
+            read_only=True,
+            keep_ends=True,
+        ):
+            if not r.startswith('160000'):
                 # make sure this method never talks about non-dataset
                 # content
                 continue
-            props["path"] = path
-            props.update(modinfo.get(path, {}))
-            yield props
+            props, rpath = r.split('\t')
+            # remove the expected line separator from the path
+            path = self.pathobj / PurePosixPath(rpath[:-1])
+            yield dict(
+                path=path,
+                type='dataset',
+                gitshasum=props.split(' ')[1],
+                **modinfo.get(path, {})
+            )
 
     def get_submodules(self, sorted_=True, paths=None):
         """Return list of submodules.

In my particular testcase this yields a speed-up of factor 350+

mih added a commit to mih/datalad that referenced this issue Aug 14, 2022
The previous implementation resulted in a circular dependency after
a check for a Git-version dependent reporting behavior was introduced
with 8ff8613

Moreover, the previous implementation also queried a repository for
all content, merely to discard any record that is not a subdataset.
This could slow a responds dramatically for dataset with few subdatasets
but many files.

This change reimplements `get_submodules_()` with a plain call to
`git ls-files` parameterized with the paths of recorded submodules
taken from `.gitmodules`.

This changes the behavior to be more in-line with `git submodule`
by refusing to report on submodules that are not recorded in
.gitmodules.

Fixes datalad#6940
Fixes datalad#6941
mih added a commit to mih/datalad that referenced this issue Aug 15, 2022
The previous implementation resulted in a circular dependency after
a check for a Git-version dependent reporting behavior was introduced
with 8ff8613

Moreover, the previous implementation also queried a repository for
all content, merely to discard any record that is not a subdataset.
This could slow a responds dramatically for dataset with few subdatasets
but many files.

This change reimplements `get_submodules_()` with a plain call to
`git ls-files` parameterized with the paths of recorded submodules
taken from `.gitmodules`.

This changes the behavior to be more in-line with `git submodule`
by refusing to report on submodules that are not recorded in
.gitmodules.

Fixes datalad#6940
Fixes datalad#6941
mih added a commit to mih/datalad that referenced this issue Aug 15, 2022
The previous implementation resulted in a circular dependency after
a check for a Git-version dependent reporting behavior was introduced
with 8ff8613

Moreover, the previous implementation also queried a repository for
all content, merely to discard any record that is not a subdataset.
This could slow a responds dramatically for dataset with few subdatasets
but many files.

This change reimplements `get_submodules_()` with a plain call to
`git ls-files` parameterized with the paths of recorded submodules
taken from `.gitmodules`.

This changes the behavior to be more in-line with `git submodule`
by refusing to report on submodules that are not recorded in
.gitmodules.

Fixes datalad#6940
Fixes datalad#6941
yarikoptic pushed a commit to yarikoptic/datalad that referenced this issue Aug 22, 2022
The previous implementation resulted in a circular dependency after
a check for a Git-version dependent reporting behavior was introduced
with 8ff8613

Moreover, the previous implementation also queried a repository for
all content, merely to discard any record that is not a subdataset.
This could slow a responds dramatically for dataset with few subdatasets
but many files.

This change reimplements `get_submodules_()` with a plain call to
`git ls-files` parameterized with the paths of recorded submodules
taken from `.gitmodules`.

This changes the behavior to be more in-line with `git submodule`
by refusing to report on submodules that are not recorded in
.gitmodules.

Fixes datalad#6940
Fixes datalad#6941
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmd-subdatasets performance Improve performance of an existing feature
Projects
None yet
2 participants