-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
subdatasets
slow for no apparent reason
#6940
Labels
Comments
This was referenced Aug 13, 2022
The following patch would turn 1min42s runtime into 270ms, and it would also remove the circular dependency, as well as the implied performance penalty #6941 diff --git a/datalad/support/gitrepo.py b/datalad/support/gitrepo.py
index 1159af3f8..e6a4bfc29 100644
--- a/datalad/support/gitrepo.py
+++ b/datalad/support/gitrepo.py
@@ -2355,17 +2358,37 @@ class GitRepo(CoreGitRepo):
return
modinfo = self._parse_gitmodules()
- for path, props in self.get_content_info(
- paths=paths,
- ref=None,
- untracked='no').items():
- if props.get('type', None) != 'dataset':
+ if paths:
+ # ease comparison
+ paths = [self.pathobj / p for p in paths]
+ # contrain the report by the given paths
+ modinfo = {
+ # modpath is absolute
+ modpath: modprobs
+ for modpath, modprobs in modinfo.items()
+ # is_relative_to() also match equal paths
+ if any(modpath.is_relative_to(p) for p in paths)
+ }
+ for r in self.call_git_items_(
+ ['ls-files', '--stage', '-z'],
+ sep='\0',
+ files=[str(p.relative_to(self.pathobj)) for p in modinfo.keys()],
+ read_only=True,
+ keep_ends=True,
+ ):
+ if not r.startswith('160000'):
# make sure this method never talks about non-dataset
# content
continue
- props["path"] = path
- props.update(modinfo.get(path, {}))
- yield props
+ props, rpath = r.split('\t')
+ # remove the expected line separator from the path
+ path = self.pathobj / PurePosixPath(rpath[:-1])
+ yield dict(
+ path=path,
+ type='dataset',
+ gitshasum=props.split(' ')[1],
+ **modinfo.get(path, {})
+ )
def get_submodules(self, sorted_=True, paths=None):
"""Return list of submodules.
In my particular testcase this yields a speed-up of factor 350+ |
mih
added a commit
to mih/datalad
that referenced
this issue
Aug 14, 2022
The previous implementation resulted in a circular dependency after a check for a Git-version dependent reporting behavior was introduced with 8ff8613 Moreover, the previous implementation also queried a repository for all content, merely to discard any record that is not a subdataset. This could slow a responds dramatically for dataset with few subdatasets but many files. This change reimplements `get_submodules_()` with a plain call to `git ls-files` parameterized with the paths of recorded submodules taken from `.gitmodules`. This changes the behavior to be more in-line with `git submodule` by refusing to report on submodules that are not recorded in .gitmodules. Fixes datalad#6940 Fixes datalad#6941
mih
added a commit
to mih/datalad
that referenced
this issue
Aug 15, 2022
The previous implementation resulted in a circular dependency after a check for a Git-version dependent reporting behavior was introduced with 8ff8613 Moreover, the previous implementation also queried a repository for all content, merely to discard any record that is not a subdataset. This could slow a responds dramatically for dataset with few subdatasets but many files. This change reimplements `get_submodules_()` with a plain call to `git ls-files` parameterized with the paths of recorded submodules taken from `.gitmodules`. This changes the behavior to be more in-line with `git submodule` by refusing to report on submodules that are not recorded in .gitmodules. Fixes datalad#6940 Fixes datalad#6941
mih
added a commit
to mih/datalad
that referenced
this issue
Aug 15, 2022
The previous implementation resulted in a circular dependency after a check for a Git-version dependent reporting behavior was introduced with 8ff8613 Moreover, the previous implementation also queried a repository for all content, merely to discard any record that is not a subdataset. This could slow a responds dramatically for dataset with few subdatasets but many files. This change reimplements `get_submodules_()` with a plain call to `git ls-files` parameterized with the paths of recorded submodules taken from `.gitmodules`. This changes the behavior to be more in-line with `git submodule` by refusing to report on submodules that are not recorded in .gitmodules. Fixes datalad#6940 Fixes datalad#6941
yarikoptic
pushed a commit
to yarikoptic/datalad
that referenced
this issue
Aug 22, 2022
The previous implementation resulted in a circular dependency after a check for a Git-version dependent reporting behavior was introduced with 8ff8613 Moreover, the previous implementation also queried a repository for all content, merely to discard any record that is not a subdataset. This could slow a responds dramatically for dataset with few subdatasets but many files. This change reimplements `get_submodules_()` with a plain call to `git ls-files` parameterized with the paths of recorded submodules taken from `.gitmodules`. This changes the behavior to be more in-line with `git submodule` by refusing to report on submodules that are not recorded in .gitmodules. Fixes datalad#6940 Fixes datalad#6941
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Scenario: dataset with two non-installed subdatasets and 40k top-level subdirectories:
So 35ms (git) vs 1m42s (datalad). 🤦
And the reason seems to be in
GitRepo.get_submodules_()
:datalad/datalad/support/gitrepo.py
Lines 2357 to 2368 in 6b7a2f7
It calls
get_content_info()
to get everything about the dataset, only to discard anything that is not atype='dataset'
. In this case of 2 subdatasets and 80k directories this is highly inefficient.It seems to make sense to limit the call to
get_content_info()
to the paths obtain from_parse_gitmodules()
already (or even a dedicatedgit submodule
call).get_submodules_()
currently yields records like this:All properties prefixed with
gitmodule_
and the path are instantaneously provided by_parse_gitmodules()
. Thetype='dataset'
is pretty much implied and thegitshasum
could come fromget_content_info()
, parameterized with the "intersection" of the reported submodule paths and thepaths
argument ofget_submodules_()
.However, #6941 points out that
get_submodules_()
andget_content_info()
are actually circular dependencies. So it would probably be better to implement thegitshasum
query with a plainls-files --stage
call, instead of going through the full complexity ofget_content_info()
.The text was updated successfully, but these errors were encountered: