-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement GitRepo.get_submodules_() without get_content_info() #6942
Conversation
The previous implementation resulted in a circular dependency after a check for a Git-version dependent reporting behavior was introduced with 8ff8613 Moreover, the previous implementation also queried a repository for all content, merely to discard any record that is not a subdataset. This could slow a responds dramatically for dataset with few subdatasets but many files. This change reimplements `get_submodules_()` with a plain call to `git ls-files` parameterized with the paths of recorded submodules taken from `.gitmodules`. This changes the behavior to be more in-line with `git submodule` by refusing to report on submodules that are not recorded in .gitmodules. Fixes datalad#6940 Fixes datalad#6941
Previously we ignored them in values as mandatory reasons for quoting. This change expands the `quote_config()` helper to perform this additional test. Fixes datalad#6943
Crawler test failure is unrelated and occurs elsewhere too. |
Codecov Report
@@ Coverage Diff @@
## master #6942 +/- ##
===========================================
+ Coverage 78.29% 90.89% +12.59%
===========================================
Files 354 354
Lines 46259 46291 +32
===========================================
+ Hits 36219 42074 +5855
+ Misses 10040 4217 -5823
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some performance observations/recommendations.
|
||
# taken from 3.9's pathlib, can be removed once the minimally | ||
# supported python version | ||
def is_relative_to(self, *other): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: I was curious if we have other uses, grepped to find added in 51e0611
datalad/support/gitrepo.py- # this could potentially be expensive if lots of files become
datalad/support/gitrepo.py- # directories, but it is unlikely to happen often
datalad/support/gitrepo.py: # Note: PurePath.is_relative_to was added in 3.9 and seems slowish
datalad/support/gitrepo.py- # path_is_subpath faster, also if comparing to "in f.parents"
datalad/support/gitrepo.py- f_str = str(f)
datalad/support/gitrepo.py- if any(path_is_subpath(str(f2), f_str) for f2 in status_state['added']):
so if speed is of concern, might be worth timing and extracting into a reusable helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not interested in improving upon this function. I was planing on removing it whenever we reach PY3.9. Are you saying that we should avoid it? path_is_subpath()
is practically unused outside the tests, and there is #2564 -- which is why I am removing it from all places I see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path_is_subpath() is practically unused outside the tests,
$> git grep -p 'path_is_subpath(' | grep -v test_
datalad/interface/results.py=def results_from_annex_noinfo(ds, requested_paths, respath_by_status, dir_fail_msg,
datalad/interface/results.py: if path_is_subpath(fp, p)]
datalad/interface/results.py: if path_is_subpath(fp, p)]
datalad/interface/utils.py=def discover_dataset_trace_to_targets(basepath, targetpaths, current_trace,
datalad/interface/utils.py: path_is_subpath(t, current_trace[-1]) and
datalad/metadata/aggregate.py=def _get_dsinfo_from_aggmetadata(ds_path, path, recursive, db):
datalad/metadata/aggregate.py: if path_is_subpath(agginfo_path, seed_ds):
datalad/metadata/aggregate.py: elif any(path_is_subpath(ep, rp) for ep in exclude):
datalad/metadata/metadata.py=def _get_containingds_from_agginfo(info, rpath):
datalad/metadata/metadata.py: if path_is_subpath(rpath, subds)],
datalad/metadata/metadata.py=def legacy_query_aggregated_metadata(reporton, ds, aps, recursive=False,
datalad/metadata/metadata.py: path_is_subpath(sub, rpath)]
datalad/metadata/metadata.py=class Metadata(Interface):
datalad/metadata/metadata.py: if parentds and not path_is_subpath(dspath, parentds[-1]):
datalad/support/gitrepo.py=class GitRepo(CoreGitRepo):
datalad/support/gitrepo.py: if any(path_is_subpath(str(f2), f_str) for f2 in status_state['added']):
datalad/support/parallel.py=def no_parentds_in_futures(futures, path, skip=tuple()):
datalad/support/parallel.py: return all(not path_is_subpath(path, p) or p in skip for p in futures)
datalad/support/parallel.py=def no_subds_in_futures(futures, path, skip=tuple()):
datalad/support/parallel.py: return all(not path_is_subpath(p, path) or p in skip for p in futures)
datalad/utils.py=def path_startswith(path, prefix):
datalad/utils.py:def path_is_subpath(path, prefix):
datalad/utils.py=def import_module_from_file(modpath, pkg=None, log=lgr.debug):
datalad/utils.py: if path_is_subpath(modpath, pkgpath):
and I think it is just a matter of fixing the tests, so submitted #6955 . Let's see how it behaves on windows
try: | ||
self.relative_to(*other) | ||
return True | ||
except ValueError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a guarantee that ValueError would be raised only if not relative, and not for some other (malformatted value etc) cases, thus possibly making it hard to debug. I hope stdlib has lots of typing annotation and testing to guarantee that. Aforementioned code snippet might provide remedy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the implementation shipped with Python 3.9. What can be done to resolve your concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concern is minor, but could be resolved through not using it, e.g. as proposed in https://github.com/datalad/datalad/pull/6942/files#r948226105 below
paths = [self.pathobj / p for p in paths] | ||
# contrain the report by the given paths | ||
modinfo = { | ||
# modpath is absolute | ||
modpath: modprobs | ||
for modpath, modprobs in modinfo.items() | ||
# is_relative_to() also match equal paths | ||
if any(is_relative_to(modpath, p) for p in paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mutating input variables often makes it harder to debug that code later since causes confusion (docstring says paths
is relative -- how they are absolute?), so I would have preferred:
paths = [self.pathobj / p for p in paths] | |
# contrain the report by the given paths | |
modinfo = { | |
# modpath is absolute | |
modpath: modprobs | |
for modpath, modprobs in modinfo.items() | |
# is_relative_to() also match equal paths | |
if any(is_relative_to(modpath, p) for p in paths) | |
abspaths = [self.pathobj / p for p in paths] | |
# contrain the report by the given paths | |
modinfo = { | |
# modpath is absolute | |
modpath: modprobs | |
for modpath, modprobs in modinfo.items() | |
# is_relative_to() also match equal paths | |
if any(is_relative_to(modpath, p) for p in abspaths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that this is potentially O(N^2)
operation (becomes expensive) if paths provided for all submodules (e.g smth like subdatasets sub-*
where each sub- is a subdataset). It reminded me about get_parent_paths solving similar problem -- please see if it could be used here as well. If not -- we should be able to come up with at least O(N*log(N))
solution here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't check if that is due to this particular code path but here is comparative timing of master (<.3sec) and this branch (>3 sec):
(git)lena:~datalad/datalad-master[master]git
$> datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
datalad 0.17.2+102.g2c0073395
1113
python -c 0.26s user 0.04s system 107% cpu 0.277 total
wc -l 0.00s user 0.00s system 0% cpu 0.277 total
$> git co -
Switched to branch 'bf-6940'
Your branch is up to date with 'gh-mih/bf-6940'.
$> datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
datalad 0.17.2+101.gb36148675
1113
python -c 3.78s user 0.02s system 100% cpu 3.777 total
wc -l 0.00s user 0.00s system 0% cpu 3.777 total
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can replicate this. I have not looked into the reasons, but I will state upfront that "give me all subdatasets" is a more common use case than "give my all subdatasets for the paths of all subdatasets". Even if the slow-down for this use case is unavoidable it does not outweigh the substantial speed-up for the common case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that applying the path-constraint is the source of the slowness. I have no idea how to fix this in a reasonable time frame. Please advice on how to proceed.
The remaining aspect in this conversation is
docstring says paths is relative -- how they are absolute?
They are not absolute, as the docs say. But they have to become absolute to enable the comparison with the paths reported by _parse_gitmodules()
.
The path conversion has no significant impact on the timing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so it read from your statement that this PR is blocked until someone investigated whether this edge case can be made faster to some degree.
FTR: Here are the timings on the HCP test case without the path constraints (only subdatasets; different from the model case with almost no subdatasets):
# this PR 0.17.2+102.g8b49a71d7
>>> timeit list(GitRepo('.').get_submodules_())
49.6 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# master 0.17.2+99.gbacdc8e8f
>>> timeit list(GitRepo('.').get_submodules_())
46.8 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Together they should give a feel for the performance change spectrum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is a patch which seems to work as desired
diff --git a/datalad/support/gitrepo.py b/datalad/support/gitrepo.py
index a437fc5a2..0776f5805 100644
--- a/datalad/support/gitrepo.py
+++ b/datalad/support/gitrepo.py
@@ -59,6 +59,7 @@ from datalad.dataset.gitrepo import (
path_based_str_repr,
)
from datalad.log import log_progress
+from datalad.support.path import get_parent_paths
from datalad.support.due import (
Doi,
due,
@@ -2367,32 +2368,14 @@ class GitRepo(CoreGitRepo):
# below
return
- # taken from 3.9's pathlib, can be removed once the minimally
- # supported python version
- def is_relative_to(self, *other):
- """Return True if the path is relative to another path or False.
- """
- try:
- self.relative_to(*other)
- return True
- except ValueError:
- return False
-
+ mod_relpaths = [str(m.relative_to(self.pathobj)) for m in modinfo]
if paths:
- # ease comparison
- paths = [self.pathobj / p for p in paths]
- # contrain the report by the given paths
- modinfo = {
- # modpath is absolute
- modpath: modprobs
- for modpath, modprobs in modinfo.items()
- # is_relative_to() also match equal paths
- if any(is_relative_to(modpath, p) for p in paths)
- }
+ # constrain the report by the given paths
+ mod_relpaths = get_parent_paths(paths, mod_relpaths, only_with_parents=True)
for r in self.call_git_items_(
['ls-files', '--stage', '-z'],
sep='\0',
- files=[str(p.relative_to(self.pathobj)) for p in modinfo.keys()],
+ files=mod_relpaths,
read_only=True,
keep_ends=True,
):
and brings it down to
$> datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l ; )
datalad 0.17.2+102.g8b49a71d7.dirty
1113
python -c 0.18s user 0.03s system 110% cpu 0.194 total
wc -l 0.00s user 0.00s system 0% cpu 0.194 total
so as well faster than current master and removing the need for borrowed is_relative
I questioned above. Known annotated tests for get_submodules pass
$> DATALAD_TESTS_SSH=1 python -m pytest -s -v -k get_submodule datalad
======================================================================= test session starts =======================================================================
platform linux -- Python 3.10.5, pytest-7.1.2, pluggy-1.0.0 -- /home/yoh/proj/datalad/datalad-master/venvs/dev3/bin/python
cachedir: .pytest_cache
rootdir: /home/yoh/proj/datalad/datalad-master, configfile: tox.ini
plugins: fail-slow-0.2.0, cov-3.0.0
collected 1304 items / 1301 deselected / 3 selected
datalad/support/tests/test_annexrepo.py::test_AnnexRepo_get_submodules SKIPPED (TODO)
datalad/support/tests/test_gitrepo.py::test_GitRepo_get_submodules PASSED
datalad/support/tests/test_gitrepo.py::test_get_submodules_parent_on_unborn_branch PASSED
but I feel that we might benefit from more tests here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would not work on Windows. At least the docs of that function say that it needs posix paths. So it seems an explicit conversion to posix is necessary, and in a way that is consistent with what git does for the report from ls-files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB note that appveyor tests on windows pass. I feel that there might be a need for a dedicated test here for any type of fix.
indeed -- there is that docstring! and more confusing:
- at https://github.com/datalad/datalad/blob/master/datalad/support/path.py#L183 we use hardcoded
'/'
- in
_get_parent_paths_check
at https://github.com/datalad/datalad/blob/master/datalad/support/path.py#L202 we useos.sep
and both cases are on path
of paths
, and both are added within the same commit where that function was added 5fefda4 .
Looking at the code, I do not see any "POSIX-specifics" of any kind, so I think that the comment and hardcoded use of /
could go and make POSIX assumption optional (so use on git output could still be used). I will propose a PR for that because I do feel that currently that function could indeed be buggy.
proposed patch could do what we do in the other use of that function within get_content_info
-- paths = map(ut.PurePosixPath, paths)
first to ensure that they are POSIX, after all we are to compare to modules paths which should be POSIX right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
submitted #6963 . While looking at the code realized that above I am pretty much suggesting what get_content_info
was doing when called from get_submodules_()
so to some effect we are just moving that logic out. Indeed would need aforementioned map(ut.PurePosixPath, paths)
to be added to the patch but otherwise -- should be fine.
What is the question? Why it is a BF PR against |
Thanks to @yarikoptic for pointing this out.
@@ -50,6 +50,7 @@ | |||
from datalad.utils import ( | |||
Path, | |||
get_home_envvars, | |||
on_windows, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no longer needed AFAIK
on_windows, |
I was not asking about |
travis - unrelated stall on is_available - known/not yet resolved. appveyor is all green. |
Restrict submodules to those under `paths`. Paths must be relative | ||
to the resolved repository root, and must be normed to match the | ||
reporting done by Git, i.e. no parent dir components | ||
(ala "some/../this"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if below for Returns
we could clarify that paths below for submodules are Posix paths, or they aren't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no definitive answer. I took this description from the documentation. I did not plan to make changes to the reporting or parameterization.
…e / as path separator Code review of this function came up while reviewing the datalad#6942 and where get_parent_paths, which was pretty much "avoided" while resolving infinite recursion of `get_submodules_ -> get_content_info -> get_submodules_`, was re-introduced specifically within get_submodules_ . BF: use consistently "sep" instead of / for check within the loop, and then os.sep to check if points to up/down folders. ENH: - sep became an option so we could use it also with os.sep if so desired on paths which are not POSIXed. By default it would remain "POSIX" (although checks were not checking for the POSIX endings before). - The test became parametric to test for various scenarios of using \ vs / vs "the default" (/ - POSIX).
@yarikoptic feel free to take over here, you seem to have plans for a deeper dive. I was just here for a quick contribution. |
Not much deeper than I suggested above . Thanks for the blessing - I will push some commits along those lines later. |
submitted #6974 but made a "mistake" of adding a unittest which unearthed a problem needed fixing ( |
Both tests pass on master and fail here ATM. The 2nd test, matching for absent s_unknown result returns all submodules but even without gitmodule_name.
Code Climate has analyzed commit ffa77a0 and detected 1 issue on this pull request. Here's the issue category breakdown:
View more on Code Climate. |
@mih - I ran into need for more work on so I am ok with you solution, even if later to be improved later. But the tests I added in #6974 fails here and passes on |
Closing due to personal overload. |
- Fix a typo - Clean up an unused import statement - check if paths from .gitmodules and paths passed to the command are relative to eachother from both directions - because we might recurse deeper into a dataset hierarchy with a subdatasets call, the test test_get_subdatasets failed when we soley checked from one direction.
The previous implementation resulted in a circular dependency after
a check for a Git-version dependent reporting behavior was introduced
with 8ff8613
Moreover, the previous implementation also queried a repository for
all content, merely to discard any record that is not a subdataset.
This could slow a responds dramatically for dataset with few subdatasets
but many files.
This change reimplements
get_submodules_()
with a plain call togit ls-files
parameterized with the paths of recorded submodulestaken from
.gitmodules
.This changes the behavior to be more in-line with
git submodule
by refusing to report on submodules that are not recorded in
.gitmodules.
Fixes #6940
Fixes #6941
Fixes #6943
This PR goes against
master
, because it changes behavior, even if only in corner-cases. The fix for #6941 is also cherry-picked into a PR againstmaint
#6944The speed-up is substantial for datasets with many files and at least one subdataset. I have a real-world test cased with a dataset with 160k files and 2 subdatasets showing a speed-up of 350x.
See here for another real-world performance change estimate datalad/datalad-next#92 (comment)