Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement GitRepo.get_submodules_() without get_content_info() #6942

Closed
wants to merge 4 commits into from

Conversation

mih
Copy link
Member

@mih mih commented Aug 14, 2022

The previous implementation resulted in a circular dependency after
a check for a Git-version dependent reporting behavior was introduced
with 8ff8613

Moreover, the previous implementation also queried a repository for
all content, merely to discard any record that is not a subdataset.
This could slow a responds dramatically for dataset with few subdatasets
but many files.

This change reimplements get_submodules_() with a plain call to
git ls-files parameterized with the paths of recorded submodules
taken from .gitmodules.

This changes the behavior to be more in-line with git submodule
by refusing to report on submodules that are not recorded in
.gitmodules.

Fixes #6940
Fixes #6941
Fixes #6943

This PR goes against master, because it changes behavior, even if only in corner-cases. The fix for #6941 is also cherry-picked into a PR against maint #6944

The speed-up is substantial for datasets with many files and at least one subdataset. I have a real-world test cased with a dataset with 160k files and 2 subdatasets showing a speed-up of 350x.

See here for another real-world performance change estimate datalad/datalad-next#92 (comment)

@mih mih added performance Improve performance of an existing feature semver-minor Increment the minor version when merged labels Aug 14, 2022
The previous implementation resulted in a circular dependency after
a check for a Git-version dependent reporting behavior was introduced
with 8ff8613

Moreover, the previous implementation also queried a repository for
all content, merely to discard any record that is not a subdataset.
This could slow a responds dramatically for dataset with few subdatasets
but many files.

This change reimplements `get_submodules_()` with a plain call to
`git ls-files` parameterized with the paths of recorded submodules
taken from `.gitmodules`.

This changes the behavior to be more in-line with `git submodule`
by refusing to report on submodules that are not recorded in
.gitmodules.

Fixes datalad#6940
Fixes datalad#6941
Previously we ignored them in values as mandatory reasons for quoting.
This change expands the `quote_config()` helper to perform this
additional test.

Fixes datalad#6943
@mih
Copy link
Member Author

mih commented Aug 15, 2022

Crawler test failure is unrelated and occurs elsewhere too.

@codecov
Copy link

codecov bot commented Aug 15, 2022

Codecov Report

Merging #6942 (8b49a71) into master (bacdc8e) will increase coverage by 12.59%.
The diff coverage is 100.00%.

❗ Current head 8b49a71 differs from pull request most recent head ffa77a0. Consider uploading reports for the commit ffa77a0 to get more accurate results

@@             Coverage Diff             @@
##           master    #6942       +/-   ##
===========================================
+ Coverage   78.29%   90.89%   +12.59%     
===========================================
  Files         354      354               
  Lines       46259    46291       +32     
===========================================
+ Hits        36219    42074     +5855     
+ Misses      10040     4217     -5823     
Impacted Files Coverage Δ
datalad/config.py 97.26% <100.00%> (+0.23%) ⬆️
datalad/support/gitrepo.py 92.37% <100.00%> (+2.02%) ⬆️
datalad/tests/test_config.py 99.73% <100.00%> (+<0.01%) ⬆️
datalad/customremotes/tests/test_datalad.py 95.34% <0.00%> (-2.28%) ⬇️
datalad/cli/common_args.py 100.00% <0.00%> (ø)
datalad/tests/test_dochelpers.py 100.00% <0.00%> (ø)
datalad/support/tests/test_network.py 100.00% <0.00%> (ø)
datalad/support/tests/test_locking.py 95.87% <0.00%> (+0.04%) ⬆️
datalad/customremotes/tests/test_archives.py 89.36% <0.00%> (+0.07%) ⬆️
datalad/local/tests/test_addurls.py 99.57% <0.00%> (+0.21%) ⬆️
... and 139 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@yarikoptic
Copy link
Member

The fix for #6941 is also cherry-picked into a PR against maint #6944

#6941 is about circular dependency and #6944 about comment symbols quotation... although could be related - I do not see how. Could you please expain?

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some performance observations/recommendations.


# taken from 3.9's pathlib, can be removed once the minimally
# supported python version
def is_relative_to(self, *other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I was curious if we have other uses, grepped to find added in 51e0611

datalad/support/gitrepo.py-                # this could potentially be expensive if lots of files become
datalad/support/gitrepo.py-                # directories, but it is unlikely to happen often
datalad/support/gitrepo.py:                # Note: PurePath.is_relative_to was added in 3.9 and seems slowish
datalad/support/gitrepo.py-                # path_is_subpath faster, also if comparing to "in f.parents"
datalad/support/gitrepo.py-                f_str = str(f)
datalad/support/gitrepo.py-                if any(path_is_subpath(str(f2), f_str) for f2 in status_state['added']):

so if speed is of concern, might be worth timing and extracting into a reusable helper.

Copy link
Member Author

@mih mih Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not interested in improving upon this function. I was planing on removing it whenever we reach PY3.9. Are you saying that we should avoid it? path_is_subpath() is practically unused outside the tests, and there is #2564 -- which is why I am removing it from all places I see it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path_is_subpath() is practically unused outside the tests,

$> git grep -p 'path_is_subpath(' | grep -v test_
datalad/interface/results.py=def results_from_annex_noinfo(ds, requested_paths, respath_by_status, dir_fail_msg,
datalad/interface/results.py:                if path_is_subpath(fp, p)]
datalad/interface/results.py:                    if path_is_subpath(fp, p)]
datalad/interface/utils.py=def discover_dataset_trace_to_targets(basepath, targetpaths, current_trace,
datalad/interface/utils.py:                       path_is_subpath(t, current_trace[-1]) and
datalad/metadata/aggregate.py=def _get_dsinfo_from_aggmetadata(ds_path, path, recursive, db):
datalad/metadata/aggregate.py:        if path_is_subpath(agginfo_path, seed_ds):
datalad/metadata/aggregate.py:            elif any(path_is_subpath(ep, rp) for ep in exclude):
datalad/metadata/metadata.py=def _get_containingds_from_agginfo(info, rpath):
datalad/metadata/metadata.py:             if path_is_subpath(rpath, subds)],
datalad/metadata/metadata.py=def legacy_query_aggregated_metadata(reporton, ds, aps, recursive=False,
datalad/metadata/metadata.py:                              path_is_subpath(sub, rpath)]
datalad/metadata/metadata.py=class Metadata(Interface):
datalad/metadata/metadata.py:                if parentds and not path_is_subpath(dspath, parentds[-1]):
datalad/support/gitrepo.py=class GitRepo(CoreGitRepo):
datalad/support/gitrepo.py:                if any(path_is_subpath(str(f2), f_str) for f2 in status_state['added']):
datalad/support/parallel.py=def no_parentds_in_futures(futures, path, skip=tuple()):
datalad/support/parallel.py:    return all(not path_is_subpath(path, p) or p in skip for p in futures)
datalad/support/parallel.py=def no_subds_in_futures(futures, path, skip=tuple()):
datalad/support/parallel.py:    return all(not path_is_subpath(p, path) or p in skip for p in futures)
datalad/utils.py=def path_startswith(path, prefix):
datalad/utils.py:def path_is_subpath(path, prefix):
datalad/utils.py=def import_module_from_file(modpath, pkg=None, log=lgr.debug):
datalad/utils.py:            if path_is_subpath(modpath, pkgpath):

and I think it is just a matter of fixing the tests, so submitted #6955 . Let's see how it behaves on windows

try:
self.relative_to(*other)
return True
except ValueError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a guarantee that ValueError would be raised only if not relative, and not for some other (malformatted value etc) cases, thus possibly making it hard to debug. I hope stdlib has lots of typing annotation and testing to guarantee that. Aforementioned code snippet might provide remedy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the implementation shipped with Python 3.9. What can be done to resolve your concern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concern is minor, but could be resolved through not using it, e.g. as proposed in https://github.com/datalad/datalad/pull/6942/files#r948226105 below

Comment on lines +2383 to +2390
paths = [self.pathobj / p for p in paths]
# contrain the report by the given paths
modinfo = {
# modpath is absolute
modpath: modprobs
for modpath, modprobs in modinfo.items()
# is_relative_to() also match equal paths
if any(is_relative_to(modpath, p) for p in paths)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutating input variables often makes it harder to debug that code later since causes confusion (docstring says paths is relative -- how they are absolute?), so I would have preferred:

Suggested change
paths = [self.pathobj / p for p in paths]
# contrain the report by the given paths
modinfo = {
# modpath is absolute
modpath: modprobs
for modpath, modprobs in modinfo.items()
# is_relative_to() also match equal paths
if any(is_relative_to(modpath, p) for p in paths)
abspaths = [self.pathobj / p for p in paths]
# contrain the report by the given paths
modinfo = {
# modpath is absolute
modpath: modprobs
for modpath, modprobs in modinfo.items()
# is_relative_to() also match equal paths
if any(is_relative_to(modpath, p) for p in abspaths)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that this is potentially O(N^2) operation (becomes expensive) if paths provided for all submodules (e.g smth like subdatasets sub-* where each sub- is a subdataset). It reminded me about get_parent_paths solving similar problem -- please see if it could be used here as well. If not -- we should be able to come up with at least O(N*log(N)) solution here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't check if that is due to this particular code path but here is comparative timing of master (<.3sec) and this branch (>3 sec):

(git)lena:~datalad/datalad-master[master]git
$> datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
datalad 0.17.2+102.g2c0073395
1113
python -c   0.26s user 0.04s system 107% cpu 0.277 total
wc -l  0.00s user 0.00s system 0% cpu 0.277 total

$> git co -
Switched to branch 'bf-6940'
Your branch is up to date with 'gh-mih/bf-6940'.

$> datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l; )
datalad 0.17.2+101.gb36148675
1113
python -c   3.78s user 0.02s system 100% cpu 3.777 total
wc -l  0.00s user 0.00s system 0% cpu 3.777 total

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can replicate this. I have not looked into the reasons, but I will state upfront that "give me all subdatasets" is a more common use case than "give my all subdatasets for the paths of all subdatasets". Even if the slow-down for this use case is unavoidable it does not outweigh the substantial speed-up for the common case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that applying the path-constraint is the source of the slowness. I have no idea how to fix this in a reasonable time frame. Please advice on how to proceed.

The remaining aspect in this conversation is

docstring says paths is relative -- how they are absolute?

They are not absolute, as the docs say. But they have to become absolute to enable the comparison with the paths reported by _parse_gitmodules().

The path conversion has no significant impact on the timing.

Copy link
Member Author

@mih mih Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so it read from your statement that this PR is blocked until someone investigated whether this edge case can be made faster to some degree.

FTR: Here are the timings on the HCP test case without the path constraints (only subdatasets; different from the model case with almost no subdatasets):

# this PR 0.17.2+102.g8b49a71d7
>>> timeit list(GitRepo('.').get_submodules_())
49.6 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# master 0.17.2+99.gbacdc8e8f
>>> timeit list(GitRepo('.').get_submodules_())
46.8 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Together they should give a feel for the performance change spectrum.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is a patch which seems to work as desired

diff --git a/datalad/support/gitrepo.py b/datalad/support/gitrepo.py
index a437fc5a2..0776f5805 100644
--- a/datalad/support/gitrepo.py
+++ b/datalad/support/gitrepo.py
@@ -59,6 +59,7 @@ from datalad.dataset.gitrepo import (
     path_based_str_repr,
 )
 from datalad.log import log_progress
+from datalad.support.path import get_parent_paths
 from datalad.support.due import (
     Doi,
     due,
@@ -2367,32 +2368,14 @@ class GitRepo(CoreGitRepo):
             # below
             return
 
-        # taken from 3.9's pathlib, can be removed once the minimally
-        # supported python version
-        def is_relative_to(self, *other):
-            """Return True if the path is relative to another path or False.
-            """
-            try:
-                self.relative_to(*other)
-                return True
-            except ValueError:
-                return False
-
+        mod_relpaths = [str(m.relative_to(self.pathobj)) for m in modinfo]
         if paths:
-            # ease comparison
-            paths = [self.pathobj / p for p in paths]
-            # contrain the report by the given paths
-            modinfo = {
-                # modpath is absolute
-                modpath: modprobs
-                for modpath, modprobs in modinfo.items()
-                # is_relative_to() also match equal paths
-                if any(is_relative_to(modpath, p) for p in paths)
-            }
+            # constrain the report by the given paths
+            mod_relpaths = get_parent_paths(paths, mod_relpaths, only_with_parents=True)
         for r in self.call_git_items_(
             ['ls-files', '--stage', '-z'],
             sep='\0',
-            files=[str(p.relative_to(self.pathobj)) for p in modinfo.keys()],
+            files=mod_relpaths,
             read_only=True,
             keep_ends=True,
         ):

and brings it down to

$> datalad --version; (builtin cd /home/yoh/datalad/hcp-openaccess; time python -c "from datalad.support.gitrepo import *; from glob import glob; print('\n'.join(map(str,GitRepo('.').get_submodules_(glob('HCP1200/*')))))" | wc -l ; )
datalad 0.17.2+102.g8b49a71d7.dirty
1113
python -c   0.18s user 0.03s system 110% cpu 0.194 total
wc -l  0.00s user 0.00s system 0% cpu 0.194 total

so as well faster than current master and removing the need for borrowed is_relative I questioned above. Known annotated tests for get_submodules pass

$> DATALAD_TESTS_SSH=1 python -m pytest -s -v -k get_submodule datalad
======================================================================= test session starts =======================================================================
platform linux -- Python 3.10.5, pytest-7.1.2, pluggy-1.0.0 -- /home/yoh/proj/datalad/datalad-master/venvs/dev3/bin/python
cachedir: .pytest_cache
rootdir: /home/yoh/proj/datalad/datalad-master, configfile: tox.ini
plugins: fail-slow-0.2.0, cov-3.0.0
collected 1304 items / 1301 deselected / 3 selected                                                                                                               

datalad/support/tests/test_annexrepo.py::test_AnnexRepo_get_submodules SKIPPED (TODO)
datalad/support/tests/test_gitrepo.py::test_GitRepo_get_submodules PASSED
datalad/support/tests/test_gitrepo.py::test_get_submodules_parent_on_unborn_branch PASSED

but I feel that we might benefit from more tests here!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would not work on Windows. At least the docs of that function say that it needs posix paths. So it seems an explicit conversion to posix is necessary, and in a way that is consistent with what git does for the report from ls-files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB note that appveyor tests on windows pass. I feel that there might be a need for a dedicated test here for any type of fix.

indeed -- there is that docstring! and more confusing:

and both cases are on path of paths, and both are added within the same commit where that function was added 5fefda4 .

Looking at the code, I do not see any "POSIX-specifics" of any kind, so I think that the comment and hardcoded use of / could go and make POSIX assumption optional (so use on git output could still be used). I will propose a PR for that because I do feel that currently that function could indeed be buggy.

proposed patch could do what we do in the other use of that function within get_content_info -- paths = map(ut.PurePosixPath, paths) first to ensure that they are POSIX, after all we are to compare to modules paths which should be POSIX right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submitted #6963 . While looking at the code realized that above I am pretty much suggesting what get_content_info was doing when called from get_submodules_() so to some effect we are just moving that logic out. Indeed would need aforementioned map(ut.PurePosixPath, paths) to be added to the patch but otherwise -- should be fine.

datalad/support/gitrepo.py Show resolved Hide resolved
@mih
Copy link
Member Author

mih commented Aug 17, 2022

The fix for #6941 is also cherry-picked into a PR against maint #6944

#6941 is about circular dependency and #6944 about comment symbols quotation... although could be related - I do not see how. Could you please expain?

What is the question? Why it is a BF PR against maint? Why it is also here? If the latter: The roundtrip until a patch against maint becomes available in master is often taking quite a bit. If the patch is removed tests will fail, because they exploit #6941

@@ -50,6 +50,7 @@
from datalad.utils import (
Path,
get_home_envvars,
on_windows,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer needed AFAIK

Suggested change
on_windows,

@yarikoptic
Copy link
Member

The fix for #6941 is also cherry-picked into a PR against maint #6944

#6941 is about circular dependency and #6944 about comment symbols quotation... although could be related - I do not see how. Could you please expain?

What is the question? Why it is a BF PR against maint? Why it is also here? If the latter: The roundtrip until a patch against maint becomes available in master is often taking quite a bit. If the patch is removed tests will fail, because they exploit #6941

I was not asking about maint per se. You said that "The fix for #6941" (where #6941 is about circular dependency) "is also cherry-picked into a PR against maint" as "#6944" which is merely about comments parsing. So I could not understand how comments parsing could resolve circular dependency. That is why was asking. So - is comments parsing resolving circular dependency issue somehow? or what this sentence was supposed to mean?

@yarikoptic
Copy link
Member

travis - unrelated stall on is_available - known/not yet resolved. appveyor is all green.

Restrict submodules to those under `paths`. Paths must be relative
to the resolved repository root, and must be normed to match the
reporting done by Git, i.e. no parent dir components
(ala "some/../this").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if below for Returns we could clarify that paths below for submodules are Posix paths, or they aren't?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no definitive answer. I took this description from the documentation. I did not plan to make changes to the reporting or parameterization.

yarikoptic added a commit to yarikoptic/datalad that referenced this pull request Aug 18, 2022
…e / as path separator

Code review of this function came up while reviewing the
datalad#6942 and where get_parent_paths,
which was pretty much "avoided" while resolving infinite recursion of
`get_submodules_ -> get_content_info -> get_submodules_`, was re-introduced
specifically within get_submodules_ .

BF: use consistently "sep" instead of / for check within the loop, and then
os.sep to check if points to up/down folders.

ENH:

- sep became an option so we could use it also with os.sep if so desired
  on paths which are not POSIXed.  By default it would remain "POSIX"
  (although checks were not checking for the POSIX endings before).

- The test became parametric to test for various scenarios of using \ vs /
  vs "the default" (/ - POSIX).
@mih
Copy link
Member Author

mih commented Aug 20, 2022

@yarikoptic feel free to take over here, you seem to have plans for a deeper dive. I was just here for a quick contribution.

@yarikoptic
Copy link
Member

@yarikoptic feel free to take over here, you seem to have plans for a deeper dive. I was just here for a quick contribution.

Not much deeper than I suggested above . Thanks for the blessing - I will push some commits along those lines later.

@yarikoptic
Copy link
Member

@yarikoptic feel free to take over here, you seem to have plans for a deeper dive. I was just here for a quick contribution.

Not much deeper than I suggested above . Thanks for the blessing - I will push some commits along those lines later.

submitted #6974 but made a "mistake" of adding a unittest which unearthed a problem needed fixing (paths=[] is not None but logic implied otherwise) to make my proposed solution work properly (I checked maint -- was ok without it, didn't check this PR but most likely it was ok too).

Both tests pass on master and fail here ATM.  The 2nd test, matching for absent s_unknown
result returns all submodules but even without gitmodule_name.
@codeclimate
Copy link

codeclimate bot commented Aug 23, 2022

Code Climate has analyzed commit ffa77a0 and detected 1 issue on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 1

View more on Code Climate.

@yarikoptic
Copy link
Member

@mih - I ran into need for more work on get_parent_paths since its goal is a bit different for matching paths within submodules not limiting some paths by other paths... needs a bit more work.

so I am ok with you solution, even if later to be improved later. But the tests I added in #6974 fails here and passes on master, so we would need to at least fix it up to not introduce regression.

@mih
Copy link
Member Author

mih commented Aug 31, 2022

Closing due to personal overload.

@mih mih closed this Aug 31, 2022
adswa added a commit to adswa/datalad that referenced this pull request Nov 21, 2022
- Fix a typo
- Clean up an unused import statement
- check if paths from .gitmodules and paths passed to the command are
  relative to eachother from both directions - because we might recurse
  deeper into a dataset hierarchy with a subdatasets call, the test
  test_get_subdatasets failed when we soley checked from one direction.
@yarikoptic yarikoptic deleted the bf-6940 branch January 20, 2023 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improve performance of an existing feature semver-minor Increment the minor version when merged stale-PR-closed-without-merge
Projects
None yet
3 participants