New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RF: ls-files only on provided paths (if given) #3508
Conversation
While you're waiting for Travis to start, here's a failure it will report:
|
and here is seems only 1 more https://travis-ci.org/datalad/datalad/jobs/553440611
|
quick summary for the last one (possibly also relates to the previous one on submodules) -- (git-annex)hopa:~/.tmp/datalad_temp_test_subds_pathyh6f2zwi[master]
$> git ls-files --stage -d -m --exclude-standard sub/some.txt
$> git ls-tree HEAD -r --full-tree -l sub/some.txt
160000 commit ca339d7604489ad84ba6a3086c7a3846b1869a72 - sub and even though
so we end up with the file within submodule (now we do not sort into submodules first I guess) being considered removed since "known" to HEAD and not to the worktree |
$> git ls-files --stage -d -m --exclude-standard sub/
160000 ca339d7604489ad84ba6a3086c7a3846b1869a72 0 sub |
Codecov Report
@@ Coverage Diff @@
## master #3508 +/- ##
===========================================
- Coverage 58.3% 43.45% -14.86%
===========================================
Files 269 269
Lines 34947 34945 -2
===========================================
- Hits 20377 15186 -5191
- Misses 14570 19759 +5189
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #3508 +/- ##
==========================================
+ Coverage 70.22% 76.98% +6.75%
==========================================
Files 272 272
Lines 35235 35287 +52
==========================================
+ Hits 24745 27164 +2419
+ Misses 10490 8123 -2367
Continue to review full report at Codecov.
|
wow -- 3 "interesting" failures resurfaced, related to submodules, whenever I started to use `get_submodules` in the `get_content_info`
|
filed an issue gitpython-developers/GitPython#890 for |
At a first glance I think this should move to |
Need to get back to it asap since could not realistically run |
The save import is no longer necessary as of 4b056a2 (BF/RF: Automagically find and import a datasetmethod if not yet bound, 2019-02-10).
In addition to keeping with our general direction of moving away from GitPython, this avoids an unresolved issue in GitPython's handling of submodules [0]. [0]: datalad#3508 (comment)
I've pushed the following updates to this PR:
I haven't done a close review of the original commits in this PR, but I'm pushing this now so that we can see where we are with the tests. (The problematic tests from the previous run pass for me locally.) range-diff
|
@@ -2362,6 +2364,62 @@ def gc(self, allow_background=False, auto=False): | |||
cmd_options += ['--auto'] | |||
self._git_custom_command('', cmd_options) | |||
|
|||
def _parse_gitmodules(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes from this commit are almost purely code movement and are probably easiest to review with something like git show --color-moved=dimmed_zebra --color-moved-ws=allow-indentation-change 23f30ba3c
.
In addition to keeping with our general direction of moving away from GitPython, this avoids an unresolved issue in GitPython's handling of submodules [0]. [0]: datalad#3508 (comment)
THANK YOU! @kyleam |
Travis is still running, but there's widespread failure on appveyor. I'm guessing that's due to my changes, but I haven't looked deeper yet. |
I said:
It seems that the original PR (tip at b8af9f5) has a similar set of failures. I didn't check them one by one, so it's possible that my update introduces other errors, but it seems the original change has windows-compatibility issues that we need to sort out. |
I said:
Never mind, it seems we're about to get hit with a set of AppVeyor failures on master. Here's a run with a noop commit on 5e410c2 (CHANGELOG.md: Second batch for 0.12.0rc5, 2019-07-26): https://ci.appveyor.com/project/mih/datalad/build/job/0ogacr7kj8ol05am The run for 5e410c2 five days ago was fine: https://ci.appveyor.com/project/mih/datalad/builds/26278166 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left comments about original (moved) code which might be worth some discussion may be.
Meanwhile, since tests pass, I will try to benchmark "manually" and report back. Otherwise it might be ready -- it would be great to see this merged ;-)
for k, v in iteritems(db): | ||
if not k.startswith('submodule.'): | ||
# we don't know what this is | ||
lgr.debug("Skip unrecognized .gitmodule specification: %s=%s", k, v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if, while at it, we should raise to to WARNING level?! Hiding possible problems and unexpected situations could just cost us in the long run
datalad/support/gitrepo.py
Outdated
A generator that yields a dictionary with information for each | ||
submodule. | ||
""" | ||
if not ((self.pathobj / ".gitmodules").exists() and self.get_hexsha()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re self.get_hexsha()
... With this changes we will be returning submodules are known to .gitmodules
, either dirty or not . Restricting by demanding also having a commit, although very unlikely and impossible when working with datalad datasets, seems to be not necessary. Or what am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restricting by demanding also having a commit, although very unlikely and impossible when working with datalad datasets, seems to be not necessary.
Good point. It's not necessary. I was porting the self.repo.head.is_valid()
check from GitRepo.get_submodules()
, but that indeed changes the behavior of datalad subdatasets
when there is no commit checked out. So GitRepo.get_submodules()
and subdatasets()
differ in their treatment of this case, and for the rewrite of get_submodules()
I agree we should use the less restrictive case. Updated.
range-diff: v2 vs v3
1: ebadb1c42 = 1: ebadb1c42 CLN: subdatasets: Drop unused imports
-: --------- > 2: 71e5c8ea6 TST: subdatasets: Test for "no commit checked out" case
2: 23f30ba3c ! 3: 7e9940945 MV: gitrepo: Absorb custom submodule parser from subdatasets()
@@ -170,7 +170,7 @@
+ return out
+
+ def get_submodules_(self, paths=None):
-+ if not ((self.pathobj / ".gitmodules").exists() and self.get_hexsha()):
++ if not (self.pathobj / ".gitmodules").exists():
+ return
+
+ modinfo = self._parse_gitmodules()
3: b488de78c ! 4: 1f8b99dff DOC: gitrepo: Add docstring for get_submodules_()
@@ -21,6 +21,6 @@
+ A generator that yields a dictionary with information for each
+ submodule.
+ """
- if not ((self.pathobj / ".gitmodules").exists() and self.get_hexsha()):
+ if not (self.pathobj / ".gitmodules").exists():
return
4: 9392e2113 ! 5: bea8551d7 RF: gitrepo: Rewrite get_submodules() to avoid GitPython
@@ -6,6 +6,13 @@
GitPython, this avoids an unresolved issue in GitPython's handling of
submodules [0].
+ As documented by the new test, there is a change in behavior when
+ get_submodules() is called in a repository that doesn't have a commit
+ checked out: it now returns any registered submodules instead of an
+ empty list. This is consistent with the behavior of 'git submodule'
+ and 'datalad subdataset', and there doesn't seem to be an obvious
+ reason not to support it.
+
[0]: https://github.com/datalad/datalad/pull/3508#issuecomment-508462763
diff --git a/datalad/support/gitrepo.py b/datalad/support/gitrepo.py
@@ -86,3 +93,24 @@
def is_submodule_modified(self, name, options=[]):
"""Whether a submodule has new commits
+
+ diff --git a/datalad/support/tests/test_gitrepo.py b/datalad/support/tests/test_gitrepo.py
+ --- a/datalad/support/tests/test_gitrepo.py
+ +++ b/datalad/support/tests/test_gitrepo.py
+@@
+ raise SkipTest("TODO")
+
+
++@with_tempfile
++def test_get_submodules_parent_on_unborn_branch(path):
++ repo = GitRepo(path, create=True)
++ subrepo = GitRepo(op.join(path, "sub"), create=True)
++ subrepo.commit(msg="s", options=["--allow-empty"])
++ repo.add_submodule(path="sub")
++ eq_([s.name for s in repo.get_submodules()],
++ ["sub"])
++
++
+ def test_kwargs_to_options():
+
+ class Some(object):
5: 52737fdea = 6: 60f889d7d RF: request status only on provided paths (if given)
6: dba239dec = 7: 5fefda437 NF: get_parent_paths to be able to quickly determine within repo paths if paths within submodules provided
7: ad93ab7d6 = 8: f8c41e0b8 BM: benchmark suite for get_parent_paths
8: fa14633c2 = 9: bfa08474c BF: make use of get_parent_paths to mitigate difference in ls-tree and ls-files behavior on paths within submodules
9: d64ef83e7 = 10: e7f4578de DOC: Use public-inbox for Git mailing list link
10: c7905e74d = 11: 5a77f6ba2 OPT: do not sort, maintain a set of prior hits
# bring into traditional shape | ||
for name, props in iteritems(mods): | ||
if 'path' not in props: | ||
lgr.debug("Failed to get '%s.path', skipping section", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here -- WARNING instead of DEBUG?
modprops = {'gitmodule_{}'.format(k): v | ||
for k, v in iteritems(props) | ||
if not (k.startswith('__') or k == 'path')} | ||
modpath = self.pathobj / PurePosixPath(props['path']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if eventually we should RF this to store just props['path']
instead of constructing full path here with intention later on to convert to relative.
subdatasets() lists registered submodules even if the parent is on an unborn branch. Explicitly test this behavior. Re: datalad#3508 (comment)
Move our custom submodule logic to gitrepo.py so that we can rewrite GitRepo.get_submodules() to use it rather than GitPython.
In addition to keeping with our general direction of moving away from GitPython, this avoids an unresolved issue in GitPython's handling of submodules [0]. As documented by the new test, there is a change in behavior when get_submodules() is called in a repository that doesn't have a commit checked out: it now returns any registered submodules instead of an empty list. This is consistent with the behavior of 'git submodule' and 'datalad subdataset', and there doesn't seem to be an obvious reason not to support it. [0]: datalad#3508 (comment)
Intends to at least partially address datalad#3506 where in a repository with lots of already tracked files, adding more files by providing their paths would lead only to the heavy CPU load due to paths matching then performed on DataLad level instead of restricting initial query to git ls-files only to the paths of interest. Locally I have ran all datalad/core tests and no failures were detected
…s if paths within submodules provided
…d ls-files behavior on paths within submodules see https://www.spinics.net/lists/git/msg362459.html for an example.
The main advantage of using public inbox is that it constructs the URL with the message ID, making it easier to find the message even if public-inbox.org/git is no longer around.
Seems to provide some (~5%) performance benefit x 4.57±0.02ms 4.19±0.1ms 0.92 paths.get_parent_paths.time_allsubmods_toplevel [hopa/virtualenv-py2.7] x 5.52±0.05ms 5.07±0.06ms 0.92 paths.get_parent_paths.time_allsubmods_toplevel [hopa/virtualenv-py3.7] x 3.85±0.06ms 3.79±0.04ms 0.98 paths.get_parent_paths.time_allsubmods_toplevel_only [hopa/virtualenv-py2.7] x 4.82±0.03ms 4.64±0.03ms 0.96 paths.get_parent_paths.time_allsubmods_toplevel_only [hopa/virtualenv-py3.7] x 257±3ns 258±5ns 1.00 paths.get_parent_paths.time_no_submods [hopa/virtualenv-py2.7] x 243±1ns 250±5ns 1.03 paths.get_parent_paths.time_no_submods [hopa/virtualenv-py3.7] x 3.33±0.04ms 3.20±0.01ms 0.96 paths.get_parent_paths.time_one_submod_subdir [hopa/virtualenv-py2.7] x 4.11±0.04ms 4.07±0.02ms 0.99 paths.get_parent_paths.time_one_submod_subdir [hopa/virtualenv-py3.7] x 3.36±0.04ms 3.18±0.01ms 0.95 paths.get_parent_paths.time_one_submod_toplevel [hopa/virtualenv-py2.7] x 4.19±0.2ms 4.04±0.03ms 0.96 paths.get_parent_paths.time_one_submod_toplevel [hopa/virtualenv-py3.7]
Returns | ||
------- | ||
A list of paths (without duplicaates), where some entries replaced with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kyleam - if you were to push more, please fix up this typo in duplicaates
, sorry about that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't plan on pushing more. I've got you around the GitPython failures. Take back your PR :]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great
Then let's take advantage of Germans venturing again to the West and merge it. I will do it locally and fix that typo and then push directlyso we don't stress Travis for no reason
rushed a bit -- travis wasn't yet fully done, but was fully green before. merged locally and pushed to master |
Intends to at least partially address #3506
where in a repository with lots of already tracked files, adding more files by providing
their paths would lead only to the heavy CPU load due to paths matching then performed
on DataLad level instead of restricting initial query to git ls-files only to the paths
of interest.
Locally I have ran all datalad/core tests and no failures were detected. This is a PR to see if it would potentially cause some breakage elsewhere.
TODOs