Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF: ls-files only on provided paths (if given) #3508

Closed
wants to merge 11 commits into from

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Jul 2, 2019

Intends to at least partially address #3506
where in a repository with lots of already tracked files, adding more files by providing
their paths would lead only to the heavy CPU load due to paths matching then performed
on DataLad level instead of restricting initial query to git ls-files only to the paths
of interest.

Locally I have ran all datalad/core tests and no failures were detected. This is a PR to see if it would potentially cause some breakage elsewhere.

TODOs

  • make sure nothing is broken
  • benchmark

@yarikoptic yarikoptic added the do not merge Not to be merged, will trigger WIP app "not passed" status label Jul 2, 2019
@yarikoptic yarikoptic changed the title RF: request status only on provided paths (if given) RF: ls-files only on provided paths (if given) Jul 2, 2019
@kyleam
Copy link
Contributor

kyleam commented Jul 2, 2019

make sure nothing is broken

While you're waiting for Travis to start, here's a failure it will report:

 python -m nose -vs datalad/local/tests/test_subdataset.py:test_get_subdatasets
datalad.local.tests.test_subdataset.test_get_subdatasets ... FAIL
[...]

======================================================================
FAIL: datalad.local.tests.test_subdataset.test_get_subdatasets
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/kyle/src/python/venvs/datalad/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/kyle/src/python/datalad/datalad/tests/utils.py", line 783, in newfunc
    t(*(arg + (uri,)), **kw)
  File "/home/kyle/src/python/datalad/datalad/local/tests/test_subdataset.py", line 79, in test_get_subdatasets
    'sub dataset1/sub sub dataset1/subm 1',
AssertionError: [] != ['sub dataset1/2', 'sub dataset1/sub sub dataset1', 'sub dataset1/sub sub dataset1/2', 'sub dataset1/sub sub dataset1/subm 1']
-------------------- >> begin captured logging << --------------------
datalad.utils: Level 5: Importing datalad.utils
datalad.utils: DEBUG: Maximal length of cmdline string (adjusted for safety margin): 1252864
datalad.utils: Level 5: Done importing datalad.utils
datalad.cmd: Level 9: Will use git under '/usr/lib/git-annex.linux' (no adjustments to PATH if empty string)
datalad.cmd: Level 9: Running: ['git', 'version']
datalad.cmd: Level 8: Finished running ['git', 'version'] with status 0
datalad.cmd: Level 9: Running: ['git', 'config', '-z', '-l', '--show-origin']
datalad.cmd: Level 8: Finished running ['git', 'config', '-z', '-l', '--show-origin'] with status 0
datalad.ui: Level 5: Starting importing ui
datalad.ui.dialog: Level 5: Starting importing ui.dialog
datalad.ui.dialog: Level 5: Done importing ui.dialog
datalad.ui: Level 5: Initiating UI switcher
datalad.ui: DEBUG: UI set to DialogUI(out=<TextIOWrapper>)
datalad.ui: Level 5: Done importing ui
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 12.685s

FAILED (failures=1)

@yarikoptic
Copy link
Member Author

and here is seems only 1 more https://travis-ci.org/datalad/datalad/jobs/553440611

FAIL: datalad.support.tests.test_fileinfo.test_subds_path
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/datalad/tests/utils.py", line 607, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/datalad/support/tests/test_fileinfo.py", line 185, in test_subds_path
    assert_equal(stat[subds.repo.pathobj]['state'], 'clean')
AssertionError: 'deleted' != 'clean'
- deleted
+ clean

@yarikoptic
Copy link
Member Author

quick summary for the last one (possibly also relates to the previous one on submodules) -- ls-files (used for work tree) does not look into submodules for a path within them, whenever ls-tree (used for HEAD) does

(git-annex)hopa:~/.tmp/datalad_temp_test_subds_pathyh6f2zwi[master]
$> git ls-files --stage -d -m --exclude-standard sub/some.txt

$> git ls-tree HEAD -r --full-tree -l sub/some.txt           
160000 commit ca339d7604489ad84ba6a3086c7a3846b1869a72       -	sub

and even though --help for ls-files lists --recurse-submodules it is not supported:

$> git ls-files --recurse-submodules --stage -d -m --exclude-standard sub/some.txt
fatal: ls-files --recurse-submodules unsupported mode

$> git --version
git version 2.22.0.455.g172b71a6c5

so we end up with the file within submodule (now we do not sort into submodules first I guess) being considered removed since "known" to HEAD and not to the worktree

@yarikoptic
Copy link
Member Author

ls-files though does list submodule itself if submodule path is provided:

$> git ls-files --stage -d -m --exclude-standard sub/        
160000 ca339d7604489ad84ba6a3086c7a3846b1869a72 0	sub

@codecov
Copy link

codecov bot commented Jul 4, 2019

Codecov Report

Merging #3508 into master will decrease coverage by 14.85%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #3508       +/-   ##
===========================================
- Coverage    58.3%   43.45%   -14.86%     
===========================================
  Files         269      269               
  Lines       34947    34945        -2     
===========================================
- Hits        20377    15186     -5191     
- Misses      14570    19759     +5189
Impacted Files Coverage Δ
datalad/support/gitrepo.py 65.73% <100%> (-0.06%) ⬇️
datalad/support/tests/test_fileinfo.py 12.24% <0%> (-87.76%) ⬇️
datalad/support/tests/test_stats.py 13.11% <0%> (-86.89%) ⬇️
datalad/support/tests/test_repodates.py 13.46% <0%> (-86.54%) ⬇️
datalad/tests/test_protocols.py 14.81% <0%> (-85.19%) ⬇️
datalad/tests/test_dochelpers.py 15.49% <0%> (-84.51%) ⬇️
datalad/tests/test_config.py 14.88% <0%> (-83.93%) ⬇️
datalad/tests/test_constraints.py 16.77% <0%> (-83.23%) ⬇️
datalad/support/tests/test_locking.py 12.19% <0%> (-82.93%) ⬇️
datalad/support/tests/test_gitrepo.py 17.23% <0%> (-82.65%) ⬇️
... and 57 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2fccc29...1760451. Read the comment docs.

@codecov
Copy link

codecov bot commented Jul 4, 2019

Codecov Report

Merging #3508 into master will increase coverage by 6.75%.
The diff coverage is 88.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3508      +/-   ##
==========================================
+ Coverage   70.22%   76.98%   +6.75%     
==========================================
  Files         272      272              
  Lines       35235    35287      +52     
==========================================
+ Hits        24745    27164    +2419     
+ Misses      10490     8123    -2367
Impacted Files Coverage Δ
datalad/support/tests/test_path.py 92.3% <ø> (+4.3%) ⬆️
datalad/support/path.py 72.83% <100%> (+10.83%) ⬆️
datalad/local/subdatasets.py 70.19% <100%> (+29.89%) ⬆️
datalad/support/gitrepo.py 67.31% <87.93%> (+15.61%) ⬆️
datalad/interface/tests/test_ls_webui.py 92.64% <0%> (-7.36%) ⬇️
datalad/tests/test_dochelpers.py 100% <0%> (ø) ⬆️
datalad/interface/base.py 82.17% <0%> (+0.33%) ⬆️
datalad/support/tests/test_annexrepo.py 96.14% <0%> (+0.4%) ⬆️
datalad/core/local/diff.py 35.06% <0%> (+0.44%) ⬆️
datalad/core/local/tests/test_save.py 97.82% <0%> (+0.54%) ⬆️
... and 56 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 041e114...5a77f6b. Read the comment docs.

@yarikoptic
Copy link
Member Author

wow -- 3 "interesting" failures resurfaced, related to submodules, whenever I started to use `get_submodules` in the `get_content_info`
======================================================================
ERROR: datalad.core.local.tests.test_create.test_create_subdataset_hierarchy_from_top
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/tests/utils.py", line 434, in newfunc
    return t(*(arg + (d,)), **kw)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/core/local/tests/test_create.py", line 272, in test_create_subdataset_hierarchy_from_top
    subsubds = subds.create('subsub', force=True)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/distribution/dataset.py", line 525, in apply_func
    return f(**kwargs)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/interface/utils.py", line 491, in eval_func
    return return_func(generator_func)(*args, **kwargs)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/interface/utils.py", line 479, in return_func
    results = list(results)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/interface/utils.py", line 428, in generator_func
    result_renderer, result_xfm, _result_filter, **_kwargs):
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/interface/utils.py", line 520, in _process_results
    for res in results:
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/core/local/create.py", line 249, in __call__
    paths=[check_path.relative_to(parentds_path)])
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/support/gitrepo.py", line 3077, in status
    eval_submodule_state=eval_submodule_state)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/support/gitrepo.py", line 3132, in diffstatus
    _cache)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/support/gitrepo.py", line 3168, in _diffstatus
    eval_file_type=eval_file_type)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/support/gitrepo.py", line 2909, in get_content_info
    submodules = [s.path for s in self.get_submodules()]
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/support/gitrepo.py", line 2337, in get_submodules
    submodules = self.repo.submodules
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/git/repo/base.py", line 324, in submodules
    return Submodule.list_items(self)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/git/util.py", line 934, in list_items
    out_list.extend(cls.iter_items(repo, *args, **kwargs))
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/git/objects/submodule/base.py", line 1159, in iter_items
    pc = repo.commit(parent_commit)         # parent commit instance
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/git/repo/base.py", line 466, in commit
    return self.rev_parse(text_type(rev) + "^0")
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/git/repo/fun.py", line 213, in rev_parse
    obj = name_to_object(repo, rev[:start])
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/git/repo/fun.py", line 147, in name_to_object
    raise BadName(name)
BadName: Ref 'HEAD' did not resolve to an object
======================================================================
ERROR: datalad.core.local.tests.test_create.test_create_raises
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/tests/utils.py", line 607, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/tests/utils.py", line 607, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/core/local/tests/test_create.py", line 89, in test_create_raises
    ds.create(obscure_ds, **raw),
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/distribution/dataset.py", line 525, in apply_func
    return f(**kwargs)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/interface/utils.py", line 491, in eval_func
    return return_func(generator_func)(*args, **kwargs)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/interface/utils.py", line 479, in return_func
    results = list(results)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/interface/utils.py", line 428, in generator_func
    result_renderer, result_xfm, _result_filter, **_kwargs):
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/interface/utils.py", line 520, in _process_results
    for res in results:
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/core/local/create.py", line 249, in __call__
    paths=[check_path.relative_to(parentds_path)])
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/support/gitrepo.py", line 3077, in status
    eval_submodule_state=eval_submodule_state)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/support/gitrepo.py", line 3132, in diffstatus
    _cache)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/support/gitrepo.py", line 3168, in _diffstatus
    eval_file_type=eval_file_type)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/support/gitrepo.py", line 2909, in get_content_info
    submodules = [s.path for s in self.get_submodules()]
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/support/gitrepo.py", line 2337, in get_submodules
    submodules = self.repo.submodules
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/git/repo/base.py", line 324, in submodules
    return Submodule.list_items(self)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/git/util.py", line 934, in list_items
    out_list.extend(cls.iter_items(repo, *args, **kwargs))
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/git/objects/submodule/base.py", line 1188, in iter_items
    "Gitmodule path %r did not exist in revision of parent commit %s" % (p, parent_commit))
InvalidGitRepositoryError: Gitmodule path u'"ds- \\"\';a&b&c\u0394\u0419\u05e7\u0645\u0e57\u3042 `| "' did not exist in revision of parent commit HEAD
======================================================================
ERROR: datalad.interface.tests.test_save.test_bf1886
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/tests/utils.py", line 607, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/interface/tests/test_save.py", line 394, in test_bf1886
    sub2 = create(opj(parent.path, 'sub2'))
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/interface/utils.py", line 491, in eval_func
    return return_func(generator_func)(*args, **kwargs)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/interface/utils.py", line 479, in return_func
    results = list(results)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/interface/utils.py", line 428, in generator_func
    result_renderer, result_xfm, _result_filter, **_kwargs):
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/interface/utils.py", line 520, in _process_results
    for res in results:
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/core/local/create.py", line 249, in __call__
    paths=[check_path.relative_to(parentds_path)])
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/support/gitrepo.py", line 3077, in status
    eval_submodule_state=eval_submodule_state)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/support/gitrepo.py", line 3132, in diffstatus
    _cache)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/support/gitrepo.py", line 3168, in _diffstatus
    eval_file_type=eval_file_type)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/support/gitrepo.py", line 2909, in get_content_info
    submodules = [s.path for s in self.get_submodules()]
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/datalad/support/gitrepo.py", line 2337, in get_submodules
    submodules = self.repo.submodules
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/git/repo/base.py", line 324, in submodules
    return Submodule.list_items(self)
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/git/util.py", line 934, in list_items
    out_list.extend(cls.iter_items(repo, *args, **kwargs))
  File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/git/objects/submodule/base.py", line 1193, in iter_items
    sm._name = n
AttributeError: 'Blob' object has no attribute '_name'
May be I should start using Michael's code for submodules detection? but it is not a part of GitRepo but of `.local.subdatasets'

@yarikoptic
Copy link
Member Author

filed an issue gitpython-developers/GitPython#890 for AttributeError: 'Blob' object has no attribute '_name'

@bpoldrack
Copy link
Member

May be I should start using Michael's code for submodules detection? but it is not a part of GitRepo but of `.local.subdatasets'

At a first glance I think this should move to GitRepo eventually. Agree, @mih?

@yarikoptic
Copy link
Member Author

yarikoptic commented Jul 24, 2019

Need to get back to it asap since could not realistically run datalad run (and probably later save, didn't even try) since it was ~200k files in repo and ~1400 paths provided in the cmdline, someone could do the math

The save import is no longer necessary as of 4b056a2 (BF/RF:
Automagically find and import a datasetmethod if not yet bound,
2019-02-10).
kyleam added a commit to yarikoptic/datalad that referenced this pull request Jul 31, 2019
In addition to keeping with our general direction of moving away from
GitPython, this avoids an unresolved issue in GitPython's handling of
submodules [0].

[0]: datalad#3508 (comment)
@kyleam
Copy link
Contributor

kyleam commented Jul 31, 2019

I've pushed the following updates to this PR:

  • Rebase to deal with conflict.

  • Since we're rebasing anyway, squash some of the existing clean-up commits to their base commit.

  • Use text_type() rather than str() to convert pathlib objects. See 2f98134 (BF(py2): pathlib: Consistently use text_type for conversion, 2019-04-12).

  • Other minor touch-ups.

  • Main change: Move custom submodule logic from subdatasets.py to gitrepo.py, and rewrite GitRepo.get_submodules() to use it. This should get around the GitPython failures we're seeing in this PR.

I haven't done a close review of the original commits in this PR, but I'm pushing this now so that we can see where we are with the tests. (The problematic tests from the previous run pass for me locally.)

range-diff
 -:  --------- >  1:  ebadb1c42 CLN: subdatasets: Drop unused imports
 -:  --------- >  2:  23f30ba3c MV: gitrepo: Absorb custom submodule parser from subdatasets()
 -:  --------- >  3:  b488de78c DOC: gitrepo: Add docstring for get_submodules_()
 -:  --------- >  4:  9392e2113 RF: gitrepo: Rewrite get_submodules() to avoid GitPython
 1:  19a4999bf !  5:  52737fdea RF: request status only on provided paths (if given)
    @@ -22,7 +22,7 @@
     -                # homogenize wrt subdataset content paths across
     -                # ls-files and ls-tree
     -                None,
    -+                list(map(str, paths)) if paths else None,
    ++                list(map(text_type, paths)) if paths else None,
                      cmd,
                      log_stderr=True,
                      log_stdout=True,
 2:  5976254d3 =  6:  dba239dec NF: get_parent_paths to be able to quickly determine within repo paths if paths within submodules provided
 3:  2874f2d82 !  7:  ad93ab7d6 BM: benchmark suite for get_parent_paths
    @@ -2,10 +2,13 @@
     
         BM: benchmark suite for get_parent_paths
     
    - diff --git a/benchmarks/paths.py b/benchmarks/paths.py
    + diff --git a/benchmarks/support/__init__.py b/benchmarks/support/__init__.py
    + new file mode 100644
    +
    + diff --git a/benchmarks/support/path.py b/benchmarks/support/path.py
      new file mode 100644
      --- /dev/null
    - +++ b/benchmarks/paths.py
    + +++ b/benchmarks/support/path.py
     @@
     +# Import functions to be tested with _ suffix and name the suite after the
     +# original function so we could easily benchmark it e.g. by
    @@ -14,7 +17,7 @@
     +
     +from datalad.support.path import get_parent_paths as get_parent_paths_
     +
    -+from .common import SuprocBenchmarks
    ++from ..common import SuprocBenchmarks
     +
     +class get_parent_paths(SuprocBenchmarks):
     +
    @@ -37,13 +40,13 @@
     +        assert get_parent_paths_(self.posixpaths, [], True) == []
     +
     +    def time_one_submod_toplevel(self):
    -+        assert get_parent_paths_(self.posixpaths, ['submod9'], True) == ['submod9']
    ++        get_parent_paths_(self.posixpaths, ['submod9'], True)
     +
     +    def time_one_submod_subdir(self):
    -+        assert get_parent_paths_(self.posixpaths, ['subdir/submod9'], True) == ['subdir/submod9']
    ++        get_parent_paths_(self.posixpaths, ['subdir/submod9'], True)
     +
     +    def time_allsubmods_toplevel_only(self):
    -+        assert get_parent_paths_(self.posixpaths, self.toplevel_submods, True) == self.toplevel_submods
    ++        get_parent_paths_(self.posixpaths, self.toplevel_submods, True)
     +
     +    def time_allsubmods_toplevel(self):
     +        get_parent_paths_(self.posixpaths, self.toplevel_submods)
 4:  176045126 !  8:  fa14633c2 BF: make use of get_parent_paths to mitigate difference in ls-tree and ls-files behavior on paths within submodules
    @@ -19,7 +19,7 @@
                  # convert unconditionally
                  paths = [ut.PurePosixPath(p) for p in paths]
      
    -+        path_strs = list(map(str, paths)) if paths else None
    ++        path_strs = list(map(text_type, paths)) if paths else None
     +
              # this will not work in direct mode, but everything else should be
              # just fine
    @@ -42,7 +42,7 @@
              lgr.debug('Query repo: %s', cmd)
              try:
                  stdout, stderr = self._git_custom_command(
    --                list(map(str, paths)) if paths else None,
    +-                list(map(text_type, paths)) if paths else None,
     +                path_strs,
                      cmd,
                      log_stderr=True,
 -:  --------- >  9:  d64ef83e7 DOC: Use public-inbox for Git mailing list link
 5:  5bd64c58e ! 10:  c7905e74d OPT: do not sort, maintain a set of prior hits
    @@ -15,9 +15,9 @@
         x     3.36±0.04ms      3.18±0.01ms     0.95  paths.get_parent_paths.time_one_submod_toplevel [hopa/virtualenv-py2.7]
         x      4.19±0.2ms      4.04±0.03ms     0.96  paths.get_parent_paths.time_one_submod_toplevel [hopa/virtualenv-py3.7]
     
    - diff --git a/benchmarks/paths.py b/benchmarks/paths.py
    - --- a/benchmarks/paths.py
    - +++ b/benchmarks/paths.py
    + diff --git a/benchmarks/support/path.py b/benchmarks/support/path.py
    + --- a/benchmarks/support/path.py
    + +++ b/benchmarks/support/path.py
     @@
              # and some hierarchy of submodules
              self.nfiles = 40  # per each construct
 6:  b687931a0 <  -:  --------- RF: moved benchmarks/paths.py into benchmarks/support/path.py to reflect main code hierarchy
 7:  b8af9f5ce <  -:  --------- RF: strip assertions from benchmark - they are not tests, painful to maintain while RFing behavior

@@ -2362,6 +2364,62 @@ def gc(self, allow_background=False, auto=False):
cmd_options += ['--auto']
self._git_custom_command('', cmd_options)

def _parse_gitmodules(self):
Copy link
Contributor

@kyleam kyleam Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes from this commit are almost purely code movement and are probably easiest to review with something like git show --color-moved=dimmed_zebra --color-moved-ws=allow-indentation-change 23f30ba3c.

kyleam added a commit to yarikoptic/datalad that referenced this pull request Jul 31, 2019
In addition to keeping with our general direction of moving away from
GitPython, this avoids an unresolved issue in GitPython's handling of
submodules [0].

[0]: datalad#3508 (comment)
@yarikoptic
Copy link
Member Author

THANK YOU! @kyleam

@kyleam
Copy link
Contributor

kyleam commented Jul 31, 2019

Travis is still running, but there's widespread failure on appveyor. I'm guessing that's due to my changes, but I haven't looked deeper yet.

@kyleam
Copy link
Contributor

kyleam commented Jul 31, 2019

I said:

there's widespread failure on appveyor. I'm guessing that's due to my changes, but I haven't looked deeper yet.

It seems that the original PR (tip at b8af9f5) has a similar set of failures. I didn't check them one by one, so it's possible that my update introduces other errors, but it seems the original change has windows-compatibility issues that we need to sort out.

@kyleam
Copy link
Contributor

kyleam commented Jul 31, 2019

I said:

but it seems the original change has windows-compatibility issues that we need to sort out.

Never mind, it seems we're about to get hit with a set of AppVeyor failures on master. Here's a run with a noop commit on 5e410c2 (CHANGELOG.md: Second batch for 0.12.0rc5, 2019-07-26): https://ci.appveyor.com/project/mih/datalad/build/job/0ogacr7kj8ol05am

The run for 5e410c2 five days ago was fine: https://ci.appveyor.com/project/mih/datalad/builds/26278166

Copy link
Member Author

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left comments about original (moved) code which might be worth some discussion may be.
Meanwhile, since tests pass, I will try to benchmark "manually" and report back. Otherwise it might be ready -- it would be great to see this merged ;-)

for k, v in iteritems(db):
if not k.startswith('submodule.'):
# we don't know what this is
lgr.debug("Skip unrecognized .gitmodule specification: %s=%s", k, v)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if, while at it, we should raise to to WARNING level?! Hiding possible problems and unexpected situations could just cost us in the long run

A generator that yields a dictionary with information for each
submodule.
"""
if not ((self.pathobj / ".gitmodules").exists() and self.get_hexsha()):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re self.get_hexsha()... With this changes we will be returning submodules are known to .gitmodules, either dirty or not . Restricting by demanding also having a commit, although very unlikely and impossible when working with datalad datasets, seems to be not necessary. Or what am I missing?

Copy link
Contributor

@kyleam kyleam Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restricting by demanding also having a commit, although very unlikely and impossible when working with datalad datasets, seems to be not necessary.

Good point. It's not necessary. I was porting the self.repo.head.is_valid() check from GitRepo.get_submodules(), but that indeed changes the behavior of datalad subdatasets when there is no commit checked out. So GitRepo.get_submodules() and subdatasets() differ in their treatment of this case, and for the rewrite of get_submodules() I agree we should use the less restrictive case. Updated.

range-diff: v2 vs v3
 1:  ebadb1c42 =  1:  ebadb1c42 CLN: subdatasets: Drop unused imports
 -:  --------- >  2:  71e5c8ea6 TST: subdatasets: Test for "no commit checked out" case
 2:  23f30ba3c !  3:  7e9940945 MV: gitrepo: Absorb custom submodule parser from subdatasets()
    @@ -170,7 +170,7 @@
     +        return out
     +
     +    def get_submodules_(self, paths=None):
    -+        if not ((self.pathobj / ".gitmodules").exists() and self.get_hexsha()):
    ++        if not (self.pathobj / ".gitmodules").exists():
     +            return
     +
     +        modinfo = self._parse_gitmodules()
 3:  b488de78c !  4:  1f8b99dff DOC: gitrepo: Add docstring for get_submodules_()
    @@ -21,6 +21,6 @@
     +        A generator that yields a dictionary with information for each
     +        submodule.
     +        """
    -         if not ((self.pathobj / ".gitmodules").exists() and self.get_hexsha()):
    +         if not (self.pathobj / ".gitmodules").exists():
                  return
      
 4:  9392e2113 !  5:  bea8551d7 RF: gitrepo: Rewrite get_submodules() to avoid GitPython
    @@ -6,6 +6,13 @@
         GitPython, this avoids an unresolved issue in GitPython's handling of
         submodules [0].
     
    +    As documented by the new test, there is a change in behavior when
    +    get_submodules() is called in a repository that doesn't have a commit
    +    checked out: it now returns any registered submodules instead of an
    +    empty list.  This is consistent with the behavior of 'git submodule'
    +    and 'datalad subdataset', and there doesn't seem to be an obvious
    +    reason not to support it.
    +
         [0]: https://github.com/datalad/datalad/pull/3508#issuecomment-508462763
     
      diff --git a/datalad/support/gitrepo.py b/datalad/support/gitrepo.py
    @@ -86,3 +93,24 @@
      
          def is_submodule_modified(self, name, options=[]):
              """Whether a submodule has new commits
    +
    + diff --git a/datalad/support/tests/test_gitrepo.py b/datalad/support/tests/test_gitrepo.py
    + --- a/datalad/support/tests/test_gitrepo.py
    + +++ b/datalad/support/tests/test_gitrepo.py
    +@@
    +     raise SkipTest("TODO")
    + 
    + 
    ++@with_tempfile
    ++def test_get_submodules_parent_on_unborn_branch(path):
    ++    repo = GitRepo(path, create=True)
    ++    subrepo = GitRepo(op.join(path, "sub"), create=True)
    ++    subrepo.commit(msg="s", options=["--allow-empty"])
    ++    repo.add_submodule(path="sub")
    ++    eq_([s.name for s in repo.get_submodules()],
    ++        ["sub"])
    ++
    ++
    + def test_kwargs_to_options():
    + 
    +     class Some(object):
 5:  52737fdea =  6:  60f889d7d RF: request status only on provided paths (if given)
 6:  dba239dec =  7:  5fefda437 NF: get_parent_paths to be able to quickly determine within repo paths if paths within submodules provided
 7:  ad93ab7d6 =  8:  f8c41e0b8 BM: benchmark suite for get_parent_paths
 8:  fa14633c2 =  9:  bfa08474c BF: make use of get_parent_paths to mitigate difference in ls-tree and ls-files behavior on paths within submodules
 9:  d64ef83e7 = 10:  e7f4578de DOC: Use public-inbox for Git mailing list link
10:  c7905e74d = 11:  5a77f6ba2 OPT: do not sort, maintain a set of prior hits

# bring into traditional shape
for name, props in iteritems(mods):
if 'path' not in props:
lgr.debug("Failed to get '%s.path', skipping section", name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here -- WARNING instead of DEBUG?

modprops = {'gitmodule_{}'.format(k): v
for k, v in iteritems(props)
if not (k.startswith('__') or k == 'path')}
modpath = self.pathobj / PurePosixPath(props['path'])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if eventually we should RF this to store just props['path'] instead of constructing full path here with intention later on to convert to relative.

subdatasets() lists registered submodules even if the parent is on an
unborn branch.  Explicitly test this behavior.

Re: datalad#3508 (comment)
Move our custom submodule logic to gitrepo.py so that we can rewrite
GitRepo.get_submodules() to use it rather than GitPython.
kyleam and others added 7 commits August 1, 2019 09:57
In addition to keeping with our general direction of moving away from
GitPython, this avoids an unresolved issue in GitPython's handling of
submodules [0].

As documented by the new test, there is a change in behavior when
get_submodules() is called in a repository that doesn't have a commit
checked out: it now returns any registered submodules instead of an
empty list.  This is consistent with the behavior of 'git submodule'
and 'datalad subdataset', and there doesn't seem to be an obvious
reason not to support it.

[0]: datalad#3508 (comment)
Intends to at least partially address datalad#3506
where in a repository with lots of already tracked files, adding more files by providing
their paths would lead only to the heavy CPU load due to paths matching then performed
on DataLad level instead of restricting initial query to  git ls-files  only to the paths
of interest.

Locally I have ran all datalad/core tests and no failures were detected
The main advantage of using public inbox is that it constructs the URL
with the message ID, making it easier to find the message even if
public-inbox.org/git is no longer around.
Seems to provide some (~5%) performance benefit

x     4.57±0.02ms       4.19±0.1ms     0.92  paths.get_parent_paths.time_allsubmods_toplevel [hopa/virtualenv-py2.7]
x     5.52±0.05ms      5.07±0.06ms     0.92  paths.get_parent_paths.time_allsubmods_toplevel [hopa/virtualenv-py3.7]
x     3.85±0.06ms      3.79±0.04ms     0.98  paths.get_parent_paths.time_allsubmods_toplevel_only [hopa/virtualenv-py2.7]
x     4.82±0.03ms      4.64±0.03ms     0.96  paths.get_parent_paths.time_allsubmods_toplevel_only [hopa/virtualenv-py3.7]
x         257±3ns          258±5ns     1.00  paths.get_parent_paths.time_no_submods [hopa/virtualenv-py2.7]
x         243±1ns          250±5ns     1.03  paths.get_parent_paths.time_no_submods [hopa/virtualenv-py3.7]
x     3.33±0.04ms      3.20±0.01ms     0.96  paths.get_parent_paths.time_one_submod_subdir [hopa/virtualenv-py2.7]
x     4.11±0.04ms      4.07±0.02ms     0.99  paths.get_parent_paths.time_one_submod_subdir [hopa/virtualenv-py3.7]
x     3.36±0.04ms      3.18±0.01ms     0.95  paths.get_parent_paths.time_one_submod_toplevel [hopa/virtualenv-py2.7]
x      4.19±0.2ms      4.04±0.03ms     0.96  paths.get_parent_paths.time_one_submod_toplevel [hopa/virtualenv-py3.7]
@yarikoptic yarikoptic added performance Improve performance of an existing feature and removed do not merge Not to be merged, will trigger WIP app "not passed" status labels Aug 1, 2019
Returns
-------
A list of paths (without duplicaates), where some entries replaced with
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyleam - if you were to push more, please fix up this typo in duplicaates, sorry about that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't plan on pushing more. I've got you around the GitPython failures. Take back your PR :]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great
Then let's take advantage of Germans venturing again to the West and merge it. I will do it locally and fix that typo and then push directlyso we don't stress Travis for no reason

@yarikoptic
Copy link
Member Author

rushed a bit -- travis wasn't yet fully done, but was fully green before. merged locally and pushed to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improve performance of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants