New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First batch of GitPython pruning #2902

Merged
merged 15 commits into from Oct 8, 2018

Conversation

Projects
None yet
4 participants
@kyleam
Member

kyleam commented Oct 6, 2018

This eliminates some of the GitPython API calls mentioned in #2879.

  • Eliminates nearly all repo.git.rev_parse calls through use and extension of GitRepo.get_hexsha and through new GitRepo.commit_exists. (ece5229 explains the remaining two.)
  • Removes all repo.git.show calls with use of new GitRepo.format_commit.
  • Removes all repo.git_dir uses.
  • Add GitRepo.is_ancestor and drops repo.git.merge_base.

  • This change is complete

@mih mih referenced this pull request Oct 6, 2018

Open

Brain dump on GitPython removal #2879

0 of 25 tasks complete
@codecov

This comment has been minimized.

codecov bot commented Oct 6, 2018

Codecov Report

Merging #2902 into master will increase coverage by <.01%.
The diff coverage is 97.61%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2902      +/-   ##
==========================================
+ Coverage   90.28%   90.29%   +<.01%     
==========================================
  Files         246      246              
  Lines       31877    31887      +10     
==========================================
+ Hits        28781    28791      +10     
  Misses       3096     3096
Impacted Files Coverage Δ
datalad/support/tests/test_repodates.py 100% <ø> (ø) ⬆️
datalad/interface/run.py 100% <100%> (ø) ⬆️
datalad/interface/rerun.py 96.26% <100%> (-0.19%) ⬇️
datalad/interface/tests/test_save.py 100% <100%> (ø) ⬆️
datalad/interface/tests/test_run.py 99.83% <100%> (ø) ⬆️
datalad/support/tests/test_gitrepo.py 100% <100%> (ø) ⬆️
datalad/support/gitrepo.py 88.93% <95.83%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a98a016...703552c. Read the comment docs.

@yarikoptic

I wonder if there is a performance impact/benefit. I expect that previous implementation used in memory db

kyleam added some commits Oct 6, 2018

ENH: GitRepo: Add is_ancestor method
rerun currently uses the GitPython api for this.
ENH: GitRepo: Add commit_exists method
This is needed in rerun.py.
RF: gitrepo: Extract more general format_commit from get_hexsha
rerun.py uses GitPython to call 'git show'.  Add a format_commit
method to GitRepo so that we can use that instead.
RF: rerun: Use new GitRepo.commit_exists() instead of GitPython api
The direct use of git.cmd.Git outside of GitRepo is discouraged.
RF: Prefer GitRepo.get_git_dir() to GitPython api
Using GitRepo.get_git_dir is a bit more awkward because it involves
passing ds.repo as an argument to its own static method, but doing so
avoids using git.cmd.Git outside of GitRepo.
RF: Prefer GitRepo.get_hexsha() to GitPython api
'git rev-parse' is arguably the idiomatic Git way to get a commit's
hash, but using get_hexsha() is consistent with other parts of DataLad
and eliminates a use of GitPython outside of GitRepo.
ENH: gitrepo: Teach get_hexsha() how to return an abbreviated hexsha
Doing so will let us eliminate another use of GitPython in rerun.
DOC: Mention in comment why a test uses GitPython api
Despite get_hexsha's docstring claiming to support "any type of Git
object identifier", it doesn't actually support tags.
RF+DOC: gitrepo: Don't claim that get_hexsha() works on any object
'git show --format=%H' will only print the hash of commit objects.
For a tag, it will print the commit hash, but also tag information, so
the downstream assert in get_hexsha() that expects one line will fail.
For tree and blob IDs, --format is simply ignored.

'git rev-parse --verify' could be used for a more general "get this
object's hash", but handling the "no commits yet" case would require
an additional git call, so stay with 'git show', at least for now.
ENH: gitrepo.get_hexsha: Cast tags to commits
As described in the previous message, passing a tag fails because it
outputs multiple lines, and downstream code expects just the single
hash specified by "%H".  Make a tag work as expected by casting it as
a commit.

@kyleam kyleam force-pushed the kyleam:enh-degitpython-rerun branch from 9c000f6 to 703552c Oct 7, 2018

@kyleam

This comment has been minimized.

Member

kyleam commented Oct 7, 2018

I wonder if there is a performance impact/benefit. I expect that previous implementation used in memory db

Not sure. I'm not familiar with how gitpython uses the db internally and am not sure when it comes into play. (Though this seems like a more general concern for #2879.)

The rev-parse call is probably the most relevant one for this pr. A crude test:

$ python -m timeit -s "from datalad.support.gitrepo import GitRepo; gr = GitRepo('.')" "gr.get_hexsha()"
100 loops, best of 3: 8.79 msec per loop

$ python -m timeit -s "from datalad.support.gitrepo import GitRepo; gr = GitRepo('.')" "gr._git_custom_command('', ['git', 'rev-parse', 'HEAD'])"
100 loops, best of 3: 8.12 msec per loop

$ python -m timeit -s "from datalad.support.gitrepo import GitRepo; gr = GitRepo('.')" "gr.repo.git.rev_parse('HEAD')"
100 loops, best of 3: 7.71 msec per loop
@mih

This comment has been minimized.

Member

mih commented Oct 7, 2018

@kyleam Thanks for the test. Personally, I would be willing to pay a higher price for not having that DB. This is pretty much the only state information that we are carrying around (besides whether a repo has an annex or not). From my POV we are not benefitting from it much, and we actually had more then one issue with leakage/destructor calls in the past.

My vote is on improving the statelessness of Dataset/Repo.

@mih

mih approved these changes Oct 7, 2018

Apart from the general attitude towards gitpython DB vs. straight git calls (see prev comment), this LGTM.

cmd = ['git', 'show', '--no-patch', "--format=%H"]
if object:
cmd.append(object)
cmd = ['git', 'show', '-z', '--no-patch', '--format=' + fmt]

This comment has been minimized.

@mih

mih Oct 7, 2018

Member

Ah, another one without -z. Thx for catching that!

@yarikoptic

This comment has been minimized.

Member

yarikoptic commented Oct 8, 2018

Re generic "From my POV we are not benefitting from it much,", hopefully @bpoldrack could support my memory of an early analysis where we compared against straight git calls and concluded that performance benefit was notable. But that was in the past, and even git could have become faster, and the way we use it might have changed not requiring as much of git commits structure analysis

@kyleam

This comment has been minimized.

Member

kyleam commented Oct 8, 2018

Stepping back, my current understanding is that there is general agreement that GitPython should not be used outside of GitRepo, and whether we should remove GitPython altogether is a separate issue. If that is true, I fail to see how this peformance discussion pertains to this PR since converting repo.git calls to GitRepo method calls would seem the clear direction to go. Instead, the question would be whether the relevant GitRepo method (here GitRepo.get_hexsha) should use GitPython.

@yarikoptic

This comment has been minimized.

Member

yarikoptic commented Oct 8, 2018

Stepping back, my current understanding is that there is general agreement that GitPython should not be used outside of GitRepo, and whether we should remove GitPython altogether is a separate issue.

True

If that is true, I fail to see how this peformance discussion pertains to this PR since converting repo.git calls to GitRepo method calls would seem the clear direction to go. Instead, the question would be whether the relevant GitRepo method (here GitRepo.get_hexsha) should use GitPython.

yeah, I just wondered if we would have any performance impact since GitPython's .git.rev_parse was replaced with get_hexsha which does git call... FWIW here is timing from my laptop on datalad code repo

$> python -m timeit -s "from datalad.support.gitrepo import GitRepo; gr = GitRepo('.')" "gr._git_custom_command('', ['git', 'rev-parse', 'HEAD'])"                                                                              100 loops, best of 3: 16.1 msec per loop
$> python -m timeit -s "from datalad.support.gitrepo import GitRepo; gr = GitRepo('.')" "gr.repo.git.rev_parse('HEAD')"
100 loops, best of 3: 16.5 msec per loop

so , it even gets better a tiny bit (should I finally get a new laptop? ;)), so seems to be a non-issue for this particular change, so we should go ahead with this one.

@yarikoptic yarikoptic merged commit e8f0e44 into datalad:master Oct 8, 2018

5 checks passed

WIP ready for review
Details
codecov/patch 97.61% of diff hit (target 90.28%)
Details
codecov/project 90.29% (+<.01%) compared to a98a016
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@kyleam kyleam deleted the kyleam:enh-degitpython-rerun branch Oct 8, 2018

@bpoldrack

This comment has been minimized.

Member

bpoldrack commented Oct 9, 2018

FTR: First of all, I'm happy to see this change. Second I do remember some performance testing in the past, but the benefit from GitPython was notable only for a very small set of queries. I doubt it will notably affect our actual commands, since we do a lot more than querying for a reference. And most things just lead to direct git calls done by GitPython if you dig into it. Plus: Even where do use features that benefit from in memory DB, I think it's mostly just a single call that hardly compensates for the offset of creating that DB in the first place.

Edit:
To underline the last point: Remember how we benefited from making the instantiation of GitPython Repo lazy?

@yarikoptic yarikoptic added this to the Release 0.10.4 milestone Oct 22, 2018

@yarikoptic yarikoptic modified the milestone: Release 0.10.4 Oct 22, 2018

yarikoptic added a commit that referenced this pull request Oct 24, 2018

Merge tag '0.11.0' into debian
 Upgrade of [git-annex] to the most recent available to your release is
 advisable since a number of issues were resolved at that level.

 ### Major refactoring and deprecations

 - `datalad.consts.LOCAL_CENTRAL_PATH` constant was deprecated in favor
   of `datalad.locations.default-dataset` [configuration] variable
   ([#2835])

 ### Minor refactoring

 - `"notneeded"` messages are no longer reported by default results
   renderer
 - [run] no longer shows commit instructions upon command failure when
   `explicit` is true and no outputs are specified ([#2922])
 - `get_git_dir` moved into GitRepo ([#2886])
 - `_gitpy_custom_call` removed from GitRepo ([#2894])
 - `GitRepo.get_merge_base` argument is now called `commitishes` instead
   of `treeishes` ([#2903])

 ### Fixes

 - [update] should not leave the dataset in non-clean state ([#2858])
   and some other enhancements ([#2859])
 - Fixed chunking of the long command lines to account for decorators
   and other arguments ([#2864])
 - Progress bar should not crash the process on some missing progress
   information ([#2891])
 - Default value for `jobs` set to be `"auto"` (not `None`) to take
   advantage of possible parallel get if in `-g` mode ([#2861])
 - [wtf] must not crash if `git-annex` is not installed etc ([#2865]),
   ([#2865]), ([#2918]), ([#2917])
 - Fixed paths (with spaces etc) handling while reporting annex error
   output ([#2892]), ([#2893])
 - `__del__` should not access `.repo` but `._repo` to avoid attempts
   for reinstantiation etc ([#2901])
 - Fix up submodule `.git` right in `GitRepo.add_submodule` to avoid
   added submodules being non git-annex friendly ([#2909]), ([#2904])
 - [run-procedure] ([#2905])
   - now will provide dataset into the procedure if called within dataset
   - will not crash if procedure is an executable without `.py` or `.sh`
     suffixes
 - Use centralized `.gitattributes` handling while setting annex backend
   ([#2912])
 - `GlobbedPaths.expand(..., full=True)` incorrectly returned relative
    paths when called more than once ([#2921])

 ### Enhancements and new features

 - Report progress on [clone] when installing from "smart" git servers
   ([#2876])
 - Stale/unused `sth_like_file_has_content` was removed ([#2860])
 - Enhancements to [search] to operate on "improved" metadata layouts
   ([#2878])
 - Output of `git annex init` operation is now logged ([#2881])
 - New
   - `GitRepo.cherry_pick` ([#2900])
   - `GitRepo.format_commit` ([#2902])
 - [run-procedure] ([#2905])
   - procedures can now recursively be discovered in subdatasets as well.
     The uppermost has highest priority
   - Procedures in user and system locations now take precedence over
     those in datasets.

* tag '0.11.0':
  Make it a 0.11.0 release since there were some API RFings
  REL: slight tune up to Changelog following the advices
  DOC: v0.10.4: Mention change in procedure precedence (a0cbcba)
  DOC: v0.10.4: Fix description of db715b7
  DOC: v0.10.4: Improve description of 6f615a4
  DOC: v0.10.4: Remove duplicate word

yarikoptic added a commit that referenced this pull request Oct 24, 2018

Merge tag '0.11.0' into debian
 ## 0.11.0 (Oct 23, 2018) -- Soon-to-be-perfect

 [git-annex] 6.20180913 (or later) is now required - provides a number of
 fixes for v6 mode operations etc.

 ### Major refactoring and deprecations

 - `datalad.consts.LOCAL_CENTRAL_PATH` constant was deprecated in favor
   of `datalad.locations.default-dataset` [configuration] variable
   ([#2835])

 ### Minor refactoring

 - `"notneeded"` messages are no longer reported by default results
   renderer
 - [run] no longer shows commit instructions upon command failure when
   `explicit` is true and no outputs are specified ([#2922])
 - `get_git_dir` moved into GitRepo ([#2886])
 - `_gitpy_custom_call` removed from GitRepo ([#2894])
 - `GitRepo.get_merge_base` argument is now called `commitishes` instead
   of `treeishes` ([#2903])

 ### Fixes

 - [update] should not leave the dataset in non-clean state ([#2858])
   and some other enhancements ([#2859])
 - Fixed chunking of the long command lines to account for decorators
   and other arguments ([#2864])
 - Progress bar should not crash the process on some missing progress
   information ([#2891])
 - Default value for `jobs` set to be `"auto"` (not `None`) to take
   advantage of possible parallel get if in `-g` mode ([#2861])
 - [wtf] must not crash if `git-annex` is not installed etc ([#2865]),
   ([#2865]), ([#2918]), ([#2917])
 - Fixed paths (with spaces etc) handling while reporting annex error
   output ([#2892]), ([#2893])
 - `__del__` should not access `.repo` but `._repo` to avoid attempts
   for reinstantiation etc ([#2901])
 - Fix up submodule `.git` right in `GitRepo.add_submodule` to avoid
   added submodules being non git-annex friendly ([#2909]), ([#2904])
 - [run-procedure] ([#2905])
   - now will provide dataset into the procedure if called within dataset
   - will not crash if procedure is an executable without `.py` or `.sh`
     suffixes
 - Use centralized `.gitattributes` handling while setting annex backend
   ([#2912])
 - `GlobbedPaths.expand(..., full=True)` incorrectly returned relative
    paths when called more than once ([#2921])

 ### Enhancements and new features

 - Report progress on [clone] when installing from "smart" git servers
   ([#2876])
 - Stale/unused `sth_like_file_has_content` was removed ([#2860])
 - Enhancements to [search] to operate on "improved" metadata layouts
   ([#2878])
 - Output of `git annex init` operation is now logged ([#2881])
 - New
   - `GitRepo.cherry_pick` ([#2900])
   - `GitRepo.format_commit` ([#2902])
 - [run-procedure] ([#2905])
   - procedures can now recursively be discovered in subdatasets as well.
     The uppermost has highest priority
   - Procedures in user and system locations now take precedence over
     those in datasets.

* tag '0.11.0':
  CHANGELOG adjusted to reflect new minimal version of git-annex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment