New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPT: get_content_info: Speed up ls-files by dropping unneeded flags #5067
Conversation
GitRepo.get_content_info() calls ls-files with --modified and --deleted. When combined with --stage, those flags don't add any information, just duplicate lines in the output. In addition to redundant processing, these flags result in a slower ls-files call because the working tree needs to be considered. Without --modified and --deleted, the execution time of `ls-files --stage` is comparable to that of `ls-tree -r`. $ git clone https://datasets.datalad.org/openneuro/.git . $ git version git version 2.29.0.311.g38ddec7408 $ git status # warm cache $ multitime -n 5 -q git ls-files --stage ===> multitime results 1: -q git ls-files --stage Mean Std.Dev. Min Median Max real 0.005 0.001 0.003 0.006 0.007 user 0.001 0.002 0.000 0.000 0.005 sys 0.004 0.002 0.000 0.006 0.006 $ multitime -n 5 -q git ls-tree --long --full-tree -r HEAD ===> multitime results 1: -q git ls-tree --long --full-tree -r HEAD Mean Std.Dev. Min Median Max real 0.007 0.002 0.004 0.007 0.010 user 0.003 0.003 0.000 0.003 0.007 sys 0.004 0.002 0.000 0.004 0.007 $ multitime -n 5 -q git ls-files --stage --modified --deleted ===> multitime results 1: -q git ls-files --stage --modified --deleted Mean Std.Dev. Min Median Max real 0.025 0.009 0.021 0.021 0.043 user 0.017 0.007 0.010 0.016 0.031 sys 0.008 0.003 0.004 0.008 0.012 Speed up get_content_info's ls-files call by dropping --modified and --deleted. While adjusting the ls-files arguments, also avoid unnecessarily adding --exclude-standard when --others isn't specified. Note that dataladgh-5061 proposed making get_submodules_() pass ref="HEAD" when it calls get_content_info() in order to trigger the faster ls-tree code path, but, with the changes here, there's no notable difference, at least in the openneuro clone above: $ python -m timeit -n 10 \ -s "from datalad.support.gitrepo import GitRepo" \ -s "gr = GitRepo('.')" \ "list(gr.get_submodules_())" on master (f66dce7): 10 loops, best of 5: 40.5 msec per loop this PR: 10 loops, best of 5: 24.8 msec per loop with ref="HEAD": 10 loops, best of 5: 25.9 msec per loop
In case others want to play around with the timings, here's a script that uses a multitime batch file to do the first timing: script#!/bin/sh
set -eu
cd "$(mktemp -d "${TMPDIR:-/tmp}"/git-ls-XXXXXXX)"
git version
git clone https://datasets.datalad.org/openneuro/.git repo
cat >bf <<'EOF'
-q git ls-files --stage
-q git ls-tree --long --full-tree -r HEAD
-q git ls-files --stage --modified --deleted
EOF
cd repo
git status
multitime -n 20 -b ../bf
|
Codecov Report
@@ Coverage Diff @@
## master #5067 +/- ##
==========================================
- Coverage 89.81% 89.78% -0.04%
==========================================
Files 293 293
Lines 41239 41239
==========================================
- Hits 37038 37025 -13
- Misses 4201 4214 +13
Continue to review full report at Codecov.
|
ALL TESTS ARE GREEN -- so indeed it was just causing some "double work" for both git and datalad. Great find @kyleam ! |
Wonderful analysis! Thx @kyleam |
GitRepo.get_content_info() calls ls-files with --modified and
--deleted. When combined with --stage, those flags don't add any
information, just duplicate lines in the output. In addition to
redundant processing, these flags result in a slower ls-files call
because the working tree needs to be considered. Without --modified
and --deleted, the execution time of
ls-files --stage
is comparableto that of
ls-tree -r
.Speed up get_content_info's ls-files call by dropping --modified and
--deleted. While adjusting the ls-files arguments, also avoid
unnecessarily adding --exclude-standard when --others isn't specified.
Note that gh-5061 proposed making get_submodules_() pass ref="HEAD"
when it calls get_content_info() in order to trigger the faster
ls-tree code path, but, with the changes here, there's no notable
difference, at least in the openneuro clone above: