Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT: get_content_info: Speed up ls-files by dropping unneeded flags #5067

Merged
merged 1 commit into from Oct 22, 2020

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Oct 21, 2020

GitRepo.get_content_info() calls ls-files with --modified and
--deleted. When combined with --stage, those flags don't add any
information, just duplicate lines in the output. In addition to
redundant processing, these flags result in a slower ls-files call
because the working tree needs to be considered. Without --modified
and --deleted, the execution time of ls-files --stage is comparable
to that of ls-tree -r.

$ git clone https://datasets.datalad.org/openneuro/.git .
$ git version
git version 2.29.0.311.g38ddec7408
$ git status # warm cache

$ multitime -n 5 -q git ls-files --stage
===> multitime results
1: -q git ls-files --stage
            Mean        Std.Dev.    Min         Median      Max
real        0.005       0.001       0.003       0.006       0.007
user        0.001       0.002       0.000       0.000       0.005
sys         0.004       0.002       0.000       0.006       0.006

$ multitime -n 5 -q git ls-tree --long --full-tree -r HEAD
===> multitime results
1: -q git ls-tree --long --full-tree -r HEAD
            Mean        Std.Dev.    Min         Median      Max
real        0.007       0.002       0.004       0.007       0.010
user        0.003       0.003       0.000       0.003       0.007
sys         0.004       0.002       0.000       0.004       0.007

$ multitime -n 5 -q git ls-files --stage --modified --deleted
===> multitime results
1: -q git ls-files --stage --modified --deleted
            Mean        Std.Dev.    Min         Median      Max
real        0.025       0.009       0.021       0.021       0.043
user        0.017       0.007       0.010       0.016       0.031
sys         0.008       0.003       0.004       0.008       0.012

Speed up get_content_info's ls-files call by dropping --modified and
--deleted. While adjusting the ls-files arguments, also avoid
unnecessarily adding --exclude-standard when --others isn't specified.

Note that gh-5061 proposed making get_submodules_() pass ref="HEAD"
when it calls get_content_info() in order to trigger the faster
ls-tree code path, but, with the changes here, there's no notable
difference, at least in the openneuro clone above:

$ python -m timeit -n 10 \
       -s "from datalad.support.gitrepo import GitRepo" \
       -s "gr = GitRepo('.')" \
       "list(gr.get_submodules_())"

on master (f66dce7e5e):  10 loops, best of 5: 40.5 msec per loop
               this PR:  10 loops, best of 5: 24.8 msec per loop
       with ref="HEAD":  10 loops, best of 5: 25.9 msec per loop

GitRepo.get_content_info() calls ls-files with --modified and
--deleted.  When combined with --stage, those flags don't add any
information, just duplicate lines in the output.  In addition to
redundant processing, these flags result in a slower ls-files call
because the working tree needs to be considered.  Without --modified
and --deleted, the execution time of `ls-files --stage` is comparable
to that of `ls-tree -r`.

    $ git clone https://datasets.datalad.org/openneuro/.git .
    $ git version
    git version 2.29.0.311.g38ddec7408
    $ git status # warm cache

    $ multitime -n 5 -q git ls-files --stage
    ===> multitime results
    1: -q git ls-files --stage
                Mean        Std.Dev.    Min         Median      Max
    real        0.005       0.001       0.003       0.006       0.007
    user        0.001       0.002       0.000       0.000       0.005
    sys         0.004       0.002       0.000       0.006       0.006

    $ multitime -n 5 -q git ls-tree --long --full-tree -r HEAD
    ===> multitime results
    1: -q git ls-tree --long --full-tree -r HEAD
                Mean        Std.Dev.    Min         Median      Max
    real        0.007       0.002       0.004       0.007       0.010
    user        0.003       0.003       0.000       0.003       0.007
    sys         0.004       0.002       0.000       0.004       0.007

    $ multitime -n 5 -q git ls-files --stage --modified --deleted
    ===> multitime results
    1: -q git ls-files --stage --modified --deleted
                Mean        Std.Dev.    Min         Median      Max
    real        0.025       0.009       0.021       0.021       0.043
    user        0.017       0.007       0.010       0.016       0.031
    sys         0.008       0.003       0.004       0.008       0.012

Speed up get_content_info's ls-files call by dropping --modified and
--deleted.  While adjusting the ls-files arguments, also avoid
unnecessarily adding --exclude-standard when --others isn't specified.

Note that dataladgh-5061 proposed making get_submodules_() pass ref="HEAD"
when it calls get_content_info() in order to trigger the faster
ls-tree code path, but, with the changes here, there's no notable
difference, at least in the openneuro clone above:

    $ python -m timeit -n 10 \
           -s "from datalad.support.gitrepo import GitRepo" \
           -s "gr = GitRepo('.')" \
           "list(gr.get_submodules_())"

    on master (f66dce7):  10 loops, best of 5: 40.5 msec per loop
                   this PR:  10 loops, best of 5: 24.8 msec per loop
           with ref="HEAD":  10 loops, best of 5: 25.9 msec per loop
@kyleam
Copy link
Contributor Author

kyleam commented Oct 21, 2020

In case others want to play around with the timings, here's a script that uses a multitime batch file to do the first timing:

script
#!/bin/sh

set -eu

cd "$(mktemp -d "${TMPDIR:-/tmp}"/git-ls-XXXXXXX)"

git version
git clone https://datasets.datalad.org/openneuro/.git repo

cat >bf <<'EOF'
-q git ls-files --stage
-q git ls-tree --long --full-tree -r HEAD
-q git ls-files --stage --modified --deleted
EOF

cd repo
git status
multitime -n 20 -b ../bf
git version 2.29.0.311.g38ddec7408
Cloning into 'repo'...
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean
===> multitime results
1: -q git ls-files --stage
            Mean        Std.Dev.    Min         Median      Max
real        0.003       0.002       0.002       0.002       0.007       
user        0.002       0.002       0.000       0.002       0.007       
sys         0.000       0.001       0.000       0.000       0.003       

2: -q git ls-tree --long --full-tree -r HEAD
            Mean        Std.Dev.    Min         Median      Max
real        0.004       0.002       0.003       0.003       0.007       
user        0.002       0.002       0.000       0.003       0.007       
sys         0.001       0.002       0.000       0.000       0.007       

3: -q git ls-files --stage --modified --deleted
            Mean        Std.Dev.    Min         Median      Max
real        0.022       0.009       0.017       0.018       0.050       
user        0.015       0.009       0.004       0.013       0.041       
sys         0.006       0.003       0.000       0.007       0.013       

@codecov
Copy link

codecov bot commented Oct 22, 2020

Codecov Report

Merging #5067 into master will decrease coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5067      +/-   ##
==========================================
- Coverage   89.81%   89.78%   -0.04%     
==========================================
  Files         293      293              
  Lines       41239    41239              
==========================================
- Hits        37038    37025      -13     
- Misses       4201     4214      +13     
Impacted Files Coverage Δ
datalad/support/gitrepo.py 90.47% <100.00%> (ø)
datalad/downloaders/http.py 81.85% <0.00%> (-2.71%) ⬇️
datalad/downloaders/tests/test_http.py 88.52% <0.00%> (-1.88%) ⬇️
datalad/distribution/create_sibling.py 86.11% <0.00%> (+0.56%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f66dce7...72b6a3c. Read the comment docs.

@yarikoptic
Copy link
Member

ALL TESTS ARE GREEN -- so indeed it was just causing some "double work" for both git and datalad. Great find @kyleam !

@mih
Copy link
Member

mih commented Oct 22, 2020

Wonderful analysis! Thx @kyleam

@mih mih merged commit c12224d into datalad:master Oct 22, 2020
@kyleam kyleam deleted the ls-files-simplify branch October 22, 2020 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants