Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use git status for dirty and reference datalad status in {,re}run's dirty message #3460

Merged
merged 4 commits into from May 30, 2019

Conversation

@kyleam
Copy link
Contributor

@kyleam kyleam commented May 26, 2019

This PR implements the proposal in #3342 (comment).

kyleam added 3 commits May 26, 2019
Fill in some to-do's.  This will be helpful for increasing our
confidence that the rewrite of GitRepo.dirty in the next commit
doesn't change the behavior.
While in general it makes sense to use status() as the uniform
interface for querying the repository state, .dirty doesn't benefit
from it.  It takes no parameters and returns a boolean, so using git
status to answer the question is simple and less expensive.

Re: datalad#3342 (comment)
Some callers may not know how to find the unsaved modifications, so
direct them to status.  Of course they also might not know how to deal
with those modifications, but, in the interest of keeping this message
to a somewhat reasonable length, let's not take responsibility for
directing the caller to `datalad save` as one of the many ways to
resolve a dirty state.  We could also mention --explicit, but that
would make the message longer, add one more thing for a caller to be
confused about, and mention an approach that at least I don't want to
encourage.

Re: datalad#3342 (comment)
@codecov
Copy link

@codecov codecov bot commented May 26, 2019

Codecov Report

Merging #3460 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3460      +/-   ##
==========================================
- Coverage   91.27%   91.24%   -0.04%     
==========================================
  Files         272      272              
  Lines       34834    34869      +35     
==========================================
+ Hits        31796    31817      +21     
- Misses       3038     3052      +14
Impacted Files Coverage Δ
datalad/core/local/run.py 100% <ø> (ø) ⬆️
datalad/interface/rerun.py 96.2% <ø> (ø) ⬆️
datalad/support/gitrepo.py 90.19% <100%> (ø) ⬆️
datalad/support/tests/test_gitrepo.py 99.88% <100%> (ø) ⬆️
datalad/support/tests/test_annexrepo.py 96.03% <100%> (+0.01%) ⬆️
datalad/plugin/wtf.py 79.25% <0%> (-2.93%) ⬇️
datalad/downloaders/tests/test_http.py 91.08% <0%> (-1.24%) ⬇️
datalad/interface/tests/test_download_url.py 100% <0%> (ø) ⬆️
datalad/distribution/tests/test_create_github.py 95.89% <0%> (+0.17%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update afdaca2...c14c3a4. Read the comment docs.

["git", "status", "--porcelain",
# Ensure the result isn't influenced by diff.ignoreSubmodules.
"--ignore-submodules=none"])
return bool(stdout.strip())
Copy link
Member

@yarikoptic yarikoptic May 29, 2019

I am curious on how does git status work.
E.g. in a big repo with lots of dirty things -- wouldn't we then first get the entire output (might take awhile?) first to be stripped and then checked for bool value? i wonder if there is an easy way to get only the first line from status output without too much dance with our Runner ... Or that all shouldn't matter much?

NB original implementation should have used return any(p for p, ...) construct to quit without analyzing the entire status list...

Copy link
Contributor Author

@kyleam kyleam May 29, 2019

I don't know of way to do the status check and not get all the output. Suggestions welcome.

That being said, it is more efficient than an ls-{tree,file} approach that lists all content in the repo.

Copy link
Member

@yarikoptic yarikoptic May 29, 2019

yeah, it should be.
I was googling up, apparently the same question ("quick git dirty check") comes often in shell prompt scenarios. In sindresorhus/pure#115 (comment) spotted them using -unormal with their status --porcelain. For me git status --help says

       -u[<mode>], --untracked-files[=<mode>]
           Show untracked files.

           The mode parameter is used to specify the handling of untracked files. It is optional: it defaults
           to all, and if specified, it must be stuck to the option (e.g.  -uno, but not -u no).

           The possible options are:
           •   no - Show no untracked files.
           •   normal - Shows untracked files and directories.
           •   all - Also shows individual files in untracked directories.

so I expected it to show all files, but for me it functions as -unormal:

$> time git status --porcelain | head
?? .datalad/                         
?? .git-aside/
?? .gitattributes
?? logs/
?? sub-000001_ses-01_run-01_task-efnback.feat/
?? sub-000001_ses-01_run-01_task-efnback.fsf
?? sub-000001_ses-01_run-02_task-efnback.feat/
?? sub-000001_ses-01_run-02_task-efnback.fsf
?? sub-000001_ses-02_run-01_task-efnback.feat/
?? sub-000001_ses-02_run-01_task-efnback.fsf
git status --porcelain  0.01s user 0.02s system 98% cpu 0.028 total
head  0.00s user 0.00s system 5% cpu 0.027 total

$> time git status --porcelain -uall | head
?? .datalad/.gitattributes
?? .datalad/config
?? .git-aside/HEAD
?? .git-aside/annex/index
?? .git-aside/annex/index.lck
?? .git-aside/annex/journal.lck

did I misread or the man page is wrong or I have some settings in git config ? ;) (git 2.21.0.593.g511ec345e18)

anyways -- may be it would be wortwhile to hardcode -unormal to avoid possible side-effects from configs etc?

Copy link
Contributor Author

@kyleam kyleam May 29, 2019

might take awhile?

Here are timings for ~30000 modified or untracked files across 3 directories,

script to generate
#!/bin/sh

set -eu

make_tree () {
    parallel -i sh -c "dd if=/dev/urandom of={}.txt count=1 bs=1024 2>>out" -- \
             $(seq 1 10000)
}

random_tree () {
    subdir=$1
    mkdir -p $subdir
    (cd $subdir && make_tree)
}

cd $(mktemp -dt dl-status-XXXXXXX)
git init

random_tree dir-a
random_tree dir-b

git add .
git commit -minitial

# Modify existing.
random_tree dir-a
random_tree dir-b
# Add untracked.
random_tree dir-c

Floor timing for git status --porcelain:

% time git status --porcelain >/dev/null
git status --porcelain >/dev/null  0.19s user 0.12s system 108% cpu 0.289 total

GitRepo.dirty on this PR:

% git -C ~/src/python/datalad rev-parse HEAD
e92689cd02b4a3c41359bcb7e46c80970b7bb395
% python -m timeit -s "import datalad.api as dl" "dl.Dataset('.').repo.dirty"
10 loops, best of 3: 281 msec per loop

GitRepo.dirty on master:

% git -C ~/src/python/datalad rev-parse HEAD
e83d571b0c551a2b8c8c4c1a526365df89f07965
% python -m timeit -s "import datalad.api as dl" "dl.Dataset('.').repo.dirty"
10 loops, best of 3: 2.02 sec per loo

GitRepo.dirty on 0.11.x:

% git -C ~/src/python/datalad rev-parse HEAD
78c72ec23cdd659580166de6b8313bab68decddd
% python -m timeit -s "import datalad.api as dl" "dl.Dataset('.').repo.dirty"
10 loops, best of 3: 312 msec per loop

I think this test repo is very much the extreme of what we could expect for number of modified files (but please modify and test on your end), so I'd say we're OK with this PR's current approach until we actually observe an issue to address.

Copy link
Contributor Author

@kyleam kyleam May 29, 2019

did I misread or the man page is wrong or I have some settings in git config ? ;) (git 2.21.0.593.g511ec345e18)

My understanding is that if -u is given without an argument, it defaults to all, but if -u isn't given at all, then the behavior matches normal (mentioned just below the text you quote and in git-configs description of status.showUntrackedFiles).

anyways -- may be it would be wortwhile to hardcode -unormal to avoid possible side-effects from configs etc?

Oh, good point! I actually didn't know that status.showUntrackedFiles existed, though even if it didn't it is best to be explicit rather than relying on the default behavior being normal (just as I already guard against the influence of diff.ignoreSubmodules by using --ignore-submodules=none). Will update.

When --untracked-files isn't passed to 'git status', the default
behavior is 'normal' (i.e., if a directory has only untracked files,
just the directory is listed).  But this can be influenced by the
configuration variable status.showUntrackedFiles, so pass
--untracked-files explicitly to avoid getting a different answer or
getting unnecessarily heavy output.

Thanks @yarikoptic.

Re: datalad#3460 (comment)
@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented May 29, 2019

Thanks for the script! When I get to the laptop I will try to 1. Add flushing of the fs caches to get cold run readings 2. Adjust original master version with that any() fix, since if status is a generator we might see early completion. Interested to see results then

@kyleam
Copy link
Contributor Author

@kyleam kyleam commented May 30, 2019

When I get to the laptop I will try to [...]

Great, thanks for playing around with it.

since if status is a generator we might see early completion

Sure, early stopping of processing can't hurt, but ls-{files,tree} are still reporting on all content and the initial intake of stdout is the entire git output.

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented May 30, 2019

here is the timings script with those two commands

$> cat /tmp/kyle-timings
#!/bin/bash

function cold () {
    sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
}


datalad --version

cd "$1"

cold
time git status --porcelain >/dev/null

cold
python -m timeit -s "import datalad.api as dl" "dl.Dataset('.').repo.dirty"

master results

$> /tmp/kyle-timings /home/yoh/.tmp/dl-status-GjgCkWZ
datalad 0.12.0rc4.dev90

real	0m0.623s
user	0m0.135s
sys	0m0.376s
1 loop, best of 5: 1.96 sec per loop

0.11.x results

$> /tmp/kyle-timings /home/yoh/.tmp/dl-status-GjgCkWZ
datalad 0.11.5.dev4

real	0m0.623s
user	0m0.129s
sys	0m0.394s
1 loop, best of 5: 321 msec per loop

this PR

$> /tmp/kyle-timings /home/yoh/.tmp/dl-status-GjgCkWZ
datalad 0.12.0rc4.dev95

real	0m0.662s
user	0m0.155s
sys	0m0.401s
1 loop, best of 5: 235 msec per loop

and master with any() probably indeed makes not much sense since the _diffstatus which it all boils down to, would return either a string with a status (!) or a dict with the entries which later would get assessed in dirty so no major benefit from any could come... I though it was a glorious generator ;-)

so, the PR is "state of the art" as far as I can tell ;-) legit

'cannot detect changes by command'))
message=(
'clean dataset required to detect changes from command; '
'use `datalad status` to inspect unsaved changes'))
Copy link
Member

@yarikoptic yarikoptic May 30, 2019

FWIW, in real life scenarios I will not be able to use datalad status in such cases since takes too long. Most likely will run git status, and that is why (since this information was already collected in dirty) it is kinda wasted cycles. But I guess we could go one step at a time

Copy link
Contributor Author

@kyleam kyleam May 30, 2019

Most likely will run git status, and that is why (since this information was already collected in dirty) it is kinda wasted cycles.

My reservations about adding "N-first modified files" to the status (as suggested here):

  • With the .dirty interface, we are committing to a simple no parameters, boolean value interface. If we find a better method to get the same answer (e.g., no need to consume all of status's output), nothing prevents us from changing to it.

  • In terms of saving cycles, the expensive 'git status' calls will be ones with lots of output. If you're only showing the first N items, there will be a need for a follow-up status call anyway.

But I guess we could go one step at a time

I vote to merge and proceed from there

OK, will merge. Thanks for the review.

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented May 30, 2019

I vote to merge and proceed from there

@kyleam kyleam merged commit c14c3a4 into datalad:master May 30, 2019
5 checks passed
kyleam added a commit to kyleam/datalad that referenced this issue May 30, 2019
@kyleam kyleam deleted the dirty-git-status branch May 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants