Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NF: status(eval_subdataset_state={'no'|'commit'|'full'} #3324

Merged
merged 6 commits into from Apr 13, 2019
Merged

Conversation

@mih
Copy link
Member

@mih mih commented Apr 12, 2019

New mode to fully or partially disable subdataset state evaluation (which involves recursion into all present subdatasets to find potential uncommited changes), without disabling subdataset discovery and
reporting.

This addition allows for using status() unconditionally for (fast(er)) discovery operation.

This also replaces the (undocumented) argument and behavior switch 'ignore_submodules' that wasn't used in any sensible way, because it didn't actually provide sensible switching. It could only distinguish
"clever" operation and "needlessly wasteful" operation.

  • final docs

Performance info on the /// dataset:

datalad status --eval-subdataset-state commit -r 744.29s user 601.62s system 102% cpu 21:47.48 total

Looks ridiculous, but a plain git status at the top-level already takes 7min. In contrast this is a recursion across all datasets, with reporting on all datasets. Interestingly, the difference to "full" on the present state of that dataset's working tree (mixture of all kinds of things) is only 2min (~24min total runtime).

So here is an artificial "worst case" example with a deep and clean hierarchy that show maximum performance difference between the modes "commit" and "full":

import datalad.api as dl
ds = dl.rev_create('deep')
path = ds.pathobj  
for i in range(20):       
     path = path / 'down'
     ds.rev_create(path)
%timeit ds.status(eval_subdataset_state='no', result_renderer='disabled')
10 loops, best of 3: 51.8 ms per loop                                                    

%timeit ds.status(eval_subdataset_state='commit', result_renderer='disabled')
10 loops, best of 3: 66.6 ms per loop

%timeit ds.status(eval_subdataset_state='full', result_renderer='disabled')
1 loop, best of 3: 1.07 s per loop
New mode to fully or partially disable subdataset state evaluation
(which involves recursion into all present subdatasets to find potential
uncommited changes), without disabling subdataset discovery and
reporting.

This addition allows for using status() unconditionally for (fast(er))
discovery operation.

This also replaces the (undocumented) argument and behavior switch
'ignore_submodules' that wasn't used in any sensible way, because it
didn't actually provide sensible switching. It could only distinguish
"clever" operation and "needlessly wasteful" operation.
kyleam added 2 commits Apr 12, 2019
Otherwise the exception would show a tuple with the format string and
the value.
@codecov
Copy link

@codecov codecov bot commented Apr 12, 2019

Codecov Report

Merging #3324 into master will decrease coverage by <.01%.
The diff coverage is 93.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3324      +/-   ##
==========================================
- Coverage   91.15%   91.15%   -0.01%     
==========================================
  Files         263      263              
  Lines       34230    34246      +16     
==========================================
+ Hits        31204    31218      +14     
- Misses       3026     3028       +2
Impacted Files Coverage Δ
datalad/support/annexrepo.py 87.31% <ø> (ø) ⬆️
datalad/core/local/status.py 97.91% <100%> (ø) ⬆️
datalad/support/tests/test_repo_save.py 100% <100%> (ø) ⬆️
datalad/support/tests/test_fileinfo.py 100% <100%> (ø) ⬆️
datalad/support/gitrepo.py 89.12% <92%> (+0.01%) ⬆️
datalad/downloaders/http.py 83.73% <0%> (-2.78%) ⬇️
datalad/downloaders/tests/test_http.py 92.57% <0%> (+1.48%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5472821...b2bccf7. Read the comment docs.

Copy link
Contributor

@kyleam kyleam left a comment

I don't have a solid understanding of the underlying code, but the overall idea sounds good to me and the results looked good in the hierarchies I played around with. Pushed a couple fixups (feel free to squash in if you prefer) and left minor comments.

datalad/core/local/status.py Outdated Show resolved Hide resolved
)

eval_subdataset_state=Parameter(
args=("--eval-subdataset-state",),
Copy link
Contributor

@kyleam kyleam Apr 12, 2019

This seems useful enough to get a short flag (-e?). Also, perhaps setting metavar to something shorter (STATE?) would make the short help a bit more readable. Currently it says

[--eval-subdataset-state EVAL_SUBDATASET_STATE]

Copy link
Member Author

@mih mih Apr 12, 2019

True, will do.

Copy link
Member Author

@mih mih Apr 13, 2019

metavar issue is addressed via #3326

@mih
Copy link
Member Author

@mih mih commented Apr 12, 2019

@kyleam Thx for the feedback. I added a performance note to the top comment.

@mih mih changed the title WIP NF: status(eval_subdataset_state={'no'|'commit'|'full'} NF: status(eval_subdataset_state={'no'|'commit'|'full'} Apr 13, 2019
@mih mih merged commit 94aadb2 into datalad:master Apr 13, 2019
5 checks passed
@mih mih deleted the enh-status branch Apr 13, 2019
kyleam added a commit to kyleam/datalad that referenced this issue Apr 13, 2019
Follow up on dataladgh-3324 by removing a stale reference to
ignore_submodules and describing the eval_submodule_state parameter in
the docstrings of diff() and status().

[ci skip]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants