New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPT: Delay Dataset.repo access until necessary #5076
Conversation
Results in substantial speedup for recursive processing, as datasets without any submodules are rejected prior any Repo instantion. For a dataset with 42k subdatasets this means going from ``` datalad subdatasets -r 270.45s user 615.64s system 104% cpu 14:07.22 total ``` to ``` datalad subdatasets -r 5.82s user 0.69s system 101% cpu 6.422 total ``` Fixes dataladgh-5075
Awesome! Thank you @mih! Now the problem is only that I need to figure out what to do with all that free time ;-) |
FWIW, even our tiny benchmarks show the gain
which is great, so we don't miss some pessimisation if any comes |
Codecov Report
@@ Coverage Diff @@
## master #5076 +/- ##
==========================================
+ Coverage 89.78% 89.82% +0.04%
==========================================
Files 293 293
Lines 41254 41255 +1
==========================================
+ Hits 37039 37057 +18
+ Misses 4215 4198 -17
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only cursory review of code is done -- but benefit is too big to not to just go forward ;)
While looking at py-spy stack from long running status I noticed that majority of time is spent on git config call, triggered by checking for fake dates config setting. AFAIK for read-only operations, which cannot produce a commit, it might benefit to skip that check. And _call_git already has that kwarg, but it is not proxied by "higher-level" helpers. To avoid breeding workarounds (like the one RFed within this commit) I decided that we might benefit from exposing that option in higher-level helpers as well. Then "read-only" invocations could set it to False and we might avoid needless config read in some cases (I have not done any timing on any prototypical use case which might benefit from this, but I would assume that status or diff might). We had an original discussion on either to expose check_fake_dates in higher level *Repo interfaces before: datalad#3791 (comment) and decided to not do it at that point. As for 'subdatasets' call, performance issue was now addressed with datalad#5076 , so this change is not strictly necessary to optimize "subdatasets", but would still be generally benefitial for possible other invocations of *Repo methods across many instances without unnecessarily triggering loading of the config, typically in the case of read-only git operations (git-annex might need to merge git-annex branch, so typically annex calls should not disable fake dates).
Results in substantial speedup for recursive processing, as datasets
without any submodules are rejected prior any Repo instantiation.
For a dataset with 42k subdatasets this means going from
to
Fixes gh-5075