Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT: Delay Dataset.repo access until necessary #5076

Merged
merged 1 commit into from Oct 23, 2020

Conversation

mih
Copy link
Member

@mih mih commented Oct 23, 2020

Results in substantial speedup for recursive processing, as datasets
without any submodules are rejected prior any Repo instantiation.

For a dataset with 42k subdatasets this means going from

datalad subdatasets -r  270.45s user 615.64s system 104% cpu 14:07.22 total

to

datalad subdatasets -r  5.82s user 0.69s system 101% cpu 6.422 total

Fixes gh-5075

Results in substantial speedup for recursive processing, as datasets
without any submodules are rejected prior any Repo instantion.

For a dataset with 42k subdatasets this means going from

```
datalad subdatasets -r  270.45s user 615.64s system 104% cpu 14:07.22 total
```

to

```
datalad subdatasets -r  5.82s user 0.69s system 101% cpu 6.422 total
```

Fixes dataladgh-5075
@mih mih requested a review from yarikoptic October 23, 2020 14:37
@mih mih added the performance Improve performance of an existing feature label Oct 23, 2020
@yarikoptic
Copy link
Member

Awesome! Thank you @mih! Now the problem is only that I need to figure out what to do with all that free time ;-)

@mih
Copy link
Member Author

mih commented Oct 23, 2020

Awesome! Thank you @mih! Now the problem is only that I need to figure out what to do with all that free time ;-)

Hehe.... #4411 could take a creative thought 👀

@yarikoptic
Copy link
Member

FWIW, even our tiny benchmarks show the gain

-         560±7ms          442±4ms     0.79  api.supers.time_subdatasets_recursive

which is great, so we don't miss some pessimisation if any comes

@codecov
Copy link

codecov bot commented Oct 23, 2020

Codecov Report

Merging #5076 into master will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5076      +/-   ##
==========================================
+ Coverage   89.78%   89.82%   +0.04%     
==========================================
  Files         293      293              
  Lines       41254    41255       +1     
==========================================
+ Hits        37039    37057      +18     
+ Misses       4215     4198      -17     
Impacted Files Coverage Δ
datalad/local/subdatasets.py 96.52% <100.00%> (+0.03%) ⬆️
datalad/distribution/create_sibling.py 86.11% <0.00%> (+0.56%) ⬆️
datalad/downloaders/tests/test_http.py 90.39% <0.00%> (+1.87%) ⬆️
datalad/downloaders/http.py 84.55% <0.00%> (+2.70%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d31c35...d486cf4. Read the comment docs.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only cursory review of code is done -- but benefit is too big to not to just go forward ;)

@kyleam kyleam merged commit 9aba263 into datalad:master Oct 23, 2020
yarikoptic added a commit to yarikoptic/datalad that referenced this pull request Nov 20, 2020
While looking at py-spy stack from long running status I noticed that majority
of time is spent on git config call, triggered by checking for fake dates
config setting. AFAIK for read-only operations, which cannot produce a commit,
it might benefit to skip that check. And _call_git already has that kwarg,
but it is not proxied by "higher-level" helpers.  To avoid breeding workarounds
(like the one RFed within this commit) I decided that we might benefit from
exposing that option in higher-level helpers as well. Then "read-only" invocations
could set it to False and we might avoid needless  config read in some cases
(I have not done any timing on any prototypical use case which might benefit
from this, but I would assume that status or diff might).

We had an original discussion on either to expose check_fake_dates in higher
level *Repo interfaces before:
datalad#3791 (comment) and decided
to not do it at that point.

As for 'subdatasets' call, performance issue was now addressed with
datalad#5076 , so this change is not strictly
necessary to optimize "subdatasets", but would still be generally benefitial
for possible other invocations of *Repo methods across many instances without
unnecessarily triggering loading of the config, typically in the case of
read-only git operations (git-annex might need to merge git-annex branch, so
typically annex calls should not disable fake dates).
@mih mih deleted the opt-subdataset-2 branch January 15, 2021 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improve performance of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

subdatasets -r listing: should be fast(er)
3 participants