-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPT: delay full tree traversal in sort_paths_into_subdatasets for the use case of adding a new subds #1407
Conversation
… use-case of adding a submodule There seems to be even more points to optimize but decided first to check this one
Codecov Report
@@ Coverage Diff @@
## master #1407 +/- ##
=========================================
- Coverage 89.47% 89.4% -0.07%
=========================================
Files 236 236
Lines 24844 24853 +9
=========================================
- Hits 22229 22221 -8
- Misses 2615 2632 +17
Continue to review full report at Codecov.
|
ok -- this time I have tried to add a new subdataset at the top level: (venv-tests)2 10715.....................................:Thu 23 Mar 2017 12:25:16 AM EDT:.
(git)smaug:/mnt/btrfs/datasets/datalad/crawl[master]git
$> datalad add -d . nipype-workshop-2017
... and now it is 12:30 -- need to go to bed already... will report tomorrow when/if it finished it is exciting to hear that some optimization is done to get full traversal x3-5 speed up, but even with x10 speed up it would be unnecessarily long to wait for full traversal to complete for no good reason when just adding a new dataset which should take just a few seconds at most. So I would still appreciate gurus (@mih, @bpoldrack ) having a look a this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Will merge. With #1423 further optimizations become possible.
@@ -309,7 +309,7 @@ def get_subdatasets(self, pattern=None, fulfilled=None, absolute=False, | |||
None is return if there is not repository instance yet. For an | |||
existing repository with no subdatasets an empty list is returned. | |||
""" | |||
|
|||
# OPT TODO: make it a generator for a possible early termination? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is implemented in #1423
Thank you @mih (venv-tests)2 10715.....................................:Thu 23 Mar 2017 12:25:16 AM EDT:.
(git)smaug:/mnt/btrfs/datasets/datalad/crawl[master]git
$> datalad add -d . nipype-workshop-2017
Added /mnt/btrfs/datasets/datalad/crawl/nipype-workshop-2017/ds000114
Added <Dataset path=/mnt/btrfs/datasets/datalad/crawl/nipype-workshop-2017>
datalad add -d . nipype-workshop-2017 32.05s user 58.52s system 16% cpu 8:55.62 total
(venv-tests)2 10716.....................................:Thu 23 Mar 2017 08:40:34 AM EDT:. I will redo it now with the fresh master to see what is the change |
gy -- was "slightly" (only ~30 times) faster ;) (venv-tests)2 10728.....................................:Thu 23 Mar 2017 08:45:51 AM EDT:.
(git)smaug:/mnt/btrfs/datasets/datalad/crawl[master]git
$> datalad add -d . nipype-workshop-2017
Added <Dataset path=/mnt/btrfs/datasets/datalad/crawl/nipype-workshop-2017>
(venv-tests)2 10729.....................................:Thu 23 Mar 2017 08:46:03 AM EDT:. |
BTW: The picture is pretty much the same in studyforrest mockup -- if you need something that iterates faster than 8h ;-)
This is with your fix already. |
There seems to be even more points to optimize but decided first to check this one
@mih -- feel free to discard (i.e. not merge) and just adjust code accordingly in the ultimate return values RF to avoid conflicts
Possibly closes #1388
so here is timing before
and here is after
NB this comparison was done on a "warmed up" filesystem where I already traversed those submodules multiple times. I expect difference to be even greater on a "fresh" one
since our tests battery doesn't involve any such large hierarchies, it might be likely that this would make them only slower for an additional logic and get_subdatasets. There might be a better way
Also note other
OPT
comments where I thought that current logic could be improved to avoid unnecessary for common use-cases delaysIdeally, it should be possible to make that
get_subdatasets + get_trace
tandem into something a bit smarter thing which would build up the "graph" based on queried paths, and not all at once. Also it should be reused/grown from withinsave_dataset_hierarchy
which calls thatsort_paths_into_subdatasets
for every subds, since there might be a hierarchy