Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT: delay full tree traversal in sort_paths_into_subdatasets for the use case of adding a new subds #1407

Merged
merged 1 commit into from
Mar 23, 2017

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Mar 21, 2017

There seems to be even more points to optimize but decided first to check this one

@mih -- feel free to discard (i.e. not merge) and just adjust code accordingly in the ultimate return values RF to avoid conflicts

Possibly closes #1388

so here is timing before

$> time bash -c 's=NEW6; datalad create $s; datalad add -d . $s'
[INFO   ] Creating a new annex repo at /mnt/btrfs/datasets-quicksub/datalad/crawl/openfmri/NEW6 
Created dataset at /mnt/btrfs/datasets-quicksub/datalad/crawl/openfmri/NEW6.                                                           
Added <Dataset path=/mnt/btrfs/datasets-quicksub/datalad/crawl/openfmri/NEW6>                                                          
bash -c 's=NEW6; datalad create $s; datalad add -d . $s'  7.18s user 3.66s system 78% cpu 13.883 total

and here is after

$> time bash -c 's=NEW7; datalad create $s; datalad add -d . $s'
[INFO   ] Creating a new annex repo at /mnt/btrfs/datasets-quicksub/datalad/crawl/openfmri/NEW7 
Created dataset at /mnt/btrfs/datasets-quicksub/datalad/crawl/openfmri/NEW7.                                                           
Added <Dataset path=/mnt/btrfs/datasets-quicksub/datalad/crawl/openfmri/NEW7>
bash -c 's=NEW7; datalad create $s; datalad add -d . $s'  2.55s user 1.19s system 86% cpu 4.342 total

NB this comparison was done on a "warmed up" filesystem where I already traversed those submodules multiple times. I expect difference to be even greater on a "fresh" one

since our tests battery doesn't involve any such large hierarchies, it might be likely that this would make them only slower for an additional logic and get_subdatasets. There might be a better way

Also note other OPT comments where I thought that current logic could be improved to avoid unnecessary for common use-cases delays

Ideally, it should be possible to make that get_subdatasets + get_trace tandem into something a bit smarter thing which would build up the "graph" based on queried paths, and not all at once. Also it should be reused/grown from within save_dataset_hierarchy which calls that sort_paths_into_subdatasets for every subds, since there might be a hierarchy

… use-case of adding a submodule

There seems to be even more points to optimize but decided first to check this one
@codecov-io
Copy link

codecov-io commented Mar 21, 2017

Codecov Report

Merging #1407 into master will decrease coverage by 0.06%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #1407      +/-   ##
=========================================
- Coverage   89.47%   89.4%   -0.07%     
=========================================
  Files         236     236              
  Lines       24844   24853       +9     
=========================================
- Hits        22229   22221       -8     
- Misses       2615    2632      +17
Impacted Files Coverage Δ
datalad/distribution/add.py 94.73% <ø> (ø) ⬆️
datalad/distribution/dataset.py 95.23% <ø> (ø) ⬆️
datalad/interface/utils.py 94.98% <100%> (+0.15%) ⬆️
datalad/crawler/nodes/matches.py 83.49% <0%> (-4.86%) ⬇️
datalad/log.py 93.02% <0%> (-4.66%) ⬇️
datalad/tests/utils.py 88.41% <0%> (-1.13%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e9393c...62d08be. Read the comment docs.

@yarikoptic
Copy link
Member Author

ok -- this time I have tried to add a new subdataset at the top level:

(venv-tests)2 10715.....................................:Thu 23 Mar 2017 12:25:16 AM EDT:.
(git)smaug:/mnt/btrfs/datasets/datalad/crawl[master]git
$> datalad add -d . nipype-workshop-2017
...  and now it is 12:30 -- need to go to bed already... will report tomorrow when/if it finished

it is exciting to hear that some optimization is done to get full traversal x3-5 speed up, but even with x10 speed up it would be unnecessarily long to wait for full traversal to complete for no good reason when just adding a new dataset which should take just a few seconds at most. So I would still appreciate gurus (@mih, @bpoldrack ) having a look a this PR

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Will merge. With #1423 further optimizations become possible.

@@ -309,7 +309,7 @@ def get_subdatasets(self, pattern=None, fulfilled=None, absolute=False,
None is return if there is not repository instance yet. For an
existing repository with no subdatasets an empty list is returned.
"""

# OPT TODO: make it a generator for a possible early termination?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implemented in #1423

@mih mih merged commit d6262b3 into datalad:master Mar 23, 2017
@yarikoptic
Copy link
Member Author

yarikoptic commented Mar 23, 2017

Thank you @mih
FYI, that add did complete right in time when I got to work (ie in 8 hours!):

(venv-tests)2 10715.....................................:Thu 23 Mar 2017 12:25:16 AM EDT:.
(git)smaug:/mnt/btrfs/datasets/datalad/crawl[master]git
$> datalad add -d . nipype-workshop-2017
Added /mnt/btrfs/datasets/datalad/crawl/nipype-workshop-2017/ds000114
Added <Dataset path=/mnt/btrfs/datasets/datalad/crawl/nipype-workshop-2017>
datalad add -d . nipype-workshop-2017  32.05s user 58.52s system 16% cpu 8:55.62 total
(venv-tests)2 10716.....................................:Thu 23 Mar 2017 08:40:34 AM EDT:.

I will redo it now with the fresh master to see what is the change

@yarikoptic
Copy link
Member Author

gy -- was "slightly" (only ~30 times) faster ;)

(venv-tests)2 10728.....................................:Thu 23 Mar 2017 08:45:51 AM EDT:.
(git)smaug:/mnt/btrfs/datasets/datalad/crawl[master]git
$> datalad add -d . nipype-workshop-2017
Added <Dataset path=/mnt/btrfs/datasets/datalad/crawl/nipype-workshop-2017>
(venv-tests)2 10729.....................................:Thu 23 Mar 2017 08:46:03 AM EDT:.

@mih
Copy link
Member

mih commented Mar 23, 2017

BTW: The picture is pretty much the same in studyforrest mockup -- if you need something that iterates faster than 8h ;-)

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   44.085   44.085 profile.py:1(<module>)
        1    0.000    0.000   43.494   43.494 utils.py:561(newfunc)
        1    0.000    0.000   43.384   43.384 profile.py:4(test_make_studyforrest_mockup)
        1    0.001    0.001   43.376   43.376 utils_testdatasets.py:16(make_studyforrest_mockup)
   188/34    0.008    0.000   42.540    1.251 dataset.py:479(apply_func)
   148/36    0.008    0.000   42.396    1.178 utils.py:948(eval_func)
   148/36    0.003    0.000   42.392    1.178 utils.py:1066(return_func)
   367/80    0.004    0.000   42.392    0.530 utils.py:986(generator_func)
       44    0.006    0.000   30.728    0.698 add.py:141(__call__)
     1285    0.104    0.000   26.530    0.021 cmd.py:239(run)
       36    0.003    0.000   21.831    0.606 clone.py:106(__call__)
      414    0.013    0.000   21.592    0.052 cmd.py:507(run)
       28    0.003    0.000   20.224    0.722 create.py:153(__call__)
     1318    0.012    0.000   18.228    0.014 subprocess.py:448(communicate)
       88    0.002    0.000   17.633    0.200 utils.py:253(save_dataset_hierarchy)
      736    0.023    0.000   15.056    0.020 dataset.py:163(repo)
     1596    0.049    0.000   13.587    0.009 utils.py:709(swallow_logs)
      800    0.015    0.000   13.522    0.017 contextlib.py:21(__exit__)
      798    0.010    0.000   13.371    0.017 utils.py:758(cleanup)
      798   13.240    0.017   13.240    0.017 {gc.collect}

This is with your fix already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

adding an existing sub-dataset is taking WAY TOO LONG
3 participants