OPT: delay full tree traversal in sort_paths_into_subdatasets for the use case of adding a new subds #1407

yarikoptic · 2017-03-21T13:41:05Z

There seems to be even more points to optimize but decided first to check this one

@mih -- feel free to discard (i.e. not merge) and just adjust code accordingly in the ultimate return values RF to avoid conflicts

Possibly closes #1388

so here is timing before

$> time bash -c 's=NEW6; datalad create $s; datalad add -d . $s'
[INFO   ] Creating a new annex repo at /mnt/btrfs/datasets-quicksub/datalad/crawl/openfmri/NEW6 
Created dataset at /mnt/btrfs/datasets-quicksub/datalad/crawl/openfmri/NEW6.                                                           
Added <Dataset path=/mnt/btrfs/datasets-quicksub/datalad/crawl/openfmri/NEW6>                                                          
bash -c 's=NEW6; datalad create $s; datalad add -d . $s'  7.18s user 3.66s system 78% cpu 13.883 total

and here is after

$> time bash -c 's=NEW7; datalad create $s; datalad add -d . $s'
[INFO   ] Creating a new annex repo at /mnt/btrfs/datasets-quicksub/datalad/crawl/openfmri/NEW7 
Created dataset at /mnt/btrfs/datasets-quicksub/datalad/crawl/openfmri/NEW7.                                                           
Added <Dataset path=/mnt/btrfs/datasets-quicksub/datalad/crawl/openfmri/NEW7>
bash -c 's=NEW7; datalad create $s; datalad add -d . $s'  2.55s user 1.19s system 86% cpu 4.342 total

NB this comparison was done on a "warmed up" filesystem where I already traversed those submodules multiple times. I expect difference to be even greater on a "fresh" one

since our tests battery doesn't involve any such large hierarchies, it might be likely that this would make them only slower for an additional logic and get_subdatasets. There might be a better way

Also note other OPT comments where I thought that current logic could be improved to avoid unnecessary for common use-cases delays

Ideally, it should be possible to make that get_subdatasets + get_trace tandem into something a bit smarter thing which would build up the "graph" based on queried paths, and not all at once. Also it should be reused/grown from within save_dataset_hierarchy which calls that sort_paths_into_subdatasets for every subds, since there might be a hierarchy

… use-case of adding a submodule There seems to be even more points to optimize but decided first to check this one

codecov-io · 2017-03-21T17:07:15Z

Codecov Report

Merging #1407 into master will decrease coverage by 0.06%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #1407      +/-   ##
=========================================
- Coverage   89.47%   89.4%   -0.07%     
=========================================
  Files         236     236              
  Lines       24844   24853       +9     
=========================================
- Hits        22229   22221       -8     
- Misses       2615    2632      +17

Impacted Files	Coverage Δ
datalad/distribution/add.py	`94.73% <ø> (ø)`	⬆️
datalad/distribution/dataset.py	`95.23% <ø> (ø)`	⬆️
datalad/interface/utils.py	`94.98% <100%> (+0.15%)`	⬆️
datalad/crawler/nodes/matches.py	`83.49% <0%> (-4.86%)`	⬇️
datalad/log.py	`93.02% <0%> (-4.66%)`	⬇️
datalad/tests/utils.py	`88.41% <0%> (-1.13%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e9393c...62d08be. Read the comment docs.

yarikoptic · 2017-03-23T04:31:32Z

ok -- this time I have tried to add a new subdataset at the top level:

(venv-tests)2 10715.....................................:Thu 23 Mar 2017 12:25:16 AM EDT:.
(git)smaug:/mnt/btrfs/datasets/datalad/crawl[master]git
$> datalad add -d . nipype-workshop-2017
...  and now it is 12:30 -- need to go to bed already... will report tomorrow when/if it finished

it is exciting to hear that some optimization is done to get full traversal x3-5 speed up, but even with x10 speed up it would be unnecessarily long to wait for full traversal to complete for no good reason when just adding a new dataset which should take just a few seconds at most. So I would still appreciate gurus (@mih, @bpoldrack ) having a look a this PR

mih

Looks good. Will merge. With #1423 further optimizations become possible.

mih · 2017-03-23T10:09:19Z

datalad/distribution/dataset.py

@@ -309,7 +309,7 @@ def get_subdatasets(self, pattern=None, fulfilled=None, absolute=False,
          None is return if there is not repository instance yet. For an
          existing repository with no subdatasets an empty list is returned.
        """
-
+        # OPT TODO: make it a generator for a possible early termination?


This is implemented in #1423

yarikoptic · 2017-03-23T12:43:02Z

Thank you @mih
FYI, that add did complete right in time when I got to work (ie in 8 hours!):

(venv-tests)2 10715.....................................:Thu 23 Mar 2017 12:25:16 AM EDT:.
(git)smaug:/mnt/btrfs/datasets/datalad/crawl[master]git
$> datalad add -d . nipype-workshop-2017
Added /mnt/btrfs/datasets/datalad/crawl/nipype-workshop-2017/ds000114
Added <Dataset path=/mnt/btrfs/datasets/datalad/crawl/nipype-workshop-2017>
datalad add -d . nipype-workshop-2017  32.05s user 58.52s system 16% cpu 8:55.62 total
(venv-tests)2 10716.....................................:Thu 23 Mar 2017 08:40:34 AM EDT:.

I will redo it now with the fresh master to see what is the change

yarikoptic · 2017-03-23T12:47:13Z

gy -- was "slightly" (only ~30 times) faster ;)

(venv-tests)2 10728.....................................:Thu 23 Mar 2017 08:45:51 AM EDT:.
(git)smaug:/mnt/btrfs/datasets/datalad/crawl[master]git
$> datalad add -d . nipype-workshop-2017
Added <Dataset path=/mnt/btrfs/datasets/datalad/crawl/nipype-workshop-2017>
(venv-tests)2 10729.....................................:Thu 23 Mar 2017 08:46:03 AM EDT:.

mih · 2017-03-23T12:47:22Z

BTW: The picture is pretty much the same in studyforrest mockup -- if you need something that iterates faster than 8h ;-)

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   44.085   44.085 profile.py:1(<module>)
        1    0.000    0.000   43.494   43.494 utils.py:561(newfunc)
        1    0.000    0.000   43.384   43.384 profile.py:4(test_make_studyforrest_mockup)
        1    0.001    0.001   43.376   43.376 utils_testdatasets.py:16(make_studyforrest_mockup)
   188/34    0.008    0.000   42.540    1.251 dataset.py:479(apply_func)
   148/36    0.008    0.000   42.396    1.178 utils.py:948(eval_func)
   148/36    0.003    0.000   42.392    1.178 utils.py:1066(return_func)
   367/80    0.004    0.000   42.392    0.530 utils.py:986(generator_func)
       44    0.006    0.000   30.728    0.698 add.py:141(__call__)
     1285    0.104    0.000   26.530    0.021 cmd.py:239(run)
       36    0.003    0.000   21.831    0.606 clone.py:106(__call__)
      414    0.013    0.000   21.592    0.052 cmd.py:507(run)
       28    0.003    0.000   20.224    0.722 create.py:153(__call__)
     1318    0.012    0.000   18.228    0.014 subprocess.py:448(communicate)
       88    0.002    0.000   17.633    0.200 utils.py:253(save_dataset_hierarchy)
      736    0.023    0.000   15.056    0.020 dataset.py:163(repo)
     1596    0.049    0.000   13.587    0.009 utils.py:709(swallow_logs)
      800    0.015    0.000   13.522    0.017 contextlib.py:21(__exit__)
      798    0.010    0.000   13.371    0.017 utils.py:758(cleanup)
      798   13.240    0.017   13.240    0.017 {gc.collect}

This is with your fix already.

OPT: delay full tree traversal in sort_paths_into_subdatasets for the…

62d08be

… use-case of adding a submodule There seems to be even more points to optimize but decided first to check this one

yarikoptic added the enhancement label Mar 21, 2017

yarikoptic mentioned this pull request Mar 22, 2017

adding an existing sub-dataset is taking WAY TOO LONG #1388

Closed

mih approved these changes Mar 23, 2017

View reviewed changes

mih merged commit d6262b3 into datalad:master Mar 23, 2017

yarikoptic deleted the opt-add-submodule branch April 2, 2017 01:50

yarikoptic mentioned this pull request Apr 14, 2017

adding new datasets is still VERY slow at times #1467

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPT: delay full tree traversal in sort_paths_into_subdatasets for the use case of adding a new subds #1407

OPT: delay full tree traversal in sort_paths_into_subdatasets for the use case of adding a new subds #1407

yarikoptic commented Mar 21, 2017 •

edited

Loading

codecov-io commented Mar 21, 2017 •

edited by codecov bot

Loading

yarikoptic commented Mar 23, 2017

mih left a comment

mih Mar 23, 2017

yarikoptic commented Mar 23, 2017 •

edited

Loading

yarikoptic commented Mar 23, 2017

mih commented Mar 23, 2017 •

edited

Loading

OPT: delay full tree traversal in sort_paths_into_subdatasets for the use case of adding a new subds #1407

OPT: delay full tree traversal in sort_paths_into_subdatasets for the use case of adding a new subds #1407

Conversation

yarikoptic commented Mar 21, 2017 • edited Loading

codecov-io commented Mar 21, 2017 • edited by codecov bot Loading

Codecov Report

yarikoptic commented Mar 23, 2017

mih left a comment

Choose a reason for hiding this comment

mih Mar 23, 2017

Choose a reason for hiding this comment

yarikoptic commented Mar 23, 2017 • edited Loading

yarikoptic commented Mar 23, 2017

mih commented Mar 23, 2017 • edited Loading

yarikoptic commented Mar 21, 2017 •

edited

Loading

codecov-io commented Mar 21, 2017 •

edited by codecov bot

Loading

yarikoptic commented Mar 23, 2017 •

edited

Loading

mih commented Mar 23, 2017 •

edited

Loading