Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT: Speedup of per-subdataset "contains?" matching #4868

Merged
merged 11 commits into from Oct 22, 2020

Conversation

mih
Copy link
Member

@mih mih commented Sep 4, 2020

Speeds up the rejection of non-containing subdatasets. For the UKB dataset with 42k subdatasets, this shaves of 1s of the previous total runtime of 2.5s.

Related to gh-4859

This also removes a backward compatibility kludge that we put in place over a year ago (revision instead of the present gitshasum property in the submodule records).

Edit: This also ups the plain reporting performance (same 42k subdatasets):

Before: datalad subdatasets 9.49s user 3.41s system 57% cpu 22.284 total
After: datalad subdatasets 9.81s user 2.49s system 74% cpu 16.399 total
and with a twist: datalad subdatasets -d . 6.95s user 2.14s system 66% cpu 13.622 total (see TODO)

TODO:

  • GitRepo.get_submodules_() calls GitRepo.get_content_info(). This might be made faster, by filtering the path-contraints with the contains parameter before calling any of this stack. Right now contains is considered last and all info is requested without any path constraint in the context of Substantial overhead of get vs direct clone of a subdataset #4859 -> GitRepo.get_submodules_() implementation issues #5063
  • Suprise differences with or without --dataset
    datalad subdatasets -r --contains sub-3507513  2.76s user 0.33s system 103% cpu 2.973 total                                           
    datalad subdatasets -d . -r --contains sub-3507513  1.36s user 0.21s system 104% cpu 1.507 total                                      
    
  • update is broken by the second commit, not clear why yet (likely it is still using the revision property)

@mih mih added the performance Improve performance of an existing feature label Sep 4, 2020
@mih mih force-pushed the opt-subdatasets branch 3 times, most recently from 085d9eb to 7a191fc Compare September 4, 2020 09:54
@mih mih changed the title OPT: ~20% speedup of per-subdataset "contains?" matching OPT: Speedup of per-subdataset "contains?" matching Sep 4, 2020
@mih mih marked this pull request as draft September 4, 2020 11:37
@mih mih force-pushed the opt-subdatasets branch 2 times, most recently from 261492d to 04d896e Compare September 4, 2020 12:46
@yarikoptic
Copy link
Member

FWIW -- no significant impact on existing benchmarks (if anything - slight tendency to slow down). Would be nice to get a dedicated benchmark to demonstrate the effect.

@mih
Copy link
Member Author

mih commented Sep 4, 2020

FWIW -- no significant impact on existing benchmarks (if anything - slight tendency to slow down). Would be nice to get a dedicated benchmark to demonstrate the effect.

Can you clarify what you want? From the times you see above you can decide on how big of a dataset you need to show an effect. For 1s you need 40k subdatasets, if you are satisfied with less, a few thousand should work. So let's say you want to see 100ms and do 4k subdatasets, you need to invest the ~3h to create such a dataset.

Otherwise the benchmark is: datalad subdatasets -d . -r --contains <subdspath>

@codecov
Copy link

codecov bot commented Sep 4, 2020

Codecov Report

Merging #4868 into master will decrease coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4868      +/-   ##
==========================================
- Coverage   89.81%   89.78%   -0.03%     
==========================================
  Files         293      293              
  Lines       41239    41238       -1     
==========================================
- Hits        37038    37025      -13     
- Misses       4201     4213      +12     
Impacted Files Coverage Δ
datalad/support/gitrepo.py 90.47% <ø> (ø)
datalad/distribution/dataset.py 96.41% <100.00%> (ø)
datalad/distribution/update.py 98.80% <100.00%> (ø)
datalad/local/subdatasets.py 96.49% <100.00%> (+0.83%) ⬆️
datalad/local/tests/test_subdataset.py 100.00% <100.00%> (ø)
datalad/downloaders/http.py 81.85% <0.00%> (-2.71%) ⬇️
datalad/downloaders/tests/test_http.py 88.52% <0.00%> (-1.88%) ⬇️
datalad/distribution/create_sibling.py 86.11% <0.00%> (+0.56%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f66dce7...23c79c1. Read the comment docs.

@yarikoptic
Copy link
Member

are you saying that there should be no notable (but thus also no negative) impact on a superdataset with e.g. 20 subdatasets?

Moreover we already have some benchmarks setup with some sizeable (although may be not big engouh) datasets, e.g. subclasses of https://github.com/datalad/datalad/blob/master/benchmarks/common.py#L104 - might be worth adding a time_contains in one of those subclasses.

@mih
Copy link
Member Author

mih commented Sep 4, 2020

There should be no negative impact on small size datasets. The proposed changes are somewhat simple, they all pull out processing out of inner loops and move them a layer or two above.

@mih
Copy link
Member Author

mih commented Sep 7, 2020

OK, this is a bit crazy. If I give --dataset . things get much faster.

datalad subdatasets > somelog  11.18s user 2.10s system 76% cpu 17.326 total
datalad subdatasets -d . > somelog  8.42s user 0.74s system 100% cpu 9.102 total

This should not make a difference. But now it gets weird. If I do not write the output to a file but to /dev/null, the pattern reverses?!

datalad subdatasets > /dev/null  11.00s user 1.17s system 101% cpu 12.001 total
datalad subdatasets -d . > /dev/null  8.69s user 2.04s system 69% cpu 15.361 total

and the runtime of the default result renderer seems to make up 25% of the total runtime?!

datalad -f '{path}' subdatasets > /dev/null  8.87s user 1.01s system 101% cpu 9.714 total
datalad -f '{path}' subdatasets -d . > /dev/null  6.55s user 1.84s system 65% cpu 12.799 total

@adswa
Copy link
Member

adswa commented Sep 7, 2020

Here is what I get (on NFS):

In master:

datalad subdatasets > somelog 11.98s user 1.08s system 100% cpu 13.023 total
datalad subdatasets -d . > somelog 10.05s user 1.15s system 100% cpu 11.179 total
datalad subdatasets > /dev/null  12.45s user 2.10s system 72% cpu 20.132 total   (???)
datalad subdatasets -d . > /dev/null  9.14s user 0.90s system 100% cpu 10.022 total
datalad -f '{path}' subdatasets > /dev/null  10.39s user 0.90s system 100% cpu 11.241 total
datalad -f '{path}' subdatasets -d . > /dev/null  7.19s user 0.91s system 100% cpu 8.078 total

On this branch:

SLOWER datalad subdatasets > somelog 11.06s user 1.87s system 76% cpu 16.922 total
FASTER datalad subdatasets -d . > somelog  8.62s user 0.87s system 100% cpu 9.416 total
FASTER datalad subdatasets > /dev/null   11.49s user 1.90s system 77% cpu 17.305 total
FASTER datalad subdatasets -d . > /dev/null   8.61s user 0.84s system 100% cpu 9.427 total
SLOWER datalad -f '{path}' subdatasets > /dev/null  9.09s user 1.95s system 73% cpu 15.015 total
FASTER datalad -f '{path}' subdatasets -d . > /dev/null 6.83s user 0.87s system 100% cpu 7.695 total

@mih's stats in comparison (on local drive):

datalad subdatasets > somelog  11.18s user 2.10s system 76% cpu 17.326 total
datalad subdatasets -d . > somelog  8.42s user 0.74s system 100% cpu 9.102 total
datalad subdatasets > /dev/null  11.00s user 1.17s system 101% cpu 12.001 total
datalad subdatasets -d . > /dev/null  8.69s user 2.04s system 69% cpu 15.361 total
datalad -f '{path}' subdatasets > /dev/null  8.87s user 1.01s system 101% cpu 9.714 total
datalad -f '{path}' subdatasets -d . > /dev/null  6.55s user 1.84s system 65% cpu 12.799 total

@yarikoptic
Copy link
Member

Wild guess re -f '{path}' effect - may be there is some .repo is involved in default full renderer?

Speeds up the rejection of non-containing subdatasets. For the UKB
dataset with 42k subdatasets, this shaves of 500ms of the previous
total runtime of 2.5s.

Related to dataladgh-4859
For a 42k subdataset dataset trying to match a single subdataset
this shaves off 700ms of the total 2.1s runtime.
This disentangled the ability to switch behavior depending on dataset arg
type, from the necessity to repeatedly run `require_dataset()` when
resolve_path()` is called repeatedly. It is common to call this function
in a loop on every single input argument. So with lots of arguments this
will have a performance impact.
…root

datalad#4868 (comment)
reported substantial time difference with and without `--dataset` on
datasets with many subdatasets. The cause was that without `--dataset`
`os.curdir` was used as a query path, whereas with `--dataset`, no explicit
query path was used. Each query path adds runtime from various safety
checks. However, when we generate the query path, we can optimze for
that.

This change uses no query path, if the to be queried dataset has its
current in `PWD`. This will yield the same results, but bypass all
path-related processing. As a consequence, command performance is the
same with or without as `dataset` argument given.
@mih
Copy link
Member Author

mih commented Oct 21, 2020

No difference conditional on --dataset anymore, with or without writing to a log:

datalad subdatasets > somelog  7.45s user 2.03s system 100% cpu 9.395 total
datalad subdatasets -d . > somelog  7.57s user 1.94s system 100% cpu 9.434 total
datalad subdatasets > /dev/null  7.48s user 1.98s system 100% cpu 9.396 total
datalad subdatasets -d . > /dev/null  7.44s user 2.10s system 100% cpu 9.477 total

Still a substantial impact of the result rendering:

datalad -f '{path}' subdatasets > /dev/null  6.19s user 2.10s system 100% cpu 8.219 total

But there is nothing subdatasets specific about this, hence not subject of this PR -> #5060

mih added a commit to mih/datalad that referenced this pull request Oct 21, 2020
Updating it for each result is the cause of the slow-down reported in
datalad#4868 (comment)

With this change we only update at max 2Hz. For the original scenario
this results in a ~30% speed-up:

```
datalad subdatasets -d .  4.30s user 1.99s system 101% cpu 6.225 total
datalad subdatasets -d .  6.33s user 2.27s system 100% cpu 8.538 total
```
@mih
Copy link
Member Author

mih commented Oct 21, 2020

Current benchmark runs seem to indicate no performance penalty for the small scale dataset we are using in the tests

https://github.com/datalad/datalad/pull/4868/checks?check_run_id=1286049657

@mih mih marked this pull request as ready for review October 21, 2020 14:26
Remove result filter. GitRepo.get_submodules_() already strips
non-dataset results.
@mih mih requested a review from kyleam October 21, 2020 14:32
@mih
Copy link
Member Author

mih commented Oct 21, 2020

FTR: I tried to further improve the situation by using the --contains specification as additional query constraints, but that did not work -- made it slower. For posterity, here was my patch:

diff --git a/datalad/local/subdatasets.py b/datalad/local/subdatasets.py
index cc36a074f..5c0d63b9d 100644
--- a/datalad/local/subdatasets.py
+++ b/datalad/local/subdatasets.py
@@ -75,6 +75,14 @@ def _parse_git_submodules(ds_pathobj, repo, paths):
             else:
                 # we had path contraints, but none matched this dataset
                 return
+        else:
+            # filter out all the paths that point outside the repo. they would
+            # throw errors in GitRepo.get_submodules_(). such paths would
+            # make it here, because we mix top-level query path constrains with
+            # --contains specification in order to speed things up. If the latter
+            # are garbage, not reporting on them will be detected and impossible
+            # results are generated
+            paths = [p for p in paths if p.parts and p.parts[0] != os.pardir]
     # can we use the reported as such, or do we need to recode wrt to the
     # query context dataset?
     if ds_pathobj == repo.pathobj:
@@ -289,8 +297,12 @@ def _get_submodules(ds, paths, fulfilled, recursive, recursion_limit,
         [c] + list(c.parents)
         for c in (contains if contains is not None else [])
     ]
+    submodule_path_constraints = (paths or []) + contains if contains else paths
     # put in giant for-loop to be able to yield results before completion
-    for sm in _parse_git_submodules(ds.pathobj, repo, paths):
+    for sm in _parse_git_submodules(
+            ds.pathobj,
+            repo,
+            submodule_path_constraints):
         sm_path = sm['path']
         contains_hits = None
         if contains:

Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading up to cbf79db, these changes look good to me aside from my comment about an inaccurate comment.

datalad/local/subdatasets.py Outdated Show resolved Hide resolved
contains_hits = [
c for c in contains if sm['path'] == c or sm['path'] in c.parents
]
contains_hits = [c[0] for c in expanded_contains if sm_path in c]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so as far as I can see, the two potential sources of speed ups are that it 1) drops the repeated 'path' lookup in the sm dict and 2) uses list's __contains__ over _PathParents.__contains__ (which is Sequence.__contains__).

I doubt no. 1 would be measurable in this context, but I think it's still good to do. So I'd guess the performance gains are coming from no 2.

timeit runs, .parents versus list __contains__
echo "--- .parents"
python -m timeit \
       -n 3 \
       -s 'from pathlib import Path' \
       -s 'import string' \
       'path = Path(*string.ascii_lowercase)' \
       'for _ in range(int(4e4)): "X" in path.parents'

echo "--- parents, no attribute access"
python -m timeit \
       -n 3 \
       -s 'from pathlib import Path' \
       -s 'import string' \
       -s 'path = Path(*string.ascii_lowercase)' \
       'parents = path.parents' \
       'for _ in range(int(4e4)): "X" in parents'

echo "--- parents, as list"
python -m timeit \
       -n 3 \
       -s 'from pathlib import Path' \
       -s 'import string' \
       -s 'path = Path(*string.ascii_lowercase)' \
       'parents = list(path.parents)' \
       'for _ in range(int(4e4)): "X" in parents'

# --- .parents
# 3 loops, best of 5: 1.31 sec per loop
# --- parents, no attribute access
# 3 loops, best of 5: 1.33 sec per loop
# --- parents, as list
# 3 loops, best of 5: 144 msec per loop

sm_path["path"] is a path object, and so are the items for each list
in expanded_contains.
cbf79db (RF: Simplify submodule path recoding) placed the yield at
the wrong level and didn't preserve the props["path"] access in the
rewrite.  This is responsible for the failures on the
TMPDIR=/var/tmp/sym\ link/ builds, e.g.
<https://travis-ci.org/github/datalad/datalad/jobs/737733767>.
@mih
Copy link
Member Author

mih commented Oct 22, 2020

Thx for the fixes @kyleam !

@mih mih merged commit 239bf8e into datalad:master Oct 22, 2020
@mih mih deleted the opt-subdatasets branch October 22, 2020 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improve performance of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants