New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPT: Speedup of per-subdataset "contains?" matching #4868
Conversation
085d9eb
to
7a191fc
Compare
261492d
to
04d896e
Compare
FWIW -- no significant impact on existing benchmarks (if anything - slight tendency to slow down). Would be nice to get a dedicated benchmark to demonstrate the effect. |
Can you clarify what you want? From the times you see above you can decide on how big of a dataset you need to show an effect. For 1s you need 40k subdatasets, if you are satisfied with less, a few thousand should work. So let's say you want to see 100ms and do 4k subdatasets, you need to invest the ~3h to create such a dataset. Otherwise the benchmark is: |
Codecov Report
@@ Coverage Diff @@
## master #4868 +/- ##
==========================================
- Coverage 89.81% 89.78% -0.03%
==========================================
Files 293 293
Lines 41239 41238 -1
==========================================
- Hits 37038 37025 -13
- Misses 4201 4213 +12
Continue to review full report at Codecov.
|
are you saying that there should be no notable (but thus also no negative) impact on a superdataset with e.g. 20 subdatasets? Moreover we already have some benchmarks setup with some sizeable (although may be not big engouh) datasets, e.g. subclasses of https://github.com/datalad/datalad/blob/master/benchmarks/common.py#L104 - might be worth adding a |
There should be no negative impact on small size datasets. The proposed changes are somewhat simple, they all pull out processing out of inner loops and move them a layer or two above. |
OK, this is a bit crazy. If I give
This should not make a difference. But now it gets weird. If I do not write the output to a file but to
and the runtime of the default result renderer seems to make up 25% of the total runtime?!
|
Here is what I get (on NFS): In master:
On this branch:
@mih's stats in comparison (on local drive):
|
Wild guess re |
Speeds up the rejection of non-containing subdatasets. For the UKB dataset with 42k subdatasets, this shaves of 500ms of the previous total runtime of 2.5s. Related to dataladgh-4859
For a 42k subdataset dataset trying to match a single subdataset this shaves off 700ms of the total 2.1s runtime.
This disentangled the ability to switch behavior depending on dataset arg type, from the necessity to repeatedly run `require_dataset()` when resolve_path()` is called repeatedly. It is common to call this function in a loop on every single input argument. So with lots of arguments this will have a performance impact.
…root datalad#4868 (comment) reported substantial time difference with and without `--dataset` on datasets with many subdatasets. The cause was that without `--dataset` `os.curdir` was used as a query path, whereas with `--dataset`, no explicit query path was used. Each query path adds runtime from various safety checks. However, when we generate the query path, we can optimze for that. This change uses no query path, if the to be queried dataset has its current in `PWD`. This will yield the same results, but bypass all path-related processing. As a consequence, command performance is the same with or without as `dataset` argument given.
No difference conditional on
Still a substantial impact of the result rendering:
But there is nothing |
Updating it for each result is the cause of the slow-down reported in datalad#4868 (comment) With this change we only update at max 2Hz. For the original scenario this results in a ~30% speed-up: ``` datalad subdatasets -d . 4.30s user 1.99s system 101% cpu 6.225 total datalad subdatasets -d . 6.33s user 2.27s system 100% cpu 8.538 total ```
Current benchmark runs seem to indicate no performance penalty for the small scale dataset we are using in the tests https://github.com/datalad/datalad/pull/4868/checks?check_run_id=1286049657 |
Remove result filter. GitRepo.get_submodules_() already strips non-dataset results.
FTR: I tried to further improve the situation by using the diff --git a/datalad/local/subdatasets.py b/datalad/local/subdatasets.py
index cc36a074f..5c0d63b9d 100644
--- a/datalad/local/subdatasets.py
+++ b/datalad/local/subdatasets.py
@@ -75,6 +75,14 @@ def _parse_git_submodules(ds_pathobj, repo, paths):
else:
# we had path contraints, but none matched this dataset
return
+ else:
+ # filter out all the paths that point outside the repo. they would
+ # throw errors in GitRepo.get_submodules_(). such paths would
+ # make it here, because we mix top-level query path constrains with
+ # --contains specification in order to speed things up. If the latter
+ # are garbage, not reporting on them will be detected and impossible
+ # results are generated
+ paths = [p for p in paths if p.parts and p.parts[0] != os.pardir]
# can we use the reported as such, or do we need to recode wrt to the
# query context dataset?
if ds_pathobj == repo.pathobj:
@@ -289,8 +297,12 @@ def _get_submodules(ds, paths, fulfilled, recursive, recursion_limit,
[c] + list(c.parents)
for c in (contains if contains is not None else [])
]
+ submodule_path_constraints = (paths or []) + contains if contains else paths
# put in giant for-loop to be able to yield results before completion
- for sm in _parse_git_submodules(ds.pathobj, repo, paths):
+ for sm in _parse_git_submodules(
+ ds.pathobj,
+ repo,
+ submodule_path_constraints):
sm_path = sm['path']
contains_hits = None
if contains: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading up to cbf79db, these changes look good to me aside from my comment about an inaccurate comment.
contains_hits = [ | ||
c for c in contains if sm['path'] == c or sm['path'] in c.parents | ||
] | ||
contains_hits = [c[0] for c in expanded_contains if sm_path in c] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so as far as I can see, the two potential sources of speed ups are that it 1) drops the repeated 'path' lookup in the sm
dict and 2) uses list's __contains__
over _PathParents.__contains__
(which is Sequence.__contains__
).
I doubt no. 1 would be measurable in this context, but I think it's still good to do. So I'd guess the performance gains are coming from no 2.
timeit runs, .parents versus list __contains__
echo "--- .parents"
python -m timeit \
-n 3 \
-s 'from pathlib import Path' \
-s 'import string' \
'path = Path(*string.ascii_lowercase)' \
'for _ in range(int(4e4)): "X" in path.parents'
echo "--- parents, no attribute access"
python -m timeit \
-n 3 \
-s 'from pathlib import Path' \
-s 'import string' \
-s 'path = Path(*string.ascii_lowercase)' \
'parents = path.parents' \
'for _ in range(int(4e4)): "X" in parents'
echo "--- parents, as list"
python -m timeit \
-n 3 \
-s 'from pathlib import Path' \
-s 'import string' \
-s 'path = Path(*string.ascii_lowercase)' \
'parents = list(path.parents)' \
'for _ in range(int(4e4)): "X" in parents'
# --- .parents
# 3 loops, best of 5: 1.31 sec per loop
# --- parents, no attribute access
# 3 loops, best of 5: 1.33 sec per loop
# --- parents, as list
# 3 loops, best of 5: 144 msec per loop
sm_path["path"] is a path object, and so are the items for each list in expanded_contains.
cbf79db (RF: Simplify submodule path recoding) placed the yield at the wrong level and didn't preserve the props["path"] access in the rewrite. This is responsible for the failures on the TMPDIR=/var/tmp/sym\ link/ builds, e.g. <https://travis-ci.org/github/datalad/datalad/jobs/737733767>.
Thx for the fixes @kyleam ! |
Speeds up the rejection of non-containing subdatasets. For the UKB dataset with 42k subdatasets, this shaves of 1s of the previous total runtime of 2.5s.
Related to gh-4859
This also removes a backward compatibility kludge that we put in place over a year ago (
revision
instead of the presentgitshasum
property in the submodule records).Edit: This also ups the plain reporting performance (same 42k subdatasets):
Before:
datalad subdatasets 9.49s user 3.41s system 57% cpu 22.284 total
After:
datalad subdatasets 9.81s user 2.49s system 74% cpu 16.399 total
and with a twist:
datalad subdatasets -d . 6.95s user 2.14s system 66% cpu 13.622 total
(see TODO)TODO:
GitRepo.get_submodules_()
callsGitRepo.get_content_info()
. This might be made faster, by filtering the path-contraints with thecontains
parameter before calling any of this stack. Right nowcontains
is considered last and all info is requested without any path constraint in the context of Substantial overhead ofget
vs direct clone of a subdataset #4859 ->GitRepo.get_submodules_()
implementation issues #5063--dataset
update
is broken by the second commit, not clear why yet (likely it is still using therevision
property)