Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT: create-sibling do not try to OPT by limiting paths to installed submodules #6528

Merged
merged 3 commits into from
Mar 14, 2022

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Mar 8, 2022

Prior approach did not work well together well with --since specification, where
subdatasets recursively would traverse entire tree for no need. And it was added in
8c673a4 RFing as an OPT to avoid expensive traversal
of subdatasets which are not installed. But for those there should be no extra cost
since we cannot descend into not installed subdatasets. But may be there is cost since
we will be analyzing entire trees (I guess) and not just subdatasets, which is
what matters for create-sibling -- so there could be more OPT I think

Fixes #6527

edit1: nope -- as is it would not be good enough since diff_dataset would return full diff, not only for subdataset states

that seems to be ok/OPT in case whenever there is `--since`, and we just need filter out for not clean datasets
(git)lena:~datalad/datalad-master[bf-rf-create-sibling]git
$> time python -c 'from collections import Counter;from datalad.core.local.diff import diff_dataset, Dataset; ds=Dataset("/home/yoh/datalad"); r=[_.get("type") for _ in diff_dataset(ds, "HEAD^^^^^^", "HEAD", path=[sds["path"] for sds in ds.subdatasets(recursive=True, state="present", result_renderer="disabled")], constant_refs=True, annex=None, untracked="no", recursive=False, eval_file_type=False)]; print(Counter(r))'
Counter({'dataset': 47, None: 1})
python -c   3.05s user 1.14s system 108% cpu 3.850 total

$> time python -c 'from collections import Counter;from datalad.core.local.diff import diff_dataset, Dataset; ds=Dataset("/home/yoh/datalad"); r=[_.get("type") for _ in diff_dataset(ds, "HEAD^^^^^^", "HEAD", constant_refs=True, annex=None, untracked="no", recursive=True, eval_file_type=False)]; print(Counter(r))'            
Counter({'file': 5253, 'symlink': 3775, 'dataset': 859, None: 1})
python -c   1.73s user 0.37s system 107% cpu 1.952 total
but if there is no --since -- without limiting for subdatasets paths, we are spending too much to do full diff
(git)lena:~datalad/datalad-master[bf-rf-create-sibling]git
$> time python -c 'from collections import Counter;from datalad.core.local.diff import diff_dataset, Dataset; ds=Dataset("/home/yoh/datalad"); r=[_.get("type") for _ in diff_dataset(ds, None, "HEAD", path=[sds["path"] for sds in ds.subdatasets(recursive=True, state="present", result_renderer="disabled")], constant_refs=True, annex=None, untracked="no", recursive=False, eval_file_type=False)]; print(Counter(r))'       
Counter({'dataset': 79})
python -c   4.22s user 1.65s system 109% cpu 5.370 total

$> time python -c 'from collections import Counter;from datalad.core.local.diff import diff_dataset, Dataset; ds=Dataset("/home/yoh/datalad"); r=[_.get("type") for _ in diff_dataset(ds, None, "HEAD", constant_refs=True, annex=None, untracked="no", recursive=True, eval_file_type=False)]; print(Counter(r))'
Counter({'symlink': 118092, 'file': 14634, 'dataset': 2158})
python -c   83.54s user 14.59s system 101% cpu 1:37.02 total

so it would need more to actually become an OPT ...

Changelog

💫 Enhancements and new features

…submodules

Prior approach did not  work well together well with --since specification, where
subdatasets recursively would traverse entire tree for no need. And it was added in
8c673a4 RFing as an OPT to avoid expensive traversal
of subdatasets which are not installed.  But for those there should be no extra cost
since we cannot descend into not installed subdatasets.  But may be there is cost since
we will be analyzing entire trees (I guess) and not just subdatasets, which is
what matters for create-sibling -- so there could be more OPT I think
@yarikoptic yarikoptic added the semver-performance Changes only improve performance, no impact on version label Mar 8, 2022
@yarikoptic
Copy link
Member Author

eh -- something odd with snapshot.debian.org causing fails on appveyor -- 503 error
[202](https://ci.appveyor.com/project/mih/datalad/builds/42832527/job/kfu2vjm0syg3avyk#L202)2-03-08T22:02:33+0000 [INFO    ] datalad_installer URL: http://snapshot.debian.org/archive/debian/20210906T[204](https://ci.appveyor.com/project/mih/datalad/builds/42832527/job/kfu2vjm0syg3avyk#L204)127Z/pool/main/g/git-annex/git-annex_8.20[210](https://ci.appveyor.com/project/mih/datalad/builds/42832527/job/kfu2vjm0syg3avyk#L210)903-1_amd64.deb
2022-03-08T22:02:33+0000 [INFO    ] datalad_installer Extra args: None
2022-03-08T22:02:33+0000 [INFO    ] datalad_installer Downloading http://snapshot.debian.org/archive/debian/20210906T204127Z/pool/main/g/git-annex/git-annex_8.20210903-1_amd64.deb
Traceback (most recent call last):
  File "/home/appveyor/venv3.7.12/bin/datalad-installer", line 8, in <module>
    sys.exit(main())
  File "/home/appveyor/venv3.7.12/lib/python3.7/site-packages/datalad_installer.py", line 1788, in main
    return manager.main(argv)
  File "/home/appveyor/venv3.7.12/lib/python3.7/site-packages/datalad_installer.py", line 657, in main
    self.addcomponent(name=cr.name, **cr.kwargs)
  File "/home/appveyor/venv3.7.12/lib/python3.7/site-packages/datalad_installer.py", line 690, in addcomponent
    component(self).provide(**kwargs)
  File "/home/appveyor/venv3.7.12/lib/python3.7/site-packages/datalad_installer.py", line 1032, in provide
    bins = self.get_installer(method).install(self.NAME, **kwargs)
  File "/home/appveyor/venv3.7.12/lib/python3.7/site-packages/datalad_installer.py", line 1118, in install
    bindir = self.install_package(package, **kwargs)
  File "/home/appveyor/venv3.7.12/lib/python3.7/site-packages/datalad_installer.py", line 1386, in install_package
    download_file(url, debpath)
  File "/home/appveyor/venv3.7.12/lib/python3.7/site-packages/datalad_installer.py", line 1722, in download_file
    with urlopen(req) as r:
  File "/home/appveyor/.localpython3.7.12/lib/python3.7/urllib/request.py", line [222](https://ci.appveyor.com/project/mih/datalad/builds/42832527/job/kfu2vjm0syg3avyk#L222), in urlopen
    return opener.open(url, data, timeout)
  File "/home/appveyor/.localpython3.7.12/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/home/appveyor/.localpython3.7.12/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/appveyor/.localpython3.7.12/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/home/appveyor/.localpython3.7.12/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/home/appveyor/.localpython3.7.12/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Backend fetch failed
Command exited with code 1
Running "on_finish" scripts
while [ -f ~/BLOCK ]; do sleep 5; done

@yarikoptic yarikoptic marked this pull request as ready for review March 9, 2022 13:30
datalad/core/local/diff.py Outdated Show resolved Hide resolved
datalad/core/local/diff.py Outdated Show resolved Hide resolved
datalad/core/local/diff.py Outdated Show resolved Hide resolved
datalad/support/gitrepo.py Outdated Show resolved Hide resolved
to make it possible to efficiently report  only diffs pertinent to datasets
hierarchy while also working efficiently with  "to" kwarg.  We should not
list all subdatasets recursively in "above" logic but rather instruct this
diff_dataset logic to act only on subdatasets

It seems to bring it back to the scale of "with subdatasets in logic above" but
not quite exactly matching it

	(git)lena:~datalad/datalad-master[bf-rf-create-sibling]git
	$> time python -c 'from collections import Counter;from datalad.core.local.diff import diff_dataset, Dataset; ds=Dataset("/home/yoh/datalad"); r=[_.get("type") for _ in diff_dataset(ds, None, "HEAD", path=[sds["path"] for sds in ds.subdatasets(recursive=True, state="present", result_renderer="disabled")], constant_refs=True, annex=None, untracked="no", recursive=False, eval_file_type=False)]; print(Counter(r))'
	Counter({'dataset': 79})
	python -c   3.77s user 1.27s system 110% cpu 4.547 total

	$> time python -c 'from collections import Counter;from datalad.core.local.diff import diff_dataset, Dataset; ds=Dataset("/home/yoh/datalad"); r=[_.get("type") for _ in diff_dataset(ds, None, "HEAD", constant_refs=True, annex=None, untracked="no", recursive=True, eval_file_type=False, datasets_only=True)]; print(Counter(r))'
	Counter({'dataset': 79})
	python -c   4.82s user 1.69s system 111% cpu 5.827 total

but I think it is ok since gains for other usecase (--since) are of larger
impact with this PR, actually boosts it even further:

	(git)lena:~datalad/datalad-master[bf-rf-create-sibling]git
	$> time python -c 'from collections import Counter;from datalad.core.local.diff import diff_dataset, Dataset; ds=Dataset("/home/yoh/datalad"); r=[_.get("type") for _ in diff_dataset(ds, "HEAD^^^^^^", "HEAD", path=[sds["path"] for sds in ds.subdatasets(recursive=True, state="present", result_renderer="disabled")], constant_refs=True, annex=None, untracked="no", recursive=False, eval_file_type=False)]; print(Counter(r))'
	Counter({'dataset': 47, None: 1})

	$> time python -c 'from collections import Counter;from datalad.core.local.diff import diff_dataset, Dataset; ds=Dataset("/home/yoh/datalad"); r=[_.get("type") for _ in diff_dataset(ds, "HEAD^^^^^^", "HEAD", constant_refs=True, annex=None, untracked="no", recursive=True, eval_file_type=False, datasets_only=True)]; print(Counter(r))'
	Counter({'dataset': 47, None: 1})
	python -c   1.46s user 0.40s system 109% cpu 1.697 total
@yarikoptic yarikoptic force-pushed the bf-rf-create-sibling branch from 85c602b to 11aa8d2 Compare March 10, 2022 17:16
@codeclimate
Copy link

codeclimate bot commented Mar 10, 2022

Code Climate has analyzed commit 11aa8d2 and detected 1 issue on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 1

View more on Code Climate.

@yarikoptic
Copy link
Member Author

This PR is ready for your review @datalad/developers . In case if there is no objections, I will merge it next monday

@yarikoptic yarikoptic requested a review from mih March 11, 2022 14:46
@yarikoptic yarikoptic merged commit b231748 into datalad:master Mar 14, 2022
yarikoptic added a commit to yarikoptic/datalad that referenced this pull request Mar 30, 2022
…on committed

Initial motivation for this change was fixing

   datalad#6596

where while specifying --since, create-sibling  did not detect needing to
create a sibling for a subdataset.  A stale (missed during datalad#6528 OPT/RF)
comment said that value of constant_refs=True should not matter since it was
not recursive originally. But now here we do recurse and it does matter since
we migth completely skip subtree where such refs are not yet known.  Changing
to False resulted in the test failing which made me realize that sibling was
created even for not yet committed changes to submodules.  Since for push it
would really not work to push something which is not yet saved, I think the
change in behavior for create-sibling is also desired.  And it fixes datalad#6596.
So I have "fixed" and expanded test with the problematic case I had
discovered.
yarikoptic added a commit to yarikoptic/datalad that referenced this pull request Mar 30, 2022
…on committed

Initial motivation for this change was fixing

   datalad#6596

where while specifying --since, create-sibling  did not detect needing to
create a sibling for a subdataset.  A stale (missed during datalad#6528 OPT/RF)
comment said that value of constant_refs=True should not matter since it was
not recursive originally. But now here we do recurse and it does matter since
we migth completely skip subtree where such refs are not yet known.  Changing
to False resulted in the test failing which made me realize that sibling was
created even for not yet committed changes to submodules.  Since for push it
would really not work to push something which is not yet saved, I think the
change in behavior for create-sibling is also desired.  And it fixes datalad#6596.
So I have "fixed" and expanded test with the problematic case I had
discovered.
@yarikoptic yarikoptic deleted the bf-rf-create-sibling branch April 5, 2022 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-performance Changes only improve performance, no impact on version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

create-sibling --since: still traverses too much (full?) tree for no good reason
1 participant