Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use --include=* or --anything instead of --copies 0 to speed up get_content_annexinfo #7230

Merged
merged 3 commits into from
Dec 31, 2022

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Dec 15, 2022

Apparently --copies 0 could result in up to 10 (or more) penalty of running find or findref.

TODO

  • Future proof it -- use --anything which was added in 10.20221212-17-g0b2dd374d on Dec 20 (so we can compare against 10.20221220 version) -- but yet to wait for datalad/git-annex get that build

@yarikoptic yarikoptic added the semver-performance Changes only improve performance, no impact on version label Dec 20, 2022
@adswa
Copy link
Member

adswa commented Dec 21, 2022

the mac os failure is due to an outdated mac build image

…t_annexinfo

Apparently `--copies 0` could result in up to 10 (or more) penalty of running find or findref.

--include=* was recommended by Joey in
https://git-annex.branchable.com/todo/add_--all___40__or_alike__41___to_find_and_findref/

Closes datalad#7038
…support

Since there is still no release interim the dates, I think such comparison

- should be safe

- would allow us to immediately take advantage of  this OPT  while testing
  datalad/git-annex against datalad master.

Here are some timings of all 3 possible options

	❯ pwd
	/home/yoh/datalad/dandi/dandisets/000026
	❯ time git annex find --copies 0 | wc -l
	20575
	git annex find --copies 0  12.67s user 1.17s system 120% cpu 11.513 total
	wc -l  0.01s user 0.10s system 0% cpu 11.513 total
	❯ time git annex find --include='*' | wc -l
	20575
	git annex find --include='*'  1.18s user 0.14s system 134% cpu 0.984 total
	wc -l  0.01s user 0.02s system 3% cpu 0.984 total
	❯ time git annex find --anything | wc -l
	20575
	git annex find --anything  0.71s user 0.18s system 157% cpu 0.567 total
	wc -l  0.02s user 0.01s system 5% cpu 0.566 total

So --anything leads to almost twice faster performance than --include=*, so
worth it.
@yarikoptic
Copy link
Member Author

the mac os failure is due to an outdated mac build image

great -- thanks! I rebased.

The main point is that "it works" so we should proceed with this PR. I also added now support for new --anything option. Its timing even better:

❯ pwd
/home/yoh/datalad/dandi/dandisets/000026
❯ time git annex find --copies 0 | wc -l
20575
git annex find --copies 0  12.67s user 1.17s system 120% cpu 11.513 total
wc -l  0.01s user 0.10s system 0% cpu 11.513 total
❯ time git annex find --include='*' | wc -l
20575
git annex find --include='*'  1.18s user 0.14s system 134% cpu 0.984 total
wc -l  0.01s user 0.02s system 3% cpu 0.984 total
❯ time git annex find --anything | wc -l
20575
git annex find --anything  0.71s user 0.18s system 157% cpu 0.567 total
wc -l  0.02s user 0.01s system 5% cpu 0.566 total

I think we are ready, taking out of the draft -- your feedback is very welcome! I hope it makes cut for 0.18.0.

@yarikoptic yarikoptic added this to the 0.18.0 milestone Dec 22, 2022
@yarikoptic yarikoptic marked this pull request as ready for review December 22, 2022 16:42
@codecov
Copy link

codecov bot commented Dec 22, 2022

Codecov Report

Base: 88.73% // Head: 88.73% // Decreases project coverage by -0.00% ⚠️

Coverage data is based on head (ef96e8a) compared to base (39eaebf).
Patch coverage: 75.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7230      +/-   ##
==========================================
- Coverage   88.73%   88.73%   -0.01%     
==========================================
  Files         325      325              
  Lines       44124    44184      +60     
  Branches     5867     5880      +13     
==========================================
+ Hits        39154    39205      +51     
- Misses       4955     4964       +9     
  Partials       15       15              
Impacted Files Coverage Δ
datalad/support/annexrepo.py 90.35% <75.00%> (-0.06%) ⬇️
datalad/distribution/utils.py 94.00% <0.00%> (-6.00%) ⬇️
datalad/interface/utils.py 95.28% <0.00%> (-1.05%) ⬇️
datalad/cli/parser.py 86.03% <0.00%> (-0.46%) ⬇️
datalad/support/gitrepo.py 92.57% <0.00%> (-0.21%) ⬇️
datalad/support/tests/test_gitrepo.py 99.77% <0.00%> (+<0.01%) ⬆️
datalad/_version.py 45.68% <0.00%> (+0.27%) ⬆️
datalad/support/tests/test_path.py 98.70% <0.00%> (+0.45%) ⬆️
datalad/support/path.py 96.58% <0.00%> (+1.17%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@yarikoptic
Copy link
Member Author

hm, crippled one freaked out:

=========================== short test summary info ============================
FAILED ../tests/test_rerun_merges.py::test_rerun_fastforwardable_mutator - datalad.runner.exception.CommandError: CommandError: 'git -c diff.ignoreSubmodules=none merge -m 'Merge side
' --no-ff --allow-unrelated-histories 7516b5a47d8cba328857b09bb9d3fd6dc901a4e5' failed with exitcode 128 [err: 'error: Your local changes to the following files would be overwritten by merge:
	foo
Please commit your changes or stash them before you merge.
Aborting']
= 1 failed, 329 passed, 63 skipped, 1 deselected, 1 xpassed, 11 warnings in 1420.99s (0:23:40) =
Error: Process completed with exit code 1.

which has no "track" in our issue tracker besides it used to fail and was disabled on windows: #5126 . rerunning that job now

Copy link
Member

@bpoldrack bpoldrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused. The PR title and the diff don't quite fit - there's no --largerthan.
Changed approach or incomplete PR, @yarikoptic ?

Copy link
Member

@bpoldrack bpoldrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than that confusion, looks good to me!

So, if you feel no changelog required (which is fine with me), I'm ready to merge this.

@yarikoptic yarikoptic changed the title OPT: use --largerthan -1 instead of --copies 0 to speed up get_content_annexinfo OPT: use --include=* or --anything instead of --copies 0 to speed up get_content_annexinfo Dec 23, 2022
@yarikoptic yarikoptic changed the title OPT: use --include=* or --anything instead of --copies 0 to speed up get_content_annexinfo Use --include=* or --anything instead of --copies 0 to speed up get_content_annexinfo Dec 23, 2022
@yarikoptic yarikoptic added the CHANGELOG-missing When a PR's description does not contain a changelog item, yet. label Dec 23, 2022
@github-actions github-actions bot removed the CHANGELOG-missing When a PR's description does not contain a changelog item, yet. label Dec 23, 2022
@yarikoptic
Copy link
Member Author

Thank you @bpoldrack . Indeed changed approach and need changelog.

@codeclimate
Copy link

codeclimate bot commented Dec 23, 2022

Code Climate has analyzed commit ef96e8a and detected 0 issues on this pull request.

View more on Code Climate.

@bpoldrack bpoldrack merged commit 357213a into datalad:master Dec 31, 2022
@yarikoptic-gitmate
Copy link
Collaborator

PR released in 0.18.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-performance Changes only improve performance, no impact on version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants