Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF+OPT: allow to save change where file becomes a folder, and avoid multiple traversals in _save #6581

Merged
merged 7 commits into from
Apr 7, 2022

Conversation

yarikoptic
Copy link
Member

reincarnated #6567 against master

The primary goal was to address #6558. Fixes #6558 (attn @jwodder, would be nice if you could check if fixes on the dandisets tests use case, didn't try)

But to arrive there without incurring notable runtime performance I decided to RF code in _save a bit since I noticed it is doing traversals of the same dict (which was also OrderedDict, no longer needed) over and over again to filter out for files with desired status. To avoid multiple traversals, I have first groupped into the groups based on status, thus in effect removing many multiple traversals of the same possibly huge dict . I think it could make any save faster now. Also removed the need for a single OrderedDict, but filed #6566

Changelog

馃悰 Bug Fixes

馃挮 Enhancements and new features

  • save code might operate faster on heavy file trees

Could be of great benefit whenever deletions are only few
among many other changes.  This way we would not need to go
through all N twice, and only once.  If everything is removals,
then we would actually pay the penalty, but assuming that we saving
more often additions and changes and not removals -- we would
gain improvement
For that we group all statuses by their status first, and even
create a combined group "modified_or_untracked" since we query
for that later as well.

Also remove another loop to just filter out  "clean" ones out and
avoid using OrderedDict which no longer needed since python 3.6
(I do not think we use any OrderedDict specific functionality
and dict should be faster than OrderedDict according to
https://realpython.com/python-ordereddict/
accompanied with a test to demonstrate save defect on renaming file into directory.

Prior Commits optimizing the body/data structures of _save should have helped
to also minimize potential run time for this added check/manipulation of status.

Underlying issue is actually within git, see
datalad#6558
and correspondence on git mailing list
https://lore.kernel.org/git/xmqqv8wdk308.fsf@gitster.g/T/#madd83451d39754c044691910f3897659ed291abf
Then we were providing such status  even with "clean" entires to _save_post
leading to e.g. asking to commit empty directories etc which caused tests
failures.

Instead of "filtering out" status dict, I decided to just stop using it
completely and RF _save_post to not be provided with status dict, but with
what it actually needs -- list of files.  And for that list, we just now
create it from groupped dicts, instead of original singular status
* origin/master: (617 commits)
  Channel progress logging by AnnexSpecialRemoteProgressBar to git-annex
  BF: compat with git ~ 2.35.1 where error msg changed
  use WINDIR variable to determine windows dir
  reduce code duplication
  clean up code, comments, and user messages
  Update datalad/support/sshconnector.py
  Update datalad/interface/common_cfg.py
  use 'datalad.ssh.executable' config in sshconnector
  Simplify logger call
  Prevent internal download functionality from logging at INFO level
  Deduplicate top-level CLI exception handler output
  Make datalad.cli.tests.test_main.run_main() also capture UI output
  exclude a Popen call from static security checks
  add windows ssh client var to common config
  change environment variable to config variable
  enforce windows system ssh usage for git on windows
  TST: Reuse test coverage helpers on non-Win Appveyor runs to cover subprocesses
  TST: Enable codecov coverage for subprocesses
  BF(TST): Skip 3 tests if root --  PermissionError not raised
  Stop forcefully disabling keyboard-interactive SSH auth
  ...

Conflict was minor, resolved by adopting master's version changes
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "python3 -m isort -m3 --fgw 2 --tc '{outputs}'",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [
  "datalad/support/gitrepo.py"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
@yarikoptic
Copy link
Member Author

I dunno -- CI just refuses to kick in for this and #6567 after I did merge of master... even after I pushed one extra commit -- what is going on?

@yarikoptic yarikoptic added performance Improve performance of an existing feature semver-patch Increment the patch version when merged labels Mar 22, 2022
@yarikoptic
Copy link
Member Author

appveyor fails are only macs so I would assume they are known issues. @mih - this one against master (instead of maint) is ready for the review

@yarikoptic
Copy link
Member Author

ping @mih and all @datalad/developers -- please review and ideally -- merge.

Copy link
Member

@bpoldrack bpoldrack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two comments. I'd feel uncomfortable touching this code ever again, if the different representations of status are running out of sync.

Otherwise: Looks as if it should work ;-)

Minor request: Can we have import cleanups in separate commits?

datalad/support/gitrepo.py Show resolved Hide resolved
datalad/support/gitrepo.py Show resolved Hide resolved
@yarikoptic yarikoptic added this to the 0.16.0 milestone Apr 5, 2022
@codecov
Copy link

codecov bot commented Apr 6, 2022

Codecov Report

Merging #6581 (adcc96f) into master (99eeb57) will increase coverage by 0.10%.
The diff coverage is 97.87%.

@@            Coverage Diff             @@
##           master    #6581      +/-   ##
==========================================
+ Coverage   91.06%   91.16%   +0.10%     
==========================================
  Files         349      353       +4     
  Lines       44218    44429     +211     
==========================================
+ Hits        40265    40505     +240     
+ Misses       3953     3924      -29     
Impacted Files Coverage 螖
datalad/support/annexrepo.py 91.19% <酶> (-0.13%) 猬囷笍
datalad/support/gitrepo.py 90.97% <97.50%> (+0.13%) 猬嗭笍
datalad/core/local/tests/test_save.py 97.97% <100.00%> (-0.14%) 猬囷笍
datalad/core/local/tests/test_status.py 97.79% <0.00%> (-0.68%) 猬囷笍
datalad/__init__.py 85.63% <0.00%> (-0.41%) 猬囷笍
datalad/cmd.py 94.09% <0.00%> (-0.40%) 猬囷笍
datalad/coreapi.py 100.00% <0.00%> (酶)
datalad/interface/test.py 63.63% <0.00%> (酶)
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update 99eeb57...adcc96f. Read the comment docs.

could not easily fix up previous commit since this branch had a merge
with a conflict -- didn't want to mess with it again
@yarikoptic
Copy link
Member Author

Left two comments. I'd feel uncomfortable touching this code ever again, if the different representations of status are running out of sync.

there is no status -- it get's deleted in the scope of that function exactly for that reason

Minor request: Can we have import cleanups in separate commits?

isn't it in a separate one? 8b31be8

@codeclimate
Copy link

codeclimate bot commented Apr 7, 2022

Code Climate has analyzed commit adcc96f and detected 0 issues on this pull request.

View more on Code Climate.

@bpoldrack bpoldrack merged commit 67ab711 into datalad:master Apr 7, 2022
@yarikoptic yarikoptic deleted the bf-sve branch July 11, 2022 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improve performance of an existing feature semver-patch Increment the patch version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

save would make git crash replacing a file with directory with content
2 participants