Skip to content

Speed up saving of removals and improve their reporting #6784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 19, 2022

Conversation

mih
Copy link
Member

@mih mih commented Jul 1, 2022

The previous implementation in GitRepo._save() used git-rm to
handle the staging of removed files. However, this porcelain command
also handles the actual removal of files from the working tree.
This is unnecessary here, because only items are processed that
have already found to be removed from the working tree. Therefore,
the more lightweight update-index is used now, and provides measurable
speed-ups. I found between 5-10% faster datalad-save runtimes when
removing 2k file entries from a dataset on a lightning-fast NVME drive.
Because GitRepo methods using the normalize_paths decorator are
avoided now, the speed-up should be even more substantial on slower
file systems.

This change also improves the reporting of datalad-save when saving
the removal of a subdataset. Previously, it was reported as

% datalad save -d demo
delete(ok): sub (file)

with the confusing file type property. Now it is properly reported as
a dataset:

% datalad save -d demo
delete(ok): sub (dataset)

Here is some code to help play with this:

datalad create demo
for i in $(seq -w 2000); do echo $i > demo/${i}.txt; done
datalad save -d demo
rm demo/*.txt
multitime -n1 datalad save -d demo

# cleanup for the next run
datalad drop -d demo --reckless kill --what all -r

Changelog

💫 Enhancements and new features

  • Saving removed dataset content was sped-up, and reporting of types of removed content now accurately states dataset for added and removed subdatasets, instead of file. Moreover, saving previously staged deletions is now also reported.

@mih mih added the semver-patch Increment the patch version when merged label Jul 1, 2022
@mih
Copy link
Member Author

mih commented Jul 1, 2022

Test failures are real. Needs an investigation.

@mih mih force-pushed the rf-save branch 2 times, most recently from 6e723a5 to 708639e Compare July 1, 2022 21:03
@mih
Copy link
Member Author

mih commented Jul 2, 2022

Appveyor failures are known and unrelated (new git-annex version breaking macos tests).

Travis failures are different, but also unrelated.

@mih
Copy link
Member Author

mih commented Jul 3, 2022

The changes in this PR are now a precondition to an upcoming PR addressing #6791 -- depending on the timing of review, I will close this PR and integrate it into a substantially larger update that introduces typechange detection.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks LGTM to me and test is added and hopefully no new failures (didn't go through CI failures). But was unlucky to get conflicting change which needs to be resolved and see the fresh CI run (some fails should be gone by now)

@mih mih mentioned this pull request Jul 18, 2022
mih added 6 commits July 19, 2022 07:51
The previous implementation in `GitRepo._save()` used `git-rm` to
handle the staging of removed files. However, this porcelain command
also handles the actual removal of files from the working tree.
This is unnecessary here, because only items are processed that
have already found to be removed from the working tree. Therefore,
the more lightweight `update-index` is used now, and provides measurable
speed-ups. I found between 5-10% faster `datalad-save` runtimes when
removing 2k file entries from a dataset on a lightning-fast NVME drive.
Because `GitRepo` methods using the `normalize_paths` decorator are
avoided not, the speed-up should be even more substantial on slower
file systems.

This change also improves the reporting of `datalad-save` when saving
the removal of a subdataset. Previously, it was reported as

```
% datalad save -d demo
delete(ok): sub (file)
```

with the confusing `file` type property. Now it is properly reported as
a dataset:

```
% datalad save -d demo
delete(ok): sub (dataset)
```

Moreover, `save` now also reports saving the deletion of files, when
the deletion was already staged.
In particular ensure that the `.gitmodule` updates are communicated.
This seems to have only been covered by a test in -deprecated otherwise.
Safe is taking care of it anyways and already. We can simply discontinue
specific action taken in `remove()`.
@codeclimate
Copy link

codeclimate bot commented Jul 19, 2022

Code Climate has analyzed commit 5bec143 and detected 0 issues on this pull request.

View more on Code Climate.

@mih
Copy link
Member Author

mih commented Jul 19, 2022

I rebased it on maint, but given that it is not a straight-up bugfix, the merge target is still the next feature release. This PR is a precondition for #2160

Once the related PRs #6797, #6793 and #6784 are "done" (whatever the end result would look like), I will extract the cumulative patch and release it with datalad-next to avoid the 6-months delay and progressing on this topic.

@codecov
Copy link

codecov bot commented Jul 19, 2022

Codecov Report

Merging #6784 (5bec143) into master (e377e54) will increase coverage by 0.95%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #6784      +/-   ##
==========================================
+ Coverage   90.26%   91.21%   +0.95%     
==========================================
  Files         354      354              
  Lines       46084    46108      +24     
==========================================
+ Hits        41598    42058     +460     
+ Misses       4486     4050     -436     
Impacted Files Coverage Δ
datalad/local/remove.py 96.92% <100.00%> (-0.10%) ⬇️
datalad/support/gitrepo.py 92.89% <100.00%> (+0.01%) ⬆️
datalad/support/tests/test_repo_save.py 100.00% <100.00%> (ø)
datalad/tests/utils_pytest.py 89.74% <100.00%> (+0.01%) ⬆️
datalad/metadata/extractors/image.py 96.77% <0.00%> (-3.23%) ⬇️
datalad/metadata/extractors/exif.py 96.87% <0.00%> (-3.13%) ⬇️
datalad/metadata/extractors/audio.py 97.14% <0.00%> (-2.86%) ⬇️
datalad/support/json_py.py 98.86% <0.00%> (-1.14%) ⬇️
datalad/metadata/metadata.py 90.27% <0.00%> (-0.55%) ⬇️
datalad/tests/utils.py 66.69% <0.00%> (+9.72%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 22b2688...5bec143. Read the comment docs.

@mih
Copy link
Member Author

mih commented Jul 19, 2022

Given the approval I will merge this now.

The failing metalad tests are observable in other PRs too, hence unlikely to be related to these changes.

@mih mih merged commit afc2ecf into datalad:master Jul 19, 2022
@mih mih deleted the rf-save branch July 19, 2022 11:25
mih added a commit to datalad/datalad-next that referenced this pull request Jul 24, 2022
mih added a commit to datalad/datalad-next that referenced this pull request Jul 24, 2022
@yarikoptic-gitmate
Copy link
Collaborator

PR released in 0.18.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-patch Increment the patch version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants