-
Notifications
You must be signed in to change notification settings - Fork 110
Speed up saving of removals and improve their reporting #6784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test failures are real. Needs an investigation. |
6e723a5
to
708639e
Compare
Appveyor failures are known and unrelated (new git-annex version breaking macos tests). Travis failures are different, but also unrelated. |
The changes in this PR are now a precondition to an upcoming PR addressing #6791 -- depending on the timing of review, I will close this PR and integrate it into a substantially larger update that introduces typechange detection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks LGTM to me and test is added and hopefully no new failures (didn't go through CI failures). But was unlucky to get conflicting change which needs to be resolved and see the fresh CI run (some fails should be gone by now)
The previous implementation in `GitRepo._save()` used `git-rm` to handle the staging of removed files. However, this porcelain command also handles the actual removal of files from the working tree. This is unnecessary here, because only items are processed that have already found to be removed from the working tree. Therefore, the more lightweight `update-index` is used now, and provides measurable speed-ups. I found between 5-10% faster `datalad-save` runtimes when removing 2k file entries from a dataset on a lightning-fast NVME drive. Because `GitRepo` methods using the `normalize_paths` decorator are avoided not, the speed-up should be even more substantial on slower file systems. This change also improves the reporting of `datalad-save` when saving the removal of a subdataset. Previously, it was reported as ``` % datalad save -d demo delete(ok): sub (file) ``` with the confusing `file` type property. Now it is properly reported as a dataset: ``` % datalad save -d demo delete(ok): sub (dataset) ``` Moreover, `save` now also reports saving the deletion of files, when the deletion was already staged.
In particular ensure that the `.gitmodule` updates are communicated. This seems to have only been covered by a test in -deprecated otherwise.
Safe is taking care of it anyways and already. We can simply discontinue specific action taken in `remove()`.
Code Climate has analyzed commit 5bec143 and detected 0 issues on this pull request. View more on Code Climate. |
I rebased it on Once the related PRs #6797, #6793 and #6784 are "done" (whatever the end result would look like), I will extract the cumulative patch and release it with datalad-next to avoid the 6-months delay and progressing on this topic. |
Codecov Report
@@ Coverage Diff @@
## master #6784 +/- ##
==========================================
+ Coverage 90.26% 91.21% +0.95%
==========================================
Files 354 354
Lines 46084 46108 +24
==========================================
+ Hits 41598 42058 +460
+ Misses 4486 4050 -436
Continue to review full report at Codecov.
|
Given the approval I will merge this now. The failing metalad tests are observable in other PRs too, hence unlikely to be related to these changes. |
PR released in |
The previous implementation in
GitRepo._save()
usedgit-rm
tohandle the staging of removed files. However, this porcelain command
also handles the actual removal of files from the working tree.
This is unnecessary here, because only items are processed that
have already found to be removed from the working tree. Therefore,
the more lightweight
update-index
is used now, and provides measurablespeed-ups. I found between 5-10% faster
datalad-save
runtimes whenremoving 2k file entries from a dataset on a lightning-fast NVME drive.
Because
GitRepo
methods using thenormalize_paths
decorator areavoided now, the speed-up should be even more substantial on slower
file systems.
This change also improves the reporting of
datalad-save
when savingthe removal of a subdataset. Previously, it was reported as
with the confusing
file
type property. Now it is properly reported asa dataset:
Here is some code to help play with this:
Changelog
💫 Enhancements and new features
dataset
for added and removed subdatasets, instead offile
. Moreover, saving previously staged deletions is now also reported.