New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proper handling of nested datasets in save/push/clone cycle #5241
Conversation
e48a5f1
to
9611ef0
Compare
This ensures that superdatasets saved on crippled FS do not reference commits in managed branches.
For additional improvement TODOs see the diff. Importantly, this consolidation makes it obsolete to have the result record be consistent with the one that is yielded by `add()` (which was used on submodule update). An unfortunate setup, that was criticized already a long time ago datalad#3398 (comment) Minus the issue of not breaking the promise of result invariance over time, we are no free to use sensible result records.
In order to do that reliably, postpone the 'modified' judgement in diffstatus() to a point where we already have submodule instance to inspect.
...and before an `annex init`. Previously this was done in the context of `get` and the installation of submodules. This RF enables the proper adjusting of freshly installed subdatasets. In the previous approach, we would miss and reset an adjusted branch to the recorded state and essentially break the repository setup, until the first `save` would rectify it. This is more or less just shifting code from `get` to `clone`. Please a bit of renaming and slightly adjusted error handling. Fixes dataladgh-5257
So it seems when a subdataset is already on an adjusted branch after clone, we must checkout the corresponding first, then reset to the target commit, and then proceed to |
One of the benchmarks reports a slowdown of 22%. Slower but correct is preferable to fast but wrong, but maybe there is an angle to keep it fast in cases where there is no adjusted branch. So far the code is intentionally avoiding or at least minimizing such tests. In particular for making assumptions that a superds in adjusted mode make it more likely that a subds is also adjusted (or should be). FS capability issues are only one of the reasons why one would have (some) datasets in adjusted mode. |
A few commits back switched to doing the checkout right after cloning. Unfortunately doing that can put us into a state that causes `git annex init` to fail, a problem also encountered in f4ab9f0 (BF: clone: Avoid adjusted branches when moving from unborn HEAD, 2020-07-31). Instead call postclone_checkout_commit() after `git annex init`. At that point, the repo may be on an adjusted branch, so update the logic in postclone_checkout_commit() to consider the corresponding branch.
I've pushed an update that moves the Please feel free to drop those changes if they don't make sense. My brain was getting pretty turned around while working on this. |
The |
Yes, I agree. Valid and replicate. The reason (at least for the The |
To save any duplicated effort, I've debugged this and have a minimal fix. I'm thinking through a wider change/fix, but will push something soonish. |
A couple of upcoming tests in test_save will need the same logic.
_save_add_submodules() does adjusted-branch-aware staging of submodules (via a dedicated update-index call). However, save_() still includes staged submodules (except newly registered ones) in the list of items passed to _save_add(). In the case of AnnexRepo._save_add(), this is unnecessary but harmless because `git annex add` ignores submodules. For GitRepo._save_add(), on the other hand, the call to `git add` overrides the work of _save_add_submodules(), potentially staging the submodule's adjusted branch commit that _save_add_submodules() avoided adding to the index. Filter all submodules, not just newly registered ones, from the items passed to _save_add().
If there are existing staged changes, save_() doesn't stage submodules changes with _save_add_submodules(). This is wrong because all submodules need to go through the adjusted-branch-aware handling of _save_add_submodules(). But it's not just a matter of removing the guard (which this commit does). The handling is broken because partial commits are done with `git commit ... -- FILES`. This form updates the index entries for FILES from the working tree. This undoes any special adjusted branch handling that _save_add_submodules() did. Using a temporary index, like the old AnnexRepo.commit() used to do (removed in bfd9aeb), is likely the only solution to this. Given datalad's "no index" philosophy, it's probably not worth it to address this, but at least add a test case that demonstrates the issue.
This series taught _save_add_submodules() to record the corresponding branch's id in the superdataset when the submodule is on an adjusted branch. `datalad status` is aware of this behavior and ignores submodule ID changes that are due the a submodule having an adjusted branch checked out. This new behavior means that the .dirty optimization from 9d29d13 (OPT: gitrepo.dirty: Use 'git status --porcelain', 2019-05-2) returns the "wrong" answer for an otherwise clean submodule that is on an adjusted branch. Several non-test spots in the code base use .dirty, so it's important that it doesn't conflict with status(). To resolve this discrepancy, confirm a dirty result reported by git with the `datalad status` machinery. When the working tree is clean in git's eyes, this retains the original speedup, but unfortunately this may still be enough of a hit to make dataladgh-3342 an issue again.
test_base.test_aggregation() does a blanket "aggregate_metadata returned six results" check, but two commits back made GitRepo.save_() go through _save_add_submodules() even when there are staged changes, leading to more results than this check expects. The exact number of results here is unimportant, and checking it is brittle. Loosen the check to assert that there is a successful save result, which seems sufficient given that there's a assert_repo_status() call immediately downstream.
It looks like the test added in the initial commit of this series ( https://ci.appveyor.com/project/mih/datalad/builds/36887798/job/ifubrpg5f2u623a8 It passes for me locally under |
I think the metalad failure (and core's the result test_aggregation failure that I "addressed" with the last commit) point to an issue with example
before
after
|
When save() is processing subdatasets, it overrides their status to type=dataset,status=untracked, except for in the situation covered by effa32c (BF: save: Fix handling of explicit paths for nested new subdatasets, 2020-10-16). This override is in effect for clean submodules. Before the recent change in 20b837d (RF: Use the exact same code path for adding and updating a submodule), this resulted in the unnecessary but quiet action of `git annex add` or `git add` being called with the submodule. After that commit, recursively saving a dataset shows an add record for each subdataset, even if it is clean. Stop save() from adjusting the status of clean subdatasets so that _save_add_submodules() is not triggered.
The change in effa32c (BF: save: Fix handling of explicit paths for nested new subdatasets, 2020-10-16) prevented save() from overwriting some of the submodule status records it passed to GitRepo.save_() in order to go down the `git submodule add` code path rather than the `git add` one. But, starting a few commits back in 20b837d (RF: Use the exact same code path for adding and updating a submodule), there is only a single code path for submodules, so this is no longer necessary. Discard the code change from effa32c, keeping the test (test_save_nested_subs_explicit_paths).
That case should be resolved with the latest push. |
test_nested_pushclone_cycle_allplatforms() is failing on AppVeyor at the clone step: datalad.support.exceptions.CommandError: CommandError: '"datalad" "clone" "ria+file://C/Users/appveyor[...]" "super"' failed with exitcode 1 Try to fix it by cloning from the URL returned by get_local_file_url(, compatibility='git'). This change is a shot in the dark that's based on the handling in test_ria_postclone_noannex. https://ci.appveyor.com/project/mih/datalad/builds/36898646/job/1jgewda7tfrdcdaq#L1064
Codecov Report
@@ Coverage Diff @@
## master #5241 +/- ##
==========================================
+ Coverage 90.20% 90.28% +0.08%
==========================================
Files 297 297
Lines 41666 41632 -34
==========================================
+ Hits 37583 37588 +5
+ Misses 4083 4044 -39
Continue to review full report at Codecov.
|
Passing with b352f73 (TST: push: Adjust clone url for windows compatibility). |
This is amazing! Thx much @kyleam I will look into the failure of the new test tomorrow. It seems we might have built the wonder machine nobody thought to be possible. |
Turns out that I had investigated the remaining failure a few days ago already, and just did not push the fix. Done now. I expect greatness! |
All groovy now! The remaining issue is the performance of
20% increase in runtime. That is substantial. And it is also fairly consistent. The only change effecting
I do not see what can be done to improve (2), but there is probably a margin for tuning the conditional for (1). However .... ...if the aim is to support any state of subdataset, the condition of a superdataset has no predictive value, hence we will not be able to get around testing a subdataset for adjusted mode -- we can only make that test cheaper. And I personally think it is not just valid, but also useful to not force entire hierarchies of nested datasets into a consistent state (e.g. think of adjusted subdatasets presenting a specific view). So to me it boils down to the question, can be have a cheaper test than
or maybe a pre-test that rules out that we have to perform the test above. |
Don't see a lot of potential there. In case it's not an annex there's a
|
As @mih pointed out in chat - pure |
Achieved consensus during online meeting. Merge now, handle fallout later, keep whining ;-) |
test_push_recursive() aborts midway through due to unresolved save/status issues with adjusted branches. While there are those (and they've at least been partially addressed on master with dataladgh-5241), the only change needed to make this test pass for me on maint under tools/eval_under_testloopfs is switching a --since value away from using "HEAD" (which is of course off-by-one on a synced adjusted branch).
Towards a fix of gh-5137 and various underlying issues.
This is a simplistic dataset workflow that must work across all platforms, as otherwise invalid datasets are being published and consumed.
The included test works on proper filesystems, but fails on crippled ones.
The goal here is to be as unimpacted from test-battery issues as possible. Hence I am going through the cmdine API, not using any testrepos, and only make use of non-highlevel API for querying to-be-tested properties.
Issues:
save --recursive
records adjusted branch commit in superdataset- This goes through a different code path that more or less calls
git add
, but already just on submodule paths. There should be an angle to re-use_save_add_submodules()
or RF it to become usable for thiswhile
git status
is expected to reported a properly saved subdataset as modified in the superdataset (due to the additional commit in the managed branch), adatalad status
should no report such a constellation as a modification -- but presently it does, and breaks many test assumptions with itFixes Subdataset handling in v7 adjusted branch? #3818
Fixes
datalad get (-n)
of subdataset in adjusted mode yield improper initialization #5257.Repo.dirty
is used byrun
to judge whether a dataset is clean. However, it usesgit status
, hence will report subdatasets in adjusted mode as modified, although they are not. We need to replaceGitRepo.dirty
with a more adequate test. It is not sufficient to implement an additionalAnnexRepo.dirty
, because it is possible that aGitRepo
superdataset contains an adjustedAnnexRepo
subdataset any number of levels underneath.A
GitRepo
super reports anAnnexRepo
subdataset is modified, right after saving.test_nested_pushclone_cycle_allplatforms
fails on AppVeyor