Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper handling of nested datasets in save/push/clone cycle #5241

Merged
merged 21 commits into from Dec 22, 2020

Conversation

mih
Copy link
Member

@mih mih commented Dec 11, 2020

Towards a fix of gh-5137 and various underlying issues.

This is a simplistic dataset workflow that must work across all platforms, as otherwise invalid datasets are being published and consumed.

The included test works on proper filesystems, but fails on crippled ones.

The goal here is to be as unimpacted from test-battery issues as possible. Hence I am going through the cmdine API, not using any testrepos, and only make use of non-highlevel API for querying to-be-tested properties.

Issues:

  • A superdataset trackes a commit in an adjusted branch for freshly created, empty subdatasets
AssertionError: Desired result
  {
   "gitshasum": "22bcba3c3b058df33d12870782d3cd4d91c8aa94",
   "path": "/media/mih/Samsung_T5/tmp/datalad_temp_test_nested_pushclone_cycle_allplatformswdj92o8g/super/sub"
  }
not found among
  [
   {
    "action": "subdataset",
    "gitmodule_datalad-id": "764306ac-6647-4da5-b4fb-9edcf4977369",
    "gitmodule_name": "sub",
    "gitmodule_url": "./sub",
    "gitshasum": "9794721d94a35243325472f5b7a40eca54662936",
    "parentds": "/media/mih/Samsung_T5/tmp/datalad_temp_test_nested_pushclone_cycle_allplatformswdj92o8g/super",
    "path": "/media/mih/Samsung_T5/tmp/datalad_temp_test_nested_pushclone_cycle_allplatformswdj92o8g/super/sub",
    "refds": "/media/mih/Samsung_T5/tmp/datalad_temp_test_nested_pushclone_cycle_allplatformswdj92o8g/super",
    "status": "ok",
    "type": "dataset"
   }
  ]
  • save --recursive records adjusted branch commit in superdataset
    - This goes through a different code path that more or less calls git add, but already just on submodule paths. There should be an angle to re-use _save_add_submodules() or RF it to become usable for this

  • while git status is expected to reported a properly saved subdataset as modified in the superdataset (due to the additional commit in the managed branch), a datalad status should no report such a constellation as a modification -- but presently it does, and breaks many test assumptions with it

  • Fixes Subdataset handling in v7 adjusted branch? #3818

  • Fixes datalad get (-n) of subdataset in adjusted mode yield improper initialization #5257.

  • Repo.dirty is used by run to judge whether a dataset is clean. However, it uses git status, hence will report subdatasets in adjusted mode as modified, although they are not. We need to replace GitRepo.dirty with a more adequate test. It is not sufficient to implement an additional AnnexRepo.dirty, because it is possible that a GitRepo superdataset contains an adjusted AnnexRepo subdataset any number of levels underneath.

  • A GitRepo super reports an AnnexRepo subdataset is modified, right after saving.

  • test_nested_pushclone_cycle_allplatforms fails on AppVeyor

@mih mih added adjusted-branches Issues caused by or affected adjusted branch operation severity-critical makes it unusable, causes data loss, introduces security vulnerability labels Dec 11, 2020
@mih mih added this to the 0.14.0 milestone Dec 11, 2020
@mih mih force-pushed the bf-5137 branch 2 times, most recently from e48a5f1 to 9611ef0 Compare December 16, 2020 09:12
@mih mih marked this pull request as ready for review December 16, 2020 14:40
This ensures that superdatasets saved on crippled FS do not reference
commits in managed branches.
For additional improvement TODOs see the diff.

Importantly, this consolidation makes it obsolete to have the result
record be consistent with the one that is yielded by `add()` (which was
used on submodule update). An unfortunate setup, that was criticized
already a long time ago
datalad#3398 (comment)

Minus the issue of not breaking the promise of result invariance
over time, we are no free to use sensible result records.
In order to do that reliably, postpone the 'modified' judgement in
diffstatus() to a point where we already have submodule instance
to inspect.
...and before an `annex init`. Previously this was done in the context
of `get` and the installation of submodules.

This RF enables the proper adjusting of freshly installed subdatasets.
In the previous approach, we would miss and reset an adjusted branch to
the recorded state and essentially break the repository setup, until
the first `save` would rectify it.

This is more or less just shifting code from `get` to `clone`. Please
a bit of renaming and slightly adjusted error handling.

Fixes dataladgh-5257
@mih
Copy link
Member Author

mih commented Dec 17, 2020

[INFO   ] Reset branch 'adjusted/dl-test-branch(unlocked)' to a9e19e27 (from b1e5b148) to avoid a detached HEAD 
[INFO   ] Scanning for unlocked files (this may take some time)                                                                            
[INFO   ] git-annex: Unable to clean up from clone of adjusted branch; perhaps you should check out dl-test-branch 
[INFO   ] git-annex: init: 1 failed            

So it seems when a subdataset is already on an adjusted branch after clone, we must checkout the corresponding first, then reset to the target commit, and then proceed to annex init.

@mih
Copy link
Member Author

mih commented Dec 17, 2020

One of the benchmarks reports a slowdown of 22%. Slower but correct is preferable to fast but wrong, but maybe there is an angle to keep it fast in cases where there is no adjusted branch. So far the code is intentionally avoiding or at least minimizing such tests. In particular for making assumptions that a superds in adjusted mode make it more likely that a subds is also adjusted (or should be). FS capability issues are only one of the reasons why one would have (some) datasets in adjusted mode.

A few commits back switched to doing the checkout right after cloning.
Unfortunately doing that can put us into a state that causes `git
annex init` to fail, a problem also encountered in f4ab9f0 (BF:
clone: Avoid adjusted branches when moving from unborn HEAD,
2020-07-31).

Instead call postclone_checkout_commit() after `git annex init`.  At
that point, the repo may be on an adjusted branch, so update the logic
in postclone_checkout_commit() to consider the corresponding branch.
@kyleam
Copy link
Collaborator

kyleam commented Dec 18, 2020

I've pushed an update that moves the postclone_checkout_commit downstream of the git annex init and seems to address the test_merge_follow_parentds_subdataset_detached failure (at least for me locally under tools/eval_under_testloopfs) while also keeping test_nested_pushclone_cycle_allplatforms passing.

Please feel free to drop those changes if they don't make sense. My brain was getting pretty turned around while working on this.

@kyleam
Copy link
Collaborator

kyleam commented Dec 18, 2020

The datalad.core.local.tests.test_save:test_save and datalad.core.local.tests.test_run:test_run_from_subds_gh3551 failures on the crippledfs build look valid and start with ca57e81 (BF: Record the state of the corresponding branch in a superdataset).

@mih
Copy link
Member Author

mih commented Dec 18, 2020

The datalad.core.local.tests.test_save:test_save and datalad.core.local.tests.test_run:test_run_from_subds_gh3551 failures on the crippledfs build look valid and start with ca57e81 (BF: Record the state of the corresponding branch in a superdataset).

Yes, I agree. Valid and replicate. The reason (at least for the run failures) is that with the new setup, the GitRepo.dirty test is now inadequate, because it uses git status to make the judgement. Which rightfully fails, but run should not consider this special kind of modified as a relevant modification. I added a TODO at the top.

The test_save failure is likely a bug. The situation is a GitRepo super that contains an AnnexRepo subdataset. I think this situation is not handles correctly, and the test failure doesn't show up elsewhere, because the test is disabled on appveyor.

@kyleam
Copy link
Collaborator

kyleam commented Dec 18, 2020

The test_save failure is likely a bug

To save any duplicated effort, I've debugged this and have a minimal fix. I'm thinking through a wider change/fix, but will push something soonish.

A couple of upcoming tests in test_save will need the same logic.
_save_add_submodules() does adjusted-branch-aware staging of
submodules (via a dedicated update-index call).  However, save_()
still includes staged submodules (except newly registered ones) in the
list of items passed to _save_add().

In the case of AnnexRepo._save_add(), this is unnecessary but harmless
because `git annex add` ignores submodules.  For GitRepo._save_add(),
on the other hand, the call to `git add` overrides the work of
_save_add_submodules(), potentially staging the submodule's adjusted
branch commit that _save_add_submodules() avoided adding to the index.

Filter all submodules, not just newly registered ones, from the items
passed to _save_add().
If there are existing staged changes, save_() doesn't stage submodules
changes with _save_add_submodules().  This is wrong because all
submodules need to go through the adjusted-branch-aware handling of
_save_add_submodules().

But it's not just a matter of removing the guard (which this commit
does).  The handling is broken because partial commits are done with
`git commit ... -- FILES`.  This form updates the index entries for
FILES from the working tree.  This undoes any special adjusted branch
handling that _save_add_submodules() did.

Using a temporary index, like the old AnnexRepo.commit() used to do
(removed in bfd9aeb), is likely the only solution to this.  Given
datalad's "no index" philosophy, it's probably not worth it to address
this, but at least add a test case that demonstrates the issue.
This series taught _save_add_submodules() to record the corresponding
branch's id in the superdataset when the submodule is on an adjusted
branch.  `datalad status` is aware of this behavior and ignores
submodule ID changes that are due the a submodule having an adjusted
branch checked out.

This new behavior means that the .dirty optimization from 9d29d13
(OPT: gitrepo.dirty: Use 'git status --porcelain', 2019-05-2) returns
the "wrong" answer for an otherwise clean submodule that is on an
adjusted branch.  Several non-test spots in the code base use .dirty,
so it's important that it doesn't conflict with status().

To resolve this discrepancy, confirm a dirty result reported by git
with the `datalad status` machinery.  When the working tree is clean
in git's eyes, this retains the original speedup, but unfortunately
this may still be enough of a hit to make dataladgh-3342 an issue again.
test_base.test_aggregation() does a blanket "aggregate_metadata
returned six results" check, but two commits back made GitRepo.save_()
go through _save_add_submodules() even when there are staged changes,
leading to more results than this check expects.

The exact number of results here is unimportant, and checking it is
brittle.  Loosen the check to assert that there is a successful save
result, which seems sufficient given that there's a
assert_repo_status() call immediately downstream.
@kyleam
Copy link
Collaborator

kyleam commented Dec 18, 2020

It looks like the test added in the initial commit of this series (test_nested_pushclone_cycle_allplatforms) is failing on appveyor. I thought that might be due to one of my recent pushes, but it looks like it was around with 93d90c7 (BF: Use correct helper to get path property into the result) if not earlier.

https://ci.appveyor.com/project/mih/datalad/builds/36887798/job/ifubrpg5f2u623a8

It passes for me locally under tools/eval_under_testloopfs.

@kyleam
Copy link
Collaborator

kyleam commented Dec 18, 2020

I think the metalad failure (and core's the result test_aggregation failure that I "addressed" with the last commit) point to an issue with _save_add_submodules behavior. On a clean repo (even without adjusted branches involved), we now get results for each subdataset on recursive saves:

example
datalad create
datalad create -d. sub
datalad -f json_pp save -r

before

{                                                                                                        
  "action": "save",
  "path": "/tmp/dl-6SpvWM4/sub",
  "refds": "/tmp/dl-6SpvWM4",
  "status": "notneeded",
  "type": "dataset"
}
{                                                                                                        
  "action": "save",                                                                                      
  "path": "/tmp/dl-6SpvWM4",
  "refds": "/tmp/dl-6SpvWM4",
  "status": "notneeded",
  "type": "dataset"
}

after


{                                                                                                        
  "action": "save",
  "path": "/tmp/dl-6SpvWM4/sub",
  "refds": "/tmp/dl-6SpvWM4",
  "status": "notneeded",
  "type": "dataset"
}
{                                                                                                        
  "action": "add",
  "key": null,
  "path": "/tmp/dl-6SpvWM4/sub",
  "refds": "/tmp/dl-6SpvWM4",
  "status": "ok",
  "type": "file"
}
{                                                                                                        
  "action": "save",                                                                                      
  "path": "/tmp/dl-6SpvWM4",
  "refds": "/tmp/dl-6SpvWM4",
  "status": "notneeded",
  "type": "dataset"
}

When save() is processing subdatasets, it overrides their status to
type=dataset,status=untracked, except for in the situation covered by
effa32c (BF: save: Fix handling of explicit paths for nested new
subdatasets, 2020-10-16).  This override is in effect for clean
submodules.

Before the recent change in 20b837d (RF: Use the exact same code
path for adding and updating a submodule), this resulted in the
unnecessary but quiet action of `git annex add` or `git add` being
called with the submodule.  After that commit, recursively saving a
dataset shows an add record for each subdataset, even if it is clean.

Stop save() from adjusting the status of clean subdatasets so that
_save_add_submodules() is not triggered.
The change in effa32c (BF: save: Fix handling of explicit paths for
nested new subdatasets, 2020-10-16) prevented save() from overwriting
some of the submodule status records it passed to GitRepo.save_() in
order to go down the `git submodule add` code path rather than the
`git add` one.  But, starting a few commits back in 20b837d (RF:
Use the exact same code path for adding and updating a submodule),
there is only a single code path for submodules, so this is no longer
necessary.

Discard the code change from effa32c, keeping the test
(test_save_nested_subs_explicit_paths).
@kyleam
Copy link
Collaborator

kyleam commented Dec 21, 2020

On a clean repo (even without adjusted branches involved), we now get results for each subdataset on recursive saves

That case should be resolved with the latest push. test_save didn't reveal any wider issues with that change, but perhaps the full test suite run will.

test_nested_pushclone_cycle_allplatforms() is failing on AppVeyor at
the clone step:

  datalad.support.exceptions.CommandError: CommandError: '"datalad" "clone"
  "ria+file://C/Users/appveyor[...]"
  "super"' failed with exitcode 1

Try to fix it by cloning from the URL returned by
get_local_file_url(, compatibility='git').  This change is a shot in
the dark that's based on the handling in test_ria_postclone_noannex.

https://ci.appveyor.com/project/mih/datalad/builds/36898646/job/1jgewda7tfrdcdaq#L1064
@codecov
Copy link

codecov bot commented Dec 21, 2020

Codecov Report

Merging #5241 (9a1b3fc) into master (7d0542f) will increase coverage by 0.08%.
The diff coverage is 92.46%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5241      +/-   ##
==========================================
+ Coverage   90.20%   90.28%   +0.08%     
==========================================
  Files         297      297              
  Lines       41666    41632      -34     
==========================================
+ Hits        37583    37588       +5     
+ Misses       4083     4044      -39     
Impacted Files Coverage Δ
datalad/distribution/get.py 97.88% <ø> (+0.24%) ⬆️
datalad/core/local/tests/test_save.py 97.26% <57.14%> (-1.86%) ⬇️
datalad/support/gitrepo.py 91.95% <96.00%> (+0.19%) ⬆️
datalad/core/distributed/clone.py 90.54% <96.77%> (+0.57%) ⬆️
datalad/config.py 96.73% <100.00%> (ø)
datalad/core/distributed/tests/test_clone.py 96.98% <100.00%> (+0.07%) ⬆️
datalad/core/distributed/tests/test_push.py 99.54% <100.00%> (+0.04%) ⬆️
datalad/core/local/save.py 98.82% <100.00%> (ø)
datalad/distribution/tests/test_update.py 100.00% <100.00%> (ø)
datalad/metadata/tests/test_base.py 99.24% <100.00%> (-0.01%) ⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d0542f...9a1b3fc. Read the comment docs.

@kyleam
Copy link
Collaborator

kyleam commented Dec 21, 2020

the test added in the initial commit of this series (test_nested_pushclone_cycle_allplatforms) is failing on appveyor

Passing with b352f73 (TST: push: Adjust clone url for windows compatibility).

@mih
Copy link
Member Author

mih commented Dec 21, 2020

This is amazing! Thx much @kyleam

I will look into the failure of the new test tomorrow. It seems we might have built the wonder machine nobody thought to be possible.

🙃

@mih
Copy link
Member Author

mih commented Dec 22, 2020

Turns out that I had investigated the remaining failure a few days ago already, and just did not push the fix. Done now. I expect greatness!

@mih
Copy link
Member Author

mih commented Dec 22, 2020

All groovy now!

The remaining issue is the performance of status:

+         1.05±0s          1.24±0s     1.18  api.supers.time_status
+         1.80±0s       2.17±0.01s     1.20  api.supers.time_status_recursive

20% increase in runtime. That is substantial. And it is also fairly consistent.

The only change effecting status is the following:

  1. there are now more conditions which lead to a closer inspection of a subdataset (i.e. we are not assigning a state property without creating a repo instance for a subdataset and at least ask for the hexsha) patch at https://github.com/datalad/datalad/pull/5241/files#diff-bfc1c47d032fc72b15b6a79db3d49e4c4334cf96318c5efe48a3403744b32fb1R3897
  2. what we do specifically to inspect a subdataset's status (patch at https://github.com/datalad/datalad/pull/5241/files#diff-bfc1c47d032fc72b15b6a79db3d49e4c4334cf96318c5efe48a3403744b32fb1R3788 )

I do not see what can be done to improve (2), but there is probably a margin for tuning the conditional for (1). However ....

...if the aim is to support any state of subdataset, the condition of a superdataset has no predictive value, hence we will not be able to get around testing a subdataset for adjusted mode -- we can only make that test cheaper. And I personally think it is not just valid, but also useful to not force entire hierarchies of nested datasets into a consistent state (e.g. think of adjusted subdatasets presenting a specific view).

So to me it boils down to the question, can be have a cheaper test than

subrepo = repo_from_path(path)
actual_state = subrepo.get_hexsha(subrepo.get_corresponding_branch())

or maybe a pre-test that rules out that we have to perform the test above.

@bpoldrack
Copy link
Member

So to me it boils down to the question, can be have a cheaper test than

subrepo = repo_from_path(path)
actual_state = subrepo.get_hexsha(subrepo.get_corresponding_branch())

or maybe a pre-test that rules out that we have to perform the test above.

Don't see a lot of potential there. In case it's not an annex there's a GitRepo.is_valid_repo() that is superfluous (once within repo_from_path when we see it's not a valid annex and once before the repo_from_path call). So, one could not use repo_from_path. We already know it's a valid git at that point. AnnexRepo.is_valid_repo is the only thing we don't know yet AFAICT. Test that and instantiate directly can be cheaper if it's not an annex. I don't think that accounts for much, though. Probably not worth it in that fashion. However, I don't currently see why we would be interested in an AnnexRepo at all. I think we can spare the test for that altogether. We only need git-fucntionality. subrepo = GitRepo(path) should do and should be cheaper (and we know it's valid already).

get_corresponding_branch currently does a superfluous string comparison, too (originating from managed branches originally covering direct mode, too), but that's even more insignificant in this context.

@bpoldrack
Copy link
Member

As @mih pointed out in chat - pure GitRepo would lack get_corresponding_branch. So, that approach doesn't work ATM, although one could RF this, since it's simple string operation on branch names, it doesn't really require an AnnexRepo instance.

@mih
Copy link
Member Author

mih commented Dec 22, 2020

Achieved consensus during online meeting. Merge now, handle fallout later, keep whining ;-)

@mih mih merged commit 3fbf486 into datalad:master Dec 22, 2020
1 check passed
@mih mih deleted the bf-5137 branch December 22, 2020 14:34
kyleam added a commit to kyleam/datalad that referenced this pull request Jan 13, 2021
test_push_recursive() aborts midway through due to unresolved
save/status issues with adjusted branches.  While there are those (and
they've at least been partially addressed on master with dataladgh-5241), the
only change needed to make this test pass for me on maint under
tools/eval_under_testloopfs is switching a --since value away from
using "HEAD" (which is of course off-by-one on a synced adjusted
branch).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adjusted-branches Issues caused by or affected adjusted branch operation severity-critical makes it unusable, causes data loss, introduces security vulnerability
Projects
None yet
3 participants