Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Record state of a submodule's corresponding branch in the parent #4275

Closed
wants to merge 10 commits into from

Conversation

mih
Copy link
Member

@mih mih commented Mar 10, 2020

Otherwise any subdataset in adjusted mode will have a state recorded
that belongs to a branch that is continuously rebased.

Also adjust DataLad's own diff-functionality to detect this intentional
'modified' state and report it as 'clean'.

This is a major step toward a first serious attempt to support nested
dataset in adjusted mode

Sitting on top of #4273 #4274 #4252 (which fixes #4227)

Conceptually this is done. A number of additional core tests run on crippled FS. Please let me know what you think.

TODO:

Fixes #3818 and fixes #3969

lgr.debug(
'Sync corresponding branch %s at %s prior repository '
'state evaluation', subm_cbranch, f)
subrepo.localsync(managed_only=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be cheaper than this, especially if we just had sync'ed before? Should we make an attempt to count the commits on the adjusted branch, and if there is only one not go for a sync. This would only miss cases, where someone amends commits on the adjusted branch, but that would be problematic behavior anyways, so maybe that is good enough. OTOH, I dunno how clever git-annex is in figuring out that nothing needs to be done....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote for sticking with the simpler sync call for now.

OTOH, I dunno how clever git-annex is in figuring out that nothing needs to be done....

Quickly testing suggests that annex will not sync back the changes if you amend the initial adjusted branch commit and have no other commits.

@mih mih marked this pull request as ready for review March 10, 2020 13:38
@mih mih requested a review from kyleam March 10, 2020 13:39
Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, in a world that goes through datalad status and friends, this seems like it could work. I really have no sense how this will play out in the wild, but, given the current state of adjusted branches and submodules, I'm for any improvement that has a minimum impact on non-adjusted branches.

In addition to my inline comments:

  • It'd be nice to see dedicated tests in addition to the modification to test_subdataset_save.
  • I think .dirty will need to be updated to go through .status when on an adjusted branch.


if subm is None:
# in case it did not happen above
subm = repo_from_path(self.pathobj / path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just replace subm = None with this (and remove the assignment from under if url is None)?

# MIH: this looks strange, why would we want to check
# out a branch that matches the name of a branch
# in the parentds, we don't even know if there is
# such a branch...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll have to ask yourself that: 2fda83e (ENH: Record the active branch of a subdataset in the parent, 2019-10-20). Based on that subject and the fact that self.get_active_branch() is evaluated for each cand_sm, I'd guess you were intending to specify the currently checked out branch in cand_sm, not the parent.

@@ -3975,6 +4037,10 @@ def save_(self, message=None, paths=None, _status=None, **kwargs):
self,
to_stage_submodules,
git_opts=None):
# _save_add doesn't know better, but we know that
# we only passed submodule records
r['type'] = 'dataset'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you're taking the public API exposed via result hooks seriously :]

@@ -4011,6 +4077,19 @@ def save_(self, message=None, paths=None, _status=None, **kwargs):
else tuple())}):
yield r

# and lastly fixup subdataset states
for sub in [f for f, props in status.items()
Copy link
Contributor

@kyleam kyleam Mar 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oy, so I think we're looping over all tracked items in the repo again. Can this be guarded by an "on adjusted branch" condition? Also, conceptually this seems a better fit for AnnexRepo._save_post.

@@ -252,7 +252,7 @@ def test_run_from_subds_gh3551(path):
assert_repo_status(ds.path)
subds = Dataset(op.join(ds.path, subds_path))
ok_exists(op.join(subds.path, "f"))
if not on_windows: # FIXME
if ds.repo.is_managed_branch(): # FIXME
Copy link
Contributor

@kyleam kyleam Mar 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested, but shouldn't the not stay?

# have a changed state, we cannot rely on this judgement
# with subdatasets that are potentially in adjusted mode.
# Setting the `state` to None will queue this record for
# further processing below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I find this and related functions hard to grok, so your comments are very much appreciated.

'update-index', '--add', '--replace', '--cacheinfo',
'160000',
subm.get_hexsha(subm_cbranch),
path])
Copy link
Contributor

@kyleam kyleam Mar 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, whoever came up with that invocation is a very clever person :]

kyleam added a commit that referenced this pull request Mar 12, 2020
b95bff1 (TST: Robustify tests for non-windows crippled FS,
2020-03-10) changed this condition from `not on_windows` to
`ds.repo.is_managed_branch()`.  This was merged in with gh-4276, but
as mentioned in review of gh-4275 [0], it seems very likely that a
`not` was unintentionally dropped, because it'd be strange to switch
from "not windows" to "ok managed branch, including windows" and
because the test now fails on the CrippledFS and Win2019 builds [1].

[0]: #4275 (comment)
[1]: https://github.com/datalad/datalad/runs/498868341#step:8:217
     https://github.com/datalad/datalad/runs/500836364?check_suite_focus=true#step:8:304
mih added 10 commits March 13, 2020 12:56
Now there will be the corresponding/real branch checkout out, even if
the clone happened from a repository in adjusted mode
Clone of adjusted dataset is not itself adjusted by default, if
filesystem permits this.
Otherwise, that state is liekly to be gone shortly.
Otherwise any subdataset in adjusted mode will have a state recorded
that belongs to a branch that is continuously rebased.

Also adjust DataLad's own diff-functionality to detect this intentional
'modified' state and report it as 'clean'.

This is a major step toward a first serious attempt to support nested
dataset in adjusted mode
@codecov
Copy link

codecov bot commented Mar 13, 2020

Codecov Report

Merging #4275 into master will decrease coverage by 44.15%.
The diff coverage is 60%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #4275       +/-   ##
===========================================
- Coverage    88.8%   44.64%   -44.16%     
===========================================
  Files         285      283        -2     
  Lines       37419    36969      -450     
===========================================
- Hits        33230    16505    -16725     
- Misses       4189    20464    +16275
Impacted Files Coverage Δ
datalad/core/distributed/tests/test_clone.py 29.53% <0%> (-62.96%) ⬇️
datalad/support/tests/test_annexrepo.py 14.42% <0%> (-80.69%) ⬇️
datalad/core/local/tests/test_run.py 26.63% <0%> (-72.06%) ⬇️
datalad/distribution/tests/test_update.py 31.31% <0%> (-66.42%) ⬇️
datalad/core/local/tests/test_save.py 11.77% <0%> (-84.98%) ⬇️
datalad/core/local/tests/test_status.py 20.68% <0%> (-78.17%) ⬇️
datalad/core/local/tests/test_diff.py 14% <0%> (-85.5%) ⬇️
datalad/core/local/tests/test_resulthooks.py 15.38% <0%> (-84.62%) ⬇️
datalad/core/local/repo.py 90.9% <100%> (+0.43%) ⬆️
datalad/core/local/tests/test_create.py 16.59% <33.33%> (-83.41%) ⬇️
... and 238 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a2da274...15d1554. Read the comment docs.

kyleam added a commit to kyleam/datalad that referenced this pull request Apr 2, 2020
As of v0.12.0, specifically 2fda83e (ENH: Record the active branch
of a subdataset in the parent, 2019-10-20, dataladgh-3817), we record the
current branch in the _parent_ repository as the value for
`submodule.<name>.branch` in .gitmodules when saving a new submodule.

There are a few problems with this:

  * The current branch in the parent is recorded.  It seems unlikely
    that that was the intent because there is no reason to assume a
    branch with that name exists in the submodule repository or, if it
    does, to assume that the parent and submodule branches are
    necessarily coupled.  2fda83e mentions dataladgh-1424
    (`Datalad.recall_state() -> `load` command) as the motivation,
    which makes it seem likely the current branch in the submodule,
    not the parent, was supposed be recorded.

    This was mentioned in a currently open PR:
    <datalad#4275 (comment)>

  * Discussion of recording the branch at dataladgh-1424 suggests that the
    idea is to record this information with every save, but 2fda83e
    records it only when the submodule is initially added.  2fda83e
    doesn't say explicitly that its intent was to do it with every
    save (just that the change moves in the direction of dataladgh-1424), but
    still it doesn't seem a particularly useful incremental step to
    have a one-shot record of the current branch at the time the
    submodule is added.

  * It's not clear that `submodule.<name>.branch` is a good spot to
    record the branch information required by dataladgh-1424.
    `submodule.<name>.branch` is about which _remote_ branch is used
    when `--remote` is passed to `submodule update`.  The goal in
    dataladgh-1424 seems to be tracking what the current _local_ branch was
    at the time of a save.  Given these different purposes, it seems
    like it'd be a good idea to track this information in a different
    way (perhaps an entry in .gitmodules with a different key).

Given these issues, let's revert the changes from 2fda83e, along
with the changes from the follow-up commit ee50107 (BF: Do not
record adjust branch in submodule config, 2019-10-21).

Fixes datalad#4373.
kyleam added a commit to kyleam/datalad that referenced this pull request Apr 2, 2020
As of v0.12.0, specifically 2fda83e (ENH: Record the active branch
of a subdataset in the parent, 2019-10-20, dataladgh-3817), we record the
current branch in the _parent_ repository as the value for
`submodule.<name>.branch` in .gitmodules when saving a new submodule.

There are a few problems with this:

  * The current branch in the parent is recorded.  It seems unlikely
    that that was the intent because there is no reason to assume a
    branch with that name exists in the submodule repository or, if it
    does, to assume that the parent and submodule branches are
    necessarily coupled.  2fda83e mentions dataladgh-1424
    (`Datalad.recall_state() -> `load` command) as the motivation,
    which makes it seem likely the current branch in the submodule,
    not the parent, was supposed be recorded.

    This was mentioned in a currently open PR:
    <datalad#4275 (comment)>

  * Discussion of recording the branch at dataladgh-1424 suggests that the
    idea is to record this information with every save, but 2fda83e
    records it only when the submodule is initially added.  2fda83e
    doesn't say explicitly that its intent was to do it with every
    save (just that the change moves in the direction of dataladgh-1424), but
    still it doesn't seem a particularly useful incremental step to
    have a one-shot record of the current branch at the time the
    submodule is added.

  * It's not clear that `submodule.<name>.branch` is a good spot to
    record the branch information required by dataladgh-1424.
    `submodule.<name>.branch` is about which _remote_ branch is used
    when `--remote` is passed to `submodule update`.  The goal in
    dataladgh-1424 seems to be tracking what the current _local_ branch was
    at the time of a save.  Given these different purposes, it seems
    like it'd be a good idea to track this information in a different
    way (perhaps an entry in .gitmodules with a different key).

Given these issues, let's revert the changes from 2fda83e, along
with the changes from the follow-up commit ee50107 (BF: Do not
record adjust branch in submodule config, 2019-10-21).

Fixes datalad#4373.
@mih
Copy link
Member Author

mih commented Apr 3, 2020

It seems the premise of this PR has changed. I dont have the resources to work on this at the moment. I will close it and reassess the state of things when I get to it next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants