Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update: Support ff-only merges and merging parent revision #4167

Merged
merged 10 commits into from Feb 22, 2020

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Feb 19, 2020

This series is based on the discussion in gh-4148. On top of the existing "merge in the chosen branch from the sibling" mode, it adds a "follow parent" mode for merging in the registered revision for updated subdatasets. It replaces the approach proposed in gh-4118, which was a hybrid approach that merged in the subdataset revision directly only if the to-be-merged sibling branch didn't contain the registered revision. It think the approach here is simpler and will make it easier for the caller to anticipate the result.

Also, --merge is expanded so that --merge=ff-only limits the allowed merges to fast-forwards, while --merge retains the same behavior (and is now a shorthand for --merge=any).

Patch 1-4 follow patches from gh-4118 pretty closely.

  [ 1/10] RF: update: Break apart _update_repo() helper
  [ 2/10] ENH: update: Use same merge approach for plain and annex repos
  [ 3/10] ENH: update: Avoid calling 'git pull' to merge changes
  [ 4/10] ENH: update: Don't set merge target to a guess that doesn't exist
  [ 5/10] TST: update: Make more specific assertions about result counts
  [ 6/10] ENH: update: Yield result for merge event
  [ 7/10] NF: update: Add option to merge in recorded commit
  [ 8/10] ENH: update: Try to fetch submodule commit explicitly if needed
  [ 9/10] DOC: update: Explain how sibling is chosen for merge
  [10/10] NF: update: Support restricting merges to fast-forwards

kyleam added 6 commits Feb 19, 2020
The _update_repo() helper is responsible for choosing how to merge in
changes (via 'git pull ...'  or 'git annex sync ...'), doing the
merge, and re-obtaining git-annex data.  Put these bits into separate
functions, mostly to make upcoming changes to the first two
functionalities clearer.
update(..., merge=True) merges in changes differently depending on
whether the repository is an annex repo.  For plain git repos, it
brings in the changes with 'git pull'.  For annex repos, it pulls with
'git annex sync', disabling committing and pushing.

This isn't problematic by itself, but it does mean that plain and
annex repos have a non-obvious discrepancy in behavior.  It also
prevents us from having finer control of what gets merged in for annex
repos, something that's needed in the upcoming commits.

Use 'git pull' for annex repos as well, except for repos that are on
adjusted branches.  Adjusted branches are tricky with submodules (as
touched on a bit in dataladgh-3818) because an adjusted branch is, by design,
frequently rebased on top of its main branch, which means the
submodule ID that is being tracked is not stable.  There is likely no
good way to handle it, but merging an adjusted branch (or even its
main branch) ourselves will make things worse, so continue to use 'git
annex sync' [*].

[*] Another option would be to use 'git annex merge <revision>' for
    git-annex versions that support it (>= 7.20190819).  However, when
    testing that locally, I noticed that it failed with submodule
    conflicts in cases where 'git annex sync' did not.
On update(..., merge=True) with a non-annex repo, we bring in changes
with (1) 'git pull <remote>' or (2) 'git pull <remote> <current
branch>', where <remote> is the resolved sibling.  We decide to use
the first if the current branch has a configured upstream remote that
matches the sibling.

This approach has the following issues:

  * it makes another remote connection even though we just fetched
    changes

  * it repeats inspection of some information that we already gathered
    (at least indirectly) in upstream code

Rework the git-pull-based merge functions to take a merge target and
call 'git merge' with the appropriate target to preserve the previous
behavior.
If the resolved sibling isn't a configured upstream, we fall back to
<sibling>/<current local branch> as the target.  Don't do that if that
branch doesn't exist locally.
Many spots in the update() tests assert that there are a certain
number of results of _any_ action type, sometimes with the "type" and
"status" specified.  This is fragile because it needs to be changed if
update() starts to yield other results (as it will in the next
commit).

Update these spots to specify that the action of interest is "update".
When merging, update() doesn't provide any information to the callers
aside from an "Applying updates ..." log message.  And if a merge
fails, the call to update() aborts.  Instead let's provide a result
record for the merge event, and, for a failing merge, translate the
CommandError to a "status=error" result.  This will allow callers to
control it with `on_failure`.

Note that, if a subdataset sees a failure, it is not added to the
paths to save because the user will need to deal with the conflicts.
This will often lead to the top-level dataset also having a conflict,
in which case no save at all will happen.  But see the new
test_merge_conflict_in_subdataset_only for an example where the parent
dataset doesn't have a conflict despite its subdataset having one.
@kyleam
Copy link
Contributor Author

@kyleam kyleam commented Feb 19, 2020

Here's @yarikoptic's script from gh-4148, updated for this PR:

@@ -19,5 +19,5 @@ datalad save -d rem -r

 # Intent is to update full hierarchy to reflect state of rem
 # but we would then would be merging changes in 3rd not present in rem
-datalad -C ds update -r --merge
-git -C ds diff origin/master..
+datalad -C ds update -r --merge=ff-only --follow=parent
+git -C ds diff --exit-code origin/master.. && echo "no differences"
[...]
action summary:
  merge (ok: 2)
  save (notneeded: 1)
  update (ok: 2)
no differences

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Feb 19, 2020

I was trying to stay away from "parent" when referring to "superdataset", but I guess with all the siblings, and "inherit" it would be ok to use "parent" term. Especially since we use it in quite a few places already http://docs.datalad.org/en/latest/search.html?q=Parent&check_keywords=yes&area=default .

@codecov
Copy link

@codecov codecov bot commented Feb 19, 2020

Codecov Report

Merging #4167 into master will increase coverage by 4.23%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4167      +/-   ##
==========================================
+ Coverage   85.04%   89.28%   +4.23%     
==========================================
  Files         275      275              
  Lines       35819    36036     +217     
==========================================
+ Hits        30462    32173    +1711     
+ Misses       5357     3863    -1494     
Impacted Files Coverage Δ
datalad/support/due.py 48.00% <0.00%> (-32.00%) ⬇️
datalad/ui/utils.py 46.87% <0.00%> (-25.00%) ⬇️
datalad/support/archive_utils_patool.py 29.87% <0.00%> (-10.39%) ⬇️
datalad/tests/test_misc.py 90.00% <0.00%> (-10.00%) ⬇️
datalad/metadata/tests/test_extract_metadata.py 90.47% <0.00%> (-9.53%) ⬇️
datalad/tests/test_archives.py 90.57% <0.00%> (-4.35%) ⬇️
datalad/core/local/tests/test_diff.py 96.90% <0.00%> (-2.58%) ⬇️
datalad/utils.py 84.12% <0.00%> (-2.47%) ⬇️
datalad/distribution/dataset.py 93.82% <0.00%> (-2.06%) ⬇️
datalad/metadata/tests/test_search.py 92.85% <0.00%> (-1.59%) ⬇️
... and 51 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c02b19b...3a8edde. Read the comment docs.

@kyleam
Copy link
Contributor Author

@kyleam kyleam commented Feb 19, 2020

I was trying to stay away from "parent" when referring to "superdataset"

Can you elaborate on why? I wasn't aware of any concerted effort to stay away from the term, and it seems like a natural fit here given that update calls remotes "siblings" .

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Feb 19, 2020

There is no concerted effort one way or another.
I was trying to stay away just to not "introduce" it, but as I have checked, we already use that term, so should be ok. Although that we also use it for parent directory, but I guess it doesn't make sense to follow parent directory in this context so all should be ok. And we might better introduce it to the http://docs.datalad.org/en/latest/glossary.html to make it concerted effort ;-)

mih
mih approved these changes Feb 20, 2020
Copy link
Member

@mih mih left a comment

LGTM! Thx.

@yarikoptic yarikoptic added this to the 0.13.0 milestone Feb 20, 2020
Copy link
Member

@yarikoptic yarikoptic left a comment

minor typos/wording + more of "parent" discussion.

configured remote for the current branch. By default, changes are
fetched from the sibling but not merged into the current branch.
With [CMD: --merge or --merge=any CMD][PY: merge=True or
merge="any" PY], the changes will merged into the current branch. A
Copy link
Member

@yarikoptic yarikoptic Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor typo:

Suggested change
merge="any" PY], the changes will merged into the current branch. A
merge="any" PY], the changes will be merged into the current branch. A

Copy link
Contributor Author

@kyleam kyleam Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will fix.

branch's configured branch, if it points to a branch that belongs
to the sibling, or a branch on the sibling branch with a name that
matches the current branch. For 'parent', the revision registered
in the parent of the subdataset is merged in. Note that the current
Copy link
Member

@yarikoptic yarikoptic Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I remembered why I also disliked "parent". "Parent" is a term you see in git to describe "parent commit". So, --follow=parent could be understood (if not reading this doc) pretty-much as --ff-only.
I think this notion of "parent" been a super-dataset is also too implicit in the above sentence in "parent of the subdataset", especially if we consider that "parent" could indeed be a "parent commit".
So if we are to stay with "parent" (I also have dislike of "superdataset" since it is not just "an immediate superdataset"), doc might better be adjusted ".. in the superdataset, immediate parent of the subdataset, ..." . A bit mouthful but removes that "implicit" meaning.

But may be, it should become parent-superdataset or parent-dataset to be 100% explicit in its meaning?

Copy link
Contributor Author

@kyleam kyleam Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I remembered why I also disliked "parent".

I'm left wondering what the first thing you disliked was :]

"Parent" is a term you see in git to describe "parent commit".

I see. I agree that a parent is a very context-dependent thing, and the question is "parent of what?". You have the parent (dataset) of the dataset, parent (commit or commits) of the commit, parent (directory) of a directory... So yeah, perhaps it's best to stay away from an option value that is open to confusion.

(All of the above can apply to "sibling" too, but there's less potential for confusion because user-facing git documentation doesn't use the term currently.)

So, --follow=parent could be understood (if not reading this doc) pretty-much as --ff-only.

Hrm, I was with you on --follow=parent leading to a user being not quite sure what "parent" was referring to, but you've lost me here. Merging the parent commit is always a no-op. I'm not seeing how a user would get to the --ff-only conclusion.

I think this notion of "parent" been a super-dataset is also too implicit in the above sentence in "parent of the subdataset", especially if we consider that "parent" could indeed be a "parent commit".

What would be the "parent commit of a subdataset"? I'm of course biased because I wrote the sentence in question, but I think the natural and clear reading of "parent of X" is that the parent of the same kind (here "dataset") as X.

So if we are to stay with "parent" (I also have dislike of "superdataset" since it is not just "an immediate superdataset"), doc might better be adjusted ".. in the superdataset, immediate parent of the subdataset, ..." . A bit mouthful but removes that "implicit" meaning.

My view is that you're overthinking this. While it's of course best to remove ambiguity, I think "parent of X" can safely be assumed to be first-level. Otherwise, the writer would say it. As an example, I haven't heard of any Git users being confused by "parent commit", but git doesn't ever qualify parent with "immediate". (OK, there's one code comment in revision.c that uses the term :].)

But may be, it should become parent-superdataset or parent-dataset to be 100% explicit in its meaning?

Those are fine by me, with a preference toward "parent-dataset". I like that it (almost) matches the key that datalad-subdatasets uses, "parentds". I could be persuaded to use that directly, though I think it makes sense to avoid an unnecessary abbreviation here.

Any other thoughts on what this option value should be?

Copy link
Contributor Author

@kyleam kyleam Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adswa It'd be nice to hear your thoughts on this too. That way I'm not to blame when you're struggling to explain a confusingly named option in the handbook :]

Copy link
Member

@yarikoptic yarikoptic Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep it short: I like parentds since it is already used with the same meaning and explicit.

will be done by merging in a branch from the (specified or
inferred) sibling. The branch brought in will either be the current
branch's configured branch, if it points to a branch that belongs
to the sibling, or a branch on the sibling branch with a name that
Copy link
Contributor Author

@kyleam kyleam Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"branch on the sibling branch"

One too many branches.

kyleam added 4 commits Feb 21, 2020
`update --merge --recursive` tries to match the current branch with a
branch on the sibling and merge in those changes.  That is, it's
similar in spirit to

    git pull --no-recurse-submodules
    git submodule update --remote --merge --recursive

However, there are various situations where the selected merge target
does not point to the revision recorded in the parent (e.g., when the
sibling has switched to another branch or the scenario described at
dataladgh-4148).

In these cases, if the caller wants to reflect the state of the remote
dataset hierarchy, the current behavior of update() won't do.  Add a
new option, --follow=parentds, that is more suited to the above
scenarios.  The new behavior is similar to

    git pull --no-recurse-submodules
    git submodule update --merge --recursive

The default value, --follow=sibling, is not intended to have any
change in behavior.

Rather than add a new option, we could extend --merge to take a value
that specifies the behavior (e.g., with --merge translating to
--merge=sibling).  But an upcoming commit will make a similar change
to add a "fast-forward only" option.  That's orthogonal to the
behavior being added in this commit, so use a separate option to avoid
needing to come up with a way for --merge to take multiple values
(e.g., "--merge=parentds,ff-only").

Note that --follow=parentds is marked as incompatible with adjusted
branches, where changes are brought in with 'git annex sync <remote>'
rather than `git merge <target>`.

Closes datalad#4148.

Supersedes datalad#4118 (a hybrid approach where the recorded subdataset
revision would be taken as the merge target if the selected sibling
branch did not contain the recorded revision)
update() calls either 'git fetch <resolved sibling>' or (a rough
approximation of) 'git fetch --all', depending on whether a sibling is
specified, what upstream is configured for the current branch, and
whether there is only a single remote.  In the case of a subdataset,
either form may fail to bring down the revision that's recorded in the
parent.  This can happen based on the fetch configuration or because
the revision is not within a branch on the remote.  The latter is an
easy state to get into due to Git's tendency to detach in submodules.

When a merge with --follow=parentds is requested, the update will fail
because the merge target (the revision) isn't available locally.  Deal
with this the same way that 'git submodule update' does: if the first
fetch doesn't bring in the revision, try to fetch the revision
explicitly from the remote.

When on an adjusted branch, don't "merge" the explicitly fetch
revision because we use git-annex sync (for reasons described a few
commits back), which can't be expected to bring the revision in.
This is useful for cases where the caller wants to update their
dataset hierarchy and doesn't think there are any changes locally (and
wants to be alerted if there are).  When used alongside
--follow=parentds, this can handle simple cases in a way that produces
the same result as

    git pull --ff-only --no-recurse-submodules
    git submodule update --recursive

but advances the current branch rather than getting into a detached
state.

Closes datalad#3079.
@kyleam
Copy link
Contributor Author

@kyleam kyleam commented Feb 21, 2020

I've pushed an update that changes --follow=parent to --follow=parentds and that makes some minor docstring (including the two noted typos).

range-diff
 1:  298a459aa =  1:  298a459aa RF: update: Break apart _update_repo() helper
 2:  02ce370da =  2:  02ce370da ENH: update: Use same merge approach for plain and annex repos
 3:  76a728351 =  3:  76a728351 ENH: update: Avoid calling 'git pull' to merge changes
 4:  2d6923e71 =  4:  2d6923e71 ENH: update: Don't set merge target to a guess that doesn't exist
 5:  3bf0ea157 =  5:  3bf0ea157 TST: update: Make more specific assertions about result counts
 6:  913d5eb55 =  6:  913d5eb55 ENH: update: Yield result for merge event
 7:  1abd042fc !  7:  94efa8637 NF: update: Add option to merge in recorded commit
    @@ Metadata
     Author: Kyle Meyer <kyle@kyleam.com>
     
      ## Commit message ##
    -    NF: update: Add option to merge in recorded commit
    +    NF: update: Add option to merge in recorded submodule commit
     
         `update --merge --recursive` tries to match the current branch with a
         branch on the sibling and merge in those changes.  That is, it's
    @@ Commit message
     
         In these cases, if the caller wants to reflect the state of the remote
         dataset hierarchy, the current behavior of update() won't do.  Add a
    -    new option, --follow=parent, that is more suited to the above
    +    new option, --follow=parentds, that is more suited to the above
         scenarios.  The new behavior is similar to
     
             git pull --no-recurse-submodules
    @@ Commit message
         to add a "fast-forward only" option.  That's orthogonal to the
         behavior being added in this commit, so use a separate option to avoid
         needing to come up with a way for --merge to take multiple values
    -    (e.g., "--merge=parent,ff-only").
    +    (e.g., "--merge=parentds,ff-only").
     
    -    Note that --follow=parent is marked as incompatible with adjusted
    +    Note that --follow=parentds is marked as incompatible with adjusted
         branches, where changes are brought in with 'git annex sync <remote>'
         rather than `git merge <target>`.
     
    @@ datalad/distribution/tests/test_update.py: def test_merge_conflict_in_subdataset
     +
     +
     +@with_tempfile(mkdir=True)
    -+def test_merge_follow_parent_subdataset_other_branch(path):
    ++def test_merge_follow_parentds_subdataset_other_branch(path):
     +    path = Path(path)
     +    ds_src = Dataset(path / "source").create()
     +    on_adjusted = ds_src.repo.is_managed_branch()
    @@ datalad/distribution/tests/test_update.py: def test_merge_conflict_in_subdataset
     +    ds_src.save(recursive=True)
     +    assert_repo_status(ds_src.path)
     +
    -+    res = ds_clone.update(merge=True, follow="parent", recursive=True,
    ++    res = ds_clone.update(merge=True, follow="parentds", recursive=True,
     +                          on_failure="ignore")
     +    if on_adjusted:
     +        # Our git-annex sync based on approach on adjusted branches is
    -+        # incompatible with follow='parent'.
    ++        # incompatible with follow='parentds'.
     +        assert_in_results(res, action="update", status="impossible")
     +        return
     +    else:
    @@ datalad/distribution/tests/test_update.py: def test_merge_conflict_in_subdataset
     +    (ds_src_subds.pathobj / "bar").write_text("bar content")
     +    ds_src.save(recursive=True)
     +    ds_clone_subds.repo.checkout("master", options=["-bnew"])
    -+    ds_clone.update(merge=True, follow="parent", recursive=True)
    ++    ds_clone.update(merge=True, follow="parentds", recursive=True)
     +    if not on_adjusted:
     +        eq_(ds_clone.repo.get_hexsha(), ds_src.repo.get_hexsha())
     
    @@ datalad/distribution/update.py: class Update(Interface):
                   code_cmd="datalad update -s <siblingname>"),
     -        dict(text="Update from a particular sibling and merge the obtained changes",
     +        dict(text="Update from a particular sibling and merge the changes "
    -+                  "from a configured or matching branch on the sibling "
    ++                  "from a configured or matching branch from the sibling "
     +                  "(see [CMD: --follow CMD][PY: `follow` PY] for details)",
                   code_py="update(sibling='siblingname', merge=True)",
                   code_cmd="datalad update --merge -s <siblingname>"),
    @@ datalad/distribution/update.py: class Update(Interface):
     +                  "subdatasets. For subdatasets, merge the revision "
     +                  "registered in the parent dataset into the current branch",
     +             code_py="update(sibling='origin', merge=True, "
    -+                     "follow='parent', recursive=True)",
    ++                     "follow='parentds', recursive=True)",
     +             code_cmd="datalad update -s origin --merge "
    -+                      "--follow=parent --recursive"),
    ++                      "--follow=parentds --recursive"),
          ]
      
          _params_ = dict(
    @@ datalad/distribution/update.py: class Update(Interface):
                  default sibling""", ),
     +        follow=Parameter(
     +            args=("--follow",),
    -+            constraints=EnsureChoice("sibling", "parent"),
    ++            constraints=EnsureChoice("sibling", "parentds"),
     +            doc="""source of updates for subdatasets. For 'sibling', the update
     +            will be done by merging in a branch from the (specified or
     +            inferred) sibling. The branch brought in will either be the current
     +            branch's configured branch, if it points to a branch that belongs
    -+            to the sibling, or a branch on the sibling branch with a name that
    -+            matches the current branch. For 'parent', the revision registered
    -+            in the parent of the subdataset is merged in. Note that the current
    -+            dataset is always updated according to 'sibling'. This option has
    -+            no effect unless a merge is requested and [CMD: --recursive
    -+            CMD][PY: recursive=True PY] is specified.""", ),
    ++            to the sibling, or a sibling branch with a name that matches the
    ++            current branch. For 'parentds', the revision registered in the
    ++            parent dataset of the subdataset is merged in. Note that the
    ++            current dataset is always updated according to 'sibling'. This
    ++            option has no effect unless a merge is requested and [CMD:
    ++            --recursive CMD][PY: recursive=True PY] is specified.""", ),
              recursive=recursion_flag,
              recursion_limit=recursion_limit,
              fetch_all=Parameter(
    @@ datalad/distribution/update.py: def __call__(
                  # a GitRepo to an AnnexRepo
                  repo = ds.repo
      
    -+            follow_parent = revision and follow == "parent"
    ++            follow_parent = revision and follow == "parentds"
     +            if follow_parent and not repo.commit_exists(revision):
     +                yield dict(
     +                    res, status="impossible",
    @@ datalad/distribution/update.py: def __call__(
     +                    if follow_parent:
     +                        yield dict(
     +                            res, status="impossible",
    -+                            message=("follow='parent' is incompatible "
    ++                            message=("follow='parentds' is incompatible "
     +                                     "with adjusted branches"))
     +                        continue
     +                elif merge_target is None:
 8:  a8446fb98 !  8:  dbd84e566 ENH: update: Try to fetch submodule commit explicitly if needed
    @@ Commit message
         the revision is not within a branch on the remote.  The latter is an
         easy state to get into due to Git's tendency to detach in submodules.
     
    -    When a merge with --follow=parent is requested, the update will fail
    +    When a merge with --follow=parentds is requested, the update will fail
         because the merge target (the revision) isn't available locally.  Deal
         with this the same way that 'git submodule update' does: if the first
         fetch doesn't bring in the revision, try to fetch the revision
    @@ Commit message
         commits back), which can't be expected to bring the revision in.
     
      ## datalad/distribution/tests/test_update.py ##
    -@@ datalad/distribution/tests/test_update.py: def test_merge_follow_parent_subdataset_other_branch(path):
    -     ds_clone.update(merge=True, follow="parent", recursive=True)
    +@@ datalad/distribution/tests/test_update.py: def test_merge_follow_parentds_subdataset_other_branch(path):
    +     ds_clone.update(merge=True, follow="parentds", recursive=True)
          if not on_adjusted:
              eq_(ds_clone.repo.get_hexsha(), ds_src.repo.get_hexsha())
     +
    @@ datalad/distribution/tests/test_update.py: def test_merge_follow_parent_subdatas
     +
     +
     +@with_tempfile(mkdir=True)
    -+def test_merge_follow_parent_subdataset_adjusted_warning(path):
    ++def test_merge_follow_parentds_subdataset_adjusted_warning(path):
     +    path = Path(path)
     +
     +    ds_src = Dataset(path / "source").create()
    @@ datalad/distribution/tests/test_update.py: def test_merge_follow_parent_subdatas
     +    assert_repo_status(ds_src.path)
     +
     +    assert_in_results(
    -+        ds_clone.update(merge=True, recursive=True, follow="parent",
    ++        ds_clone.update(merge=True, recursive=True, follow="parentds",
     +                        on_failure="ignore"),
     +        status="impossible",
     +        path=ds_clone_subds.path,
    @@ datalad/distribution/tests/test_update.py: def test_merge_follow_parent_subdatas
     +
     +
     +@with_tempfile(mkdir=True)
    -+def check_merge_follow_parent_subdataset_detached(on_adjusted, path):
    ++def check_merge_follow_parentds_subdataset_detached(on_adjusted, path):
     +    # Note: For the adjusted case, this is not much more than a smoke test that
     +    # on an adjusted branch we fail sensibly. The resulting state is not easy
     +    # to reason about nor desirable.
    @@ datalad/distribution/tests/test_update.py: def test_merge_follow_parent_subdatas
     +    ds_src.save(recursive=True)
     +    assert_repo_status(ds_src.path)
     +
    -+    res = ds_clone.update(merge=True, recursive=True, follow="parent",
    ++    res = ds_clone.update(merge=True, recursive=True, follow="parentds",
     +                          on_failure="ignore")
     +    if on_adjusted:
     +        # The top-level update is okay because there is no parent revision to
    @@ datalad/distribution/tests/test_update.py: def test_merge_follow_parent_subdatas
     +    # This is the default, but just in case:
     +    ds_src_s1.repo.config.set("uploadpack.allowAnySHA1InWant", "false",
     +                              where="local")
    -+    res = ds_clone.update(merge=True, recursive=True, follow="parent",
    ++    res = ds_clone.update(merge=True, recursive=True, follow="parentds",
     +                          on_failure="ignore")
     +    # The fetch with the explicit ref fails because it isn't advertised.
     +    assert_in_results(
    @@ datalad/distribution/tests/test_update.py: def test_merge_follow_parent_subdatas
     +    ds_clone_s1.repo.call_git(["branch", "--unset-upstream"])
     +    ds_clone_s1.config.reload(force=True)
     +    ds_clone_s1.repo.call_git(["remote", "add", "other", ds_src_s1.path])
    -+    res = ds_clone.update(recursive=True, follow="parent",
    ++    res = ds_clone.update(recursive=True, follow="parentds",
     +                          on_failure="ignore")
     +    # In this case, update() won't abort if we call with merge=False, but
     +    # it does if the revision wasn't brought down in the `fetch(all_=True)`
    @@ datalad/distribution/tests/test_update.py: def test_merge_follow_parent_subdatas
     +        action="update")
     +
     +
    -+def test_merge_follow_parent_subdataset_detached():
    -+    yield check_merge_follow_parent_subdataset_detached, True
    -+    yield check_merge_follow_parent_subdataset_detached, False
    ++def test_merge_follow_parentds_subdataset_detached():
    ++    yield check_merge_follow_parentds_subdataset_detached, True
    ++    yield check_merge_follow_parentds_subdataset_detached, False
     
      ## datalad/distribution/update.py ##
     @@ datalad/distribution/update.py: def __call__(
      
    -             follow_parent = revision and follow == "parent"
    +             follow_parent = revision and follow == "parentds"
                  if follow_parent and not repo.commit_exists(revision):
     -                yield dict(
     -                    res, status="impossible",
 9:  68d86ddd7 !  9:  eddf9be9f DOC: update: Explain how sibling is chosen for merge
    @@ datalad/distribution/update.py: class Update(Interface):
     +            configured remote for the current branch."""),
              follow=Parameter(
                  args=("--follow",),
    -             constraints=EnsureChoice("sibling", "parent"),
    +             constraints=EnsureChoice("sibling", "parentds"),
10:  3f810398c ! 10:  3a8edde4d NF: update: Support restricting merges to fast-forwards
    @@ Commit message
         This is useful for cases where the caller wants to update their
         dataset hierarchy and doesn't think there are any changes locally (and
         wants to be alerted if there are).  When used alongside
    -    --follow=parent, this can handle simple cases in a way that produces
    +    --follow=parentds, this can handle simple cases in a way that produces
         the same result as
     
             git pull --ff-only --no-recurse-submodules
    @@ datalad/distribution/tests/test_update.py: def test_merge_conflict_in_subdataset
     +
     +
      @with_tempfile(mkdir=True)
    - def test_merge_follow_parent_subdataset_other_branch(path):
    + def test_merge_follow_parentds_subdataset_other_branch(path):
          path = Path(path)
     
      ## datalad/distribution/update.py ##
    @@ datalad/distribution/update.py: class Update(Interface):
     +            configured remote for the current branch. By default, changes are
     +            fetched from the sibling but not merged into the current branch.
     +            With [CMD: --merge or --merge=any CMD][PY: merge=True or
    -+            merge="any" PY], the changes will merged into the current branch. A
    -+            value of 'ff-only' restricts the allowed merges to
    ++            merge="any" PY], the changes will be merged into the current
    ++            branch. A value of 'ff-only' restricts the allowed merges to
     +            fast-forwards."""),
              follow=Parameter(
                  args=("--follow",),
    -             constraints=EnsureChoice("sibling", "parent"),
    +             constraints=EnsureChoice("sibling", "parentds"),
     @@ datalad/distribution/update.py: def __call__(
                          is_annex=is_annex,
                          adjusted=is_annex and repo.is_managed_branch(curr_branch))

Copy link
Member

@yarikoptic yarikoptic left a comment

I don't think Windows failures are related -- but who knows, didn't analyze, we are just to turn a blind eye on those and keep merging ;-)

@kyleam kyleam merged commit 2342cb4 into datalad:master Feb 22, 2020
14 of 17 checks passed
@kyleam kyleam deleted the update-follow branch Feb 22, 2020
kyleam added a commit to kyleam/datalad that referenced this issue Apr 1, 2020
If update() is called while on an unborn branch, trying to merge in
changes will fail with a confusing "not something we can merge" error
from git, relayed by git-annex (see dataladgh-4347).  This isn't a situation
a caller is likely to be in knowingly, but it can happen easily if a
dataset was cloned from a repo where HEAD points to an unborn branch.
That, in turn, can happen easily because create_sibling(), when called
with a branch other than master checked out, will leave HEAD pointing
to master, which may not exist and, even it does, may never be
published to the sibling (dataladgh-4349).

Note that this is a maint-specific fix.  On master update() fails with
a reasonable "cannot determine merge target" (*), a change that came
with 2342cb4 (Merge pull request datalad#4167 from kyleam/update-follow,
2020-02-21).

Fixes dataladgh-4347, in the sense that a more understandable error is given.
The situation will be further improved in the rest of this series by
(1) teaching create_sibling() to point the sibling's HEAD to the ref
that is checked out in the local repo and (2) teaching clone() and
friends to try to check out another ref if the one pointed at by HEAD
doesn't exist.

(*) Unsurprisingly, this commit has non-trivial conflicts with master.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants