Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Teach rerun how to handle merges #2754

Merged
merged 14 commits into from Sep 6, 2019
Merged

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Aug 9, 2018

Tagging as WIP because I want to test it out a bit more.

- [ ] how should we handle cases where a merge is in the revision range but only one of the parents is?
- [ ] handle merges of unrelated branches

Update: Lots to do in a reroll. See #2754 (comment)

@kyleam kyleam added the WIP label Aug 9, 2018
@codecov
Copy link

@codecov codecov bot commented Aug 9, 2018

Codecov Report

Merging #2754 into master will increase coverage by 0.3%.
The diff coverage is 85.34%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #2754     +/-   ##
=========================================
+ Coverage   82.85%   83.16%   +0.3%     
=========================================
  Files         273      274      +1     
  Lines       35577    35955    +378     
=========================================
+ Hits        29479    29903    +424     
+ Misses       6098     6052     -46
Impacted Files Coverage Δ
datalad/interface/tests/test_rerun_merges.py 100% <100%> (ø)
datalad/interface/tests/test_rerun.py 100% <100%> (ø) ⬆️
datalad/interface/rerun.py 62.18% <35.41%> (-11.41%) ⬇️
datalad/support/gitrepo.py 76.56% <0%> (+0.46%) ⬆️
datalad/support/network.py 82.52% <0%> (+0.68%) ⬆️
datalad/interface/results.py 92.74% <0%> (+0.8%) ⬆️
datalad/distribution/get.py 79.11% <0%> (+1.2%) ⬆️
datalad/core/local/run.py 96.15% <0%> (+1.44%) ⬆️
datalad/distribution/remove.py 73.22% <0%> (+1.57%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d0e81fb...9521691. Read the comment docs.

Copy link
Member

@yarikoptic yarikoptic left a comment

Exciting!!! Thanks for pushing this one forward. Left some minor comments, yet to try

datalad/interface/rerun.py Outdated Show resolved Hide resolved
datalad/interface/tests/test_run.py Outdated Show resolved Hide resolved
res["commit"], "--no-patch", "--format=%B")
dset.repo._git_custom_command(
None,
["git", "merge", "--no-ff", "-m", msg] + new_parents[1:],
Copy link
Member

@yarikoptic yarikoptic Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes sense right away to reserve some options to allow to pass into merge eg to instruct on conflicts resolution strategy

Copy link
Contributor Author

@kyleam kyleam Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar comment could be made about cherry picking non-run commits. I don't plan to do that until the need arises and a convincing case is made.

# We uniquify because content may not change (depending on the
# value of --onto).
new_parents = unique(
filter(None, (new_bases.get(p) for p in res["parents"])))
Copy link
Member

@yarikoptic yarikoptic Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, didn't know about None-- would have used bool here. Good to know I guess

Copy link
Contributor Author

@kyleam kyleam Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you guess?

Copy link
Member

@yarikoptic yarikoptic Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is not a special behavior which isn't as clear (requires reading the function documentation) as just using bool as the comparison/casting function here

Copy link
Contributor Author

@kyleam kyleam Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will change.

lgr.warning(
"Skipping merge %s because only one parent "
"is in the specified revision range",
res["commit"])
Copy link
Member

@yarikoptic yarikoptic Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, if I was to specify a range for rerunning - don't we still want to merge previous histories (as is, without rerunning) if that was done on original history?

Copy link
Contributor Author

@kyleam kyleam Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Maybe? Another thing I need to think about more.

@kyleam
Copy link
Contributor Author

@kyleam kyleam commented Aug 10, 2018

Time conservation notice: These are my notes from exploring (shortcomings of) the current implementation. You probably don't want to bother reading anything besides the summary below.

Summary of problems with current implementation

  • We need to consider the base when cherry picking.

  • We drop leading commits.

    This is sensible in the non-merge case, but we can't do it unconditionally in the merge case.

  • We don't detect unrelated histories. Instead they get recreated as though they forked from a common ancestor.

  • If a merge is included in the revision range but a parent isn't, we should substitute the parent as is when we merge the reran parents.

  • If we're rerunning commits on top of themselves, we don't find the proper base when rerunning commits on two different lines results in changes.

Details

In the diagrams below, the following notation is used:

  • x_n: commit x without a run command
  • x_r: commit x with a run command
  • x_C: commit x_n was cherry picked on rerun
  • x_M: merge commit x_n was remerged on rerun
  • x_R: run commit x_r was reran

The commits in the diagrams are ordered by the commit date (i.e. how they would be ordered by git rev-list by default and unlike the default --topo-order implied by git log --graph.)

Replay cases

Below are several cases of replaying a sequence of commits on a common base with datalad rerun --since= --onto=. This isn't an exhaustive list of cases, but I've tried to boil it down to ones that show different core problems of the current implementation. They're labeled with "[OK]" if the current implementation handles them "correctly" and "[BAD]" otherwise.

[OK] No commits on left, run commit on right

o                 c_n
|\
| o               b_r
|/
o                 a_n

rerun recreates this fine:

o                 c_M
|\
| o               b_R
|/
o                 a_n

Also recreated OK if there are stem run commits.

[OK] Run commits left and right

Each side has a run commit.

o                 d_n
|\
o |               c_r
| o               b_r
|/
o                 a_n

Result:

o                 d_M
|\
o |               c_R
| o               b_R
|/
o                 a_n

[BAD] Non-run commit on left then run commit right

"left" and "right" don't matter here. What is important is that the non-run commit is encountered first when walking the history with git rev-list --reverse.

o                 d_n
|\
| o               c_r
o |               b_n
|/
o                 a_n

Rerunning this gives

o                 c_R
o                 a_n

We want a merge, but b_n is dropped because of the "drop non-run stem commits" step. If there were a rerun commit before the fork, the merge would be replayed fine.

[BAD] Run commit on left then non-run commit on right

Again, "left" and "right" don't matter. What does it that we encounter the run commit before the other line's non-run commit.

o                 d_n
|\
o |               c_n
| o               b_r
|/
o                 a_n

Result:

o                 c_C
o                 b_R
o                 a_n

We want a merge. Cherry pick needs to consider what its base should be (b_R). Unlike the previous case, stem run commits wouldn't help here.

[BAD] Non-run commits on left and right

o                 e_n
|\
o |               d_n
| o               c_n
|/
o                 b_r
o                 a_n

Result:

o                 d_C
o                 c_C
o                 b_R
o                 a_n

This fails for the same reason as the previous one. We should identify the base for cherry pickng as b_R.

Unrelated histories

It'd be convenient if we could just barf if we hit orphan branches. Unfortunately, the crawler uses them, so it isn't unlikely that we'll be in a repo that has them.

[BAD] Run commit on left then non-run commit on right
o                 d_n
|\
| o               c_n
o                 b_r
o                 a_n

Result:

o                 d_M
|\
| o               c_C
|/
o                 b_R
o                 a_n

We need to recognize that the histories are unrelated and not fork c_C from b_R.

[BAD] Non-run commit on left then run commit on right
o                 e_n
|\
| o               d_r
| o               c_n
o                 b_n
o                 a_n

Result:

o                 d_R
o                 c_n

We need to avoid unconditionally dropping stem non-run commits.

Parents outside of the revision range

It's possible to specify a revision range to rerun that doesn't include all parents of the merge.

There are two possible behaviors in this case:

  • Don't do the merge (or in the case of an octopus merge, drop the excluded parents from the merge).

  • Merge in the original parent.

Looking at the simplest case

o                 c_n
|\
| o               b_r
|/
o                 a_n

say we call datalad rerun --onto= --since=b_r c_n. That translates to running b_r..c_c_n (i.e. [a_n, c_n]). Neither of those commits are run commits, so the only option is to report that no run commands were found.

The next simplest case is

o                 d_n
|\
| o               c_r
o |               b_r
|/
o                 a_n

Here c_n..d_n would include a run commit, b_r. Going with the first behavior (dropping the c_r line) seems like it'd be a mistake because a commit after d_n may depend on something in the history of c_r. The second behavior (merging in c_r) avoids this. Also, doing the merge doesn't seem to go against the revision range restriction, since that can be viewed as a specification of which commits to rerun, not which commits to include in the final history.

In diagrams, favor

o                 d_M
|\
| o               c_r
o |               b_R
|/
o                 a_n

over

o                 b_R
o                 a_n

Rerun on top

Running commits on top of each other isn't as straightforward conceptually as --onto=''. We can't expect the history to always be recreated because, if rerunning the commit doesn't change the content, there's nothing to commit.

o                 d_c
|\
| o               c_r
o |               b_r
|/
o                 a_n

If we run datalad rerun --since= and both b_r and c_r don't have any changes, we'll remain on d_c with no changes. Trying to merge c_r would be a noop. (The current implementation gives a misleading message, but is otherwise OK here.)

Instead, if c_r has a change on rerun, the current implementation ends up with

o                 d_M
|\
| o               c_R
|/
o                 d_c
|\
| o               c_r
o |               b_r
|/
o                 a_n

That seems like the most reasonable thing to do. Likewise, if just b_r changes on rerun, we get something reasonable:

o                 b_R
o                 d_c
|\
| o               c_r
o |               b_r
|/
o                 a_n

However, if the results from both b_r and c_r change on rerun, we do the wrong thing:

o                 d_M
|\
| o               c_R
|/
o                 b_R
o                 d_c
|\
| o               c_r
o |               b_r
|/
o                 a_n

Notice that the fork happens at the wrong base.

Octopus merges

Octopus merges have the same issues as standard merges, but I don't see any obvious additional issues. If they end up being a headache, I plan to drop support for them. But, at the moment, they don't seem to add much complexity.

Merge conflicts

If a rerun hits a merge conflict, it will fail (in the same way that a rerun already fails if there's conflict in a cherry pick). We should be able to address this by dumping the current state (the remaining results list) to .git/datalad-rerun-merge or similar and then exiting. Once the user has resolved the merge, they could signal to continue with datalad rerun --continue (related issue).

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Aug 10, 2018

Kudos @kyleam

@kyleam
Copy link
Contributor Author

@kyleam kyleam commented Aug 14, 2018

I've pushed an update. Things are at a state where I'd appreciate people trying it out and letting me know what they think.

The Travis runs are failing across the board. I haven't looked at it in detail, though I'll boldly claim it is unrelated (update: see #2769).

@kyleam kyleam force-pushed the enh-run-merges branch 2 times, most recently from 3e82b38 to 63c9384 Compare Sep 17, 2018
@kyleam kyleam removed the WIP label Sep 28, 2018
@mih
Copy link
Member

@mih mih commented Oct 7, 2018

Could you please address the conflicts? Thx.

Copy link
Member

@mih mih left a comment

I had a walk through the code and tests and it looks good to me. Thx for anticipating this need, it will likely be in the spotlight of our activities a few months from now. I am afraid that until then my concept of what we would need to be able to support will remain somewhat limited and theoretical. From this POV, however, this PR is exactly what is needed ;-)

The only criticism I have is that in the light of the movement away from GitPy, the tests should rather not introduce additional usage of the GitPy API.

@mih mih removed their assignment Jul 17, 2019
@kyleam
Copy link
Contributor Author

@kyleam kyleam commented Jul 17, 2019

@kyleam
Copy link
Contributor Author

@kyleam kyleam commented Jul 17, 2019

I'll deal with those when I rebase to resolve the recently emerged conflicts.

Done.

range-diff
 1:  76b25a9bc !  1:  ebe8d7f2b RF: rerun: Move intermediate variable into helper function
    @@ -15,7 +15,7 @@
      
     -def _revs_as_results(dset, revs):
     +def _revrange_as_results(dset, revrange):
    -+    revs = dset.repo.repo.git.rev_list("--reverse", revrange, "--").split()
    ++    revs = dset.repo.get_revisions(revrange, options=["--reverse"])
          for rev in revs:
              res = get_status_dict("run", ds=dset, commit=rev)
              full_msg = dset.repo.format_commit("%B", rev)
    @@ -23,7 +23,7 @@
          In the standard case, the information in these results will be used to
          actually re-execute the commands.
          """
    --    revs = dset.repo.repo.git.rev_list("--reverse", revrange, "--").split()
    +-    revs = dset.repo.get_revisions(revrange, options=["--reverse"])
     +
          try:
     -        results = _revs_as_results(dset, revs)
 2:  20c01d444 !  2:  8433edf3d RF: rerun: Add parents to the revision results
    @@ -12,15 +12,19 @@
      
      
      def _revrange_as_results(dset, revrange):
    --    revs = dset.repo.repo.git.rev_list("--reverse", revrange, "--").split()
    +-    revs = dset.repo.get_revisions(revrange, options=["--reverse"])
     -    for rev in revs:
     -        res = get_status_dict("run", ds=dset, commit=rev)
    -+    out = dset.repo.repo.git.rev_list("--reverse", "--parents", revrange, "--")
    -+    if not out:
    ++    rev_lines = dset.repo.get_revisions(
    ++        revrange, fmt="%H %P", options=["--reverse"])
    ++    if not rev_lines:
     +        return
     +
    -+    for rev_line in out.splitlines():
    -+        fields = rev_line.split(" ")
    ++    for rev_line in rev_lines:
    ++        # The strip() below is necessary because, with the format above, a
    ++        # commit without any parent has a trailing space. (We could also use a
    ++        # custom `rev-list --parents ...` call to avoid this.)
    ++        fields = rev_line.strip().split(" ")
     +        rev, parents = fields[0], fields[1:]
     +        res = get_status_dict("run", ds=dset, commit=rev, parents=parents)
              full_msg = dset.repo.format_commit("%B", rev)
 3:  438483951 !  3:  0a9250f64 RF: rerun: List revisions with --topo-order
    @@ -13,11 +13,10 @@
      +++ b/datalad/interface/rerun.py
     @@
      
    - 
      def _revrange_as_results(dset, revrange):
    --    out = dset.repo.repo.git.rev_list("--reverse", "--parents", revrange, "--")
    -+    out = dset.repo.repo.git.rev_list("--reverse", "--parents", "--topo-order",
    -+                                      revrange, "--")
    -     if not out:
    +     rev_lines = dset.repo.get_revisions(
    +-        revrange, fmt="%H %P", options=["--reverse"])
    ++        revrange, fmt="%H %P", options=["--reverse", "--topo-order"])
    +     if not rev_lines:
              return
      
 4:  6faaa305c =  4:  0f3a05a52 RF: rerun: Ignore multi-parent run commits
 5:  9692e2551 !  5:  32e5abd13 RF: rerun: Refuse to run root commits
    @@ -55,17 +55,17 @@
     -    # Check out an orphan branch so that we can test the "one commit
     -    # in a repo" case.
          ds.repo.checkout("orph", options=["--orphan"])
    -     ds.repo.repo.git.reset("--hard")
    +     ds.repo._git_custom_command(None, ["git", "reset", "--hard"])
          ds.repo.config.reload()
      
          ds.run('echo static-content > static')
    -     assert_result_count(ds.repo.repo.git.rev_list("HEAD").split(), 1)
    +     eq_(len(ds.repo.get_revisions("HEAD")), 1)
     -
     -    # Rerunning with just one commit doesn't raise an error ...
     -    ds.rerun()
     -    # ... but we're still at one commit because the content didn't
     -    # change.
    --    assert_result_count(ds.repo.repo.git.rev_list("HEAD").split(), 1)
    +-    eq_(len(ds.repo.get_revisions("HEAD")), 1)
     -
     -    # We abort rather than trying to do anything when --onto='' and
     -    # --since='' are given together and the first commit contains a
 6:  8d4855d7c =  6:  709e58677 DOC: rerun: Redefine --onto=''
 7:  0b07c2358 =  7:  e975b5639 RF: rerun: Unconditionally drop stem non-run commits
 8:  b99f6c7c3 =  8:  e0212d7f8 RF: rerun: Resolve start point in "checkout" result
 9:  a6c0e232a =  9:  abfb98411 RF: rerun: Rearrange a conditional
10:  9306b3a7f = 10:  57dce8be4 RF: rerun: Restructure an if-elif chain
11:  6b6cb6ee1 = 11:  274200267 RF: rerun: Pull out the result's hexsha into a variable sooner
12:  e4a590307 = 12:  26f091f2c RF: rerun: Rename a variable
13:  6d6192598 = 13:  7fbc6e460 RF: rerun: Use a more general approach to decide "skip" or "pick"
14:  6e0b91e79 ! 14:  a276c6410 ENH: rerun: Support rerunning history with merges
    @@ -29,7 +29,7 @@
              else:
                  revrange = "{}..{}".format(since, revision)
      
    --        if ds.repo.repo.git.rev_list("--merges", revrange, "--"):
    +-        if ds.repo.get_revisions(revrange, options=["--merges"]):
     -            yield get_status_dict(
     -                "run", ds=ds, status="error",
     -                message="cannot rerun history with merge commits")
    @@ -158,7 +158,7 @@
     +    if branch_to_restore:
     +        # The user asked us to replay the sequence onto a branch, but the
     +        # history had merges, so we're in a detached state.
    -+        dset.repo.repo.git.update_ref("refs/heads/" + branch_to_restore,
    ++        dset.repo.update_ref("refs/heads/" + branch_to_restore,
     +                             "HEAD")
     +        dset.repo.checkout(branch_to_restore)
      
    @@ -194,27 +194,25 @@
     +    ds = Dataset(path).create()
     +    ds.run("echo foo >>foo")
     +    ds.run("echo invalid >>invalid")
    -+    run_msg = ds.repo.repo.git.show(
    -+        "--no-patch", "--format=%B", "HEAD")
    ++    run_msg = ds.repo.format_commit("%B")
     +    run_hexsha = ds.repo.get_hexsha()
    -+    ds.repo.repo.git.reset("--hard", "master~")
    ++    ds.repo._git_custom_command(None, ["git", "reset", "--hard", "master~"])
     +    with open(op.join(ds.path, "non-run"), "w") as nrfh:
     +        nrfh.write("non-run")
     +    ds.save()
     +    # Assign two parents to the invalid run commit.
    -+    commit = ds.repo.repo.git.commit_tree(
    -+        run_hexsha + "^{tree}",
    -+        "-m", run_msg,
    ++    commit = ds.repo._git_custom_command(
    ++        None,
    ++        ["git", "commit-tree", run_hexsha + "^{tree}", "-m", run_msg,
     +         "-p", run_hexsha + "^",
    -+        "-p", ds.repo.get_hexsha())
    -+    ds.repo.repo.git.reset("--hard", commit)
    ++         "-p", ds.repo.get_hexsha()])[0].strip()
    ++
    ++    ds.repo._git_custom_command(None, ["git", "reset", "--hard", commit])
     +    hexsha_orig = ds.repo.get_hexsha()
     +    with swallow_logs(new_level=logging.WARN) as cml:
     +        ds.rerun(since="")
     +        assert_in("has run information but is a merge commit", cml.out)
    -+    assert_result_count(
    -+        ds.repo.repo.git.rev_list(hexsha_orig + "..master").split(),
    -+        1)
    ++    eq_(len(ds.repo.get_revisions(hexsha_orig + "..master")), 1)
     +
     +
      @known_failure_windows

mih
mih approved these changes Jul 18, 2019
@mih
Copy link
Member

@mih mih commented Jul 18, 2019

Good to go from my POV. As I said, this looks like a good thing to have and likely needs to be revisited once speculation about features can be replaced empirical evidence.

@kyleam
Copy link
Contributor Author

@kyleam kyleam commented Jul 18, 2019

Good to go from my POV.

Thanks. I'll plan to merge this after 0.12.0rc5 unless others raise objections in the meantime.

kyleam added 14 commits Sep 6, 2019
We don't use this outside of this function, and the next commit will
complicate the rev-list return value, so it'll be cleaner to handle
that processing within the helper function.
We'll need this information to rerun revision ranges that include
merges.
In preparation for supporting rerunning merges, use the --topo-order
flag so that commits on the same line are grouped.  This will make it
easier to process the revisions because we know that if we encounter a
commit on a different line, we can switch to processing that line
without keeping track of the previous one.
In order for these to occur, the user would have to do something like
'datalad run git merge topic'.  Don't support that because (1) it
would make it trickier to implement the logic for rerunning a revision
range with merges and (2) it's not clear that recording a merge as a
run command would ever be a sensible thing to do.
We abort if --since='' and --onto='' are given and there is a run
commit at the root of the history.  The motivation is that we're very
unlikely to encounter a repo that has a run as the first commit, so it
isn't worth the complication to support, particularly when a branch
isn't specified.

When rerun is extended to handle merges, it will be possible to have
multiple roots if there is a merge of unrelated branches.  To make the
logic and processing easier, refuse to run all root commits.
In preparation for teaching rerun to handle merges, we need to clarify
what --onto='' means.  The current definition is the value of --since.
Because we don't allow merges, that identifies an appropriate base.
But if the revision range is allowed to include merges, that could
give a less intuitive base.

For example, say we call 'datalad rerun --onto= --since=c d' with the
history below.

    d
    |
    M
    |\
    b c
    |/
    a
    |
    .

The revision list is 'c..d', so [b, M, d] would be replayed.  The
previous definition of --onto would identify c as the base to run b on
top of.

A rule that gives the same answer in the "no merges" case but a better
one in the "merges" case is "use the first parent of the first
revision with a run command".
We drop stem non-run commits when --since= is given, but not if
--since=REV is given.  In all cases, however, these are going to be
skipped, so simplify things and always drop them.
The user can pass in any commit-ish.  Resolve it to a hexsha so
downstream code receives something predictable.
Move "skip" adjacent to its counterpart "pick".
To handle rerunning merges, we're going to need to do processing after
the if/elif block, but that processing will only apply to results
without a rerun action or with an action of "checkout".
There are a few different oid's floating around in this loop.  Make it
clearer that "hexsha" refers to the current result's.
If we assume a linear chain of commits, we can simply partition the
list into those that are an ancestor of the start point ("skip") and
those that aren't ("pick").

This logic breaks down in the non-linear case, so we need another
approach to support rerunning merges.  Instead, skip the commit if it
is reachable from the current head or the start point, and cherry pick
it otherwise.  Note: This approach assumes the commits are yielded
according to --topo-order.

A downside of this approach is that the output of --report is less
precise: we say "skip-or-pick" because we don't know yet which it will
be.
Many of the pieces are in place with the preceding commits.  The main
thing here is to create a map of old bases to new bases so that we can
re-execute a commit on the right base.  We make branch management easy
by not doing it.  Instead, we just work in a detached state and, if
the user specifies a branch, we point that branch to HEAD at the end.

We don't yet handle merge conflicts.  Those, along with cherry pick
conflicts, will be addressed later.

Closes datalad#2093.
@kyleam kyleam merged commit 9521691 into datalad:master Sep 6, 2019
5 checks passed
@kyleam kyleam deleted the enh-run-merges branch Sep 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants