New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Teach rerun how to handle merges #2754
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2754 +/- ##
=========================================
+ Coverage 82.85% 83.16% +0.3%
=========================================
Files 273 274 +1
Lines 35577 35955 +378
=========================================
+ Hits 29479 29903 +424
+ Misses 6098 6052 -46
Continue to review full report at Codecov.
|
Exciting!!! Thanks for pushing this one forward. Left some minor comments, yet to try
datalad/interface/rerun.py
Outdated
res["commit"], "--no-patch", "--format=%B") | ||
dset.repo._git_custom_command( | ||
None, | ||
["git", "merge", "--no-ff", "-m", msg] + new_parents[1:], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it makes sense right away to reserve some options to allow to pass into merge eg to instruct on conflicts resolution strategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A similar comment could be made about cherry picking non-run commits. I don't plan to do that until the need arises and a convincing case is made.
datalad/interface/rerun.py
Outdated
# We uniquify because content may not change (depending on the | ||
# value of --onto). | ||
new_parents = unique( | ||
filter(None, (new_bases.get(p) for p in res["parents"]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, didn't know about None-- would have used bool
here. Good to know I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is not a special behavior which isn't as clear (requires reading the function documentation) as just using bool
as the comparison/casting function here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will change.
datalad/interface/rerun.py
Outdated
lgr.warning( | ||
"Skipping merge %s because only one parent " | ||
"is in the specified revision range", | ||
res["commit"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, if I was to specify a range for rerunning - don't we still want to merge previous histories (as is, without rerunning) if that was done on original history?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Maybe? Another thing I need to think about more.
Time conservation notice: These are my notes from exploring (shortcomings of) the current implementation. You probably don't want to bother reading anything besides the summary below. Summary of problems with current implementation
DetailsIn the diagrams below, the following notation is used:
The commits in the diagrams are ordered by the commit date (i.e. how they would be ordered by Replay casesBelow are several cases of replaying a sequence of commits on a common base with [OK] No commits on left, run commit on right
Also recreated OK if there are stem run commits. [OK] Run commits left and rightEach side has a run commit.
Result:
[BAD] Non-run commit on left then run commit right"left" and "right" don't matter here. What is important is that the non-run commit is encountered first when walking the history with
Rerunning this gives
We want a merge, but [BAD] Run commit on left then non-run commit on rightAgain, "left" and "right" don't matter. What does it that we encounter the run commit before the other line's non-run commit.
Result:
We want a merge. Cherry pick needs to consider what its base should be ( [BAD] Non-run commits on left and right
Result:
This fails for the same reason as the previous one. We should identify the base for cherry pickng as Unrelated historiesIt'd be convenient if we could just barf if we hit orphan branches. Unfortunately, the crawler uses them, so it isn't unlikely that we'll be in a repo that has them. [BAD] Run commit on left then non-run commit on right
Result:
We need to recognize that the histories are unrelated and not fork [BAD] Non-run commit on left then run commit on right
Result:
We need to avoid unconditionally dropping stem non-run commits. Parents outside of the revision rangeIt's possible to specify a revision range to There are two possible behaviors in this case:
Looking at the simplest case
say we call The next simplest case is
Here In diagrams, favor
over
Rerun on topRunning commits on top of each other isn't as straightforward conceptually as
If we run Instead, if
That seems like the most reasonable thing to do. Likewise, if just
However, if the results from both
Notice that the fork happens at the wrong base. Octopus mergesOctopus merges have the same issues as standard merges, but I don't see any obvious additional issues. If they end up being a headache, I plan to drop support for them. But, at the moment, they don't seem to add much complexity. Merge conflictsIf a rerun hits a merge conflict, it will fail (in the same way that a rerun already fails if there's conflict in a cherry pick). We should be able to address this by dumping the current state (the remaining results list) to .git/datalad-rerun-merge or similar and then exiting. Once the user has resolved the merge, they could signal to continue with |
Kudos @kyleam |
I've pushed an update. Things are at a state where I'd appreciate people trying it out and letting me know what they think. The Travis runs are failing across the board. I haven't looked at it in detail, though I'll boldly claim it is unrelated (update: see #2769). |
3e82b38
to
63c9384
Compare
Could you please address the conflicts? Thx. |
f6e367c
to
b779aae
Compare
33d52f1
to
f5a6f27
Compare
I had a walk through the code and tests and it looks good to me. Thx for anticipating this need, it will likely be in the spotlight of our activities a few months from now. I am afraid that until then my concept of what we would need to be able to support will remain somewhat limited and theoretical. From this POV, however, this PR is exactly what is needed ;-)
The only criticism I have is that in the light of the movement away from GitPy, the tests should rather not introduce additional usage of the GitPy API.
@mih:
I had a walk through the code and tests and it looks good to me.
Thanks for taking a look!
Thx for anticipating this need, it will likely be in the spotlight of our activities a few months from now.
Interesting. Looking forward to seeing (or at least hearing) about what that involves.
I am afraid that until then my concept of what we would need to be able to support will remain somewhat limited and theoretical. From this POV, however, this PR is exactly what is needed ;-)
Yeah, I'm of course working from the same point :/ I was hoping (though not really expecting) that a user would jump in and say "hey, great, I need this".
The only criticism I have is that in the light of the movement away from GitPy, the tests should rather not introduce additional usage of the GitPy API.
Thanks for catching that. I cleaned up some of those in previous iterations, but I see that there are a few spots I missed. I'll deal with those when I rebase to resolve the recently emerged conflicts.
|
Done. range-diff 1: 76b25a9bc ! 1: ebe8d7f2b RF: rerun: Move intermediate variable into helper function
@@ -15,7 +15,7 @@
-def _revs_as_results(dset, revs):
+def _revrange_as_results(dset, revrange):
-+ revs = dset.repo.repo.git.rev_list("--reverse", revrange, "--").split()
++ revs = dset.repo.get_revisions(revrange, options=["--reverse"])
for rev in revs:
res = get_status_dict("run", ds=dset, commit=rev)
full_msg = dset.repo.format_commit("%B", rev)
@@ -23,7 +23,7 @@
In the standard case, the information in these results will be used to
actually re-execute the commands.
"""
-- revs = dset.repo.repo.git.rev_list("--reverse", revrange, "--").split()
+- revs = dset.repo.get_revisions(revrange, options=["--reverse"])
+
try:
- results = _revs_as_results(dset, revs)
2: 20c01d444 ! 2: 8433edf3d RF: rerun: Add parents to the revision results
@@ -12,15 +12,19 @@
def _revrange_as_results(dset, revrange):
-- revs = dset.repo.repo.git.rev_list("--reverse", revrange, "--").split()
+- revs = dset.repo.get_revisions(revrange, options=["--reverse"])
- for rev in revs:
- res = get_status_dict("run", ds=dset, commit=rev)
-+ out = dset.repo.repo.git.rev_list("--reverse", "--parents", revrange, "--")
-+ if not out:
++ rev_lines = dset.repo.get_revisions(
++ revrange, fmt="%H %P", options=["--reverse"])
++ if not rev_lines:
+ return
+
-+ for rev_line in out.splitlines():
-+ fields = rev_line.split(" ")
++ for rev_line in rev_lines:
++ # The strip() below is necessary because, with the format above, a
++ # commit without any parent has a trailing space. (We could also use a
++ # custom `rev-list --parents ...` call to avoid this.)
++ fields = rev_line.strip().split(" ")
+ rev, parents = fields[0], fields[1:]
+ res = get_status_dict("run", ds=dset, commit=rev, parents=parents)
full_msg = dset.repo.format_commit("%B", rev)
3: 438483951 ! 3: 0a9250f64 RF: rerun: List revisions with --topo-order
@@ -13,11 +13,10 @@
+++ b/datalad/interface/rerun.py
@@
-
def _revrange_as_results(dset, revrange):
-- out = dset.repo.repo.git.rev_list("--reverse", "--parents", revrange, "--")
-+ out = dset.repo.repo.git.rev_list("--reverse", "--parents", "--topo-order",
-+ revrange, "--")
- if not out:
+ rev_lines = dset.repo.get_revisions(
+- revrange, fmt="%H %P", options=["--reverse"])
++ revrange, fmt="%H %P", options=["--reverse", "--topo-order"])
+ if not rev_lines:
return
4: 6faaa305c = 4: 0f3a05a52 RF: rerun: Ignore multi-parent run commits
5: 9692e2551 ! 5: 32e5abd13 RF: rerun: Refuse to run root commits
@@ -55,17 +55,17 @@
- # Check out an orphan branch so that we can test the "one commit
- # in a repo" case.
ds.repo.checkout("orph", options=["--orphan"])
- ds.repo.repo.git.reset("--hard")
+ ds.repo._git_custom_command(None, ["git", "reset", "--hard"])
ds.repo.config.reload()
ds.run('echo static-content > static')
- assert_result_count(ds.repo.repo.git.rev_list("HEAD").split(), 1)
+ eq_(len(ds.repo.get_revisions("HEAD")), 1)
-
- # Rerunning with just one commit doesn't raise an error ...
- ds.rerun()
- # ... but we're still at one commit because the content didn't
- # change.
-- assert_result_count(ds.repo.repo.git.rev_list("HEAD").split(), 1)
+- eq_(len(ds.repo.get_revisions("HEAD")), 1)
-
- # We abort rather than trying to do anything when --onto='' and
- # --since='' are given together and the first commit contains a
6: 8d4855d7c = 6: 709e58677 DOC: rerun: Redefine --onto=''
7: 0b07c2358 = 7: e975b5639 RF: rerun: Unconditionally drop stem non-run commits
8: b99f6c7c3 = 8: e0212d7f8 RF: rerun: Resolve start point in "checkout" result
9: a6c0e232a = 9: abfb98411 RF: rerun: Rearrange a conditional
10: 9306b3a7f = 10: 57dce8be4 RF: rerun: Restructure an if-elif chain
11: 6b6cb6ee1 = 11: 274200267 RF: rerun: Pull out the result's hexsha into a variable sooner
12: e4a590307 = 12: 26f091f2c RF: rerun: Rename a variable
13: 6d6192598 = 13: 7fbc6e460 RF: rerun: Use a more general approach to decide "skip" or "pick"
14: 6e0b91e79 ! 14: a276c6410 ENH: rerun: Support rerunning history with merges
@@ -29,7 +29,7 @@
else:
revrange = "{}..{}".format(since, revision)
-- if ds.repo.repo.git.rev_list("--merges", revrange, "--"):
+- if ds.repo.get_revisions(revrange, options=["--merges"]):
- yield get_status_dict(
- "run", ds=ds, status="error",
- message="cannot rerun history with merge commits")
@@ -158,7 +158,7 @@
+ if branch_to_restore:
+ # The user asked us to replay the sequence onto a branch, but the
+ # history had merges, so we're in a detached state.
-+ dset.repo.repo.git.update_ref("refs/heads/" + branch_to_restore,
++ dset.repo.update_ref("refs/heads/" + branch_to_restore,
+ "HEAD")
+ dset.repo.checkout(branch_to_restore)
@@ -194,27 +194,25 @@
+ ds = Dataset(path).create()
+ ds.run("echo foo >>foo")
+ ds.run("echo invalid >>invalid")
-+ run_msg = ds.repo.repo.git.show(
-+ "--no-patch", "--format=%B", "HEAD")
++ run_msg = ds.repo.format_commit("%B")
+ run_hexsha = ds.repo.get_hexsha()
-+ ds.repo.repo.git.reset("--hard", "master~")
++ ds.repo._git_custom_command(None, ["git", "reset", "--hard", "master~"])
+ with open(op.join(ds.path, "non-run"), "w") as nrfh:
+ nrfh.write("non-run")
+ ds.save()
+ # Assign two parents to the invalid run commit.
-+ commit = ds.repo.repo.git.commit_tree(
-+ run_hexsha + "^{tree}",
-+ "-m", run_msg,
++ commit = ds.repo._git_custom_command(
++ None,
++ ["git", "commit-tree", run_hexsha + "^{tree}", "-m", run_msg,
+ "-p", run_hexsha + "^",
-+ "-p", ds.repo.get_hexsha())
-+ ds.repo.repo.git.reset("--hard", commit)
++ "-p", ds.repo.get_hexsha()])[0].strip()
++
++ ds.repo._git_custom_command(None, ["git", "reset", "--hard", commit])
+ hexsha_orig = ds.repo.get_hexsha()
+ with swallow_logs(new_level=logging.WARN) as cml:
+ ds.rerun(since="")
+ assert_in("has run information but is a merge commit", cml.out)
-+ assert_result_count(
-+ ds.repo.repo.git.rev_list(hexsha_orig + "..master").split(),
-+ 1)
++ eq_(len(ds.repo.get_revisions(hexsha_orig + "..master")), 1)
+
+
@known_failure_windows |
Good to go from my POV. As I said, this looks like a good thing to have and likely needs to be revisited once speculation about features can be replaced empirical evidence. |
Thanks. I'll plan to merge this after 0.12.0rc5 unless others raise objections in the meantime. |
We don't use this outside of this function, and the next commit will complicate the rev-list return value, so it'll be cleaner to handle that processing within the helper function.
We'll need this information to rerun revision ranges that include merges.
In preparation for supporting rerunning merges, use the --topo-order flag so that commits on the same line are grouped. This will make it easier to process the revisions because we know that if we encounter a commit on a different line, we can switch to processing that line without keeping track of the previous one.
In order for these to occur, the user would have to do something like 'datalad run git merge topic'. Don't support that because (1) it would make it trickier to implement the logic for rerunning a revision range with merges and (2) it's not clear that recording a merge as a run command would ever be a sensible thing to do.
We abort if --since='' and --onto='' are given and there is a run commit at the root of the history. The motivation is that we're very unlikely to encounter a repo that has a run as the first commit, so it isn't worth the complication to support, particularly when a branch isn't specified. When rerun is extended to handle merges, it will be possible to have multiple roots if there is a merge of unrelated branches. To make the logic and processing easier, refuse to run all root commits.
In preparation for teaching rerun to handle merges, we need to clarify what --onto='' means. The current definition is the value of --since. Because we don't allow merges, that identifies an appropriate base. But if the revision range is allowed to include merges, that could give a less intuitive base. For example, say we call 'datalad rerun --onto= --since=c d' with the history below. d | M |\ b c |/ a | . The revision list is 'c..d', so [b, M, d] would be replayed. The previous definition of --onto would identify c as the base to run b on top of. A rule that gives the same answer in the "no merges" case but a better one in the "merges" case is "use the first parent of the first revision with a run command".
We drop stem non-run commits when --since= is given, but not if --since=REV is given. In all cases, however, these are going to be skipped, so simplify things and always drop them.
The user can pass in any commit-ish. Resolve it to a hexsha so downstream code receives something predictable.
Move "skip" adjacent to its counterpart "pick".
To handle rerunning merges, we're going to need to do processing after the if/elif block, but that processing will only apply to results without a rerun action or with an action of "checkout".
There are a few different oid's floating around in this loop. Make it clearer that "hexsha" refers to the current result's.
If we assume a linear chain of commits, we can simply partition the list into those that are an ancestor of the start point ("skip") and those that aren't ("pick"). This logic breaks down in the non-linear case, so we need another approach to support rerunning merges. Instead, skip the commit if it is reachable from the current head or the start point, and cherry pick it otherwise. Note: This approach assumes the commits are yielded according to --topo-order. A downside of this approach is that the output of --report is less precise: we say "skip-or-pick" because we don't know yet which it will be.
Many of the pieces are in place with the preceding commits. The main thing here is to create a map of old bases to new bases so that we can re-execute a commit on the right base. We make branch management easy by not doing it. Instead, we just work in a detached state and, if the user specifies a branch, we point that branch to HEAD at the end. We don't yet handle merge conflicts. Those, along with cherry pick conflicts, will be addressed later. Closes datalad#2093.
Tagging as WIP because I want to test it out a bit more.
- [ ] how should we handle cases where a merge is in the revision range but only one of the parents is?- [ ] handle merges of unrelated branchesUpdate: Lots to do in a reroll. See #2754 (comment)