[RFC] diff: add diff.<driver>.process for external hunk providers#2120
Open
mmontalbo wants to merge 5 commits into
Open
[RFC] diff: add diff.<driver>.process for external hunk providers#2120mmontalbo wants to merge 5 commits into
diff.<driver>.process for external hunk providers#2120mmontalbo wants to merge 5 commits into
Conversation
Add two new xpparam_t fields (external_hunks, external_hunks_nr) that let callers supply pre-computed hunks. When set, xdl_diff() populates the changed[] arrays from these hunks instead of running the diff algorithm, then continues through compaction and emission as usual. Validate supplied hunks before use: reject out-of-bounds line numbers, overlapping or out-of-order hunks, negative counts, and violations of the synchronization invariant (unchanged line counts must match between files). On validation failure, fall back to the builtin diff algorithm. Skip trim_common_tail() in xdi_diff() when external hunks are present, since external hunks reference line numbers in the original content. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
Add a new per-driver configuration key that specifies the command for a long-running diff process. The name follows filter.<driver>.process: a long-running subprocess that stays alive across files within a single git invocation. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
Add support for external diff processes that communicate via the
long-running process protocol (pkt-line over stdin/stdout).
A diff process is configured per userdiff driver:
[diff "cdiff"]
process = /path/to/diff-tool
The tool receives file pairs and returns hunks describing which
lines changed. Git feeds these hunks into the standard xdiff
pipeline, so all output features (word diff, function context,
color) work normally.
The handshake negotiates version=1 and capability=hunks. Per-file
requests send command=hunks, pathname, and both file contents as
packetized data. The tool responds with hunk lines and a status
packet. On error, git falls back to the builtin diff algorithm.
Zero hunks with status=success means the tool considers the files
equivalent. Git skips diff output for that file entirely.
Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
When a diff process is configured via diff.<driver>.process, consult it during blame's per-commit diffing. If the process returns zero hunks for a commit's changes to a file, treat the commit as having no changes, causing blame to attribute lines to earlier commits. The subprocess is long-running (one startup cost amortized across the blame traversal), but each commit in the file's history incurs a round-trip to the tool. Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
Author
|
/preview |
|
Preview email sent as pull.2120.git.1779415095.gitgitgadget@gmail.com |
diff.<driver>.process for external hunk providers
Author
|
/preview |
|
Preview email sent as pull.2120.git.1779415483.gitgitgadget@gmail.com |
Add git diff-process-normalize, a built-in diff process that
detects whitespace-only changes. It compares files line by line
using xdiff_compare_lines() with XDF_IGNORE_WHITESPACE (same
logic as "git diff -w"). If all lines match, it returns zero
hunks; otherwise it returns an error so git falls back to the
builtin diff algorithm.
[diff "cdiff"]
process = git diff-process-normalize
Update documentation to describe zero-hunk behavior for diff
and blame, and document the built-in normalize tool.
Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
dfcb1c3 to
8c7359b
Compare
Author
|
/submit |
|
Submitted as pull.2120.git.1779415884.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
|
Junio C Hamano wrote on the Git mailing list (how to reply to this email): "Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:
> This series adds diff.<driver>.process, a long-running subprocess protocol
> that lets external tools provide hunks to git's diff and blame pipelines.
>
> Over the past 18 years, git's diff pipeline accumulated many features that
> operate on hunks: word diff, function context, color-moved, indent
> heuristic, blame. External tools can replace the pipeline entirely
> (diff.<driver>.command) or select among builtin algorithms
> (diff.<driver>.algorithm), but there is no way for a tool to provide
> line-change information into the pipeline. Tools that understand code
> structure (tree-sitter parsers, format-aware analyzers, tools like
> Difftastic and Mergiraf) must bypass git's pipeline and lose access to
> everything downstream.
>
> The protocol follows filter.<driver>.process: pkt-line over stdin/stdout,
> capability negotiation, one tool invocation per git command. The tool
> receives file pairs and returns hunk descriptors that git feeds into the
> standard xdiff pipeline. All output features work normally.
>
> Zero hunks with status=success means the tool considers the files
> equivalent. git diff shows no output for the file, and git blame skips the
> commit, attributing lines to earlier commits.
>
> On error or tool crash, git falls back silently to the builtin diff
> algorithm. The feature is opt-in via diff.<driver>.process and
> .gitattributes; unconfigured files are unaffected.
>
> The series includes git diff-process-normalize, a built-in tool that
> compares files line by line ignoring whitespace (same logic as "git diff -w"
> via xdiff_compare_lines):
Interesting.
If the goal is purely to normalize content before comparison
(e.g. stripping comments or canonicalizing formatting), we already
have the `textconv` mechanism. While `textconv` is a "one-shot"
per-file process, it is significantly simpler.
I suspect, however, that the primary focus here is to allow external
tools to provide structural alignment (e.g. for AST- aware diffs
like Difftastic or Mergiraf) without losing the original content in
the display. Unlike `textconv`, which transforms the text the user
sees, this protocol lets the display remain identical to the source
while using a custom engine for the line-matching logic.
If that is the intent, it should be stated more explicitly in the
documentation and commit messages. The "whitespace-normalize"
demonstration in [PATCH 5/5] is misleading because it's exactly the
case where `textconv` would be sufficient.
I am afraid that the use of a long-running subprocess for every
diff/blame invocation adds significant complexity and overhead. In
particular, wouldn't the `blame` implementation performs a
round-trip to the subprocess for every commit in the history? Even
with a persistent process, the overhead of serializing and
deserializing the entire file content twice (old and new) for every
commit could be prohibitive for large files or deep histories.
So, I dunno. |
| @@ -124,7 +124,12 @@ int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t co | |||
| if (mf1->size > MAX_XDIFF_SIZE || mf2->size > MAX_XDIFF_SIZE) | |||
There was a problem hiding this comment.
Junio C Hamano wrote on the Git mailing list (how to reply to this email):
"Michael Montalbo via GitGitGadget" <gitgitgadget@gmail.com> writes:
> +/*
> + * Populate the changed[] arrays from externally supplied hunks,
> + * bypassing the diff algorithm. Validates that hunks are in order,
> + * non-overlapping, and within bounds.
> + *
> + * Returns 0 on success, -1 on validation failure.
> + */
> +static int xdl_populate_hunks_from_external(xdfenv_t *xe,
> + const struct xdl_hunk *hunks,
> + size_t nr_hunks)
> +{
> + size_t i;
> + long j, prev_old_end = 0, prev_new_end = 0;
> + long total_old = 0, total_new = 0;
> +
> + /*
> + * Clear changed[] arrays. xdl_prepare_env() may have dirtied
> + * them via xdl_cleanup_records(). The allocation is nrec + 2
> + * elements; changed points one past the start (see xprepare.c).
> + */
> + memset(xe->xdf1.changed - 1, 0,
> + (xe->xdf1.nrec + 2) * sizeof(bool));
> + memset(xe->xdf2.changed - 1, 0,
> + (xe->xdf2.nrec + 2) * sizeof(bool));
This, especially the starting offset of -1, looks horrible. The
internal layout of xdfenv_t might happen to match the way the above
code expects, which is how xdl_prepare_ctx() may have give you, but
it somehow feels brittle. I guess the assumption that changed[]
does not point at the beginning of the allocated area (e.g., it is a
no-no to free(xe->xdf1.changed) or realloc() it) is so pervasive that
it cannot be helped. Sigh.
> int xdl_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp,
> xdemitconf_t const *xecfg, xdemitcb_t *ecb) {
> xdchange_t *xscr;
> xdfenv_t xe;
> emit_func_t ef = xecfg->hunk_func ? xdl_call_hunk_func : xdl_emit_diff;
>
> - if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0) {
> -
> - return -1;
> + if (xpp->external_hunks) {
> + if (xdl_prepare_env(mf1, mf2, xpp, &xe) < 0)
> + return -1;
> + if (xdl_populate_hunks_from_external(&xe,
> + xpp->external_hunks,
> + xpp->external_hunks_nr) < 0) {
> + /*
> + * Invalid external hunks; fall back to the
> + * builtin diff algorithm. Re-runs
> + * xdl_prepare_env() via xdl_do_diff().
> + */
> + xdl_free_env(&xe);
> + if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0)
> + return -1;
If the external tool keeps sending bogus hunks, silently falling
back to what we would have done if there weren't any external stuff
may be necessary to pleasantly keep using Git, but two and a half
short comments here.
(1) "What we would have done" is exactly the same as what appears
in the corresponding "else" block. Can we make sure that we do
not have to keep updating both copies in the future with some
code rearrangement?
(2) The writer of the external tool may want to see some trace of
warning under certain flags when a failure of the tool forces
the receiving end to fallback.
(3) If the tool throws too many broken replies, perhaps we want to
disable it automatically?
> + }
> + } else {
> + if (xdl_do_diff(mf1, mf2, xpp, &xe) < 0)
> + return -1;
> }
> +
> if (xdl_change_compact(&xe.xdf1, &xe.xdf2, xpp->flags) < 0 ||
> xdl_change_compact(&xe.xdf2, &xe.xdf1, xpp->flags) < 0 ||
> xdl_build_script(&xe, &xscr) < 0) {
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This series adds
diff.<driver>.process, a long-running subprocessprotocol that lets external tools provide hunks to git's diff and
blame pipelines.
Over the past 18 years, git's diff pipeline accumulated many
features that operate on hunks: word diff, function context,
color-moved, indent heuristic, blame. External tools can replace
the pipeline entirely (
diff.<driver>.command) or select amongbuiltin algorithms (
diff.<driver>.algorithm), but there is no wayfor a tool to provide line-change information into the pipeline.
Tools that understand code structure (tree-sitter parsers,
format-aware analyzers, tools like Difftastic and Mergiraf) must
bypass git's pipeline and lose access to everything downstream.
The protocol follows
filter.<driver>.process: pkt-line overstdin/stdout, capability negotiation, one tool invocation per
git command. The tool receives file pairs and returns hunk
descriptors that git feeds into the standard xdiff pipeline.
All output features work normally.
Zero hunks with status=success means the tool considers the
files equivalent. git diff shows no output for the file, and
git blame skips the commit, attributing lines to earlier
commits.
On error or tool crash, git falls back silently to the builtin
diff algorithm. The feature is opt-in via
diff.<driver>.processand .gitattributes; unconfigured files are unaffected.
The series includes git diff-process-normalize, a built-in
tool that compares files line by line ignoring whitespace
(same logic as "git diff -w" via xdiff_compare_lines):
A whitespace-only boolean flag could serve this specific case.
The subprocess protocol is more general, allowing any tool to
participate without further changes to git.