Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teach diff to honor diff algorithms set through git attributes #1452

Closed
wants to merge 2 commits into from

Conversation

john-cai
Copy link
Contributor

@john-cai john-cai commented Feb 4, 2023

When a repository contains different kinds of files, it may be desirable to use different algorithms based on file type. This is currently not feasible through the command line or using git configs. However, we can leverage the fact that gitattributes are path aware.

Teach the diff machinery to check gitattributes when diffing files by using the existing diff. scheme, and add an "algorithm" type to the external driver config.

Change since V3:

  • cleaned up documentation, typos
  • minor cleanup such as if statement ordering, and overly long lines

Changes since V2:

  • minor clean up and variable renaming
  • avoid parsing attribute files for the driver if the diff algorithm is set through the command line

Changes since V1:

  • utilize the existing diff.<driver>.* scheme where the driver is defined in gitattributes, but the algorithm is defined in the gitconfig.

To address some of the performance concerns in the previous series, a benchmark shows that now only a minor performance penalty is incurred, now that we are no longer adding an additional attributes parsing call:

$ echo "*.[ch] diff=other" >> .gitattributes
$ hyperfine -r 10 -L a git-bin-wrapper,git '{a} -c diff.other.algorithm=myers diff v2.0.0 v2.28.0'
Benchmark 1: git-bin-wrapper -c diff.other.algorithm=myers diff v2.0.0 v2.28.0
Time (mean ± σ): 716.3 ms ± 3.8 ms [User: 660.2 ms, System: 50.8 ms]
Range (min … max): 709.8 ms … 720.6 ms 10 runs

Benchmark 2: git -c diff.other.algorithm=myers diff v2.0.0 v2.28.0
Time (mean ± σ): 704.3 ms ± 2.9 ms [User: 656.6 ms, System: 44.3 ms]
Range (min … max): 700.1 ms … 708.6 ms 10 runs

Summary
'git -c diff.other.algorithm=myers diff v2.0.0 v2.28.0' ran
1.02 ± 0.01 times faster than 'git-bin-wrapper -c diff.other.algorithm=myers diff v2.0.0 v2.28.0'

cc: Eric Sunshine sunshine@sunshineco.com
cc: Phillip Wood phillip.wood123@gmail.com
cc: Ævar Arnfjörð Bjarmason avarab@gmail.com
cc: Jeff King peff@peff.net
cc: Elijah Newren newren@gmail.com

@gitgitgadget-git
Copy link

There are issues in commit 8e8fee8:
diff: Teach diff to read attribute diff-algorithm
Prefixed commit message must be in lower case

@john-cai
Copy link
Contributor Author

john-cai commented Feb 5, 2023

/preview

@gitgitgadget-git
Copy link

Preview email sent as pull.1452.git.git.1675567956.gitgitgadget@gmail.com

@john-cai
Copy link
Contributor Author

john-cai commented Feb 5, 2023

/submit

@gitgitgadget-git
Copy link

Submitted as pull.1452.git.git.1675568781.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-git-1452/john-cai/jc/attr-diff-algo-v1

To fetch this version to local tag pr-git-1452/john-cai/jc/attr-diff-algo-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-git-1452/john-cai/jc/attr-diff-algo-v1

@@ -736,6 +736,29 @@ String::
by the configuration variables in the "diff.foo" section of the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Eric Sunshine wrote (reply to this):

On Sat, Feb 4, 2023 at 11:47 PM John Cai via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> It can be useful to specify diff algorithms per file type. For example,
> one may want to use the minimal diff algorithm for .json files, another
> for .c files, etc.
>
> Teach the diff machinery to check attributes for a diff algorithm.
> Enforce precedence by favoring the command line option, then looking at
> attributes, then finally the config.
>
> To enforce precedence order, set the `xdl_opts_command_line` member
> during options pasing to indicate the diff algorithm was set via command
> line args.
>
> Signed-off-by: John Cai <johncai86@gmail.com>
> ---
> diff --git a/diff.c b/diff.c
> @@ -3652,6 +3652,27 @@ static void builtin_diff(const char *name_a,
> +               if (!o->xdl_opts_command_line) {
> +                       static struct attr_check *check;

`check` is declared static...

> +                       const char *one_diff_algo;
> +                       const char *two_diff_algo;
> +
> +                       check = attr_check_alloc();

... is allocated here...

> +                       attr_check_append(check, git_attr("diff-algorithm"));
> +
> +                       git_check_attr(the_repository->index, NULL, one->path, check);
> +                       one_diff_algo = check->items[0].value;
> +                       git_check_attr(the_repository->index, NULL, two->path, check);
> +                       two_diff_algo = check->items[0].value;
> +
> +                       if (!ATTR_UNSET(one_diff_algo) && !ATTR_UNSET(two_diff_algo) &&
> +                               !strcmp(one_diff_algo, two_diff_algo))
> +                               set_diff_algorithm(o, one_diff_algo);
> +
> +                       attr_check_free(check);

... and freed here...

> +               }

... so the reason for the `static` declaration is not clear. Am I
missing something obvious?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, John Cai wrote (reply to this):

Hi Eric,

On 5 Feb 2023, at 12:50, Eric Sunshine wrote:

> On Sat, Feb 4, 2023 at 11:47 PM John Cai via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> It can be useful to specify diff algorithms per file type. For example,
>> one may want to use the minimal diff algorithm for .json files, another
>> for .c files, etc.
>>
>> Teach the diff machinery to check attributes for a diff algorithm.
>> Enforce precedence by favoring the command line option, then looking at
>> attributes, then finally the config.
>>
>> To enforce precedence order, set the `xdl_opts_command_line` member
>> during options pasing to indicate the diff algorithm was set via command
>> line args.
>>
>> Signed-off-by: John Cai <johncai86@gmail.com>
>> ---
>> diff --git a/diff.c b/diff.c
>> @@ -3652,6 +3652,27 @@ static void builtin_diff(const char *name_a,
>> +               if (!o->xdl_opts_command_line) {
>> +                       static struct attr_check *check;
>
> `check` is declared static...
>
>> +                       const char *one_diff_algo;
>> +                       const char *two_diff_algo;
>> +
>> +                       check = attr_check_alloc();
>
> ... is allocated here...
>
>> +                       attr_check_append(check, git_attr("diff-algorithm"));
>> +
>> +                       git_check_attr(the_repository->index, NULL, one->path, check);
>> +                       one_diff_algo = check->items[0].value;
>> +                       git_check_attr(the_repository->index, NULL, two->path, check);
>> +                       two_diff_algo = check->items[0].value;
>> +
>> +                       if (!ATTR_UNSET(one_diff_algo) && !ATTR_UNSET(two_diff_algo) &&
>> +                               !strcmp(one_diff_algo, two_diff_algo))
>> +                               set_diff_algorithm(o, one_diff_algo);
>> +
>> +                       attr_check_free(check);
>
> ... and freed here...
>
>> +               }
>
> ... so the reason for the `static` declaration is not clear. Am I
> missing something obvious?

No, you are correct. No reason for the static declaration. `check` is not used outside of the scope of this
conditional. I think this made it in from an earlier iteration and I didn't catch the oversight.

thanks
John

@gitgitgadget-git
Copy link

User Eric Sunshine <sunshine@sunshineco.com> has been added to the cc: list.

@@ -3437,6 +3437,22 @@ static int diff_filepair_is_phoney(struct diff_filespec *one,
return !DIFF_FILE_VALID(one) && !DIFF_FILE_VALID(two);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Phillip Wood wrote (reply to this):

Hi John

On 05/02/2023 03:46, John Cai via GitGitGadget wrote:
> From: John Cai <jcai@gitlab.com>
> > The diff option parsing for --minimal, --patience, --histgoram can all
> be consolidated into one function. This is a preparatory step for the
> subsequent commit which teaches diff to keep track of whether or not a
> diff algorithm has been set via the command line.
> > While we're at it, the logic that sets the diff algorithm in
> diff_opt_diff_algorithm() can be refactored into a helper that will
> allow multiple callsites to set the diff algorithm.

You say "while  we're at it" but isn't it a wholly necessary change for what you want to do?

This patch basically looks good, I've left a couple of comments below, thanks for separating it out as a preparatory step

> Signed-off-by: John Cai <johncai86@gmail.com>
> ---
>   diff.c | 87 ++++++++++++++++++++++++++++++++++++----------------------
>   1 file changed, 54 insertions(+), 33 deletions(-)
> > diff --git a/diff.c b/diff.c
> index 329eebf16a0..a8a31c81fe7 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -3437,6 +3437,22 @@ static int diff_filepair_is_phoney(struct diff_filespec *one,
>   	return !DIFF_FILE_VALID(one) && !DIFF_FILE_VALID(two);
>   }
>   > +static int set_diff_algorithm(struct diff_options *opts,
> +			      const char *alg)
> +{
> +	long value = parse_algorithm_value(alg);
> +
> +	if (value < 0)
> +		return 1;
> +
> +	/* clear out previous settings */
> +	DIFF_XDL_CLR(opts, NEED_MINIMAL);
> +	opts->xdl_opts &= ~XDF_DIFF_ALGORITHM_MASK;
> +	opts->xdl_opts |= value;
> +
> +	return 0;
> +}
> +
>   static void builtin_diff(const char *name_a,
>   			 const char *name_b,
>   			 struct diff_filespec *one,
> @@ -5107,17 +5123,40 @@ static int diff_opt_diff_algorithm(const struct option *opt,
>   				   const char *arg, int unset)
>   {
>   	struct diff_options *options = opt->value;
> -	long value = parse_algorithm_value(arg);
>   >   	BUG_ON_OPT_NEG(unset);
> -	if (value < 0)
> +
> +	if (set_diff_algorithm(options, arg))
>   		return error(_("option diff-algorithm accepts \"myers\", "
>   			       "\"minimal\", \"patience\" and \"histogram\""));
>   > -	/* clear out previous settings */
> -	DIFF_XDL_CLR(options, NEED_MINIMAL);
> -	options->xdl_opts &= ~XDF_DIFF_ALGORITHM_MASK;
> -	options->xdl_opts |= value;
> +	return 0;
> +}
> +
> +static int diff_opt_diff_algorithm_no_arg(const struct option *opt,
> +				   const char *arg, int unset)
> +{
> +	struct diff_options *options = opt->value;
> +
> +	BUG_ON_OPT_NEG(unset);
> +	BUG_ON_OPT_ARG(arg);
> +
> +	if (!strcmp(opt->long_name, "patience")) {
> +		int i;

This is copied from the existing code but as `options->anchors_nr` is a size_t it is probably worth converting `i` to a size_t here.

> +		/*
> +		 * Both --patience and --anchored use PATIENCE_DIFF
> +		 * internally, so remove any anchors previously
> +		 * specified.
> +		 */
> +		for (i = 0; i < options->anchors_nr; i++)
> +			free(options->anchors[i]);
> +		options->anchors_nr = 0;
> +	}
> +
> +	if (set_diff_algorithm(options, opt->long_name))
> +		return error(_("available diff algorithms include \"myers\", "
> +			       "\"minimal\", \"patience\" and \"histogram\""));

I think this should be a BUG() as it is a programming error if we reach this point.

Best Wishes

Phillip

> +
>   	return 0;
>   }
>   > @@ -5242,26 +5281,6 @@ static enum parse_opt_result diff_opt_output(struct parse_opt_ctx_t *ctx,
>   	return 0;
>   }
>   > -static int diff_opt_patience(const struct option *opt,
> -			     const char *arg, int unset)
> -{
> -	struct diff_options *options = opt->value;
> -	int i;
> -
> -	BUG_ON_OPT_NEG(unset);
> -	BUG_ON_OPT_ARG(arg);
> -	options->xdl_opts = DIFF_WITH_ALG(options, PATIENCE_DIFF);
> -	/*
> -	 * Both --patience and --anchored use PATIENCE_DIFF
> -	 * internally, so remove any anchors previously
> -	 * specified.
> -	 */
> -	for (i = 0; i < options->anchors_nr; i++)
> -		free(options->anchors[i]);
> -	options->anchors_nr = 0;
> -	return 0;
> -}
> -
>   static int diff_opt_ignore_regex(const struct option *opt,
>   				 const char *arg, int unset)
>   {
> @@ -5562,9 +5581,10 @@ struct option *add_diff_options(const struct option *opts,
>   			    N_("prevent rename/copy detection if the number of rename/copy targets exceeds given limit")),
>   >   		OPT_GROUP(N_("Diff algorithm options")),
> -		OPT_BIT(0, "minimal", &options->xdl_opts,
> -			N_("produce the smallest possible diff"),
> -			XDF_NEED_MINIMAL),
> +		OPT_CALLBACK_F(0, "minimal", options, NULL,
> +			       N_("produce the smallest possible diff"),
> +			       PARSE_OPT_NONEG | PARSE_OPT_NOARG,
> +			       diff_opt_diff_algorithm_no_arg),
>   		OPT_BIT_F('w', "ignore-all-space", &options->xdl_opts,
>   			  N_("ignore whitespace when comparing lines"),
>   			  XDF_IGNORE_WHITESPACE, PARSE_OPT_NONEG),
> @@ -5589,10 +5609,11 @@ struct option *add_diff_options(const struct option *opts,
>   		OPT_CALLBACK_F(0, "patience", options, NULL,
>   			       N_("generate diff using the \"patience diff\" algorithm"),
>   			       PARSE_OPT_NONEG | PARSE_OPT_NOARG,
> -			       diff_opt_patience),
> -		OPT_BITOP(0, "histogram", &options->xdl_opts,
> -			  N_("generate diff using the \"histogram diff\" algorithm"),
> -			  XDF_HISTOGRAM_DIFF, XDF_DIFF_ALGORITHM_MASK),
> +			       diff_opt_diff_algorithm_no_arg),
> +		OPT_CALLBACK_F(0, "histogram", options, NULL,
> +			       N_("generate diff using the \"histogram diff\" algorithm"),
> +			       PARSE_OPT_NONEG | PARSE_OPT_NOARG,
> +			       diff_opt_diff_algorithm_no_arg),
>   		OPT_CALLBACK_F(0, "diff-algorithm", options, N_("<algorithm>"),
>   			       N_("choose a diff algorithm"),
>   			       PARSE_OPT_NONEG, diff_opt_diff_algorithm),

@gitgitgadget-git
Copy link

User Phillip Wood <phillip.wood123@gmail.com> has been added to the cc: list.

@@ -736,6 +736,29 @@ String::
by the configuration variables in the "diff.foo" section of the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Phillip Wood wrote (reply to this):

Hi John

On 05/02/2023 03:46, John Cai via GitGitGadget wrote:
> From: John Cai <johncai86@gmail.com>
> > It can be useful to specify diff algorithms per file type. For example,
> one may want to use the minimal diff algorithm for .json files, another
> for .c files, etc.

Have you got any examples of why this is useful? I find myself occasionally changing the algorithm when the default gives a sub-optimal diff but I've not really noticed any pattern with respect to file types.

> Teach the diff machinery to check attributes for a diff algorithm.
> Enforce precedence by favoring the command line option, then looking at
> attributes, then finally the config.
> > To enforce precedence order, set the `xdl_opts_command_line` member
> during options pasing to indicate the diff algorithm was set via command
> line args.

I've only commented on the tests as it looks like Eric and had a careful look at the code

> diff --git a/t/lib-diff-alternative.sh b/t/lib-diff-alternative.sh
> index 8d1e408bb58..630c98ea65a 100644
> --- a/t/lib-diff-alternative.sh
> +++ b/t/lib-diff-alternative.sh
> @@ -107,8 +107,27 @@ EOF
>   >   	STRATEGY=$1
>   > +	test_expect_success "$STRATEGY diff from attributes" '
> +		echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
> +		test_must_fail git diff --no-index file1 file2 > output &&
> +		test_cmp expect output
> +	'
> +
>   	test_expect_success "$STRATEGY diff" '
> -		test_must_fail git diff --no-index "--$STRATEGY" file1 file2 > output &&
> +		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
> +		test_cmp expect output
> +	'
> +
> +	test_expect_success "$STRATEGY diff command line precedence before attributes" '
> +		echo "file* diff-algorithm=meyers" >.gitattributes &&
> +		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
> +		test_cmp expect output
> +	'
> +
> +	test_expect_success "$STRATEGY diff attributes precedence before config" '
> +		git config diff.algorithm default &&
> +		echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
> +		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&

This test passes with or without the code changes in this patch. I think you need to drop --diff-algorithm=$STRATEGY from the diff command.

>   		test_cmp expect output
>   	'
>   > @@ -166,5 +185,11 @@ EOF
>   		test_must_fail git diff --no-index "--$STRATEGY" uniq1 uniq2 > output &&
>   		test_cmp expect output
>   	'
> +
> +	test_expect_success "$STRATEGY diff from attributes" '
> +		echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
> +		test_must_fail git diff --no-index uniq1 uniq2 > output &&
> +		test_cmp expect output

This test also passes with or without the code changes in this patch. It   is the same as the first test added above but with files that give the same diff irrespective of the algorithm chosen so I don't think it is doing anything useful. Unless I've missed something it should be dropped.

Best Wishes

Phillip

> +	'
>   }
>   

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Eric Sunshine wrote (reply to this):

On Mon, Feb 6, 2023 at 11:46 AM Phillip Wood <phillip.wood123@gmail.com> wrote:
> On 05/02/2023 03:46, John Cai via GitGitGadget wrote:
> > Teach the diff machinery to check attributes for a diff algorithm.
> > Enforce precedence by favoring the command line option, then looking at
> > attributes, then finally the config.
> >
> > To enforce precedence order, set the `xdl_opts_command_line` member
> > during options pasing to indicate the diff algorithm was set via command
> > line args.
>
> I've only commented on the tests as it looks like Eric and had a careful
> look at the code

I only very quickly ran my eye over the code; didn't even really read it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, John Cai wrote (reply to this):

Hi Phillip,

On 6 Feb 2023, at 11:27, Phillip Wood wrote:

> Hi John
>
> On 05/02/2023 03:46, John Cai via GitGitGadget wrote:
>> From: John Cai <johncai86@gmail.com>
>>
>> It can be useful to specify diff algorithms per file type. For example,
>> one may want to use the minimal diff algorithm for .json files, another
>> for .c files, etc.
>
> Have you got any examples of why this is useful? I find myself occasionally changing the algorithm when the default gives a sub-optimal diff but I've not really noticed any pattern with respect to file types.

At $DAYJOB, there has been a discussion and request for a feature like this [1].
One use case that came up was to be able to set a different diff algorithm for
.json files.

1. https://gitlab.com/gitlab-org/gitaly/-/issues/2591

>
>> Teach the diff machinery to check attributes for a diff algorithm.
>> Enforce precedence by favoring the command line option, then looking at
>> attributes, then finally the config.
>>
>> To enforce precedence order, set the `xdl_opts_command_line` member
>> during options pasing to indicate the diff algorithm was set via command
>> line args.
>
> I've only commented on the tests as it looks like Eric and had a careful look at the code
>
>> diff --git a/t/lib-diff-alternative.sh b/t/lib-diff-alternative.sh
>> index 8d1e408bb58..630c98ea65a 100644
>> --- a/t/lib-diff-alternative.sh
>> +++ b/t/lib-diff-alternative.sh
>> @@ -107,8 +107,27 @@ EOF
>>    	STRATEGY=$1
>>  +	test_expect_success "$STRATEGY diff from attributes" '
>> +		echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
>> +		test_must_fail git diff --no-index file1 file2 > output &&
>> +		test_cmp expect output
>> +	'
>> +
>>   	test_expect_success "$STRATEGY diff" '
>> -		test_must_fail git diff --no-index "--$STRATEGY" file1 file2 > output &&
>> +		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>> +		test_cmp expect output
>> +	'
>> +
>> +	test_expect_success "$STRATEGY diff command line precedence before attributes" '
>> +		echo "file* diff-algorithm=meyers" >.gitattributes &&
>> +		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>> +		test_cmp expect output
>> +	'
>> +
>> +	test_expect_success "$STRATEGY diff attributes precedence before config" '
>> +		git config diff.algorithm default &&
>> +		echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
>> +		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>
> This test passes with or without the code changes in this patch. I think you need to drop --diff-algorithm=$STRATEGY from the diff command.

an oversight indeed, thanks for catching this.

>
>>   		test_cmp expect output
>>   	'
>>  @@ -166,5 +185,11 @@ EOF
>>   		test_must_fail git diff --no-index "--$STRATEGY" uniq1 uniq2 > output &&
>>   		test_cmp expect output
>>   	'
>> +
>> +	test_expect_success "$STRATEGY diff from attributes" '
>> +		echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
>> +		test_must_fail git diff --no-index uniq1 uniq2 > output &&
>> +		test_cmp expect output
>
> This test also passes with or without the code changes in this patch. It  is the same as the first test added above but with files that give the same diff irrespective of the algorithm chosen so I don't think it is doing anything useful. Unless I've missed something it should be dropped.

I should have been more thorough with this one as well, thanks.

>
> Best Wishes
>
> Phillip
>
>> +	'
>>   }
>>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Elijah Newren wrote (reply to this):

Hi John,

On Mon, Feb 6, 2023 at 12:02 PM John Cai <johncai86@gmail.com> wrote:
>
> Hi Phillip,
>
> On 6 Feb 2023, at 11:27, Phillip Wood wrote:
>
> > Hi John
> >
> > On 05/02/2023 03:46, John Cai via GitGitGadget wrote:
> >> From: John Cai <johncai86@gmail.com>
> >>
> >> It can be useful to specify diff algorithms per file type. For example,
> >> one may want to use the minimal diff algorithm for .json files, another
> >> for .c files, etc.
> >
> > Have you got any examples of why this is useful? I find myself occasionally changing the algorithm when the default gives a sub-optimal diff but I've not really noticed any pattern with respect to file types.
>
> At $DAYJOB, there has been a discussion and request for a feature like this [1].
> One use case that came up was to be able to set a different diff algorithm for
> .json files.
>
> 1. https://gitlab.com/gitlab-org/gitaly/-/issues/2591

A couple points:

First, there seems to be a misunderstanding in that issue.  In
particular, the merge algorithm does call into the xdiff library to do
the three-way content merge of individual files, and when it does so,
it has to specify the diff algorithm (or take the default, currently
myers).  merge-recursive allows the diff algorithm to be specified by
the user (there are
-Xdiff-algorithm={histogram,minimal,patience,myers} flags to
merge/rebase for it), while merge-ort uses histogram (though it uses
the same parser as merge-recursive and thus gets the variables set
from the -Xdiff-algorithm flag, it just ignores those values and
hardcodes histogram).

Second, I also think the user request got converted to a particular
solution without looking at the wider problem space:  The idea seemed
to assume "myers" is default for a good reason, and thus asked for an
option to use something else.  I'm not sure the assumption is valid; I
think "myers" is default for historical reasons and histogram is
better not just for special Salesforce xml files, but code files too.
The output makes more sense to users.  So much so that even though my
simple testing suggested it had a 2% performance penalty compared to
myers, I forced ort to use it[1] even though I designed  everything
else in that algorithm around eking out maximum performance.  Others
who have tested the diff algorithms have also found histogram has very
similar performance to myers, and oftentimes even beats it[2][3].
Also, worries about invalidating rerere caches[4] was real, but we
already paid that price when we switched to ort.  And if performance
is still a worry, [3] gives me reason to believe we can make our
histogram implementation faster.  Finally, for the period of time when
Palantir was letting me make an internal git distribution (mostly for
testing ort), I also carried a patch that changed the default diff
algorithm to histogram (not just for ort, but for diff/log/etc. as
well).  Never had any complaints from the users from it.  Perhaps you
could do the same in your local version of git used by gitaly?

[1] See c8017176ac ("merge-ort: use histogram diff", 2020-12-13)
[2] From 85551232b5 ("perf: compare diff algorithms", 2012-03-06):
"This does indeed show that histogram diff slightly beats Myers, while
patience is much slower than the others."
[3] https://github.com/pascalkuthe/imara-diff
[4] https://lore.kernel.org/git/20120307114714.GA14990@sigill.intra.peff.net/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):

On Thu, Feb 09 2023, Elijah Newren wrote:

> [...] I
> think "myers" is default for historical reasons and histogram is
> better not just for special Salesforce xml files, but code files too.
> The output makes more sense to users.  So much so that even though my
> simple testing suggested it had a 2% performance penalty compared to
> myers, I forced ort to use it[1] even though I designed  everything
> else in that algorithm around eking out maximum performance.  Others
> who have tested the diff algorithms have also found histogram has very
> similar performance to myers, and oftentimes even beats it[2][3].
> Also, worries about invalidating rerere caches[4] was real, but we
> already paid that price when we switched to ort.

FWIW as someone who went through that one-time pain for my many git.git
topics it wasn't even a price, it was paying me!

Maybe it's just confirmation bias, or looking at the conflicts with
fresh eyes, but I found I mis-solved some of them seemingly because the
output from the old conflicts was so confusing, but much better with
"ort".

> And if performance
> is still a worry, [3] gives me reason to believe we can make our
> histogram implementation faster.  Finally, for the period of time when
> Palantir was letting me make an internal git distribution (mostly for
> testing ort), I also carried a patch that changed the default diff
> algorithm to histogram (not just for ort, but for diff/log/etc. as
> well).  Never had any complaints from the users from it.  Perhaps you
> could do the same in your local version of git used by gitaly?

I think that might be worth considering for GitLab, John? Although the
bias has definitely been to go with vanilla git semantics, but it sounds
like you might be in favor of endorsing a patch to change the default,
so if that happens to solve the problem...

> I also think the user request got converted to a particular
> solution without looking at the wider problem space:  The idea seemed
> to assume "myers" is default for a good reason, and thus asked for an
> option to use something else.  I'm not sure the assumption is valid; 

Just on the "wider problem", I've also looked at a lot of "bad diffs"
and there's interesting cases where histogram does equally bad as the
others *from most user's POV*, but from a "let's make a small diff" for
a computer it's perfect.

I've seen this most commonly with repetative file formats, e.g JSON-like
ones are a good example. You've probably looked at this exact thing N
times, but in case it's useful consider:
	
	$ cat a
	{
	        thing => 'foo',
	        a => 'b',
	},
	{
	        c => 'd',
	        thing => 'bar',
	}
	$ cat b
	{
	        thing => 'foo',
	        a => 'b',
	},
	{
	        c => 'd',
	},
	{
	        e => 'f',
	        thing => 'bar',
	}

Here all of our diff algorithms generate this equally terrible diff, or
a great diff, depending on your POV :):
	
	diff --git a/a b/b
	index 186017c1f38..5afadde1800 100644
	--- a/a
	+++ b/b
	@@ -4,5 +4,8 @@
	 },
	 {
	        c => 'd',
	+},
	+{
	+       e => 'f',
	        thing => 'bar',
	 }

The problem here is that from the user's POV they didn't add a closing
bracket, comma, open bracket etc. They added a whole new copy/pasted
block, and then modified a key-value in the subsequent one.

But the diff engine doesn't know about any of that, and will "helpfully"
proceed to "steal" parts of the previous block.

All of the cases where users have asked me (in person) about some bad
diffs have pretty much come down to this sort of thing.

In those cases one of the algorithms sometimes *happened* to find a
"better" diff, but in my experience it's been a
wrong-clock-is-right-twice-a-day sort of thing.

I've wondered if we couldn't have some much more stupid but effective
solution to these common cases. All of the ones I remember could
basically be expressed as a hypothetical:

	[diff "c"]
        balanceBrackets = true

Where we'd try as hard as we could not to produce diffs that had
un-balanced brackets. I.e. in this case (I manually produced this by
converting the brackets[1] to []'s, then changing them back:
	
	diff --git a/a b/b
	index 186017c1f38..05cdd03bfa4 100644
	--- a/a
	+++ b/b
	@@ -2,7 +2,10 @@
	        thing => 'foo',
	        a => 'b',
	 },
	+{
	+       x => 'y',
	+},
	 {
	-       c => 'd',
	+       c => 'f',
	        thing => 'bar',
	 }

That's a much worse diff to a computer (now 11 lines, v.s. 8 lines
before), but I'd think to most users that's *much* more understandable,
just by knowing just a bit about the language (although I'd argue we
could go as far as assuming this in general, with how common balanced
brackets[1] are across languages).

P.S.: Funny story: at <pastjob> I once helped a user who'd been
      struggling for hours to turn their already working change into
      something that made more sense with "git diff".

      We eventually managed to come up with something that looked
      "right", I can't remember how, probably some mixture of -U<n>,
      diff algorithm etc.

      Their next question was "Ok, so how do I commit this?", referring
      to "this particular version of the diff".

      Which, having already spent more time than I'd like to admit in
      trying to "help" them was a good reminder to first ask what
      problem we're trying to solve :)

1. By "balanced bracket" I'm referring not just to "{}", but what's
   considered a "mirrored" character in Unicode. I.e. not just ()[]{}<>
   etc., but also ∈∋ and the like (see
   e.g. https://www.compart.com/en/unicode/U+2208)

   For better or worse the Perl 6 language has this as part of its
   grammar, see e.g.:
   https://andrewshitov.com/2018/01/23/embedded-comment-delimiters-in-perl-6/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, John Cai wrote (reply to this):

On 23/02/09 12:26AM, Elijah Newren wrote:
> Hi John,
> 
> On Mon, Feb 6, 2023 at 12:02 PM John Cai <johncai86@gmail.com> wrote:
> >
> > Hi Phillip,
> >
> > On 6 Feb 2023, at 11:27, Phillip Wood wrote:
> >
> > > Hi John
> > >
> > > On 05/02/2023 03:46, John Cai via GitGitGadget wrote:
> > >> From: John Cai <johncai86@gmail.com>
> > >>
> > >> It can be useful to specify diff algorithms per file type. For example,
> > >> one may want to use the minimal diff algorithm for .json files, another
> > >> for .c files, etc.
> > >
> > > Have you got any examples of why this is useful? I find myself occasionally changing the algorithm when the default gives a sub-optimal diff but I've not really noticed any pattern with respect to file types.
> >
> > At $DAYJOB, there has been a discussion and request for a feature like this [1].
> > One use case that came up was to be able to set a different diff algorithm for
> > .json files.
> >
> > 1. https://gitlab.com/gitlab-org/gitaly/-/issues/2591
> 
> A couple points:
> 
> First, there seems to be a misunderstanding in that issue.  In
> particular, the merge algorithm does call into the xdiff library to do
> the three-way content merge of individual files, and when it does so,
> it has to specify the diff algorithm (or take the default, currently
> myers).  merge-recursive allows the diff algorithm to be specified by
> the user (there are
> -Xdiff-algorithm={histogram,minimal,patience,myers} flags to
> merge/rebase for it), while merge-ort uses histogram (though it uses
> the same parser as merge-recursive and thus gets the variables set
> from the -Xdiff-algorithm flag, it just ignores those values and
> hardcodes histogram).
> 
> Second, I also think the user request got converted to a particular
> solution without looking at the wider problem space:  The idea seemed
> to assume "myers" is default for a good reason, and thus asked for an
> option to use something else.  I'm not sure the assumption is valid; I
> think "myers" is default for historical reasons and histogram is
> better not just for special Salesforce xml files, but code files too.
> The output makes more sense to users.  So much so that even though my
> simple testing suggested it had a 2% performance penalty compared to
> myers, I forced ort to use it[1] even though I designed  everything
> else in that algorithm around eking out maximum performance.  Others
> who have tested the diff algorithms have also found histogram has very
> similar performance to myers, and oftentimes even beats it[2][3].
> Also, worries about invalidating rerere caches[4] was real, but we
> already paid that price when we switched to ort.  And if performance
> is still a worry, [3] gives me reason to believe we can make our
> histogram implementation faster.  Finally, for the period of time when
> Palantir was letting me make an internal git distribution (mostly for
> testing ort), I also carried a patch that changed the default diff
> algorithm to histogram (not just for ort, but for diff/log/etc. as
> well).  Never had any complaints from the users from it.  Perhaps you
> could do the same in your local version of git used by gitaly?

Thanks for that suggestion, Elijah. Changing the diff alg to histogram has been
something we've considered doing as well. However, we've gotten customer
requests to also have the option to view their diffs with patience. Seeing as
the diff algorithm is sometimes a matter of preference, we thought giving users
the control through the repository would be nice.

thanks
John

> 
> [1] See c8017176ac ("merge-ort: use histogram diff", 2020-12-13)
> [2] From 85551232b5 ("perf: compare diff algorithms", 2012-03-06):
> "This does indeed show that histogram diff slightly beats Myers, while
> patience is much slower than the others."
> [3] https://github.com/pascalkuthe/imara-diff
> [4] https://lore.kernel.org/git/20120307114714.GA14990@sigill.intra.peff.net/

@@ -736,6 +736,29 @@ String::
by the configuration variables in the "diff.foo" section of the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):

On Sun, Feb 05 2023, John Cai via GitGitGadget wrote:

> From: John Cai <johncai86@gmail.com>
> [...]
> +
> +		if (!o->xdl_opts_command_line) {
> +			static struct attr_check *check;
> +			const char *one_diff_algo;
> +			const char *two_diff_algo;
> +
> +			check = attr_check_alloc();
> +			attr_check_append(check, git_attr("diff-algorithm"));
> +
> +			git_check_attr(the_repository->index, NULL, one->path, check);
> +			one_diff_algo = check->items[0].value;
> +			git_check_attr(the_repository->index, NULL, two->path, check);
> +			two_diff_algo = check->items[0].value;
> +
> +			if (!ATTR_UNSET(one_diff_algo) && !ATTR_UNSET(two_diff_algo) &&
> +				!strcmp(one_diff_algo, two_diff_algo))
> +				set_diff_algorithm(o, one_diff_algo);
> +
> +			attr_check_free(check);

This is a bit nitpicky, but I for one would find this much easier to
read with some shorter variables, here just with "a" rather than
"one_diff_algo", "b" instead of "two_diff_algo", and splitting
"the_repository->index" into "istate" (untested):
	
	+		if (!o->xdl_opts_command_line) {
	+			static struct attr_check *check;
	+			const char *a;
	+			const char *b;
	+			struct index_state *istate = the_repository->index;
	+
	+			check = attr_check_alloc();
	+			attr_check_append(check, git_attr("diff-algorithm"));
	+
	+			git_check_attr(istate, NULL, one->path, check);
	+			a = check->items[0].value;
	+			git_check_attr(istate, NULL, two->path, check);
	+			b = check->items[0].value;
	+
	+			if (!ATTR_UNSET(a) && !ATTR_UNSET(b) && !strcmp(a, b))
	+				set_diff_algorithm(o, a);
	+
	+			attr_check_free(check);
	+		}

That also nicely keeps the line length shorter.

> @@ -333,6 +333,8 @@ struct diff_options {
>  	int prefix_length;
>  	const char *stat_sep;
>  	int xdl_opts;
> +	/* If xdl_opts has been set via the command line. */
> +	int xdl_opts_command_line;
>  
>  	/* see Documentation/diff-options.txt */
>  	char **anchors;
> diff --git a/t/lib-diff-alternative.sh b/t/lib-diff-alternative.sh
> index 8d1e408bb58..630c98ea65a 100644
> --- a/t/lib-diff-alternative.sh
> +++ b/t/lib-diff-alternative.sh
> @@ -107,8 +107,27 @@ EOF
>  
>  	STRATEGY=$1
>  
> +	test_expect_success "$STRATEGY diff from attributes" '
> +		echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
> +		test_must_fail git diff --no-index file1 file2 > output &&
> +		test_cmp expect output
> +	'
> +
>  	test_expect_success "$STRATEGY diff" '
> -		test_must_fail git diff --no-index "--$STRATEGY" file1 file2 > output &&
> +		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&

Nit: The usual style is ">output", not "> output".

> +		test_cmp expect output
> +	'
> +
> +	test_expect_success "$STRATEGY diff command line precedence before attributes" '
> +		echo "file* diff-algorithm=meyers" >.gitattributes &&
> +		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
> +		test_cmp expect output
> +	'
> +
> +	test_expect_success "$STRATEGY diff attributes precedence before config" '
> +		git config diff.algorithm default &&
> +		echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
> +		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>  		test_cmp expect output
>  	'
>  
> @@ -166,5 +185,11 @@ EOF
>  		test_must_fail git diff --no-index "--$STRATEGY" uniq1 uniq2 > output &&
>  		test_cmp expect output
>  	'
> +
> +	test_expect_success "$STRATEGY diff from attributes" '
> +		echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
> +		test_must_fail git diff --no-index uniq1 uniq2 > output &&
> +		test_cmp expect output
> +	'
>  }

For some non-nitpicking, I do worry about exposing this as a DoS vector,
e.g. here's a diff between two distant points in git.git with the
various algorithms:

	$ hyperfine -r 1 -L a patience,minimal,histogram,myers 'git diff --diff-algorithm={a} v2.0.0 v2.28.0'
	Benchmark 1: git diff --diff-algorithm=patience v2.0.0 v2.28.0
	  Time (abs ≡):        42.121 s               [User: 41.879 s, System: 0.144 s]
	
	Benchmark 2: git diff --diff-algorithm=minimal v2.0.0 v2.28.0
	  Time (abs ≡):        35.634 s               [User: 35.473 s, System: 0.160 s]
	
	Benchmark 3: git diff --diff-algorithm=histogram v2.0.0 v2.28.0
	  Time (abs ≡):        46.912 s               [User: 46.657 s, System: 0.228 s]
	
	Benchmark 4: git diff --diff-algorithm=myers v2.0.0 v2.28.0
	  Time (abs ≡):        33.233 s               [User: 33.072 s, System: 0.160 s]
	
	Summary
	  'git diff --diff-algorithm=myers v2.0.0 v2.28.0' ran
	    1.07 times faster than 'git diff --diff-algorithm=minimal v2.0.0 v2.28.0'
	    1.27 times faster than 'git diff --diff-algorithm=patience v2.0.0 v2.28.0'
	    1.41 times faster than 'git diff --diff-algorithm=histogram v2.0.0 v2.28.0'

Now, all of those are very slow overall, but some much more than
others. I seem to recall that the non-default ones also had some
pathological cases.

Another thing to think about is that we've so far considered the diff
algorithm to be purely about presentation, with some notable exceptions
such as "patch-id".

I've advocated for us getting to the point of having an in-repo
.gitconfig or .gitattributes before with a whitelist of settings like
diff.context for certain paths, or a diff.orderFile.

But those seem easy to promise future behavior for, v.s. an entire diff
algorithm (which we of course had before, but now we'd have it in
repository data).

Maybe that's not a distinction worth worrying about, just putting that
out there.

I think if others are concerned about the above something that would
neatly side-step those is to have it opt-in via the .git/config somehow,
similar to e.g. how you can commit *.gpg content, put this in
.gitattributes:

	*.gpg diff=gpg

But not have it do anything until this is in the repo's .git/config (or
similar):

	[diff "gpg"]
        	textconv = gpg --no-tty --decrypt

For that you could still keep the exact .gitattributes format you have
here, i.e.:

	file* diff-algorithm=$STRATEGY

But we to pick it up we'd need either:

	[diff-algorithm]
        	histogram = myers

Or:

	[diff-algorithm "histogram"]
        	allow = true

The former form being one that would allow you to map the .gitattributes
of the repo (but maybe that would be redundant to
.git/info/attributes)...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, John Cai wrote (reply to this):

Hi Ævar,

On 6 Feb 2023, at 11:39, Ævar Arnfjörð Bjarmason wrote:

> On Sun, Feb 05 2023, John Cai via GitGitGadget wrote:
>
>> From: John Cai <johncai86@gmail.com>
>> [...]
>> +
>> +		if (!o->xdl_opts_command_line) {
>> +			static struct attr_check *check;
>> +			const char *one_diff_algo;
>> +			const char *two_diff_algo;
>> +
>> +			check = attr_check_alloc();
>> +			attr_check_append(check, git_attr("diff-algorithm"));
>> +
>> +			git_check_attr(the_repository->index, NULL, one->path, check);
>> +			one_diff_algo = check->items[0].value;
>> +			git_check_attr(the_repository->index, NULL, two->path, check);
>> +			two_diff_algo = check->items[0].value;
>> +
>> +			if (!ATTR_UNSET(one_diff_algo) && !ATTR_UNSET(two_diff_algo) &&
>> +				!strcmp(one_diff_algo, two_diff_algo))
>> +				set_diff_algorithm(o, one_diff_algo);
>> +
>> +			attr_check_free(check);
>
> This is a bit nitpicky, but I for one would find this much easier to
> read with some shorter variables, here just with "a" rather than
> "one_diff_algo", "b" instead of "two_diff_algo", and splitting
> "the_repository->index" into "istate" (untested):
> 	
> 	+		if (!o->xdl_opts_command_line) {
> 	+			static struct attr_check *check;
> 	+			const char *a;
> 	+			const char *b;
> 	+			struct index_state *istate = the_repository->index;
> 	+
> 	+			check = attr_check_alloc();
> 	+			attr_check_append(check, git_attr("diff-algorithm"));
> 	+
> 	+			git_check_attr(istate, NULL, one->path, check);
> 	+			a = check->items[0].value;
> 	+			git_check_attr(istate, NULL, two->path, check);
> 	+			b = check->items[0].value;
> 	+
> 	+			if (!ATTR_UNSET(a) && !ATTR_UNSET(b) && !strcmp(a, b))
> 	+				set_diff_algorithm(o, a);
> 	+
> 	+			attr_check_free(check);
> 	+		}
>
> That also nicely keeps the line length shorter.

Thanks, I think this does look better.

>
>> @@ -333,6 +333,8 @@ struct diff_options {
>>  	int prefix_length;
>>  	const char *stat_sep;
>>  	int xdl_opts;
>> +	/* If xdl_opts has been set via the command line. */
>> +	int xdl_opts_command_line;
>>
>>  	/* see Documentation/diff-options.txt */
>>  	char **anchors;
>> diff --git a/t/lib-diff-alternative.sh b/t/lib-diff-alternative.sh
>> index 8d1e408bb58..630c98ea65a 100644
>> --- a/t/lib-diff-alternative.sh
>> +++ b/t/lib-diff-alternative.sh
>> @@ -107,8 +107,27 @@ EOF
>>
>>  	STRATEGY=$1
>>
>> +	test_expect_success "$STRATEGY diff from attributes" '
>> +		echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
>> +		test_must_fail git diff --no-index file1 file2 > output &&
>> +		test_cmp expect output
>> +	'
>> +
>>  	test_expect_success "$STRATEGY diff" '
>> -		test_must_fail git diff --no-index "--$STRATEGY" file1 file2 > output &&
>> +		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>
> Nit: The usual style is ">output", not "> output".

noted!

>
>> +		test_cmp expect output
>> +	'
>> +
>> +	test_expect_success "$STRATEGY diff command line precedence before attributes" '
>> +		echo "file* diff-algorithm=meyers" >.gitattributes &&
>> +		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>> +		test_cmp expect output
>> +	'
>> +
>> +	test_expect_success "$STRATEGY diff attributes precedence before config" '
>> +		git config diff.algorithm default &&
>> +		echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
>> +		test_must_fail git diff --no-index "--diff-algorithm=$STRATEGY" file1 file2 > output &&
>>  		test_cmp expect output
>>  	'
>>
>> @@ -166,5 +185,11 @@ EOF
>>  		test_must_fail git diff --no-index "--$STRATEGY" uniq1 uniq2 > output &&
>>  		test_cmp expect output
>>  	'
>> +
>> +	test_expect_success "$STRATEGY diff from attributes" '
>> +		echo "file* diff-algorithm=$STRATEGY" >.gitattributes &&
>> +		test_must_fail git diff --no-index uniq1 uniq2 > output &&
>> +		test_cmp expect output
>> +	'
>>  }
>
> For some non-nitpicking, I do worry about exposing this as a DoS vector,
> e.g. here's a diff between two distant points in git.git with the
> various algorithms:
>
> 	$ hyperfine -r 1 -L a patience,minimal,histogram,myers 'git diff --diff-algorithm={a} v2.0.0 v2.28.0'
> 	Benchmark 1: git diff --diff-algorithm=patience v2.0.0 v2.28.0
> 	  Time (abs ≡):        42.121 s               [User: 41.879 s, System: 0.144 s]
> 	
> 	Benchmark 2: git diff --diff-algorithm=minimal v2.0.0 v2.28.0
> 	  Time (abs ≡):        35.634 s               [User: 35.473 s, System: 0.160 s]
> 	
> 	Benchmark 3: git diff --diff-algorithm=histogram v2.0.0 v2.28.0
> 	  Time (abs ≡):        46.912 s               [User: 46.657 s, System: 0.228 s]
> 	
> 	Benchmark 4: git diff --diff-algorithm=myers v2.0.0 v2.28.0
> 	  Time (abs ≡):        33.233 s               [User: 33.072 s, System: 0.160 s]
> 	
> 	Summary
> 	  'git diff --diff-algorithm=myers v2.0.0 v2.28.0' ran
> 	    1.07 times faster than 'git diff --diff-algorithm=minimal v2.0.0 v2.28.0'
> 	    1.27 times faster than 'git diff --diff-algorithm=patience v2.0.0 v2.28.0'
> 	    1.41 times faster than 'git diff --diff-algorithm=histogram v2.0.0 v2.28.0'

Thanks for this analysis. To clarify, .gitconfig's diff.algorithm setting is
already an attack vector right? I see how this would be adding another one.

That being said, here's a separate issue. I benchmarked the usage of
.gitattributes as introduced in this patch series, and indeed it does look like
there is additional latency:

$ echo "* diff-algorithm=patience >> .gitattributes
$ hyperfine -r 5 'git-bin-wrapper diff --diff-algorithm=patience v2.0.0 v2.28.0'                      ✭
Benchmark 1: git-bin-wrapper diff --diff-algorithm=patience v2.0.0 v2.28.0
  Time (mean ± σ):     889.4 ms ± 113.8 ms    [User: 715.7 ms, System: 65.3 ms]
  Range (min … max):   764.1 ms … 1029.3 ms    5 runs

$ hyperfine -r 5 'git-bin-wrapper diff v2.0.0 v2.28.0'                                                ✭
Benchmark 1: git-bin-wrapper diff v2.0.0 v2.28.0
  Time (mean ± σ):      2.146 s ±  0.368 s    [User: 0.827 s, System: 0.243 s]
  Range (min … max):    1.883 s …  2.795 s    5 runs

and I imagine the latency scales with the size of .gitattributes. Although I'm
not familiar with other parts of the codebase and how it deals with the latency
introduced by reading attributes files.

>
> Now, all of those are very slow overall, but some much more than
> others. I seem to recall that the non-default ones also had some
> pathological cases.
>
> Another thing to think about is that we've so far considered the diff
> algorithm to be purely about presentation, with some notable exceptions
> such as "patch-id".
>
> I've advocated for us getting to the point of having an in-repo
> .gitconfig or .gitattributes before with a whitelist of settings like
> diff.context for certain paths, or a diff.orderFile.
>
> But those seem easy to promise future behavior for, v.s. an entire diff
> algorithm (which we of course had before, but now we'd have it in
> repository data).
>
> Maybe that's not a distinction worth worrying about, just putting that
> out there.
>
> I think if others are concerned about the above something that would
> neatly side-step those is to have it opt-in via the .git/config somehow,
> similar to e.g. how you can commit *.gpg content, put this in
> .gitattributes:
>
> 	*.gpg diff=gpg
>
> But not have it do anything until this is in the repo's .git/config (or
> similar):
>
> 	[diff "gpg"]
>         	textconv = gpg --no-tty --decrypt
>
> For that you could still keep the exact .gitattributes format you have
> here, i.e.:
>
> 	file* diff-algorithm=$STRATEGY
>
> But we to pick it up we'd need either:
>
> 	[diff-algorithm]
>         	histogram = myers
>
> Or:
>
> 	[diff-algorithm "histogram"]
>         	allow = true

This would help address slowness from the diff algorithm itself. I'm not opposed
to adding this config if this attack vector is concerning to people.

However, it wouldn't help address the additional latency of scanning
.gitattributes to find the diff algorithm.

Would a separate config to allow gitattributes be helpful here?

[diff-algorithm]
	attributes = true

>
> The former form being one that would allow you to map the .gitattributes
> of the repo (but maybe that would be redundant to
> .git/info/attributes)...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Phillip Wood wrote (reply to this):

Hi John

On 06/02/2023 20:37, John Cai wrote:
> On 6 Feb 2023, at 11:39, Ævar Arnfjörð Bjarmason wrote:
>> On Sun, Feb 05 2023, John Cai via GitGitGadget wrote:
>> For some non-nitpicking, I do worry about exposing this as a DoS vector,
>> e.g. here's a diff between two distant points in git.git with the
>> various algorithms:
>>
>> 	$ hyperfine -r 1 -L a patience,minimal,histogram,myers 'git diff --diff-algorithm={a} v2.0.0 v2.28.0'
>> 	Benchmark 1: git diff --diff-algorithm=patience v2.0.0 v2.28.0
>> 	  Time (abs ≡):        42.121 s               [User: 41.879 s, System: 0.144 s]
>> 	
>> 	Benchmark 2: git diff --diff-algorithm=minimal v2.0.0 v2.28.0
>> 	  Time (abs ≡):        35.634 s               [User: 35.473 s, System: 0.160 s]
>> 	
>> 	Benchmark 3: git diff --diff-algorithm=histogram v2.0.0 v2.28.0
>> 	  Time (abs ≡):        46.912 s               [User: 46.657 s, System: 0.228 s]
>> 	
>> 	Benchmark 4: git diff --diff-algorithm=myers v2.0.0 v2.28.0
>> 	  Time (abs ≡):        33.233 s               [User: 33.072 s, System: 0.160 s]
>> 	
>> 	Summary
>> 	  'git diff --diff-algorithm=myers v2.0.0 v2.28.0' ran
>> 	    1.07 times faster than 'git diff --diff-algorithm=minimal v2.0.0 v2.28.0'
>> 	    1.27 times faster than 'git diff --diff-algorithm=patience v2.0.0 v2.28.0'
>> 	    1.41 times faster than 'git diff --diff-algorithm=histogram v2.0.0 v2.28.0'
> > Thanks for this analysis. To clarify, .gitconfig's diff.algorithm setting is
> already an attack vector right?

.gitconfig is under the user's control though whereas .gitattributes is attacker controlled if one clones a malicious repository. Having said the worst results above are for the historgram algorithm that merge-ort uses internally and no one has complained about ort's performance.

> I see how this would be adding another one.
> > That being said, here's a separate issue. I benchmarked the usage of
> .gitattributes as introduced in this patch series, and indeed it does look like
> there is additional latency:
> > $ echo "* diff-algorithm=patience >> .gitattributes
> $ hyperfine -r 5 'git-bin-wrapper diff --diff-algorithm=patience v2.0.0 v2.28.0'                      ✭
> Benchmark 1: git-bin-wrapper diff --diff-algorithm=patience v2.0.0 v2.28.0
>    Time (mean ± σ):     889.4 ms ± 113.8 ms    [User: 715.7 ms, System: 65.3 ms]
>    Range (min … max):   764.1 ms … 1029.3 ms    5 runs
> > $ hyperfine -r 5 'git-bin-wrapper diff v2.0.0 v2.28.0'                                                ✭
> Benchmark 1: git-bin-wrapper diff v2.0.0 v2.28.0
>    Time (mean ± σ):      2.146 s ±  0.368 s    [User: 0.827 s, System: 0.243 s]
>    Range (min … max):    1.883 s …  2.795 s    5 runs
> > and I imagine the latency scales with the size of .gitattributes. Although I'm
> not familiar with other parts of the codebase and how it deals with the latency
> introduced by reading attributes files.

Ouch! Thanks for benchmarking that is a suspiciously large slow down. I've a feeling the attributes code parses all the .gitattribute files from scratch for each path that's queried, so there may be scope for making it a bit smarter. I see some slow down but no where near as much


$ hyperfine -r 3 'bin-wrappers/git diff --diff-algorithm=patience --no-color --no-color-moved v2.0.0 v2.28.0'
Benchmark 1: bin-wrappers/git diff --diff-algorithm=patience --no-color --no-color-moved v2.0.0 v2.28.0
  Time (mean ± σ):      1.996 s ±  0.008 s    [User: 1.706 s, System: 0.286 s]
  Range (min … max):    1.989 s …  2.004 s    3 runs

$ hyperfine -r 5 'bin-wrappers/git diff --no-color --no-color-moved v2.0.0 v2.28.0'
Benchmark 1: bin-wrappers/git diff --no-color --no-color-moved v2.0.0 v2.28.0
  Time (mean ± σ):      2.238 s ±  0.037 s    [User: 1.880 s, System: 0.350 s]
  Range (min … max):    2.216 s …  2.303 s    5 runs


Perhaps I'm over simplifying but having read the issue you linked to I couldn't help feeling that the majority of users might be satisfied by just changing gitlab to use the patience algorithm when generating diffs. The idea to use .gitattributes seems to have come from Patrick rather than a user request.

This is slightly off topic but one thing I'd really like is a way to tell diff use automatically use --diff-words on some files (e.g. Documentation/*)

Best Wishes

Phillip

>> Now, all of those are very slow overall, but some much more than
>> others. I seem to recall that the non-default ones also had some
>> pathological cases.
>>
>> Another thing to think about is that we've so far considered the diff
>> algorithm to be purely about presentation, with some notable exceptions
>> such as "patch-id".
>>
>> I've advocated for us getting to the point of having an in-repo
>> .gitconfig or .gitattributes before with a whitelist of settings like
>> diff.context for certain paths, or a diff.orderFile.
>>
>> But those seem easy to promise future behavior for, v.s. an entire diff
>> algorithm (which we of course had before, but now we'd have it in
>> repository data).
>>
>> Maybe that's not a distinction worth worrying about, just putting that
>> out there.
>>
>> I think if others are concerned about the above something that would
>> neatly side-step those is to have it opt-in via the .git/config somehow,
>> similar to e.g. how you can commit *.gpg content, put this in
>> .gitattributes:
>>
>> 	*.gpg diff=gpg
>>
>> But not have it do anything until this is in the repo's .git/config (or
>> similar):
>>
>> 	[diff "gpg"]
>>          	textconv = gpg --no-tty --decrypt
>>
>> For that you could still keep the exact .gitattributes format you have
>> here, i.e.:
>>
>> 	file* diff-algorithm=$STRATEGY
>>
>> But we to pick it up we'd need either:
>>
>> 	[diff-algorithm]
>>          	histogram = myers
>>
>> Or:
>>
>> 	[diff-algorithm "histogram"]
>>          	allow = true
> > This would help address slowness from the diff algorithm itself. I'm not opposed
> to adding this config if this attack vector is concerning to people.
> > However, it wouldn't help address the additional latency of scanning
> .gitattributes to find the diff algorithm.
> > Would a separate config to allow gitattributes be helpful here?
> > [diff-algorithm]
> 	attributes = true
> >>
>> The former form being one that would allow you to map the .gitattributes
>> of the repo (but maybe that would be redundant to
>> .git/info/attributes)...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, John Cai wrote (reply to this):

Hi Phillip,

On 7 Feb 2023, at 9:55, Phillip Wood wrote:

> Hi John
>
> On 06/02/2023 20:37, John Cai wrote:
>> On 6 Feb 2023, at 11:39, Ævar Arnfjörð Bjarmason wrote:
>>> On Sun, Feb 05 2023, John Cai via GitGitGadget wrote:
>>> For some non-nitpicking, I do worry about exposing this as a DoS vector,
>>> e.g. here's a diff between two distant points in git.git with the
>>> various algorithms:
>>>
>>> 	$ hyperfine -r 1 -L a patience,minimal,histogram,myers 'git diff --diff-algorithm={a} v2.0.0 v2.28.0'
>>> 	Benchmark 1: git diff --diff-algorithm=patience v2.0.0 v2.28.0
>>> 	  Time (abs ≡):        42.121 s               [User: 41.879 s, System: 0.144 s]
>>> 	
>>> 	Benchmark 2: git diff --diff-algorithm=minimal v2.0.0 v2.28.0
>>> 	  Time (abs ≡):        35.634 s               [User: 35.473 s, System: 0.160 s]
>>> 	
>>> 	Benchmark 3: git diff --diff-algorithm=histogram v2.0.0 v2.28.0
>>> 	  Time (abs ≡):        46.912 s               [User: 46.657 s, System: 0.228 s]
>>> 	
>>> 	Benchmark 4: git diff --diff-algorithm=myers v2.0.0 v2.28.0
>>> 	  Time (abs ≡):        33.233 s               [User: 33.072 s, System: 0.160 s]
>>> 	
>>> 	Summary
>>> 	  'git diff --diff-algorithm=myers v2.0.0 v2.28.0' ran
>>> 	    1.07 times faster than 'git diff --diff-algorithm=minimal v2.0.0 v2.28.0'
>>> 	    1.27 times faster than 'git diff --diff-algorithm=patience v2.0.0 v2.28.0'
>>> 	    1.41 times faster than 'git diff --diff-algorithm=histogram v2.0.0 v2.28.0'
>>
>> Thanks for this analysis. To clarify, .gitconfig's diff.algorithm setting is
>> already an attack vector right?
>
> .gitconfig is under the user's control though whereas .gitattributes is attacker controlled if one clones a malicious repository. Having said the worst results above are for the historgram algorithm that merge-ort uses internally and no one has complained about ort's performance.
>
>> I see how this would be adding another one.
>>
>> That being said, here's a separate issue. I benchmarked the usage of
>> .gitattributes as introduced in this patch series, and indeed it does look like
>> there is additional latency:
>>
>> $ echo "* diff-algorithm=patience >> .gitattributes
>> $ hyperfine -r 5 'git-bin-wrapper diff --diff-algorithm=patience v2.0.0 v2.28.0'                      ✭
>> Benchmark 1: git-bin-wrapper diff --diff-algorithm=patience v2.0.0 v2.28.0
>>    Time (mean ± σ):     889.4 ms ± 113.8 ms    [User: 715.7 ms, System: 65.3 ms]
>>    Range (min … max):   764.1 ms … 1029.3 ms    5 runs
>>
>> $ hyperfine -r 5 'git-bin-wrapper diff v2.0.0 v2.28.0'                                                ✭
>> Benchmark 1: git-bin-wrapper diff v2.0.0 v2.28.0
>>    Time (mean ± σ):      2.146 s ±  0.368 s    [User: 0.827 s, System: 0.243 s]
>>    Range (min … max):    1.883 s …  2.795 s    5 runs
>>
>> and I imagine the latency scales with the size of .gitattributes. Although I'm
>> not familiar with other parts of the codebase and how it deals with the latency
>> introduced by reading attributes files.
>
> Ouch! Thanks for benchmarking that is a suspiciously large slow down. I've a feeling the attributes code parses all the .gitattribute files from scratch for each path that's queried, so there may be scope for making it a bit smarter. I see some slow down but no where near as much

Yes, definitely worth looking into how this can be sped up.

>
>
> $ hyperfine -r 3 'bin-wrappers/git diff --diff-algorithm=patience --no-color --no-color-moved v2.0.0 v2.28.0'
> Benchmark 1: bin-wrappers/git diff --diff-algorithm=patience --no-color --no-color-moved v2.0.0 v2.28.0
>   Time (mean ± σ):      1.996 s ±  0.008 s    [User: 1.706 s, System: 0.286 s]
>   Range (min … max):    1.989 s …  2.004 s    3 runs
>
> $ hyperfine -r 5 'bin-wrappers/git diff --no-color --no-color-moved v2.0.0 v2.28.0'
> Benchmark 1: bin-wrappers/git diff --no-color --no-color-moved v2.0.0 v2.28.0
>   Time (mean ± σ):      2.238 s ±  0.037 s    [User: 1.880 s, System: 0.350 s]
>   Range (min … max):    2.216 s …  2.303 s    5 runs
>
>
> Perhaps I'm over simplifying but having read the issue you linked to I couldn't help feeling that the majority of users might be satisfied by just changing gitlab to use the patience algorithm when generating diffs.

Right, I recognize this is a judgment call that may be best left up to the list.

We don't have a way in GitLab to change the diff algorithm currently. Of course
that can be implemented outside of Git, but having it as part of Git may have a
wider benefit for users who would appreciate the convenience of automatically
having certain files use one diff algorithm and other files use another, without
having to pass in the diff algorithm through the command line each time.

>
> The idea to use .gitattributes seems to have come from Patrick rather than a user request.
>
> This is slightly off topic but one thing I'd really like is a way to tell diff use automatically use --diff-words on some files (e.g. Documentation/*)

I think it's a bit similar to the spirit of this desired feature.

thanks
John

>
> Best Wishes
>
> Phillip
>
>>> Now, all of those are very slow overall, but some much more than
>>> others. I seem to recall that the non-default ones also had some
>>> pathological cases.
>>>
>>> Another thing to think about is that we've so far considered the diff
>>> algorithm to be purely about presentation, with some notable exceptions
>>> such as "patch-id".
>>>
>>> I've advocated for us getting to the point of having an in-repo
>>> .gitconfig or .gitattributes before with a whitelist of settings like
>>> diff.context for certain paths, or a diff.orderFile.
>>>
>>> But those seem easy to promise future behavior for, v.s. an entire diff
>>> algorithm (which we of course had before, but now we'd have it in
>>> repository data).
>>>
>>> Maybe that's not a distinction worth worrying about, just putting that
>>> out there.
>>>
>>> I think if others are concerned about the above something that would
>>> neatly side-step those is to have it opt-in via the .git/config somehow,
>>> similar to e.g. how you can commit *.gpg content, put this in
>>> .gitattributes:
>>>
>>> 	*.gpg diff=gpg
>>>
>>> But not have it do anything until this is in the repo's .git/config (or
>>> similar):
>>>
>>> 	[diff "gpg"]
>>>          	textconv = gpg --no-tty --decrypt
>>>
>>> For that you could still keep the exact .gitattributes format you have
>>> here, i.e.:
>>>
>>> 	file* diff-algorithm=$STRATEGY
>>>
>>> But we to pick it up we'd need either:
>>>
>>> 	[diff-algorithm]
>>>          	histogram = myers
>>>
>>> Or:
>>>
>>> 	[diff-algorithm "histogram"]
>>>          	allow = true
>>
>> This would help address slowness from the diff algorithm itself. I'm not opposed
>> to adding this config if this attack vector is concerning to people.
>>
>> However, it wouldn't help address the additional latency of scanning
>> .gitattributes to find the diff algorithm.
>>
>> Would a separate config to allow gitattributes be helpful here?
>>
>> [diff-algorithm]
>> 	attributes = true
>>
>>>
>>> The former form being one that would allow you to map the .gitattributes
>>> of the repo (but maybe that would be redundant to
>>> .git/info/attributes)...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):

On Tue, Feb 07 2023, Phillip Wood wrote:

> This is slightly off topic but one thing I'd really like is a way to
> tell diff use automatically use --diff-words on some files
> (e.g. Documentation/*)

Unlike changing the algorithm, -U options, diff.orderFile etc. doing
that would give you a diff that can't be applied with "git apply" or
other tools that expect a valid unified diff.

So I can imagine that it would be neat in some contexts, but such a
change would have much wider implications than options that tweak how a
valid unified diff looks, or is generated.

We'd need some way to mark a diff as "for ad-hoc viewing only".

But as it sounds like you want this for git.git the
Documentation/doc-diff script is much better than anything word-diff
could spew out, as it diffs the resulting generated docs.

I wonder (but haven't tried) whether you can't "diff" that using the
same method that can be used to diff binary files using a custom driver.

Hrm, except that in that case (with includes etc) there isn't really a
1=1 mapping between files within Documentation/ and generated docs (due
to includes etc.). But I suppose it could be used only for those files
that 1=1 correspond to the generated manpages.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Feb 14, 2023 at 06:35:00PM -0800, Elijah Newren wrote:

> > Just a small counter-point, since I happened to be looking at myers vs
> > patience for something elsewhere in the thread, but:
> >
> >   git show 35bd13fcd2caa4185bf3729655ca20b6a5fe9b6f builtin/add.c
> 
> "fatal: bad object 35bd13fcd2caa4185bf3729655ca20b6a5fe9b6f"
> 
> Is that a local commit of yours?

Oh, sorry. It's not my commit, but it may have been something I picked
up off the list. The version from Junio is 20b813d7d3d.

>    * They had two people annotating the diffs and independently
> scoring them, and then checked for agreement between their answers
> afterwards.  (No, they didn't always agree, but they did have
> substantial agreement.)

Kind of a weird methodology. I'd think you'd do better to put those
diffs in front of a lot of people to get a consensus. But OK...

> For the (again, non-identical) diffs modifying non-code, they found
> (see table 11) that:
>    * 14.9% of the myers diffs are better
>    * 13.4% of the histogram diffs are better
>    * 71.6% of the diffs have equal quality
> 
> For the (non-identical) diffs modifying code, they found (again, see
> table 11) that:
>    * 16.9% of the myers diffs are better
>    * 62.6% of the histogram diffs are better
>    * 20.6% of the diffs have equal quality
> 
> A ratio of 4 to 1 for histogram being better on code diffs is pretty
> weighty to me.

Interesting. They do slightly worse on non-code, but probably not enough
to worry about.

Despite my complaint above, I do find those results compelling. Or at
least, it is much closer to being real data than anything I have seen
before, so it seems like the best guide we have. :)

Like I said, I don't have a very strong opinion myself. Mostly I wanted
to note that while it is easy for us to remember the times we saw a
Myers diff, said "yuck", and tried patience and it was better, most of
us may be blind to the opposite case. If patience is not the default,
most of us would not have hit "yuck" case for it at all.

FWIW, I coincidentally hit this case earlier today where patience does a
_much_ better job than myers:

  https://lore.kernel.org/git/Y+vV8Ifkj1QV7KF0@coredump.intra.peff.net/

So don't count me as an objection, but just a curious observer. ;)

-Peff

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <peff@peff.net> writes:

>> >   git show 35bd13fcd2caa4185bf3729655ca20b6a5fe9b6f builtin/add.c
>> 
>> "fatal: bad object 35bd13fcd2caa4185bf3729655ca20b6a5fe9b6f"
>> 
>> Is that a local commit of yours?
>
> Oh, sorry. It's not my commit, but it may have been something I picked
> up off the list. The version from Junio is 20b813d7d3d.

$ git show -s --notes=amlog --format='%N %s' 20b813d7d3d
Message-Id: <patch-v2-1.3-71c7922b25f-20230206T225639Z-avarab@gmail.com>
 add: remove "add.interactive.useBuiltin" & Perl "git add--interactive"

> FWIW, I coincidentally hit this case earlier today where patience does a
> _much_ better job than myers:
>
>   https://lore.kernel.org/git/Y+vV8Ifkj1QV7KF0@coredump.intra.peff.net/
>
> So don't count me as an objection, but just a curious observer. ;)

Yeah, I know some people consider that Patience is the best thing
since sliced bread, and it does produce more readable diffs often.
I haven't used it often enough to cite/remember a bad case, though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Phillip Wood wrote (reply to this):

Hi Peff

Thanks for the counter example

On 11/02/2023 01:59, Jeff King wrote:
> Just a small counter-point, since I happened to be looking at myers vs
> patience for something elsewhere in the thread, but:
> >    git show 35bd13fcd2caa4185bf3729655ca20b6a5fe9b6f builtin/add.c
> > looks slightly better to me with myers, even though it is 2 lines
> longer. The issue is that patience and histogram are very eager to use
> blank lines as anchor points, so a diff like:
> >    -some words
>    -
>    -and some more
>    +unrelated content
>    +
>    +but it happens to also be two paragraphs
> > in myers becomes:
> >    -some words
>    +unrelated content
>    >    -and some more
>    +but it happens to also be two paragraphs
> > in patience (here I'm using single lines, but in practice these may be
> paragraphs, or stanzas of code). I think that's also the _strength_ of
> patience in many cases, but it really depends on the content.

Indeed. Ironically as there are no unique context lines in that example the blank lines are being matched by patience implementation falling back to the myers algorithm. Normally the myers implementation tries to avoid matching common context lines between two blocks of changed lines but I think because in this case it is only called on a small part of the file the blank lines are not common enough to trigger that heuristic. I've got a patch[1] that stops the patience implementation falling back to the myers algorithm and just trims any leading and trailing context. On the whole it I think it gives more readable diffs but I've not got any systematic data to back that up. I also suspect there are pathological cases such as each line in the file being duplicated where the falling back to the myers algorithm gives a much better result.

> Replacing
> a multi-stanza block with another one may be the best explanation for
> what happened. Or the two stanzas may be independent, and showing the
> change for each one may be better.
>
> I'm not sure which one happens more often. And you'd probably want to
> weight it by how good/bad the change is. In the example I showed I don't
> find patience very much worse, since it's already a pretty ugly diff.
> But in cases where patience shines, it may be making things
> significantly more readable.

I agree that having some data would be useful if we're going to change the default but collecting it would entail quite a bit of work and as the scoring is subjective we'd want a few people doing it. It's great that someone has done that for the histogram algorithm in the paper Elijah cited.

> I don't have a super strong opinion, but I just wanted to chime in that
> it is not clear to me that patience/histogram is always a win over myers
> (yes, I know your examples were comparing patience vs histogram, but the
> larger thread is discussing the other).

Agreed, there are definitely cases where myers gives more readable diffs, I think if we're going to change the default the question we need to answer is which algorithm gives the best result most of the time.

Best Wishes

Phillip

[1] https://github.com/phillipwood/git/commits/pure-patience-diff

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Phillip Wood wrote (reply to this):

Hi Ævar

On 07/02/2023 17:27, Ævar Arnfjörð Bjarmason wrote:
> > On Tue, Feb 07 2023, Phillip Wood wrote:
> >> This is slightly off topic but one thing I'd really like is a way to
>> tell diff use automatically use --diff-words on some files
>> (e.g. Documentation/*)
> > Unlike changing the algorithm, -U options, diff.orderFile etc. doing
> that would give you a diff that can't be applied with "git apply" or
> other tools that expect a valid unified diff.

That's a good point, we'd probably would want to guard it by checking if the output is going to a tty.

> So I can imagine that it would be neat in some contexts, but such a
> change would have much wider implications than options that tweak how a
> valid unified diff looks, or is generated.
> > We'd need some way to mark a diff as "for ad-hoc viewing only".
> > But as it sounds like you want this for git.git the
> Documentation/doc-diff script is much better than anything word-diff
> could spew out, as it diffs the resulting generated docs.

It's true that I should look at the doc-diff script for git, but it is a wider wish for text files in other projects as well.

Best Wishes

Phillip

> I wonder (but haven't tried) whether you can't "diff" that using the
> same method that can be used to diff binary files using a custom driver.
> > Hrm, except that in that case (with includes etc) there isn't really a
> 1=1 mapping between files within Documentation/ and generated docs (due
> to includes etc.). But I suppose it could be used only for those files
> that 1=1 correspond to the generated manpages.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Wed, Feb 15, 2023 at 02:44:59PM +0000, Phillip Wood wrote:

> Indeed. Ironically as there are no unique context lines in that example the
> blank lines are being matched by patience implementation falling back to the
> myers algorithm. Normally the myers implementation tries to avoid matching
> common context lines between two blocks of changed lines but I think because
> in this case it is only called on a small part of the file the blank lines
> are not common enough to trigger that heuristic. I've got a patch[1] that
> stops the patience implementation falling back to the myers algorithm and
> just trims any leading and trailing context. On the whole it I think it
> gives more readable diffs but I've not got any systematic data to back that
> up. I also suspect there are pathological cases such as each line in the
> file being duplicated where the falling back to the myers algorithm gives a
> much better result.

Ah, I should have suspected it was something like that (since one of the
purposes of patience is trying not to key on meaningless lines).

I tried your patch on my test case, and the result is even more readable
than the myers output, because it really was effectively a complete
rewrite of the function. It is, of course, not the minimal diff. I'm not
sure if there would be cases where you'd prefer the minimal. I guess if
each stanza's change really was independent of the others. But if there
is no commonality except for blank lines, I find it hard to imagine that
it's much worse to just treat the whole thing as a block.

Anyway, thank you (and Elijah) for explaining. I'm getting more
comfortable with the idea of switching the default.

-Peff

@gitgitgadget-git
Copy link

User Ævar Arnfjörð Bjarmason <avarab@gmail.com> has been added to the cc: list.

@@ -736,6 +736,29 @@ String::
by the configuration variables in the "diff.foo" section of the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Sun, Feb 05, 2023 at 03:46:21AM +0000, John Cai via GitGitGadget wrote:

> +`diff-algorithm`
> +^^^^^^^^^^^^^^^^
> +
> +The attribute `diff-algorithm` affects which algorithm Git uses to generate
> +diffs. This allows defining diff algorithms per file extension. Precedence rules
> +are as follows, in order from highest to lowest:
> +
> +*Command line option*
> +
> +Pass in the `--diff-algorithm` command line option int git-diff(1)
> +
> +*Git attributes*
> +
> +------------------------
> +*.json	diff-algorithm=histogram
> +------------------------
> +
> +*Git config*
> +
> +----------------------------------------------------------------
> +[diff]
> +	algorithm = histogram
> +----------------------------------------------------------------

From the user's perspective, this is weirdly inconsistent with the
existing diff attributes, which would be more like:

  # in .gitattributes
  *.json diff=json 

  # in config
  [diff "json"]
  algorithm = histogram

I know why one might choose the scheme you did; it kicks in if the repo
sets the algorithm, without users having to set up any extra config.
Which is sort of nice, if we assume that malicious actors don't have any
incentive to pick the algorithm. In theory they don't, though I saw Ævar
mention possible DoS elsewhere in the thread.

  Side note: It's also possible that algorithm selection could be
  required to trigger a separate security bug (say, a buffer overflow in
  the patience code or something), so restricting that works in a
  belt-and-suspenders way. But that somehow feels like like the wrong
  side of the paranoia-vs-feature line.

So I dunno. I recognize that this scheme fulfills your immediate needs
better, but I fear that we'll be stuck with a weird split between "diff"
and "diff-*" attributes forever. In the long run, having a way for the
repo to say "and here is some config I recommend to you" would give you
the best of both, but that is a challenging topic that has been
discussed and punted on for many years.

-Peff

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):

On Tue, Feb 07 2023, Jeff King wrote:

> On Sun, Feb 05, 2023 at 03:46:21AM +0000, John Cai via GitGitGadget wrote:

> From the user's perspective, this is weirdly inconsistent with the
> existing diff attributes, which would be more like:
>
>   # in .gitattributes
>   *.json diff=json 
>
>   # in config
>   [diff "json"]
>   algorithm = histogram

That does look more elegant.

> I know why one might choose the scheme you did; it kicks in if the repo
> sets the algorithm, without users having to set up any extra config.
> Which is sort of nice, if we assume that malicious actors don't have any
> incentive to pick the algorithm. In theory they don't, though I saw Ævar
> mention possible DoS elsewhere in the thread.
>
>   Side note: It's also possible that algorithm selection could be
>   required to trigger a separate security bug (say, a buffer overflow in
>   the patience code or something), so restricting that works in a
>   belt-and-suspenders way. But that somehow feels like like the wrong
>   side of the paranoia-vs-feature line.
>
> So I dunno. I recognize that this scheme fulfills your immediate needs
> better, but I fear that we'll be stuck with a weird split between "diff"
> and "diff-*" attributes forever. In the long run, having a way for the
> repo to say "and here is some config I recommend to you" would give you
> the best of both, but that is a challenging topic that has been
> discussed and punted on for many years.

If (and I'm not taking a stance on whether this is the case here) we
think that a hypothetical .gitconfig in-repo combined with
.gitattributes would be more elegant for this or other cases, I don't
see why the path to some very limited version of that where we read
literally one whitelisted config variable from such a .gitconfig, and
ignore everything else, wouldn't be OK.

I.e. the more general in-repo .gitconfig discussion (of which there was
some discussion this past Git Merge in Chicago) includes potentially
much harder things

Like what to do with changing genreal in-repo config, if and how to
compare that against some local whitelist of what sort of config we
should trust (or none) etc. etc.

But if we agreed that we'd be willing to trust the remote with some
config-by-another-name unconditionally, as is being proposed here, we
could easily read that one thing from an in-repo .gitconfig, ignore
everything else, and then just document that interface as a
special-case.

We have several such "limited config" readers already, and the relevant
bits of the config parser already have to handle arbitrary "git config"
files in-tree, due to .gitmodules.

We even have special-snowflake config in the configspace already that
doesn't obey the usual rules: The trace2.* config is only read from the
--system config, as it has an inherent bootsrtapping problem in wanting
to log as early as possible.

The advantage of the limited in-repo .gitconfig would be that if we
think the interface was more elegant this would be a way to start small
in a way that would be future-proof as far as the permanent UX goes.

If there's interest the read_early_config(), read_very_early_config()
and gitmodules_config_oid() functions are good places to look.

The last one in particular could be generalized pretty easily to read a
limited in-tree .gitconfig.

Although plugging it into the "config order" might be tricky, I haven't
tried.

Even a minimal implementation of such a thing would probably want a "git
config --repository" (or whatever we call the flag) to go with
"--local", "--system" etc, to spew out something sensible from
"--show-origin" etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> From the user's perspective, this is weirdly inconsistent with the
>> existing diff attributes, which would be more like:
>>
>>   # in .gitattributes
>>   *.json diff=json 
>>
>>   # in config
>>   [diff "json"]
>>   algorithm = histogram
>
> That does look more elegant.

We use attributes to define what it is, while configurations are
used to define what to do on different things.  The separation of
attributes and configuration came not from "elegance" or "security"
but from a lot more practical reasons.

For a tracked file, the fact that it contains JSON text as payload
does not change per user who cloned the project, or per platform the
user used to do so.  In-tree .gitattributes that the project
controls is a perfect way to define what it is for each file.

On the other hand, the diff program suitable to compare two JSON
files may vary per platform (your favorite Windows program may not
be available to me) and per user (a platform may support more than
one and the choice becomes the matter of personal taste).

The security aspect of giving users tighter control over which exact
programs are to be run by not allowing the attributes or so called
in-tree configuration mechansim is a small bonus that fell out as a
consequence.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):

On Tue, Feb 07 2023, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>>> From the user's perspective, this is weirdly inconsistent with the
>>> existing diff attributes, which would be more like:
>>>
>>>   # in .gitattributes
>>>   *.json diff=json 
>>>
>>>   # in config
>>>   [diff "json"]
>>>   algorithm = histogram
>>
>> That does look more elegant.
>
> We use attributes to define what it is, while configurations are
> used to define what to do on different things.  The separation of
> attributes and configuration came not from "elegance" or "security"
> but from a lot more practical reasons.
>
> For a tracked file, the fact that it contains JSON text as payload
> does not change per user who cloned the project, or per platform the
> user used to do so.  In-tree .gitattributes that the project
> controls is a perfect way to define what it is for each file.
>
> On the other hand, the diff program suitable to compare two JSON
> files may vary per platform (your favorite Windows program may not
> be available to me) and per user (a platform may support more than
> one and the choice becomes the matter of personal taste).
>
> The security aspect of giving users tighter control over which exact
> programs are to be run by not allowing the attributes or so called
> in-tree configuration mechansim is a small bonus that fell out as a
> consequence.

To clarify, I'm not suggesting that we ever read arbitrary parts of the
"diff.<driver>.<key>" config space, but that we could whitelist one set
of "diff.<driver>.<known-key>"="<known-values>".

The reason to do it being that, as Jeff points out, that config
mechanism is already established, and arguably more elegant. I.e. that
this is a natural fit for "diff=<driver>" in .gitattributes), and that
mechanism is already tied to config as Jeff's example shows.

Some of your reply seems like it assumed that I was suggesting that we
read "diff.algorithm" (i.e. a config setting to apply for all paths)
from an in-repo .gitconfig.

I wasn't suggesting that, nor that we open Pandora's box on starting a
limited in-repo .gitconfig support with anything remotely to do with
executing arbitrary commands (which the full "diff.<driver>.<key>" space
does support).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> To clarify, I'm not suggesting that we ever read arbitrary parts of the
> "diff.<driver>.<key>" config space, but that we could whitelist one set
> of "diff.<driver>.<known-key>"="<known-values>".

When the value names the path to an executable or the command line
to invoke a program, there is no "portable" value that is useful.
Whitelisting macOS only program only because its pathname is one of
the known values does not help me running something else.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):

On Tue, Feb 07 2023, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> To clarify, I'm not suggesting that we ever read arbitrary parts of the
>> "diff.<driver>.<key>" config space, but that we could whitelist one set
>> of "diff.<driver>.<known-key>"="<known-values>".
>
> When the value names the path to an executable or the command line
> to invoke a program, there is no "portable" value that is useful.
> Whitelisting macOS only program only because its pathname is one of
> the known values does not help me running something else.

We're not talking about invoking an executable of any kind, but a
mechanism to pick up a "diff.algorithm" setting referring to one of our
built-in algorithms, either via the in-repo .gitattributes, as in John's
original proposal:

	*.json diff-algorithm=histogram

Or via some hybrid .gitattributes/.gitconfig mechanism, where (if I
understand Jeff's suggestion correctly) the two would contain:

	# In .gitattributes
	*.json diff=json 
	# In .gitconfig
	[diff "json"]
	algorithm = histogram

In terms of implementation this would be pretty much what we do with
"submodule.<name>.update", where "Allowed values here are checkout,
rebase, merge or none.".

I don't see where you're getting this suggestion of "[a] diff program
suitable to compare two JSON files" from something I said.

That is a thing we generally support, but I've only mentioned that
arbitrary command execution (e.g. in [1]) as something we explicitly
*don't* want to support in this context.

The "submodule.<name>.update" example is particularly relevant because
it also supports arbitrary command execution with "!"-prefixed values
(the rest being the arbitrary command).

But that's something we specifically exclude when reading the in-repo
version, it's only allowed in the user-controlled config.

I'm not saying that we should be going for the read-just-the-one-key
in-repo .gitconfig approach here, just *if* we like that interface
better the implementation should be relatively straightforward, assuming
that we like the proposal in principle, and are just bikeshedding about
what the mecanhism to enable it would be.

The advantage of that being that it more naturally slots into existing
config, e.g. "diff.<driver>.binary=<bool>", this being just another
"diff.<driver>.<known-key>".

And that the way to do that securely is something we're doing for
several "submodule.*" keys, so if we pick that approach tweaking or
stealing ideas from that code should be relatively straightforward.

1. https://lore.kernel.org/git/230206.865yce7n1w.gmgdl@evledraar.gmail.com/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, John Cai wrote (reply to this):

On 23/02/07 12:56PM, Jeff King wrote:
> On Sun, Feb 05, 2023 at 03:46:21AM +0000, John Cai via GitGitGadget wrote:
> 
> > +`diff-algorithm`
> > +^^^^^^^^^^^^^^^^
> > +
> > +The attribute `diff-algorithm` affects which algorithm Git uses to generate
> > +diffs. This allows defining diff algorithms per file extension. Precedence rules
> > +are as follows, in order from highest to lowest:
> > +
> > +*Command line option*
> > +
> > +Pass in the `--diff-algorithm` command line option int git-diff(1)
> > +
> > +*Git attributes*
> > +
> > +------------------------
> > +*.json	diff-algorithm=histogram
> > +------------------------
> > +
> > +*Git config*
> > +
> > +----------------------------------------------------------------
> > +[diff]
> > +	algorithm = histogram
> > +----------------------------------------------------------------
> 
> From the user's perspective, this is weirdly inconsistent with the
> existing diff attributes, which would be more like:
> 
>   # in .gitattributes
>   *.json diff=json 
> 
>   # in config
>   [diff "json"]
>   algorithm = histogram

Thanks for this suggestion, Peff. What I like about this is that it builds off
of the existing diff.<driver> scheme rather than inventing another one.
Additionally, we won't get hit with a performance penalty since we already read
gitattrbitues to see if a driver has been set or not.

Thinking out loud, if we add "algorithm" as a key for diff.<driver>, it would be
mutually exclusive with "command" where "command" takes precedence, correct?

> 
> I know why one might choose the scheme you did; it kicks in if the repo
> sets the algorithm, without users having to set up any extra config.
> Which is sort of nice, if we assume that malicious actors don't have any
> incentive to pick the algorithm. In theory they don't, though I saw Ævar
> mention possible DoS elsewhere in the thread.
> 
>   Side note: It's also possible that algorithm selection could be
>   required to trigger a separate security bug (say, a buffer overflow in
>   the patience code or something), so restricting that works in a
>   belt-and-suspenders way. But that somehow feels like like the wrong
>   side of the paranoia-vs-feature line.
> 
> So I dunno. I recognize that this scheme fulfills your immediate needs
> better, but I fear that we'll be stuck with a weird split between "diff"
> and "diff-*" attributes forever. In the long run, having a way for the
> repo to say "and here is some config I recommend to you" would give you
> the best of both, but that is a challenging topic that has been
> discussed and punted on for many years.
> 
> -Peff

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Thu, Feb 09, 2023 at 11:34:00AM -0500, John Cai wrote:

> > From the user's perspective, this is weirdly inconsistent with the
> > existing diff attributes, which would be more like:
> > 
> >   # in .gitattributes
> >   *.json diff=json 
> > 
> >   # in config
> >   [diff "json"]
> >   algorithm = histogram
> 
> Thanks for this suggestion, Peff. What I like about this is that it builds off
> of the existing diff.<driver> scheme rather than inventing another one.
> Additionally, we won't get hit with a performance penalty since we already read
> gitattrbitues to see if a driver has been set or not.
> 
> Thinking out loud, if we add "algorithm" as a key for diff.<driver>, it would be
> mutually exclusive with "command" where "command" takes precedence, correct?

Yes. I think the documentation would be something like "When generating
a diff internally, use <algorithm> to do so." And external diffs
obviously skip that code path internally.

I didn't check whether your patch does this or not, but should this
feature (however it is engaged) apply to diff-stats, too? I think
--patience, etc, does, which makes sense. But that might be something
worth elaborating in the description, too.

-Peff

@gitgitgadget-git
Copy link

User Jeff King <peff@peff.net> has been added to the cc: list.

@gitgitgadget-git
Copy link

There are issues in commit 94b5ef6:
diff: Teach diff to read attribute diff-algorithm
Prefixed commit message must be in lower case

@gitgitgadget-git
Copy link

There are issues in commit 65c8b2d:
wip: read config
Commit checks stopped - the message is too short
Commit not signed off

@gitgitgadget-git
Copy link

There are issues in commit 5d81c3f:
wip: read config
Commit checks stopped - the message is too short
Commit not signed off

@gitgitgadget-git
Copy link

User Elijah Newren <newren@gmail.com> has been added to the cc: list.

@john-cai john-cai force-pushed the jc/attr-diff-algo branch 5 times, most recently from 6ff3dbc to cb03056 Compare February 14, 2023 21:04
@john-cai
Copy link
Contributor Author

/preview

@gitgitgadget-git
Copy link

Preview email sent as pull.1452.v2.git.git.1676409650.gitgitgadget@gmail.com

@john-cai
Copy link
Contributor Author

/submit

@LoboVerga

This comment was marked as off-topic.

@john-cai
Copy link
Contributor Author

/preview

@gitgitgadget-git
Copy link

Preview email sent as pull.1452.v4.git.git.1676912099.gitgitgadget@gmail.com

@john-cai
Copy link
Contributor Author

/preview

@gitgitgadget-git
Copy link

Preview email sent as pull.1452.v4.git.git.1676926635.gitgitgadget@gmail.com

@john-cai
Copy link
Contributor Author

/submit

@gitgitgadget-git
Copy link

Submitted as pull.1452.v4.git.git.1676927082.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-git-1452/john-cai/jc/attr-diff-algo-v4

To fetch this version to local tag pr-git-1452/john-cai/jc/attr-diff-algo-v4:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-git-1452/john-cai/jc/attr-diff-algo-v4

@gitgitgadget-git
Copy link

On the Git mailing list, Junio C Hamano wrote (reply to this):

"John Cai via GitGitGadget" <gitgitgadget@gmail.com> writes:

> When a repository contains different kinds of files, it may be desirable to
> use different algorithms based on file type. This is currently not feasible
> through the command line or using git configs. However, we can leverage the
> fact that gitattributes are path aware.
> ...
> To address some of the performance concerns in the previous series, a
> benchmark shows that now only a minor performance penalty is incurred, now
> that we are no longer adding an additional attributes parsing call:
>
> $ echo "*.[ch] diff=other" >> .gitattributes $ hyperfine -r 10 -L a
> git-bin-wrapper,git '{a} -c diff.other.algorithm=myers diff v2.0.0 v2.28.0'
> Benchmark 1: git-bin-wrapper -c diff.other.algorithm=myers diff v2.0.0
> v2.28.0 Time (mean ± σ): 716.3 ms ± 3.8 ms [User: 660.2 ms, System: 50.8 ms]
> Range (min … max): 709.8 ms … 720.6 ms 10 runs
>
> Benchmark 2: git -c diff.other.algorithm=myers diff v2.0.0 v2.28.0 Time
> (mean ± σ): 704.3 ms ± 2.9 ms [User: 656.6 ms, System: 44.3 ms] Range (min …
> max): 700.1 ms … 708.6 ms 10 runs
>
> Summary 'git -c diff.other.algorithm=myers diff v2.0.0 v2.28.0' ran 1.02 ±
> 0.01 times faster than 'git-bin-wrapper -c diff.other.algorithm=myers diff
> v2.0.0 v2.28.0'

Hopefully this round can immediately be merged down to 'next'?
Thanks.

@gitgitgadget-git
Copy link

On the Git mailing list, Elijah Newren wrote (reply to this):

On Tue, Feb 21, 2023 at 9:34 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "John Cai via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > When a repository contains different kinds of files, it may be desirable to
> > use different algorithms based on file type. This is currently not feasible
> > through the command line or using git configs. However, we can leverage the
> > fact that gitattributes are path aware.
> > ...
> > To address some of the performance concerns in the previous series, a
> > benchmark shows that now only a minor performance penalty is incurred, now
> > that we are no longer adding an additional attributes parsing call:
> >
> > $ echo "*.[ch] diff=other" >> .gitattributes $ hyperfine -r 10 -L a
> > git-bin-wrapper,git '{a} -c diff.other.algorithm=myers diff v2.0.0 v2.28.0'
> > Benchmark 1: git-bin-wrapper -c diff.other.algorithm=myers diff v2.0.0
> > v2.28.0 Time (mean ± σ): 716.3 ms ± 3.8 ms [User: 660.2 ms, System: 50.8 ms]
> > Range (min … max): 709.8 ms … 720.6 ms 10 runs
> >
> > Benchmark 2: git -c diff.other.algorithm=myers diff v2.0.0 v2.28.0 Time
> > (mean ± σ): 704.3 ms ± 2.9 ms [User: 656.6 ms, System: 44.3 ms] Range (min …
> > max): 700.1 ms … 708.6 ms 10 runs
> >
> > Summary 'git -c diff.other.algorithm=myers diff v2.0.0 v2.28.0' ran 1.02 ±
> > 0.01 times faster than 'git-bin-wrapper -c diff.other.algorithm=myers diff
> > v2.0.0 v2.28.0'
>
> Hopefully this round can immediately be merged down to 'next'?
> Thanks.

I'll leave that up to you and John, but are we risking merging code
that could go unused or that we need to fundamentally change?  I don't
see how to handle the issues over at
https://lore.kernel.org/git/647D3D49-B85B-4B66-A857-695CFF9685EE@gmail.com/
(not that I've spent much thought on it besides asking how things are
going to work), and would be worried that we'd end up needing to
change the mechanism somehow to cover the things John ultimately
wants.

@gitgitgadget-git
Copy link

On the Git mailing list, Junio C Hamano wrote (reply to this):

Elijah Newren <newren@gmail.com> writes:

> I'll leave that up to you and John, but are we risking merging code
> that could go unused or that we need to fundamentally change?  I don't
> see how to handle the issues over at
> https://lore.kernel.org/git/647D3D49-B85B-4B66-A857-695CFF9685EE@gmail.com/

If this is useful enough for desktop users already, then that is a
good enough reason to take it, I would say.

GitLab can easily add WebUI that says "You can define what diff
algorithm is used for files with which suffix" to allow you to
configure a table like this:

	json | histogram
	c    | patience
	*    | myers

and populate the server-side equivalent of .git/info/attributes and
.git/config based on that, and without anything further than the
posted patches, the result should just work, no?

@gitgitgadget-git
Copy link

On the Git mailing list, John Cai wrote (reply to this):

On 21 Feb 2023, at 13:51, Junio C Hamano wrote:

> Elijah Newren <newren@gmail.com> writes:
>
>> I'll leave that up to you and John, but are we risking merging code
>> that could go unused or that we need to fundamentally change?  I don't
>> see how to handle the issues over at
>> https://lore.kernel.org/git/647D3D49-B85B-4B66-A857-695CFF9685EE@gmail.com/
>
> If this is useful enough for desktop users already, then that is a
> good enough reason to take it, I would say.
>
> GitLab can easily add WebUI that says "You can define what diff
> algorithm is used for files with which suffix" to allow you to
> configure a table like this:
>
> 	json | histogram
> 	c    | patience
> 	*    | myers
>
> and populate the server-side equivalent of .git/info/attributes and
> .git/config based on that, and without anything further than the
> posted patches, the result should just work, no?

Yes, that is my thinking too. The goal was to make this user friendly for
everyone on the Git command line, and then GitLab can hook into this accordingly
to make it ussable with the GitLab UI.

thanks
John

@gitgitgadget-git
Copy link

On the Git mailing list, Elijah Newren wrote (reply to this):

On Tue, Feb 21, 2023 at 11:36 AM John Cai <johncai86@gmail.com> wrote:
>
> On 21 Feb 2023, at 13:51, Junio C Hamano wrote:
>
> > Elijah Newren <newren@gmail.com> writes:
> >
> >> I'll leave that up to you and John, but are we risking merging code
> >> that could go unused or that we need to fundamentally change?  I don't
> >> see how to handle the issues over at
> >> https://lore.kernel.org/git/647D3D49-B85B-4B66-A857-695CFF9685EE@gmail.com/
> >
> > If this is useful enough for desktop users already, then that is a
> > good enough reason to take it, I would say.
> >
> > GitLab can easily add WebUI that says "You can define what diff
> > algorithm is used for files with which suffix" to allow you to
> > configure a table like this:
> >
> >       json | histogram
> >       c    | patience
> >       *    | myers
> >
> > and populate the server-side equivalent of .git/info/attributes and
> > .git/config based on that, and without anything further than the
> > posted patches, the result should just work, no?
>
> Yes, that is my thinking too. The goal was to make this user friendly for
> everyone on the Git command line, and then GitLab can hook into this accordingly
> to make it ussable with the GitLab UI.

Ok, sounds good.  Merge away.  :-)

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 14a1eb9.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via e00c58a.

@gitgitgadget-git
Copy link

This patch series was integrated into next via 2375640.

@gitgitgadget-git
Copy link

There was a status update in the "New Topics" section about the branch jc/diff-algo-attribute on the Git mailing list:

The "diff" drivers specified by the "diff" attribute attached to
paths can now specify which algorithm (e.g. histogram) to use.

Will merge to 'master'.
source: <pull.1452.v4.git.git.1676927082.gitgitgadget@gmail.com>

@gitgitgadget-git
Copy link

This patch series was integrated into seen via b32135d.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via d05da27.

@gitgitgadget-git
Copy link

There was a status update in the "Cooking" section about the branch jc/diff-algo-attribute on the Git mailing list:

The "diff" drivers specified by the "diff" attribute attached to
paths can now specify which algorithm (e.g. histogram) to use.

Will merge to 'master'.
source: <pull.1452.v4.git.git.1676927082.gitgitgadget@gmail.com>

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 01f1ff8.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 5fa284c.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via a0134ff.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via ece8dc9.

@gitgitgadget-git
Copy link

This patch series was integrated into master via ece8dc9.

@gitgitgadget-git
Copy link

This patch series was integrated into next via ece8dc9.

@gitgitgadget-git
Copy link

Closed via ece8dc9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants