Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization batch 7: use file basenames to guide rename detection #843

Closed

Conversation

newren
Copy link

@newren newren commented Jan 24, 2021

This series depends on ort-perf-batch-6. It appears Junio has appended an earlier round of this series on that one and called the combined series en/diffcore-rename. I'm still resubmitting the two separately to preserve the threaded discussion in the archives and because gitgitgadget can provide proper range-diffs that way.

This series uses file basenames (portion of the path after final '/', including extension) in a basic fashion to guide rename detection.

Changes since v4:

  • add wording to make it clearer that we are considering remaining basenames after exact rename detection
  • add three minor optimizations to patch 3. (All three will have to be undone by the next series, but this series is probably clearer with them.)
  • a typo fix or two
  • v2 of ort-perf-batch-6 added some changes around consistency of rename_src_nr; make similar changes in using this variable in find_basename_changes() for consistency
  • fix the testcase so the expected comments about the change in behavior only show up after we change the behavior
  • attempt a rewrite of the commit message for the new testcase, who knows if I'll get it right this time.

CC: Derrick Stolee dstolee@microsoft.com
CC: Jonathan Tan jonathantanmy@google.com
CC: Taylor Blau me@ttaylorr.com
CC: Junio C Hamano gitster@pobox.com
CC: Jeff King peff@peff.net
cc: Elijah Newren newren@gmail.com
cc: Derrick Stolee stolee@gmail.com

@newren
Copy link
Author

newren commented Feb 6, 2021

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 6, 2021

Submitted as pull.843.git.1612651937.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v1

To fetch this version to local tag pr-843/newren/ort-perf-batch-7-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-843/newren/ort-perf-batch-7-v1

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 7, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This series depends on ort-perf-batch-6[1], which has not yet appeared in
> seen despite being reviewed by both Junio and Stolee.

It is because that one depends on something not in, but soon about
to go in, 'master' and I didn't want to add to "this topic depends
on top of that other topic" mess.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 7, 2021

On the Git mailing list, Elijah Newren wrote (reply to this):

On Sat, Feb 6, 2021 at 9:19 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > This series depends on ort-perf-batch-6[1], which has not yet appeared in
> > seen despite being reviewed by both Junio and Stolee.
>
> It is because that one depends on something not in, but soon about
> to go in, 'master' and I didn't want to add to "this topic depends
> on top of that other topic" mess.

Ah, makes sense.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 7, 2021

User Elijah Newren <newren@gmail.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 7, 2021

User Derrick Stolee <stolee@gmail.com> has been added to the cc: list.

@newren
Copy link
Author

newren commented Feb 9, 2021

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 9, 2021

Submitted as pull.843.v2.git.1612870326.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v2

To fetch this version to local tag pr-843/newren/ort-perf-batch-7-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-843/newren/ort-perf-batch-7-v2

@newren
Copy link
Author

newren commented Feb 10, 2021

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 10, 2021

Submitted as pull.843.v3.git.1612970140.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v3

To fetch this version to local tag pr-843/newren/ort-perf-batch-7-v3:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-843/newren/ort-perf-batch-7-v3

@newren
Copy link
Author

newren commented Feb 11, 2021

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 11, 2021

Submitted as pull.843.v4.git.1613031350.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v4

To fetch this version to local tag pr-843/newren/ort-perf-batch-7-v4:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-843/newren/ort-perf-batch-7-v4

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 12, 2021

This branch is now known as en/diffcore-rename.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 12, 2021

This patch series was integrated into seen via git@84f0c41.

@gitgitgadget gitgitgadget bot added the seen label Feb 12, 2021
@gitgitgadget
Copy link

gitgitgadget bot commented Feb 12, 2021

This patch series was integrated into seen via git@72384d7.

@@ -262,4 +262,28 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' '
grep "myotherfile.*myfile" actual
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Elijah Newren <newren@gmail.com>
>
> Add a simple test where a removed file is similar to two different added
> files; one of them has the same basename, and the other has a slightly
> higher content similarity.  Without break detection, filename similarity
> of 100% trumps content similarity for pairing up related files.  For
> any filename similarity less than 100%, the opposite is true -- content
> similarity is all that matters.  Add a testcase that documents this.

I am not sure why it is the "opposite".  When contents are similar
to the same degree of 100%, we tiebreak with the filename.  We never
favor a pair between the same filename over a pair between different
filenames with better content similarity.

And when contents are similar to the same degree of less than 100%,
we do not favor a pair between the same filename over a pair between
different filenames, as long as they are similar to the same degree.

So, I do not think "opposite" is helping readers to understand what
is going on.

> +test_expect_success 'basename similarity vs best similarity' '
> +	mkdir subdir &&
> +	test_write_lines line1 line2 line3 line4 line5 \
> +			 line6 line7 line8 line9 line10 >subdir/file.txt &&
> +	git add subdir/file.txt &&
> +	git commit -m "base txt" &&
> +
> +	git rm subdir/file.txt &&
> +	test_write_lines line1 line2 line3 line4 line5 \
> +			  line6 line7 line8 >file.txt &&
> +	test_write_lines line1 line2 line3 line4 line5 \
> +			  line6 line7 line8 line9 >file.md &&
> +	git add file.txt file.md &&
> +	git commit -a -m "rename" &&
> +	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
> +	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
> +	# but since same basenames are checked first...

I am not sure what the second line of this comment wants to imply
with the ellipses here.  Care to finish the sentence?

Or was the second line planned to be added when we start applying
the "check only the same filename first and see if we find a
better-than-reasonable match" heuristics but somehow survived
"rebase -i" and ended up here?

> +	cat >expected <<-\EOF &&
> +	R088	subdir/file.txt	file.md
> +	A	file.txt
> +	EOF
> +	test_cmp expected actual

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Elijah Newren wrote (reply to this):

On Fri, Feb 12, 2021 at 5:15 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Elijah Newren <newren@gmail.com>
> >
> > Add a simple test where a removed file is similar to two different added
> > files; one of them has the same basename, and the other has a slightly
> > higher content similarity.  Without break detection, filename similarity
> > of 100% trumps content similarity for pairing up related files.  For
> > any filename similarity less than 100%, the opposite is true -- content
> > similarity is all that matters.  Add a testcase that documents this.
>
> I am not sure why it is the "opposite".  When contents are similar
> to the same degree of 100%, we tiebreak with the filename.  We never
> favor a pair between the same filename over a pair between different
> filenames with better content similarity.

This is not true.  If src/main.c is 99% similar to src/foo.c, and is
0% similar to the src/main.c in the new commit, we match the old
src/main.c to the new src/main.c despite being far more similar
src/foo.c.  Unless break detection is turned on, we do not allow
content similarity to trump (full) filename equality.

> And when contents are similar to the same degree of less than 100%,
> we do not favor a pair between the same filename over a pair between
> different filenames, as long as they are similar to the same degree.

This is also not true; we tiebreak with filenames for inexact renames
just like we do for exact renames (note that basename_same() is called
both from find_identical_files() and from the nested loop where
inexact rename detection is done).

> So, I do not think "opposite" is helping readers to understand what
> is going on.
>
> > +test_expect_success 'basename similarity vs best similarity' '
> > +     mkdir subdir &&
> > +     test_write_lines line1 line2 line3 line4 line5 \
> > +                      line6 line7 line8 line9 line10 >subdir/file.txt &&
> > +     git add subdir/file.txt &&
> > +     git commit -m "base txt" &&
> > +
> > +     git rm subdir/file.txt &&
> > +     test_write_lines line1 line2 line3 line4 line5 \
> > +                       line6 line7 line8 >file.txt &&
> > +     test_write_lines line1 line2 line3 line4 line5 \
> > +                       line6 line7 line8 line9 >file.md &&
> > +     git add file.txt file.md &&
> > +     git commit -a -m "rename" &&
> > +     git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
> > +     # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
> > +     # but since same basenames are checked first...
>
> I am not sure what the second line of this comment wants to imply
> with the ellipses here.  Care to finish the sentence?
>
> Or was the second line planned to be added when we start applying
> the "check only the same filename first and see if we find a
> better-than-reasonable match" heuristics but somehow survived
> "rebase -i" and ended up here?

Oops, indeed; that is precisely what happened.  Will fix.

> > +     cat >expected <<-\EOF &&
> > +     R088    subdir/file.txt file.md
> > +     A       file.txt
> > +     EOF
> > +     test_cmp expected actual
>
> Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Elijah Newren <newren@gmail.com> writes:

> This is not true.  If src/main.c is 99% similar to src/foo.c, and is
> 0% similar to the src/main.c in the new commit, we match the old
> src/main.c to the new src/main.c despite being far more similar
> src/foo.c.  Unless break detection is turned on, we do not allow
> content similarity to trump (full) filename equality.

Absolutely.  And we are talking about a new optimization that kicks
in only when there is no break or no copy detection going on, no?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Elijah Newren wrote (reply to this):

On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > This is not true.  If src/main.c is 99% similar to src/foo.c, and is
> > 0% similar to the src/main.c in the new commit, we match the old
> > src/main.c to the new src/main.c despite being far more similar
> > src/foo.c.  Unless break detection is turned on, we do not allow
> > content similarity to trump (full) filename equality.
>
> Absolutely.  And we are talking about a new optimization that kicks
> in only when there is no break or no copy detection going on, no?

Yes, precisely, we are only considering cases without break
detection...and thus we are considering cases where for the last 15
years or more, sufficiently large filename similarity (an exact
fullname match) trumps any level of content similarity.  I think it is
useful to note that while my optimization is adding more
considerations that can overrule maximal content similarity, it is not
the first such code choice to do that.

But let me back up a bit...

When I submitted the series, you and Stolee went into a long
discussion about an optimization that I didn't submit, one that feels
looser on "matching" than anything I submitted, and which I think
might counter-intuitively reduce performance rather than aid it.  (The
performance side only comes into view in combination with later
series, but it was why I harped so much since then on only comparing
against at most one other file in the steps before full inexact rename
detection.)  I was quite surprised by the diversion, but it made it
clear to me that my descriptions and commit messages were far too
vague and could be read to imply a completely different algorithm than
I intended.  So, I tried to be far more careful in subsequent
iterations by adding wider context and contrasts.

Further, after I wrote various things to try to clarify the
misunderstandings, I noticed that Stolee picked out one thing and
stated that "This idea of optimizing first for 100% filename
similarity is a good perspective on Git's rename detection algorithm."
(see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com/)
 So, that particular point seemed to help him understand more, and
thus might be useful extra context for others reading along now or in
the future.

Given all the above, I was trying to address earlier misunderstandings
and provide more context.  Perhaps I swung the pendulum too far and
talked too much about other cases, or perhaps I just worded things
poorly again.  All I was attempting to do in the commit message was
point out the multiple basic rules with filename and content
similarity, to lay the groundwork for new rules that do alternative
weightings.

Anyway, I've added a few more tweaks to try to improve the wording for
the next round I'll submit today.  Given my track record so far, it
would not be surprising if it still needed more tweaks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

I do not consider "the same file changed in place" the same as "we
seem to have lost a file in the old tree, ah, we found one that has
the same basename in a different directory" at all, so your argument
still does not make any sense to me, sorry.

2021年2月13日(土) 17:25 Elijah Newren <newren@gmail.com>:
>
> On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Elijah Newren <newren@gmail.com> writes:
> >
> > > This is not true.  If src/main.c is 99% similar to src/foo.c, and is
> > > 0% similar to the src/main.c in the new commit, we match the old
> > > src/main.c to the new src/main.c despite being far more similar
> > > src/foo.c.  Unless break detection is turned on, we do not allow
> > > content similarity to trump (full) filename equality.
> >
> > Absolutely.  And we are talking about a new optimization that kicks
> > in only when there is no break or no copy detection going on, no?
>
> Yes, precisely, we are only considering cases without break
> detection...and thus we are considering cases where for the last 15
> years or more, sufficiently large filename similarity (an exact
> fullname match) trumps any level of content similarity.  I think it is
> useful to note that while my optimization is adding more
> considerations that can overrule maximal content similarity, it is not
> the first such code choice to do that.
>
> But let me back up a bit...
>
> When I submitted the series, you and Stolee went into a long
> discussion about an optimization that I didn't submit, one that feels
> looser on "matching" than anything I submitted, and which I think
> might counter-intuitively reduce performance rather than aid it.  (The
> performance side only comes into view in combination with later
> series, but it was why I harped so much since then on only comparing
> against at most one other file in the steps before full inexact rename
> detection.)  I was quite surprised by the diversion, but it made it
> clear to me that my descriptions and commit messages were far too
> vague and could be read to imply a completely different algorithm than
> I intended.  So, I tried to be far more careful in subsequent
> iterations by adding wider context and contrasts.
>
> Further, after I wrote various things to try to clarify the
> misunderstandings, I noticed that Stolee picked out one thing and
> stated that "This idea of optimizing first for 100% filename
> similarity is a good perspective on Git's rename detection algorithm."
> (see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com/)
>  So, that particular point seemed to help him understand more, and
> thus might be useful extra context for others reading along now or in
> the future.
>
> Given all the above, I was trying to address earlier misunderstandings
> and provide more context.  Perhaps I swung the pendulum too far and
> talked too much about other cases, or perhaps I just worded things
> poorly again.  All I was attempting to do in the commit message was
> point out the multiple basic rules with filename and content
> similarity, to lay the groundwork for new rules that do alternative
> weightings.
>
> Anyway, I've added a few more tweaks to try to improve the wording for
> the next round I'll submit today.  Given my track record so far, it
> would not be surprising if it still needed more tweaks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Elijah Newren wrote (reply to this):

On Sat, Feb 13, 2021 at 5:32 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> I do not consider "the same file changed in place" the same as "we
> seem to have lost a file in the old tree, ah, we found one that has
> the same basename in a different directory" at all, so your argument
> still does not make any sense to me, sorry.

I'm not set on the commit message wording, you asked why I had used
the terms I did and I tried to explain.  I also explained how the
wording seemed to have helped Stolee understand.

If you'd like to suggest an alternative commit message, I'm happy to take it.

> 2021年2月13日(土) 17:25 Elijah Newren <newren@gmail.com>:
> >
> > On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@pobox.com> wrote:
> > >
> > > Elijah Newren <newren@gmail.com> writes:
> > >
> > > > This is not true.  If src/main.c is 99% similar to src/foo.c, and is
> > > > 0% similar to the src/main.c in the new commit, we match the old
> > > > src/main.c to the new src/main.c despite being far more similar
> > > > src/foo.c.  Unless break detection is turned on, we do not allow
> > > > content similarity to trump (full) filename equality.
> > >
> > > Absolutely.  And we are talking about a new optimization that kicks
> > > in only when there is no break or no copy detection going on, no?
> >
> > Yes, precisely, we are only considering cases without break
> > detection...and thus we are considering cases where for the last 15
> > years or more, sufficiently large filename similarity (an exact
> > fullname match) trumps any level of content similarity.  I think it is
> > useful to note that while my optimization is adding more
> > considerations that can overrule maximal content similarity, it is not
> > the first such code choice to do that.
> >
> > But let me back up a bit...
> >
> > When I submitted the series, you and Stolee went into a long
> > discussion about an optimization that I didn't submit, one that feels
> > looser on "matching" than anything I submitted, and which I think
> > might counter-intuitively reduce performance rather than aid it.  (The
> > performance side only comes into view in combination with later
> > series, but it was why I harped so much since then on only comparing
> > against at most one other file in the steps before full inexact rename
> > detection.)  I was quite surprised by the diversion, but it made it
> > clear to me that my descriptions and commit messages were far too
> > vague and could be read to imply a completely different algorithm than
> > I intended.  So, I tried to be far more careful in subsequent
> > iterations by adding wider context and contrasts.
> >
> > Further, after I wrote various things to try to clarify the
> > misunderstandings, I noticed that Stolee picked out one thing and
> > stated that "This idea of optimizing first for 100% filename
> > similarity is a good perspective on Git's rename detection algorithm."
> > (see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com/)
> >  So, that particular point seemed to help him understand more, and
> > thus might be useful extra context for others reading along now or in
> > the future.
> >
> > Given all the above, I was trying to address earlier misunderstandings
> > and provide more context.  Perhaps I swung the pendulum too far and
> > talked too much about other cases, or perhaps I just worded things
> > poorly again.  All I was attempting to do in the commit message was
> > point out the multiple basic rules with filename and content
> > similarity, to lay the groundwork for new rules that do alternative
> > weightings.
> >
> > Anyway, I've added a few more tweaks to try to improve the wording for
> > the next round I'll submit today.  Given my track record so far, it
> > would not be surprising if it still needed more tweaks.

@@ -367,6 +367,67 @@ static int find_exact_renames(struct diff_options *options)
return renames;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +MAYBE_UNUSED
> +static int find_basename_matches(struct diff_options *options,
> +				 int minimum_score,
> +				 int num_src)
> +{
> +	int i;
> +	struct strintmap sources;
> +	struct strintmap dests;
> +
> +	/* Create maps of basename -> fullname(s) for sources and dests */
> +	strintmap_init_with_options(&sources, -1, NULL, 0);
> +	strintmap_init_with_options(&dests, -1, NULL, 0);
> +	for (i = 0; i < num_src; ++i) {
> +		char *filename = rename_src[i].p->one->path;
> +		const char *base;
> +
> +		/* exact renames removed in remove_unneeded_paths_from_src() */
> +		assert(!rename_src[i].p->one->rename_used);
> +
> +		/* Record index within rename_src (i) if basename is unique */
> +		base = get_basename(filename);
> +		if (strintmap_contains(&sources, base))
> +			strintmap_set(&sources, base, -1);
> +		else
> +			strintmap_set(&sources, base, i);
> +	}
> +	for (i = 0; i < rename_dst_nr; ++i) {
> +		char *filename = rename_dst[i].p->two->path;
> +		const char *base;
> +
> +		if (rename_dst[i].is_rename)
> +			continue; /* involved in exact match already. */
> +
> +		/* Record index within rename_dst (i) if basename is unique */
> +		base = get_basename(filename);
> +		if (strintmap_contains(&dests, base))
> +			strintmap_set(&dests, base, -1);
> +		else
> +			strintmap_set(&dests, base, i);
> +	}
> +
> +	/* TODO: Make use of basenames source and destination basenames */

;-)

So at this point sources and dests can be used to quickly look up,
given a filename, if there is a single src among all sources, and a
single dst among all dests, that have the filename.

I wonder if the second loop over destinations can be "optimized"
further by using the sources map, though.  The reason you quash
entries with -1 when you see second instance of the same name is
because you intend to limit the heuristics only to a uniquely named
file among the removed files going to a uniquely named file among
the added files, right?  So even if a name is unique among dests,
if that name has duplicates on the source side, there is no point
recording its location.  i.e.

	/* record index within dst if it is unique in both dst and src */
	base = get_basename(filename);
	if (strintmap_contains(&sources, base) ||
	    strintmap_contains(&dests, base))
		strintmap_set(&dests, base, -1);
	else
		strintmap_set(&dests, base, i);

perhaps?

I guess it depends on what actually will be written in this "TODO"
space how effective such a change would be.  Presumably, you'd
iterate over &sources while skipping entries that record -1, to
learn (basename, i), and use the basename found there to consult
&dests to see if it yields a non-negative integer j, to notice that
rename_src[i] is a good candidate to match rename_dst[j].  If that
is the case, then such a change won't help as an optimization at
all, as we'd need to consult &dests map with the basename anyway,
so let's scratch the above idea.

In any case, after we walk over rename_src[] and rename_dst[] once,
the number of entries in &sources would be smaller than rename_src[]
so iterating over &sources, hunting for entries that record
non-negative index into rename_src[] would hopefully be cheaper than
the naive loop we've been working with.  I like the idea of using the
strintmap for this part of the code.

Thanks.


> +	strintmap_clear(&sources);
> +	strintmap_clear(&dests);
> +
> +	return 0;
> +}
> +
>  #define NUM_CANDIDATE_PER_DST 4
>  static void record_if_better(struct diff_score m[], struct diff_score *o)
>  {

@@ -367,6 +367,156 @@ static int find_exact_renames(struct diff_options *options)
return renames;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +	/* Now look for basename matchups and do similarity estimation */
> +	for (i = 0; i < num_src; ++i) {
> +		char *filename = rename_src[i].p->one->path;
> +		const char *base = NULL;
> +		intptr_t src_index;
> +		intptr_t dst_index;
> +
> +		/* Find out if this basename is unique among sources */
> +		base = get_basename(filename);
> +		src_index = strintmap_get(&sources, base);
> +		if (src_index == -1)
> +			continue; /* not a unique basename; skip it */
> +		assert(src_index == i);
> +
> +		if (strintmap_contains(&dests, base)) {
> +			struct diff_filespec *one, *two;
> +			int score;
> +
> +			/* Find out if this basename is unique among dests */
> +			dst_index = strintmap_get(&dests, base);
> +			if (dst_index == -1)
> +				continue; /* not a unique basename; skip it */

It would be a lot easier to read if "we must have the same singleton
in dests" in a single if condition, I suspect.  I.e.

		if (strintmap_contains(&dests, base) &&
		    0 <= (dst_index = (strintmap_get(&dests, base)))) {

It is a bit sad that we iterate over rename_src[] array, even though
we now have a map that presumably have fewer number of entries than
the original array, though.

> +			/* Ignore this dest if already used in a rename */
> +			if (rename_dst[dst_index].is_rename)
> +				continue; /* already used previously */

Since we will only be matching between unique entries in src and
dst, this "this has been used, so we cannot use it" will not change
during this loop.  I wonder if the preparation done in the previous
step, i.e. [PATCH v3 2/5], can take advantage of this fact, i.e.  a
dst that has already been used (in the previous "exact" step) would
not even have to be in &dests map, so that the strintmap_contains()
check can reject it much earlier.

Stepping back a bit, it appears to me that [2/5] and [3/5] considers
a source file having unique basename among the sources even if there
are many such files with the same basename, as long as all the other
files with the same basename have been matched in the previous
"exact" phase.  It probably does the same thing for destination
side.

Intended?

It feels incompatible with the spirit of these two steps aim for
(i.e. only use this optimization on a pair of src/dst with UNIQUE
basenames).  For the purpose of "we only handle unique ones", the
paths that already have matched should participate in deciding if
the files that survived "exact" phase have unique basename among
the original inpu?

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Elijah Newren wrote (reply to this):

On Fri, Feb 12, 2021 at 5:48 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +     /* Now look for basename matchups and do similarity estimation */
> > +     for (i = 0; i < num_src; ++i) {
> > +             char *filename = rename_src[i].p->one->path;
> > +             const char *base = NULL;
> > +             intptr_t src_index;
> > +             intptr_t dst_index;
> > +
> > +             /* Find out if this basename is unique among sources */
> > +             base = get_basename(filename);
> > +             src_index = strintmap_get(&sources, base);
> > +             if (src_index == -1)
> > +                     continue; /* not a unique basename; skip it */
> > +             assert(src_index == i);
> > +
> > +             if (strintmap_contains(&dests, base)) {
> > +                     struct diff_filespec *one, *two;
> > +                     int score;
> > +
> > +                     /* Find out if this basename is unique among dests */
> > +                     dst_index = strintmap_get(&dests, base);
> > +                     if (dst_index == -1)
> > +                             continue; /* not a unique basename; skip it */
>
> It would be a lot easier to read if "we must have the same singleton
> in dests" in a single if condition, I suspect.  I.e.
>
>                 if (strintmap_contains(&dests, base) &&
>                     0 <= (dst_index = (strintmap_get(&dests, base)))) {

I can change that.  I can also simplify it further to

        if (0 <= (dst_index = (strintmap_get(&dests, base)))) {

since dests uses a default value of -1.  That will decrease the number
of strmap lookups here from 2 to 1.

> It is a bit sad that we iterate over rename_src[] array, even though
> we now have a map that presumably have fewer number of entries than
> the original array, though.

Oh, interesting; I forgot all about that.  I just looked up my
original implementation from February of last year and indeed I had
done exactly that
(https://github.com/newren/git/commit/43eaec6007c92b6af05e0ef0fcc047c1d1ba1de8).
However, when I added a later optimization that pairs up non-unique
basenames, I had to switch to looping over rename_src.

For various reasons (mostly starting with the fact that I had lots of
experimental ideas that were tried and thrown out but with pieces kept
around for ideas), I wasn't even close to having a clean history in my
original implementation of merge-ort and the diffcore-rename
optimizations.  And it was far, far easier to achieve the goal of a
clean history by picking out chunks of code from the end-state and
creating entirely new commits than attempting to use my existing
history.  But, of course, that method made me lose this intermediate
state.

>
> > +                     /* Ignore this dest if already used in a rename */
> > +                     if (rename_dst[dst_index].is_rename)
> > +                             continue; /* already used previously */
>
> Since we will only be matching between unique entries in src and
> dst, this "this has been used, so we cannot use it" will not change
> during this loop.  I wonder if the preparation done in the previous
> step, i.e. [PATCH v3 2/5], can take advantage of this fact, i.e.  a
> dst that has already been used (in the previous "exact" step) would
> not even have to be in &dests map, so that the strintmap_contains()
> check can reject it much earlier.

Good, catch again.  The previous step (v4 2/5) actually did already
check this, so this if-condition will always be false at this point.
Looking at the link above, this if-condition check wasn't there in the
original, but again was added due to altered state introduced by a
later optimization.  So, I should pull this check out of this patch
and add it back in to the later patch.

> Stepping back a bit, it appears to me that [2/5] and [3/5] considers
> a source file having unique basename among the sources even if there
> are many such files with the same basename, as long as all the other
> files with the same basename have been matched in the previous
> "exact" phase.  It probably does the same thing for destination
> side.
>
> Intended?
>
> It feels incompatible with the spirit of these two steps aim for
> (i.e. only use this optimization on a pair of src/dst with UNIQUE
> basenames).  For the purpose of "we only handle unique ones", the
> paths that already have matched should participate in deciding if
> the files that survived "exact" phase have unique basename among
> the original inpu?

Yeah, I should have been more careful with my wording.  Stated a
different way, what confidence can we associate with an exact rename?
Obviously, the confidence is high as we mark them as renames.  But if
the confidence is less than 100%, and enough less than 100% that it
casts a doubt on "related" inexact renames, then yes the basenames of
the exact renames should also be computed so that we can determine
what basenames are truly unique.  By the exact same argument, you
could take this a step further and say that we should calculate the
basenames of *all* files in the tree, not just add/delete pairs, and
only match up the ones via basename that are *truly* unique.  After
all, break detection exists, so perhaps we don't have full confidence
that files with an unchanged fullname are actually related.

From my view, though, both are too cautious and throwing out valuable
heuristics for common cases.  Starting with break detection, it is off
for a reason: we think unchanged filename is a strong enough heuristic
to just match up those files and consider the confidence of the match
in effect 100%.  Similarly, we put a lot of confidence in exact rename
detection.  If there are multiple adds/deletes with the same basename,
and all but one on each side are paired up by exact rename detection,
aren't the remaining two files a (very) likely rename pair?  I think
so, and believe they're worth including in the basename-based rename
detection step.  We do require basename-based matches to meet a much
higher similarity scoring threshold now, which I feel already
adequately adjusts for not doing full content similarity against all
other files.

Also, in the next series, I find an additional way to match up files
by basename when basenames are not unique, and which doesn't involve
pairwise comparing all the files with the same basename.  I only pick
at most one other file to compare to (and the selection is not
random).  So, my overall strategy for these two series is "find which
basenames are likely matches" even if I didn't word it very well.

I do agree, though, that I should add some more careful wording about
this in the series.  I'll include it in a re-roll.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Elijah Newren <newren@gmail.com> writes:

> I can change that.  I can also simplify it further to
>
>         if (0 <= (dst_index = (strintmap_get(&dests, base)))) {
>
> since dests uses a default value of -1.  That will decrease the number
> of strmap lookups here from 2 to 1.

Which would be a real win, unlike what I said in the message you are
responding to.

>> It feels incompatible with the spirit of these two steps aim for
>> (i.e. only use this optimization on a pair of src/dst with UNIQUE
>> basenames).  For the purpose of "we only handle unique ones", the
>> paths that already have matched should participate in deciding if
>> the files that survived "exact" phase have unique basename among
>> the original inpu?
>
> Yeah, I should have been more careful with my wording.  Stated a
> different way, what confidence can we associate with an exact rename?

Suppose you start with a/Makefile, b/Makefile and c/Makefile and
then they all disappeared while a1/Makefile, b1/Makefile and
c1/Makefile now are in the tree.  The contents a/Makefile used to
have appears without any difference in a1/Makefile, the same for b
and b1, but c/Makefile and c1/Makefile are different.  The c vs c1
pair may worth investigating, so it goes through the "same basename"
phase.

Now, in a slightly different situation, a vs a1 are still identical,
but b vs b1 have only one blank line removal but without any other
change.  It looks odd that such a change has to pessimize c vs c1
optimization opportunity, but an interesting part of the story is
that we can only say "such a change", not "such a miniscule change",
because we have just finished the "exact" phase, and we do not know
how big a difference b vs b1 pair actually had.

That makes me feel that this whole "we must treat unique one that
remains specially" is being incoherent.  If "because we have only
small number of removed and added Makefiles spread across the trees,
first full-matrix matching among them without anything else with
higher bar may be worth an optimization" were the optimization, then
I would understand and support the design to omit those that have
already been matched in the "exact" phase.

But IIRC, limiting this "same basename" phase to unique add/del pair
was sold as a way to make it less likely for the heuristics to make
mistakes, yet the definition of "unique", as shown above, is not all
that solid.  That I find it rather unsatisfactory.

In other words, it is not "what confidence do we have in exact
phase?"  "exact" matching may have found perfect matching pair.  But
the found pair should be happy just between themselves, and should
not have undue effect on how _other_ pairs are compared.  Stopping
the "exact" pair from participating in the "uniqueness" definition
is placing "exact" phase too much weight to affect how other filepairs
are found.

> By the exact same argument, you
> could take this a step further and say that we should calculate the
> basenames of *all* files in the tree, not just add/delete pairs, and
> only match up the ones via basename that are *truly* unique.  After
> all, break detection exists, so perhaps we don't have full confidence
> that files with an unchanged fullname are actually related.

Sorry, but you are not making sense.  These optimizations are done
only when we are not using copies and breaks, no?  What _other_
changes that kept the paths the same, or modified in place, have any
effect on matching added and deleted pairs?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Elijah Newren wrote (reply to this):

On Sat, Feb 13, 2021 at 3:55 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > I can change that.  I can also simplify it further to
> >
> >         if (0 <= (dst_index = (strintmap_get(&dests, base)))) {
> >
> > since dests uses a default value of -1.  That will decrease the number
> > of strmap lookups here from 2 to 1.
>
> Which would be a real win, unlike what I said in the message you are
> responding to.

Sadly, while it's a real win, it's very temporary.  The next series
I'll submit needs to separate the two checks back out for other
reasons.

> >> It feels incompatible with the spirit of these two steps aim for
> >> (i.e. only use this optimization on a pair of src/dst with UNIQUE
> >> basenames).  For the purpose of "we only handle unique ones", the
> >> paths that already have matched should participate in deciding if
> >> the files that survived "exact" phase have unique basename among
> >> the original inpu?
> >
> > Yeah, I should have been more careful with my wording.  Stated a
> > different way, what confidence can we associate with an exact rename?
>
> Suppose you start with a/Makefile, b/Makefile and c/Makefile and
> then they all disappeared while a1/Makefile, b1/Makefile and
> c1/Makefile now are in the tree.  The contents a/Makefile used to
> have appears without any difference in a1/Makefile, the same for b
> and b1, but c/Makefile and c1/Makefile are different.  The c vs c1
> pair may worth investigating, so it goes through the "same basename"
> phase.
>
> Now, in a slightly different situation, a vs a1 are still identical,
> but b vs b1 have only one blank line removal but without any other
> change.  It looks odd that such a change has to pessimize c vs c1
> optimization opportunity, but an interesting part of the story is
> that we can only say "such a change", not "such a miniscule change",
> because we have just finished the "exact" phase, and we do not know
> how big a difference b vs b1 pair actually had.
>
> That makes me feel that this whole "we must treat unique one that
> remains specially" is being incoherent.

It's really not that special; the pessimization is not in my mind due
to correctness reasons, but performance reasons.

I need to only compare any given file to at most one other file in the
preliminary steps.  When there are multiple remaining possibilities to
compare, I need a method for selecting which ones to compare.  I have
such a method, but it's a lot more code.  It was easier to submit a
series that was only 3 patches long and only considered the pairs that
just happened to uniquely match up so we could talk about the general
idea of basename matching.  The next series finds ways to match up
more files with similar basenames.

>  If "because we have only
> small number of removed and added Makefiles spread across the trees,
> first full-matrix matching among them without anything else with
> higher bar may be worth an optimization" were the optimization, then

This optimization was indeed considered...and fully implemented.
Let's give it a name, so I can refer to it more below.  How about the
"preliminary-matrix-of-basenames" optimization?

> I would understand and support the design to omit those that have
> already been matched in the "exact" phase.
>
> But IIRC, limiting this "same basename" phase to unique add/del pair
> was sold as a way to make it less likely for the heuristics to make
> mistakes, yet the definition of "unique", as shown above, is not all
> that solid.  That I find it rather unsatisfactory.

No, I never sold it as a way to make it less likely for the heuristics
to make mistakes.  If I implied that anywhere, it was on accident.

I certainly emphasized only doing one comparison per file, but not for
that reason.  I had three reasons for mentioning
one-comparison-per-file: (1) I was trying to contrast with Stolee's
original assumption about what this series was doing, to try to avoid
a repeat of the misunderstandings about the current optimization being
suggested.  (2) The preliminary-matrix-of-basenames optimization has
worst-case performance nearly twice as bad as without such an
optimization.  (For example, with preliminary-matrix-of-basenames, if
nearly all unmatched files have the same basename, we end up basically
doing inexact rename detection on all files twice).  I believe
Stolee's original assumption of what was being proposed also has such
twice-as-slow-as-normal worst-case performance behavior.  Even though
the worst case performance would be fairly rare, making an algorithm
twice as slow by introducing an optimization felt like something I
should avoid.  (3) Despite the theoretical problems with worst-case
performance, I implemented the preliminary-matrix-of-basenames
optimization anyway.  I threw the code away, because even in cases
with a wide variety of basenames, it slowed things down when other
optimizations were also involved.  The one clear way to work well with
other optimizations I was working with was to only allow the
preliminary step to compare any given file to at most one other file.

> In other words, it is not "what confidence do we have in exact
> phase?"  "exact" matching may have found perfect matching pair.  But
> the found pair should be happy just between themselves, and should
> not have undue effect on how _other_ pairs are compared.  Stopping
> the "exact" pair from participating in the "uniqueness" definition
> is placing "exact" phase too much weight to affect how other filepairs
> are found.

I guess I look at this quite a bit differently.  Here's my view:

  * If we have a reasonable and cheap way to determine that two
particular files are likely potential rename pairs,
  * AND checking their similarity confirms they are sufficiently
similar (perhaps with a higher bar)
  * then we've found a way to avoid quadratic comparisons.

We will give up "optimal" matches, but as long as what we provide are
"reasonable" matches I think that should suffice.  I personally
believe "reasonable" at O(N) cost trumps "optimal" at O(N^2).

There are several different ways to find "likely potential rename pairs":
  * The preliminary-matrix-of-basenames is one that I tried (but
interacts badly performance-wise with other optimizations).
  * https://github.com/gitgitgadget/git/issues/519 has multiple ideas.
  * Stolee's misunderstanding of my series is another
  * unique basenames among remaining pairs after exact renames is a
really simple one that lets me introduce "reasonable" matches so we
can discuss
  * my next series adds another

That leaves us with a big question.  Are we happy with higher
sufficient similarity bar being enough of a constraint for
"reasonable" matches?  If so, each of the above ideas might be able to
help us.  If not, we may be able to rule some of them out apriori and
avoid working on them (well, working on them any more; I've already
implemented three, and we have an intern who picked a project to look
at one)

> > By the exact same argument, you
> > could take this a step further and say that we should calculate the
> > basenames of *all* files in the tree, not just add/delete pairs, and
> > only match up the ones via basename that are *truly* unique.  After
> > all, break detection exists, so perhaps we don't have full confidence
> > that files with an unchanged fullname are actually related.
>
> Sorry, but you are not making sense.  These optimizations are done
> only when we are not using copies and breaks, no?  What _other_
> changes that kept the paths the same, or modified in place, have any
> effect on matching added and deleted pairs?

If the optimization is presented to users as "only compare basenames
in a preliminary step when they are unique", which is what I was
understanding you to say, and if the user has a/Makefile and
d/Makefile in the source tree, and a1/Makefile and d/Makefile in the
destination tree, then a/Makefile is not the unique "Makefile" in the
source tree.

I think you're trying to make an argument about uniqueness and why it
matters for correctness, but I'm not following it.

The only reason uniqueness is important to me is because I was using
it with future optimizations in mind, and knew it to be related to an
important performance criteria.  I tried to avoid mentioning
uniqueness at all in the user-facing documentation, though I did try
to explain why some files with the same basename might not be matched
up by that step (and my next series modifies those docs a bit.)

@@ -367,6 +367,155 @@ static int find_exact_renames(struct diff_options *options)
return renames;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
> index 797343b38106..bf62537c29a0 100755
> --- a/t/t4001-diff-rename.sh
> +++ b/t/t4001-diff-rename.sh
> @@ -280,8 +280,8 @@ test_expect_success 'basename similarity vs best similarity' '
>  	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
>  	# but since same basenames are checked first...

Here lies the answer to my earlier question ;-)

>  	cat >expected <<-\EOF &&
> -	R088	subdir/file.txt	file.md
> -	A	file.txt
> +	A	file.md
> +	R078	subdir/file.txt	file.txt
>  	EOF
>  	test_cmp expected actual
>  '

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 13, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This series depends on ort-perf-batch-6[1].
>
> This series uses file basenames (portion of the path after final '/',
> including extension) in a basic fashion to guide rename detection.
>
> Changes since v3:
>
>  * update documentation as suggested by Junio
>  * NEW: add another patch at the end, to simplify patch series that will be
>    submitted later (please review!)

Sorry, by mistake I somehow read v4 and sent some comments on v3,
but as the above says, they are on the part that hadn't changed at
all, and should still be relevant.

Thanks.

Add a simple test where a removed file is similar to two different added
files; one of them has the same basename, and the other has a slightly
higher content similarity.  In the current test, content similarity is
weighted higher than filename similarity.

Subsequent commits will add a new rule that weighs a mixture of filename
similarity and content similarity in a manner that will change the
outcome of this testcase.

Signed-off-by: Elijah Newren <newren@gmail.com>
We want to make use of unique basenames among remaining source and
destination files to help inform rename detection, so that more likely
pairings can be checked first.  (src/moduleA/foo.txt and
source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the remaining deleted and added files.)  Add a new
function, not yet used, which creates a map of the unique basenames
within rename_src and another within rename_dst, together with the
indices within rename_src/rename_dst where those basenames show up.
Non-unique basenames still show up in the map, but have an invalid index
(-1).

This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names.  Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
  * linux: 76%
  * gcc: 64%
  * gecko: 79%
  * webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.

Signed-off-by: Elijah Newren <newren@gmail.com>
It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories.  We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames.  If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.

This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file.  For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there are no other
device.h files in the remaining sets of added and deleted files after
exact rename detection, then these two files would be compared in the
preliminary step.

This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose.  The next
commit will do the necessary plumbing to make use of it.

Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename.  I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches.  Reason (4) takes a full paragraph to
explain...

If the previous three reasons aren't enough, consider what rename
detection already does.  Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar.  In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity.  Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents.  This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well.  Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.

Signed-off-by: Elijah Newren <newren@gmail.com>
Make use of the new find_basename_matches() function added in the last
two patches, to find renames more rapidly in cases where we can match up
files based on basenames.  As a quick reminder (see the last two commit
messages for more details), this means for example that
docs/extensions.txt and docs/config/extensions.txt are considered likely
renames if there are no remaining 'extensions.txt' files elsewhere among
the added and deleted files, and if a similarity check confirms they are
similar, then they are marked as a rename without looking for a better
similarity match among other files.  This is a behavioral change, as
covered in more detail in the previous commit message.

We do not use this heuristic together with either break or copy
detection.  The point of break detection is to say that filename
similarity does not imply file content similarity, and we only want to
know about file content similarity.  The point of copy detection is to
use more resources to check for additional similarities, while this is
an optimization that uses far less resources but which might also result
in finding slightly fewer similarities.  So the idea behind this
optimization goes against both of those features, and will be turned off
for both.

For the testcases mentioned in commit 557ac03 ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       13.815 s ±  0.062 s    13.294 s ±  0.103 s
    mega-renames:   1799.937 s ±  0.493 s   187.248 s ±  0.882 s
    just-one-mega:    51.289 s ±  0.019 s     5.557 s ±  0.017 s

Signed-off-by: Elijah Newren <newren@gmail.com>
The last few patches have introduced a new preliminary step when rename
detection is on but both break detection and copy detection are off.
Document this new step.  While we're at it, add a testcase that checks
the new behavior as well.

Signed-off-by: Elijah Newren <newren@gmail.com>
We want to pass additional information to diffcore_rename() (or some
variant thereof) without plumbing that extra information through
diff_tree_oid() and diffcore_std().  Further, since we will need to
gather additional special information related to diffs and are walking
the trees anyway in collect_merge_info(), it seems odd to have
diff_tree_oid()/diffcore_std() repeat those tree walks.  And there may
be times where we can avoid traversing into a subtree in
collect_merge_info() (based on additional information at our disposal),
that the basic diff logic would be unable to take advantage of.  For all
these reasons, just create the add and delete pairs ourself and then
call diffcore_rename() directly.

This change is primarily about enabling future optimizations; the
advantage of avoiding extra tree traversals is small compared to the
cost of rename detection, and the advantage of avoiding the extra tree
traversals is somewhat offset by the extra time spent in
collect_merge_info() collecting the additional data anyway.  However...

For the testcases mentioned in commit 557ac03 ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       13.294 s ±  0.103 s    12.775 s ±  0.062 s
    mega-renames:    187.248 s ±  0.882 s   188.754 s ±  0.284 s
    just-one-mega:     5.557 s ±  0.017 s     5.599 s ±  0.019 s

Signed-off-by: Elijah Newren <newren@gmail.com>
@newren
Copy link
Author

newren commented Feb 14, 2021

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 14, 2021

Submitted as pull.843.v5.git.1613289112.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v5

To fetch this version to local tag pr-843/newren/ort-perf-batch-7-v5:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-843/newren/ort-perf-batch-7-v5

@newren newren closed this Mar 5, 2021
@newren newren deleted the ort-perf-batch-7 branch March 5, 2021 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant