Optimization batch 7: use file basenames to guide rename detection #843

newren · 2021-01-24T07:02:25Z

This series depends on ort-perf-batch-6. It appears Junio has appended an earlier round of this series on that one and called the combined series en/diffcore-rename. I'm still resubmitting the two separately to preserve the threaded discussion in the archives and because gitgitgadget can provide proper range-diffs that way.

This series uses file basenames (portion of the path after final '/', including extension) in a basic fashion to guide rename detection.

Changes since v4:

add wording to make it clearer that we are considering remaining basenames after exact rename detection
add three minor optimizations to patch 3. (All three will have to be undone by the next series, but this series is probably clearer with them.)
a typo fix or two
v2 of ort-perf-batch-6 added some changes around consistency of rename_src_nr; make similar changes in using this variable in find_basename_changes() for consistency
fix the testcase so the expected comments about the change in behavior only show up after we change the behavior
attempt a rewrite of the commit message for the new testcase, who knows if I'll get it right this time.

CC: Derrick Stolee dstolee@microsoft.com
CC: Jonathan Tan jonathantanmy@google.com
CC: Taylor Blau me@ttaylorr.com
CC: Junio C Hamano gitster@pobox.com
CC: Jeff King peff@peff.net
cc: Elijah Newren newren@gmail.com
cc: Derrick Stolee stolee@gmail.com

newren · 2021-02-06T22:51:08Z

/submit

gitgitgadget · 2021-02-06T22:52:52Z

Submitted as pull.843.git.1612651937.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v1

To fetch this version to local tag pr-843/newren/ort-perf-batch-7-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-843/newren/ort-perf-batch-7-v1

gitgitgadget · 2021-02-07T05:22:26Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This series depends on ort-perf-batch-6[1], which has not yet appeared in
> seen despite being reviewed by both Junio and Stolee.

It is because that one depends on something not in, but soon about
to go in, 'master' and I didn't want to add to "this topic depends
on top of that other topic" mess.

gitgitgadget · 2021-02-07T06:10:34Z

On the Git mailing list, Elijah Newren wrote (reply to this):

On Sat, Feb 6, 2021 at 9:19 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > This series depends on ort-perf-batch-6[1], which has not yet appeared in
> > seen despite being reviewed by both Junio and Stolee.
>
> It is because that one depends on something not in, but soon about
> to go in, 'master' and I didn't want to add to "this topic depends
> on top of that other topic" mess.

Ah, makes sense.

gitgitgadget · 2021-02-07T06:10:36Z

User Elijah Newren <newren@gmail.com> has been added to the cc: list.

diffcore-rename.c

gitgitgadget · 2021-02-07T14:40:39Z

User Derrick Stolee <stolee@gmail.com> has been added to the cc: list.

newren · 2021-02-09T11:31:00Z

/submit

gitgitgadget · 2021-02-09T11:32:38Z

Submitted as pull.843.v2.git.1612870326.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v2

To fetch this version to local tag pr-843/newren/ort-perf-batch-7-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-843/newren/ort-perf-batch-7-v2

Documentation/gitdiffcore.txt

diffcore-rename.c

newren · 2021-02-10T15:14:40Z

/submit

gitgitgadget · 2021-02-10T15:16:04Z

Submitted as pull.843.v3.git.1612970140.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v3

To fetch this version to local tag pr-843/newren/ort-perf-batch-7-v3:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-843/newren/ort-perf-batch-7-v3

Documentation/gitdiffcore.txt

newren · 2021-02-11T08:14:44Z

/submit

gitgitgadget · 2021-02-11T08:16:12Z

Submitted as pull.843.v4.git.1613031350.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v4

To fetch this version to local tag pr-843/newren/ort-perf-batch-7-v4:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-843/newren/ort-perf-batch-7-v4

gitgitgadget · 2021-02-12T22:11:17Z

This branch is now known as en/diffcore-rename.

gitgitgadget · 2021-02-12T22:11:18Z

This patch series was integrated into seen via git@84f0c41.

gitgitgadget · 2021-02-12T22:57:31Z

This patch series was integrated into seen via git@72384d7.

gitgitgadget · 2021-02-13T01:17:56Z

t/t4001-diff-rename.sh

@@ -262,4 +262,28 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' '
 	grep "myotherfile.*myfile" actual


On the Git mailing list, Junio C Hamano wrote (reply to this):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Elijah Newren <newren@gmail.com> > > Add a simple test where a removed file is similar to two different added > files; one of them has the same basename, and the other has a slightly > higher content similarity. Without break detection, filename similarity > of 100% trumps content similarity for pairing up related files. For > any filename similarity less than 100%, the opposite is true -- content > similarity is all that matters. Add a testcase that documents this. I am not sure why it is the "opposite". When contents are similar to the same degree of 100%, we tiebreak with the filename. We never favor a pair between the same filename over a pair between different filenames with better content similarity. And when contents are similar to the same degree of less than 100%, we do not favor a pair between the same filename over a pair between different filenames, as long as they are similar to the same degree. So, I do not think "opposite" is helping readers to understand what is going on. > +test_expect_success 'basename similarity vs best similarity' ' > + mkdir subdir && > + test_write_lines line1 line2 line3 line4 line5 \ > + line6 line7 line8 line9 line10 >subdir/file.txt && > + git add subdir/file.txt && > + git commit -m "base txt" && > + > + git rm subdir/file.txt && > + test_write_lines line1 line2 line3 line4 line5 \ > + line6 line7 line8 >file.txt && > + test_write_lines line1 line2 line3 line4 line5 \ > + line6 line7 line8 line9 >file.md && > + git add file.txt file.md && > + git commit -a -m "rename" && > + git diff-tree -r -M --name-status HEAD^ HEAD >actual && > + # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt, > + # but since same basenames are checked first... I am not sure what the second line of this comment wants to imply with the ellipses here. Care to finish the sentence? Or was the second line planned to be added when we start applying the "check only the same filename first and see if we find a better-than-reasonable match" heuristics but somehow survived "rebase -i" and ended up here? > + cat >expected <<-\EOF && > + R088 subdir/file.txt file.md > + A file.txt > + EOF > + test_cmp expected actual Thanks.

On the Git mailing list, Elijah Newren wrote (reply to this):

On Fri, Feb 12, 2021 at 5:15 PM Junio C Hamano <gitster@pobox.com> wrote: > > "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > From: Elijah Newren <newren@gmail.com> > > > > Add a simple test where a removed file is similar to two different added > > files; one of them has the same basename, and the other has a slightly > > higher content similarity. Without break detection, filename similarity > > of 100% trumps content similarity for pairing up related files. For > > any filename similarity less than 100%, the opposite is true -- content > > similarity is all that matters. Add a testcase that documents this. > > I am not sure why it is the "opposite". When contents are similar > to the same degree of 100%, we tiebreak with the filename. We never > favor a pair between the same filename over a pair between different > filenames with better content similarity. This is not true. If src/main.c is 99% similar to src/foo.c, and is 0% similar to the src/main.c in the new commit, we match the old src/main.c to the new src/main.c despite being far more similar src/foo.c. Unless break detection is turned on, we do not allow content similarity to trump (full) filename equality. > And when contents are similar to the same degree of less than 100%, > we do not favor a pair between the same filename over a pair between > different filenames, as long as they are similar to the same degree. This is also not true; we tiebreak with filenames for inexact renames just like we do for exact renames (note that basename_same() is called both from find_identical_files() and from the nested loop where inexact rename detection is done). > So, I do not think "opposite" is helping readers to understand what > is going on. > > > +test_expect_success 'basename similarity vs best similarity' ' > > + mkdir subdir && > > + test_write_lines line1 line2 line3 line4 line5 \ > > + line6 line7 line8 line9 line10 >subdir/file.txt && > > + git add subdir/file.txt && > > + git commit -m "base txt" && > > + > > + git rm subdir/file.txt && > > + test_write_lines line1 line2 line3 line4 line5 \ > > + line6 line7 line8 >file.txt && > > + test_write_lines line1 line2 line3 line4 line5 \ > > + line6 line7 line8 line9 >file.md && > > + git add file.txt file.md && > > + git commit -a -m "rename" && > > + git diff-tree -r -M --name-status HEAD^ HEAD >actual && > > + # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt, > > + # but since same basenames are checked first... > > I am not sure what the second line of this comment wants to imply > with the ellipses here. Care to finish the sentence? > > Or was the second line planned to be added when we start applying > the "check only the same filename first and see if we find a > better-than-reasonable match" heuristics but somehow survived > "rebase -i" and ended up here? Oops, indeed; that is precisely what happened. Will fix. > > + cat >expected <<-\EOF && > > + R088 subdir/file.txt file.md > > + A file.txt > > + EOF > > + test_cmp expected actual > > Thanks.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Elijah Newren <newren@gmail.com> writes: > This is not true. If src/main.c is 99% similar to src/foo.c, and is > 0% similar to the src/main.c in the new commit, we match the old > src/main.c to the new src/main.c despite being far more similar > src/foo.c. Unless break detection is turned on, we do not allow > content similarity to trump (full) filename equality. Absolutely. And we are talking about a new optimization that kicks in only when there is no break or no copy detection going on, no?

On the Git mailing list, Elijah Newren wrote (reply to this):

On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@pobox.com> wrote: > > Elijah Newren <newren@gmail.com> writes: > > > This is not true. If src/main.c is 99% similar to src/foo.c, and is > > 0% similar to the src/main.c in the new commit, we match the old > > src/main.c to the new src/main.c despite being far more similar > > src/foo.c. Unless break detection is turned on, we do not allow > > content similarity to trump (full) filename equality. > > Absolutely. And we are talking about a new optimization that kicks > in only when there is no break or no copy detection going on, no? Yes, precisely, we are only considering cases without break detection...and thus we are considering cases where for the last 15 years or more, sufficiently large filename similarity (an exact fullname match) trumps any level of content similarity. I think it is useful to note that while my optimization is adding more considerations that can overrule maximal content similarity, it is not the first such code choice to do that. But let me back up a bit... When I submitted the series, you and Stolee went into a long discussion about an optimization that I didn't submit, one that feels looser on "matching" than anything I submitted, and which I think might counter-intuitively reduce performance rather than aid it. (The performance side only comes into view in combination with later series, but it was why I harped so much since then on only comparing against at most one other file in the steps before full inexact rename detection.) I was quite surprised by the diversion, but it made it clear to me that my descriptions and commit messages were far too vague and could be read to imply a completely different algorithm than I intended. So, I tried to be far more careful in subsequent iterations by adding wider context and contrasts. Further, after I wrote various things to try to clarify the misunderstandings, I noticed that Stolee picked out one thing and stated that "This idea of optimizing first for 100% filename similarity is a good perspective on Git's rename detection algorithm." (see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com/) So, that particular point seemed to help him understand more, and thus might be useful extra context for others reading along now or in the future. Given all the above, I was trying to address earlier misunderstandings and provide more context. Perhaps I swung the pendulum too far and talked too much about other cases, or perhaps I just worded things poorly again. All I was attempting to do in the commit message was point out the multiple basic rules with filename and content similarity, to lay the groundwork for new rules that do alternative weightings. Anyway, I've added a few more tweaks to try to improve the wording for the next round I'll submit today. Given my track record so far, it would not be surprising if it still needed more tweaks.

On the Git mailing list, Junio C Hamano wrote (reply to this):

I do not consider "the same file changed in place" the same as "we seem to have lost a file in the old tree, ah, we found one that has the same basename in a different directory" at all, so your argument still does not make any sense to me, sorry. 2021年2月13日(土) 17:25 Elijah Newren <newren@gmail.com>: > > On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@pobox.com> wrote: > > > > Elijah Newren <newren@gmail.com> writes: > > > > > This is not true. If src/main.c is 99% similar to src/foo.c, and is > > > 0% similar to the src/main.c in the new commit, we match the old > > > src/main.c to the new src/main.c despite being far more similar > > > src/foo.c. Unless break detection is turned on, we do not allow > > > content similarity to trump (full) filename equality. > > > > Absolutely. And we are talking about a new optimization that kicks > > in only when there is no break or no copy detection going on, no? > > Yes, precisely, we are only considering cases without break > detection...and thus we are considering cases where for the last 15 > years or more, sufficiently large filename similarity (an exact > fullname match) trumps any level of content similarity. I think it is > useful to note that while my optimization is adding more > considerations that can overrule maximal content similarity, it is not > the first such code choice to do that. > > But let me back up a bit... > > When I submitted the series, you and Stolee went into a long > discussion about an optimization that I didn't submit, one that feels > looser on "matching" than anything I submitted, and which I think > might counter-intuitively reduce performance rather than aid it. (The > performance side only comes into view in combination with later > series, but it was why I harped so much since then on only comparing > against at most one other file in the steps before full inexact rename > detection.) I was quite surprised by the diversion, but it made it > clear to me that my descriptions and commit messages were far too > vague and could be read to imply a completely different algorithm than > I intended. So, I tried to be far more careful in subsequent > iterations by adding wider context and contrasts. > > Further, after I wrote various things to try to clarify the > misunderstandings, I noticed that Stolee picked out one thing and > stated that "This idea of optimizing first for 100% filename > similarity is a good perspective on Git's rename detection algorithm." > (see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com/) > So, that particular point seemed to help him understand more, and > thus might be useful extra context for others reading along now or in > the future. > > Given all the above, I was trying to address earlier misunderstandings > and provide more context. Perhaps I swung the pendulum too far and > talked too much about other cases, or perhaps I just worded things > poorly again. All I was attempting to do in the commit message was > point out the multiple basic rules with filename and content > similarity, to lay the groundwork for new rules that do alternative > weightings. > > Anyway, I've added a few more tweaks to try to improve the wording for > the next round I'll submit today. Given my track record so far, it > would not be surprising if it still needed more tweaks.

On the Git mailing list, Elijah Newren wrote (reply to this):

On Sat, Feb 13, 2021 at 5:32 PM Junio C Hamano <gitster@pobox.com> wrote: > > I do not consider "the same file changed in place" the same as "we > seem to have lost a file in the old tree, ah, we found one that has > the same basename in a different directory" at all, so your argument > still does not make any sense to me, sorry. I'm not set on the commit message wording, you asked why I had used the terms I did and I tried to explain. I also explained how the wording seemed to have helped Stolee understand. If you'd like to suggest an alternative commit message, I'm happy to take it. > 2021年2月13日(土) 17:25 Elijah Newren <newren@gmail.com>: > > > > On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@pobox.com> wrote: > > > > > > Elijah Newren <newren@gmail.com> writes: > > > > > > > This is not true. If src/main.c is 99% similar to src/foo.c, and is > > > > 0% similar to the src/main.c in the new commit, we match the old > > > > src/main.c to the new src/main.c despite being far more similar > > > > src/foo.c. Unless break detection is turned on, we do not allow > > > > content similarity to trump (full) filename equality. > > > > > > Absolutely. And we are talking about a new optimization that kicks > > > in only when there is no break or no copy detection going on, no? > > > > Yes, precisely, we are only considering cases without break > > detection...and thus we are considering cases where for the last 15 > > years or more, sufficiently large filename similarity (an exact > > fullname match) trumps any level of content similarity. I think it is > > useful to note that while my optimization is adding more > > considerations that can overrule maximal content similarity, it is not > > the first such code choice to do that. > > > > But let me back up a bit... > > > > When I submitted the series, you and Stolee went into a long > > discussion about an optimization that I didn't submit, one that feels > > looser on "matching" than anything I submitted, and which I think > > might counter-intuitively reduce performance rather than aid it. (The > > performance side only comes into view in combination with later > > series, but it was why I harped so much since then on only comparing > > against at most one other file in the steps before full inexact rename > > detection.) I was quite surprised by the diversion, but it made it > > clear to me that my descriptions and commit messages were far too > > vague and could be read to imply a completely different algorithm than > > I intended. So, I tried to be far more careful in subsequent > > iterations by adding wider context and contrasts. > > > > Further, after I wrote various things to try to clarify the > > misunderstandings, I noticed that Stolee picked out one thing and > > stated that "This idea of optimizing first for 100% filename > > similarity is a good perspective on Git's rename detection algorithm." > > (see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com/) > > So, that particular point seemed to help him understand more, and > > thus might be useful extra context for others reading along now or in > > the future. > > > > Given all the above, I was trying to address earlier misunderstandings > > and provide more context. Perhaps I swung the pendulum too far and > > talked too much about other cases, or perhaps I just worded things > > poorly again. All I was attempting to do in the commit message was > > point out the multiple basic rules with filename and content > > similarity, to lay the groundwork for new rules that do alternative > > weightings. > > > > Anyway, I've added a few more tweaks to try to improve the wording for > > the next round I'll submit today. Given my track record so far, it > > would not be surprising if it still needed more tweaks.

gitgitgadget · 2021-02-13T01:36:02Z

diffcore-rename.c

@@ -367,6 +367,67 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;


On the Git mailing list, Junio C Hamano wrote (reply to this):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > +MAYBE_UNUSED > +static int find_basename_matches(struct diff_options *options, > + int minimum_score, > + int num_src) > +{ > + int i; > + struct strintmap sources; > + struct strintmap dests; > + > + /* Create maps of basename -> fullname(s) for sources and dests */ > + strintmap_init_with_options(&sources, -1, NULL, 0); > + strintmap_init_with_options(&dests, -1, NULL, 0); > + for (i = 0; i < num_src; ++i) { > + char *filename = rename_src[i].p->one->path; > + const char *base; > + > + /* exact renames removed in remove_unneeded_paths_from_src() */ > + assert(!rename_src[i].p->one->rename_used); > + > + /* Record index within rename_src (i) if basename is unique */ > + base = get_basename(filename); > + if (strintmap_contains(&sources, base)) > + strintmap_set(&sources, base, -1); > + else > + strintmap_set(&sources, base, i); > + } > + for (i = 0; i < rename_dst_nr; ++i) { > + char *filename = rename_dst[i].p->two->path; > + const char *base; > + > + if (rename_dst[i].is_rename) > + continue; /* involved in exact match already. */ > + > + /* Record index within rename_dst (i) if basename is unique */ > + base = get_basename(filename); > + if (strintmap_contains(&dests, base)) > + strintmap_set(&dests, base, -1); > + else > + strintmap_set(&dests, base, i); > + } > + > + /* TODO: Make use of basenames source and destination basenames */ ;-) So at this point sources and dests can be used to quickly look up, given a filename, if there is a single src among all sources, and a single dst among all dests, that have the filename. I wonder if the second loop over destinations can be "optimized" further by using the sources map, though. The reason you quash entries with -1 when you see second instance of the same name is because you intend to limit the heuristics only to a uniquely named file among the removed files going to a uniquely named file among the added files, right? So even if a name is unique among dests, if that name has duplicates on the source side, there is no point recording its location. i.e. /* record index within dst if it is unique in both dst and src */ base = get_basename(filename); if (strintmap_contains(&sources, base) || strintmap_contains(&dests, base)) strintmap_set(&dests, base, -1); else strintmap_set(&dests, base, i); perhaps? I guess it depends on what actually will be written in this "TODO" space how effective such a change would be. Presumably, you'd iterate over &sources while skipping entries that record -1, to learn (basename, i), and use the basename found there to consult &dests to see if it yields a non-negative integer j, to notice that rename_src[i] is a good candidate to match rename_dst[j]. If that is the case, then such a change won't help as an optimization at all, as we'd need to consult &dests map with the basename anyway, so let's scratch the above idea. In any case, after we walk over rename_src[] and rename_dst[] once, the number of entries in &sources would be smaller than rename_src[] so iterating over &sources, hunting for entries that record non-negative index into rename_src[] would hopefully be cheaper than the naive loop we've been working with. I like the idea of using the strintmap for this part of the code. Thanks. > + strintmap_clear(&sources); > + strintmap_clear(&dests); > + > + return 0; > +} > + > #define NUM_CANDIDATE_PER_DST 4 > static void record_if_better(struct diff_score m[], struct diff_score *o) > {

gitgitgadget · 2021-02-13T01:51:00Z

diffcore-rename.c

@@ -367,6 +367,156 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;


On the Git mailing list, Junio C Hamano wrote (reply to this):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > + /* Now look for basename matchups and do similarity estimation */ > + for (i = 0; i < num_src; ++i) { > + char *filename = rename_src[i].p->one->path; > + const char *base = NULL; > + intptr_t src_index; > + intptr_t dst_index; > + > + /* Find out if this basename is unique among sources */ > + base = get_basename(filename); > + src_index = strintmap_get(&sources, base); > + if (src_index == -1) > + continue; /* not a unique basename; skip it */ > + assert(src_index == i); > + > + if (strintmap_contains(&dests, base)) { > + struct diff_filespec *one, *two; > + int score; > + > + /* Find out if this basename is unique among dests */ > + dst_index = strintmap_get(&dests, base); > + if (dst_index == -1) > + continue; /* not a unique basename; skip it */ It would be a lot easier to read if "we must have the same singleton in dests" in a single if condition, I suspect. I.e. if (strintmap_contains(&dests, base) && 0 <= (dst_index = (strintmap_get(&dests, base)))) { It is a bit sad that we iterate over rename_src[] array, even though we now have a map that presumably have fewer number of entries than the original array, though. > + /* Ignore this dest if already used in a rename */ > + if (rename_dst[dst_index].is_rename) > + continue; /* already used previously */ Since we will only be matching between unique entries in src and dst, this "this has been used, so we cannot use it" will not change during this loop. I wonder if the preparation done in the previous step, i.e. [PATCH v3 2/5], can take advantage of this fact, i.e. a dst that has already been used (in the previous "exact" step) would not even have to be in &dests map, so that the strintmap_contains() check can reject it much earlier. Stepping back a bit, it appears to me that [2/5] and [3/5] considers a source file having unique basename among the sources even if there are many such files with the same basename, as long as all the other files with the same basename have been matched in the previous "exact" phase. It probably does the same thing for destination side. Intended? It feels incompatible with the spirit of these two steps aim for (i.e. only use this optimization on a pair of src/dst with UNIQUE basenames). For the purpose of "we only handle unique ones", the paths that already have matched should participate in deciding if the files that survived "exact" phase have unique basename among the original inpu? Thanks.

On the Git mailing list, Elijah Newren wrote (reply to this):

On Fri, Feb 12, 2021 at 5:48 PM Junio C Hamano <gitster@pobox.com> wrote: > > "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > + /* Now look for basename matchups and do similarity estimation */ > > + for (i = 0; i < num_src; ++i) { > > + char *filename = rename_src[i].p->one->path; > > + const char *base = NULL; > > + intptr_t src_index; > > + intptr_t dst_index; > > + > > + /* Find out if this basename is unique among sources */ > > + base = get_basename(filename); > > + src_index = strintmap_get(&sources, base); > > + if (src_index == -1) > > + continue; /* not a unique basename; skip it */ > > + assert(src_index == i); > > + > > + if (strintmap_contains(&dests, base)) { > > + struct diff_filespec *one, *two; > > + int score; > > + > > + /* Find out if this basename is unique among dests */ > > + dst_index = strintmap_get(&dests, base); > > + if (dst_index == -1) > > + continue; /* not a unique basename; skip it */ > > It would be a lot easier to read if "we must have the same singleton > in dests" in a single if condition, I suspect. I.e. > > if (strintmap_contains(&dests, base) && > 0 <= (dst_index = (strintmap_get(&dests, base)))) { I can change that. I can also simplify it further to if (0 <= (dst_index = (strintmap_get(&dests, base)))) { since dests uses a default value of -1. That will decrease the number of strmap lookups here from 2 to 1. > It is a bit sad that we iterate over rename_src[] array, even though > we now have a map that presumably have fewer number of entries than > the original array, though. Oh, interesting; I forgot all about that. I just looked up my original implementation from February of last year and indeed I had done exactly that (https://github.com/newren/git/commit/43eaec6007c92b6af05e0ef0fcc047c1d1ba1de8). However, when I added a later optimization that pairs up non-unique basenames, I had to switch to looping over rename_src. For various reasons (mostly starting with the fact that I had lots of experimental ideas that were tried and thrown out but with pieces kept around for ideas), I wasn't even close to having a clean history in my original implementation of merge-ort and the diffcore-rename optimizations. And it was far, far easier to achieve the goal of a clean history by picking out chunks of code from the end-state and creating entirely new commits than attempting to use my existing history. But, of course, that method made me lose this intermediate state. > > > + /* Ignore this dest if already used in a rename */ > > + if (rename_dst[dst_index].is_rename) > > + continue; /* already used previously */ > > Since we will only be matching between unique entries in src and > dst, this "this has been used, so we cannot use it" will not change > during this loop. I wonder if the preparation done in the previous > step, i.e. [PATCH v3 2/5], can take advantage of this fact, i.e. a > dst that has already been used (in the previous "exact" step) would > not even have to be in &dests map, so that the strintmap_contains() > check can reject it much earlier. Good, catch again. The previous step (v4 2/5) actually did already check this, so this if-condition will always be false at this point. Looking at the link above, this if-condition check wasn't there in the original, but again was added due to altered state introduced by a later optimization. So, I should pull this check out of this patch and add it back in to the later patch. > Stepping back a bit, it appears to me that [2/5] and [3/5] considers > a source file having unique basename among the sources even if there > are many such files with the same basename, as long as all the other > files with the same basename have been matched in the previous > "exact" phase. It probably does the same thing for destination > side. > > Intended? > > It feels incompatible with the spirit of these two steps aim for > (i.e. only use this optimization on a pair of src/dst with UNIQUE > basenames). For the purpose of "we only handle unique ones", the > paths that already have matched should participate in deciding if > the files that survived "exact" phase have unique basename among > the original inpu? Yeah, I should have been more careful with my wording. Stated a different way, what confidence can we associate with an exact rename? Obviously, the confidence is high as we mark them as renames. But if the confidence is less than 100%, and enough less than 100% that it casts a doubt on "related" inexact renames, then yes the basenames of the exact renames should also be computed so that we can determine what basenames are truly unique. By the exact same argument, you could take this a step further and say that we should calculate the basenames of *all* files in the tree, not just add/delete pairs, and only match up the ones via basename that are *truly* unique. After all, break detection exists, so perhaps we don't have full confidence that files with an unchanged fullname are actually related. From my view, though, both are too cautious and throwing out valuable heuristics for common cases. Starting with break detection, it is off for a reason: we think unchanged filename is a strong enough heuristic to just match up those files and consider the confidence of the match in effect 100%. Similarly, we put a lot of confidence in exact rename detection. If there are multiple adds/deletes with the same basename, and all but one on each side are paired up by exact rename detection, aren't the remaining two files a (very) likely rename pair? I think so, and believe they're worth including in the basename-based rename detection step. We do require basename-based matches to meet a much higher similarity scoring threshold now, which I feel already adequately adjusts for not doing full content similarity against all other files. Also, in the next series, I find an additional way to match up files by basename when basenames are not unique, and which doesn't involve pairwise comparing all the files with the same basename. I only pick at most one other file to compare to (and the selection is not random). So, my overall strategy for these two series is "find which basenames are likely matches" even if I didn't word it very well. I do agree, though, that I should add some more careful wording about this in the series. I'll include it in a re-roll.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Elijah Newren <newren@gmail.com> writes: > I can change that. I can also simplify it further to > > if (0 <= (dst_index = (strintmap_get(&dests, base)))) { > > since dests uses a default value of -1. That will decrease the number > of strmap lookups here from 2 to 1. Which would be a real win, unlike what I said in the message you are responding to. >> It feels incompatible with the spirit of these two steps aim for >> (i.e. only use this optimization on a pair of src/dst with UNIQUE >> basenames). For the purpose of "we only handle unique ones", the >> paths that already have matched should participate in deciding if >> the files that survived "exact" phase have unique basename among >> the original inpu? > > Yeah, I should have been more careful with my wording. Stated a > different way, what confidence can we associate with an exact rename? Suppose you start with a/Makefile, b/Makefile and c/Makefile and then they all disappeared while a1/Makefile, b1/Makefile and c1/Makefile now are in the tree. The contents a/Makefile used to have appears without any difference in a1/Makefile, the same for b and b1, but c/Makefile and c1/Makefile are different. The c vs c1 pair may worth investigating, so it goes through the "same basename" phase. Now, in a slightly different situation, a vs a1 are still identical, but b vs b1 have only one blank line removal but without any other change. It looks odd that such a change has to pessimize c vs c1 optimization opportunity, but an interesting part of the story is that we can only say "such a change", not "such a miniscule change", because we have just finished the "exact" phase, and we do not know how big a difference b vs b1 pair actually had. That makes me feel that this whole "we must treat unique one that remains specially" is being incoherent. If "because we have only small number of removed and added Makefiles spread across the trees, first full-matrix matching among them without anything else with higher bar may be worth an optimization" were the optimization, then I would understand and support the design to omit those that have already been matched in the "exact" phase. But IIRC, limiting this "same basename" phase to unique add/del pair was sold as a way to make it less likely for the heuristics to make mistakes, yet the definition of "unique", as shown above, is not all that solid. That I find it rather unsatisfactory. In other words, it is not "what confidence do we have in exact phase?" "exact" matching may have found perfect matching pair. But the found pair should be happy just between themselves, and should not have undue effect on how _other_ pairs are compared. Stopping the "exact" pair from participating in the "uniqueness" definition is placing "exact" phase too much weight to affect how other filepairs are found. > By the exact same argument, you > could take this a step further and say that we should calculate the > basenames of *all* files in the tree, not just add/delete pairs, and > only match up the ones via basename that are *truly* unique. After > all, break detection exists, so perhaps we don't have full confidence > that files with an unchanged fullname are actually related. Sorry, but you are not making sense. These optimizations are done only when we are not using copies and breaks, no? What _other_ changes that kept the paths the same, or modified in place, have any effect on matching added and deleted pairs?

On the Git mailing list, Elijah Newren wrote (reply to this):

On Sat, Feb 13, 2021 at 3:55 PM Junio C Hamano <gitster@pobox.com> wrote: > > Elijah Newren <newren@gmail.com> writes: > > > I can change that. I can also simplify it further to > > > > if (0 <= (dst_index = (strintmap_get(&dests, base)))) { > > > > since dests uses a default value of -1. That will decrease the number > > of strmap lookups here from 2 to 1. > > Which would be a real win, unlike what I said in the message you are > responding to. Sadly, while it's a real win, it's very temporary. The next series I'll submit needs to separate the two checks back out for other reasons. > >> It feels incompatible with the spirit of these two steps aim for > >> (i.e. only use this optimization on a pair of src/dst with UNIQUE > >> basenames). For the purpose of "we only handle unique ones", the > >> paths that already have matched should participate in deciding if > >> the files that survived "exact" phase have unique basename among > >> the original inpu? > > > > Yeah, I should have been more careful with my wording. Stated a > > different way, what confidence can we associate with an exact rename? > > Suppose you start with a/Makefile, b/Makefile and c/Makefile and > then they all disappeared while a1/Makefile, b1/Makefile and > c1/Makefile now are in the tree. The contents a/Makefile used to > have appears without any difference in a1/Makefile, the same for b > and b1, but c/Makefile and c1/Makefile are different. The c vs c1 > pair may worth investigating, so it goes through the "same basename" > phase. > > Now, in a slightly different situation, a vs a1 are still identical, > but b vs b1 have only one blank line removal but without any other > change. It looks odd that such a change has to pessimize c vs c1 > optimization opportunity, but an interesting part of the story is > that we can only say "such a change", not "such a miniscule change", > because we have just finished the "exact" phase, and we do not know > how big a difference b vs b1 pair actually had. > > That makes me feel that this whole "we must treat unique one that > remains specially" is being incoherent. It's really not that special; the pessimization is not in my mind due to correctness reasons, but performance reasons. I need to only compare any given file to at most one other file in the preliminary steps. When there are multiple remaining possibilities to compare, I need a method for selecting which ones to compare. I have such a method, but it's a lot more code. It was easier to submit a series that was only 3 patches long and only considered the pairs that just happened to uniquely match up so we could talk about the general idea of basename matching. The next series finds ways to match up more files with similar basenames. > If "because we have only > small number of removed and added Makefiles spread across the trees, > first full-matrix matching among them without anything else with > higher bar may be worth an optimization" were the optimization, then This optimization was indeed considered...and fully implemented. Let's give it a name, so I can refer to it more below. How about the "preliminary-matrix-of-basenames" optimization? > I would understand and support the design to omit those that have > already been matched in the "exact" phase. > > But IIRC, limiting this "same basename" phase to unique add/del pair > was sold as a way to make it less likely for the heuristics to make > mistakes, yet the definition of "unique", as shown above, is not all > that solid. That I find it rather unsatisfactory. No, I never sold it as a way to make it less likely for the heuristics to make mistakes. If I implied that anywhere, it was on accident. I certainly emphasized only doing one comparison per file, but not for that reason. I had three reasons for mentioning one-comparison-per-file: (1) I was trying to contrast with Stolee's original assumption about what this series was doing, to try to avoid a repeat of the misunderstandings about the current optimization being suggested. (2) The preliminary-matrix-of-basenames optimization has worst-case performance nearly twice as bad as without such an optimization. (For example, with preliminary-matrix-of-basenames, if nearly all unmatched files have the same basename, we end up basically doing inexact rename detection on all files twice). I believe Stolee's original assumption of what was being proposed also has such twice-as-slow-as-normal worst-case performance behavior. Even though the worst case performance would be fairly rare, making an algorithm twice as slow by introducing an optimization felt like something I should avoid. (3) Despite the theoretical problems with worst-case performance, I implemented the preliminary-matrix-of-basenames optimization anyway. I threw the code away, because even in cases with a wide variety of basenames, it slowed things down when other optimizations were also involved. The one clear way to work well with other optimizations I was working with was to only allow the preliminary step to compare any given file to at most one other file. > In other words, it is not "what confidence do we have in exact > phase?" "exact" matching may have found perfect matching pair. But > the found pair should be happy just between themselves, and should > not have undue effect on how _other_ pairs are compared. Stopping > the "exact" pair from participating in the "uniqueness" definition > is placing "exact" phase too much weight to affect how other filepairs > are found. I guess I look at this quite a bit differently. Here's my view: * If we have a reasonable and cheap way to determine that two particular files are likely potential rename pairs, * AND checking their similarity confirms they are sufficiently similar (perhaps with a higher bar) * then we've found a way to avoid quadratic comparisons. We will give up "optimal" matches, but as long as what we provide are "reasonable" matches I think that should suffice. I personally believe "reasonable" at O(N) cost trumps "optimal" at O(N^2). There are several different ways to find "likely potential rename pairs": * The preliminary-matrix-of-basenames is one that I tried (but interacts badly performance-wise with other optimizations). * https://github.com/gitgitgadget/git/issues/519 has multiple ideas. * Stolee's misunderstanding of my series is another * unique basenames among remaining pairs after exact renames is a really simple one that lets me introduce "reasonable" matches so we can discuss * my next series adds another That leaves us with a big question. Are we happy with higher sufficient similarity bar being enough of a constraint for "reasonable" matches? If so, each of the above ideas might be able to help us. If not, we may be able to rule some of them out apriori and avoid working on them (well, working on them any more; I've already implemented three, and we have an intern who picked a project to look at one) > > By the exact same argument, you > > could take this a step further and say that we should calculate the > > basenames of *all* files in the tree, not just add/delete pairs, and > > only match up the ones via basename that are *truly* unique. After > > all, break detection exists, so perhaps we don't have full confidence > > that files with an unchanged fullname are actually related. > > Sorry, but you are not making sense. These optimizations are done > only when we are not using copies and breaks, no? What _other_ > changes that kept the paths the same, or modified in place, have any > effect on matching added and deleted pairs? If the optimization is presented to users as "only compare basenames in a preliminary step when they are unique", which is what I was understanding you to say, and if the user has a/Makefile and d/Makefile in the source tree, and a1/Makefile and d/Makefile in the destination tree, then a/Makefile is not the unique "Makefile" in the source tree. I think you're trying to make an argument about uniqueness and why it matters for correctness, but I'm not following it. The only reason uniqueness is important to me is because I was using it with future optimizations in mind, and knew it to be related to an important performance criteria. I tried to avoid mentioning uniqueness at all in the user-facing documentation, though I did try to explain why some files with the same basename might not be matched up by that step (and my next series modifies those docs a bit.)

gitgitgadget · 2021-02-13T01:51:01Z

diffcore-rename.c

@@ -367,6 +367,155 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;


On the Git mailing list, Junio C Hamano wrote (reply to this):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh > index 797343b38106..bf62537c29a0 100755 > --- a/t/t4001-diff-rename.sh > +++ b/t/t4001-diff-rename.sh > @@ -280,8 +280,8 @@ test_expect_success 'basename similarity vs best similarity' ' > # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt, > # but since same basenames are checked first... Here lies the answer to my earlier question ;-) > cat >expected <<-\EOF && > - R088 subdir/file.txt file.md > - A file.txt > + A file.md > + R078 subdir/file.txt file.txt > EOF > test_cmp expected actual > '

gitgitgadget · 2021-02-13T01:57:00Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This series depends on ort-perf-batch-6[1].
>
> This series uses file basenames (portion of the path after final '/',
> including extension) in a basic fashion to guide rename detection.
>
> Changes since v3:
>
>  * update documentation as suggested by Junio
>  * NEW: add another patch at the end, to simplify patch series that will be
>    submitted later (please review!)

Sorry, by mistake I somehow read v4 and sent some comments on v3,
but as the above says, they are on the part that hadn't changed at
all, and should still be relevant.

Thanks.

Add a simple test where a removed file is similar to two different added files; one of them has the same basename, and the other has a slightly higher content similarity. In the current test, content similarity is weighted higher than filename similarity. Subsequent commits will add a new rule that weighs a mixture of filename similarity and content similarity in a manner that will change the outcome of this testcase. Signed-off-by: Elijah Newren <newren@gmail.com>

We want to make use of unique basenames among remaining source and destination files to help inform rename detection, so that more likely pairings can be checked first. (src/moduleA/foo.txt and source/module/A/foo.txt are likely related if there are no other 'foo.txt' files among the remaining deleted and added files.) Add a new function, not yet used, which creates a map of the unique basenames within rename_src and another within rename_dst, together with the indices within rename_src/rename_dst where those basenames show up. Non-unique basenames still show up in the map, but have an invalid index (-1). This function was inspired by the fact that in real world repositories, files are often moved across directories without changing names. Here are some sample repositories and the percentage of their historical renames (as of early 2020) that preserved basenames: * linux: 76% * gcc: 64% * gecko: 79% * webkit: 89% These statistics alone don't prove that an optimization in this area will help or how much it will help, since there are also unpaired adds and deletes, restrictions on which basenames we consider, etc., but it certainly motivated the idea to try something in this area. Signed-off-by: Elijah Newren <newren@gmail.com>

It is not uncommon in real world repositories for the majority of file renames to not change the basename of the file; i.e. most "renames" are just a move of files into different directories. We can make use of this to avoid comparing all rename source candidates with all rename destination candidates, by first comparing sources to destinations with the same basenames. If two files with the same basename are sufficiently similar, we record the rename; if not, we include those files in the more exhaustive matrix comparison. This means we are adding a set of preliminary additional comparisons, but for each file we only compare it with at most one other file. For example, if there was a include/media/device.h that was deleted and a src/module/media/device.h that was added, and there are no other device.h files in the remaining sets of added and deleted files after exact rename detection, then these two files would be compared in the preliminary step. This commit does not yet actually employ this new optimization, it merely adds a function which can be used for this purpose. The next commit will do the necessary plumbing to make use of it. Note that this optimization might give us different results than without the optimization, because it's possible that despite files with the same basename being sufficiently similar to be considered a rename, there's an even better match between files without the same basename. I think that is okay for four reasons: (1) it's easy to explain to the users what happened if it does ever occur (or even for them to intuitively figure out), (2) as the next patch will show it provides such a large performance boost that it's worth the tradeoff, and (3) it's somewhat unlikely that despite having unique matching basenames that other files serve as better matches. Reason (4) takes a full paragraph to explain... If the previous three reasons aren't enough, consider what rename detection already does. Break detection is not the default, meaning that if files have the same _fullname_, then they are considered related even if they are 0% similar. In fact, in such a case, we don't even bother comparing the files to see if they are similar let alone comparing them to all other files to see what they are most similar to. Basically, we override content similarity based on sufficient filename similarity. Without the filename similarity (currently implemented as an exact match of filename), we swing the pendulum the opposite direction and say that filename similarity is irrelevant and compare a full N x M matrix of sources and destinations to find out which have the most similar contents. This optimization just adds another form of filename similarity comparison, but augments it with a file content similarity check as well. Basically, if two files have the same basename and are sufficiently similar to be considered a rename, mark them as such without comparing the two to all other rename candidates. Signed-off-by: Elijah Newren <newren@gmail.com>

Make use of the new find_basename_matches() function added in the last two patches, to find renames more rapidly in cases where we can match up files based on basenames. As a quick reminder (see the last two commit messages for more details), this means for example that docs/extensions.txt and docs/config/extensions.txt are considered likely renames if there are no remaining 'extensions.txt' files elsewhere among the added and deleted files, and if a similarity check confirms they are similar, then they are marked as a rename without looking for a better similarity match among other files. This is a behavioral change, as covered in more detail in the previous commit message. We do not use this heuristic together with either break or copy detection. The point of break detection is to say that filename similarity does not imply file content similarity, and we only want to know about file content similarity. The point of copy detection is to use more resources to check for additional similarities, while this is an optimization that uses far less resources but which might also result in finding slightly fewer similarities. So the idea behind this optimization goes against both of those features, and will be turned off for both. For the testcases mentioned in commit 557ac03 ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 13.815 s ± 0.062 s 13.294 s ± 0.103 s mega-renames: 1799.937 s ± 0.493 s 187.248 s ± 0.882 s just-one-mega: 51.289 s ± 0.019 s 5.557 s ± 0.017 s Signed-off-by: Elijah Newren <newren@gmail.com>

The last few patches have introduced a new preliminary step when rename detection is on but both break detection and copy detection are off. Document this new step. While we're at it, add a testcase that checks the new behavior as well. Signed-off-by: Elijah Newren <newren@gmail.com>

We want to pass additional information to diffcore_rename() (or some variant thereof) without plumbing that extra information through diff_tree_oid() and diffcore_std(). Further, since we will need to gather additional special information related to diffs and are walking the trees anyway in collect_merge_info(), it seems odd to have diff_tree_oid()/diffcore_std() repeat those tree walks. And there may be times where we can avoid traversing into a subtree in collect_merge_info() (based on additional information at our disposal), that the basic diff logic would be unable to take advantage of. For all these reasons, just create the add and delete pairs ourself and then call diffcore_rename() directly. This change is primarily about enabling future optimizations; the advantage of avoiding extra tree traversals is small compared to the cost of rename detection, and the advantage of avoiding the extra tree traversals is somewhat offset by the extra time spent in collect_merge_info() collecting the additional data anyway. However... For the testcases mentioned in commit 557ac03 ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 13.294 s ± 0.103 s 12.775 s ± 0.062 s mega-renames: 187.248 s ± 0.882 s 188.754 s ± 0.284 s just-one-mega: 5.557 s ± 0.017 s 5.599 s ± 0.019 s Signed-off-by: Elijah Newren <newren@gmail.com>

newren · 2021-02-14T07:50:50Z

/submit

gitgitgadget · 2021-02-14T07:52:22Z

Submitted as pull.843.v5.git.1613289112.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v5

To fetch this version to local tag pr-843/newren/ort-perf-batch-7-v5:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-843/newren/ort-perf-batch-7-v5

newren force-pushed the ort-perf-batch-7 branch from 4dada11 to 581dca0 Compare January 25, 2021 16:11

newren force-pushed the temporary/ort-perf-batch-6 branch from 580ba9a to 7ae9460 Compare February 3, 2021 19:03

newren force-pushed the ort-perf-batch-7 branch from 581dca0 to 1d941c3 Compare February 3, 2021 19:03

gitgitgadget bot reviewed Feb 7, 2021

View reviewed changes

diffcore-rename.c Show resolved Hide resolved

newren force-pushed the ort-perf-batch-7 branch from 1d941c3 to a0e75d8 Compare February 9, 2021 10:33

gitgitgadget bot reviewed Feb 9, 2021

View reviewed changes

Documentation/gitdiffcore.txt Show resolved Hide resolved

gitgitgadget bot reviewed Feb 9, 2021

View reviewed changes

diffcore-rename.c Show resolved Hide resolved

gitgitgadget bot reviewed Feb 9, 2021

View reviewed changes

diffcore-rename.c Show resolved Hide resolved

gitgitgadget bot reviewed Feb 9, 2021

View reviewed changes

diffcore-rename.c Show resolved Hide resolved

newren force-pushed the ort-perf-batch-7 branch from a0e75d8 to fc72d24 Compare February 10, 2021 14:08

gitgitgadget bot reviewed Feb 10, 2021

View reviewed changes

Documentation/gitdiffcore.txt Show resolved Hide resolved

newren force-pushed the ort-perf-batch-7 branch from fc72d24 to fedb3d3 Compare February 11, 2021 07:36

gitgitgadget bot added the seen label Feb 12, 2021

gitgitgadget bot reviewed Feb 13, 2021

View reviewed changes

newren force-pushed the temporary/ort-perf-batch-6 branch from 7ae9460 to a0b6dc9 Compare February 14, 2021 03:31

newren force-pushed the ort-perf-batch-7 branch from fedb3d3 to 3c9d4fe Compare February 14, 2021 03:31

newren added 6 commits February 13, 2021 19:36

newren force-pushed the temporary/ort-perf-batch-6 branch from a0b6dc9 to dd6595b Compare February 14, 2021 03:38

newren force-pushed the ort-perf-batch-7 branch from 3c9d4fe to aeca14f Compare February 14, 2021 03:39

newren closed this Mar 5, 2021

newren deleted the ort-perf-batch-7 branch March 5, 2021 05:59

		@@ -262,4 +262,28 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' '
		grep "myotherfile.*myfile" actual

		@@ -367,6 +367,67 @@ static int find_exact_renames(struct diff_options *options)
		return renames;

		@@ -367,6 +367,156 @@ static int find_exact_renames(struct diff_options *options)
		return renames;

		@@ -367,6 +367,155 @@ static int find_exact_renames(struct diff_options *options)
		return renames;

Optimization batch 7: use file basenames to guide rename detection #843

Optimization batch 7: use file basenames to guide rename detection #843

Uh oh!

Conversation

newren commented Jan 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

newren commented Feb 6, 2021

Uh oh!

gitgitgadget bot commented Feb 6, 2021

Uh oh!

gitgitgadget bot commented Feb 7, 2021

Uh oh!

gitgitgadget bot commented Feb 7, 2021

Uh oh!

gitgitgadget bot commented Feb 7, 2021

Uh oh!

Uh oh!

gitgitgadget bot commented Feb 7, 2021

Uh oh!

newren commented Feb 9, 2021

Uh oh!

gitgitgadget bot commented Feb 9, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

newren commented Feb 10, 2021

Uh oh!

gitgitgadget bot commented Feb 10, 2021

Uh oh!

Uh oh!

newren commented Feb 11, 2021

Uh oh!

gitgitgadget bot commented Feb 11, 2021

Uh oh!

gitgitgadget bot commented Feb 12, 2021

Uh oh!

gitgitgadget bot commented Feb 12, 2021

Uh oh!

gitgitgadget bot commented Feb 12, 2021

Uh oh!

gitgitgadget bot Feb 13, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 13, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 14, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 14, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 14, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 14, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 13, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 13, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 13, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 13, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 14, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Feb 13, 2021

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot commented Feb 13, 2021

newren commented Jan 24, 2021 •

edited

Loading