Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization batch 8: use file basenames even more #844

Closed

Conversation

newren
Copy link

@newren newren commented Jan 24, 2021

This series depends on en/diffcore-rename (a concatenation of what I
was calling ort-perf-batch-6 and ort-perf-batch-7).

Changes since v3:

  • Update the commit messages (one was out of date after the rearrangement), and include Stolee's Reviewed-by

cc: Derrick Stolee stolee@gmail.com
cc: Elijah Newren newren@gmail.com
cc: Junio C Hamano gitster@pobox.com
cc: Ævar Arnfjörð Bjarmason avarab@gmail.com

@newren newren force-pushed the ort-perf-batch-8 branch 2 times, most recently from 3bfbd40 to f3cda8d Compare January 28, 2021 00:16
@newren newren force-pushed the ort-perf-batch-8 branch 2 times, most recently from c53920b to ac1fea5 Compare February 3, 2021 20:33
@newren newren force-pushed the ort-perf-batch-8 branch 3 times, most recently from d266fa7 to e546ae2 Compare February 10, 2021 16:35
@newren newren force-pushed the ort-perf-batch-8 branch 3 times, most recently from 81bcfc4 to 46774dc Compare February 12, 2021 21:12
@newren
Copy link
Author

newren commented Feb 14, 2021

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 14, 2021

Submitted as pull.844.git.1613289544.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-844/newren/ort-perf-batch-8-v1

To fetch this version to local tag pr-844/newren/ort-perf-batch-8-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-844/newren/ort-perf-batch-8-v1

@newren
Copy link
Author

newren commented Feb 23, 2021

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 23, 2021

Submitted as pull.844.v2.git.1614123848.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-844/newren/ort-perf-batch-8-v2

To fetch this version to local tag pr-844/newren/ort-perf-batch-8-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-844/newren/ort-perf-batch-8-v2

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 24, 2021

User Derrick Stolee <stolee@gmail.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 24, 2021

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 2/23/2021 6:43 PM, Elijah Newren via GitGitGadget wrote:
> This series depends on en/diffcore-rename (a concatenation of what I was
> calling ort-perf-batch-6 and ort-perf-batch-7).
> 
> There are no changes since v1; it's just a resend a week and a half later to
> bump it so it isn't lost.

Thank you for re-sending. I intended to review it before but got redirected
and forgot to pick it up again.

> === Results ===
> 
> For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> performance work; instrument with trace2_region_* calls", 2020-10-28), the
> changes in just this series improves the performance as follows:
> 
>                      Before Series           After Series
> no-renames:       12.775 s ±  0.062 s    12.596 s ±  0.061 s
> mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s
> just-one-mega:     5.599 s ±  0.019 s     3.958 s ±  0.010 s
> 
> 
> As a reminder, before any merge-ort/diffcore-rename performance work, the
> performance results we started with (as noted in the same commit message)
> were:
> 
> no-renames-am:      6.940 s ±  0.485 s
> no-renames:        18.912 s ±  0.174 s
> mega-renames:    5964.031 s ± 10.459 s
> just-one-mega:    149.583 s ±  0.751 s

These are good results.

I reviewed the patches and believe they do the optimizations claimed. I
only found some nits for comments and whitespace things.

You are very careful to create the necessary pieces and connect them
from the bottom-up. However, this leads to one big "now everything is
done" commit with performance improvements. It seems that there are
some smaller performance improvements that could be measured if the
logic was instead built from the top-down with stubs for the complicated
logic.

For example, the final patch links the rename logic with a call to
idx_possible_rename(). But, that could just as well always return -1
and the implementation would be correct. Then, it would be good to see
if the performance changes with that non-functional update. It would
also help me read the series in patch order and understand the context
of the methods a bit better before seeing their implementation.

This is _not_ a recommendation that you rewrite the series. Just food
for thought as we continue with similar enhancements in the future.

Thanks,
-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 14, 2021

This patch series was integrated into seen via git@de7e639.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 15, 2021

There was a status update about the branch en/ort-perf-batch-8 on the Git mailing list:

Rename detection rework continues.

Will cook in 'next'.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 15, 2021

This patch series was integrated into seen via git@891cf66.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 15, 2021

This patch series was integrated into seen via git@7d6738c.

A previous commit noted that it is very common for people to move files
across directories while keeping their filename the same.  The last few
commits took advantage of this and showed that we can accelerate rename
detection significantly using basenames; since files with the same
basename serve as likely rename candidates, we can check those first and
remove them from the rename candidate pool if they are sufficiently
similar.

Unfortunately, the previous optimization was limited by the fact that
the remaining basenames after exact rename detection are not always
unique.  Many repositories have hundreds of build files with the same
name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have
hundreds of source files with the same name.  (For example, the linux
kernel has 100 setup.c, 87 irq.c, and 112 core.c files.  A repository at
$DAYJOB has a lot of ObjectFactory.java and Plugin.java files).

For these files with non-unique basenames, we are faced with the task of
attempting to determine or guess which directory they may have been
relocated to.  Such a task is precisely the job of directory rename
detection.  However, there are two catches: (1) the directory rename
detection code has traditionally been part of the merge machinery rather
than diffcore-rename.c, and (2) directory rename detection currently
runs after regular rename detection is complete.  The 1st catch is just
an implementation issue that can be overcome by some code shuffling.
The 2nd requires us to add a further approximation: we only have access
to exact renames at this point, so we need to do directory rename
detection based on just exact renames.  In some cases we won't have
exact renames, in which case this extra optimization won't apply.  We
also choose to not apply the optimization unless we know that the
underlying directory was removed, which will require extra data to be
passed in to diffcore_rename_extended().  Also, even if we get a
prediction about which directory a file may have relocated to, we will
still need to check to see if there is a file in the predicted
directory, and then compare the two files to see if they meet the higher
min_basename_score threshold required for marking the two files as
renames.

This commit introduces an idx_possible_rename() function which will
do this directory rename detection for us and give us the index within
rename_dst of the resulting filename.  For now, this function is
hardcoded to return -1 (not found) and just hooks up how its results
would be used once we have a more complete implementation in place.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Add a new struct dir_rename_info with various values we need inside our
idx_possible_rename() function introduced in the previous commit.  Add a
basic implementation for this function showing how we plan to use the
variables, but which will just return early with a value of -1 (not
found) when those variables are not set up.

Future commits will do the work necessary to set up those other
variables so that idx_possible_rename() does not always return -1.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Compute a mapping of full filename to the index within rename_dst where
that filename is found, and store it in idx_map.  idx_possible_rename()
needs this to quickly finding an array entry in rename_dst given the
pathname.

While at it, add placeholder initializations for dir_rename_count and
dir_rename_guess; these will be more fully populated in subsequent
commits.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Move the computation of dir_rename_count from merge-ort.c to
diffcore-rename.c, making slight adjustments to the data structures
based on the move.  While the diffstat looks large, viewing this commit
with --color-moved makes it clear that only about 20 lines changed.

With this patch, the computation of dir_rename_count is still only done
after inexact rename detection, but subsequent commits will add a
preliminary computation of dir_rename_count after exact rename
detection, followed by some updates after inexact rename detection.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
As we adjust the usage of dir_rename_count we want to have a function
for clearing, or partially clearing it out.  Add a
partial_clear_dir_rename_count() function for this purpose.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
This continues the migration of the directory rename detection code into
diffcore-rename, now taking the simple step of combining it with the
dir_rename_info struct.  Future commits will then make dir_rename_counts
be computed in stages, and add computation of dir_rename_guess.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
When diffcore_rename_extended() is passed a NULL dir_rename_count, we
will still want to create a temporary one for use by
find_basename_matches(), but have it fully deallocated before
diffcore_rename_extended() returns.  However, when
diffcore_rename_extended() is passed a dir_rename_count, we want to fill
that strmap with appropriate values and return it.  However, for our
interim purposes we may also add entries corresponding to directories
that cannot have been renamed due to still existing on both sides.

Extend cleanup_dir_rename_info() to handle these two different cases,
cleaning up the relevant bits of information for each case.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Compute dir_rename_counts based just on exact renames to start, as that
can provide us useful information in find_basename_matches().  This is
done by moving the code from compute_dir_rename_counts() into
initialize_dir_rename_info(), resulting in it being computed earlier and
based just on exact renames.  Since that's an incomplete result, we
augment the counts via calling update_dir_rename_counts() after each
basename-guide and inexact rename detection match is found.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
We are using dir_rename_counts to count the number of other directories
that files within a directory moved to.  We only need this information
for directories that disappeared, though, so we can return early from
update_dir_rename_counts() for other paths.

If dirs_removed is passed to diffcore_rename_extended(), then it
provides the relevant bits of information for us to limit this counting
to relevant dirs.  If dirs_removed is not passed, we would need to
compute some replacement in order to do this limiting.  Introduce a new
info->relevant_source_dirs variable for this purpose, even though at
this stage we will only set it to dirs_removed for simplicity.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
dir_rename_counts has a mapping of a mapping, in particular, it has
   old_dir => { new_dir => count }
We want a simple mapping of
   old_dir => new_dir
based on which new_dir had the highest count for a given old_dir.
Compute this and store it in dir_rename_guess.

This is the final piece of the puzzle needed to make our guesses at
which directory files have been moved to when basenames aren't unique.

For the testcases mentioned in commit 557ac03 ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       12.775 s ±  0.062 s    12.596 s ±  0.061 s
    mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s
    just-one-mega:     5.599 s ±  0.019 s     3.958 s ±  0.010 s

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
@gitgitgadget
Copy link

gitgitgadget bot commented Mar 15, 2021

This patch series was integrated into seen via git@b382178.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 17, 2021

There was a status update about the branch en/ort-perf-batch-8 on the Git mailing list:

Rename detection rework continues.

Will cook in 'next'.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 19, 2021

This patch series was integrated into seen via git@3f6755e.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 19, 2021

This patch series was integrated into seen via git@feb8f96.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 20, 2021

This patch series was integrated into seen via git@a467314.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 20, 2021

There was a status update about the branch en/ort-perf-batch-8 on the Git mailing list:

Rename detection rework continues.

Will merge to 'master'.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 22, 2021

This patch series was integrated into seen via git@fb339f8.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 22, 2021

This patch series was integrated into seen via git@dd4048d.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 22, 2021

This patch series was integrated into next via git@dd4048d.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 22, 2021

This patch series was integrated into master via git@dd4048d.

@gitgitgadget gitgitgadget bot added the master label Mar 22, 2021
@gitgitgadget
Copy link

gitgitgadget bot commented Mar 22, 2021

Closed via dd4048d.

@gitgitgadget gitgitgadget bot closed this Mar 22, 2021
@newren newren deleted the ort-perf-batch-8 branch March 24, 2021 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant