Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate "git merge-base --is-ancestor" #664

Closed

Conversation

derrickstolee
Copy link

It was recently [1] reported (and not-so-recently [2] reported) that "git merge-base --is-ancestor" can be pretty slow. In fact, it is regularly slower than "git branch --contains" or "git tag --contains", which are answering a "harder" query.

[1] https://lore.kernel.org/git/20200607195347.GA8232@szeder.dev/

[2] https://lore.kernel.org/git/87608bawoa.fsf@evledraar.gmail.com/

The root cause is that the in_merge_base() implementation is skipping the fast can_all_from_reach() implementation and using paint_down_to_common() instead. Note that these are equivalent: a commit A is in the set of merge-bases between A and B if and only if B can reach A.

This fixes the issue, and makes the performance degradation reported by Szeder a non-issue.

Thanks,
-Stolee

Cc: szeder.dev@gmail.com, avarab@gmail.com, abhishekkumar8222@gmail.com, me@ttaylorr.com

The next change will make repo_in_merge_bases() depend on the logic in
is_descendant_of(), but we need to make the method independent of
the_repository first.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The repo_is_descendant_of() method is aware of the existence of the
commit-graph file. It checks for generation_numbers_enabled() before
deciding on using can_all_from_reach() or repo_in_merge_bases()
depending on the situation. The reason here is that can_all_from_reach()
uses a depth-first search that is limited by the minimum generation
number of the target commits, and that algorithm can be very slow when
generation numbers are not present. The alternative uses
paint_down_to_common() which will walk the entire merge-base boundary,
which is typically slower.

This method is used by commands like "git tag --contains" and "git
branch --contains" for very fast results when a commit-graph file
exists. Unfortunately, it is _not_ used in commands like "git merge-base
--is-ancestor" which is doing an even simpler request.

This issue was raised recently [1] with respect to a change to how
generation numbers are stored, but was also reported much earlier [2]
before commit-reach.c existed to simplify these reachability queries.

[1] https://lore.kernel.org/git/20200607195347.GA8232@szeder.dev/
[2] https://lore.kernel.org/git/87608bawoa.fsf@evledraar.gmail.com/

The root cause is that builtin/merge-base.c has a method
handle_is_ancestor() that calls in_merge_bases(), an older version of
repo_in_merge_bases(). It would be better if we have every caller to
in_merge_bases() use the logic in can_all_from_reach() when possible.

This is where things get a little tricky: repo_is_descendant_of() calls
repo_in_merge_bases() in the non-generation numbers enabled case! If we
simply update repo_in_merge_bases() to call repo_is_descendant_of()
instead of repo_in_merge_bases_many(), then we will get a recursive call
loop. Thankfully, this is caught by the test suite in the default mode
(i.e. GIT_TEST_COMMIT_GRAPH=0).

The trick, then, is to make the non-generation number case for
repo_is_descendant_of() call repo_in_merge_bases_many() directly,
skipping the non-_many version. This allows us to take advantage of this
faster code path, when possible.

The easiest way to measure the performance impact is to test the
following command on the Linux kernel repository:

	git merge-base --is-ancestor <A> <B>

  | A    | B    | Time Before | Time After |
  |------|------|-------------|------------|
  | v3.0 | v5.7 | 0.459s      | 0.028s     |
  | v4.0 | v5.7 | 0.267s      | 0.021s     |
  | v5.0 | v5.7 | 0.074s      | 0.013s     |

Note that each of these samples return success. The old code performed
the same operation when <A> and <B> are swapped. However,
can_all_from_reach() will return immediately if the generation numbers
show that <A> has larger generation number than <B>. Thus, the time for
the swapped case is universally 0.004s in each case.

Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 17, 2020

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 17, 2020

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> It was recently [1] reported (and not-so-recently [2] reported) that "git
> merge-base --is-ancestor" can be pretty slow. In fact, it is regularly
> slower than "git branch --contains" or "git tag --contains", which are
> answering a "harder" query.
>
> [1] https://lore.kernel.org/git/20200607195347.GA8232@szeder.dev/
>
> [2] https://lore.kernel.org/git/87608bawoa.fsf@evledraar.gmail.com/
>
> The root cause is that the in_merge_base() implementation is skipping the
> fast can_all_from_reach() implementation and using paint_down_to_common()
> instead. Note that these are equivalent: a commit A is in the set of
> merge-bases between A and B if and only if B can reach A.

OK, so in short, this codepath was not taking advantage of the
generation numbers and that was the difference?

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 18, 2020

This branch is now known as ds/merge-base-is-ancestor-optim.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 18, 2020

This patch series was integrated into pu via git@92cfb92.

@gitgitgadget gitgitgadget bot added the pu label Jun 18, 2020
@gitgitgadget
Copy link

gitgitgadget bot commented Jun 18, 2020

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 6/17/2020 4:46 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> It was recently [1] reported (and not-so-recently [2] reported) that "git
>> merge-base --is-ancestor" can be pretty slow. In fact, it is regularly
>> slower than "git branch --contains" or "git tag --contains", which are
>> answering a "harder" query.
>>
>> [1] https://lore.kernel.org/git/20200607195347.GA8232@szeder.dev/
>>
>> [2] https://lore.kernel.org/git/87608bawoa.fsf@evledraar.gmail.com/
>>
>> The root cause is that the in_merge_base() implementation is skipping the
>> fast can_all_from_reach() implementation and using paint_down_to_common()
>> instead. Note that these are equivalent: a commit A is in the set of
>> merge-bases between A and B if and only if B can reach A.
> 
> OK, so in short, this codepath was not taking advantage of the
> generation numbers and that was the difference?

Correct. We now use generation numbers more often, when
available. All callers of in_merge_bases() will become
faster, which includes callers in these builtins:

	- branch
	- fetch
	- log
	- merge-base
	- receive-pack

and these libgit files:

	- fast-import.c
	- http-push.c
	- merge-recursive.c (several!)
	- pack-bitmap-write.c

Thanks,
-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 18, 2020

This patch series was integrated into pu via git@3a602fe.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 19, 2020

This patch series was integrated into pu via git@dd64d65.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 19, 2020

This patch series was integrated into next via git@86b34a3.

@gitgitgadget gitgitgadget bot added the next label Jun 19, 2020
@gitgitgadget
Copy link

gitgitgadget bot commented Jun 19, 2020

On the Git mailing list, Abhishek Kumar wrote (reply to this):

On Wed, Jun 17, 2020 at 05:24:27PM +0000, Derrick Stolee via GitGitGadget wrote:
> It was recently [1] reported (and not-so-recently [2] reported) that "git
> merge-base --is-ancestor" can be pretty slow. In fact, it is regularly
> slower than "git branch --contains" or "git tag --contains", which are
> answering a "harder" query.
> 
> [1] https://lore.kernel.org/git/20200607195347.GA8232@szeder.dev/
> 
> [2] https://lore.kernel.org/git/87608bawoa.fsf@evledraar.gmail.com/
> 
> The root cause is that the in_merge_base() implementation is skipping the
> fast can_all_from_reach() implementation and using paint_down_to_common()
> instead. Note that these are equivalent: a commit A is in the set of
> merge-bases between A and B if and only if B can reach A.
> 
> This fixes the issue, and makes the performance degradation reported by
> Szeder a non-issue.
> 
> Thanks, -Stolee
> 
> Derrick Stolee (2):
>   commit-reach: create repo_is_descendant_of()
>   commit-reach: use fast logic in repo_in_merge_base
> 
>  commit-reach.c | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> 
> base-commit: b3d7a52fac39193503a0b6728771d1bf6a161464
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-664%2Fderrickstolee%2Fmerge-base-is-ancestor-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-664/derrickstolee/merge-base-is-ancestor-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/664
> -- 
> gitgitgadget

Wow! Thanks for investigating through the issue and following up. The
performance numbers speak for themselves.

By applying this series on the commit-slab patch series, both are now
just as fast (master: 0.048s, commit-slab: 0.050s). 

Regards
Abhishek

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 24, 2020

This patch series was integrated into seen via git@168f87a.

@gitgitgadget gitgitgadget bot added the seen label Jun 24, 2020
@gitgitgadget
Copy link

gitgitgadget bot commented Jun 24, 2020

This patch series was integrated into seen via git@d082ecb.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 25, 2020

This patch series was integrated into seen via git@dc4b3cf.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 25, 2020

This patch series was integrated into next via git@dc4b3cf.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 25, 2020

This patch series was integrated into master via git@dc4b3cf.

@gitgitgadget gitgitgadget bot added the master label Jun 25, 2020
@gitgitgadget gitgitgadget bot closed this Jun 25, 2020
@gitgitgadget
Copy link

gitgitgadget bot commented Jun 25, 2020

Closed via dc4b3cf.

@@ -283,7 +283,9 @@ struct commit_list *repo_get_merge_bases(struct repository *r,
/*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Wed, Jun 17, 2020 at 05:24:28PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The next change will make repo_in_merge_bases() depend on the logic in
> is_descendant_of(), but we need to make the method independent of
> the_repository first.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-reach.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/commit-reach.c b/commit-reach.c
> index 4ca7e706a18..13722430aa5 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -283,7 +283,9 @@ struct commit_list *repo_get_merge_bases(struct repository *r,
>  /*
>   * Is "commit" a descendant of one of the elements on the "with_commit" list?
>   */
> -int is_descendant_of(struct commit *commit, struct commit_list *with_commit)
> +static int repo_is_descendant_of(struct repository *r,
> +				 struct commit *commit,
> +				 struct commit_list *with_commit)
>  {
>  	if (!with_commit)
>  		return 1;
> @@ -301,13 +303,18 @@ int is_descendant_of(struct commit *commit, struct commit_list *with_commit)
>
>  			other = with_commit->item;
>  			with_commit = with_commit->next;
> -			if (in_merge_bases(other, commit))
> +			if (repo_in_merge_bases(r, other, commit))
>  				return 1;
>  		}
>  		return 0;
>  	}
>  }
>
> +int is_descendant_of(struct commit *commit, struct commit_list *with_commit)
> +{
> +	return repo_is_descendant_of(the_repository, commit, with_commit);
> +}
> +

I don't think that it makes a big deal either way, but I wonder about
moving 'repo_is_descendant_of' to the header file, and making
'is_descendant_of' be 'static inline int' as you defined it here.

Since this has already graduated up to master already, I don't think
that it's worth going back just to shuffle this code around, but I was
wondering if you had any specific reason for doing it this way.

>  /*
>   * Is "commit" an ancestor of one of the "references"?
>   */
> --
> gitgitgadget
>
Thanks,
Taylor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 6/29/2020 9:40 AM, Taylor Blau wrote:
> On Wed, Jun 17, 2020 at 05:24:28PM +0000, Derrick Stolee via GitGitGadget wrote:
>> +int is_descendant_of(struct commit *commit, struct commit_list *with_commit)
>> +{
>> +	return repo_is_descendant_of(the_repository, commit, with_commit);
>> +}
>> +
> 
> I don't think that it makes a big deal either way, but I wonder about
> moving 'repo_is_descendant_of' to the header file, and making
> 'is_descendant_of' be 'static inline int' as you defined it here.
> 
> Since this has already graduated up to master already, I don't think
> that it's worth going back just to shuffle this code around, but I was
> wondering if you had any specific reason for doing it this way.

I have good news for you. [1]

[1] https://lore.kernel.org/git/20200623184222.54201-1-carenas@gmail.com/

Thanks,
-Stolee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant