Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch: limit shared symref check only for local branches #1266

Closed
wants to merge 1 commit into from

Conversation

orgads
Copy link
Contributor

@orgads orgads commented May 16, 2022

No description provided.

@orgads
Copy link
Contributor Author

orgads commented May 16, 2022

/submit

@gitgitgadget-git
Copy link

Submitted as pull.1266.git.git.1652690275625.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-git-1266/orgads/fetch-perf-v1

To fetch this version to local tag pr-git-1266/orgads/fetch-perf-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-git-1266/orgads/fetch-perf-v1

This check was introduced in 8ee5d73 (Fix fetch/pull when run without
--update-head-ok, 2008-10-13) in order to protect against replacing the ref
of the active branch by mistake, for example by running git fetch origin
master:master.

It was later extended in 8bc1f39 (fetch: protect branches checked out
in all worktrees, 2021-12-01) to scan all worktrees.

This operation is very expensive (takes about 30s in my repository) when
there are many tags or branches, and it is executed on every fetch, even if
no local heads are updated at all.

Limit it to protect only refs/heads/* to improve fetch performance.

Signed-off-by: Orgad Shaneh <orgads@gmail.com>
@orgads
Copy link
Contributor Author

orgads commented May 16, 2022

/submit

@gitgitgadget-git
Copy link

Submitted as pull.1266.v2.git.git.1652690501963.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-git-1266/orgads/fetch-perf-v2

To fetch this version to local tag pr-git-1266/orgads/fetch-perf-v2:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-git-1266/orgads/fetch-perf-v2

@orgads
Copy link
Contributor Author

orgads commented May 16, 2022

@dscho You're the original author. Can you please review?

@gitgitgadget-git
Copy link

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Orgad Shaneh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Orgad Shaneh <orgads@gmail.com>
>
> This check was introduced in 8ee5d73137f (Fix fetch/pull when run without
> --update-head-ok, 2008-10-13) in order to protect against replacing the ref
> of the active branch by mistake, for example by running git fetch origin
> master:master.
>
> It was later extended in 8bc1f39f411 (fetch: protect branches checked out
> in all worktrees, 2021-12-01) to scan all worktrees.
>
> This operation is very expensive (takes about 30s in my repository) when
> there are many tags or branches, and it is executed on every fetch, even if
> no local heads are updated at all.
>
> Limit it to protect only refs/heads/* to improve fetch performance.

The point of the check is to prevent the index+working tree in the
worktrees to go out of sync with HEAD, and HEAD by definition can
point only into refs/heads/*, this change should be OK.

It is surprising find_shared_symref() is so expensive, though.  If
you have a dozen worktrees linked to the current repository, there
are at most a dozen HEAD that point at various refs in refs/heads/
namespace.  Even if you need to check a thousand ref_map elements,
it should cost almost nothing if you build a hashmap to find matches
with these dozen HEADs upfront, no?

Another thing that is surprising is that you say this loop is
expensive when there are many tags or branches.  Do you mean it is
expensive when there are many tags and branches that are updated, or
it is expensive to merely have thousands of dormant tags and
branches?  If the latter, I wonder if it is sensible to limit the
check only to the refs that are going to be updated.

> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index e3791f09ed5..eeee5ac8f15 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -1440,6 +1440,7 @@ static void check_not_current_branch(struct ref *ref_map,
>  	const struct worktree *wt;
>  	for (; ref_map; ref_map = ref_map->next)
>  		if (ref_map->peer_ref &&
> +		    starts_with(ref_map->peer_ref->name, "refs/heads/") &&
>  		    (wt = find_shared_symref(worktrees, "HEAD",
>  					     ref_map->peer_ref->name)) &&
>  		    !wt->is_bare)
>
> base-commit: 277cf0bc36094f6dc4297d8c9cef79df045b735d

@gitgitgadget-git
Copy link

On the Git mailing list, Junio C Hamano wrote (reply to this):

Junio C Hamano <gitster@pobox.com> writes:

>> Limit it to protect only refs/heads/* to improve fetch performance.
>
> The point of the check is to prevent the index+working tree in the
> worktrees to go out of sync with HEAD, and HEAD by definition can
> point only into refs/heads/*, this change should be OK.

I am willing to take the above change as-is as a standalone patch,
but independently from that ...

> It is surprising find_shared_symref() is so expensive, though.  If
> you have a dozen worktrees linked to the current repository, there
> are at most a dozen HEAD that point at various refs in refs/heads/
> namespace.  Even if you need to check a thousand ref_map elements,
> it should cost almost nothing if you build a hashmap to find matches
> with these dozen HEADs upfront, no?
>
> Another thing that is surprising is that you say this loop is
> expensive when there are many tags or branches.  Do you mean it is
> expensive when there are many tags and branches that are updated, or
> it is expensive to merely have thousands of dormant tags and
> branches?  If the latter, I wonder if it is sensible to limit the
> check only to the refs that are going to be updated.

... I was wondering if an approach like the attached might be a
better way to proceed in the longer run.  Instead of (or in addition
to) reducing the number of calls to the function, we should see if
we can make the function less expensive.

In short, find_shared_symref(), and is_worktree_being_FUTZed(), is
following a wrong API pattern that encourages a loop to call them
repeatedly by taking 'target' parameter and doing a comparison
itself.  If you want to find if any of your 1000 refs violate what
they are trying to check, you'd need to call them 1000 times.

Instead, you should be able to learn which branch is to be protected
per each worktree and do the asking about your 1000 refs yourself.

If we instead find out what branch, other than the one pointed at by
the HEAD symref, each worktree is working on, just like we find out
what branch is pointed at by the HEAD symref, and store that
findings in the worktree structure, you should be able to walk the
list of worktrees just once to build a hashmap that lets you tell
which branches are not to be messed with before deciding if the
fetch should go through.

The following is not even compile tested, and some details may be
wrong, but I hope it is good enough to illustrate the idea.

 worktree.c | 73 ++++++++++++++++++++++++++++++++++----------------------------
 worktree.h |  5 +++++
 2 files changed, 45 insertions(+), 33 deletions(-)

diff --git c/worktree.c w/worktree.c
index 90fc085f76..2d96bd9dd1 100644
--- c/worktree.c
+++ w/worktree.c
@@ -15,6 +15,7 @@ void free_worktrees(struct worktree **worktrees)
 		free(worktrees[i]->path);
 		free(worktrees[i]->id);
 		free(worktrees[i]->head_ref);
+		free(worktrees[i]->protected_branch);
 		free(worktrees[i]->lock_reason);
 		free(worktrees[i]->prune_reason);
 		free(worktrees[i]);
@@ -22,13 +23,28 @@ void free_worktrees(struct worktree **worktrees)
 	free (worktrees);
 }
 
+int is_worktree_being_rebased(const struct worktree *wt,
+			      const char *target)
+{
+	return ((wt->protected_reason == WT_BRANCH_REBASING) &&
+		(!strcmp(target, wt->protected_branch)));
+}
+
+int is_worktree_being_bisected(const struct worktree *wt,
+			       const char *target)
+{
+	return ((wt->protected_reason == WT_BRANCH_BISECTING) &&
+		(!strcmp(target, wt->protected_branch)));
+}
+
 /**
- * Update head_oid, head_ref and is_detached of the given worktree
+ * Grab HEAD-related information of the given worktree
  */
 static void add_head_info(struct worktree *wt)
 {
 	int flags;
 	const char *target;
+	struct wt_status_state state;
 
 	target = refs_resolve_ref_unsafe(get_worktree_ref_store(wt),
 					 "HEAD",
@@ -41,6 +57,29 @@ static void add_head_info(struct worktree *wt)
 		wt->head_ref = xstrdup(target);
 	else
 		wt->is_detached = 1;
+
+	wt->protected_reason = 0;
+	memset(&state, 0, sizeof(state));
+	if (wt_status_check_rebase(wt, &state) &&
+	    (state.rebase_in_progress ||
+	     state.rebase_interactive_in_progress) &&
+	    state.branch &&
+	    skip_prefix(state.branch, "refs/heads/", &target)) {
+		wt->protected_branch = xstrdup(target);
+		wt->protected_reason = WT_BRANCH_REBASING;
+	}
+	wt_status_state_free_buffers(&state);
+
+	memset(&state, 0, sizeof(state));
+	if (wt_status_check_bisect(wt, &state) && state.branch &&
+	    skip_prefix(state.branch, "refs/heads/", &target)) {
+		if (wt->protected_reason)
+			BUG("branch '%s' being bisected and rebased at the same time?",
+			    wt->protected_branch);
+		wt->protected_branch = xstrdup(target);
+		wt->protected_reason = WT_BRANCH_BISECTING;
+	}
+	wt_status_state_free_buffers(&state);
 }
 
 /**
@@ -365,38 +404,6 @@ void update_worktree_location(struct worktree *wt, const char *path_)
 	strbuf_release(&path);
 }
 
-int is_worktree_being_rebased(const struct worktree *wt,
-			      const char *target)
-{
-	struct wt_status_state state;
-	int found_rebase;
-
-	memset(&state, 0, sizeof(state));
-	found_rebase = wt_status_check_rebase(wt, &state) &&
-		       (state.rebase_in_progress ||
-			state.rebase_interactive_in_progress) &&
-		       state.branch &&
-		       skip_prefix(target, "refs/heads/", &target) &&
-		       !strcmp(state.branch, target);
-	wt_status_state_free_buffers(&state);
-	return found_rebase;
-}
-
-int is_worktree_being_bisected(const struct worktree *wt,
-			       const char *target)
-{
-	struct wt_status_state state;
-	int found_bisect;
-
-	memset(&state, 0, sizeof(state));
-	found_bisect = wt_status_check_bisect(wt, &state) &&
-		       state.branch &&
-		       skip_prefix(target, "refs/heads/", &target) &&
-		       !strcmp(state.branch, target);
-	wt_status_state_free_buffers(&state);
-	return found_bisect;
-}
-
 /*
  * note: this function should be able to detect shared symref even if
  * HEAD is temporarily detached (e.g. in the middle of rebase or
diff --git c/worktree.h w/worktree.h
index e9e839926b..4e9d06c26a 100644
--- c/worktree.h
+++ w/worktree.h
@@ -10,6 +10,11 @@ struct worktree {
 	char *path;
 	char *id;
 	char *head_ref;		/* NULL if HEAD is broken or detached */
+	char *protected_branch; /* being rebased or bisected, othewise NULL */
+	enum {
+		WT_BRANCH_REBASING = 1,
+		WT_BRANCH_BISECTING,
+	} protected_reason;
 	char *lock_reason;	/* private - use worktree_lock_reason */
 	char *prune_reason;     /* private - use worktree_prune_reason */
 	struct object_id head_oid;

@gitgitgadget-git
Copy link

This branch is now known as os/fetch-check-not-current-branch.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via be12624.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via ad7820f.

@gitgitgadget-git
Copy link

There was a status update in the "New Topics" section about the branch os/fetch-check-not-current-branch on the Git mailing list:

The way "git fetch" without "--update-head-ok" ensures that HEAD in
no worktree points at any ref being updated was too wasteful, which
has been optimized a bit.

Will merge to 'next'?
source: <pull.1266.v2.git.git.1652690501963.gitgitgadget@gmail.com>

@gitgitgadget-git
Copy link

On the Git mailing list, Orgad Shaneh wrote (reply to this):

On Mon, May 16, 2022 at 7:01 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Orgad Shaneh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Orgad Shaneh <orgads@gmail.com>
> >
> > This check was introduced in 8ee5d73137f (Fix fetch/pull when run without
> > --update-head-ok, 2008-10-13) in order to protect against replacing the ref
> > of the active branch by mistake, for example by running git fetch origin
> > master:master.
> >
> > It was later extended in 8bc1f39f411 (fetch: protect branches checked out
> > in all worktrees, 2021-12-01) to scan all worktrees.
> >
> > This operation is very expensive (takes about 30s in my repository) when
> > there are many tags or branches, and it is executed on every fetch, even if
> > no local heads are updated at all.
> >
> > Limit it to protect only refs/heads/* to improve fetch performance.
>
> The point of the check is to prevent the index+working tree in the
> worktrees to go out of sync with HEAD, and HEAD by definition can
> point only into refs/heads/*, this change should be OK.
>
> It is surprising find_shared_symref() is so expensive, though.  If
> you have a dozen worktrees linked to the current repository, there
> are at most a dozen HEAD that point at various refs in refs/heads/
> namespace.  Even if you need to check a thousand ref_map elements,
> it should cost almost nothing if you build a hashmap to find matches
> with these dozen HEADs upfront, no?

I also had this idea, but I'm not familiar enough with the codebase to
implement it. I see you already started that.

> Another thing that is surprising is that you say this loop is
> expensive when there are many tags or branches.  Do you mean it is
> expensive when there are many tags and branches that are updated, or
> it is expensive to merely have thousands of dormant tags and
> branches?  If the latter, I wonder if it is sensible to limit the
> check only to the refs that are going to be updated.

It's expensive even when *nothing* is updated. I have a repo with 44K
tags, 13K of the tags are annotated, 134 remote branches and 4
worktrees (except the main repo) with 33 local branches.

I counted the calls to find_shared_symref - it was called 35755 times,
and refs_read_raw_ref was called 357585 times.

- Orgad

@gitgitgadget-git
Copy link

On the Git mailing list, Junio C Hamano wrote (reply to this):

Orgad Shaneh <orgads@gmail.com> writes:

>> Another thing that is surprising is that you say this loop is
>> expensive when there are many tags or branches.  Do you mean it is
>> expensive when there are many tags and branches that are updated, or
>> it is expensive to merely have thousands of dormant tags and
>> branches?  If the latter, I wonder if it is sensible to limit the
>> check only to the refs that are going to be updated.
>
> It's expensive even when *nothing* is updated. I have a repo with 44K
> tags, 13K of the tags are annotated, 134 remote branches and 4
> worktrees (except the main repo) with 33 local branches.
>
> I counted the calls to find_shared_symref - it was called 35755 times,
> and refs_read_raw_ref was called 357585 times.

That is exactly why I asked, as the above number hints that it could
be a viable optimization to omit calls for refs whose old_ and
new_oid are the same, just like you omit calls for refs that are not
inside refs/heads/ in your patch, perhaps?

@gitgitgadget-git
Copy link

On the Git mailing list, Orgad Shaneh wrote (reply to this):

On Tue, May 17, 2022 at 1:27 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Orgad Shaneh <orgads@gmail.com> writes:
>
> >> Another thing that is surprising is that you say this loop is
> >> expensive when there are many tags or branches.  Do you mean it is
> >> expensive when there are many tags and branches that are updated, or
> >> it is expensive to merely have thousands of dormant tags and
> >> branches?  If the latter, I wonder if it is sensible to limit the
> >> check only to the refs that are going to be updated.
> >
> > It's expensive even when *nothing* is updated. I have a repo with 44K
> > tags, 13K of the tags are annotated, 134 remote branches and 4
> > worktrees (except the main repo) with 33 local branches.
> >
> > I counted the calls to find_shared_symref - it was called 35755 times,
> > and refs_read_raw_ref was called 357585 times.
>
> That is exactly why I asked, as the above number hints that it could
> be a viable optimization to omit calls for refs whose old_ and
> new_oid are the same, just like you omit calls for refs that are not
> inside refs/heads/ in your patch, perhaps?

This would require shuffling the code. check_not_current_branch() is
called by do_fetch before fetch_and_consume_refs (which updates
ref->old_oid and ref->new_oid).

- Orgad

@gitgitgadget-git
Copy link

On the Git mailing list, Junio C Hamano wrote (reply to this):

Orgad Shaneh <orgads@gmail.com> writes:

> This would require shuffling the code. check_not_current_branch() is
> called by do_fetch before fetch_and_consume_refs (which updates
> ref->old_oid and ref->new_oid).

Oh, that's unfortunate.

We may not be matching them with the current values of the local
copies in ref_map in today's code until fetch_and_consume_refs()
calls store_updated_refs() to update the .new_oid member, but I
would have thought that we learned the equivalent of "git ls-remote"
output a lot earlier, at least most of the time, via a call to
transport_get_remote_refs(), and that was why I thought such an
optimization was doable.

Thanks.  

@gitgitgadget-git
Copy link

This patch series was integrated into seen via a5b3ce7.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 8968fd4.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 29577cb.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 548f7ee.

@gitgitgadget-git
Copy link

This patch series was integrated into next via f48dca6.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via df5a5f8.

@gitgitgadget-git
Copy link

There was a status update in the "Cooking" section about the branch os/fetch-check-not-current-branch on the Git mailing list:

The way "git fetch" without "--update-head-ok" ensures that HEAD in
no worktree points at any ref being updated was too wasteful, which
has been optimized a bit.

Will merge to 'master'.
source: <pull.1266.v2.git.git.1652690501963.gitgitgadget@gmail.com>

@dscho
Copy link
Member

dscho commented May 21, 2022

dscho You're the original author. Can you please review?

Seems that you already got good reviews.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 6db2344.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 5da424e.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 5ed49a7.

@gitgitgadget-git
Copy link

This patch series was integrated into master via 5ed49a7.

@gitgitgadget-git
Copy link

This patch series was integrated into next via 5ed49a7.

@gitgitgadget-git
Copy link

Closed via 5ed49a7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants