Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write commit-graph on fetch #328

Conversation

derrickstolee
Copy link

@derrickstolee derrickstolee commented Sep 3, 2019

Instead of waiting for a non-trivial GC command to write the commit-graph, write one during 'git fetch'. By using the equivalent method for 'git commit-graph write --split --reachable', we create a commit-graph chain that includes all reachable commits. Most of the time, these writes only add the newly-downloaded commits in a small file at the top of the commit-graph chain. Commits that are no longer reachable still exist in the commit-graph, but will be cleaned up by a later GC command that forces the commit-graph to be rewritten completely.

A version of this patch [1] used to be a part of ds/commit-graph-incremental, but was removed to focus on the incremental commit-graph file format. Now that the incremental file format has been shipped in v2.23.0 and some config things have adjusted in ds/feature-macros, I'm reintroducing the idea.

Ævar had mentioned wanting to do something with "incremental maintenance during GC" [2]. I haven't seen any patches towards that aim (please point me in that direction if they have been submitted). I still think it is worth allowing a write at fetch time, as some users have GC disabled. I know for sure that users who only interact with their Git repos via Visual Studio Team Explorer have all Git commands running with GC disabled, and likely other desktop GUI clients have it disabled to avoid blocking processes.

Aside: VFS for Git users have GC disabled, but the commit-graph is being written in the background by a monitoring process. We shipped the incremental commit-graph writes in a recent version and reduced our writes from ~60 seconds each to less than a second on average. Very rarely, the layers of the commit-graph chain collapse and return to the old values. This feature has been performing well with no known issues.

Thanks,
-Stolee

[1] https://public-inbox.org/git/3c52385e5696887c40cab4a6b9b7923d60a0567c.1557330827.git.gitgitgadget@gmail.com/

[2] https://public-inbox.org/git/b1de6af2-c015-098e-a656-e1b68056e037@gmail.com/

Cc: peff@peff.net, avarab@gmail.com, garimasigit@gmail.com

The commit-graph feature is now on by default, and is being
written during 'git gc' by default. Typically, Git only writes
a commit-graph when a 'git gc --auto' command passes the gc.auto
setting to actualy do work. This means that a commit-graph will
typically fall behind the commits that are being used every day.

To stay updated with the latest commits, add a step to 'git
fetch' to write a commit-graph after fetching new objects. The
fetch.writeCommitGraph config setting enables writing a split
commit-graph, so on average the cost of writing this file is
very small. Occasionally, the commit-graph chain will collapse
to a single level, and this could be slow for very large repos.

For additional use, adjust the default to be true when
feature.experimental is enabled.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 3, 2019

Submitted as pull.328.git.gitgitgadget@gmail.com

@@ -17,6 +17,14 @@ which can improve `git push` performance in repos with many files.
+
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 53ce99d2bb..d36a403859 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -23,6 +23,7 @@
>  #include "packfile.h"
>  #include "list-objects-filter-options.h"
>  #include "commit-reach.h"
> +#include "commit-graph.h"
>  
>  #define FORCED_UPDATES_DELAY_WARNING_IN_MS (10 * 1000)
>  
> @@ -1715,6 +1716,20 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>  
>  	string_list_clear(&list, 0);
>  
> +	prepare_repo_settings(the_repository);
> +	if (the_repository->settings.fetch_write_commit_graph) {
> +		int commit_graph_flags = COMMIT_GRAPH_SPLIT;
> +		struct split_commit_graph_opts split_opts;
> +		memset(&split_opts, 0, sizeof(struct split_commit_graph_opts));
> +
> +		if (progress)
> +			commit_graph_flags |= COMMIT_GRAPH_PROGRESS;
> +
> +		write_commit_graph_reachable(get_object_directory(),
> +					     commit_graph_flags,
> +					     &split_opts);
> +	}

As a low-impact change this is good.  

For longer term, it feels a bit unfortunate that this is still a
separate phase of the program, though.  We know what new refs we
added, we know what new objects we received, and we even scanned
each and every one of them while running the index-pack step to
store the .pack and compute the .idx file, i.e. it feels that we
have most of the information already in-core to extend the commit
graph for new parts of the history we just received.

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/3/2019 3:05 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> diff --git a/builtin/fetch.c b/builtin/fetch.c
>> index 53ce99d2bb..d36a403859 100644
>> --- a/builtin/fetch.c
>> +++ b/builtin/fetch.c
>> @@ -23,6 +23,7 @@
>>  #include "packfile.h"
>>  #include "list-objects-filter-options.h"
>>  #include "commit-reach.h"
>> +#include "commit-graph.h"
>>  
>>  #define FORCED_UPDATES_DELAY_WARNING_IN_MS (10 * 1000)
>>  
>> @@ -1715,6 +1716,20 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>>  
>>  	string_list_clear(&list, 0);
>>  
>> +	prepare_repo_settings(the_repository);
>> +	if (the_repository->settings.fetch_write_commit_graph) {
>> +		int commit_graph_flags = COMMIT_GRAPH_SPLIT;
>> +		struct split_commit_graph_opts split_opts;
>> +		memset(&split_opts, 0, sizeof(struct split_commit_graph_opts));
>> +
>> +		if (progress)
>> +			commit_graph_flags |= COMMIT_GRAPH_PROGRESS;
>> +
>> +		write_commit_graph_reachable(get_object_directory(),
>> +					     commit_graph_flags,
>> +					     &split_opts);
>> +	}
> 
> As a low-impact change this is good.  
> 
> For longer term, it feels a bit unfortunate that this is still a
> separate phase of the program, though.  We know what new refs we
> added, we know what new objects we received, and we even scanned
> each and every one of them while running the index-pack step to
> store the .pack and compute the .idx file, i.e. it feels that we
> have most of the information already in-core to extend the commit
> graph for new parts of the history we just received.

You're right that we could isolate the new write to the refs we
just received. We could use the more cumbersome write_commit_graph()
method with a list of commit oids as starting points. I'm happy to
make that change if we see a lot of value there.

However, the current patch also gives us a way to add local refs
to the commit graph and "catch up" to the work the user had done.
With this in mind, I do think the simpler code has another functional
advantage.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes:

>>> +		write_commit_graph_reachable(get_object_directory(),
>>> +					     commit_graph_flags,
>>> +					     &split_opts);
>>> +	}
>> 
>> As a low-impact change this is good.  
>> 
>> For longer term, it feels a bit unfortunate that this is still a
>> separate phase of the program, though.  We know what new refs we
>> added, we know what new objects we received, and we even scanned
>> each and every one of them while running the index-pack step to
>> store the .pack and compute the .idx file, i.e. it feels that we
>> have most of the information already in-core to extend the commit
>> graph for new parts of the history we just received.
>
> You're right that we could isolate the new write to the refs we
> just received. We could use the more cumbersome write_commit_graph()
> method with a list of commit oids as starting points. I'm happy to
> make that change if we see a lot of value there.

Well, that is not the kind of information reuse I am talking about.

I was wondering if "index-pack" has enough information in-core after
it receives and processes the incoming pack data, scanned each and
every object in it, in order to write out the commit graph _without_
having to do a lot of duplicate computation and enumeration of the
objects done in the current commit-graph.c::write_commit_graph(), so
that it can learn a "--write-commit-graph" option that performs much
better than running "git fetch && git commit-graph write".

Perhaps that would require too much refactoring of both index-pack
and commit-graph code and infeasible in short to medium term and
that is why I said "for longer term, it feels a bit unfortunate".

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 02:46:22PM -0700, Junio C Hamano wrote:

> > You're right that we could isolate the new write to the refs we
> > just received. We could use the more cumbersome write_commit_graph()
> > method with a list of commit oids as starting points. I'm happy to
> > make that change if we see a lot of value there.
> 
> Well, that is not the kind of information reuse I am talking about.
> 
> I was wondering if "index-pack" has enough information in-core after
> it receives and processes the incoming pack data, scanned each and
> every object in it, in order to write out the commit graph _without_
> having to do a lot of duplicate computation and enumeration of the
> objects done in the current commit-graph.c::write_commit_graph(), so
> that it can learn a "--write-commit-graph" option that performs much
> better than running "git fetch && git commit-graph write".
> 
> Perhaps that would require too much refactoring of both index-pack
> and commit-graph code and infeasible in short to medium term and
> that is why I said "for longer term, it feels a bit unfortunate".

I think the basic metadata should be easy. We have each commit expanded
in memory at some point, so parsing it and filing away its parents,
trees, etc isn't too hard.

Generation numbers are a little trickier, though, because they imply an
actual topological traversal. It might actually be easier to couple this
with the connectivity check we do after index-pack finishes (though I've
often wondered if we could drop that check in favor of making index-pack
smarter about finding the boundaries).

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <peff@peff.net> writes:

> Generation numbers are a little trickier, though, because they imply an
> actual topological traversal. It might actually be easier to couple this
> with the connectivity check we do after index-pack finishes (though I've
> often wondered if we could drop that check in favor of making index-pack
> smarter about finding the boundaries).

Currently after scanning the incoming objects with the fsck
machinery, we count the number of objects that are pointed at by
these objects in the pack and are not in the pack in the
builtin/index-pack.c::check_object() function; at that point, we no
longer know who points at the object in question, which is needed if
we want to compute the boundary, so we need a bit of work here.

With boundary information, we could be smarter about lazy fetching,
I presume?

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 3, 2019

This branch is now known as ds/commit-graph-on-fetch.

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 3, 2019

This patch series was integrated into pu via git@de1e625.

@gitgitgadget gitgitgadget bot added the pu label Sep 3, 2019
@@ -17,6 +17,14 @@ which can improve `git push` performance in repos with many files.
+
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Mon, Sep 02, 2019 at 07:22:02PM -0700, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The commit-graph feature is now on by default, and is being
> written during 'git gc' by default. Typically, Git only writes
> a commit-graph when a 'git gc --auto' command passes the gc.auto
> setting to actualy do work. This means that a commit-graph will
> typically fall behind the commits that are being used every day.
> 
> To stay updated with the latest commits, add a step to 'git
> fetch' to write a commit-graph after fetching new objects. The
> fetch.writeCommitGraph config setting enables writing a split
> commit-graph, so on average the cost of writing this file is
> very small. Occasionally, the commit-graph chain will collapse
> to a single level, and this could be slow for very large repos.
> 
> For additional use, adjust the default to be true when
> feature.experimental is enabled.

Seems like a good time to do it. We'd eventually want a similar option
for updating it on the receiving side of a push, too. I don't insist
that come at the same time, but we should at least have a plan about how
it will look to the user.

Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph,
and then a master transfer.writeCommitGraph?

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <peff@peff.net> writes:

> Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph,
> and then a master transfer.writeCommitGraph?

If anything, it may be good for consistency.

I am not sure if it is a good idea to trigger writing the commit
graph when accepting a push, though.  It tends to be a lot finer
grained than fetching, right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/5/2019 4:37 PM, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
>> Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph,
>> and then a master transfer.writeCommitGraph?
> 
> If anything, it may be good for consistency.
> 
> I am not sure if it is a good idea to trigger writing the commit
> graph when accepting a push, though.  It tends to be a lot finer
> grained than fetching, right?

And I expect a push to include many fewer commits than a fetch.
In a server environment, I would expect to have a separate
maintenance task responsible for updating the commit-graph after
receiving new data, but not in an in-line fashion with the push.

Think about the situation of many pushes that happen in a short
burst: one write after all of the pushes would have close to the
same performance benefits as writing with every push, but does
a lot less work.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 01:00:40PM -0400, Derrick Stolee wrote:

> On 9/5/2019 4:37 PM, Junio C Hamano wrote:
> > Jeff King <peff@peff.net> writes:
> > 
> >> Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph,
> >> and then a master transfer.writeCommitGraph?
> > 
> > If anything, it may be good for consistency.
> > 
> > I am not sure if it is a good idea to trigger writing the commit
> > graph when accepting a push, though.  It tends to be a lot finer
> > grained than fetching, right?
> 
> And I expect a push to include many fewer commits than a fetch.
> In a server environment, I would expect to have a separate
> maintenance task responsible for updating the commit-graph after
> receiving new data, but not in an in-line fashion with the push.

That's probably how we'll end up doing it at GitHub, because we run a
big server farm that has a job-queueing system for periodic maintenance,
etc.

But keep in mind the "little guy" who is hosting a few repositories for
themselves or their company. They'd presumably want the benefits here,
too, right? We already have options to trigger auto-gc and
update-server-info for them, so why not this maintenance, too?

> Think about the situation of many pushes that happen in a short
> burst: one write after all of the pushes would have close to the
> same performance benefits as writing with every push, but does
> a lot less work.

Sure, but wouldn't that similarly apply to fetching? What is it that
makes bursts of pushes more likely than bursts of fetches?

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <peff@peff.net> writes:

> Sure, but wouldn't that similarly apply to fetching? What is it that
> makes bursts of pushes more likely than bursts of fetches?

Because people tend to use a repository as a gathering point?  You
may periodically fetch from and push to a repository, and you may
even do so at the same interval, but simply because there are more
"other" people than you alone as a single developer in the project,
your fetch tends to grab work from more people than yoru push that
publish your work alone?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <peff@peff.net> writes:

> I suppose so. But I think the "stock git without any other job
> infrastructure" case would still benefit.

Oh, no question about it.

I did question if an automatic writing would benefit the side that
receives a push when you brought up the usual "fetch.* and receive.*
for separate configuration, transfer.* when want to rule them both"
convention, which is a good idea if only for consistency, but the
question was if it helps the receiving end of a push to the same
degree as it would help the repository that fetches.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/6/2019 4:42 PM, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
>> I suppose so. But I think the "stock git without any other job
>> infrastructure" case would still benefit.
> 
> Oh, no question about it.
> 
> I did question if an automatic writing would benefit the side that
> receives a push when you brought up the usual "fetch.* and receive.*
> for separate configuration, transfer.* when want to rule them both"
> convention, which is a good idea if only for consistency, but the
> question was if it helps the receiving end of a push to the same
> degree as it would help the repository that fetches.

I think the "stock git without any other job infrastructure" is
a very important scenario. Putting the simplest version of
"commit-graph writes in-line with every push" seems to be ripe
for failure under load. I'd rather think deeply about what is
best for this scenario.

Spit-balling: what if we used the same mechanism as running GC
in the background to launch 'git commit-graph write' commands?
And we could have the command give up (without failure) if the
commit-graph.lock file is already created, so concurrent pushes
would not fight each other.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes:

> On 9/6/2019 4:42 PM, Junio C Hamano wrote:
>> Jeff King <peff@peff.net> writes:
>> 
>>> I suppose so. But I think the "stock git without any other job
>>> infrastructure" case would still benefit.
>> 
>> Oh, no question about it.
>> 
>> I did question if an automatic writing would benefit the side that
>> receives a push when you brought up the usual "fetch.* and receive.*
>> for separate configuration, transfer.* when want to rule them both"
>> convention, which is a good idea if only for consistency, but the
>> question was if it helps the receiving end of a push to the same
>> degree as it would help the repository that fetches.
>
> I think the "stock git without any other job infrastructure" is
> a very important scenario. Putting the simplest version of
> "commit-graph writes in-line with every push" seems to be ripe
> for failure under load. I'd rather think deeply about what is
> best for this scenario.

As to what to do on the push side, I suppose we can afford to let it
simmer in the back of our heads while moving this topic forward.
Whether we'd later decide to have receive.writeCommitGraph (in which
case we would add transfer.writeCommitGraph, too) or not, this
change on the fetch side can independently be used, right?

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 05:04:17PM -0400, Derrick Stolee wrote:

> On 9/6/2019 4:42 PM, Junio C Hamano wrote:
> > Jeff King <peff@peff.net> writes:
> > 
> >> I suppose so. But I think the "stock git without any other job
> >> infrastructure" case would still benefit.
> > 
> > Oh, no question about it.
> > 
> > I did question if an automatic writing would benefit the side that
> > receives a push when you brought up the usual "fetch.* and receive.*
> > for separate configuration, transfer.* when want to rule them both"
> > convention, which is a good idea if only for consistency, but the
> > question was if it helps the receiving end of a push to the same
> > degree as it would help the repository that fetches.
> 
> I think the "stock git without any other job infrastructure" is
> a very important scenario. Putting the simplest version of
> "commit-graph writes in-line with every push" seems to be ripe
> for failure under load. I'd rather think deeply about what is
> best for this scenario.

If it's going to be a problem under load, that makes me worry about the
same thing for fetches. Whether you see a lot of either depends on your
workflow. But as long as neither option is enabled by default (or at
least if it becomes common knowledge to _disable_ them if you have a
high rate), it may be OK.

I do agree that the normal "busy" repos you and I have both seen in
enterprise settings (where people are literally pushing multiple times
per second at peak) would want some kind of throttling. But I think
there's a long tail of "push once a week" repos.

> Spit-balling: what if we used the same mechanism as running GC
> in the background to launch 'git commit-graph write' commands?
> And we could have the command give up (without failure) if the
> commit-graph.lock file is already created, so concurrent pushes
> would not fight each other.

I have mixed feelings. It's nice not to stand in the critical path of
the user, but background tasks have a way of finding funny corner cases
(e.g., packfiles moving around, or the issues we've had with deciding
when to shut down auto-gc for a grace period due to warnings/errors).

I think since this is generally additive (adding new commit-graph
files), it might be less finicky, though.

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 02:57:55PM -0700, Junio C Hamano wrote:

> > I think the "stock git without any other job infrastructure" is
> > a very important scenario. Putting the simplest version of
> > "commit-graph writes in-line with every push" seems to be ripe
> > for failure under load. I'd rather think deeply about what is
> > best for this scenario.
> 
> As to what to do on the push side, I suppose we can afford to let it
> simmer in the back of our heads while moving this topic forward.
> Whether we'd later decide to have receive.writeCommitGraph (in which
> case we would add transfer.writeCommitGraph, too) or not, this
> change on the fetch side can independently be used, right?

Yeah, I'm OK with proceeding without the receive-pack side for now.

-Peff

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 5, 2019

This patch series was integrated into pu via git@e56c626.

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 6, 2019

This patch series was integrated into pu via git@38c5b72.

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 6, 2019

This patch series was integrated into pu via git@f556b99.

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 7, 2019

This patch series was integrated into pu via git@c9c5025.

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 9, 2019

This patch series was integrated into pu via git@f6803fe.

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 10, 2019

This patch series was integrated into pu via git@54b8038.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant