Write commit-graph on fetch #328

derrickstolee · 2019-09-03T00:43:11Z

Instead of waiting for a non-trivial GC command to write the commit-graph, write one during 'git fetch'. By using the equivalent method for 'git commit-graph write --split --reachable', we create a commit-graph chain that includes all reachable commits. Most of the time, these writes only add the newly-downloaded commits in a small file at the top of the commit-graph chain. Commits that are no longer reachable still exist in the commit-graph, but will be cleaned up by a later GC command that forces the commit-graph to be rewritten completely.

A version of this patch [1] used to be a part of ds/commit-graph-incremental, but was removed to focus on the incremental commit-graph file format. Now that the incremental file format has been shipped in v2.23.0 and some config things have adjusted in ds/feature-macros, I'm reintroducing the idea.

Ævar had mentioned wanting to do something with "incremental maintenance during GC" [2]. I haven't seen any patches towards that aim (please point me in that direction if they have been submitted). I still think it is worth allowing a write at fetch time, as some users have GC disabled. I know for sure that users who only interact with their Git repos via Visual Studio Team Explorer have all Git commands running with GC disabled, and likely other desktop GUI clients have it disabled to avoid blocking processes.

Aside: VFS for Git users have GC disabled, but the commit-graph is being written in the background by a monitoring process. We shipped the incremental commit-graph writes in a recent version and reduced our writes from ~60 seconds each to less than a second on average. Very rarely, the layers of the commit-graph chain collapse and return to the old values. This feature has been performing well with no known issues.

Thanks,
-Stolee

[1] https://public-inbox.org/git/3c52385e5696887c40cab4a6b9b7923d60a0567c.1557330827.git.gitgitgadget@gmail.com/

[2] https://public-inbox.org/git/b1de6af2-c015-098e-a656-e1b68056e037@gmail.com/

Cc: peff@peff.net, avarab@gmail.com, garimasigit@gmail.com

The commit-graph feature is now on by default, and is being written during 'git gc' by default. Typically, Git only writes a commit-graph when a 'git gc --auto' command passes the gc.auto setting to actualy do work. This means that a commit-graph will typically fall behind the commits that are being used every day. To stay updated with the latest commits, add a step to 'git fetch' to write a commit-graph after fetching new objects. The fetch.writeCommitGraph config setting enables writing a split commit-graph, so on average the cost of writing this file is very small. Occasionally, the commit-graph chain will collapse to a single level, and this could be slow for very large repos. For additional use, adjust the default to be true when feature.experimental is enabled. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

derrickstolee · 2019-09-03T02:21:13Z

/submit

gitgitgadget · 2019-09-03T02:22:13Z

Submitted as pull.328.git.gitgitgadget@gmail.com

gitgitgadget · 2019-09-03T19:08:14Z

Documentation/config/feature.txt

@@ -17,6 +17,14 @@ which can improve `git push` performance in repos with many files.
 +


On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > diff --git a/builtin/fetch.c b/builtin/fetch.c > index 53ce99d2bb..d36a403859 100644 > --- a/builtin/fetch.c > +++ b/builtin/fetch.c > @@ -23,6 +23,7 @@ > #include "packfile.h" > #include "list-objects-filter-options.h" > #include "commit-reach.h" > +#include "commit-graph.h" > > #define FORCED_UPDATES_DELAY_WARNING_IN_MS (10 * 1000) > > @@ -1715,6 +1716,20 @@ int cmd_fetch(int argc, const char **argv, const char *prefix) > > string_list_clear(&list, 0); > > + prepare_repo_settings(the_repository); > + if (the_repository->settings.fetch_write_commit_graph) { > + int commit_graph_flags = COMMIT_GRAPH_SPLIT; > + struct split_commit_graph_opts split_opts; > + memset(&split_opts, 0, sizeof(struct split_commit_graph_opts)); > + > + if (progress) > + commit_graph_flags |= COMMIT_GRAPH_PROGRESS; > + > + write_commit_graph_reachable(get_object_directory(), > + commit_graph_flags, > + &split_opts); > + } As a low-impact change this is good. For longer term, it feels a bit unfortunate that this is still a separate phase of the program, though. We know what new refs we added, we know what new objects we received, and we even scanned each and every one of them while running the index-pack step to store the .pack and compute the .idx file, i.e. it feels that we have most of the information already in-core to extend the commit graph for new parts of the history we just received. Thanks.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/3/2019 3:05 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> diff --git a/builtin/fetch.c b/builtin/fetch.c >> index 53ce99d2bb..d36a403859 100644 >> --- a/builtin/fetch.c >> +++ b/builtin/fetch.c >> @@ -23,6 +23,7 @@ >> #include "packfile.h" >> #include "list-objects-filter-options.h" >> #include "commit-reach.h" >> +#include "commit-graph.h" >> >> #define FORCED_UPDATES_DELAY_WARNING_IN_MS (10 * 1000) >> >> @@ -1715,6 +1716,20 @@ int cmd_fetch(int argc, const char **argv, const char *prefix) >> >> string_list_clear(&list, 0); >> >> + prepare_repo_settings(the_repository); >> + if (the_repository->settings.fetch_write_commit_graph) { >> + int commit_graph_flags = COMMIT_GRAPH_SPLIT; >> + struct split_commit_graph_opts split_opts; >> + memset(&split_opts, 0, sizeof(struct split_commit_graph_opts)); >> + >> + if (progress) >> + commit_graph_flags |= COMMIT_GRAPH_PROGRESS; >> + >> + write_commit_graph_reachable(get_object_directory(), >> + commit_graph_flags, >> + &split_opts); >> + } > > As a low-impact change this is good. > > For longer term, it feels a bit unfortunate that this is still a > separate phase of the program, though. We know what new refs we > added, we know what new objects we received, and we even scanned > each and every one of them while running the index-pack step to > store the .pack and compute the .idx file, i.e. it feels that we > have most of the information already in-core to extend the commit > graph for new parts of the history we just received. You're right that we could isolate the new write to the refs we just received. We could use the more cumbersome write_commit_graph() method with a list of commit oids as starting points. I'm happy to make that change if we see a lot of value there. However, the current patch also gives us a way to add local refs to the commit graph and "catch up" to the work the user had done. With this in mind, I do think the simpler code has another functional advantage. Thanks, -Stolee

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes: >>> + write_commit_graph_reachable(get_object_directory(), >>> + commit_graph_flags, >>> + &split_opts); >>> + } >> >> As a low-impact change this is good. >> >> For longer term, it feels a bit unfortunate that this is still a >> separate phase of the program, though. We know what new refs we >> added, we know what new objects we received, and we even scanned >> each and every one of them while running the index-pack step to >> store the .pack and compute the .idx file, i.e. it feels that we >> have most of the information already in-core to extend the commit >> graph for new parts of the history we just received. > > You're right that we could isolate the new write to the refs we > just received. We could use the more cumbersome write_commit_graph() > method with a list of commit oids as starting points. I'm happy to > make that change if we see a lot of value there. Well, that is not the kind of information reuse I am talking about. I was wondering if "index-pack" has enough information in-core after it receives and processes the incoming pack data, scanned each and every object in it, in order to write out the commit graph _without_ having to do a lot of duplicate computation and enumeration of the objects done in the current commit-graph.c::write_commit_graph(), so that it can learn a "--write-commit-graph" option that performs much better than running "git fetch && git commit-graph write". Perhaps that would require too much refactoring of both index-pack and commit-graph code and infeasible in short to medium term and that is why I said "for longer term, it feels a bit unfortunate". Thanks.

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 02:46:22PM -0700, Junio C Hamano wrote: > > You're right that we could isolate the new write to the refs we > > just received. We could use the more cumbersome write_commit_graph() > > method with a list of commit oids as starting points. I'm happy to > > make that change if we see a lot of value there. > > Well, that is not the kind of information reuse I am talking about. > > I was wondering if "index-pack" has enough information in-core after > it receives and processes the incoming pack data, scanned each and > every object in it, in order to write out the commit graph _without_ > having to do a lot of duplicate computation and enumeration of the > objects done in the current commit-graph.c::write_commit_graph(), so > that it can learn a "--write-commit-graph" option that performs much > better than running "git fetch && git commit-graph write". > > Perhaps that would require too much refactoring of both index-pack > and commit-graph code and infeasible in short to medium term and > that is why I said "for longer term, it feels a bit unfortunate". I think the basic metadata should be easy. We have each commit expanded in memory at some point, so parsing it and filing away its parents, trees, etc isn't too hard. Generation numbers are a little trickier, though, because they imply an actual topological traversal. It might actually be easier to couple this with the connectivity check we do after index-pack finishes (though I've often wondered if we could drop that check in favor of making index-pack smarter about finding the boundaries). -Peff

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <peff@peff.net> writes: > Generation numbers are a little trickier, though, because they imply an > actual topological traversal. It might actually be easier to couple this > with the connectivity check we do after index-pack finishes (though I've > often wondered if we could drop that check in favor of making index-pack > smarter about finding the boundaries). Currently after scanning the incoming objects with the fsck machinery, we count the number of objects that are pointed at by these objects in the pack and are not in the pack in the builtin/index-pack.c::check_object() function; at that point, we no longer know who points at the object in question, which is needed if we want to compute the boundary, so we need a bit of work here. With boundary information, we could be smarter about lazy fetching, I presume?

gitgitgadget · 2019-09-03T23:49:38Z

This branch is now known as ds/commit-graph-on-fetch.

gitgitgadget · 2019-09-03T23:49:39Z

This patch series was integrated into pu via git@de1e625.

gitgitgadget · 2019-09-04T03:11:33Z

Documentation/config/feature.txt

@@ -17,6 +17,14 @@ which can improve `git push` performance in repos with many files.
 +


On the Git mailing list, Jeff King wrote (reply to this):

On Mon, Sep 02, 2019 at 07:22:02PM -0700, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <dstolee@microsoft.com> > > The commit-graph feature is now on by default, and is being > written during 'git gc' by default. Typically, Git only writes > a commit-graph when a 'git gc --auto' command passes the gc.auto > setting to actualy do work. This means that a commit-graph will > typically fall behind the commits that are being used every day. > > To stay updated with the latest commits, add a step to 'git > fetch' to write a commit-graph after fetching new objects. The > fetch.writeCommitGraph config setting enables writing a split > commit-graph, so on average the cost of writing this file is > very small. Occasionally, the commit-graph chain will collapse > to a single level, and this could be slow for very large repos. > > For additional use, adjust the default to be true when > feature.experimental is enabled. Seems like a good time to do it. We'd eventually want a similar option for updating it on the receiving side of a push, too. I don't insist that come at the same time, but we should at least have a plan about how it will look to the user. Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph, and then a master transfer.writeCommitGraph? -Peff

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <peff@peff.net> writes: > Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph, > and then a master transfer.writeCommitGraph? If anything, it may be good for consistency. I am not sure if it is a good idea to trigger writing the commit graph when accepting a push, though. It tends to be a lot finer grained than fetching, right?

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/5/2019 4:37 PM, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > >> Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph, >> and then a master transfer.writeCommitGraph? > > If anything, it may be good for consistency. > > I am not sure if it is a good idea to trigger writing the commit > graph when accepting a push, though. It tends to be a lot finer > grained than fetching, right? And I expect a push to include many fewer commits than a fetch. In a server environment, I would expect to have a separate maintenance task responsible for updating the commit-graph after receiving new data, but not in an in-line fashion with the push. Think about the situation of many pushes that happen in a short burst: one write after all of the pushes would have close to the same performance benefits as writing with every push, but does a lot less work. Thanks, -Stolee

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 01:00:40PM -0400, Derrick Stolee wrote: > On 9/5/2019 4:37 PM, Junio C Hamano wrote: > > Jeff King <peff@peff.net> writes: > > > >> Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph, > >> and then a master transfer.writeCommitGraph? > > > > If anything, it may be good for consistency. > > > > I am not sure if it is a good idea to trigger writing the commit > > graph when accepting a push, though. It tends to be a lot finer > > grained than fetching, right? > > And I expect a push to include many fewer commits than a fetch. > In a server environment, I would expect to have a separate > maintenance task responsible for updating the commit-graph after > receiving new data, but not in an in-line fashion with the push. That's probably how we'll end up doing it at GitHub, because we run a big server farm that has a job-queueing system for periodic maintenance, etc. But keep in mind the "little guy" who is hosting a few repositories for themselves or their company. They'd presumably want the benefits here, too, right? We already have options to trigger auto-gc and update-server-info for them, so why not this maintenance, too? > Think about the situation of many pushes that happen in a short > burst: one write after all of the pushes would have close to the > same performance benefits as writing with every push, but does > a lot less work. Sure, but wouldn't that similarly apply to fetching? What is it that makes bursts of pushes more likely than bursts of fetches? -Peff

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <peff@peff.net> writes: > Sure, but wouldn't that similarly apply to fetching? What is it that > makes bursts of pushes more likely than bursts of fetches? Because people tend to use a repository as a gathering point? You may periodically fetch from and push to a repository, and you may even do so at the same interval, but simply because there are more "other" people than you alone as a single developer in the project, your fetch tends to grab work from more people than yoru push that publish your work alone?

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <peff@peff.net> writes: > I suppose so. But I think the "stock git without any other job > infrastructure" case would still benefit. Oh, no question about it. I did question if an automatic writing would benefit the side that receives a push when you brought up the usual "fetch.* and receive.* for separate configuration, transfer.* when want to rule them both" convention, which is a good idea if only for consistency, but the question was if it helps the receiving end of a push to the same degree as it would help the repository that fetches.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/6/2019 4:42 PM, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > >> I suppose so. But I think the "stock git without any other job >> infrastructure" case would still benefit. > > Oh, no question about it. > > I did question if an automatic writing would benefit the side that > receives a push when you brought up the usual "fetch.* and receive.* > for separate configuration, transfer.* when want to rule them both" > convention, which is a good idea if only for consistency, but the > question was if it helps the receiving end of a push to the same > degree as it would help the repository that fetches. I think the "stock git without any other job infrastructure" is a very important scenario. Putting the simplest version of "commit-graph writes in-line with every push" seems to be ripe for failure under load. I'd rather think deeply about what is best for this scenario. Spit-balling: what if we used the same mechanism as running GC in the background to launch 'git commit-graph write' commands? And we could have the command give up (without failure) if the commit-graph.lock file is already created, so concurrent pushes would not fight each other. Thanks, -Stolee

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes: > On 9/6/2019 4:42 PM, Junio C Hamano wrote: >> Jeff King <peff@peff.net> writes: >> >>> I suppose so. But I think the "stock git without any other job >>> infrastructure" case would still benefit. >> >> Oh, no question about it. >> >> I did question if an automatic writing would benefit the side that >> receives a push when you brought up the usual "fetch.* and receive.* >> for separate configuration, transfer.* when want to rule them both" >> convention, which is a good idea if only for consistency, but the >> question was if it helps the receiving end of a push to the same >> degree as it would help the repository that fetches. > > I think the "stock git without any other job infrastructure" is > a very important scenario. Putting the simplest version of > "commit-graph writes in-line with every push" seems to be ripe > for failure under load. I'd rather think deeply about what is > best for this scenario. As to what to do on the push side, I suppose we can afford to let it simmer in the back of our heads while moving this topic forward. Whether we'd later decide to have receive.writeCommitGraph (in which case we would add transfer.writeCommitGraph, too) or not, this change on the fetch side can independently be used, right? Thanks.

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 05:04:17PM -0400, Derrick Stolee wrote: > On 9/6/2019 4:42 PM, Junio C Hamano wrote: > > Jeff King <peff@peff.net> writes: > > > >> I suppose so. But I think the "stock git without any other job > >> infrastructure" case would still benefit. > > > > Oh, no question about it. > > > > I did question if an automatic writing would benefit the side that > > receives a push when you brought up the usual "fetch.* and receive.* > > for separate configuration, transfer.* when want to rule them both" > > convention, which is a good idea if only for consistency, but the > > question was if it helps the receiving end of a push to the same > > degree as it would help the repository that fetches. > > I think the "stock git without any other job infrastructure" is > a very important scenario. Putting the simplest version of > "commit-graph writes in-line with every push" seems to be ripe > for failure under load. I'd rather think deeply about what is > best for this scenario. If it's going to be a problem under load, that makes me worry about the same thing for fetches. Whether you see a lot of either depends on your workflow. But as long as neither option is enabled by default (or at least if it becomes common knowledge to _disable_ them if you have a high rate), it may be OK. I do agree that the normal "busy" repos you and I have both seen in enterprise settings (where people are literally pushing multiple times per second at peak) would want some kind of throttling. But I think there's a long tail of "push once a week" repos. > Spit-balling: what if we used the same mechanism as running GC > in the background to launch 'git commit-graph write' commands? > And we could have the command give up (without failure) if the > commit-graph.lock file is already created, so concurrent pushes > would not fight each other. I have mixed feelings. It's nice not to stand in the critical path of the user, but background tasks have a way of finding funny corner cases (e.g., packfiles moving around, or the issues we've had with deciding when to shut down auto-gc for a grace period due to warnings/errors). I think since this is generally additive (adding new commit-graph files), it might be less finicky, though. -Peff

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 02:57:55PM -0700, Junio C Hamano wrote: > > I think the "stock git without any other job infrastructure" is > > a very important scenario. Putting the simplest version of > > "commit-graph writes in-line with every push" seems to be ripe > > for failure under load. I'd rather think deeply about what is > > best for this scenario. > > As to what to do on the push side, I suppose we can afford to let it > simmer in the back of our heads while moving this topic forward. > Whether we'd later decide to have receive.writeCommitGraph (in which > case we would add transfer.writeCommitGraph, too) or not, this > change on the fetch side can independently be used, right? Yeah, I'm OK with proceeding without the receive-pack side for now. -Peff

gitgitgadget · 2019-09-05T22:30:09Z

This patch series was integrated into pu via git@e56c626.

gitgitgadget · 2019-09-06T19:46:18Z

This patch series was integrated into pu via git@38c5b72.

gitgitgadget · 2019-09-06T21:26:12Z

This patch series was integrated into pu via git@f556b99.

gitgitgadget · 2019-09-07T17:22:08Z

This patch series was integrated into pu via git@c9c5025.

gitgitgadget · 2019-09-09T20:44:33Z

This patch series was integrated into pu via git@f6803fe.

gitgitgadget · 2019-09-10T00:30:33Z

This patch series was integrated into pu via git@54b8038.

derrickstolee force-pushed the fetch-write-commit-graph branch from a24b773 to 49f877c Compare September 3, 2019 01:07

gitgitgadget bot reviewed Sep 3, 2019

View reviewed changes

gitgitgadget bot added the pu label Sep 3, 2019

gitgitgadget bot reviewed Sep 4, 2019

View reviewed changes

weekly-digest bot mentioned this pull request Sep 8, 2019

Weekly Digest (1 September, 2019 - 8 September, 2019) #336

Closed

derrickstolee closed this Sep 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write commit-graph on fetch #328

Write commit-graph on fetch #328

derrickstolee commented Sep 3, 2019 •

edited

Loading

derrickstolee commented Sep 3, 2019

gitgitgadget bot commented Sep 3, 2019

gitgitgadget bot Sep 3, 2019

gitgitgadget bot Sep 3, 2019

gitgitgadget bot Sep 6, 2019

gitgitgadget bot Sep 7, 2019

gitgitgadget bot Sep 9, 2019

gitgitgadget bot commented Sep 3, 2019

gitgitgadget bot commented Sep 3, 2019

gitgitgadget bot Sep 4, 2019

gitgitgadget bot Sep 5, 2019

gitgitgadget bot Sep 6, 2019

gitgitgadget bot Sep 6, 2019

gitgitgadget bot Sep 6, 2019

gitgitgadget bot Sep 6, 2019

gitgitgadget bot Sep 6, 2019

gitgitgadget bot Sep 6, 2019

gitgitgadget bot Sep 7, 2019

gitgitgadget bot Sep 7, 2019

gitgitgadget bot commented Sep 5, 2019

gitgitgadget bot commented Sep 6, 2019

gitgitgadget bot commented Sep 6, 2019

gitgitgadget bot commented Sep 7, 2019

gitgitgadget bot commented Sep 9, 2019

gitgitgadget bot commented Sep 10, 2019

		@@ -17,6 +17,14 @@ which can improve `git push` performance in repos with many files.
		+

Write commit-graph on fetch #328

Write commit-graph on fetch #328

Conversation

derrickstolee commented Sep 3, 2019 • edited Loading

derrickstolee commented Sep 3, 2019

gitgitgadget bot commented Sep 3, 2019

gitgitgadget bot Sep 3, 2019

Choose a reason for hiding this comment

gitgitgadget bot Sep 3, 2019

Choose a reason for hiding this comment

gitgitgadget bot Sep 6, 2019

Choose a reason for hiding this comment

gitgitgadget bot Sep 7, 2019

Choose a reason for hiding this comment

gitgitgadget bot Sep 9, 2019

Choose a reason for hiding this comment

gitgitgadget bot commented Sep 3, 2019

gitgitgadget bot commented Sep 3, 2019

gitgitgadget bot Sep 4, 2019

Choose a reason for hiding this comment

gitgitgadget bot Sep 5, 2019

Choose a reason for hiding this comment

gitgitgadget bot Sep 6, 2019

Choose a reason for hiding this comment

gitgitgadget bot Sep 6, 2019

Choose a reason for hiding this comment

gitgitgadget bot Sep 6, 2019

Choose a reason for hiding this comment

gitgitgadget bot Sep 6, 2019

Choose a reason for hiding this comment

gitgitgadget bot Sep 6, 2019

Choose a reason for hiding this comment

gitgitgadget bot Sep 6, 2019

Choose a reason for hiding this comment

gitgitgadget bot Sep 7, 2019

Choose a reason for hiding this comment

gitgitgadget bot Sep 7, 2019

Choose a reason for hiding this comment

gitgitgadget bot commented Sep 5, 2019

gitgitgadget bot commented Sep 6, 2019

gitgitgadget bot commented Sep 6, 2019

gitgitgadget bot commented Sep 7, 2019

gitgitgadget bot commented Sep 9, 2019

gitgitgadget bot commented Sep 10, 2019

derrickstolee commented Sep 3, 2019 •

edited

Loading