Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintenance: add pack-refs task #871

Closed

Conversation

derrickstolee
Copy link

@derrickstolee derrickstolee commented Feb 8, 2021

This patch series adds a new pack-refs task to the maintenance builtin. This operation already happens within git gc (and hence the gc task) but it is easy to extract. Packing refs does not delete any data, only collects loose objects into a combined file. This makes things faster in subtle ways, especially when a command needs to iterate through refs (especially tags).

Credit for inspiring this goes to Suolong, who asked for this to be added to Scalar [1]. I've been waiting instead to add it directly to Git and its background maintenance. Now is the time!

[1] microsoft/scalar#382

I chose to add it to the incremental maintenance strategy at a weekly cadence. I'm not sure there is significant value to the difference between weekly and daily. It just seems to me that weekly is often enough. Feel free to correct me if you have a different opinion.

My hope is that this patch series could be used as an example for further extracting tasks out of the gc task and making them be full maintenance tasks. Doing more of these extractions could be a good project for a new contributor.

One thing that is not implemented in this series is a notion of the behavior for the pack-refs task during git maintenance run --auto. This could be added in the future, but I wanted to focus on getting this behavior into the incremental maintenance schedule.

Updates in V2

  • Fixed doc typo. Thanks, Eric!
  • Updated commit messages to make it clear that the 'pack-refs' step will still happen within the 'gc' task.
  • Updated the test to check that we run the correct subcommand.
  • maintenance_task_pack_refs() uses MAYBE_UNUSED on its parameter.

Thanks,
-Stolee

Cc: gitster@pobox.com
Cc: sluongng@gmail.com
Cc: martin.agren@gmail.com
Cc: sunshine@sunshineco.com

cc: Taylor Blau me@ttaylorr.com
cc: Eric Sunshine sunshine@sunshineco.com
cc: Derrick Stolee stolee@gmail.com

@derrickstolee derrickstolee changed the title Maintenance/pack refs Maintenance: add pack-refs task Feb 8, 2021
@derrickstolee derrickstolee self-assigned this Feb 8, 2021
@derrickstolee derrickstolee force-pushed the maintenance/pack-refs branch 2 times, most recently from 316f455 to 8012d2d Compare February 8, 2021 14:02
@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 8, 2021

Submitted as pull.871.git.1612795943.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-871/derrickstolee/maintenance/pack-refs-v1

To fetch this version to local tag pr-871/derrickstolee/maintenance/pack-refs-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-871/derrickstolee/maintenance/pack-refs-v1

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 8, 2021

On the Git mailing list, Taylor Blau wrote (reply to this):

On Mon, Feb 08, 2021 at 02:52:21PM +0000, Derrick Stolee via GitGitGadget wrote:
> I chose to add it to the incremental maintenance strategy at a weekly
> cadence. I'm not sure there is significant value to the difference between
> weekly and daily. It just seems to me that weekly is often enough. Feel free
> to correct me if you have a different opinion.

I don't have any strong opinion about this. At GitHub we run pack-refs
as part of a maintenance-like job, which varies in frequency between
different repositories based on a number of factors. In our busiest
repository networks, this is almost certainly at least a few times a
day, if not "as often as possible."

I'd suspect that for any laptop-sized repository, anywhere between daily
and weekly is just fine.

I suppose weekly is a good starting point, since we can always change
the default to run more often if enough users complain.

> My hope is that this patch series could be used as an example for further
> extracting tasks out of the gc task and making them be full maintenance
> tasks. Doing more of these extractions could be a good project for a new
> contributor.

That's great :-).

Thanks,
Taylor

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 8, 2021

User Taylor Blau <me@ttaylorr.com> has been added to the cc: list.

@@ -145,6 +145,12 @@ incremental-repack::
which is a special case that attempts to repack all pack-files
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Mon, Feb 08, 2021 at 02:52:22PM +0000, Derrick Stolee via GitGitGadget wrote:
> +pack-refs::
> +	The `pack-refs` task collects the loose reference files and
> +	collects them into a single file. This speeds up operations that
> +	need to iterate across many refereences. See linkgit:git-pack-refs[1]
> +	for more information.
> +

Do you think it's worth documenting here or in git-gc(1) that running
this has the effect of disabling the same step during gc?

>  OPTIONS
>  -------
>  --auto::
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 4c40594d660e..8f13c3bb8607 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -54,7 +54,6 @@ static const char *prune_worktrees_expire = "3.months.ago";
>  static unsigned long big_pack_threshold;
>  static unsigned long max_delta_cache_size = DEFAULT_DELTA_CACHE_SIZE;
>
> -static struct strvec pack_refs_cmd = STRVEC_INIT;
>  static struct strvec reflog = STRVEC_INIT;
>  static struct strvec repack = STRVEC_INIT;
>  static struct strvec prune = STRVEC_INIT;
> @@ -163,6 +162,15 @@ static void gc_config(void)
>  	git_config(git_default_config, NULL);
>  }
>
> +struct maintenance_run_opts;
> +static int maintenance_task_pack_refs(struct maintenance_run_opts *opts)

It may be worth calling this "unused", since you explicitly pass NULL
below.

> +{
> +	struct strvec pack_refs_cmd = STRVEC_INIT;
> +	strvec_pushl(&pack_refs_cmd, "pack-refs", "--all", "--prune", NULL);
> +
> +	return run_command_v_opt(pack_refs_cmd.v, RUN_GIT_CMD);
> +}
> +
>  static int too_many_loose_objects(void)
>  {
>  	/*
> @@ -518,8 +526,8 @@ static void gc_before_repack(void)
>  	if (done++)
>  		return;
>
> -	if (pack_refs && run_command_v_opt(pack_refs_cmd.v, RUN_GIT_CMD))
> -		die(FAILED_RUN, pack_refs_cmd.v[0]);
> +	if (pack_refs && maintenance_task_pack_refs(NULL))
> +		die(FAILED_RUN, "pack-refs");

Hmm. Am I missing where opting into the maintenance step disables this
in gc? You suggest that it does in the commit message, but I can't seem
to see it here.

> +test_expect_success 'pack-refs task' '
> +	for n in $(test_seq 1 5)
> +	do
> +		git branch -f to-pack/$n HEAD || return 1
> +	done &&
> +	git maintenance run --task=pack-refs &&
> +	ls .git/refs/heads/ >after &&
> +	test_must_be_empty after

Worth testing that $GIT_DIR/packed-refs exists, too?

Thanks,
Taylor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 2/8/2021 5:53 PM, Taylor Blau wrote:
> On Mon, Feb 08, 2021 at 02:52:22PM +0000, Derrick Stolee via GitGitGadget wrote:
>> +pack-refs::
>> +	The `pack-refs` task collects the loose reference files and
>> +	collects them into a single file. This speeds up operations that
>> +	need to iterate across many refereences. See linkgit:git-pack-refs[1]
>> +	for more information.
>> +
> 
> Do you think it's worth documenting here or in git-gc(1) that running
> this has the effect of disabling the same step during gc?

It doesn't disable the step during gc.

Perhaps I should use a better term than "extract". The 'gc' task shouldn't
change as we make some of its steps also available as independent tasks.

Instead, users can select a subset of its steps by enabling them directly.
Other such tasks could include:

 * prune-reflog
 * full-repack (as opposed to the existing incremental-repack task)

>> +struct maintenance_run_opts;
>> +static int maintenance_task_pack_refs(struct maintenance_run_opts *opts)
> 
> It may be worth calling this "unused", since you explicitly pass NULL
> below.

Good idea.

>> +	if (pack_refs && maintenance_task_pack_refs(NULL))
>> +		die(FAILED_RUN, "pack-refs");
> 
> Hmm. Am I missing where opting into the maintenance step disables this
> in gc? You suggest that it does in the commit message, but I can't seem
> to see it here.

My commit message suggests wrong.
 
>> +test_expect_success 'pack-refs task' '
>> +	for n in $(test_seq 1 5)
>> +	do
>> +		git branch -f to-pack/$n HEAD || return 1
>> +	done &&
>> +	git maintenance run --task=pack-refs &&
>> +	ls .git/refs/heads/ >after &&
>> +	test_must_be_empty after
> 
> Worth testing that $GIT_DIR/packed-refs exists, too?

That would start breaking if we change the backing store, right? I
want to be sure this doesn't create test breakages with reftable.

I _could_ add a 'test_subcommand' check here, though.

Thanks,
-Stolee

@sluongng
Copy link

sluongng commented Feb 8, 2021

Nice work Derrick. Could you please correct my name to 'Son Luong'? Thanks!

@@ -145,6 +145,12 @@ incremental-repack::
which is a special case that attempts to repack all pack-files
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Eric Sunshine wrote (reply to this):

On Mon, Feb 8, 2021 at 9:52 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> +pack-refs::
> +       The `pack-refs` task collects the loose reference files and
> +       collects them into a single file. This speeds up operations that
> +       need to iterate across many refereences. See linkgit:git-pack-refs[1]
> +       for more information.

s/refereences/references/

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 2/8/2021 6:06 PM, Eric Sunshine wrote:
> On Mon, Feb 8, 2021 at 9:52 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> +pack-refs::
>> +       The `pack-refs` task collects the loose reference files and
>> +       collects them into a single file. This speeds up operations that
>> +       need to iterate across many refereences. See linkgit:git-pack-refs[1]
>> +       for more information.
> 
> s/refereences/references/
 
Thanks!
-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 8, 2021

User Eric Sunshine <sunshine@sunshineco.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 8, 2021

This branch is now known as ds/maintenance-pack-refs.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 8, 2021

This patch series was integrated into seen via git@93756a6.

@gitgitgadget gitgitgadget bot added the seen label Feb 8, 2021
@gitgitgadget
Copy link

gitgitgadget bot commented Feb 9, 2021

User Derrick Stolee <stolee@gmail.com> has been added to the cc: list.

It is valuable to collect loose refs into a more compressed form. This
is typically the packed-refs file, although this could be the reftable
in the future. Having packed refs can be extremely valuable in repos
with many tags or remote branches that are not modified by the local
user, but still are necessary for other queries.

For instance, with many exploded refs, commands such as

	git describe --tags --exact-match HEAD

can be very slow (multiple seconds). This command in particular is used
by terminal prompts to show when a detatched HEAD is pointing to an
existing tag, so having it be slow causes significant delays for users.

Add a new 'pack-refs' maintenance task. It runs 'git pack-refs --all
--prune' to move loose refs into a packed form. For now, that is the
packed-refs file, but could adjust to other file formats in the future.

This is the first of several sub-tasks of the 'gc' task that could be
extracted to their own tasks. In this process, we should not change the
behavior of the 'gc' task since that remains the default way to keep
repositories maintained. Creating a new task for one of these sub-tasks
only provides more customization options for those choosing to not use
the 'gc' task. It is certainly possible to have both the 'gc' and
'pack-refs' tasks enabled and run regularly. While they may repeat
effort, they do not conflict in a destructive way.

The 'auto_condition' function pointer is left NULL for now. We could
extend this in the future to have a condition check if pack-refs should
be run during 'git maintenance run --auto'.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When the 'maintenance.strategy' config option is set to 'incremental',
a default maintenance schedule is enabled. Add the 'pack-refs' task to
that strategy at the weekly cadence.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 9, 2021

Submitted as pull.871.v2.git.1612878149.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-871/derrickstolee/maintenance/pack-refs-v2

To fetch this version to local tag pr-871/derrickstolee/maintenance/pack-refs-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-871/derrickstolee/maintenance/pack-refs-v2

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 9, 2021

This patch series was integrated into seen via git@e371173.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 10, 2021

This patch series was integrated into seen via git@7b2b219.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 10, 2021

On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Feb 09, 2021 at 01:42:27PM +0000, Derrick Stolee via GitGitGadget wrote:
> Updates in V2
> =============
>
>  * Fixed doc typo. Thanks, Eric!
>  * Updated commit messages to make it clear that the 'pack-refs' step will
>    still happen within the 'gc' task.
>  * Updated the test to check that we run the correct subcommand.
>  * maintenance_task_pack_refs() uses MAYBE_UNUSED on its parameter.

Thanks, the range-diff looked sane to me.

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 10, 2021

This patch series was integrated into seen via git@97d06dd.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 10, 2021

This patch series was integrated into seen via git@befa7f8.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 11, 2021

This patch series was integrated into seen via git@79d2bc6.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 11, 2021

This patch series was integrated into seen via git@d28aa44.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 11, 2021

This patch series was integrated into next via git@36f56bc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants