Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintenance builtin, allowing 'gc --auto' customization #671

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
62986fd
maintenance: create basic maintenance runner
derrickstolee Jun 26, 2020
51b78c7
maintenance: add --quiet option
derrickstolee Jun 26, 2020
621471d
maintenance: replace run_auto_gc()
derrickstolee Jun 26, 2020
1dbbfd2
maintenance: initialize task array
derrickstolee Jun 26, 2020
ea80307
maintenance: add commit-graph task
derrickstolee Jun 26, 2020
271bc35
maintenance: add --task option
derrickstolee Jul 1, 2020
9cc6fc8
maintenance: take a lock on the objects directory
derrickstolee Jul 1, 2020
d571b1a
maintenance: create maintenance.<task>.enabled config
derrickstolee Jul 2, 2020
cb26dfa
maintenance: use pointers to check --auto
derrickstolee Jul 2, 2020
00697e7
maintenance: add auto condition for commit-graph task
derrickstolee Jul 2, 2020
da495f6
maintenance: add trace2 regions for task execution
derrickstolee Jul 14, 2020
fc68305
fetch: optionally allow disabling FETCH_HEAD update
gitster Jul 28, 2020
8e4513f
maintenance: add prefetch task
derrickstolee Jul 1, 2020
9186e89
maintenance: add loose-objects task
derrickstolee Jul 1, 2020
54cd6e0
maintenance: create auto condition for loose-objects
derrickstolee Jul 2, 2020
7568fcc
midx: enable core.multiPackIndex by default
derrickstolee Jul 30, 2020
ba4c627
midx: use start_delayed_progress()
derrickstolee Jul 5, 2020
eecdf99
maintenance: add incremental-repack task
derrickstolee Jul 2, 2020
bce1308
maintenance: auto-size incremental-repack batch
derrickstolee Jul 2, 2020
a78dbc7
maintenance: add incremental-repack auto condition
derrickstolee Jul 2, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Expand Up @@ -90,6 +90,7 @@
/git-ls-tree
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Jul 23, 2020 at 05:56:23PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>

Everything in this patch looks very sensible to me, and I think that
implementing 'git maintenance' in the same file as the current 'git gc'
builtin makes a lot of sense to me.

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Đoàn Trần Công Danh wrote (reply to this):

On 2020-07-23 17:56:23+0000, Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> new file mode 100755
> index 0000000000..d00641c4dd
> --- /dev/null
> +++ b/t/t7900-maintenance.sh
> @@ -0,0 +1,22 @@
> +#!/bin/sh
> +
> +test_description='git maintenance builtin'
> +
> +GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_MULTI_PACK_INDEX=0
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'help text' '
> +	test_must_fail git maintenance -h 2>err &&
> +	test_i18ngrep "usage: git maintenance run" err
> +'

I think this test has been tested better already in t0012,
Anyway, if we would like to check the word "git maintenance run"
explicitly here, it would be clearer to

s/test_must_fail/test_expect_code 129/

> +
> +test_expect_success 'gc [--auto]' '
> +	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
> +	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
> +	grep ",\"gc\"]" run-no-auto.txt  &&
> +	grep ",\"gc\",\"--auto\"]" run-auto.txt
> +'
> +
> +test_done
> -- 
> gitgitgadget
> 

-- 
Danh

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Nieder wrote (reply to this):

Hi,

Derrick Stolee wrote:

> [Subject: maintenance: create basic maintenance runner]

Seems sensible, and a good way to set up for the later patches.  Let's
take a look at how it does that.

[...]
> --- /dev/null
> +++ b/Documentation/git-maintenance.txt
> @@ -0,0 +1,57 @@
[...]
> +SUBCOMMANDS
> +-----------
> +
> +run::
> +	Run one or more maintenance tasks.

[jrnieder] How do I supply the tasks on the command line?  Are they
parameters to this subcommand?  If so, it could make sense for this to
say something like

	run <task>...::

What is the exit code convention for "git maintenance run"?  (Many Git
commands don't document this, but since we're making a new command
it seems worth building the habit.)

[...]
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -699,3 +699,62 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
[...]
> +static struct maintenance_opts {
> +	int auto_flag;
> +} opts;

Packing this in a struct feels a bit unusual.  Is the struct going to
be passed somewhere, or could these be individual locals in
cmd_maintenance?

[...]
> +
> +static int maintenance_task_gc(void)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> +	argv_array_pushl(&cmd, "gc", NULL);
> +
> +	if (opts.auto_flag)
> +		argv_array_pushl(&cmd, "--auto", NULL);

These are both pushing single strings, so they can use argv_array_push.

[...]
> --- /dev/null
> +++ b/t/t7900-maintenance.sh
> @@ -0,0 +1,22 @@
> +#!/bin/sh
> +
> +test_description='git maintenance builtin'
> +
> +GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_MULTI_PACK_INDEX=0

Why does this disable commit graph and multipack index?  Is that setting
up for something that comes later?

[...]
> +test_expect_success 'gc [--auto]' '
> +	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
> +	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
> +	grep ",\"gc\"]" run-no-auto.txt  &&
> +	grep ",\"gc\",\"--auto\"]" run-auto.txt

This feels a little odd in two ways:

- the use of "git gc" isn't a user-facing detail, so this is testing
  implementation instead of the user-facing behavior.  That's okay,
  though --- it can be useful to test internals sometimes.

- the way that this tests for "git gc" feels brittle: if the trace
  emitter changes some day to include a space after the comma, for
  example, then the test would start failing.  I thought that in the
  spirit of fakes, it could make sense to write a custom "git gc"
  command using test_write_script, but that isn't likely to work
  either since gc is a builtin.

Perhaps this is suggesting that we need some central test helper for
parsing traces so we can do this reliably in one place.  What do you
think?

Thanks,
Jonathan

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/29/2020 6:19 PM, Jonathan Nieder wrote:
> Hi,
> 
> Derrick Stolee wrote:
> 
>> [Subject: maintenance: create basic maintenance runner]
> 
> Seems sensible, and a good way to set up for the later patches.  Let's
> take a look at how it does that.
> 
> [...]
>> --- /dev/null
>> +++ b/Documentation/git-maintenance.txt
>> @@ -0,0 +1,57 @@
> [...]
>> +SUBCOMMANDS
>> +-----------
>> +
>> +run::
>> +	Run one or more maintenance tasks.
> 
> [jrnieder] How do I supply the tasks on the command line?  Are they
> parameters to this subcommand?  If so, it could make sense for this to
> say something like
> 
> 	run <task>...::

Hopefully this is documented to your satisfaction when the ability
to customize the tasks is implemented.

> What is the exit code convention for "git maintenance run"?  (Many Git
> commands don't document this, but since we're making a new command
> it seems worth building the habit.)

Is this worth doing? Do we really want every command to document
that "0 means success, everything else means failure, and some of
those exit codes mean a specific kind of failure that is global to
Git"?

> [...]
>> --- a/builtin/gc.c
>> +++ b/builtin/gc.c
>> @@ -699,3 +699,62 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
> [...]
>> +static struct maintenance_opts {
>> +	int auto_flag;
>> +} opts;
> 
> Packing this in a struct feels a bit unusual.  Is the struct going to
> be passed somewhere, or could these be individual locals in
> cmd_maintenance?

This will grow, and I'd rather have one global struct than many
individual global items. It makes it clearer when I use
"opts.auto_flag" that this corresponds to whether "--auto" was
provided as a command-line option.

> [...]
>> +
>> +static int maintenance_task_gc(void)
>> +{
>> +	int result;
>> +	struct argv_array cmd = ARGV_ARRAY_INIT;
>> +
>> +	argv_array_pushl(&cmd, "gc", NULL);
>> +
>> +	if (opts.auto_flag)
>> +		argv_array_pushl(&cmd, "--auto", NULL);
> 
> These are both pushing single strings, so they can use argv_array_push.

Thanks. I noticed a few of these myself. Luckily, I'll be going
through all of these invocations and replacing them with new
methods soon ;)

[1] https://lore.kernel.org/git/30933a71-3130-5478-cbfd-0ca5bb308cf2@gmail.com/

> [...]
>> --- /dev/null
>> +++ b/t/t7900-maintenance.sh
>> @@ -0,0 +1,22 @@
>> +#!/bin/sh
>> +
>> +test_description='git maintenance builtin'
>> +
>> +GIT_TEST_COMMIT_GRAPH=0
>> +GIT_TEST_MULTI_PACK_INDEX=0
> 
> Why does this disable commit graph and multipack index?  Is that setting
> up for something that comes later?

Yes, these don't need to be here yet.

> [...]
>> +test_expect_success 'gc [--auto]' '
>> +	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run &&
>> +	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
>> +	grep ",\"gc\"]" run-no-auto.txt  &&
>> +	grep ",\"gc\",\"--auto\"]" run-auto.txt
> 
> This feels a little odd in two ways:
> 
> - the use of "git gc" isn't a user-facing detail, so this is testing
>   implementation instead of the user-facing behavior.  That's okay,
>   though --- it can be useful to test internals sometimes.

Consider this a "unit test" of the builtin. I'm testing whether the
command-line arguments had an effect on the child process.

> - the way that this tests for "git gc" feels brittle: if the trace
>   emitter changes some day to include a space after the comma, for
>   example, then the test would start failing.  I thought that in the
>   spirit of fakes, it could make sense to write a custom "git gc"
>   command using test_write_script, but that isn't likely to work
>   either since gc is a builtin.
> 
> Perhaps this is suggesting that we need some central test helper for
> parsing traces so we can do this reliably in one place.  What do you
> think?

I'm specifically using GIT_TRACE2_EVENT, which is intended for
consumption by computer, not humans. Adding whitespace or otherwise
changing the output format would be an unnecessary disruption that
is more likely to have downstream effects with external tools.

In some way, adding extra dependencies like this in our own test
suite can help ensure the output format is more stable.

If there is a better way to ask "Did my command call 'git gc' (with
no arguments|with these arguments)?" then I'm happy to consider it.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Nieder wrote (reply to this):

Derrick Stolee wrote:
> On 7/29/2020 6:19 PM, Jonathan Nieder wrote:

>> [jrnieder] How do I supply the tasks on the command line?  Are they
>> parameters to this subcommand?  If so, it could make sense for this to
>> say something like
>>
>> 	run <task>...::
>
> Hopefully this is documented to your satisfaction when the ability
> to customize the tasks is implemented.

Hm, do you mean that this change is too difficult to squash into the
first patch?

>> What is the exit code convention for "git maintenance run"?  (Many Git
>> commands don't document this, but since we're making a new command
>> it seems worth building the habit.)
>
> Is this worth doing? Do we really want every command to document
> that "0 means success, everything else means failure, and some of
> those exit codes mean a specific kind of failure that is global to
> Git"?

I mean things like "what exit code does it run if there is no
maintenance to do?"

[...]
>>> +static struct maintenance_opts {
>>> +	int auto_flag;
>>> +} opts;
>>
>> Packing this in a struct feels a bit unusual.  Is the struct going to
>> be passed somewhere, or could these be individual locals in
>> cmd_maintenance?
>
> This will grow, and I'd rather have one global struct than many
> individual global items. It makes it clearer when I use
> "opts.auto_flag" that this corresponds to whether "--auto" was
> provided as a command-line option.

That doesn't seem idiomatic within the Git codebase.

Can they be locals instead?

[...]
>> Perhaps this is suggesting that we need some central test helper for
>> parsing traces so we can do this reliably in one place.  What do you
>> think?
>
> I'm specifically using GIT_TRACE2_EVENT, which is intended for
> consumption by computer, not humans. Adding whitespace or otherwise
> changing the output format would be an unnecessary disruption that
> is more likely to have downstream effects with external tools.
>
> In some way, adding extra dependencies like this in our own test
> suite can help ensure the output format is more stable.

Hm, I thought the contract for GIT_TRACE2_EVENT is that it is JSON,
which is a well defined format.  To that end, I would sincerely hope
that external tools are not assuming anything beyond that the format
is JSON.

If that wasn't the intent, I'd rather that we not use a JSON-like
format at all.  That would make the expectations for evolving this
interface clearer.

> If there is a better way to ask "Did my command call 'git gc' (with
> no arguments|with these arguments)?" then I'm happy to consider it.

My proposal was just to factor this out into a function in
test-lib-functions.sh so it's easy to evolve over time in one place.

Thanks and hope that helps,
Jonathan

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Nieder wrote (reply to this):

Derrick Stolee wrote:
> On 7/29/2020 6:19 PM, Jonathan Nieder wrote:
> > Derrick Stolee wrote:

>>> --- /dev/null
>>> +++ b/Documentation/git-maintenance.txt
>>> @@ -0,0 +1,57 @@
>> [...]
>>> +SUBCOMMANDS
>>> +-----------
>>> +
>>> +run::
>>> +	Run one or more maintenance tasks.
>>
>> [jrnieder] How do I supply the tasks on the command line?  Are they
>> parameters to this subcommand?  If so, it could make sense for this to
>> say something like
>>
>> 	run <task>...::
>
> Hopefully this is documented to your satisfaction when the ability
> to customize the tasks is implemented.

I think my last reply to this part missed the point a little, so let
me try again.

In multi-patch series, requests to make improvements to early parts
that would be obsoleted by later parts may seem strange --- isn't this
just wasted work?  Sure, occasionally someone will stumble on the
early part using "git bisect" or "git log", but isn't that rare enough
to not be worth worrying too much about?

And I'd agree with that reaction --- for a documentation nit like this
one, that benefit alone is probably not enough to be worth the fuss of
tweaking things.

When I focus more on early parts of a series, it is for a different
reason: from a release management point of view, it is useful to get
the early part of a series "cooked" as soon as possible so it can be
exposed to users in "next" and then become a foundation for later
patches in "master".  So we try to get as much of the early part of a
series into that state as we can, and to merge the early part even
without the later part as soon as the former is ready and valuable
enough to users on its own.

Relatedly, ensuring each commit is "complete" in this sense on its own
gives us more flexibility in case a problem is discovered later: we
can revert a patch that is causing issues without having to revert the
preceding part.

I think that might be part of the context I didn't convey well, and
why I want to get this in tip-top shape and start using it.

Thanks,
Jonathan

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/30/2020 8:30 PM, Jonathan Nieder wrote:
> Derrick Stolee wrote:
>> On 7/29/2020 6:19 PM, Jonathan Nieder wrote:
> 
>>> [jrnieder] How do I supply the tasks on the command line?  Are they
>>> parameters to this subcommand?  If so, it could make sense for this to
>>> say something like
>>>
>>> 	run <task>...::
>>
>> Hopefully this is documented to your satisfaction when the ability
>> to customize the tasks is implemented.
> 
> Hm, do you mean that this change is too difficult to squash into the
> first patch?

I mean that there is only one task right now. Until the commit-graph
task is implemented, there is no need to have a --task=<task> option.

Creating a --task=<task> option in this patch would unnecessarily bloat
the patch by implementing multiple tasks while filling in the boilerplate
of creating a new builtin.

>>> What is the exit code convention for "git maintenance run"?  (Many Git
>>> commands don't document this, but since we're making a new command
>>> it seems worth building the habit.)
>>
>> Is this worth doing? Do we really want every command to document
>> that "0 means success, everything else means failure, and some of
>> those exit codes mean a specific kind of failure that is global to
>> Git"?
> 
> I mean things like "what exit code does it run if there is no
> maintenance to do?"
> 
> [...]
>>>> +static struct maintenance_opts {
>>>> +	int auto_flag;
>>>> +} opts;
>>>
>>> Packing this in a struct feels a bit unusual.  Is the struct going to
>>> be passed somewhere, or could these be individual locals in
>>> cmd_maintenance?
>>
>> This will grow, and I'd rather have one global struct than many
>> individual global items. It makes it clearer when I use
>> "opts.auto_flag" that this corresponds to whether "--auto" was
>> provided as a command-line option.
> 
> That doesn't seem idiomatic within the Git codebase.
>
> Can they be locals instead?

Which part do you want to be idiomatic? The fact that the parse-options
library typically refers to static global values or the fact that the
data is not organized in a struct?

Turning these into locals is _even worse_ than the current model of
static globals because that then requires updating the prototypes of
all the (static) local methods that operate on the arguments. Every
time a new option is added, that would cascade in method prototypes
all over again.

The natural thing to do to "fix" that situation is to create a struct
that holds the information from the parsed command-line arguments. But
then why is it a local parameter that is passed through all of the
local methods instead of a global (as presented here in this patch)?

> [...]
>>> Perhaps this is suggesting that we need some central test helper for
>>> parsing traces so we can do this reliably in one place.  What do you
>>> think?
>>
>> I'm specifically using GIT_TRACE2_EVENT, which is intended for
>> consumption by computer, not humans. Adding whitespace or otherwise
>> changing the output format would be an unnecessary disruption that
>> is more likely to have downstream effects with external tools.
>>
>> In some way, adding extra dependencies like this in our own test
>> suite can help ensure the output format is more stable.
> 
> Hm, I thought the contract for GIT_TRACE2_EVENT is that it is JSON,
> which is a well defined format.  To that end, I would sincerely hope
> that external tools are not assuming anything beyond that the format
> is JSON.
> 
> If that wasn't the intent, I'd rather that we not use a JSON-like
> format at all.  That would make the expectations for evolving this
> interface clearer.
> 
>> If there is a better way to ask "Did my command call 'git gc' (with
>> no arguments|with these arguments)?" then I'm happy to consider it.
> 
> My proposal was just to factor this out into a function in
> test-lib-functions.sh so it's easy to evolve over time in one place.

This is a valuable suggestion, but this series is already too large
to make such a change in addition to the patches already here. I'd
consider this more if there were more tests that checked "did we
run this subcommand with these arguments?" Better to extract a helper
when this is actually used in multiple places.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Nieder wrote (reply to this):

Derrick Stolee wrote:

> What is seems like you are asking instead is for me to create a tool
> in the test suite that parses each JSON line, extracts a specific
> member from that JSON object, reconstructs a command-line invocation
> from the JSON array, and reports whether that process worked for any
> line in the event output.

No, that isn't what I'm asking.

I'm asking for this patch series to take the existing "grep" lines
and put them in a function in test-lib-functions.sh, so that we can
change them in one place when the trace emitter changes.

[...]
> If this is to truly be a hard requirement for these tests to move
> forward,

Yes, from my point of view it really is.

But that "is this truly a hard requirement?" comes up tells me I have
not done a good job of communicating in this review.  A review is
about participants in the project working together to improve a patch,
not people making demands at each other.

[...]
> If I'm to spend time engineering something more complicated just to
> check "did this subcommand run with these arguments?" then

I don't see why this is more complicated than what is in patch 1.  In
fact, I think it would be a little more simple.

Jonathan

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/4/2020 10:42 AM, Jonathan Nieder wrote:
> Derrick Stolee wrote:
> 
>> What is seems like you are asking instead is for me to create a tool
>> in the test suite that parses each JSON line, extracts a specific
>> member from that JSON object, reconstructs a command-line invocation
>> from the JSON array, and reports whether that process worked for any
>> line in the event output.
> 
> No, that isn't what I'm asking.
> 
> I'm asking for this patch series to take the existing "grep" lines
> and put them in a function in test-lib-functions.sh, so that we can
> change them in one place when the trace emitter changes.

Thanks for clarifying, clearing up my misinterpretation of your
request. I'm coming around to this idea.

> [...]
>> If this is to truly be a hard requirement for these tests to move
>> forward,
> 
> Yes, from my point of view it really is.
> 
> But that "is this truly a hard requirement?" comes up tells me I have
> not done a good job of communicating in this review.  A review is
> about participants in the project working together to improve a patch,
> not people making demands at each other.

It's also about a balance. A patch author and a reviewer can have
disagreements, and it is important to discover exactly how hard each
is holding to their opinion.

> [...]
>> If I'm to spend time engineering something more complicated just to
>> check "did this subcommand run with these arguments?" then
> 
> I don't see why this is more complicated than what is in patch 1.  In
> fact, I think it would be a little more simple.

My interpretation (parsing JSON) _was_ more complicated, hence my
reservations. The helper still requires non-trivial script-fu (for
someone like me who is bad at scripting languages) but centralizing
the grep has value.

Here is my attempt so far:

trace2_subcommand_data () {
	command="$1" &&
	commas=$(echo $command | sed s/\ /\",\"/g) &&
	printf "[\"$commas\"]"
}

test_subcommand_called () {
	data=$(trace2_subcommand_data $1) &&
	grep $data $2
}

test_subcommand_not_called () {
	! test_subcommand_called $1 $2
}

with callers looking like

	test_subcommand_called "gc" run-no-auto.txt &&
	test_subcommand_not_called "gc --auto" run-auto.txt &&
	test_subcommand_called "gc --quiet" run-quiet.txt

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Nieder wrote (reply to this):

Derrick Stolee wrote:

> Here is my attempt so far:
[...]
> 	test_subcommand_called "gc" run-no-auto.txt &&
> 	test_subcommand_not_called "gc --auto" run-auto.txt &&
> 	test_subcommand_called "gc --quiet" run-quiet.txt

I like it.  Some tweaks:
- can simplify by using 'printf' instead of sed
- passing the trace in stdin
- following the convention that helpers like test_i18ngrep use of
  accepting ! as a parameter to negate the exit code

That would make (untested)

# Check that the given command was invoked as part of the
# trace2-format trace on stdin.
#
#	test_subcommand_called [!] <command> <args>... < <trace>
#
# For example, to look for an invocation of "git upload-pack
# /path/to/repo"
#
#	GIT_TRACE2_EVENT=event.log git fetch ... &&
#	test_subcommand_called git upload-pack "$PATH" <event.log
#
# If the first parameter passed is !, this instead checks that
# the given command was not called.
#
test_subcommand_called () {
	local negate=
	if test "$1" = "!"
	then
		negate=t
		shift
	fi

	local expr=$(printf '"%s",' "$@")
	expr="${expr%,}"

	if test -n "$negate"
	then
		! grep "\[$expr\]"
	else
		grep "\[$expr\]"
	fi
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/4/2020 1:02 PM, Jonathan Nieder wrote:
> Derrick Stolee wrote:
> 
>> Here is my attempt so far:
> [...]
>> 	test_subcommand_called "gc" run-no-auto.txt &&
>> 	test_subcommand_not_called "gc --auto" run-auto.txt &&
>> 	test_subcommand_called "gc --quiet" run-quiet.txt
> 
> I like it.  Some tweaks:
> - can simplify by using 'printf' instead of sed
> - passing the trace in stdin
> - following the convention that helpers like test_i18ngrep use of
>   accepting ! as a parameter to negate the exit code
> 
> That would make (untested)
> 
> # Check that the given command was invoked as part of the
> # trace2-format trace on stdin.
> #
> #	test_subcommand_called [!] <command> <args>... < <trace>
> #
> # For example, to look for an invocation of "git upload-pack
> # /path/to/repo"
> #
> #	GIT_TRACE2_EVENT=event.log git fetch ... &&
> #	test_subcommand_called git upload-pack "$PATH" <event.log
> #
> # If the first parameter passed is !, this instead checks that
> # the given command was not called.
> #
> test_subcommand_called () {
> 	local negate=
> 	if test "$1" = "!"
> 	then
> 		negate=t
> 		shift
> 	fi
> 
> 	local expr=$(printf '"%s",' "$@")
> 	expr="${expr%,}"
> 
> 	if test -n "$negate"
> 	then
> 		! grep "\[$expr\]"
> 	else
> 		grep "\[$expr\]"
> 	fi
> }

This works! Thanks. With the ! usage, I think a rename to
'test_subcommand' makes sense. Further, I was somehow wrong
about not including "git" in the beginning, so here is my
example use of this method;

	test_subcommand git gc <run-no-auto.txt &&
	test_subcommand ! git gc --auto <run-auto.txt &&
	test_subcommand git gc --quiet <run-quiet.txt

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/3/2020 1:46 PM, Jonathan Nieder wrote:
> Derrick Stolee wrote:
>> On 7/30/2020 8:30 PM, Jonathan Nieder wrote:
>>> Derrick Stolee wrote:
>>>> On 7/29/2020 6:19 PM, Jonathan Nieder wrote:

Revisiting the first part of this thread, as it got lost in the discussion
about trace2 and JSON parsing...

>>>>> [jrnieder] How do I supply the tasks on the command line?  Are they
>>>>> parameters to this subcommand?  If so, it could make sense for this to
>>>>> say something like
>>>>>
>>>>> 	run <task>...::
>>>>
>>>> Hopefully this is documented to your satisfaction when the ability
>>>> to customize the tasks is implemented.
> [...]
>> I mean that there is only one task right now. Until the commit-graph
>> task is implemented, there is no need to have a --task=<task> option.
> 
> Ah, that wasn't clear to me from the docs.
> 
> You're saying that "git maintenance run" runs the "gc" task?  Can the
> documentation say so?

In this patch, the documentation says

	Run tasks to optimize Git repository data

the plural is a bit incorrect at this point, but not a huge deal.

The run subcommand says "Run one or more maintenance tasks" which again
is a bit misleading since "or more" is not technically possible. This
becomes a bit obvious when we see that the task list only contains a 'gc'
task.

So no, the documentation in this patch doesn't say "Run the 'gc' task"
because that could be gathered from context _and_ it will be incorrect
by the time this patch is integrated as part of the series.

So I struggle with finding a way to improve the documentation in this
patch to satisfy your concerns. Any change made here will just be un-done
by a later patch in the same series.

I'd rather have something vague now and improve the documentation
incrementally along with the implementation.

Another point you made earlier, but seems to be dropped in this email:
you wanted the return code documented. This also seems like wasteful
text, because the best I can think of is

  This command will return 0 unless a maintenance task failed in an
  unrecoverable way, or other typical failure cases such as incorrect
  usage or the command being run outside of a Git repository.

Please recommend something better, because that description suffers from
being too generic/obvious when talking about the maintenance task and
being an obvious repetition of Git's existing error code behavior.

Perhaps this would be easier if there was an established pattern of
documenting the return code. So, I went digging.

The examples I could find are:

 1. 'git cat-file -e'
 2. 'git check-ref-format'
 3. 'git commit-graph --object-dir=<dir>'
 4. 'git cvsimport'
 5. 'git fast-import'
 6. 'git filter-branch'
 7. 'git grep --quiet'
 8. 'git gui citool'
 9. 'git ls-remote --exit-code'
10. 'git merge-base --is-ancestor'
11. 'git push --recurse-submodules'
12. 'git reflog exists'
13. 'git rev-parse --parseopt'
14. 'git update-index --test-untracked-cache' 

Most of these are for specific options, and the exit code is documented
with relation to how that specific option may lead to a non-zero exit
code (1, 3, 7, 8, 9, 10, 11, 12, 13, 14).

The one that is most clearly part of the main description is for
the plumbing command 'git check-ref-format':

	Checks if a given 'refname' is acceptable, and exits with
	a non-zero status if it is not.

This seems appropriate for something intended for use as a helper
in script, with a very simple behavior. The exit code is the only
way it provides any feedback to the caller! This use of the exit
code as the only output is used by some other cases (1, 7, 9, 10,
11, 12, 13, 14).

Of course, we also have examples like this in git-cvsimport.txt:

	Otherwise, success is indicated the Unix way, i.e. by simply
	exiting with a zero exit status.

This phrase should be _assumed_ as part of Git itself, instead of
written in the documentation for specific commands.

The documentation for 'git fast-import' is more useful here:

	If fast-import is supplied invalid input it will terminate
	with a non-zero exit status and create a crash report...

But, you can see that the exit code is not the focus, but the
crash report _is_ important to document.

'git filter-branch' has a special exit code for the case that no
commits were found to update.

'git gui citool' returns an error code if the window is closed
before the command can complete. This is also not a core feature,
so is not a good example.

My only conclusion from this investigation is that the exit code
is documented when it is useful, but otherwise the standard Unix
pattern is assumed and not worth documenting.

If this is something we want to change, then I don't think it
valuable to have that discussion here. Instead, let's start that
discussion elsewhere with a series that updates existing docs to
catch up to a new standard of documenting the exit code.

[1] https://git-scm.com/docs/git-cat-file#Documentation/git-cat-file.txt--e
[2] https://git-scm.com/docs/git-check-ref-format#_description
[3] https://git-scm.com/docs/git-commit-graph#Documentation/git-commit-graph.txt---object-dir
[4] https://git-scm.com/docs/git-cvsimport#_output
[5] https://git-scm.com/docs/git-fast-import#_crash_reports
[6] https://git-scm.com/docs/git-filter-branch#_exit_status
[7] https://git-scm.com/docs/git-grep#Documentation/git-grep.txt---quiet
[8] https://git-scm.com/docs/git-gui#Documentation/git-gui.txt-codegitguicitoolcode
[9] https://git-scm.com/docs/git-ls-remote#Documentation/git-ls-remote.txt---exit-code
[10] https://git-scm.com/docs/git-merge-base#Documentation/git-merge-base.txt---is-ancestor
[11] https://git-scm.com/docs/git-push#Documentation/git-push.txt---recurse-submodulescheckon-demandonlyno
[12] https://git-scm.com/docs/git-reflog
[13] https://git-scm.com/docs/git-rev-parse#_parseopt
[14] https://git-scm.com/docs/git-update-index#Documentation/git-update-index.txt---test-untracked-cache

> [...]
>>>>>> +static struct maintenance_opts {
>>>>>> +	int auto_flag;
>>>>>> +} opts;
>>>>>
>>>>> Packing this in a struct feels a bit unusual.  Is the struct going to
>>>>> be passed somewhere, or could these be individual locals in
>>>>> cmd_maintenance?
>>>>
>>>> This will grow, and I'd rather have one global struct than many
>>>> individual global items. It makes it clearer when I use
>>>> "opts.auto_flag" that this corresponds to whether "--auto" was
>>>> provided as a command-line option.
>>>
>>> That doesn't seem idiomatic within the Git codebase.
>>>
>>> Can they be locals instead?
>>
>> Which part do you want to be idiomatic? The fact that the parse-options
>> library typically refers to static global values or the fact that the
>> data is not organized in a struct?
> 
> parse-options has no requirement about the values being global.  Some
> older code does that, but newer code tends to use locals when
> appropriate.

You're right. Global is not required. The struct needs to be 'static'
or initialized for compilation.

While trying to get rid of globals based on your recommendation, I
see that the callback functions in the option-parsing _do_ require
global state. This is used in "maintenance: add --task option" with
the global tasks array, but it also uses a global count of "how many
tasks have been selected" currently in the struct. Two ways to get
around that issue include (1) making a global int, or (2) looping
through the task list to count how many have been selected so far.

Neither of these solutions is clean. (In the diff below, I chose
option 2.)

> Putting it in a struct is fine as long as that struct is being passed
> around.  What isn't idiomatic in Git is using a global struct for
> namespacing.

In my opinion, using a global struct would be better than the existing
pattern of a bunch of global variables, but I get your point about
precedence. More on this below.

> [...]
>> The natural thing to do to "fix" that situation is to create a struct
>> that holds the information from the parsed command-line arguments. But
>> then why is it a local parameter that is passed through all of the
>> local methods instead of a global (as presented here in this patch)?
> 
> I'm having trouble parsing that last sentence.  You're saying a global
> is preferable over passing around a pointer to a local "opts" struct?

Using a "singleton" like the opts struct here, the code is definitely
much simpler to read. There is only ever one instance, because the struct
definition comes along with the instance declaration.

By making it a local instance, we then need to pass around the struct as
a parameter to all of the static methods in the builtin, even though there
is only ever one instance of this struct per process.

The two points above are the ones that led me to use the global instance
in the first place.

To get back to your point about being idiomatic, these builtins use a
struct to group options, then pass it around as a parameter:

* cat-file (struct batch_options)
* checkout (struct checkout_opts)
* rebase (struct rebase_options)
* tag (struct create_tag_options)
* worktree (struct add_opts)

Unfortunately, there _is_ an example of the behavior I was following in
this series: my own example in the sparse-checkout builtin. So, I'm just
repeating my own mistakes that were not challenged before.

I updated my v4-in-progress branch with that change, and the tip-to-tip
diff is below. It required some non-trivial care around the --task=<task>
callback (which can't see the local options struct) and other cases.

Thanks,
-Stolee

--- 

diff --git a/builtin/gc.c b/builtin/gc.c
index 1f0c519d14..3e473a1293 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -709,11 +709,10 @@ static const char * const builtin_maintenance_usage[] = {
 	NULL
 };
 
-static struct maintenance_opts {
+struct maintenance_opts {
 	int auto_flag;
 	int quiet;
-	int tasks_selected;
-} opts;
+};
 
 /* Remember to update object flag allocation in object.h */
 #define PARENT1		(1u<<16)
@@ -789,7 +788,7 @@ static int should_write_commit_graph(void)
 	return result;
 }
 
-static int run_write_commit_graph(void)
+static int run_write_commit_graph(struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
@@ -797,13 +796,13 @@ static int run_write_commit_graph(void)
 	strvec_pushl(&child.args, "commit-graph", "write",
 		     "--split", "--reachable", NULL);
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--no-progress");
 
 	return !!run_command(&child);
 }
 
-static int run_verify_commit_graph(void)
+static int run_verify_commit_graph(struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
@@ -811,24 +810,24 @@ static int run_verify_commit_graph(void)
 	strvec_pushl(&child.args, "commit-graph", "verify",
 		     "--shallow", NULL);
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--no-progress");
 
 	return !!run_command(&child);
 }
 
-static int maintenance_task_commit_graph(void)
+static int maintenance_task_commit_graph(struct maintenance_opts *opts)
 {
 	struct repository *r = the_repository;
 	char *chain_path;
 
 	close_object_store(r->objects);
-	if (run_write_commit_graph()) {
+	if (run_write_commit_graph(opts)) {
 		error(_("failed to write commit-graph"));
 		return 1;
 	}
 
-	if (!run_verify_commit_graph())
+	if (!run_verify_commit_graph(opts))
 		return 0;
 
 	warning(_("commit-graph verify caught error, rewriting"));
@@ -840,14 +839,14 @@ static int maintenance_task_commit_graph(void)
 	}
 	free(chain_path);
 
-	if (!run_write_commit_graph())
+	if (!run_write_commit_graph(opts))
 		return 0;
 
 	error(_("failed to rewrite commit-graph"));
 	return 1;
 }
 
-static int fetch_remote(const char *remote)
+static int fetch_remote(const char *remote, struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
@@ -855,7 +854,7 @@ static int fetch_remote(const char *remote)
 	strvec_pushl(&child.args, "fetch", remote, "--prune", "--no-tags",
 		     "--no-write-fetch-head", "--refmap=", NULL);
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--quiet");
 
 	strvec_pushf(&child.args, "+refs/heads/*:refs/prefetch/%s/*", remote);
@@ -871,7 +870,7 @@ static int fill_each_remote(struct remote *remote, void *cbdata)
 	return 0;
 }
 
-static int maintenance_task_prefetch(void)
+static int maintenance_task_prefetch(struct maintenance_opts *opts)
 {
 	int result = 0;
 	struct string_list_item *item;
@@ -886,23 +885,23 @@ static int maintenance_task_prefetch(void)
 	for (item = remotes.items;
 	     item && item < remotes.items + remotes.nr;
 	     item++)
-		result |= fetch_remote(item->string);
+		result |= fetch_remote(item->string, opts);
 
 cleanup:
 	string_list_clear(&remotes, 0);
 	return result;
 }
 
-static int maintenance_task_gc(void)
+static int maintenance_task_gc(struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
 	child.git_cmd = 1;
 	strvec_push(&child.args, "gc");
 
-	if (opts.auto_flag)
+	if (opts->auto_flag)
 		strvec_push(&child.args, "--auto");
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--quiet");
 	else
 		strvec_push(&child.args, "--no-quiet");
@@ -911,14 +910,14 @@ static int maintenance_task_gc(void)
 	return run_command(&child);
 }
 
-static int prune_packed(void)
+static int prune_packed(struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
 	child.git_cmd = 1;
 	strvec_push(&child.args, "prune-packed");
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--quiet");
 
 	return !!run_command(&child);
@@ -977,7 +976,7 @@ static int write_loose_object_to_stdin(const struct object_id *oid,
 	return ++(d->count) > d->batch_size;
 }
 
-static int pack_loose(void)
+static int pack_loose(struct maintenance_opts *opts)
 {
 	struct repository *r = the_repository;
 	int result = 0;
@@ -996,7 +995,7 @@ static int pack_loose(void)
 	pack_proc.git_cmd = 1;
 
 	strvec_push(&pack_proc.args, "pack-objects");
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&pack_proc.args, "--quiet");
 	strvec_pushf(&pack_proc.args, "%s/pack/loose", r->objects->odb->path);
 
@@ -1027,9 +1026,9 @@ static int pack_loose(void)
 	return result;
 }
 
-static int maintenance_task_loose_objects(void)
+static int maintenance_task_loose_objects(struct maintenance_opts *opts)
 {
-	return prune_packed() || pack_loose();
+	return prune_packed(opts) || pack_loose(opts);
 }
 
 static int incremental_repack_auto_condition(void)
@@ -1061,14 +1060,14 @@ static int incremental_repack_auto_condition(void)
 	return count >= incremental_repack_auto_limit;
 }
 
-static int multi_pack_index_write(void)
+static int multi_pack_index_write(struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
 	child.git_cmd = 1;
 	strvec_pushl(&child.args, "multi-pack-index", "write", NULL);
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--no-progress");
 
 	if (run_command(&child))
@@ -1077,7 +1076,7 @@ static int multi_pack_index_write(void)
 	return 0;
 }
 
-static int rewrite_multi_pack_index(void)
+static int rewrite_multi_pack_index(struct maintenance_opts *opts)
 {
 	struct repository *r = the_repository;
 	char *midx_name = get_midx_filename(r->objects->odb->path);
@@ -1085,17 +1084,18 @@ static int rewrite_multi_pack_index(void)
 	unlink(midx_name);
 	free(midx_name);
 
-	return multi_pack_index_write();
+	return multi_pack_index_write(opts);
 }
 
-static int multi_pack_index_verify(const char *message)
+static int multi_pack_index_verify(struct maintenance_opts *opts,
+				   const char *message)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
 	child.git_cmd = 1;
 	strvec_pushl(&child.args, "multi-pack-index", "verify", NULL);
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--no-progress");
 
 	if (run_command(&child)) {
@@ -1106,14 +1106,14 @@ static int multi_pack_index_verify(const char *message)
 	return 0;
 }
 
-static int multi_pack_index_expire(void)
+static int multi_pack_index_expire(struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
 	child.git_cmd = 1;
 	strvec_pushl(&child.args, "multi-pack-index", "expire", NULL);
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--no-progress");
 
 	close_object_store(the_repository->objects);
@@ -1164,14 +1164,14 @@ static off_t get_auto_pack_size(void)
 	return result_size;
 }
 
-static int multi_pack_index_repack(void)
+static int multi_pack_index_repack(struct maintenance_opts *opts)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
 
 	child.git_cmd = 1;
 	strvec_pushl(&child.args, "multi-pack-index", "repack", NULL);
 
-	if (opts.quiet)
+	if (opts->quiet)
 		strvec_push(&child.args, "--no-progress");
 
 	strvec_pushf(&child.args, "--batch-size=%"PRIuMAX,
@@ -1185,7 +1185,7 @@ static int multi_pack_index_repack(void)
 	return 0;
 }
 
-static int maintenance_task_incremental_repack(void)
+static int maintenance_task_incremental_repack(struct maintenance_opts *opts)
 {
 	prepare_repo_settings(the_repository);
 	if (!the_repository->settings.core_multi_pack_index) {
@@ -1193,22 +1193,22 @@ static int maintenance_task_incremental_repack(void)
 		return 0;
 	}
 
-	if (multi_pack_index_write())
+	if (multi_pack_index_write(opts))
 		return 1;
-	if (multi_pack_index_verify("after initial write"))
-		return rewrite_multi_pack_index();
-	if (multi_pack_index_expire())
+	if (multi_pack_index_verify(opts, "after initial write"))
+		return rewrite_multi_pack_index(opts);
+	if (multi_pack_index_expire(opts))
 		return 1;
-	if (multi_pack_index_verify("after expire step"))
-		return !!rewrite_multi_pack_index();
-	if (multi_pack_index_repack())
+	if (multi_pack_index_verify(opts, "after expire step"))
+		return !!rewrite_multi_pack_index(opts);
+	if (multi_pack_index_repack(opts))
 		return 1;
-	if (multi_pack_index_verify("after repack step"))
-		return !!rewrite_multi_pack_index();
+	if (multi_pack_index_verify(opts, "after repack step"))
+		return !!rewrite_multi_pack_index(opts);
 	return 0;
 }
 
-typedef int maintenance_task_fn(void);
+typedef int maintenance_task_fn(struct maintenance_opts *opts);
 
 /*
  * An auto condition function returns 1 if the task should run
@@ -1275,9 +1275,9 @@ static int compare_tasks_by_selection(const void *a_, const void *b_)
 	return b->selected_order - a->selected_order;
 }
 
-static int maintenance_run(void)
+static int maintenance_run(struct maintenance_opts *opts)
 {
-	int i;
+	int i, found_selected = 0;
 	int result = 0;
 	struct lock_file lk;
 	struct repository *r = the_repository;
@@ -1291,7 +1291,7 @@ static int maintenance_run(void)
 		 * recursive process stack. Do not report an error in
 		 * that case.
 		 */
-		if (!opts.auto_flag && !opts.quiet)
+		if (!opts->auto_flag && !opts->quiet)
 			error(_("lock file '%s' exists, skipping maintenance"),
 			      lock_path);
 		free(lock_path);
@@ -1299,23 +1299,26 @@ static int maintenance_run(void)
 	}
 	free(lock_path);
 
-	if (opts.tasks_selected)
+	for (i = 0; !found_selected && i < TASK__COUNT; i++)
+		found_selected = tasks[i].selected;
+
+	if (found_selected)
 		QSORT(tasks, TASK__COUNT, compare_tasks_by_selection);
 
 	for (i = 0; i < TASK__COUNT; i++) {
-		if (opts.tasks_selected && !tasks[i].selected)
+		if (found_selected && !tasks[i].selected)
 			continue;
 
-		if (!opts.tasks_selected && !tasks[i].enabled)
+		if (!found_selected && !tasks[i].enabled)
 			continue;
 
-		if (opts.auto_flag &&
+		if (opts->auto_flag &&
 		    (!tasks[i].auto_condition ||
 		     !tasks[i].auto_condition()))
 			continue;
 
 		trace2_region_enter("maintenance", tasks[i].name, r);
-		if (tasks[i].fn()) {
+		if (tasks[i].fn(opts)) {
 			error(_("task '%s' failed"), tasks[i].name);
 			result = 1;
 		}
@@ -1349,17 +1352,15 @@ static void initialize_task_config(void)
 static int task_option_parse(const struct option *opt,
 			     const char *arg, int unset)
 {
-	int i;
+	int i, num_selected = 0;
 	struct maintenance_task *task = NULL;
 
 	BUG_ON_OPT_NEG(unset);
 
-	opts.tasks_selected++;
-
 	for (i = 0; i < TASK__COUNT; i++) {
+		num_selected += tasks[i].selected;
 		if (!strcasecmp(tasks[i].name, arg)) {
 			task = &tasks[i];
-			break;
 		}
 	}
 
@@ -1374,13 +1375,14 @@ static int task_option_parse(const struct option *opt,
 	}
 
 	task->selected = 1;
-	task->selected_order = opts.tasks_selected;
+	task->selected_order = num_selected + 1;
 
 	return 0;
 }
 
 int cmd_maintenance(int argc, const char **argv, const char *prefix)
 {
+	static struct maintenance_opts opts;
 	static struct option builtin_maintenance_options[] = {
 		OPT_BOOL(0, "auto", &opts.auto_flag,
 			 N_("run tasks based on the state of the repository")),
@@ -1408,7 +1410,7 @@ int cmd_maintenance(int argc, const char **argv, const char *prefix)
 
 	if (argc == 1) {
 		if (!strcmp(argv[0], "run"))
-			return maintenance_run();
+			return maintenance_run(&opts);
 	}
 
 	usage_with_options(builtin_maintenance_usage,

/git-mailinfo
/git-mailsplit
/git-maintenance
/git-merge
/git-merge-base
/git-merge-index
Expand Down
2 changes: 2 additions & 0 deletions Documentation/config.txt
Expand Up @@ -396,6 +396,8 @@ include::config/mailinfo.txt[]

include::config/mailmap.txt[]

include::config/maintenance.txt[]

include::config/man.txt[]

include::config/merge.txt[]
Expand Down
4 changes: 2 additions & 2 deletions Documentation/config/core.txt
Expand Up @@ -606,8 +606,8 @@ core.useReplaceRefs::

core.multiPackIndex::
Use the multi-pack-index file to track multiple packfiles using a
single index. See link:technical/multi-pack-index.html[the
multi-pack-index design document].
single index. See linkgit:git-multi-pack-index[1] for more
information. Defaults to true.

core.sparseCheckout::
Enable "sparse checkout" feature. See linkgit:git-sparse-checkout[1]
Expand Down
7 changes: 7 additions & 0 deletions Documentation/config/fetch.txt
Expand Up @@ -91,3 +91,10 @@ fetch.writeCommitGraph::
merge and the write may take longer. Having an updated commit-graph
file helps performance of many Git commands, including `git merge-base`,
`git push -f`, and `git log --graph`. Defaults to false.

fetch.writeFetchHEAD::
Setting it to false tells `git fetch` not to write the list
of remote refs fetched in the `FETCH_HEAD` file directly
under `$GIT_DIR`. Can be countermanded from the command
line with the `--[no-]write-fetch-head` option. Defaults to
true.
32 changes: 32 additions & 0 deletions Documentation/config/maintenance.txt
@@ -0,0 +1,32 @@
maintenance.<task>.enabled::
This boolean config option controls whether the maintenance task
with name `<task>` is run when no `--task` option is specified.
By default, only `maintenance.gc.enabled` is true.

maintenance.commit-graph.auto::
This integer config option controls how often the `commit-graph` task
should be run as part of `git maintenance run --auto`. If zero, then
the `commit-graph` task will not run with the `--auto` option. A
negative value will force the task to run every time. Otherwise, a
positive value implies the command should run when the number of
reachable commits that are not in the commit-graph file is at least
the value of `maintenance.commit-graph.auto`. The default value is
100.

maintenance.loose-objects.auto::
This integer config option controls how often the `loose-objects` task
should be run as part of `git maintenance run --auto`. If zero, then
the `loose-objects` task will not run with the `--auto` option. A
negative value will force the task to run every time. Otherwise, a
positive value implies the command should run when the number of
loose objects is at least the value of `maintenance.loose-objects.auto`.
The default value is 100.

maintenance.incremental-repack.auto::
This integer config option controls how often the `incremental-repack`
task should be run as part of `git maintenance run --auto`. If zero,
then the `incremental-repack` task will not run with the `--auto`
option. A negative value will force the task to run every time.
Otherwise, a positive value implies the command should run when the
number of pack-files not in the multi-pack-index is at least the value
of `maintenance.incremental-repack.auto`. The default value is 10.
16 changes: 14 additions & 2 deletions Documentation/fetch-options.txt
Expand Up @@ -64,6 +64,16 @@ documented in linkgit:git-config[1].
--dry-run::
Show what would be done, without making any changes.

ifndef::git-pull[]
--[no-]write-fetch-head::
Write the list of remote refs fetched in the `FETCH_HEAD`
file directly under `$GIT_DIR`. This is the default unless
the configuration variable `fetch.writeFetchHEAD` is set to
false. Passing `--no-write-fetch-head` from the command
line tells Git not to write the file. Under `--dry-run`
option, the file is never written.
endif::git-pull[]

-f::
--force::
When 'git fetch' is used with `<src>:<dst>` refspec it may
Expand All @@ -86,9 +96,11 @@ ifndef::git-pull[]
Allow several <repository> and <group> arguments to be
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +--[no-]maintenance::
>  --[no-]auto-gc::
> -	Run `git gc --auto` at the end to perform garbage collection
> -	if needed. This is enabled by default.
> +	Run `git maintenance run --auto` at the end to perform garbage
> +	collection if needed. This is enabled by default.

Shouldn't the new synonym be called --auto-maintenance or an
abbreviation thereof?  It is not like we will run the full
maintenance suite when "--no-maintenance" is omitted, which
certainly is not the impression we want to give our readers.

>  These objects may be removed by normal Git operations (such as `git commit`)
> -which automatically call `git gc --auto`. (See linkgit:git-gc[1].)
> -If these objects are removed and were referenced by the cloned repository,
> -then the cloned repository will become corrupt.
> +which automatically call `git maintenance run --auto` and `git gc --auto`.

Hmph.  Perhaps the picture may change in the end of the series but I
got an impression that "gc --auto" would eventually become just part
of "maintenance --auto" and the users won't have to be even aware of
its existence?  Wouldn't we rather want to say something like

	--[no-]auto-maintenance::
	--[no-]auto-gc::
                Run `git maintenance run --auto` at the end to perform
                garbage collection if needed (`--[no-]auto-gc` is a
                synonym).  This is enabled by default.
	
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 82ac4be8a5..49a4d727d4 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -196,8 +196,10 @@ static struct option builtin_fetch_options[] = {
>  	OPT_STRING_LIST(0, "negotiation-tip", &negotiation_tip, N_("revision"),
>  			N_("report that we have only objects reachable from this object")),
>  	OPT_PARSE_LIST_OBJECTS_FILTER(&filter_options),

> +	OPT_BOOL(0, "maintenance", &enable_auto_gc,
> +		 N_("run 'maintenance --auto' after fetching")),
>  	OPT_BOOL(0, "auto-gc", &enable_auto_gc,
> +		 N_("run 'maintenance --auto' after fetching")),

OK, so this is truly a backward-compatible synonym at this point.

> diff --git a/run-command.c b/run-command.c
> index 9b3a57d1e3..82ad241638 100644
> --- a/run-command.c
> +++ b/run-command.c
> @@ -1865,14 +1865,17 @@ int run_processes_parallel_tr2(int n, get_next_task_fn get_next_task,
>  	return result;
>  }
>  
> -int run_auto_gc(int quiet)
> +int run_auto_maintenance(int quiet)
>  {
>  	struct argv_array argv_gc_auto = ARGV_ARRAY_INIT;
>  	int status;
>  
> -	argv_array_pushl(&argv_gc_auto, "gc", "--auto", NULL);
> +	argv_array_pushl(&argv_gc_auto, "maintenance", "run", "--auto", NULL);
>  	if (quiet)
>  		argv_array_push(&argv_gc_auto, "--quiet");
> +	else
> +		argv_array_push(&argv_gc_auto, "--no-quiet");
> +
>  	status = run_command_v_opt(argv_gc_auto.argv, RUN_GIT_CMD);
>  	argv_array_clear(&argv_gc_auto);
>  	return status;

Don't we want to replace all _gc_ with _maintenance_ in this
function?  I think the first business before we can do so would be
to rethink if spelling out "maintenance" fully in code is a good
idea in the first space.  It would make names for variables,
structures and fields unnecessarily long without contributing to
ease of understanding an iota, and a easy-to-remember short-form or
an abbreviation may be needed.  Using a short-form/abbreviation
wouldn't worsen the end-user experience, and not the developer
experience for that matter.

If we choose "gc" as the short-hand, most of the change in this step
would become unnecessary.  I also do not mind if we some other words
or word-fragment (perhaps "maint"???) is chosen.

> diff --git a/run-command.h b/run-command.h
> index 191dfcdafe..d9a800e700 100644
> --- a/run-command.h
> +++ b/run-command.h
> @@ -221,7 +221,7 @@ int run_hook_ve(const char *const *env, const char *name, va_list args);
>  /*
>   * Trigger an auto-gc
>   */
> -int run_auto_gc(int quiet);
> +int run_auto_maintenance(int quiet);
>  
>  #define RUN_COMMAND_NO_STDIN 1
>  #define RUN_GIT_CMD	     2	/*If this is to be git sub-command */
> diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
> index a66dbe0bde..9850ecde5d 100755
> --- a/t/t5510-fetch.sh
> +++ b/t/t5510-fetch.sh
> @@ -919,7 +919,7 @@ test_expect_success 'fetching with auto-gc does not lock up' '
>  		git config fetch.unpackLimit 1 &&
>  		git config gc.autoPackLimit 1 &&
>  		git config gc.autoDetach false &&
> -		GIT_ASK_YESNO="$D/askyesno" git fetch >fetch.out 2>&1 &&
> +		GIT_ASK_YESNO="$D/askyesno" git fetch --verbose >fetch.out 2>&1 &&
>  		test_i18ngrep "Auto packing the repository" fetch.out &&
>  		! grep "Should I try again" fetch.out
>  	)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Jul 23, 2020 at 01:21:55PM -0700, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +--[no-]maintenance::
> >  --[no-]auto-gc::
> > -	Run `git gc --auto` at the end to perform garbage collection
> > -	if needed. This is enabled by default.
> > +	Run `git maintenance run --auto` at the end to perform garbage
> > +	collection if needed. This is enabled by default.
>
> Shouldn't the new synonym be called --auto-maintenance or an
> abbreviation thereof?  It is not like we will run the full
> maintenance suite when "--no-maintenance" is omitted, which
> certainly is not the impression we want to give our readers.
>
> >  These objects may be removed by normal Git operations (such as `git commit`)
> > -which automatically call `git gc --auto`. (See linkgit:git-gc[1].)
> > -If these objects are removed and were referenced by the cloned repository,
> > -then the cloned repository will become corrupt.
> > +which automatically call `git maintenance run --auto` and `git gc --auto`.
>
> Hmph.  Perhaps the picture may change in the end of the series but I
> got an impression that "gc --auto" would eventually become just part
> of "maintenance --auto" and the users won't have to be even aware of
> its existence?  Wouldn't we rather want to say something like
>
> 	--[no-]auto-maintenance::
> 	--[no-]auto-gc::
>                 Run `git maintenance run --auto` at the end to perform
>                 garbage collection if needed (`--[no-]auto-gc` is a
>                 synonym).  This is enabled by default.
>
> > diff --git a/builtin/fetch.c b/builtin/fetch.c
> > index 82ac4be8a5..49a4d727d4 100644
> > --- a/builtin/fetch.c
> > +++ b/builtin/fetch.c
> > @@ -196,8 +196,10 @@ static struct option builtin_fetch_options[] = {
> >  	OPT_STRING_LIST(0, "negotiation-tip", &negotiation_tip, N_("revision"),
> >  			N_("report that we have only objects reachable from this object")),
> >  	OPT_PARSE_LIST_OBJECTS_FILTER(&filter_options),
>
> > +	OPT_BOOL(0, "maintenance", &enable_auto_gc,
> > +		 N_("run 'maintenance --auto' after fetching")),
> >  	OPT_BOOL(0, "auto-gc", &enable_auto_gc,
> > +		 N_("run 'maintenance --auto' after fetching")),
>
> OK, so this is truly a backward-compatible synonym at this point.

I wouldn't be opposed to making the 'auto-gc' option an
'OPT_HIDDEN_BOOL', but I realize that users may not want to move as
quickly as that. Perhaps we should visit this in a couple of releases
(or perhaps you are getting to it in a later patch that I haven't read
yet).

> > diff --git a/run-command.c b/run-command.c
> > index 9b3a57d1e3..82ad241638 100644
> > --- a/run-command.c
> > +++ b/run-command.c
> > @@ -1865,14 +1865,17 @@ int run_processes_parallel_tr2(int n, get_next_task_fn get_next_task,
> >  	return result;
> >  }
> >
> > -int run_auto_gc(int quiet)
> > +int run_auto_maintenance(int quiet)
> >  {
> >  	struct argv_array argv_gc_auto = ARGV_ARRAY_INIT;
> >  	int status;
> >
> > -	argv_array_pushl(&argv_gc_auto, "gc", "--auto", NULL);
> > +	argv_array_pushl(&argv_gc_auto, "maintenance", "run", "--auto", NULL);
> >  	if (quiet)
> >  		argv_array_push(&argv_gc_auto, "--quiet");
> > +	else
> > +		argv_array_push(&argv_gc_auto, "--no-quiet");
> > +
> >  	status = run_command_v_opt(argv_gc_auto.argv, RUN_GIT_CMD);
> >  	argv_array_clear(&argv_gc_auto);
> >  	return status;
>
> Don't we want to replace all _gc_ with _maintenance_ in this
> function?  I think the first business before we can do so would be
> to rethink if spelling out "maintenance" fully in code is a good
> idea in the first space.  It would make names for variables,
> structures and fields unnecessarily long without contributing to
> ease of understanding an iota, and a easy-to-remember short-form or
> an abbreviation may be needed.  Using a short-form/abbreviation
> wouldn't worsen the end-user experience, and not the developer
> experience for that matter.
>
> If we choose "gc" as the short-hand, most of the change in this step
> would become unnecessary.  I also do not mind if we some other words
> or word-fragment (perhaps "maint"???) is chosen.

Yeah, writing out 'maintenance' every time in the code and in
command-line arguments is kind of a mouthful. I'm more willing to accept
that '--maintenance' is something that users would write or script
around, but 'maint' makes sense to me as a shorthand in the code.

I could go either way on calling the command-line flag '--maint',
though.

>
> > diff --git a/run-command.h b/run-command.h
> > index 191dfcdafe..d9a800e700 100644
> > --- a/run-command.h
> > +++ b/run-command.h
> > @@ -221,7 +221,7 @@ int run_hook_ve(const char *const *env, const char *name, va_list args);
> >  /*
> >   * Trigger an auto-gc
> >   */
> > -int run_auto_gc(int quiet);
> > +int run_auto_maintenance(int quiet);
> >
> >  #define RUN_COMMAND_NO_STDIN 1
> >  #define RUN_GIT_CMD	     2	/*If this is to be git sub-command */
> > diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
> > index a66dbe0bde..9850ecde5d 100755
> > --- a/t/t5510-fetch.sh
> > +++ b/t/t5510-fetch.sh
> > @@ -919,7 +919,7 @@ test_expect_success 'fetching with auto-gc does not lock up' '
> >  		git config fetch.unpackLimit 1 &&
> >  		git config gc.autoPackLimit 1 &&
> >  		git config gc.autoDetach false &&
> > -		GIT_ASK_YESNO="$D/askyesno" git fetch >fetch.out 2>&1 &&
> > +		GIT_ASK_YESNO="$D/askyesno" git fetch --verbose >fetch.out 2>&1 &&
> >  		test_i18ngrep "Auto packing the repository" fetch.out &&
> >  		! grep "Should I try again" fetch.out
> >  	)

Thanks,
Taylor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/23/2020 4:21 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> +--[no-]maintenance::
>>  --[no-]auto-gc::
>> -	Run `git gc --auto` at the end to perform garbage collection
>> -	if needed. This is enabled by default.
>> +	Run `git maintenance run --auto` at the end to perform garbage
>> +	collection if needed. This is enabled by default.
> 
> Shouldn't the new synonym be called --auto-maintenance or an
> abbreviation thereof?  It is not like we will run the full
> maintenance suite when "--no-maintenance" is omitted, which
> certainly is not the impression we want to give our readers.

Adding "auto-" to the argument can be informative.

I would think that abbreviating the option may make understanding
the argument more difficult for users where English is not their
first language. Of course, I'm a bad judge of that.

I also don't think this option is called often by users at the
command-line and instead is done by scripts and third-party tools.
Abbreviating the word may have less value there.

(This reminds me that is would be good to have a config option that
prevents running this process _at all_. Sure, gc.auto=0 prevents the
command from doing anything, but there is still an extra cost of
starting a process. This is more of a problem on Windows. Making a
note to self here.)

>>  These objects may be removed by normal Git operations (such as `git commit`)
>> -which automatically call `git gc --auto`. (See linkgit:git-gc[1].)
>> -If these objects are removed and were referenced by the cloned repository,
>> -then the cloned repository will become corrupt.
>> +which automatically call `git maintenance run --auto` and `git gc --auto`.
> 
> Hmph.  Perhaps the picture may change in the end of the series but I
> got an impression that "gc --auto" would eventually become just part
> of "maintenance --auto" and the users won't have to be even aware of
> its existence?  Wouldn't we rather want to say something like
> 
> 	--[no-]auto-maintenance::
> 	--[no-]auto-gc::
>                 Run `git maintenance run --auto` at the end to perform
>                 garbage collection if needed (`--[no-]auto-gc` is a
>                 synonym).  This is enabled by default.

I don't completely eliminate 'git gc' at the end of this series, but
instead hope that we can peel it apart slowly in follow-up series.
However, you are correct that I should be more careful to obliterate
it from the documentation instead of keeping references around.

>> diff --git a/builtin/fetch.c b/builtin/fetch.c
>> index 82ac4be8a5..49a4d727d4 100644
>> --- a/builtin/fetch.c
>> +++ b/builtin/fetch.c
>> @@ -196,8 +196,10 @@ static struct option builtin_fetch_options[] = {
>>  	OPT_STRING_LIST(0, "negotiation-tip", &negotiation_tip, N_("revision"),
>>  			N_("report that we have only objects reachable from this object")),
>>  	OPT_PARSE_LIST_OBJECTS_FILTER(&filter_options),
> 
>> +	OPT_BOOL(0, "maintenance", &enable_auto_gc,
>> +		 N_("run 'maintenance --auto' after fetching")),
>>  	OPT_BOOL(0, "auto-gc", &enable_auto_gc,
>> +		 N_("run 'maintenance --auto' after fetching")),
> 
> OK, so this is truly a backward-compatible synonym at this point.
> 
>> diff --git a/run-command.c b/run-command.c
>> index 9b3a57d1e3..82ad241638 100644
>> --- a/run-command.c
>> +++ b/run-command.c
>> @@ -1865,14 +1865,17 @@ int run_processes_parallel_tr2(int n, get_next_task_fn get_next_task,
>>  	return result;
>>  }
>>  
>> -int run_auto_gc(int quiet)
>> +int run_auto_maintenance(int quiet)
>>  {
>>  	struct argv_array argv_gc_auto = ARGV_ARRAY_INIT;
>>  	int status;
>>  
>> -	argv_array_pushl(&argv_gc_auto, "gc", "--auto", NULL);
>> +	argv_array_pushl(&argv_gc_auto, "maintenance", "run", "--auto", NULL);
>>  	if (quiet)
>>  		argv_array_push(&argv_gc_auto, "--quiet");
>> +	else
>> +		argv_array_push(&argv_gc_auto, "--no-quiet");
>> +
>>  	status = run_command_v_opt(argv_gc_auto.argv, RUN_GIT_CMD);
>>  	argv_array_clear(&argv_gc_auto);
>>  	return status;
> 
> Don't we want to replace all _gc_ with _maintenance_ in this
> function?  I think the first business before we can do so would be
> to rethink if spelling out "maintenance" fully in code is a good
> idea in the first space.  It would make names for variables,
> structures and fields unnecessarily long without contributing to
> ease of understanding an iota, and a easy-to-remember short-form or
> an abbreviation may be needed.  Using a short-form/abbreviation
> wouldn't worsen the end-user experience, and not the developer
> experience for that matter.
> 
> If we choose "gc" as the short-hand, most of the change in this step
> would become unnecessary.  I also do not mind if we some other words
> or word-fragment (perhaps "maint"???) is chosen.

Yes, I should have noticed that. Also, with Peff's feedback from
another thread, the method can look a bit simpler this way:

int run_auto_maintenance(int quiet)
{
	struct child_process maint = CHILD_PROCESS_INIT;
	maint.git_cmd = 1;

	argv_array_pushl(&maint.argv, "maintenance", "run", "--auto", NULL);
	if (quiet)
		argv_array_push(&maint.argv, "--quiet");
	else
		argv_array_push(&maint.argv, "--no-quiet");

	return run_command(&maint.argv);
}

(I will update it again to work on Peff's argv_array work, but
hopefully it is clear how this is simpler.)

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/30/2020 9:29 AM, Derrick Stolee wrote:
> On 7/23/2020 4:21 PM, Junio C Hamano wrote:
>> Don't we want to replace all _gc_ with _maintenance_ in this
>> function?  I think the first business before we can do so would be
>> to rethink if spelling out "maintenance" fully in code is a good
>> idea in the first space.  It would make names for variables,
>> structures and fields unnecessarily long without contributing to
>> ease of understanding an iota, and a easy-to-remember short-form or
>> an abbreviation may be needed.  Using a short-form/abbreviation
>> wouldn't worsen the end-user experience, and not the developer
>> experience for that matter.
>>
>> If we choose "gc" as the short-hand, most of the change in this step
>> would become unnecessary.  I also do not mind if we some other words
>> or word-fragment (perhaps "maint"???) is chosen.
> 
> Yes, I should have noticed that. Also, with Peff's feedback from
> another thread, the method can look a bit simpler this way:

It would help if I actually _compile_ code before sending it.
Here is the fixed version:

int run_auto_maintenance(int quiet)
{
	struct child_process maint = CHILD_PROCESS_INIT;
	maint.git_cmd = 1;

	argv_array_pushl(&maint.args, "maintenance", "run", "--auto", NULL);
	if (quiet)
		argv_array_push(&maint.args, "--quiet");
	else
		argv_array_push(&maint.args, "--no-quiet");

	return run_command(&maint);
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Eric Sunshine wrote (reply to this):

On Thu, Jul 30, 2020 at 9:31 AM Derrick Stolee <stolee@gmail.com> wrote:
> int run_auto_maintenance(int quiet)
> {
>         struct child_process maint = CHILD_PROCESS_INIT;
>         maint.git_cmd = 1;
>
>         argv_array_pushl(&maint.args, "maintenance", "run", "--auto", NULL);
>         if (quiet)
>                 argv_array_push(&maint.args, "--quiet");
>         else
>                 argv_array_push(&maint.args, "--no-quiet");

It's subjective, but this might be a good fit for the ternary operator:

    argv_array_pushl(&maint.args, "maintenance", "run", "--auto", NULL);
    argv_array_push(&maint.args, quiet ? "--quiet" : "--no-quiet");

Or even:

    argv_array_pushl(&maint.args, "maintenance", "run", "--auto",
        quiet ? "--quiet" : "--no-quiet", NULL);

The latter is a bit less readable to me.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/30/2020 3:00 PM, Eric Sunshine wrote:
> On Thu, Jul 30, 2020 at 9:31 AM Derrick Stolee <stolee@gmail.com> wrote:
>> int run_auto_maintenance(int quiet)
>> {
>>         struct child_process maint = CHILD_PROCESS_INIT;
>>         maint.git_cmd = 1;
>>
>>         argv_array_pushl(&maint.args, "maintenance", "run", "--auto", NULL);
>>         if (quiet)
>>                 argv_array_push(&maint.args, "--quiet");
>>         else
>>                 argv_array_push(&maint.args, "--no-quiet");
> 
> It's subjective, but this might be a good fit for the ternary operator:
> 
>     argv_array_pushl(&maint.args, "maintenance", "run", "--auto", NULL);
>     argv_array_push(&maint.args, quiet ? "--quiet" : "--no-quiet");

Good idea! Thanks.

> Or even:
> 
>     argv_array_pushl(&maint.args, "maintenance", "run", "--auto",
>         quiet ? "--quiet" : "--no-quiet", NULL);
> 
> The latter is a bit less readable to me.

I agree.
-Stolee

specified. No <refspec>s may be specified.

--[no-]auto-maintenance::
--[no-]auto-gc::
Run `git gc --auto` at the end to perform garbage collection
if needed. This is enabled by default.
Run `git maintenance run --auto` at the end to perform automatic
repository maintenance if needed. (`--[no-]auto-gc` is a synonym.)
This is enabled by default.

--[no-]write-commit-graph::
Write a commit-graph after fetching. This overrides the config
Expand Down
6 changes: 3 additions & 3 deletions Documentation/git-clone.txt
Expand Up @@ -78,9 +78,9 @@ repository using this option and then delete branches (or use any
other Git command that makes any existing commit unreferenced) in the
source repository, some objects may become unreferenced (or dangling).
These objects may be removed by normal Git operations (such as `git commit`)
which automatically call `git gc --auto`. (See linkgit:git-gc[1].)
If these objects are removed and were referenced by the cloned repository,
then the cloned repository will become corrupt.
which automatically call `git maintenance run --auto`. (See
linkgit:git-maintenance[1].) If these objects are removed and were referenced
by the cloned repository, then the cloned repository will become corrupt.
+
Note that running `git repack` without the `--local` option in a repository
cloned with `--shared` will copy objects from the source repository into a pack
Expand Down
120 changes: 120 additions & 0 deletions Documentation/git-maintenance.txt
@@ -0,0 +1,120 @@
git-maintenance(1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Tan wrote (reply to this):

> +static int run_write_commit_graph(struct repository *r)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> +	argv_array_pushl(&cmd, "-C", r->worktree,
> +			 "commit-graph", "write",
> +			 "--split", "--reachable",
> +			 NULL);

As mentioned in my reply to an earlier patch (sent a few minutes ago),
this won't work if there are environment variables like GIT_DIR present.

Besides that, the overall design looks reasonable.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/8/2020 10:29 PM, Jonathan Tan wrote:
>> +static int run_write_commit_graph(struct repository *r)
>> +{
>> +	int result;
>> +	struct argv_array cmd = ARGV_ARRAY_INIT;
>> +
>> +	argv_array_pushl(&cmd, "-C", r->worktree,
>> +			 "commit-graph", "write",
>> +			 "--split", "--reachable",
>> +			 NULL);
> 
> As mentioned in my reply to an earlier patch (sent a few minutes ago),
> this won't work if there are environment variables like GIT_DIR present.

Do we not pass GIT_DIR to the subcommand? Or does using "-C" override
the GIT_DIR?

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Thu, Jul 09, 2020 at 07:14:41AM -0400, Derrick Stolee wrote:

> On 7/8/2020 10:29 PM, Jonathan Tan wrote:
> >> +static int run_write_commit_graph(struct repository *r)
> >> +{
> >> +	int result;
> >> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> >> +
> >> +	argv_array_pushl(&cmd, "-C", r->worktree,
> >> +			 "commit-graph", "write",
> >> +			 "--split", "--reachable",
> >> +			 NULL);
> > 
> > As mentioned in my reply to an earlier patch (sent a few minutes ago),
> > this won't work if there are environment variables like GIT_DIR present.
> 
> Do we not pass GIT_DIR to the subcommand? Or does using "-C" override
> the GIT_DIR?

We do pass GIT_DIR to the subcommand, and "-C" does not override it. I
think this code would work as long as "r" is the_repository, which it
would be in the current code. But then the "-C" would be doing nothing
useful (it might change to the top of the worktree if we weren't there
for some reason, but I don't think "commit-graph write" would care
either way).

But if "r" is some other repository, "commit-graph" would continue to
operate in the parent process repository because of the inherited
GIT_DIR. Using "--git-dir" would solve that, but as a general practice,
if you're spawning a sub-process that might be in another repository,
you should clear any repo-specific environment variables. The list is in
local_repo_env, which you can feed to the "env" or "env_array" parameter
of a child_process (see the use in connect.c for an example).

Even in the current scheme where "r" is always the_repository, I suspect
this might still be buggy. If we're in a bare repository, presumably
r->worktree would be NULL.

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/9/2020 6:52 PM, Jeff King wrote:
> On Thu, Jul 09, 2020 at 07:14:41AM -0400, Derrick Stolee wrote:
> 
>> On 7/8/2020 10:29 PM, Jonathan Tan wrote:
>>>> +static int run_write_commit_graph(struct repository *r)
>>>> +{
>>>> +	int result;
>>>> +	struct argv_array cmd = ARGV_ARRAY_INIT;
>>>> +
>>>> +	argv_array_pushl(&cmd, "-C", r->worktree,
>>>> +			 "commit-graph", "write",
>>>> +			 "--split", "--reachable",
>>>> +			 NULL);
>>>
>>> As mentioned in my reply to an earlier patch (sent a few minutes ago),
>>> this won't work if there are environment variables like GIT_DIR present.
>>
>> Do we not pass GIT_DIR to the subcommand? Or does using "-C" override
>> the GIT_DIR?
> 
> We do pass GIT_DIR to the subcommand, and "-C" does not override it. I
> think this code would work as long as "r" is the_repository, which it
> would be in the current code. But then the "-C" would be doing nothing
> useful (it might change to the top of the worktree if we weren't there
> for some reason, but I don't think "commit-graph write" would care
> either way).
> 
> But if "r" is some other repository, "commit-graph" would continue to
> operate in the parent process repository because of the inherited
> GIT_DIR. Using "--git-dir" would solve that, but as a general practice,
> if you're spawning a sub-process that might be in another repository,
> you should clear any repo-specific environment variables. The list is in
> local_repo_env, which you can feed to the "env" or "env_array" parameter
> of a child_process (see the use in connect.c for an example).
> 
> Even in the current scheme where "r" is always the_repository, I suspect
> this might still be buggy. If we're in a bare repository, presumably
> r->worktree would be NULL.

Ah. I'll investigate this more and work to create a way to
run a subcommand in a given repository. Your pointers will
help a lot.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Tan wrote (reply to this):

> 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
>    we can ensure that we actually load the new values somewhere in
>    our refspace while not updating refs/heads or refs/remotes. By
>    storing these refs here, the commit-graph job will update the
>    commit-graph with the commits from these hidden refs.
> 
> 4. --prune will delete the refs/hidden/<remote> refs that no
>    longer appear on the remote.

Having a ref path where Git can place commit IDs that it needs persisted
is useful, not only in this case but in other cases (e.g. when fetching
a submodule commit by hash, we might not have a ref name for that commit
but want to persist it anyway), so I look forward to having something
like this.

The name of this special ref path and its specific nature could be
discussed further, but maybe it is sufficient for now to just say that
the refs under this special ref path are controlled by Git, and their
layout is experimental and subject to change (e.g. future versions of
Git could just erase the entire path and rewrite the refs its own way).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
> index 35b0be7d40..9204762e21 100644
> --- a/Documentation/git-maintenance.txt
> +++ b/Documentation/git-maintenance.txt
> @@ -73,6 +73,10 @@ OPTIONS
>  --quiet::
>  	Do not report progress or other information over `stderr`.
>  
> +--task=<task>::
> +	If this option is specified one or more times, then only run the
> +	specified tasks in the specified order.
> +
>  GIT
>  ---
>  Part of the linkgit:git[1] suite
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 2cd17398ec..c58dea6fa5 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -710,6 +710,7 @@ static const char * const builtin_maintenance_usage[] = {
>  static struct maintenance_opts {
>  	int auto_flag;
>  	int quiet;
> +	int tasks_selected;
>  } opts;
>  
>  static int run_write_commit_graph(void)
> @@ -804,20 +805,38 @@ typedef int maintenance_task_fn(void);
>  struct maintenance_task {
>  	const char *name;
>  	maintenance_task_fn *fn;
> -	unsigned enabled:1;
> +	int task_order;
> +	unsigned enabled:1,
> +		 selected:1;
>  };
>  
>  static struct maintenance_task *tasks[MAX_NUM_TASKS];
>  static int num_tasks;
>  
> +static int compare_tasks_by_selection(const void *a_, const void *b_)
> +{
> +	const struct maintenance_task *a, *b;
> +	a = (const struct maintenance_task *)a_;
> +	b = (const struct maintenance_task *)b_;
> +
> +	return b->task_order - a->task_order;
> +}

It forces the reader to know intimately that task_order *is*
selection order in order to understand why this is "tasks by
selection".  Perhaps rename the field to match what it is
(i.e. "selection_order")?

>  static int maintenance_run(void)
>  {
>  	int i;
>  	int result = 0;
>  
> +	if (opts.tasks_selected)
> +		QSORT(tasks, num_tasks, compare_tasks_by_selection);
> +
>  	for (i = 0; !result && i < num_tasks; i++) {
> -		if (!tasks[i]->enabled)
> +		if (opts.tasks_selected && !tasks[i]->selected)
> +			continue;
> +
> +		if (!opts.tasks_selected && !tasks[i]->enabled)
>  			continue;

I am not sure about this.  Even if the task <x> is disabled, if the
user says --task=<x>, it is run anyway?  Doesn't make an immediate
sense to me.

As I am bad at deciphering de Morgan, I'd find it easier to read if
the first were written more like so:

		if (!(!opts.tasks_selected || tasks[i]->selected))
			continue;

That is, "do not skip any when not limited, and do not skip the ones
that are selected when limited".  And that would easily extend to

		if (!tasks[i]->enabled ||
		    !(!opts.tasks_selected || tasks[i]->selected))
			continue;
> +
>  		result = tasks[i]->fn();
>  	}

> @@ -842,6 +861,44 @@ static void initialize_tasks(void)
>  	num_tasks++;
>  }
>  
> +static int task_option_parse(const struct option *opt,
> +			     const char *arg, int unset)
> +{
> +	int i;
> +	struct maintenance_task *task = NULL;
> +
> +	BUG_ON_OPT_NEG(unset);
> +
> +	if (!arg || !strlen(arg)) {
> +		error(_("--task requires a value"));
> +		return 1;

There is no need to special case an empty string that was explicitly
given as the value---it will be caught as "'' is not a valid task".

> +	}
> +
> +	opts.tasks_selected++;
> +
> +	for (i = 0; i < MAX_NUM_TASKS; i++) {
> +		if (tasks[i] && !strcasecmp(tasks[i]->name, arg)) {
> +			task = tasks[i];
> +			break;
> +		}
> +	}
> +
> +	if (!task) {
> +		error(_("'%s' is not a valid task"), arg);
> +		return 1;
> +	}
> +
> +	if (task->selected) {
> +		error(_("task '%s' cannot be selected multiple times"), arg);
> +		return 1;
> +	}
> +
> +	task->selected = 1;
> +	task->task_order = opts.tasks_selected;
> +
> +	return 0;
> +}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Junio C Hamano <gitster@pobox.com> writes:

>>  	for (i = 0; !result && i < num_tasks; i++) {
>> -		if (!tasks[i]->enabled)
>> +		if (opts.tasks_selected && !tasks[i]->selected)
>> +			continue;
>> +
>> +		if (!opts.tasks_selected && !tasks[i]->enabled)
>>  			continue;
>
> I am not sure about this.  Even if the task <x> is disabled, if the
> user says --task=<x>, it is run anyway?  Doesn't make an immediate
> sense to me.

OK, after seeing the title of 12/18, it does make sense.  Even if it
is disabled by default, you could still choose to run, which makes
sense.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/23/2020 4:21 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> +static int compare_tasks_by_selection(const void *a_, const void *b_)
>> +{
>> +	const struct maintenance_task *a, *b;
>> +	a = (const struct maintenance_task *)a_;
>> +	b = (const struct maintenance_task *)b_;
>> +
>> +	return b->task_order - a->task_order;
>> +}
> 
> It forces the reader to know intimately that task_order *is*
> selection order in order to understand why this is "tasks by
> selection".  Perhaps rename the field to match what it is
> (i.e. "selection_order")?

Good idea. I made this fix locally.

>>  static int maintenance_run(void)
>>  {
>>  	int i;
>>  	int result = 0;
>>  
>> +	if (opts.tasks_selected)
>> +		QSORT(tasks, num_tasks, compare_tasks_by_selection);
>> +
>>  	for (i = 0; !result && i < num_tasks; i++) {
>> -		if (!tasks[i]->enabled)
>> +		if (opts.tasks_selected && !tasks[i]->selected)
>> +			continue;
>> +
>> +		if (!opts.tasks_selected && !tasks[i]->enabled)
>>  			continue;
> 
> I am not sure about this.  Even if the task <x> is disabled, if the
> user says --task=<x>, it is run anyway?  Doesn't make an immediate
> sense to me.

You already replied that you figured this out. However, I could make
it easier by adding some foreshadowing in the commit message here.

> As I am bad at deciphering de Morgan, I'd find it easier to read if
> the first were written more like so:
> 
> 		if (!(!opts.tasks_selected || tasks[i]->selected))
> 			continue;
> 
> That is, "do not skip any when not limited, and do not skip the ones
> that are selected when limited".  And that would easily extend to
> 
> 		if (!tasks[i]->enabled ||
> 		    !(!opts.tasks_selected || tasks[i]->selected))
> 			continue;

This isn't quite right, due to the confusing nature of "enabled".
The condition here will _never_ allow selecting a disabled task.

Perhaps it would be better to rename 'enabled' to 'run_by_default'?
That would make it clear that it is choosing which tasks to run unless
specified otherwise with --task=<task> here, the config option
maintenance.<task>.enabled later, and the --auto conditions even later.
Looking even farther down the line (into the next series) there will be
similar checks for auto-conditionschecking time-based schedules.

Since this loop becomes more complicated in the future, I specifically
wanted to group the "skip this task" conditions into their own if
blocks:

	1. If the user didn't specify --task=<task> explicitly and this
 	   task is disabled, then skip this task.

	2. If the user _did_ specify --task=<task> explicitly and this
	   task was not on the list, then skip this task.

	3. If the user specified --auto and the auto condition fails,
	   then skip this task.

	4. (Later) If the user specified --scheduled and the time since
	   the last run is too soon, then skip this task.

With this being the planned future, I'd prefer these be split out as
separate if conditions instead of a giant combined if. And since that
is the plan, then I won't work too hard to combine conditions 1 and 2
into a single condition.

>> +
>>  		result = tasks[i]->fn();
>>  	}
> 
>> @@ -842,6 +861,44 @@ static void initialize_tasks(void)
>>  	num_tasks++;
>>  }
>>  
>> +static int task_option_parse(const struct option *opt,
>> +			     const char *arg, int unset)
>> +{
>> +	int i;
>> +	struct maintenance_task *task = NULL;
>> +
>> +	BUG_ON_OPT_NEG(unset);
>> +
>> +	if (!arg || !strlen(arg)) {
>> +		error(_("--task requires a value"));
>> +		return 1;
> 
> There is no need to special case an empty string that was explicitly
> given as the value---it will be caught as "'' is not a valid task".

Sounds good. No need for this extra message.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes:

>>>  	for (i = 0; !result && i < num_tasks; i++) {
>>> -		if (!tasks[i]->enabled)
>>> +		if (opts.tasks_selected && !tasks[i]->selected)
>>> +			continue;
>>> +
>>> +		if (!opts.tasks_selected && !tasks[i]->enabled)
>>>  			continue;
>> 
>> I am not sure about this.  Even if the task <x> is disabled, if the
>> user says --task=<x>, it is run anyway?  Doesn't make an immediate
>> sense to me.
>>  ...
>> 		if (!tasks[i]->enabled ||
>> 		    !(!opts.tasks_selected || tasks[i]->selected))
>> 			continue;
>
> This isn't quite right, due to the confusing nature of "enabled".

Yes, in the message you are responding to, I was still assuming that
--task=foo that defeat task.foo.enabled=no would be a bug.  If we
want to run disabled tasks by selection, of course the condition
would need to change.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +static int maintenance_task_commit_graph(void)
> +{
> +	struct repository *r = the_repository;
> +	char *chain_path;
> +
> +	/* Skip commit-graph when --auto is specified. */
> +	if (opts.auto_flag)
> +		return 0;

Stepping back a bit, back in "git gc" days, "--auto" had two
distinct meanings rolled into one.  Check if it even needs to be
done, and perform only the lightweight variant if needed.

For this task, there is no "lightweight variant" is possible, so
returning without checking the need to do a lightweight one makes
perfect sense here.

But wouldn't it suggest perhaps we could name "auto" field of the
options struct in a more meaningful way?  Perhaps "quick" (i.e. only
the quicker-variant of the maintenance job) or something?

> +	close_object_store(r->objects);
> +	if (run_write_commit_graph()) {
> +		error(_("failed to write commit-graph"));
> +		return 1;
> +	}
> +
> +	if (!run_verify_commit_graph())
> +		return 0;
> +
> +	warning(_("commit-graph verify caught error, rewriting"));
> +
> +	chain_path = get_commit_graph_chain_filename(r->objects->odb);
> +	if (unlink(chain_path)) {
> +		UNLEAK(chain_path);
> +		die(_("failed to remove commit-graph at %s"), chain_path);

OK.

> +	}
> +	free(chain_path);
> +
> +	if (!run_write_commit_graph())
> +		return 0;
> +
> +	error(_("failed to rewrite commit-graph"));
> +	return 1;
> +}

Error convention is "positive for error, zero for success?"  That is
a bit unusual for our internal API.

>  static int maintenance_task_gc(void)
>  {
>  	int result;
> @@ -768,6 +836,10 @@ static void initialize_tasks(void)
>  	tasks[num_tasks]->fn = maintenance_task_gc;
>  	tasks[num_tasks]->enabled = 1;
>  	num_tasks++;
> +
> +	tasks[num_tasks]->name = "commit-graph";
> +	tasks[num_tasks]->fn = maintenance_task_commit_graph;
> +	num_tasks++;

Again, I am not sure if we want to keep piling on this.

> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index e4e4036e50..216ac0b19e 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -12,7 +12,7 @@ test_expect_success 'help text' '
>  	test_i18ngrep "usage: git maintenance run" err
>  '
>  
> -test_expect_success 'gc [--auto|--quiet]' '
> +test_expect_success 'run [--auto|--quiet]' '

It does not look like this change belongs here.  If "run" is
appropriate title for this test at this step, it must have been so
in the previous step.

>  	GIT_TRACE2_EVENT="$(pwd)/run-no-auto.txt" git maintenance run --no-quiet &&
>  	GIT_TRACE2_EVENT="$(pwd)/run-auto.txt" git maintenance run --auto &&
>  	GIT_TRACE2_EVENT="$(pwd)/run-quiet.txt" git maintenance run --quiet &&

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/23/2020 4:22 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> +static int maintenance_task_commit_graph(void)
>> +{
>> +	struct repository *r = the_repository;
>> +	char *chain_path;
>> +
>> +	/* Skip commit-graph when --auto is specified. */
>> +	if (opts.auto_flag)
>> +		return 0;

Now that you point this out, this is actually a stray condition
from an earlier version. We now have the ".enabled" config and
the auto condition function pointer. That handles all of that
"should we run this when --auto is specified?" logic outside of
the task itself.
 
> Stepping back a bit, back in "git gc" days, "--auto" had two
> distinct meanings rolled into one.  Check if it even needs to be
> done, and perform only the lightweight variant if needed.
> 
> For this task, there is no "lightweight variant" is possible, so
> returning without checking the need to do a lightweight one makes
> perfect sense here.
> 
> But wouldn't it suggest perhaps we could name "auto" field of the
> options struct in a more meaningful way?  Perhaps "quick" (i.e. only
> the quicker-variant of the maintenance job) or something?

But you are discussing here how the _behavior_ can change when
--auto is specified. And specifically, "git gc --auto" really
meant "This is running after a foreground command, so only do
work if necessary and do it quickly to minimize blocking time."

I'd be happy to replace "--auto" with "--quick" in the
maintenance builtin.

This opens up some extra design space for how the individual
tasks perform depending on "--quick" being specified or not.
My intention was to create tasks that are already in "quick"
mode:

* loose-objects have a maximum batch size.
* incremental-repack is capped in size.
* commit-graph uses the --split option.

But this "quick" distinction might be important for some of
the tasks we intend to extract from the gc builtin.

>> +	close_object_store(r->objects);
>> +	if (run_write_commit_graph()) {
>> +		error(_("failed to write commit-graph"));
>> +		return 1;
>> +	}
>> +
>> +	if (!run_verify_commit_graph())
>> +		return 0;
>> +
>> +	warning(_("commit-graph verify caught error, rewriting"));
>> +
>> +	chain_path = get_commit_graph_chain_filename(r->objects->odb);
>> +	if (unlink(chain_path)) {
>> +		UNLEAK(chain_path);
>> +		die(_("failed to remove commit-graph at %s"), chain_path);
> 
> OK.
> 
>> +	}
>> +	free(chain_path);
>> +
>> +	if (!run_write_commit_graph())
>> +		return 0;
>> +
>> +	error(_("failed to rewrite commit-graph"));
>> +	return 1;
>> +}
> 
> Error convention is "positive for error, zero for success?"  That is
> a bit unusual for our internal API.

Since the tasks are frequently running subcommands, returning
0 for success and non-zero for error matches the error codes
returned by those subcommands.

Should I instead change the behavior and clearly document that
task functions matching maintenance_task_fn follow this error
pattern?

>> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
>> index e4e4036e50..216ac0b19e 100755
>> --- a/t/t7900-maintenance.sh
>> +++ b/t/t7900-maintenance.sh
>> @@ -12,7 +12,7 @@ test_expect_success 'help text' '
>>  	test_i18ngrep "usage: git maintenance run" err
>>  '
>>  
>> -test_expect_success 'gc [--auto|--quiet]' '
>> +test_expect_success 'run [--auto|--quiet]' '
> 
> It does not look like this change belongs here.  If "run" is
> appropriate title for this test at this step, it must have been so
> in the previous step.

Thanks. Will fix.

-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes:

> But you are discussing here how the _behavior_ can change when
> --auto is specified. And specifically, "git gc --auto" really
> meant "This is running after a foreground command, so only do
> work if necessary and do it quickly to minimize blocking time."
>
> I'd be happy to replace "--auto" with "--quick" in the
> maintenance builtin.
>
> This opens up some extra design space for how the individual
> tasks perform depending on "--quick" being specified or not.
> My intention was to create tasks that are already in "quick"
> mode:
>
> * loose-objects have a maximum batch size.
> * incremental-repack is capped in size.
> * commit-graph uses the --split option.
>
> But this "quick" distinction might be important for some of
> the tasks we intend to extract from the gc builtin.

Yup.  To be honest, I came to this topic from a completely different
direction.  The field name "auto" alone (and no other field name)
had to have an extra cruft (i.e. "_flag") attached to it, which is
understandable but ugly.  Then I started thinking if 'auto(matic)'
is really the right word to describe what we want out of the option,
and came to the realization that there may be better words.

> Since the tasks are frequently running subcommands, returning
> 0 for success and non-zero for error matches the error codes
> returned by those subcommands.

As long as these will _never_ be called from other helper functions
but from the cmd_foo() top-level and their return values are only
used directly as the top-level's return value, I do not mind too
much.

But whenever I am writing such a code, I find myself not brave
enough to make such a bold promise (I saw other people call the
helpers I wrote in unintended ways and had to adjust the semantics
of them to accomodate the new callers too many times), so I'd rather
see the caller do "return !!helper_fn()" to allow helper_fn() to be
written more naturally (e.g. letting them return error(...)).

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Fri, Jul 24, 2020 at 12:47:00PM -0700, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
>
> > But you are discussing here how the _behavior_ can change when
> > --auto is specified. And specifically, "git gc --auto" really
> > meant "This is running after a foreground command, so only do
> > work if necessary and do it quickly to minimize blocking time."
> >
> > I'd be happy to replace "--auto" with "--quick" in the
> > maintenance builtin.
> >
> > This opens up some extra design space for how the individual
> > tasks perform depending on "--quick" being specified or not.
> > My intention was to create tasks that are already in "quick"
> > mode:
> >
> > * loose-objects have a maximum batch size.
> > * incremental-repack is capped in size.
> > * commit-graph uses the --split option.
> >
> > But this "quick" distinction might be important for some of
> > the tasks we intend to extract from the gc builtin.
>
> Yup.  To be honest, I came to this topic from a completely different
> direction.  The field name "auto" alone (and no other field name)
> had to have an extra cruft (i.e. "_flag") attached to it, which is
> understandable but ugly.  Then I started thinking if 'auto(matic)'
> is really the right word to describe what we want out of the option,
> and came to the realization that there may be better words.

I wonder what the quick and slow paths are here. For the commit-graph
code, what you wrote here seems to match what I'd expect with passing
'--auto' in the sense of running 'git gc'. That is, I'm leaving it up to
the commit-graph machinery's idea of the normal '--split' rules to
figure out when to roll up layers of a commit-graph, as opposed to
creating a new layer and extending the chain.

So, I think that makes sense if the caller gave '--auto'. But, I'm not
sure that it makes sense if they didn't, in which case I'd imagine
something quicker to happen. There, I'd expect something more like:

  1. Run 'git commit-graph write --reachable --split=no-merge'.
  2. Run 'git commit-graph verify'.
  3. If 'git commit-graph verify' failed, drop the existing commit graph
     and rebuild it with 'git commit-graph --reachable --split=replace'.
  4. Otherwise, do nothing.

I'm biased, of course, but I think that that matches roughly what I'd
expect to happen in the fast/slow path. Granted, the steps to rebuild
the commit graph are going to be slow no matter what (depending on the
size of the repository), and so in that case maybe the commit-graph
should just be dropped. I'm not really sure what to do about that...

> > Since the tasks are frequently running subcommands, returning
> > 0 for success and non-zero for error matches the error codes
> > returned by those subcommands.
>
> As long as these will _never_ be called from other helper functions
> but from the cmd_foo() top-level and their return values are only
> used directly as the top-level's return value, I do not mind too
> much.
>
> But whenever I am writing such a code, I find myself not brave
> enough to make such a bold promise (I saw other people call the
> helpers I wrote in unintended ways and had to adjust the semantics
> of them to accomodate the new callers too many times), so I'd rather
> see the caller do "return !!helper_fn()" to allow helper_fn() to be
> written more naturally (e.g. letting them return error(...)).
>
> Thanks.

Thanks,
Taylor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/24/2020 9:52 PM, Taylor Blau wrote:
> On Fri, Jul 24, 2020 at 12:47:00PM -0700, Junio C Hamano wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> But you are discussing here how the _behavior_ can change when
>>> --auto is specified. And specifically, "git gc --auto" really
>>> meant "This is running after a foreground command, so only do
>>> work if necessary and do it quickly to minimize blocking time."
>>>
>>> I'd be happy to replace "--auto" with "--quick" in the
>>> maintenance builtin.
>>>
>>> This opens up some extra design space for how the individual
>>> tasks perform depending on "--quick" being specified or not.
>>> My intention was to create tasks that are already in "quick"
>>> mode:
>>>
>>> * loose-objects have a maximum batch size.
>>> * incremental-repack is capped in size.
>>> * commit-graph uses the --split option.
>>>
>>> But this "quick" distinction might be important for some of
>>> the tasks we intend to extract from the gc builtin.
>>
>> Yup.  To be honest, I came to this topic from a completely different
>> direction.  The field name "auto" alone (and no other field name)
>> had to have an extra cruft (i.e. "_flag") attached to it, which is
>> understandable but ugly.  Then I started thinking if 'auto(matic)'
>> is really the right word to describe what we want out of the option,
>> and came to the realization that there may be better words.
> 
> I wonder what the quick and slow paths are here. For the commit-graph
> code, what you wrote here seems to match what I'd expect with passing
> '--auto' in the sense of running 'git gc'. That is, I'm leaving it up to
> the commit-graph machinery's idea of the normal '--split' rules to
> figure out when to roll up layers of a commit-graph, as opposed to
> creating a new layer and extending the chain.

I had intended all of my new tasks to be the "quick" version of their
operations. The "slow" version would abandon hope of doing a small
amount of work to create the best possible world for the repository.
This would include:

 * The commit-graph would collapse all layers into one file.
 * The multi-pack-index repack would rewrite all object data into one
   pack-file.
 * The loose-objects task would not stop at a maximum number of loose
   objects (and would probably want to repack everything, anyway).

I'm open to making this possibility more explicit by renaming "--auto"
and just performing a translation to 'git gc --auto'. So, what should
the name be? Here are a few options to consider:

 --quick
 --fast
 --limited
 --incremental
 -O[0|1|2...] (think GCC optimization flags, exposing granularity)
 --[non-]aggressive

Regardless, this makes me rethink that the --[no-]maintenance option
from PATCH 03/18 is better than --[no-]auto-maintenance, since we are
really saying "run _some_ maintenance or _no_ maintenance" and the "how"
of the maintenance is left intentionally vague. I've already made the
change locally to add "auto-" so I'll wait for confirmation before
reverting that change.

> So, I think that makes sense if the caller gave '--auto'. But, I'm not
> sure that it makes sense if they didn't, in which case I'd imagine
> something quicker to happen. There, I'd expect something more like:
> 
>   1. Run 'git commit-graph write --reachable --split=no-merge'.
>   2. Run 'git commit-graph verify'.
>   3. If 'git commit-graph verify' failed, drop the existing commit graph
>      and rebuild it with 'git commit-graph --reachable --split=replace'.
>   4. Otherwise, do nothing.
> 
> I'm biased, of course, but I think that that matches roughly what I'd
> expect to happen in the fast/slow path. Granted, the steps to rebuild
> the commit graph are going to be slow no matter what (depending on the
> size of the repository), and so in that case maybe the commit-graph
> should just be dropped. I'm not really sure what to do about that...

I think this approach is the best we can do given the current behavior
inside the commit-graph builtin. Perhaps in the future we could change
the commit-graph builtin to include a "--verify" option so it could do
the "git commit-graph verify --shallow" on the new layer before committing
the new commit-graph-chain file and expiring old layers. That way, we
would not need to delete and rewrite the whole thing when there is a
problem writing the top layer.

>>> Since the tasks are frequently running subcommands, returning
>>> 0 for success and non-zero for error matches the error codes
>>> returned by those subcommands.
>>
>> As long as these will _never_ be called from other helper functions
>> but from the cmd_foo() top-level and their return values are only
>> used directly as the top-level's return value, I do not mind too
>> much.
>>
>> But whenever I am writing such a code, I find myself not brave
>> enough to make such a bold promise (I saw other people call the
>> helpers I wrote in unintended ways and had to adjust the semantics
>> of them to accomodate the new callers too many times), so I'd rather
>> see the caller do "return !!helper_fn()" to allow helper_fn() to be
>> written more naturally (e.g. letting them return error(...)).

I will try to be consistent here with the behavior:

 * 0 is success
 * 1 is failure

Which is what I think you are implying by "return !!helper_fn()".

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
> index 9204762e21..0927643247 100644
> --- a/Documentation/git-maintenance.txt
> +++ b/Documentation/git-maintenance.txt
> @@ -53,6 +53,18 @@ since it will not expire `.graph` files that were in the previous
>  `commit-graph-chain` file. They will be deleted by a later run based on
>  the expiration delay.
>  
> +prefetch::
> +	The `fetch` task updates the object directory with the latest objects

s/fetch/prefetch/ most likely.

> +	from all registered remotes. For each remote, a `git fetch` command
> +	is run. The refmap is custom to avoid updating local or remote

s/remote/remote-tracking/ definitely.  Do not forget the hyphen
between the two words.

I think it made the above unnecessarily confusing that you ended a
sentence after "is run".  It gives a wrong impression that you'd be
doing a "real fetch", which you need to dispel with a follow up
description of the refmap.

	For each remote, a `git fetch` command is run with a refspec
	to fetch their branches (those in their `refs/heads`) into
	our `refs/prefetch/<remote>/` hierarchy and without auto
	following tags (the configured refspec in the repository is
	ignored).

> +	branches (those in `refs/heads` or `refs/remotes`). Instead, the
> +	remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
> +	not updated.
> ++
> +This means that foreground fetches are still required to update the
> +remote refs, but the users is notified when the branches and tags are

s/is notified/are notified/???

> +updated on the remote.

Often, when one needs to say "X.  This means Y.", X is a suboptimal
way to explain what needs to be conveyed to the readers.  But this
is not such a case.  Unlike the "This means" that is often an
attempt to rephrasing a poor explanation given first, this gives an
implication.

But let's not start with a negative impression (i.e. even with
prefetch, I still have to do X?  What's the point???), but let them
feel why it is a good thing.  Perhaps (continuing my previous
rewrite):

	This is done to avoid disrupting the remote-tracking
	branches---the end users expect them to stay unmoved unless
	they initiate a fetch.  With prefetch task, however, the
	objects necessary to complete a later real fetch would
	already be obtained, so the real fetch would go faster.  In
	the ideal case, it will just become an update to bunch of
	remote-tracking branches without any object transfer.

or something like that?  

> +	argv_array_pushl(&cmd, "fetch", remote, "--prune",
> +			 "--no-tags", "--refmap=", NULL);
> +	strbuf_addf(&refmap, "+refs/heads/*:refs/prefetch/%s/*", remote);
> +	argv_array_push(&cmd, refmap.buf);

The command line looks somewhat fishy, but I think it is correct.
At first glance it looks like a mistake to pass "--refmap=" and the
refspec "+refs/heads/*:refs/prefetch/origin/*" as separate arguments,
but I think that is exactly what you want here, i.e.

 - defeat any refspec in the configuration with "--refmap=<empty>"

 - give explicit refspec "+refs/heads/*:...", with "--no-tags" to
   decline auto-following, to tell what exactly are to be fetched
   and stored where.

The description in the log message about refmap needs to be
clarified, though (I've already done so in the above suggested
rewrite).

> +static int maintenance_task_prefetch(void)
> +{
> +	int result = 0;
> +	struct string_list_item *item;
> +	struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> +	if (for_each_remote(fill_each_remote, &remotes)) {
> +		error(_("failed to fill remotes"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	/*
> +	 * Do not modify the result based on the success of the 'fetch'
> +	 * operation, as a loss of network could cause 'fetch' to fail
> +	 * quickly. We do not want that to stop the rest of our
> +	 * background operations.
> +	 */

The loop that runs different tasks abort at the first failure,
though.  Perhaps that loop needs to be rethought as well?

> +	for (item = remotes.items;
> +	     item && item < remotes.items + remotes.nr;
> +	     item++)
> +		fetch_remote(item->string);
> +
> +cleanup:
> +	string_list_clear(&remotes, 0);
> +	return result;
> +}
> +
>  static int maintenance_task_gc(void)
>  {
>  	int result;
> @@ -871,6 +929,10 @@ static void initialize_tasks(void)
>  	for (i = 0; i < MAX_NUM_TASKS; i++)
>  		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
>  
> +	tasks[num_tasks]->name = "prefetch";
> +	tasks[num_tasks]->fn = maintenance_task_prefetch;
> +	num_tasks++;
> +
>  	tasks[num_tasks]->name = "gc";
>  	tasks[num_tasks]->fn = maintenance_task_gc;
>  	tasks[num_tasks]->enabled = 1;

Two things.

 - As I said upfront, I do not see the point of preparing the table
   with code.

 - The reason why prefetch is placed in front is probably because
   you do not want to repack before you add more objects to the
   object store.  But doesn't that imply that there is an inherent
   ordering that we, as those who are more expert on Git than the
   end users, prefer?  Is it a wise decision to let the users affect
   the order of the tasks run by giving command line options in
   different order in the previous step?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/23/2020 4:53 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
>> index 9204762e21..0927643247 100644
>> --- a/Documentation/git-maintenance.txt
>> +++ b/Documentation/git-maintenance.txt
>> @@ -53,6 +53,18 @@ since it will not expire `.graph` files that were in the previous
>>  `commit-graph-chain` file. They will be deleted by a later run based on
>>  the expiration delay.
>>  
>> +prefetch::
>> +	The `fetch` task updates the object directory with the latest objects
> 
> s/fetch/prefetch/ most likely.
> 
>> +	from all registered remotes. For each remote, a `git fetch` command
>> +	is run. The refmap is custom to avoid updating local or remote
> 
> s/remote/remote-tracking/ definitely.  Do not forget the hyphen
> between the two words.
> 
> I think it made the above unnecessarily confusing that you ended a
> sentence after "is run".  It gives a wrong impression that you'd be
> doing a "real fetch", which you need to dispel with a follow up
> description of the refmap.
> 
> 	For each remote, a `git fetch` command is run with a refspec
> 	to fetch their branches (those in their `refs/heads`) into
> 	our `refs/prefetch/<remote>/` hierarchy and without auto
> 	following tags (the configured refspec in the repository is
> 	ignored).
> 
>> +	branches (those in `refs/heads` or `refs/remotes`). Instead, the
>> +	remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
>> +	not updated.
>> ++
>> +This means that foreground fetches are still required to update the
>> +remote refs, but the users is notified when the branches and tags are
> 
> s/is notified/are notified/???
> 
>> +updated on the remote.
> 
> Often, when one needs to say "X.  This means Y.", X is a suboptimal
> way to explain what needs to be conveyed to the readers.  But this
> is not such a case.  Unlike the "This means" that is often an
> attempt to rephrasing a poor explanation given first, this gives an
> implication.
> 
> But let's not start with a negative impression (i.e. even with
> prefetch, I still have to do X?  What's the point???), but let them
> feel why it is a good thing.  Perhaps (continuing my previous
> rewrite):
> 
> 	This is done to avoid disrupting the remote-tracking
> 	branches---the end users expect them to stay unmoved unless
> 	they initiate a fetch.  With prefetch task, however, the
> 	objects necessary to complete a later real fetch would
> 	already be obtained, so the real fetch would go faster.  In
> 	the ideal case, it will just become an update to bunch of
> 	remote-tracking branches without any object transfer.
> 
> or something like that?  

I like this clarification and have adapted it with minimal edits.

>> +	argv_array_pushl(&cmd, "fetch", remote, "--prune",
>> +			 "--no-tags", "--refmap=", NULL);
>> +	strbuf_addf(&refmap, "+refs/heads/*:refs/prefetch/%s/*", remote);
>> +	argv_array_push(&cmd, refmap.buf);
> 
> The command line looks somewhat fishy, but I think it is correct.
> At first glance it looks like a mistake to pass "--refmap=" and the
> refspec "+refs/heads/*:refs/prefetch/origin/*" as separate arguments,
> but I think that is exactly what you want here, i.e.
> 
>  - defeat any refspec in the configuration with "--refmap=<empty>"
> 
>  - give explicit refspec "+refs/heads/*:...", with "--no-tags" to
>    decline auto-following, to tell what exactly are to be fetched
>    and stored where.
> 
> The description in the log message about refmap needs to be
> clarified, though (I've already done so in the above suggested
> rewrite).

I could have made your life easier by referring to b40a50264ac
(fetch: document and test --refmap="", 2020-01-21) in my commit
message. It includes this sentence in Documentation/fetch-options.txt:

  Providing an empty `<refspec>` to the `--refmap` option causes
  Git to ignore the configured refspecs and rely entirely on the
  refspecs supplied as command-line arguments.

>> +static int maintenance_task_prefetch(void)
>> +{
>> +	int result = 0;
>> +	struct string_list_item *item;
>> +	struct string_list remotes = STRING_LIST_INIT_DUP;
>> +
>> +	if (for_each_remote(fill_each_remote, &remotes)) {
>> +		error(_("failed to fill remotes"));
>> +		result = 1;
>> +		goto cleanup;
>> +	}
>> +
>> +	/*
>> +	 * Do not modify the result based on the success of the 'fetch'
>> +	 * operation, as a loss of network could cause 'fetch' to fail
>> +	 * quickly. We do not want that to stop the rest of our
>> +	 * background operations.
>> +	 */
> 
> The loop that runs different tasks abort at the first failure,
> though.  Perhaps that loop needs to be rethought as well?

You're right. These maintenance tasks are intended to be
independent of each other, so let's try all of them and
report a failure after all have been given an opportunity
to run. That makes this failure behavior unnecessary.

>> +	for (item = remotes.items;
>> +	     item && item < remotes.items + remotes.nr;
>> +	     item++)
>> +		fetch_remote(item->string);
>> +
>> +cleanup:
>> +	string_list_clear(&remotes, 0);
>> +	return result;
>> +}
>> +
>>  static int maintenance_task_gc(void)
>>  {
>>  	int result;
>> @@ -871,6 +929,10 @@ static void initialize_tasks(void)
>>  	for (i = 0; i < MAX_NUM_TASKS; i++)
>>  		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
>>  
>> +	tasks[num_tasks]->name = "prefetch";
>> +	tasks[num_tasks]->fn = maintenance_task_prefetch;
>> +	num_tasks++;
>> +
>>  	tasks[num_tasks]->name = "gc";
>>  	tasks[num_tasks]->fn = maintenance_task_gc;
>>  	tasks[num_tasks]->enabled = 1;
> 
> Two things.
> 
>  - As I said upfront, I do not see the point of preparing the table
>    with code.
> 
>  - The reason why prefetch is placed in front is probably because
>    you do not want to repack before you add more objects to the
>    object store.  But doesn't that imply that there is an inherent
>    ordering that we, as those who are more expert on Git than the
>    end users, prefer?  Is it a wise decision to let the users affect
>    the order of the tasks run by giving command line options in
>    different order in the previous step?

I don't anticipate users specifying --task=<task> very often, as
it requires deep knowledge of the tasks. If a user _does_ use the
option, then we should trust their order as they might have a
good reason to choose that order.

Generally, my philosophy is to provide expert users with flexible
choices while creating sensible defaults for non-expert users.

That said, if this sort is more of a problem than the value it
provides, then I can drop the sort and the duplicate --task=<task>
check.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes:

> I don't anticipate users specifying --task=<task> very often, as
> it requires deep knowledge of the tasks. If a user _does_ use the
> option, then we should trust their order as they might have a
> good reason to choose that order.
>
> Generally, my philosophy is to provide expert users with flexible
> choices while creating sensible defaults for non-expert users.

Sounds sensible.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Create a 'loose-objects' task for the 'git maintenance run' command.
> This helps clean up loose objects without disrupting concurrent Git
> commands using the following sequence of events:
>
> 1. Run 'git prune-packed' to delete any loose objects that exist
>    in a pack-file. Concurrent commands will prefer the packed
>    version of the object to the loose version. (Of course, there
>    are exceptions for commands that specifically care about the
>    location of an object. These are rare for a user to run on
>    purpose, and we hope a user that has selected background
>    maintenance will not be trying to do foreground maintenance.)

OK.  That would make sense.

> 2. Run 'git pack-objects' on a batch of loose objects. These
>    objects are grouped by scanning the loose object directories in
>    lexicographic order until listing all loose objects -or-
>    reaching 50,000 objects. This is more than enough if the loose
>    objects are created only by a user doing normal development.

I haven't seen this in action, but my gut feeling is that this would
result in horrible locality and deltification in the resulting
packfile.  The order you feed the objects to pack-objects and the
path hint you attach to each object matters quite a lot.

I do agree that it would be useful to have a task to deal with only
loose objects without touching existing packfiles.  I just am not
sure if 2. is a worthwhile thing to do.  A poorly constructed pack
will also contaminate later packfiles made without "-f" option to
"git repack".

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/23/2020 4:59 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> Create a 'loose-objects' task for the 'git maintenance run' command.
>> This helps clean up loose objects without disrupting concurrent Git
>> commands using the following sequence of events:
>>
>> 1. Run 'git prune-packed' to delete any loose objects that exist
>>    in a pack-file. Concurrent commands will prefer the packed
>>    version of the object to the loose version. (Of course, there
>>    are exceptions for commands that specifically care about the
>>    location of an object. These are rare for a user to run on
>>    purpose, and we hope a user that has selected background
>>    maintenance will not be trying to do foreground maintenance.)
> 
> OK.  That would make sense.
> 
>> 2. Run 'git pack-objects' on a batch of loose objects. These
>>    objects are grouped by scanning the loose object directories in
>>    lexicographic order until listing all loose objects -or-
>>    reaching 50,000 objects. This is more than enough if the loose
>>    objects are created only by a user doing normal development.
> 
> I haven't seen this in action, but my gut feeling is that this would
> result in horrible locality and deltification in the resulting
> packfile.  The order you feed the objects to pack-objects and the
> path hint you attach to each object matters quite a lot.
> 
> I do agree that it would be useful to have a task to deal with only
> loose objects without touching existing packfiles.  I just am not
> sure if 2. is a worthwhile thing to do.  A poorly constructed pack
> will also contaminate later packfiles made without "-f" option to
> "git repack".

There are several factors going on here:

 * In a partial clone, it is likely that we get loose objects only
   due to a command like "git log -p" that downloads blobs
   one-by-one. In such a case, this step coming in later and picking
   up those blobs _will_ find good deltas because they are present
   in the same batch.

 * (I know this case isn't important to core Git, but please indulge
   me) In a VFS for Git repo, the loose objects correspond to blobs
   that were faulted in by a virtual filesystem read. In this case,
   the blobs are usually from a single commit in history, so good
   deltas between the blobs don't actually exist!

 * My experience indicates that the packs created by the
   loose-objects task are rather small (when created daily). This
   means that they get selected by the incremental-repack task to
   repack into a new pack-file where deltas are recomputed with modest
   success. As mentioned in that task, we saw a significant compression
   factor using that step for users of the Windows OS repo, mostly due
   to recomputing tree deltas.

 * Some amount of "extra" space is expected with this incremental
   repacking scheme. The most space-efficient thing to do is a full
   repack along with a tree walk that detects the paths used for each
   blob, allowing better hints for delta compression. However, that
   operation is very _time_ consuming. The trade-off here is something
   I should make more explicit. In my experience, disk space is cheap
   but CPU time is expensive. Most repositories could probably do a
   daily repack without being a disruption to the user. These steps
   enable maintenance for repositories where a full repack is too
   disruptive.

I hope this adds some context. I would love if someone who knows more
about delta compression could challenge my assumptions. Sharing that
expertise can help create better maintenance strategies. Junio's
initial concern here is a good first step in that direction.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes:

>  * In a partial clone, it is likely that we get loose objects only
>    due to a command like "git log -p" that downloads blobs
>    one-by-one. In such a case, this step coming in later and picking
>    up those blobs _will_ find good deltas because they are present
>    in the same batch.
>
>  * (I know this case isn't important to core Git, but please indulge
>    me) In a VFS for Git repo, the loose objects correspond to blobs
>    that were faulted in by a virtual filesystem read. In this case,
>    the blobs are usually from a single commit in history, so good
>    deltas between the blobs don't actually exist!

Let me stop here by saying that I am now starting to worry about
overfitting the repacking strategy to lazy clone repositories.  I am
perfectly fine with the plan to start with just one strategy overfit
for partially cloned repositories, as long as we make sure that we
can be extended it to suit other access/object acquisition patterns.



Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> 1. 'git multi-pack-index write' creates a multi-pack-index file if
>    one did not exist, and otherwise will update the multi-pack-index
>    with any new pack-files that appeared since the last write. This
>    is particularly relevant with the background fetch job.
>
>    When the multi-pack-index sees two copies of the same object, it
>    stores the offset data into the newer pack-file. This means that
>    some old pack-files could become "unreferenced" which I will use
>    to mean "a pack-file that is in the pack-file list of the
>    multi-pack-index but none of the objects in the multi-pack-index
>    reference a location inside that pack-file."

An obvious alternative is to favor the copy in the older pack,
right?  Is the expectation that over time, most of the objects that
are relevant would reappear in newer packs, so that eventually by
favoring the copies in the newer packs, we can retire and remove the
old pack, keeping only the newer ones?

But would that assumption hold?  The old packs hold objects that are
necessary for the older parts of the history, so unless you are
cauterizing away the old history, these objects in the older packs
are likely to stay with us longer than those used by the newer parts
of the history, some of which may not even have been pushed out yet
and can be rebased away?

> 2. 'git multi-pack-index expire' deletes any unreferenced pack-files
>    and updaes the multi-pack-index to drop those pack-files from the
>    list. This is safe to do as concurrent Git processes will see the
>    multi-pack-index and not open those packs when looking for object
>    contents. (Similar to the 'loose-objects' job, there are some Git
>    commands that open pack-files regardless of the multi-pack-index,
>    but they are rarely used. Further, a user that self-selects to
>    use background operations would likely refrain from using those
>    commands.)

OK.

> 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
>    of pack-files that are listed in the multi-pack-index and creates
>    a new pack-file containing the objects whose offsets are listed
>    by the multi-pack-index to be in those objects. The set of pack-
>    files is selected greedily by sorting the pack-files by modified
>    time and adding a pack-file to the set if its "expected size" is
>    smaller than the batch size until the total expected size of the
>    selected pack-files is at least the batch size. The "expected
>    size" is calculated by taking the size of the pack-file divided
>    by the number of objects in the pack-file and multiplied by the
>    number of objects from the multi-pack-index with offset in that
>    pack-file. The expected size approximats how much data from that

approximates.

>    pack-file will contribute to the resulting pack-file size. The
>    intention is that the resulting pack-file will be close in size
>    to the provided batch size.

> +static int maintenance_task_incremental_repack(void)
> +{
> +	if (multi_pack_index_write()) {
> +		error(_("failed to write multi-pack-index"));
> +		return 1;
> +	}
> +
> +	if (multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after initial write"));
> +		return rewrite_multi_pack_index();
> +	}
> +
> +	if (multi_pack_index_expire()) {
> +		error(_("multi-pack-index expire failed"));
> +		return 1;
> +	}
> +
> +	if (multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after expire"));
> +		return rewrite_multi_pack_index();
> +	}
> +	if (multi_pack_index_repack()) {
> +		error(_("multi-pack-index repack failed"));
> +		return 1;
> +	}

Hmph, I wonder if these warning should come from each helper
functions that are static to this function anyway.

It also makes it easier to reason about this function by eliminating
the need for having a different pattern only for the verify helper.
Instead, verify could call rewrite internally when it notices a
breakage.  I.e.

	if (multi_pack_index_write())
		return 1;
	if (multi_pack_index_verify("after initial write"))
		return 1;
	if (multi_pack_index_exire())
		return 1;
	...

Also, it feels odd, compared to our internal API convention, that
positive non-zero is used as an error here.

> +	return 0;
> +}
> +
>  typedef int maintenance_task_fn(void);
>  
>  struct maintenance_task {
> @@ -1037,6 +1152,10 @@ static void initialize_tasks(void)
>  	tasks[num_tasks]->fn = maintenance_task_loose_objects;
>  	num_tasks++;
>  
> +	tasks[num_tasks]->name = "incremental-repack";
> +	tasks[num_tasks]->fn = maintenance_task_incremental_repack;
> +	num_tasks++;
> +
>  	tasks[num_tasks]->name = "gc";
>  	tasks[num_tasks]->fn = maintenance_task_gc;
>  	tasks[num_tasks]->enabled = 1;

Exactly the same comment as 08/18 about natural/inherent ordering
applies here as well.

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/23/2020 6:00 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> 1. 'git multi-pack-index write' creates a multi-pack-index file if
>>    one did not exist, and otherwise will update the multi-pack-index
>>    with any new pack-files that appeared since the last write. This
>>    is particularly relevant with the background fetch job.
>>
>>    When the multi-pack-index sees two copies of the same object, it
>>    stores the offset data into the newer pack-file. This means that
>>    some old pack-files could become "unreferenced" which I will use
>>    to mean "a pack-file that is in the pack-file list of the
>>    multi-pack-index but none of the objects in the multi-pack-index
>>    reference a location inside that pack-file."
> 
> An obvious alternative is to favor the copy in the older pack,
> right?  Is the expectation that over time, most of the objects that
> are relevant would reappear in newer packs, so that eventually by
> favoring the copies in the newer packs, we can retire and remove the
> old pack, keeping only the newer ones?
> 
> But would that assumption hold?  The old packs hold objects that are
> necessary for the older parts of the history, so unless you are
> cauterizing away the old history, these objects in the older packs
> are likely to stay with us longer than those used by the newer parts
> of the history, some of which may not even have been pushed out yet
> and can be rebased away?

If we created a new pack-file containing an already-packed object,
then shouldn't we assume that the new pack-file does a _better_ job
of compressing that object? Or at least, doesn't make it worse?

For example, if we use 'git multi-pack-index repack --batch-size=0',
then this creates a new pack-file containing every previously-packed
object. This new pack-file should have better delta compression than
the previous setup across multiple pack-files. We want this new
pack-file to be used, not the old ones.

This "pick the newest pack" strategy is also what allows us to safely
use the 'expire' option to drop old pack-files. If we always keep the
old copies, then when we try to switch the new pack-files, we cannot
delete the old packs safely because a concurrent Git process could
be trying to reference it. One race is as follows:

 * Process A opens the multi-pack-index referring to old pack P. It
   doesn't open the pack-file as it hasn't tried to parse objects yet.

 * Process B is 'git multi-pack-index expire'. It sees that pack P can
   be dropped because all objects appear in newer pack-files. It deletes
   P.

 * Process A tries to read from P, and this fails. A must reprepare its
   representation of the pack-files.

This is the less disruptive race since A can recover with a small cost
to its performance. The worse race (on Windows) is this:

 * Process A loads the multi-pack-index and tries to parse an object by
   loading "old" pack P.

 * Process B tries to delete P. However, on Windows the handle to P by
   A prevents the deletion.

At this point, there could be two resolutions. The first is to have
the 'expire' fail because we can't delete A. This means we might never
delete A in a busy repository. The second is that the 'expire' command
continues and drops A from the list in the multi-pack-index. However,
now all Git processes add A to the packed_git list because it isn't
referenced by the multi-pack-index.

In summary: the decision to pick the newer copies is a fundamental
part of how the write->expire->repack loop was designed.

>> 2. 'git multi-pack-index expire' deletes any unreferenced pack-files
>>    and updaes the multi-pack-index to drop those pack-files from the
>>    list. This is safe to do as concurrent Git processes will see the
>>    multi-pack-index and not open those packs when looking for object
>>    contents. (Similar to the 'loose-objects' job, there are some Git
>>    commands that open pack-files regardless of the multi-pack-index,
>>    but they are rarely used. Further, a user that self-selects to
>>    use background operations would likely refrain from using those
>>    commands.)
> 
> OK.
> 
>> 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
>>    of pack-files that are listed in the multi-pack-index and creates
>>    a new pack-file containing the objects whose offsets are listed
>>    by the multi-pack-index to be in those objects. The set of pack-
>>    files is selected greedily by sorting the pack-files by modified
>>    time and adding a pack-file to the set if its "expected size" is
>>    smaller than the batch size until the total expected size of the
>>    selected pack-files is at least the batch size. The "expected
>>    size" is calculated by taking the size of the pack-file divided
>>    by the number of objects in the pack-file and multiplied by the
>>    number of objects from the multi-pack-index with offset in that
>>    pack-file. The expected size approximats how much data from that
> 
> approximates.
> 
>>    pack-file will contribute to the resulting pack-file size. The
>>    intention is that the resulting pack-file will be close in size
>>    to the provided batch size.
> 
>> +static int maintenance_task_incremental_repack(void)
>> +{
>> +	if (multi_pack_index_write()) {
>> +		error(_("failed to write multi-pack-index"));
>> +		return 1;
>> +	}
>> +
>> +	if (multi_pack_index_verify()) {
>> +		warning(_("multi-pack-index verify failed after initial write"));
>> +		return rewrite_multi_pack_index();
>> +	}
>> +
>> +	if (multi_pack_index_expire()) {
>> +		error(_("multi-pack-index expire failed"));
>> +		return 1;
>> +	}
>> +
>> +	if (multi_pack_index_verify()) {
>> +		warning(_("multi-pack-index verify failed after expire"));
>> +		return rewrite_multi_pack_index();
>> +	}
>> +	if (multi_pack_index_repack()) {
>> +		error(_("multi-pack-index repack failed"));
>> +		return 1;
>> +	}
> 
> Hmph, I wonder if these warning should come from each helper
> functions that are static to this function anyway.
> 
> It also makes it easier to reason about this function by eliminating
> the need for having a different pattern only for the verify helper.
> Instead, verify could call rewrite internally when it notices a
> breakage.  I.e.
> 
> 	if (multi_pack_index_write())
> 		return 1;
> 	if (multi_pack_index_verify("after initial write"))
> 		return 1;
> 	if (multi_pack_index_exire())
> 		return 1;
> 	...

This is a cleaner model. I'll work on that.

> Also, it feels odd, compared to our internal API convention, that
> positive non-zero is used as an error here.
...
> Exactly the same comment as 08/18 about natural/inherent ordering
> applies here as well.

I'll leave these to be resolved in the earlier messages.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Đoàn Trần Công Danh wrote (reply to this):

On 2020-07-23 17:56:30+0000, Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> When working with very large repositories, an incremental 'git fetch'
> command can download a large amount of data. If there are many other
> users pushing to a common repo, then this data can rival the initial
> pack-file size of a 'git clone' of a medium-size repo.
> 
> Users may want to keep the data on their local repos as close as
> possible to the data on the remote repos by fetching periodically in
> the background. This can break up a large daily fetch into several
> smaller hourly fetches.
> 
> The task is called "prefetch" because it is work done in advance
> of a foreground fetch to make that 'git fetch' command much faster.
> 
> However, if we simply ran 'git fetch <remote>' in the background,
> then the user running a foregroudn 'git fetch <remote>' would lose
> some important feedback when a new branch appears or an existing
> branch updates. This is especially true if a remote branch is
> force-updated and this isn't noticed by the user because it occurred
> in the background. Further, the functionality of 'git push
> --force-with-lease' becomes suspect.
> 
> When running 'git fetch <remote> <options>' in the background, use
> the following options for careful updating:

Does this job interfere with FETCH_HEAD?
From my quick test (by applying 01-08 on top of rc1, and messing with t7900),
it looks like yes.

I (and some other people, probably) rely on FETCH_HEAD for our scripts.
Hence, it would be nice to not touch FETCH_HEAD with prefetch job.

Thanks,
-Danh

> 
> 1. --no-tags prevents getting a new tag when a user wants to see
>    the new tags appear in their foreground fetches.
> 
> 2. --refmap= removes the configured refspec which usually updates
>    refs/remotes/<remote>/* with the refs advertised by the remote.
> 
> 3. By adding a new refspec "+refs/heads/*:refs/prefetch/<remote>/*"
>    we can ensure that we actually load the new values somewhere in
>    our refspace while not updating refs/heads or refs/remotes. By
>    storing these refs here, the commit-graph job will update the
>    commit-graph with the commits from these hidden refs.
> 
> 4. --prune will delete the refs/prefetch/<remote> refs that no
>    longer appear on the remote.
> 
> We've been using this step as a critical background job in Scalar
> [1] (and VFS for Git). This solved a pain point that was showing up
> in user reports: fetching was a pain! Users do not like waiting to
> download the data that was created while they were away from their
> machines. After implementing background fetch, the foreground fetch
> commands sped up significantly because they mostly just update refs
> and download a small amount of new data. The effect is especially
> dramatic when paried with --no-show-forced-udpates (through
> fetch.showForcedUpdates=false).
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-maintenance.txt | 12 ++++++
>  builtin/gc.c                      | 64 ++++++++++++++++++++++++++++++-
>  t/t7900-maintenance.sh            | 24 ++++++++++++
>  3 files changed, 99 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
> index 9204762e21..0927643247 100644
> --- a/Documentation/git-maintenance.txt
> +++ b/Documentation/git-maintenance.txt
> @@ -53,6 +53,18 @@ since it will not expire `.graph` files that were in the previous
>  `commit-graph-chain` file. They will be deleted by a later run based on
>  the expiration delay.
>  
> +prefetch::
> +	The `fetch` task updates the object directory with the latest objects
> +	from all registered remotes. For each remote, a `git fetch` command
> +	is run. The refmap is custom to avoid updating local or remote
> +	branches (those in `refs/heads` or `refs/remotes`). Instead, the
> +	remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
> +	not updated.
> ++
> +This means that foreground fetches are still required to update the
> +remote refs, but the users is notified when the branches and tags are
> +updated on the remote.
> +
>  gc::
>  	Cleanup unnecessary files and optimize the local repository. "GC"
>  	stands for "garbage collection," but this task performs many
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 5d99b4b805..969c127877 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -28,6 +28,7 @@
>  #include "blob.h"
>  #include "tree.h"
>  #include "promisor-remote.h"
> +#include "remote.h"
>  
>  #define FAILED_RUN "failed to run %s"
>  
> @@ -700,7 +701,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>  	return 0;
>  }
>  
> -#define MAX_NUM_TASKS 2
> +#define MAX_NUM_TASKS 3
>  
>  static const char * const builtin_maintenance_usage[] = {
>  	N_("git maintenance run [<options>]"),
> @@ -781,6 +782,63 @@ static int maintenance_task_commit_graph(void)
>  	return 1;
>  }
>  
> +static int fetch_remote(const char *remote)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	struct strbuf refmap = STRBUF_INIT;
> +
> +	argv_array_pushl(&cmd, "fetch", remote, "--prune",
> +			 "--no-tags", "--refmap=", NULL);
> +
> +	strbuf_addf(&refmap, "+refs/heads/*:refs/prefetch/%s/*", remote);
> +	argv_array_push(&cmd, refmap.buf);
> +
> +	if (opts.quiet)
> +		argv_array_push(&cmd, "--quiet");
> +
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> +	strbuf_release(&refmap);
> +	return result;
> +}
> +
> +static int fill_each_remote(struct remote *remote, void *cbdata)
> +{
> +	struct string_list *remotes = (struct string_list *)cbdata;
> +
> +	string_list_append(remotes, remote->name);
> +	return 0;
> +}
> +
> +static int maintenance_task_prefetch(void)
> +{
> +	int result = 0;
> +	struct string_list_item *item;
> +	struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> +	if (for_each_remote(fill_each_remote, &remotes)) {
> +		error(_("failed to fill remotes"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	/*
> +	 * Do not modify the result based on the success of the 'fetch'
> +	 * operation, as a loss of network could cause 'fetch' to fail
> +	 * quickly. We do not want that to stop the rest of our
> +	 * background operations.
> +	 */
> +	for (item = remotes.items;
> +	     item && item < remotes.items + remotes.nr;
> +	     item++)
> +		fetch_remote(item->string);
> +
> +cleanup:
> +	string_list_clear(&remotes, 0);
> +	return result;
> +}
> +
>  static int maintenance_task_gc(void)
>  {
>  	int result;
> @@ -871,6 +929,10 @@ static void initialize_tasks(void)
>  	for (i = 0; i < MAX_NUM_TASKS; i++)
>  		tasks[i] = xcalloc(1, sizeof(struct maintenance_task));
>  
> +	tasks[num_tasks]->name = "prefetch";
> +	tasks[num_tasks]->fn = maintenance_task_prefetch;
> +	num_tasks++;
> +
>  	tasks[num_tasks]->name = "gc";
>  	tasks[num_tasks]->fn = maintenance_task_gc;
>  	tasks[num_tasks]->enabled = 1;
> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index c09a9eb90b..8b04a04c79 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -44,4 +44,28 @@ test_expect_success 'run --task duplicate' '
>  	test_i18ngrep "cannot be selected multiple times" err
>  '
>  
> +test_expect_success 'run --task=prefetch with no remotes' '
> +	git maintenance run --task=prefetch 2>err &&
> +	test_must_be_empty err
> +'
> +
> +test_expect_success 'prefetch multiple remotes' '
> +	git clone . clone1 &&
> +	git clone . clone2 &&
> +	git remote add remote1 "file://$(pwd)/clone1" &&
> +	git remote add remote2 "file://$(pwd)/clone2" &&
> +	git -C clone1 switch -c one &&
> +	git -C clone2 switch -c two &&
> +	test_commit -C clone1 one &&
> +	test_commit -C clone2 two &&
> +	GIT_TRACE2_EVENT="$(pwd)/run-prefetch.txt" git maintenance run --task=prefetch &&
> +	grep ",\"fetch\",\"remote1\"" run-prefetch.txt &&
> +	grep ",\"fetch\",\"remote2\"" run-prefetch.txt &&
> +	test_path_is_missing .git/refs/remotes &&
> +	test_cmp clone1/.git/refs/heads/one .git/refs/prefetch/remote1/one &&
> +	test_cmp clone2/.git/refs/heads/two .git/refs/prefetch/remote2/two &&
> +	git log prefetch/remote1/one &&
> +	git log prefetch/remote2/two
> +'
> +
>  test_done
> -- 
> gitgitgadget
> 

-- 
Danh

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Đoàn Trần Công Danh  <congdanhqx@gmail.com> writes:

>> When running 'git fetch <remote> <options>' in the background, use
>> the following options for careful updating:
>
> Does this job interfere with FETCH_HEAD?
> From my quick test (by applying 01-08 on top of rc1, and messing with t7900),
> it looks like yes.
>
> I (and some other people, probably) rely on FETCH_HEAD for our scripts.
> Hence, it would be nice to not touch FETCH_HEAD with prefetch job.

Very good point.  For that, Derrick may want to swallow the single
patch from 'jc/no-update-fetch-head' topic into this series and
pass the new command line option.




Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/24/2020 9:48 PM, Junio C Hamano wrote:
> Đoàn Trần Công Danh  <congdanhqx@gmail.com> writes:
> 
>>> When running 'git fetch <remote> <options>' in the background, use
>>> the following options for careful updating:
>>
>> Does this job interfere with FETCH_HEAD?
>> From my quick test (by applying 01-08 on top of rc1, and messing with t7900),
>> it looks like yes.
>>
>> I (and some other people, probably) rely on FETCH_HEAD for our scripts.
>> Hence, it would be nice to not touch FETCH_HEAD with prefetch job.
> 
> Very good point.  For that, Derrick may want to swallow the single
> patch from 'jc/no-update-fetch-head' topic into this series and
> pass the new command line option.

Thanks for the pointer! I appreciate the attention to detail here.

I'll rebase onto jc/no-update-fetch-head for the next version, since
that branch is based on v2.28.0-rc0, which is recent enough.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes:

> I'll rebase onto jc/no-update-fetch-head for the next version, since
> that branch is based on v2.28.0-rc0, which is recent enough.

I do not think it is wise to base a work on top of unfinished "you
could do it this way, perhaps?" demonstration patch the original
author does not have much inclination to finish, though.

When I am really bored, I may go back to the topic to finish it, but
I wouldn't mind if you took ownership of it at all.

Thanks.



Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/27/2020 12:13 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> I'll rebase onto jc/no-update-fetch-head for the next version, since
>> that branch is based on v2.28.0-rc0, which is recent enough.
> 
> I do not think it is wise to base a work on top of unfinished "you
> could do it this way, perhaps?" demonstration patch the original
> author does not have much inclination to finish, though.
> 
> When I am really bored, I may go back to the topic to finish it, but
> I wouldn't mind if you took ownership of it at all.

Ah. I didn't understand the status of that branch. I'll pull it in
to this topic.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes:

> On 7/27/2020 12:13 PM, Junio C Hamano wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
>> 
>>> I'll rebase onto jc/no-update-fetch-head for the next version, since
>>> that branch is based on v2.28.0-rc0, which is recent enough.
>> 
>> I do not think it is wise to base a work on top of unfinished "you
>> could do it this way, perhaps?" demonstration patch the original
>> author does not have much inclination to finish, though.
>> 
>> When I am really bored, I may go back to the topic to finish it, but
>> I wouldn't mind if you took ownership of it at all.
>
> Ah. I didn't understand the status of that branch. I'll pull it in
> to this topic.

So here is with one of the two things that I found missing in the
first iteration of the patch: documentation.

The other thing that I found iffy (and still missing from this
version) was what should be done when "git pull" is explicitly given
the "--no-write-fetch-head" option.

I think (but didn't check the recent code) that 'git pull' would
pass only known-to-make-sense command line options to underlying
'git fetch', so it probably will barf with "unknown option", which
is the best case.  We might want to make it sure with a new test in
5521.  On the other hand, if we get anything other than "no such
option", we may want to think if we want to "fix" it or just leave
it inside "if it hurts, don't do it" territory.

Thanks.  

The patch without doc was Reviewed-by: Taylor Blau <me@ttaylorr.com>
but this round has not been.

-- >8 --

If you run fetch but record the result in remote-tracking branches,
and either if you do nothing with the fetched refs (e.g. you are
merely mirroring) or if you always work from the remote-tracking
refs (e.g. you fetch and then merge origin/branchname separately),
you can get away with having no FETCH_HEAD at all.

Teach "git fetch" a command line option "--[no-]write-fetch-head"
and "fetch.writeFetchHEAD" configuration variable.  Without either,
the default is to write FETCH_HEAD, and the usual rule that the
command line option defeats configured default applies.

Note that under "--dry-run" mode, FETCH_HEAD is never written;
otherwise you'd see list of objects in the file that you do not
actually have.  Passing `--fetch-write-head` does not force `git
fetch` to write the file.

Also note that this option is explicitly passed when "git pull"
internally invokes "git fetch", so that those who configured their
"git fetch" not to write FETCH_HEAD would not be able to break the
cooperation between these two commands.  "git pull" must see what
"git fetch" got recorded in FETCH_HEAD to work correctly.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/config/fetch.txt  |  7 ++++++
 Documentation/fetch-options.txt | 10 +++++++++
 builtin/fetch.c                 | 19 +++++++++++++---
 builtin/pull.c                  |  3 ++-
 t/t5510-fetch.sh                | 39 +++++++++++++++++++++++++++++++--
 5 files changed, 72 insertions(+), 6 deletions(-)

diff --git a/Documentation/config/fetch.txt b/Documentation/config/fetch.txt
index b20394038d..0aaa05e8c0 100644
--- a/Documentation/config/fetch.txt
+++ b/Documentation/config/fetch.txt
@@ -91,3 +91,10 @@ fetch.writeCommitGraph::
 	merge and the write may take longer. Having an updated commit-graph
 	file helps performance of many Git commands, including `git merge-base`,
 	`git push -f`, and `git log --graph`. Defaults to false.
+
+fetch.writeFetchHEAD::
+	Setting it to false tells `git fetch` not to write the list
+	of remote refs fetched in the `FETCH_HEAD` file directly
+	under `$GIT_DIR`.  Can be countermanded from the command
+	line with the `--[no-]write-fetch-head` option.  Defaults to
+	true.
diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
index 6e2a160a47..6775e8499f 100644
--- a/Documentation/fetch-options.txt
+++ b/Documentation/fetch-options.txt
@@ -64,6 +64,16 @@ documented in linkgit:git-config[1].
 --dry-run::
 	Show what would be done, without making any changes.
 
+ifndef::git-pull[]
+--[no-]write-fetch-head::
+	Write the list of remote refs fetched in the `FETCH_HEAD`
+	file directly under `$GIT_DIR`.  This is the default unless
+	the configuration variable `fetch.writeFetchHEAD` is set to
+	false.  Passing `--no-write-fetch-head` from the command
+	line tells Git not to write the file.  Under `--dry-run`
+	option, the file is never written.
+endif::git-pull[]
+
 -f::
 --force::
 	When 'git fetch' is used with `<src>:<dst>` refspec it may
diff --git a/builtin/fetch.c b/builtin/fetch.c
index 82ac4be8a5..3ccf69753f 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -56,6 +56,7 @@ static int prune_tags = -1; /* unspecified */
 #define PRUNE_TAGS_BY_DEFAULT 0 /* do we prune tags by default? */
 
 static int all, append, dry_run, force, keep, multiple, update_head_ok;
+static int write_fetch_head = 1;
 static int verbosity, deepen_relative, set_upstream;
 static int progress = -1;
 static int enable_auto_gc = 1;
@@ -118,6 +119,10 @@ static int git_fetch_config(const char *k, const char *v, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(k, "fetch.writefetchhead")) {
+		write_fetch_head = git_config_bool(k, v);
+		return 0;
+	}
 	return git_default_config(k, v, cb);
 }
 
@@ -162,6 +167,8 @@ static struct option builtin_fetch_options[] = {
 		    PARSE_OPT_OPTARG, option_fetch_parse_recurse_submodules),
 	OPT_BOOL(0, "dry-run", &dry_run,
 		 N_("dry run")),
+	OPT_BOOL(0, "write-fetch-head", &write_fetch_head,
+		 N_("write fetched references to the FETCH_HEAD file")),
 	OPT_BOOL('k', "keep", &keep, N_("keep downloaded pack")),
 	OPT_BOOL('u', "update-head-ok", &update_head_ok,
 		    N_("allow updating of HEAD ref")),
@@ -893,7 +900,9 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 	const char *what, *kind;
 	struct ref *rm;
 	char *url;
-	const char *filename = dry_run ? "/dev/null" : git_path_fetch_head(the_repository);
+	const char *filename = (!write_fetch_head
+				? "/dev/null"
+				: git_path_fetch_head(the_repository));
 	int want_status;
 	int summary_width = transport_summary_width(ref_map);
 
@@ -1327,7 +1336,7 @@ static int do_fetch(struct transport *transport,
 	}
 
 	/* if not appending, truncate FETCH_HEAD */
-	if (!append && !dry_run) {
+	if (!append && write_fetch_head) {
 		retcode = truncate_fetch_head();
 		if (retcode)
 			goto cleanup;
@@ -1594,7 +1603,7 @@ static int fetch_multiple(struct string_list *list, int max_children)
 	int i, result = 0;
 	struct argv_array argv = ARGV_ARRAY_INIT;
 
-	if (!append && !dry_run) {
+	if (!append && write_fetch_head) {
 		int errcode = truncate_fetch_head();
 		if (errcode)
 			return errcode;
@@ -1795,6 +1804,10 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	if (depth || deepen_since || deepen_not.nr)
 		deepen = 1;
 
+	/* FETCH_HEAD never gets updated in --dry-run mode */
+	if (dry_run)
+		write_fetch_head = 0;
+
 	if (all) {
 		if (argc == 1)
 			die(_("fetch --all does not take a repository argument"));
diff --git a/builtin/pull.c b/builtin/pull.c
index 8159c5d7c9..e988d92b53 100644
--- a/builtin/pull.c
+++ b/builtin/pull.c
@@ -527,7 +527,8 @@ static int run_fetch(const char *repo, const char **refspecs)
 	struct argv_array args = ARGV_ARRAY_INIT;
 	int ret;
 
-	argv_array_pushl(&args, "fetch", "--update-head-ok", NULL);
+	argv_array_pushl(&args, "fetch", "--update-head-ok",
+			 "--write-fetch-head", NULL);
 
 	/* Shared options */
 	argv_push_verbosity(&args);
diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
index a66dbe0bde..3052c2d8d5 100755
--- a/t/t5510-fetch.sh
+++ b/t/t5510-fetch.sh
@@ -539,13 +539,48 @@ test_expect_success 'fetch into the current branch with --update-head-ok' '
 
 '
 
-test_expect_success 'fetch --dry-run' '
-
+test_expect_success 'fetch --dry-run does not touch FETCH_HEAD' '
 	rm -f .git/FETCH_HEAD &&
 	git fetch --dry-run . &&
 	! test -f .git/FETCH_HEAD
 '
 
+test_expect_success '--no-write-fetch-head does not touch FETCH_HEAD' '
+	rm -f .git/FETCH_HEAD &&
+	git fetch --no-write-fetch-head . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success '--write-fetch-head gets defeated by --dry-run' '
+	rm -f .git/FETCH_HEAD &&
+	git fetch --dry-run --write-fetch-head . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD and FETCH_HEAD' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=no fetch . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD gets defeated by --dry-run' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=yes fetch --dry-run . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD and --no-write-fetch-head' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=yes fetch --no-write-fetch-head . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD and --write-fetch-head' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=no fetch --write-fetch-head . &&
+	test -f .git/FETCH_HEAD
+'
+
 test_expect_success "should be able to fetch with duplicate refspecs" '
 	mkdir dups &&
 	(
-- 
2.28.0

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Phillip Wood wrote (reply to this):

On 28/07/2020 17:37, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> On 7/27/2020 12:13 PM, Junio C Hamano wrote:
>>> Derrick Stolee <stolee@gmail.com> writes:
>>>
>>>> I'll rebase onto jc/no-update-fetch-head for the next version, since
>>>> that branch is based on v2.28.0-rc0, which is recent enough.
>>>
>>> I do not think it is wise to base a work on top of unfinished "you
>>> could do it this way, perhaps?" demonstration patch the original
>>> author does not have much inclination to finish, though.
>>>
>>> When I am really bored, I may go back to the topic to finish it, but
>>> I wouldn't mind if you took ownership of it at all.
>>
>> Ah. I didn't understand the status of that branch. I'll pull it in
>> to this topic.
> 
> So here is with one of the two things that I found missing in the
> first iteration of the patch: documentation.
> 
> The other thing that I found iffy (and still missing from this
> version) was what should be done when "git pull" is explicitly given
> the "--no-write-fetch-head" option.
> 
> I think (but didn't check the recent code) that 'git pull' would
> pass only known-to-make-sense command line options to underlying
> 'git fetch', so it probably will barf with "unknown option", which
> is the best case.  We might want to make it sure with a new test in
> 5521.  On the other hand, if we get anything other than "no such
> option", we may want to think if we want to "fix" it or just leave
> it inside "if it hurts, don't do it" territory.
> 
> Thanks.
> 
> The patch without doc was Reviewed-by: Taylor Blau <me@ttaylorr.com>
> but this round has not been.
> 
> -- >8 --
> 
> If you run fetch but record the result in remote-tracking branches,
> and either if you do nothing with the fetched refs (e.g. you are
> merely mirroring) or if you always work from the remote-tracking
> refs (e.g. you fetch and then merge origin/branchname separately),
> you can get away with having no FETCH_HEAD at all.
> 
> Teach "git fetch" a command line option "--[no-]write-fetch-head"
> and "fetch.writeFetchHEAD" configuration variable.  Without either,
> the default is to write FETCH_HEAD, and the usual rule that the
> command line option defeats configured default applies.
> 
> Note that under "--dry-run" mode, FETCH_HEAD is never written;
> otherwise you'd see list of objects in the file that you do not
> actually have.  Passing `--fetch-write-head` 

Typo, it should be `--write-fetch-head`

>does not force `git
> fetch` to write the file.
> 
> Also note that this option is explicitly passed when "git pull"
> internally invokes "git fetch", so that those who configured their
> "git fetch" not to write FETCH_HEAD would not be able to break the
> cooperation between these two commands.  "git pull" must see what
> "git fetch" got recorded in FETCH_HEAD to work correctly.
> 
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
> ---
>   Documentation/config/fetch.txt  |  7 ++++++
>   Documentation/fetch-options.txt | 10 +++++++++
>   builtin/fetch.c                 | 19 +++++++++++++---
>   builtin/pull.c                  |  3 ++-
>   t/t5510-fetch.sh                | 39 +++++++++++++++++++++++++++++++--
>   5 files changed, 72 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/config/fetch.txt b/Documentation/config/fetch.txt
> index b20394038d..0aaa05e8c0 100644
> --- a/Documentation/config/fetch.txt
> +++ b/Documentation/config/fetch.txt
> @@ -91,3 +91,10 @@ fetch.writeCommitGraph::
>   	merge and the write may take longer. Having an updated commit-graph
>   	file helps performance of many Git commands, including `git merge-base`,
>   	`git push -f`, and `git log --graph`. Defaults to false.
> +
> +fetch.writeFetchHEAD::
> +	Setting it to false tells `git fetch` not to write the list
> +	of remote refs fetched in the `FETCH_HEAD` file directly
> +	under `$GIT_DIR`.  Can be countermanded from the command
> +	line with the `--[no-]write-fetch-head` option.  Defaults to
> +	true.
> diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
> index 6e2a160a47..6775e8499f 100644
> --- a/Documentation/fetch-options.txt
> +++ b/Documentation/fetch-options.txt
> @@ -64,6 +64,16 @@ documented in linkgit:git-config[1].
>   --dry-run::
>   	Show what would be done, without making any changes.
>   
> +ifndef::git-pull[]
> +--[no-]write-fetch-head::
> +	Write the list of remote refs fetched in the `FETCH_HEAD`
> +	file directly under `$GIT_DIR`.  This is the default unless
> +	the configuration variable `fetch.writeFetchHEAD` is set to
> +	false.  Passing `--no-write-fetch-head` from the command
> +	line tells Git not to write the file.  Under `--dry-run`
> +	option, the file is never written.
> +endif::git-pull[]
> +
>   -f::
>   --force::
>   	When 'git fetch' is used with `<src>:<dst>` refspec it may
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 82ac4be8a5..3ccf69753f 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -56,6 +56,7 @@ static int prune_tags = -1; /* unspecified */
>   #define PRUNE_TAGS_BY_DEFAULT 0 /* do we prune tags by default? */
>   
>   static int all, append, dry_run, force, keep, multiple, update_head_ok;
> +static int write_fetch_head = 1;
>   static int verbosity, deepen_relative, set_upstream;
>   static int progress = -1;
>   static int enable_auto_gc = 1;
> @@ -118,6 +119,10 @@ static int git_fetch_config(const char *k, const char *v, void *cb)
>   		return 0;
>   	}
>   
> +	if (!strcmp(k, "fetch.writefetchhead")) {
> +		write_fetch_head = git_config_bool(k, v);
> +		return 0;
> +	}
>   	return git_default_config(k, v, cb);
>   }
>   
> @@ -162,6 +167,8 @@ static struct option builtin_fetch_options[] = {
>   		    PARSE_OPT_OPTARG, option_fetch_parse_recurse_submodules),
>   	OPT_BOOL(0, "dry-run", &dry_run,
>   		 N_("dry run")),
> +	OPT_BOOL(0, "write-fetch-head", &write_fetch_head,
> +		 N_("write fetched references to the FETCH_HEAD file")),
>   	OPT_BOOL('k', "keep", &keep, N_("keep downloaded pack")),
>   	OPT_BOOL('u', "update-head-ok", &update_head_ok,
>   		    N_("allow updating of HEAD ref")),
> @@ -893,7 +900,9 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
>   	const char *what, *kind;
>   	struct ref *rm;
>   	char *url;
> -	const char *filename = dry_run ? "/dev/null" : git_path_fetch_head(the_repository);
> +	const char *filename = (!write_fetch_head
> +				? "/dev/null"
> +				: git_path_fetch_head(the_repository));

I was suspicious of this as we haven't cleared write_fetch_head in the 
--dry-run case yet but the test below seems to show that we still don't 
write FETCH_HEAD if --dry-run is given. That makes we wonder what the 
point of setting the filename based on the value of write_fetch_head is.

The rest looks good to me, though it might be worth having a test to 
check that pull does indeed reject --no-write-fetch-head

Best Wishes

Phillip

>   	int want_status;
>   	int summary_width = transport_summary_width(ref_map);
>   
> @@ -1327,7 +1336,7 @@ static int do_fetch(struct transport *transport,
>   	}
>   
>   	/* if not appending, truncate FETCH_HEAD */
> -	if (!append && !dry_run) {
> +	if (!append && write_fetch_head) {
>   		retcode = truncate_fetch_head();
>   		if (retcode)
>   			goto cleanup;
> @@ -1594,7 +1603,7 @@ static int fetch_multiple(struct string_list *list, int max_children)
>   	int i, result = 0;
>   	struct argv_array argv = ARGV_ARRAY_INIT;
>   
> -	if (!append && !dry_run) {
> +	if (!append && write_fetch_head) {
>   		int errcode = truncate_fetch_head();
>   		if (errcode)
>   			return errcode;
> @@ -1795,6 +1804,10 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>   	if (depth || deepen_since || deepen_not.nr)
>   		deepen = 1;
>   
> +	/* FETCH_HEAD never gets updated in --dry-run mode */
> +	if (dry_run)
> +		write_fetch_head = 0;
> +
>   	if (all) {
>   		if (argc == 1)
>   			die(_("fetch --all does not take a repository argument"));
> diff --git a/builtin/pull.c b/builtin/pull.c
> index 8159c5d7c9..e988d92b53 100644
> --- a/builtin/pull.c
> +++ b/builtin/pull.c
> @@ -527,7 +527,8 @@ static int run_fetch(const char *repo, const char **refspecs)
>   	struct argv_array args = ARGV_ARRAY_INIT;
>   	int ret;
>   
> -	argv_array_pushl(&args, "fetch", "--update-head-ok", NULL);
> +	argv_array_pushl(&args, "fetch", "--update-head-ok",
> +			 "--write-fetch-head", NULL);
>   
>   	/* Shared options */
>   	argv_push_verbosity(&args);
> diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
> index a66dbe0bde..3052c2d8d5 100755
> --- a/t/t5510-fetch.sh
> +++ b/t/t5510-fetch.sh
> @@ -539,13 +539,48 @@ test_expect_success 'fetch into the current branch with --update-head-ok' '
>   
>   '
>   
> -test_expect_success 'fetch --dry-run' '
> -
> +test_expect_success 'fetch --dry-run does not touch FETCH_HEAD' '
>   	rm -f .git/FETCH_HEAD &&
>   	git fetch --dry-run . &&
>   	! test -f .git/FETCH_HEAD
>   '
>   
> +test_expect_success '--no-write-fetch-head does not touch FETCH_HEAD' '
> +	rm -f .git/FETCH_HEAD &&
> +	git fetch --no-write-fetch-head . &&
> +	! test -f .git/FETCH_HEAD
> +'
> +
> +test_expect_success '--write-fetch-head gets defeated by --dry-run' '
> +	rm -f .git/FETCH_HEAD &&
> +	git fetch --dry-run --write-fetch-head . &&
> +	! test -f .git/FETCH_HEAD
> +'
> +
> +test_expect_success 'fetch.writeFetchHEAD and FETCH_HEAD' '
> +	rm -f .git/FETCH_HEAD &&
> +	git -c fetch.writeFetchHEAD=no fetch . &&
> +	! test -f .git/FETCH_HEAD
> +'
> +
> +test_expect_success 'fetch.writeFetchHEAD gets defeated by --dry-run' '
> +	rm -f .git/FETCH_HEAD &&
> +	git -c fetch.writeFetchHEAD=yes fetch --dry-run . &&
> +	! test -f .git/FETCH_HEAD
> +'
> +
> +test_expect_success 'fetch.writeFetchHEAD and --no-write-fetch-head' '
> +	rm -f .git/FETCH_HEAD &&
> +	git -c fetch.writeFetchHEAD=yes fetch --no-write-fetch-head . &&
> +	! test -f .git/FETCH_HEAD
> +'
> +
> +test_expect_success 'fetch.writeFetchHEAD and --write-fetch-head' '
> +	rm -f .git/FETCH_HEAD &&
> +	git -c fetch.writeFetchHEAD=no fetch --write-fetch-head . &&
> +	test -f .git/FETCH_HEAD
> +'
> +
>   test_expect_success "should be able to fetch with duplicate refspecs" '
>   	mkdir dups &&
>   	(
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Phillip Wood wrote (reply to this):

On 29/07/2020 10:12, Phillip Wood wrote:
> On 28/07/2020 17:37, Junio C Hamano wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> On 7/27/2020 12:13 PM, Junio C Hamano wrote:
>>>> Derrick Stolee <stolee@gmail.com> writes:
>>>>
>>>>> I'll rebase onto jc/no-update-fetch-head for the next version, since
>>>>> that branch is based on v2.28.0-rc0, which is recent enough.
>>>>
>>>> I do not think it is wise to base a work on top of unfinished "you
>>>> could do it this way, perhaps?" demonstration patch the original
>>>> author does not have much inclination to finish, though.
>>>>
>>>> When I am really bored, I may go back to the topic to finish it, but
>>>> I wouldn't mind if you took ownership of it at all.
>>>
>>> Ah. I didn't understand the status of that branch. I'll pull it in
>>> to this topic.
>>
>> So here is with one of the two things that I found missing in the
>> first iteration of the patch: documentation.
>>
>> The other thing that I found iffy (and still missing from this
>> version) was what should be done when "git pull" is explicitly given
>> the "--no-write-fetch-head" option.
>>
>> I think (but didn't check the recent code) that 'git pull' would
>> pass only known-to-make-sense command line options to underlying
>> 'git fetch', so it probably will barf with "unknown option", which
>> is the best case.  We might want to make it sure with a new test in
>> 5521.  On the other hand, if we get anything other than "no such
>> option", we may want to think if we want to "fix" it or just leave
>> it inside "if it hurts, don't do it" territory.
>>
>> Thanks.
>>
>> The patch without doc was Reviewed-by: Taylor Blau <me@ttaylorr.com>
>> but this round has not been.
>>
>> -- >8 --
>>
>> If you run fetch but record the result in remote-tracking branches,
>> and either if you do nothing with the fetched refs (e.g. you are
>> merely mirroring) or if you always work from the remote-tracking
>> refs (e.g. you fetch and then merge origin/branchname separately),
>> you can get away with having no FETCH_HEAD at all.
>>
>> Teach "git fetch" a command line option "--[no-]write-fetch-head"
>> and "fetch.writeFetchHEAD" configuration variable.  Without either,
>> the default is to write FETCH_HEAD, and the usual rule that the
>> command line option defeats configured default applies.
>>
>> Note that under "--dry-run" mode, FETCH_HEAD is never written;
>> otherwise you'd see list of objects in the file that you do not
>> actually have.  Passing `--fetch-write-head` 
> 
> Typo, it should be `--write-fetch-head`
> 
>> does not force `git
>> fetch` to write the file.
>>
>> Also note that this option is explicitly passed when "git pull"
>> internally invokes "git fetch", so that those who configured their
>> "git fetch" not to write FETCH_HEAD would not be able to break the
>> cooperation between these two commands.  "git pull" must see what
>> "git fetch" got recorded in FETCH_HEAD to work correctly.
>>
>> Signed-off-by: Junio C Hamano <gitster@pobox.com>
>> ---
>>   Documentation/config/fetch.txt  |  7 ++++++
>>   Documentation/fetch-options.txt | 10 +++++++++
>>   builtin/fetch.c                 | 19 +++++++++++++---
>>   builtin/pull.c                  |  3 ++-
>>   t/t5510-fetch.sh                | 39 +++++++++++++++++++++++++++++++--
>>   5 files changed, 72 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/config/fetch.txt 
>> b/Documentation/config/fetch.txt
>> index b20394038d..0aaa05e8c0 100644
>> --- a/Documentation/config/fetch.txt
>> +++ b/Documentation/config/fetch.txt
>> @@ -91,3 +91,10 @@ fetch.writeCommitGraph::
>>       merge and the write may take longer. Having an updated commit-graph
>>       file helps performance of many Git commands, including `git 
>> merge-base`,
>>       `git push -f`, and `git log --graph`. Defaults to false.
>> +
>> +fetch.writeFetchHEAD::
>> +    Setting it to false tells `git fetch` not to write the list
>> +    of remote refs fetched in the `FETCH_HEAD` file directly
>> +    under `$GIT_DIR`.  Can be countermanded from the command
>> +    line with the `--[no-]write-fetch-head` option.  Defaults to
>> +    true.
>> diff --git a/Documentation/fetch-options.txt 
>> b/Documentation/fetch-options.txt
>> index 6e2a160a47..6775e8499f 100644
>> --- a/Documentation/fetch-options.txt
>> +++ b/Documentation/fetch-options.txt
>> @@ -64,6 +64,16 @@ documented in linkgit:git-config[1].
>>   --dry-run::
>>       Show what would be done, without making any changes.
>> +ifndef::git-pull[]
>> +--[no-]write-fetch-head::
>> +    Write the list of remote refs fetched in the `FETCH_HEAD`
>> +    file directly under `$GIT_DIR`.  This is the default unless
>> +    the configuration variable `fetch.writeFetchHEAD` is set to
>> +    false.  Passing `--no-write-fetch-head` from the command
>> +    line tells Git not to write the file.  Under `--dry-run`
>> +    option, the file is never written.
>> +endif::git-pull[]
>> +
>>   -f::
>>   --force::
>>       When 'git fetch' is used with `<src>:<dst>` refspec it may
>> diff --git a/builtin/fetch.c b/builtin/fetch.c
>> index 82ac4be8a5..3ccf69753f 100644
>> --- a/builtin/fetch.c
>> +++ b/builtin/fetch.c
>> @@ -56,6 +56,7 @@ static int prune_tags = -1; /* unspecified */
>>   #define PRUNE_TAGS_BY_DEFAULT 0 /* do we prune tags by default? */
>>   static int all, append, dry_run, force, keep, multiple, update_head_ok;
>> +static int write_fetch_head = 1;
>>   static int verbosity, deepen_relative, set_upstream;
>>   static int progress = -1;
>>   static int enable_auto_gc = 1;
>> @@ -118,6 +119,10 @@ static int git_fetch_config(const char *k, const 
>> char *v, void *cb)
>>           return 0;
>>       }
>> +    if (!strcmp(k, "fetch.writefetchhead")) {
>> +        write_fetch_head = git_config_bool(k, v);
>> +        return 0;
>> +    }
>>       return git_default_config(k, v, cb);
>>   }
>> @@ -162,6 +167,8 @@ static struct option builtin_fetch_options[] = {
>>               PARSE_OPT_OPTARG, option_fetch_parse_recurse_submodules),
>>       OPT_BOOL(0, "dry-run", &dry_run,
>>            N_("dry run")),
>> +    OPT_BOOL(0, "write-fetch-head", &write_fetch_head,
>> +         N_("write fetched references to the FETCH_HEAD file")),
>>       OPT_BOOL('k', "keep", &keep, N_("keep downloaded pack")),
>>       OPT_BOOL('u', "update-head-ok", &update_head_ok,
>>               N_("allow updating of HEAD ref")),
>> @@ -893,7 +900,9 @@ static int store_updated_refs(const char *raw_url, 
>> const char *remote_name,
>>       const char *what, *kind;
>>       struct ref *rm;
>>       char *url;
>> -    const char *filename = dry_run ? "/dev/null" : 
>> git_path_fetch_head(the_repository);
>> +    const char *filename = (!write_fetch_head
>> +                ? "/dev/null"
>> +                : git_path_fetch_head(the_repository));
> 
> I was suspicious of this as we haven't cleared write_fetch_head in the 
> --dry-run case yet but the test below seems to show that we still don't 
> write FETCH_HEAD if --dry-run is given. That makes we wonder what the 
> point of setting the filename based on the value of write_fetch_head is.

Sorry ignore that. I misread the hunk header - we have in fact cleared 
write_fetch_head in the --dry-run case by the time we get here.

> The rest looks good to me, though it might be worth having a test to 
> check that pull does indeed reject --no-write-fetch-head
> 
> Best Wishes
> 
> Phillip
> 
>>       int want_status;
>>       int summary_width = transport_summary_width(ref_map);
>> @@ -1327,7 +1336,7 @@ static int do_fetch(struct transport *transport,
>>       }
>>       /* if not appending, truncate FETCH_HEAD */
>> -    if (!append && !dry_run) {
>> +    if (!append && write_fetch_head) {
>>           retcode = truncate_fetch_head();
>>           if (retcode)
>>               goto cleanup;
>> @@ -1594,7 +1603,7 @@ static int fetch_multiple(struct string_list 
>> *list, int max_children)
>>       int i, result = 0;
>>       struct argv_array argv = ARGV_ARRAY_INIT;
>> -    if (!append && !dry_run) {
>> +    if (!append && write_fetch_head) {
>>           int errcode = truncate_fetch_head();
>>           if (errcode)
>>               return errcode;
>> @@ -1795,6 +1804,10 @@ int cmd_fetch(int argc, const char **argv, 
>> const char *prefix)
>>       if (depth || deepen_since || deepen_not.nr)
>>           deepen = 1;
>> +    /* FETCH_HEAD never gets updated in --dry-run mode */
>> +    if (dry_run)
>> +        write_fetch_head = 0;
>> +
>>       if (all) {
>>           if (argc == 1)
>>               die(_("fetch --all does not take a repository argument"));
>> diff --git a/builtin/pull.c b/builtin/pull.c
>> index 8159c5d7c9..e988d92b53 100644
>> --- a/builtin/pull.c
>> +++ b/builtin/pull.c
>> @@ -527,7 +527,8 @@ static int run_fetch(const char *repo, const char 
>> **refspecs)
>>       struct argv_array args = ARGV_ARRAY_INIT;
>>       int ret;
>> -    argv_array_pushl(&args, "fetch", "--update-head-ok", NULL);
>> +    argv_array_pushl(&args, "fetch", "--update-head-ok",
>> +             "--write-fetch-head", NULL);
>>       /* Shared options */
>>       argv_push_verbosity(&args);
>> diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
>> index a66dbe0bde..3052c2d8d5 100755
>> --- a/t/t5510-fetch.sh
>> +++ b/t/t5510-fetch.sh
>> @@ -539,13 +539,48 @@ test_expect_success 'fetch into the current 
>> branch with --update-head-ok' '
>>   '
>> -test_expect_success 'fetch --dry-run' '
>> -
>> +test_expect_success 'fetch --dry-run does not touch FETCH_HEAD' '
>>       rm -f .git/FETCH_HEAD &&
>>       git fetch --dry-run . &&
>>       ! test -f .git/FETCH_HEAD
>>   '
>> +test_expect_success '--no-write-fetch-head does not touch FETCH_HEAD' '
>> +    rm -f .git/FETCH_HEAD &&
>> +    git fetch --no-write-fetch-head . &&
>> +    ! test -f .git/FETCH_HEAD
>> +'
>> +
>> +test_expect_success '--write-fetch-head gets defeated by --dry-run' '
>> +    rm -f .git/FETCH_HEAD &&
>> +    git fetch --dry-run --write-fetch-head . &&
>> +    ! test -f .git/FETCH_HEAD
>> +'
>> +
>> +test_expect_success 'fetch.writeFetchHEAD and FETCH_HEAD' '
>> +    rm -f .git/FETCH_HEAD &&
>> +    git -c fetch.writeFetchHEAD=no fetch . &&
>> +    ! test -f .git/FETCH_HEAD
>> +'
>> +
>> +test_expect_success 'fetch.writeFetchHEAD gets defeated by --dry-run' '
>> +    rm -f .git/FETCH_HEAD &&
>> +    git -c fetch.writeFetchHEAD=yes fetch --dry-run . &&
>> +    ! test -f .git/FETCH_HEAD
>> +'
>> +
>> +test_expect_success 'fetch.writeFetchHEAD and --no-write-fetch-head' '
>> +    rm -f .git/FETCH_HEAD &&
>> +    git -c fetch.writeFetchHEAD=yes fetch --no-write-fetch-head . &&
>> +    ! test -f .git/FETCH_HEAD
>> +'
>> +
>> +test_expect_success 'fetch.writeFetchHEAD and --write-fetch-head' '
>> +    rm -f .git/FETCH_HEAD &&
>> +    git -c fetch.writeFetchHEAD=no fetch --write-fetch-head . &&
>> +    test -f .git/FETCH_HEAD
>> +'
>> +
>>   test_expect_success "should be able to fetch with duplicate refspecs" '
>>       mkdir dups &&
>>       (
>>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/28/2020 12:37 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> On 7/27/2020 12:13 PM, Junio C Hamano wrote:
>>> Derrick Stolee <stolee@gmail.com> writes:
>>>
>>>> I'll rebase onto jc/no-update-fetch-head for the next version, since
>>>> that branch is based on v2.28.0-rc0, which is recent enough.
>>>
>>> I do not think it is wise to base a work on top of unfinished "you
>>> could do it this way, perhaps?" demonstration patch the original
>>> author does not have much inclination to finish, though.
>>>
>>> When I am really bored, I may go back to the topic to finish it, but
>>> I wouldn't mind if you took ownership of it at all.
>>
>> Ah. I didn't understand the status of that branch. I'll pull it in
>> to this topic.
> 
> So here is with one of the two things that I found missing in the
> first iteration of the patch: documentation.
> 
> The other thing that I found iffy (and still missing from this
> version) was what should be done when "git pull" is explicitly given
> the "--no-write-fetch-head" option.
> 
> I think (but didn't check the recent code) that 'git pull' would
> pass only known-to-make-sense command line options to underlying
> 'git fetch', so it probably will barf with "unknown option", which
> is the best case.  We might want to make it sure with a new test in
> 5521.  On the other hand, if we get anything other than "no such
> option", we may want to think if we want to "fix" it or just leave
> it inside "if it hurts, don't do it" territory.

Here is the version I applied and updated. Please let me know what
you think.

-->8--

From 3f60a0f0fd388447aa9c2b805b5646039df98d0b Mon Sep 17 00:00:00 2001
From: Junio C Hamano <gitster@pobox.com>
Date: Tue, 28 Jul 2020 09:37:59 -0700
Subject: [PATCH] fetch: optionally allow disabling FETCH_HEAD update

If you run fetch but record the result in remote-tracking branches,
and either if you do nothing with the fetched refs (e.g. you are
merely mirroring) or if you always work from the remote-tracking
refs (e.g. you fetch and then merge origin/branchname separately),
you can get away with having no FETCH_HEAD at all.

Teach "git fetch" a command line option "--[no-]write-fetch-head"
and "fetch.writeFetchHEAD" configuration variable.  Without either,
the default is to write FETCH_HEAD, and the usual rule that the
command line option defeats configured default applies.

Note that under "--dry-run" mode, FETCH_HEAD is never written;
otherwise you'd see list of objects in the file that you do not
actually have.  Passing `--write-fetch-head` does not force `git
fetch` to write the file.

Also note that this option is explicitly passed when "git pull"
internally invokes "git fetch", so that those who configured their
"git fetch" not to write FETCH_HEAD would not be able to break the
cooperation between these two commands.  "git pull" must see what
"git fetch" got recorded in FETCH_HEAD to work correctly.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/fetch.txt  |  7 ++++++
 Documentation/fetch-options.txt | 10 +++++++++
 builtin/fetch.c                 | 19 +++++++++++++---
 builtin/pull.c                  |  3 ++-
 t/t5510-fetch.sh                | 39 +++++++++++++++++++++++++++++++--
 t/t5521-pull-options.sh         | 16 ++++++++++++++
 6 files changed, 88 insertions(+), 6 deletions(-)

diff --git a/Documentation/config/fetch.txt b/Documentation/config/fetch.txt
index b20394038d1..0aaa05e8c0e 100644
--- a/Documentation/config/fetch.txt
+++ b/Documentation/config/fetch.txt
@@ -91,3 +91,10 @@ fetch.writeCommitGraph::
 	merge and the write may take longer. Having an updated commit-graph
 	file helps performance of many Git commands, including `git merge-base`,
 	`git push -f`, and `git log --graph`. Defaults to false.
+
+fetch.writeFetchHEAD::
+	Setting it to false tells `git fetch` not to write the list
+	of remote refs fetched in the `FETCH_HEAD` file directly
+	under `$GIT_DIR`.  Can be countermanded from the command
+	line with the `--[no-]write-fetch-head` option.  Defaults to
+	true.
diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
index 495bc8ab5a1..6972ad2522c 100644
--- a/Documentation/fetch-options.txt
+++ b/Documentation/fetch-options.txt
@@ -64,6 +64,16 @@ documented in linkgit:git-config[1].
 --dry-run::
 	Show what would be done, without making any changes.
 
+ifndef::git-pull[]
+--[no-]write-fetch-head::
+	Write the list of remote refs fetched in the `FETCH_HEAD`
+	file directly under `$GIT_DIR`.  This is the default unless
+	the configuration variable `fetch.writeFetchHEAD` is set to
+	false.  Passing `--no-write-fetch-head` from the command
+	line tells Git not to write the file.  Under `--dry-run`
+	option, the file is never written.
+endif::git-pull[]
+
 -f::
 --force::
 	When 'git fetch' is used with `<src>:<dst>` refspec it may
diff --git a/builtin/fetch.c b/builtin/fetch.c
index d8253f66e4c..1d7194aac26 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -56,6 +56,7 @@ static int prune_tags = -1; /* unspecified */
 #define PRUNE_TAGS_BY_DEFAULT 0 /* do we prune tags by default? */
 
 static int all, append, dry_run, force, keep, multiple, update_head_ok;
+static int write_fetch_head = 1;
 static int verbosity, deepen_relative, set_upstream;
 static int progress = -1;
 static int enable_auto_gc = 1;
@@ -118,6 +119,10 @@ static int git_fetch_config(const char *k, const char *v, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(k, "fetch.writefetchhead")) {
+		write_fetch_head = git_config_bool(k, v);
+		return 0;
+	}
 	return git_default_config(k, v, cb);
 }
 
@@ -162,6 +167,8 @@ static struct option builtin_fetch_options[] = {
 		    PARSE_OPT_OPTARG, option_fetch_parse_recurse_submodules),
 	OPT_BOOL(0, "dry-run", &dry_run,
 		 N_("dry run")),
+	OPT_BOOL(0, "write-fetch-head", &write_fetch_head,
+		 N_("write fetched references to the FETCH_HEAD file")),
 	OPT_BOOL('k', "keep", &keep, N_("keep downloaded pack")),
 	OPT_BOOL('u', "update-head-ok", &update_head_ok,
 		    N_("allow updating of HEAD ref")),
@@ -895,7 +902,9 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 	const char *what, *kind;
 	struct ref *rm;
 	char *url;
-	const char *filename = dry_run ? "/dev/null" : git_path_fetch_head(the_repository);
+	const char *filename = (!write_fetch_head
+				? "/dev/null"
+				: git_path_fetch_head(the_repository));
 	int want_status;
 	int summary_width = transport_summary_width(ref_map);
 
@@ -1329,7 +1338,7 @@ static int do_fetch(struct transport *transport,
 	}
 
 	/* if not appending, truncate FETCH_HEAD */
-	if (!append && !dry_run) {
+	if (!append && write_fetch_head) {
 		retcode = truncate_fetch_head();
 		if (retcode)
 			goto cleanup;
@@ -1596,7 +1605,7 @@ static int fetch_multiple(struct string_list *list, int max_children)
 	int i, result = 0;
 	struct argv_array argv = ARGV_ARRAY_INIT;
 
-	if (!append && !dry_run) {
+	if (!append && write_fetch_head) {
 		int errcode = truncate_fetch_head();
 		if (errcode)
 			return errcode;
@@ -1797,6 +1806,10 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	if (depth || deepen_since || deepen_not.nr)
 		deepen = 1;
 
+	/* FETCH_HEAD never gets updated in --dry-run mode */
+	if (dry_run)
+		write_fetch_head = 0;
+
 	if (all) {
 		if (argc == 1)
 			die(_("fetch --all does not take a repository argument"));
diff --git a/builtin/pull.c b/builtin/pull.c
index 8159c5d7c96..e988d92b535 100644
--- a/builtin/pull.c
+++ b/builtin/pull.c
@@ -527,7 +527,8 @@ static int run_fetch(const char *repo, const char **refspecs)
 	struct argv_array args = ARGV_ARRAY_INIT;
 	int ret;
 
-	argv_array_pushl(&args, "fetch", "--update-head-ok", NULL);
+	argv_array_pushl(&args, "fetch", "--update-head-ok",
+			 "--write-fetch-head", NULL);
 
 	/* Shared options */
 	argv_push_verbosity(&args);
diff --git a/t/t5510-fetch.sh b/t/t5510-fetch.sh
index 9850ecde5df..31c91d0ed2e 100755
--- a/t/t5510-fetch.sh
+++ b/t/t5510-fetch.sh
@@ -539,13 +539,48 @@ test_expect_success 'fetch into the current branch with --update-head-ok' '
 
 '
 
-test_expect_success 'fetch --dry-run' '
-
+test_expect_success 'fetch --dry-run does not touch FETCH_HEAD' '
 	rm -f .git/FETCH_HEAD &&
 	git fetch --dry-run . &&
 	! test -f .git/FETCH_HEAD
 '
 
+test_expect_success '--no-write-fetch-head does not touch FETCH_HEAD' '
+	rm -f .git/FETCH_HEAD &&
+	git fetch --no-write-fetch-head . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success '--write-fetch-head gets defeated by --dry-run' '
+	rm -f .git/FETCH_HEAD &&
+	git fetch --dry-run --write-fetch-head . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD and FETCH_HEAD' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=no fetch . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD gets defeated by --dry-run' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=yes fetch --dry-run . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD and --no-write-fetch-head' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=yes fetch --no-write-fetch-head . &&
+	! test -f .git/FETCH_HEAD
+'
+
+test_expect_success 'fetch.writeFetchHEAD and --write-fetch-head' '
+	rm -f .git/FETCH_HEAD &&
+	git -c fetch.writeFetchHEAD=no fetch --write-fetch-head . &&
+	test -f .git/FETCH_HEAD
+'
+
 test_expect_success "should be able to fetch with duplicate refspecs" '
 	mkdir dups &&
 	(
diff --git a/t/t5521-pull-options.sh b/t/t5521-pull-options.sh
index 159afa7ac81..1acae3b9a4f 100755
--- a/t/t5521-pull-options.sh
+++ b/t/t5521-pull-options.sh
@@ -77,6 +77,7 @@ test_expect_success 'git pull -q -v --no-rebase' '
 	test_must_be_empty out &&
 	test -s err)
 '
+
 test_expect_success 'git pull --cleanup errors early on invalid argument' '
 	mkdir clonedcleanup &&
 	(cd clonedcleanup && git init &&
@@ -85,6 +86,21 @@ test_expect_success 'git pull --cleanup errors early on invalid argument' '
 	test -s err)
 '
 
+test_expect_success 'git pull --no-write-fetch-head fails' '
+	mkdir clonedwfh &&
+	(cd clonedwfh && git init &&
+	test_must_fail git pull --no-write-fetch-head "../parent" >out 2>err &&
+	test_must_be_empty out &&
+	test_i18ngrep "no-write-fetch-head" err)
+'
+
+test_expect_success 'git pull succeeds with fetch.writeFetchHEAD=false' '
+	mkdir clonedwfhconfig &&
+	(cd clonedwfhconfig && git init &&
+	git config fetch.writeFetchHEAD false &&
+	git pull "../parent" >out 2>err &&
+	grep FETCH_HEAD err)
+'
 
 test_expect_success 'git pull --force' '
 	mkdir clonedoldstyle &&
-- 
2.27.0.366.g34746c1d11e.dirty



Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Thu, Jul 23, 2020 at 05:56:27PM +0000, Derrick Stolee via GitGitGadget wrote:

> +static int run_write_commit_graph(void)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> +	argv_array_pushl(&cmd, "commit-graph", "write",
> +			 "--split", "--reachable", NULL);
> +
> +	if (opts.quiet)
> +		argv_array_push(&cmd, "--no-progress");
> +
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +	argv_array_clear(&cmd);
> +
> +	return result;
> +}

This is a pretty minor nit, but since I happened to be looking at the
merge of all of the recent argv_array callers today... :)

You can write this a bit more succinctly by reusing the argv_array
provided by the child_process:

  struct child_process cmd = CHILD_PROCESS_INIT;

  cmd.git_cmd = 1;
  argv_array_pushl(&cmd.args, "commit-graph", "write",
                   "--split", "--reachable", NULL);

  if (opts.quiet)
          argv_array_push(&cmd.args, "--no-progress");

  return run_command(&cmd);

Then you don't have to worry about freeing the argv memory, because it's
handled automatically.

Like I said, quite minor, but it looks like this pattern appears in a
few places, so it might be worth tweaking. And it would still work with
the "pushf" people recommended to avoid extra strbufs (I saw another one
in fetch_remote(), too).

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Emily Shaffer wrote (reply to this):

On Thu, Jul 23, 2020 at 05:56:31PM +0000, Derrick Stolee via GitGitGadget wrote:
> +loose-objects::
> +	The `loose-objects` job cleans up loose objects and places them into
> +	pack-files. In order to prevent race conditions with concurrent Git
> +	commands, it follows a two-step process. First, it deletes any loose
> +	objects that already exist in a pack-file; concurrent Git processes
> +	will examine the pack-file for the object data instead of the loose
> +	object. Second, it creates a new pack-file (starting with "loose-")

[jonathan tan + jonathan nieder] If you are going to document this,
probably it should also be tested, so the documentation does not become
stale. Or, just don't document it.

> +static int pack_loose(void)
> +{
> +	struct repository *r = the_repository;
> +	int result = 0;
> +	struct write_loose_object_data data;
> +	struct strbuf prefix = STRBUF_INIT;
> +	struct child_process *pack_proc;
> +
> +	/*
> +	 * Do not start pack-objects process
> +	 * if there are no loose objects.
> +	 */
> +	if (!for_each_loose_file_in_objdir(r->objects->odb->path,
> +					   loose_object_exists,
> +					   NULL, NULL, NULL))

[emily] To me, this is unintuitive - but upon inspection, it's exiting
the foreach early if any loose object is found, so this is cheaper than
actually counting. Maybe a comment would help to understand? Or we could
name the function differently, like "bail_if_loose()" or something?

> +test_expect_success 'loose-objects task' '
> +	# Repack everything so we know the state of the object dir
> +	git repack -adk &&
> +
> +	# Hack to stop maintenance from running during "git commit"
> +	echo in use >.git/objects/maintenance.lock &&
> +	test_commit create-loose-object &&

[jonathan nieder] Does it make sense to use a different git command
which is guaranteed to make a loose object? Is 'git commit' futureproof,
if we decide commits should directly create packs in the future? 'git
unpack-objects' is guaranteed to make a loose object, although it is
clumsy because it needs a packfile to begin with...

[jonathan tan] But, using 'git commit' is easier to understand in this
context. Maybe commenting to say that we assume 'git commit' makes 1 or
more loose objects will be enough to futureproof - then we have a signal
to whoever made a change to make this fail, and that person knows how to
fix this test.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/29/2020 6:21 PM, Emily Shaffer wrote:
> On Thu, Jul 23, 2020 at 05:56:31PM +0000, Derrick Stolee via GitGitGadget wrote:
>> +loose-objects::
>> +	The `loose-objects` job cleans up loose objects and places them into
>> +	pack-files. In order to prevent race conditions with concurrent Git
>> +	commands, it follows a two-step process. First, it deletes any loose
>> +	objects that already exist in a pack-file; concurrent Git processes
>> +	will examine the pack-file for the object data instead of the loose
>> +	object. Second, it creates a new pack-file (starting with "loose-")
> 
> [jonathan tan + jonathan nieder] If you are going to document this,
> probably it should also be tested, so the documentation does not become
> stale. Or, just don't document it.

Adding a condition to the test is a good idea.

>> +static int pack_loose(void)
>> +{
>> +	struct repository *r = the_repository;
>> +	int result = 0;
>> +	struct write_loose_object_data data;
>> +	struct strbuf prefix = STRBUF_INIT;
>> +	struct child_process *pack_proc;
>> +
>> +	/*
>> +	 * Do not start pack-objects process
>> +	 * if there are no loose objects.
>> +	 */
>> +	if (!for_each_loose_file_in_objdir(r->objects->odb->path,
>> +					   loose_object_exists,
>> +					   NULL, NULL, NULL))
> 
> [emily] To me, this is unintuitive - but upon inspection, it's exiting
> the foreach early if any loose object is found, so this is cheaper than
> actually counting. Maybe a comment would help to understand? Or we could
> name the function differently, like "bail_if_loose()" or something?

Sure. "bail_on_loose()" makes more sense to me.

>> +test_expect_success 'loose-objects task' '
>> +	# Repack everything so we know the state of the object dir
>> +	git repack -adk &&
>> +
>> +	# Hack to stop maintenance from running during "git commit"
>> +	echo in use >.git/objects/maintenance.lock &&
>> +	test_commit create-loose-object &&
> 
> [jonathan nieder] Does it make sense to use a different git command
> which is guaranteed to make a loose object? Is 'git commit' futureproof,
> if we decide commits should directly create packs in the future? 'git
> unpack-objects' is guaranteed to make a loose object, although it is
> clumsy because it needs a packfile to begin with...

"unpack-objects" also has the problem that it won't write the loose
object if it exists as a packed object, so it requires moving the
pack-file out of the object directory first. (That's also required
for testing that the loose objects get packed by the maintenance
task, but I'd rather not need to remove the multi-pack-index, too.)

> [jonathan tan] But, using 'git commit' is easier to understand in this
> context. Maybe commenting to say that we assume 'git commit' makes 1 or
> more loose objects will be enough to futureproof - then we have a signal
> to whoever made a change to make this fail, and that person knows how to
> fix this test.

I'll add the comment for now. Thanks.

-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Emily Shaffer wrote (reply to this):

On Thu, Jul 23, 2020 at 05:56:32PM +0000, Derrick Stolee via GitGitGadget wrote:
> 1. 'git multi-pack-index write' creates a multi-pack-index file if
>    one did not exist, and otherwise will update the multi-pack-index
>    with any new pack-files that appeared since the last write. This
>    is particularly relevant with the background fetch job.

[emily shaffer] Will this use midx even if the user has disabled it in
their config?

> diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
> index 94bb493733..3ec813979a 100755
> --- a/t/t7900-maintenance.sh
> +++ b/t/t7900-maintenance.sh
> @@ -103,4 +103,41 @@ test_expect_success 'loose-objects task' '
>  	test_cmp packs-between packs-after
>  '

[emily shaffer] Can we include a test to prove that this task is or is
not run if core.multipackindex is set or unset? That behavior is hard to
deduce from the code... we might want to be cautious.

>  
> +test_expect_success 'incremental-repack task' '
> +	packDir=.git/objects/pack &&
> +	for i in $(test_seq 1 5)
> +	do
> +		test_commit $i || return 1
> +	done &&
> +
> +	# Create three disjoint pack-files with size BIG, small, small.
> +	echo HEAD~2 | git pack-objects --revs $packDir/test-1 &&
> +	test_tick &&
> +	git pack-objects --revs $packDir/test-2 <<-\EOF &&
> +	HEAD~1
> +	^HEAD~2
> +	EOF
> +	test_tick &&
> +	git pack-objects --revs $packDir/test-3 <<-\EOF &&
> +	HEAD
> +	^HEAD~1
> +	EOF
> +	rm -f $packDir/pack-* &&
> +	rm -f $packDir/loose-* &&
> +	ls $packDir/*.pack >packs-before &&
> +	test_line_count = 3 packs-before &&
> +
> +	# the job repacks the two into a new pack, but does not
> +	# delete the old ones.
> +	git maintenance run --task=incremental-repack &&
> +	ls $packDir/*.pack >packs-between &&
> +	test_line_count = 4 packs-between &&
> +
> +	# the job deletes the two old packs, and does not write
> +	# a new one because only one pack remains.
> +	git maintenance run --task=incremental-repack &&
> +	ls .git/objects/pack/*.pack >packs-after &&
> +	test_line_count = 1 packs-after
> +'
> +
>  test_done
> -- 
> gitgitgadget
> 

==================

NAME
----
git-maintenance - Run tasks to optimize Git repository data


SYNOPSIS
--------
[verse]
'git maintenance' run [<options>]


DESCRIPTION
-----------
Run tasks to optimize Git repository data, speeding up other Git commands
and reducing storage requirements for the repository.

Git commands that add repository data, such as `git add` or `git fetch`,
are optimized for a responsive user experience. These commands do not take
time to optimize the Git data, since such optimizations scale with the full
size of the repository while these user commands each perform a relatively
small action.

The `git maintenance` command provides flexibility for how to optimize the
Git repository.

SUBCOMMANDS
-----------

run::
Run one or more maintenance tasks. If one or more `--task` options
are specified, then those tasks are run in that order. Otherwise,
the tasks are determined by which `maintenance.<task>.enabled`
config options are true. By default, only `maintenance.gc.enabled`
is true.

TASKS
-----

commit-graph::
The `commit-graph` job updates the `commit-graph` files incrementally,
then verifies that the written data is correct. If the new layer has an
issue, then the chain file is removed and the `commit-graph` is
rewritten from scratch.
+
The incremental write is safe to run alongside concurrent Git processes
since it will not expire `.graph` files that were in the previous
`commit-graph-chain` file. They will be deleted by a later run based on
the expiration delay.

prefetch::
The `prefetch` task updates the object directory with the latest
objects from all registered remotes. For each remote, a `git fetch`
command is run. The refmap is custom to avoid updating local or remote
branches (those in `refs/heads` or `refs/remotes`). Instead, the
remote refs are stored in `refs/prefetch/<remote>/`. Also, tags are
not updated.
+
This is done to avoid disrupting the remote-tracking branches. The end users
expect these refs to stay unmoved unless they initiate a fetch. With prefetch
task, however, the objects necessary to complete a later real fetch would
already be obtained, so the real fetch would go faster. In the ideal case,
it will just become an update to bunch of remote-tracking branches without
any object transfer.

gc::
Cleanup unnecessary files and optimize the local repository. "GC"
stands for "garbage collection," but this task performs many
smaller tasks. This task can be rather expensive for large
repositories, as it repacks all Git objects into a single pack-file.
It can also be disruptive in some situations, as it deletes stale
data.

loose-objects::
The `loose-objects` job cleans up loose objects and places them into
pack-files. In order to prevent race conditions with concurrent Git
commands, it follows a two-step process. First, it deletes any loose
objects that already exist in a pack-file; concurrent Git processes
will examine the pack-file for the object data instead of the loose
object. Second, it creates a new pack-file (starting with "loose-")
containing a batch of loose objects. The batch size is limited to 50
thousand objects to prevent the job from taking too long on a
repository with many loose objects.

incremental-repack::
The `incremental-repack` job repacks the object directory
using the `multi-pack-index` feature. In order to prevent race
conditions with concurrent Git commands, it follows a two-step
process. First, it deletes any pack-files included in the
`multi-pack-index` where none of the objects in the
`multi-pack-index` reference those pack-files; this only happens
if all objects in the pack-file are also stored in a newer
pack-file. Second, it selects a group of pack-files whose "expected
size" is below the batch size until the group has total expected
size at least the batch size; see the `--batch-size` option for
the `repack` subcommand in linkgit:git-multi-pack-index[1]. The
default batch-size is zero, which is a special case that attempts
to repack all pack-files into a single pack-file.

OPTIONS
-------
--auto::
When combined with the `run` subcommand, run maintenance tasks
only if certain thresholds are met. For example, the `gc` task
runs when the number of loose objects exceeds the number stored
in the `gc.auto` config setting, or when the number of pack-files
exceeds the `gc.autoPackLimit` config setting.

--quiet::
Do not report progress or other information over `stderr`.

--task=<task>::
If this option is specified one or more times, then only run the
specified tasks in the specified order.

GIT
---
Part of the linkgit:git[1] suite
1 change: 1 addition & 0 deletions builtin.h
Expand Up @@ -167,6 +167,7 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix);
int cmd_ls_remote(int argc, const char **argv, const char *prefix);
int cmd_mailinfo(int argc, const char **argv, const char *prefix);
int cmd_mailsplit(int argc, const char **argv, const char *prefix);
int cmd_maintenance(int argc, const char **argv, const char *prefix);
int cmd_merge(int argc, const char **argv, const char *prefix);
int cmd_merge_base(int argc, const char **argv, const char *prefix);
int cmd_merge_index(int argc, const char **argv, const char *prefix);
Expand Down
2 changes: 1 addition & 1 deletion builtin/am.c
Expand Up @@ -1795,7 +1795,7 @@ static void am_run(struct am_state *state, int resume)
if (!state->rebasing) {
am_destroy(state);
close_object_store(the_repository->objects);
run_auto_gc(state->quiet);
run_auto_maintenance(state->quiet);
}
}

Expand Down
2 changes: 1 addition & 1 deletion builtin/commit.c
Expand Up @@ -1702,7 +1702,7 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
git_test_write_commit_graph_or_die();

repo_rerere(the_repository, 0);
run_auto_gc(quiet);
run_auto_maintenance(quiet);
run_commit_hook(use_editor, get_index_file(), "post-commit", NULL);
if (amend && !no_post_rewrite) {
commit_post_rewrite(the_repository, current_head, &oid);
Expand Down
25 changes: 20 additions & 5 deletions builtin/fetch.c
Expand Up @@ -56,6 +56,7 @@ static int prune_tags = -1; /* unspecified */
#define PRUNE_TAGS_BY_DEFAULT 0 /* do we prune tags by default? */

static int all, append, dry_run, force, keep, multiple, update_head_ok;
static int write_fetch_head = 1;
static int verbosity, deepen_relative, set_upstream;
static int progress = -1;
static int enable_auto_gc = 1;
Expand Down Expand Up @@ -118,6 +119,10 @@ static int git_fetch_config(const char *k, const char *v, void *cb)
return 0;
}

if (!strcmp(k, "fetch.writefetchhead")) {
write_fetch_head = git_config_bool(k, v);
return 0;
}
return git_default_config(k, v, cb);
}

Expand Down Expand Up @@ -162,6 +167,8 @@ static struct option builtin_fetch_options[] = {
PARSE_OPT_OPTARG, option_fetch_parse_recurse_submodules),
OPT_BOOL(0, "dry-run", &dry_run,
N_("dry run")),
OPT_BOOL(0, "write-fetch-head", &write_fetch_head,
N_("write fetched references to the FETCH_HEAD file")),
OPT_BOOL('k', "keep", &keep, N_("keep downloaded pack")),
OPT_BOOL('u', "update-head-ok", &update_head_ok,
N_("allow updating of HEAD ref")),
Expand Down Expand Up @@ -196,8 +203,10 @@ static struct option builtin_fetch_options[] = {
OPT_STRING_LIST(0, "negotiation-tip", &negotiation_tip, N_("revision"),
N_("report that we have only objects reachable from this object")),
OPT_PARSE_LIST_OBJECTS_FILTER(&filter_options),
OPT_BOOL(0, "auto-maintenance", &enable_auto_gc,
N_("run 'maintenance --auto' after fetching")),
OPT_BOOL(0, "auto-gc", &enable_auto_gc,
N_("run 'gc --auto' after fetching")),
N_("run 'maintenance --auto' after fetching")),
OPT_BOOL(0, "show-forced-updates", &fetch_show_forced_updates,
N_("check for forced-updates on all updated branches")),
OPT_BOOL(0, "write-commit-graph", &fetch_write_commit_graph,
Expand Down Expand Up @@ -893,7 +902,9 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
const char *what, *kind;
struct ref *rm;
char *url;
const char *filename = dry_run ? "/dev/null" : git_path_fetch_head(the_repository);
const char *filename = (!write_fetch_head
? "/dev/null"
: git_path_fetch_head(the_repository));
int want_status;
int summary_width = transport_summary_width(ref_map);

Expand Down Expand Up @@ -1327,7 +1338,7 @@ static int do_fetch(struct transport *transport,
}

/* if not appending, truncate FETCH_HEAD */
if (!append && !dry_run) {
if (!append && write_fetch_head) {
retcode = truncate_fetch_head();
if (retcode)
goto cleanup;
Expand Down Expand Up @@ -1594,7 +1605,7 @@ static int fetch_multiple(struct string_list *list, int max_children)
int i, result = 0;
struct strvec argv = STRVEC_INIT;

if (!append && !dry_run) {
if (!append && write_fetch_head) {
int errcode = truncate_fetch_head();
if (errcode)
return errcode;
Expand Down Expand Up @@ -1795,6 +1806,10 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
if (depth || deepen_since || deepen_not.nr)
deepen = 1;

/* FETCH_HEAD never gets updated in --dry-run mode */
if (dry_run)
write_fetch_head = 0;

if (all) {
if (argc == 1)
die(_("fetch --all does not take a repository argument"));
Expand Down Expand Up @@ -1882,7 +1897,7 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
close_object_store(the_repository->objects);

if (enable_auto_gc)
run_auto_gc(verbosity < 0);
run_auto_maintenance(verbosity < 0);

return result;
}