Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch: add repair: full refetch without negotiation (was: "refiltering") #1138

Closed
wants to merge 7 commits into from

Conversation

rcoup
Copy link

@rcoup rcoup commented Jan 31, 2022

If a filter is changed on a partial clone repository, for example from blob:none to blob:limit=1m, there is currently no straightforward way to bulk-refetch the objects that match the new filter for existing local commits. This is because the client will report commits as "have" during fetch negotiation and any dependent objects won't be included in the transferred pack. Another use case is discussed at [1].

This patch series introduces a --refetch option to fetch & fetch-pack to enable doing a full fetch without performing any commit negotiation with the remote, as a fresh clone does. It builds upon cbe566a ("negotiator/noop: add noop fetch negotiator", 2020-08-18).

  • Using --refetch will produce duplicated objects between the existing and newly fetched packs, but maintenance will clean them up when it runs automatically post-fetch (if enabled).
  • If a user fetches with --refetch applying a more restrictive partial clone filter than previously (eg: blob:limit=1m then blob:limit=1k) the eventual state is a no-op, since any referenced object already in the local repository is never removed. More advanced repacking which could improve this scenario is currently proposed at [2].

[1] https://lore.kernel.org/git/aa7b89ee-08aa-7943-6a00-28dcf344426e@syntevo.com/
[2] https://lore.kernel.org/git/21ED346B-A906-4905-B061-EDE53691C586@gmail.com/

Changes since v3:

  • Mention fetch --refetch in the remote.<name>.partialclonefilter documentation.

Changes since v2:

  • Changed the name from "repair" to "refetch". While it's conceivable to use it in some object DB repair situations that's not the focus of these changes.
  • Pass config options to maintenance via GIT_CONFIG_PARAMETERS
  • Split out auto-maintenance to a separate & more robust test
  • Minor fixes/improvements from reviews by Junio & Ævar

Changes since RFC (v1):

  • Changed the name from "refilter" to "repair"
  • Removed dependency between server-side support for filtering and repair
  • Added a test case for a shallow clone
  • Post-fetch auto maintenance now strongly encourages repacking/consolidation

Reviewed-by: Calvin Wan calvinwan@google.com

CC: Jonathan Tan jonathantanmy@google.com, John Cai johncai86@gmail.com, Jeff Hostetler git@jeffhostetler.com, Junio C Hamano gitster@pobox.com, Derrick Stolee derrickstolee@github.com, Ævar Arnfjörð Bjarmason avarab@gmail.com
cc: Calvin Wan calvinwan@google.com

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 31, 2022

Welcome to GitGitGadget

Hi @rcoup, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests.

Please make sure that your Pull Request has a good description, as it will be used as cover letter.

Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:

  • the lines should not exceed 76 columns,
  • the first line should be like a header and typically start with a prefix like "tests:" or "revisions:" to state which subsystem the change is about, and
  • the commit messages' body should be describing the "why?" of the change.
  • Finally, the commit messages should end in a Signed-off-by: line matching the commits' author.

It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code.

Contributing the patches

Before you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form /allow. A good way to find other contributors is to locate recent pull requests where someone has been /allowed:

Both the person who commented /allow and the PR author are able to /allow you.

An alternative is the channel #git-devel on the Libera Chat IRC network:

<newcontributor> I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345
<veteran> newcontributor: it is done
<newcontributor> thanks!

Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment /submit.

If you want to see what email(s) would be sent for a /submit request, add a PR comment /preview to have the email(s) sent to you. You must have a public GitHub email address for this.

After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail).

If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the (raw) link), then import it into your mail program. If you use GMail, you can do this via:

curl -g --user "<EMailAddress>:<Password>" \
    --url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt

To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):

Changes since v1:
- Fixed a typo in the commit message (found by ...)
- Added a code comment to ... as suggested by ...
...

To send a new iteration, just add another PR comment with the contents: /submit.

Need help?

New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join.

You may also be able to find help in real time in the developer IRC channel, #git-devel on Libera Chat. Remember that IRC does not support offline messaging, so if you send someone a private message and log out, they cannot respond to you. The scrollback of #git-devel is archived, though.

@derrickstolee
Copy link

/allow

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 31, 2022

User rcoup is now allowed to use GitGitGadget.

@rcoup
Copy link
Author

rcoup commented Feb 1, 2022

/preview

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 1, 2022

Preview email sent as pull.1138.git.1643730506.gitgitgadget@gmail.com

@rcoup
Copy link
Author

rcoup commented Feb 1, 2022

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 1, 2022

Submitted as pull.1138.git.1643730593.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1138/rcoup/rc-partial-clone-refilter-v1

To fetch this version to local tag pr-1138/rcoup/rc-partial-clone-refilter-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1138/rcoup/rc-partial-clone-refilter-v1

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 1, 2022

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Robert Coup via GitGitGadget" <gitgitgadget@gmail.com> writes:

> If a filter is changed on a partial clone repository, for example from
> blob:none to blob:limit=1m, there is currently no straightforward way to
> bulk-refetch the objects that match the new filter for existing local
> commits. This is because the client will report commits as "have" during
> negotiation and any dependent objects won't be included in the transferred
> pack.

It sounds like a useful thing to have such a "refetch things"
option.

A lazy/partial clone is narrower than the full tree in the width
dimension, while a shallow clone is shallower than the full history
in the time dimension.  The latter already has the "--deepen"
support to say "the commits listed in my shallow boundary list may
claim that I already have these, but I actually don't have them;
please stop lying to the other side and refetch what I should have
fetched earlier".  I understand that this works in the other
dimension to "--widen" things?

Makes me wonder how well these two features work together (or if
they are mutually exclusive, that is fine as well as a starting
point).

If you update the filter specification to make it narrower (e.g. you
start from blob:limit=1m down to blob:limit=512k), would we transfer
nothing (which would be ideal), or would we end up refetching
everything that are smaller than 512k?

> This patch series proposes adding a --refilter option to fetch & fetch-pack
> to enable doing a full fetch with a different filter, as if the local has no
> commits in common with the remote. It builds upon cbe566a071
> ("negotiator/noop: add noop fetch negotiator", 2020-08-18).

I guess the answer to the last question is ...

> To note:
>
>  1. This will produce duplicated objects between the existing and newly
>     fetched packs, but gc will clean them up.

... it is not smart enough to stell them to exclude what we _ought_
to have by telling them what the _old_ filter spec was.  That's OK
for a starting point, I guess.  Hopefully, at the end of this
operation, we should garbage collect the duplicated objects by
default (with an option to turn it off)?

>  2. This series doesn't check that there's a new filter in any way, whether
>     configured via config or passed via --filter=. Personally I think that's
>     fine.

In other words, a repository that used to be a partial clone can
become a full clone by using the option _and_ not giving any filter.
I think that is an intuitive enough behaviour and a natural
consequence to the extreme of what the feature is.  Compared to
making a full "git clone", fetching from the old local (and narrow)
repository into it and then discarding the old one, it would not
have any performance or storage advantage, but it probably is more
convenient.

>  3. If a user fetches with --refilter applying a more restrictive filter
>     than previously (eg: blob:limit=1m then blob:limit=1k) the eventual
>     state is a no-op, since any referenced object already in the local
>     repository is never removed. Potentially this could be improved in
>     future by more advanced gc, possibly along the lines discussed at [2].

OK.  That matches my reaction to 1. above.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 2, 2022

On the Git mailing list, Robert Coup wrote (reply to this):

Hi Junio

Thanks for your input. Hopefully some of the other partial-clone
interested folks will chime in too.

On Tue, 1 Feb 2022 at 20:13, Junio C Hamano <gitster@pobox.com> wrote:
>
> It sounds like a useful thing to have such a "refetch things"
> option.

Any improved suggestions on the argument name? I thought of
--refetch but `fetch --refetch` seemed more confusing to explain.

> Makes me wonder how well these two features work together (or if
> they are mutually exclusive, that is fine as well as a starting
> point).

I don't see any particular reason they can't work together - as you say,
the filtering is orthogonal to shallow on a conceptual level. I haven't
added a test for that scenario yet but will do for a v1.

> If you update the filter specification to make it narrower (e.g. you
> start from blob:limit=1m down to blob:limit=512k), would we transfer
> nothing (which would be ideal), or would we end up refetching
> everything that are smaller than 512k?

As you spot, the latter. I can't see a straightforward way of telling the
server "I have these trees/blobs already" without generating (one way
or the other) a list of millions of oids, then transferring & negotiating
with it.

> ... it is not smart enough to stell them to exclude what we _ought_
> to have by telling them what the _old_ filter spec was.  That's OK
> for a starting point, I guess.

The client doesn't really know what the local repository *has* —
potentially several filters could have been applied and used for fetches
at different points in the commit history, as well as objects dynamically
fetched in. Even a filter set in the config only applies to subsequent
fetches, and only if --filter isn't used to override it.

> Hopefully, at the end of this
> operation, we should garbage collect the duplicated objects by
> default (with an option to turn it off)?

I haven't explicitly looked into invoking gc yet, but yes, it'd be a bit of
a waste if it wasn't kicked off by default. Maybe reusing gc.auto

> In other words, a repository that used to be a partial clone can
> become a full clone by using the option _and_ not giving any filter.

For that specific case I think you can already do it by removing the
promisor flag in the remote config, potentially adding it back if you
wanted to keep it partial again from that point forward.

> I think that is an intuitive enough behaviour and a natural
> consequence to the extreme of what the feature is.  Compared to
> making a full "git clone", fetching from the old local (and narrow)
> repository into it and then discarding the old one, it would not
> have any performance or storage advantage, but it probably is more
> convenient.

It's certainly cleaner than abusing --deepen, or temporarily moving pack
files out of the way, or starting over with a fresh clone & copying config.

Thanks,

Rob :)

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 2, 2022

On the Git mailing list, Jonathan Tan wrote (reply to this):

"Robert Coup via GitGitGadget" <gitgitgadget@gmail.com> writes:
> If a filter is changed on a partial clone repository, for example from
> blob:none to blob:limit=1m, there is currently no straightforward way to
> bulk-refetch the objects that match the new filter for existing local
> commits. This is because the client will report commits as "have" during
> negotiation and any dependent objects won't be included in the transferred
> pack. Another use case is discussed at [1].

Reporting commits as "have" can be solved by forcing the noop
negotiator, but there is another issue (which you seem to have
discovered, glancing through your patches) in that fetch will abort (and
report success) if all wanted commits are present, even if not all
objects directly or indirectly referenced by those commits are present.

> This patch series proposes adding a --refilter option to fetch & fetch-pack
> to enable doing a full fetch with a different filter, as if the local has no
> commits in common with the remote. It builds upon cbe566a071
> ("negotiator/noop: add noop fetch negotiator", 2020-08-18).

Thanks - I think this is a useful feature. This is useful even in a
non-partial-clone repo, to repair objects that were, say, accidentally
deleted from the local object store.

If it's acceptable to have a separate command to configure the new
filter in the repo config (or to delete it, if we want to convert a
partial clone into a regular repo), I think it's clearer to name this
option "--repair" or something like that, and explain it as a fetch that
does not take into account the contents of the local object store (not
as a fetch that changes the filter).

Having said that, the overall feature is worth having. While we decide
on the name, the implementation of this will not change much. I'll try
to review the implementation by the end of this week (but other reviews
are welcome too, needless to say).

> To note:
> 
>  1. This will produce duplicated objects between the existing and newly
>     fetched packs, but gc will clean them up.

Noted. This might be an argument for naming the option "--repair", since
the user would probably understand that there would be duplicate
objects.

>  2. This series doesn't check that there's a new filter in any way, whether
>     configured via config or passed via --filter=. Personally I think that's
>     fine.

Agreed.

>  3. If a user fetches with --refilter applying a more restrictive filter
>     than previously (eg: blob:limit=1m then blob:limit=1k) the eventual
>     state is a no-op, since any referenced object already in the local
>     repository is never removed. Potentially this could be improved in
>     future by more advanced gc, possibly along the lines discussed at [2].

That's true.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 2, 2022

On the Git mailing list, Robert Coup wrote (reply to this):

Hi Jonathan,

On Wed, 2 Feb 2022 at 19:00, Jonathan Tan <jonathantanmy@google.com> wrote:
>
> Thanks - I think this is a useful feature. This is useful even in a
> non-partial-clone repo, to repair objects that were, say, accidentally
> deleted from the local object store.

I'd

> If it's acceptable to have a separate command to configure the new
> filter in the repo config (or to delete it, if we want to convert a
> partial clone into a regular repo), I think it's clearer to name this
> option "--repair" or something like that, and explain it as a fetch that
> does not take into account the contents of the local object store (not
> as a fetch that changes the filter).

I quite like --repair, since the implementation really has zero to do with
filtering or partial clones beyond that being my use case for it. Specifying
a filter, shallow options, or even using a promisor remote aren't
even slightly necessary for the implementation as it turns out.

And as you say, that makes it easier to explain too:

"fetch --repair will fetch all objects from the remote (applying any filters
or shallow options as requested). Unlike a normal fetch it does not take into
account any content already in the local repository and acts more like an
initial clone. Any duplicate objects will get cleaned up during subsequent
maintenance."

"If you want to update your local repository with a different partial clone
filter, use `fetch --repair` to re-download all matching objects from the
remote."

> I'll try
> to review the implementation by the end of this week (but other reviews
> are welcome too, needless to say).

Thanks!

Rob :)

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 2, 2022

On the Git mailing list, Robert Coup wrote (reply to this):

On Wed, 2 Feb 2022 at 21:58, Robert Coup <robert@coup.net.nz> wrote:
>
> Hi Jonathan,
>
> On Wed, 2 Feb 2022 at 19:00, Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > Thanks - I think this is a useful feature. This is useful even in a
> > non-partial-clone repo, to repair objects that were, say, accidentally
> > deleted from the local object store.
>
> I'd

I'd be keen to hear the stories here :) Just losing loose object files?

Rob.

@@ -312,19 +312,21 @@ static int find_common(struct fetch_negotiator *negotiator,
const char *remote_hex;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Tan wrote (reply to this):

"Robert Coup via GitGitGadget" <gitgitgadget@gmail.com> writes:
> @@ -312,19 +312,21 @@ static int find_common(struct fetch_negotiator *negotiator,
>  		const char *remote_hex;
>  		struct object *o;
>  
> -		/*
> -		 * If that object is complete (i.e. it is an ancestor of a
> -		 * local ref), we tell them we have it but do not have to
> -		 * tell them about its ancestors, which they already know
> -		 * about.
> -		 *
> -		 * We use lookup_object here because we are only
> -		 * interested in the case we *know* the object is
> -		 * reachable and we have already scanned it.
> -		 */
> -		if (((o = lookup_object(the_repository, remote)) != NULL) &&
> -				(o->flags & COMPLETE)) {
> -			continue;
> +		if (!args->refilter) {
> +			/*
> +			* If that object is complete (i.e. it is an ancestor of a
> +			* local ref), we tell them we have it but do not have to
> +			* tell them about its ancestors, which they already know
> +			* about.
> +			*
> +			* We use lookup_object here because we are only
> +			* interested in the case we *know* the object is
> +			* reachable and we have already scanned it.
> +			*/
> +			if (((o = lookup_object(the_repository, remote)) != NULL) &&
> +					(o->flags & COMPLETE)) {
> +				continue;
> +			}

The approach that I would have expected is to not call
mark_complete_and_common_ref(), filter_refs(), everything_local(), and
find_common(), but your approach here is to ensure that
mark_complete_and_common_ref() and find_common() do not do anything.
Comparing the two approaches, the advantage of yours is that we only
need to make the change once to support both protocol v0 and v2
(although the change looks more substantial than just skipping function
calls), but it makes the code more difficult to read in that we now have
function calls that do nothing. What do you think about my approach?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Robert Coup wrote (reply to this):

Hi Jonathan,

Thanks for taking a look at this.

On Fri, 4 Feb 2022 at 18:02, Jonathan Tan <jonathantanmy@google.com> wrote:
>
> The approach that I would have expected is to not call
> mark_complete_and_common_ref(), filter_refs(), everything_local(), and
> find_common(), but your approach here is to ensure that
> mark_complete_and_common_ref() and find_common() do not do anything.

v0: find_common() definitely still does something: during refiltering it skips
checking the local object db, but it's still responsible for sending
the "wants".

filter_refs() is necessary under v0 & v2 so the remote refs all get marked as
matched, otherwise we end up erroring after the transfer with
"error: no such remote ref refs/heads/main" etc.

> Comparing the two approaches, the advantage of yours is that we only
> need to make the change once to support both protocol v0 and v2
> (although the change looks more substantial than just skipping function
> calls), but it makes the code more difficult to read in that we now have
> function calls that do nothing. What do you think about my approach?

My original approach was to leave the negotiator in place, and just conditional
around it in do_fetch_pack_v2 — this worked ok for protocol v2 but the v0
implementation didn't work. After that I switched to forcing noop in [1/6]
which made both implementations match (& tidier imo).

To make the test pass and skip those calls I need a patch like the below
— filter_refs() is still required during refiltering for the ref-matching. To me
this looks more complicated, but I'm happy to defer to your thinking.

Thanks,

Rob :)


diff --git a/fetch-pack.c b/fetch-pack.c
index dd67044165..870bfba267 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1125,15 +1125,16 @@ static struct ref *do_fetch_pack(struct
fetch_pack_args *args,
        negotiator = &negotiator_alloc;
        if (is_refiltering) {
                fetch_negotiator_init_noop(negotiator);
+               filter_refs(args, &ref, sought, nr_sought);
        } else {
                fetch_negotiator_init(r, negotiator);
-       }

-       mark_complete_and_common_ref(negotiator, args, &ref);
-       filter_refs(args, &ref, sought, nr_sought);
-       if (!is_refiltering && everything_local(args, &ref)) {
-               packet_flush(fd[1]);
-               goto all_done;
+               mark_complete_and_common_ref(negotiator, args, &ref);
+               filter_refs(args, &ref, sought, nr_sought);
+               if (everything_local(args, &ref)) {
+                       packet_flush(fd[1]);
+                       goto all_done;
+               }
        }
        if (find_common(negotiator, args, fd, &oid, ref) < 0)
                if (!args->keep_pack)
@@ -1615,13 +1616,18 @@ static struct ref *do_fetch_pack_v2(struct
fetch_pack_args *args,
                        if (args->depth > 0 || args->deepen_since ||
args->deepen_not)
                                args->deepen = 1;

-                       /* Filter 'ref' by 'sought' and those that
aren't local */
-                       mark_complete_and_common_ref(negotiator, args, &ref);
-                       filter_refs(args, &ref, sought, nr_sought);
-                       if (!args->refilter && everything_local(args, &ref))
-                               state = FETCH_DONE;
-                       else
+                       if (args->refilter) {
+                               filter_refs(args, &ref, sought, nr_sought);
                                state = FETCH_SEND_REQUEST;
+                       } else {
+                               /* Filter 'ref' by 'sought' and those
that aren't local */
+
mark_complete_and_common_ref(negotiator, args, &ref);
+                               filter_refs(args, &ref, sought, nr_sought);
+                               if (everything_local(args, &ref))
+                                       state = FETCH_DONE;
+                               else
+                                       state = FETCH_SEND_REQUEST;
+                       }

                        mark_tips(negotiator, args->negotiation_tips);
                        for_each_cached_alternate(negotiator,

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Tan wrote (reply to this):

Robert Coup <robert@coup.net.nz> writes:
> On Fri, 4 Feb 2022 at 18:02, Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > The approach that I would have expected is to not call
> > mark_complete_and_common_ref(), filter_refs(), everything_local(), and
> > find_common(), but your approach here is to ensure that
> > mark_complete_and_common_ref() and find_common() do not do anything.
> 
> v0: find_common() definitely still does something: during refiltering it skips
> checking the local object db, but it's still responsible for sending
> the "wants".
> 
> filter_refs() is necessary under v0 & v2 so the remote refs all get marked as
> matched, otherwise we end up erroring after the transfer with
> "error: no such remote ref refs/heads/main" etc.
> 
> > Comparing the two approaches, the advantage of yours is that we only
> > need to make the change once to support both protocol v0 and v2
> > (although the change looks more substantial than just skipping function
> > calls), but it makes the code more difficult to read in that we now have
> > function calls that do nothing. What do you think about my approach?
> 
> My original approach was to leave the negotiator in place, and just conditional
> around it in do_fetch_pack_v2 — this worked ok for protocol v2 but the v0
> implementation didn't work. After that I switched to forcing noop in [1/6]
> which made both implementations match (& tidier imo).
> 
> To make the test pass and skip those calls I need a patch like the below
> — filter_refs() is still required during refiltering for the ref-matching. To me
> this looks more complicated, but I'm happy to defer to your thinking.
> 
> Thanks,
> 
> Rob :)

Thanks for investigating this; looks like I was perhaps too optimistic
in thinking that the code could be reorganized in the way I'm
suggesting.

I think that it's worth checking if we could refactor the parts that are
needed with --refilter out of filter_refs() and find_common() (and in
doing so, make these functions do only what their names imply). I
don't have time to do so right now, but I don't want to unnecessarily
block this patch set either, so I'll say that ideally another reviewer
(or you) would do that investigation, but I have no objection to merging
this patch set in this state (other than the change of the argument
name, perhaps to "--repair", and associated documentation).

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 7, 2022

On the Git mailing list, Jeff Hostetler wrote (reply to this):



On 2/1/22 10:49 AM, Robert Coup via GitGitGadget wrote:
> If a filter is changed on a partial clone repository, for example from
> blob:none to blob:limit=1m, there is currently no straightforward way to
> bulk-refetch the objects that match the new filter for existing local
> commits. This is because the client will report commits as "have" during
> negotiation and any dependent objects won't be included in the transferred
> pack. Another use case is discussed at [1].
> 
> This patch series proposes adding a --refilter option to fetch & fetch-pack
> to enable doing a full fetch with a different filter, as if the local has no
> commits in common with the remote. It builds upon cbe566a071
> ("negotiator/noop: add noop fetch negotiator", 2020-08-18).
> 
> To note:
> 
>   1. This will produce duplicated objects between the existing and newly
>      fetched packs, but gc will clean them up.
>   2. This series doesn't check that there's a new filter in any way, whether
>      configured via config or passed via --filter=. Personally I think that's
>      fine.
>   3. If a user fetches with --refilter applying a more restrictive filter
>      than previously (eg: blob:limit=1m then blob:limit=1k) the eventual
>      state is a no-op, since any referenced object already in the local
>      repository is never removed. Potentially this could be improved in
>      future by more advanced gc, possibly along the lines discussed at [2].
> 

Yes, it would be nice to have a way to efficiently extend a
partial clone with a more inclusive filter.

It would be nice to be able to send the old filter-spec and the
new filter-spec and ask the server to send "new && !old" to keep
from having to resend the objects that the client already has.
But I'm not sure we know enough (on either side) to make that
computation.  And as you say, there is no guarantee that the client
has only used one filter in the past.

Jeff

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 7, 2022

User Jeff Hostetler <git@jeffhostetler.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 16, 2022

On the Git mailing list, Robert Coup wrote (reply to this):

Hi Derrick,

On Wed, 2 Feb 2022 at 15:02, Robert Coup <robert@coup.net.nz> wrote:
> On Tue, 1 Feb 2022 at 20:13, Junio C Hamano <gitster@pobox.com> wrote:
> > "Robert Coup via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > >  1. This will produce duplicated objects between the existing and newly
> > >     fetched packs, but gc will clean them up.
> >
> > Hopefully, at the end of this
> > operation, we should garbage collect the duplicated objects by
> > default (with an option to turn it off)?
>
> I haven't explicitly looked into invoking gc yet, but yes, it'd be a bit of
> a waste if it wasn't kicked off by default. Maybe reusing gc.auto

Just looking into this: after a fetch, run_auto_maintenance() is called, which
invokes `git maintenance run --auto` (being the successor to `git gc --auto`)...

In the refilter (repair) case, we'll have a new pack which substantially
duplicates what's in our existing packs. I'm after some advice here: I'm not
sure whether I want to encourage a gc pack consolidation, an incremental
repack, both, neither, or something else?

My current train of thought is it invokes `git maintenance run --auto` with
gc.autoPackLimit=1 to force a gc pack consolidation if it's not currently
=0 (disabled), and with maintenance.incremental-repack.auto=-1 if it's not
currently =0 (disabled) to force an incremental repack if someone doesn't
want to do pack consolidations.

In all cases the various enabled-ness settings should still be honoured to
skip tasks if the user doesn't want them ever happening automatically.

Implementation-wise I guess we'll need run_auto_maintenance() to be able
to pass some config options through to the maintenance subprocess.

Does this sound like a reasonable approach?

Thanks,

Rob :)

@rcoup rcoup changed the title [RFC] partial-clone: add ability to refetch with expanded filter fetch: add repair: full refetch without negotiation Feb 16, 2022
@rcoup rcoup force-pushed the rc-partial-clone-refilter branch 5 times, most recently from b80ca59 to cd072de Compare February 24, 2022 00:25
@rcoup
Copy link
Author

rcoup commented Feb 24, 2022

/preview

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 24, 2022

Preview email sent as pull.1138.v2.git.1645715098.gitgitgadget@gmail.com

@rcoup rcoup changed the title fetch: add repair: full refetch without negotiation fetch: add repair: full refetch without negotiation (was: "refiltering") Feb 24, 2022
@rcoup
Copy link
Author

rcoup commented Feb 24, 2022

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 24, 2022

Submitted as pull.1138.v2.git.1645719218.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1138/rcoup/rc-partial-clone-refilter-v2

To fetch this version to local tag pr-1138/rcoup/rc-partial-clone-refilter-v2:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1138/rcoup/rc-partial-clone-refilter-v2

Add a test for doing a refetch to apply a changed partial clone filter
under protocol v0 and v2.

Signed-off-by: Robert Coup <robert@coup.net.nz>
After invoking `fetch --refetch`, the object db will likely contain many
duplicate objects. If auto-maintenance is enabled, invoke it with
appropriate settings to encourage repacking/consolidation.

* gc.autoPackLimit: unless this is set to 0 (disabled), override the
  value to 1 to force pack consolidation.
* maintenance.incremental-repack.auto: unless this is set to 0, override
  the value to -1 to force incremental repacking.

Signed-off-by: Robert Coup <robert@coup.net.nz>
Document it for partial clones as a means to apply a new filter, and
reference it from the remote.<name>.partialclonefilter config parameter.

Signed-off-by: Robert Coup <robert@coup.net.nz>
@rcoup rcoup force-pushed the rc-partial-clone-refilter branch from f923a06 to da1e6de Compare March 28, 2022 12:47
@rcoup
Copy link
Author

rcoup commented Mar 28, 2022

/preview

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 28, 2022

Preview email sent as pull.1138.v4.git.1648475699.gitgitgadget@gmail.com

@rcoup
Copy link
Author

rcoup commented Mar 28, 2022

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 28, 2022

Submitted as pull.1138.v4.git.1648476131.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1138/rcoup/rc-partial-clone-refilter-v4

To fetch this version to local tag pr-1138/rcoup/rc-partial-clone-refilter-v4:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1138/rcoup/rc-partial-clone-refilter-v4

@@ -82,5 +82,7 @@ remote.<name>.promisor::
objects.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Robert Coup via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  remote.<name>.partialclonefilter::
> -	The filter that will be applied when fetching from this
> -	promisor remote.
> +	The filter that will be applied when fetching from this	promisor remote.
> +	Changing or clearing this value will only affect fetches for new commits.
> +	To fetch associated objects for commits already present in the local object
> +	database, use the `--refetch` option of linkgit:git-fetch[1].

Good advice to add.

Will replace.  I think we've seen this topic enough times and it
looked reasonably well done.  Let's mark it for 'next' unless we
hear objections soonish.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 28, 2022

This patch series was integrated into seen via git@9304375.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 29, 2022

There was a status update in the "Cooking" section about the branch rc/fetch-refetch on the Git mailing list:

"git fetch --refetch" learned to fetch everything without telling
the other side what we already have, which is useful when you
cannot trust what you have in the local object store.

Will merge to 'next'?
source: <pull.1138.v4.git.1648476131.gitgitgadget@gmail.com>

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 29, 2022

This patch series was integrated into seen via git@2d72be0.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 30, 2022

This patch series was integrated into seen via git@229b9ce.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 31, 2022

This patch series was integrated into seen via git@6c9ab52.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 31, 2022

There was a status update in the "Cooking" section about the branch rc/fetch-refetch on the Git mailing list:

"git fetch --refetch" learned to fetch everything without telling
the other side what we already have, which is useful when you
cannot trust what you have in the local object store.

Will merge to 'next'?
source: <pull.1138.v4.git.1648476131.gitgitgadget@gmail.com>

@@ -312,19 +312,21 @@ static int find_common(struct fetch_negotiator *negotiator,
const char *remote_hex;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Mon, Mar 28 2022, Robert Coup via GitGitGadget wrote:

> From: Robert Coup <robert@coup.net.nz>
>
> Allow a "refetch" where the contents of the local object store are
> ignored and a full fetch is performed, not attempting to find or
> negotiate common commits with the remote.
>
> A key use case is to apply a new partial clone blob/tree filter and
> refetch all the associated matching content, which would otherwise not
> be transferred when the commit objects are already present locally.

FWIW it's not clear to me re earlier comments on earlier iterations
whether the "noop fetch" is per-se wanted for this feature for some
reason, or if it's just much easier to implement than doing what I
suggested in:
https://lore.kernel.org/git/220228.86ee3m39jf.gmgdl@evledraar.gmail.com/

I don't think such a thing should hold this series up, but as it would
be a bit kinder to servers I think it's worth at least noting in the
commit message what's desired per-se here, v.s. what's just needed for
the convenience of implementation.

I.e. when this series was in an earlier iteration the scope was to
repair repository corruption, which I pointed out to you it really
couldn't do without more wider changes to the object store management,
and at that point having it be NOOP definitely makes sense. The object
lookups etc. take shortcuts that "fsck" wouldn't do, so we could be
negotiating on the basis of corrupt content.

But now that it's a "fetch what's missing" wouldn't it make more sense
to descend from our otherwise-negotiated tips, and find the OIDs that
are "complete", if any, and negotiate with those?

Which again, I think it's fine to say "yeah, that would be ideal, but
this is easier". I'm just checking if I'm missing some subtlety here...

> Signed-off-by: Robert Coup <robert@coup.net.nz>
> ---
>  fetch-pack.c | 46 +++++++++++++++++++++++++++++-----------------
>  fetch-pack.h |  1 +
>  2 files changed, 30 insertions(+), 17 deletions(-)
>
> diff --git a/fetch-pack.c b/fetch-pack.c
> index 87657907e78..4e1e88eea09 100644
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -312,19 +312,21 @@ static int find_common(struct fetch_negotiator *negotiator,
>  		const char *remote_hex;
>  		struct object *o;
>  
> -		/*
> -		 * If that object is complete (i.e. it is an ancestor of a
> -		 * local ref), we tell them we have it but do not have to
> -		 * tell them about its ancestors, which they already know
> -		 * about.
> -		 *
> -		 * We use lookup_object here because we are only
> -		 * interested in the case we *know* the object is
> -		 * reachable and we have already scanned it.
> -		 */
> -		if (((o = lookup_object(the_repository, remote)) != NULL) &&
> -				(o->flags & COMPLETE)) {
> -			continue;
> +		if (!args->refetch) {
> +			/*
> +			* If that object is complete (i.e. it is an ancestor of a
> +			* local ref), we tell them we have it but do not have to
> +			* tell them about its ancestors, which they already know
> +			* about.
> +			*
> +			* We use lookup_object here because we are only
> +			* interested in the case we *know* the object is
> +			* reachable and we have already scanned it.
> +			*/
> +			if (((o = lookup_object(the_repository, remote)) != NULL) &&
> +					(o->flags & COMPLETE)) {
> +				continue;
> +			}

nit: remove the {} here
nit: this double-indented if can just be one if-statement
nit: don't compare against NULL, use ! instead.
>  		}
>  
>  		remote_hex = oid_to_hex(remote);
> @@ -692,6 +694,9 @@ static void mark_complete_and_common_ref(struct fetch_negotiator *negotiator,
>  	int old_save_commit_buffer = save_commit_buffer;
>  	timestamp_t cutoff = 0;
>  
> +	if (args->refetch)
> +		return;
> +
>  	save_commit_buffer = 0;

nit: This function has only two callers, perhaps it's clearer to do do
this "early abort" in those calls?

>  	trace2_region_enter("fetch-pack", "parse_remote_refs_and_find_cutoff", NULL);
> @@ -1028,7 +1033,11 @@ static struct ref *do_fetch_pack(struct fetch_pack_args *args,
>  	struct fetch_negotiator *negotiator;
>  
>  	negotiator = &negotiator_alloc;
> -	fetch_negotiator_init(r, negotiator);
> +	if (args->refetch) {
> +		fetch_negotiator_init_noop(negotiator);
> +	} else {
> +		fetch_negotiator_init(r, negotiator);
> +	}

More needless {}

>  	sort_ref_list(&ref, ref_compare_name);
>  	QSORT(sought, nr_sought, cmp_ref_by_name);
> @@ -1121,7 +1130,7 @@ static struct ref *do_fetch_pack(struct fetch_pack_args *args,
>  
>  	mark_complete_and_common_ref(negotiator, args, &ref);
>  	filter_refs(args, &ref, sought, nr_sought);
> -	if (everything_local(args, &ref)) {
> +	if (!args->refetch && everything_local(args, &ref)) {
>  		packet_flush(fd[1]);
>  		goto all_done;
>  	}

Here everything_local() is doing what I suggested for
mark_complete_and_common_ref() above, i.e. we check args->refetch first.

> @@ -1587,7 +1596,10 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
>  	struct strvec index_pack_args = STRVEC_INIT;
>  
>  	negotiator = &negotiator_alloc;
> -	fetch_negotiator_init(r, negotiator);
> +	if (args->refetch)
> +		fetch_negotiator_init_noop(negotiator);
> +	else
> +		fetch_negotiator_init(r, negotiator);

This one doesn't have braces (good), unlike do_fetch_pack()

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Robert Coup wrote (reply to this):

Hi Ævar,

< just when I was thinking I'm done... ;-) >

On Thu, 31 Mar 2022 at 16:17, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
> FWIW it's not clear to me re earlier comments on earlier iterations
> whether the "noop fetch" is per-se wanted for this feature for some
> reason, or if it's just much easier to implement than doing what I
> suggested in:
> https://lore.kernel.org/git/220228.86ee3m39jf.gmgdl@evledraar.gmail.com/

The noop-fetch is an implementation detail IMO, and can be improved
if/when someone is motivated later.

> I don't think such a thing should hold this series up, but as it would
> be a bit kinder to servers I think it's worth at least noting in the
> commit message what's desired per-se here, v.s. what's just needed for
> the convenience of implementation.

Yes, doing some per-commit negotiation is conceptually nicer. The
user-facing alternative at this point is basically running "clone" in
some form; or temporarily moving aside the object DB and running
fetch. Both those (and the new approach) all end up putting the same
load on the server.

I can note it in the commit message.

> I.e. when this series was in an earlier iteration the scope was to
> repair repository corruption

That was never _my_ motivation, I was a bit eager in reflecting some
comments I received suggesting it would be useful for that as well.

> But now that it's a "fetch what's missing" wouldn't it make more sense
> to descend from our otherwise-negotiated tips, and find the OIDs that
> are "complete", if any, and negotiate with those?
>
> Which again, I think it's fine to say "yeah, that would be ideal, but
> this is easier". I'm just checking if I'm missing some subtlety here...

Yeah, that would be ideal, but this is easier. AFAICS I'm not putting
in place any barriers to making it smarter later.

> nit: remove the {} here
> nit: this double-indented if can just be one if-statement
> nit: don't compare against NULL, use ! instead.

Is the common interpretation for nits "Do another re-roll"; "If you're
doing another re-roll"; or "For future reference"?

> nit: This function has only two callers, perhaps it's clearer to do do
> this "early abort" in those calls?

I discussed similar in
https://lore.kernel.org/git/CACf-nVePhtm_HAzAKzcap0E8kiyyEJPY_+N+bbPcYPVUkjweFg@mail.gmail.com/
but yes, I think you're right about mark_complete_and_common_ref().
Will see what it looks like.

Thanks, Rob.

@@ -163,6 +163,15 @@ endif::git-pull[]
behavior for a remote may be specified with the remote.<name>.tagOpt
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Mon, Mar 28 2022, Robert Coup via GitGitGadget wrote:

> From: Robert Coup <robert@coup.net.nz>
>
> Teach fetch and transports the --refetch option to force a full fetch
> without negotiating common commits with the remote. Use when applying a
> new partial clone filter to refetch all matching objects.
>
> [...]
> +ifndef::git-pull[]
> +--refetch::
> +	Instead of negotiating with the server to avoid transferring commits and
> +	associated objects that are already present locally, this option fetches
> +	all objects as a fresh clone would. Use this to reapply a partial clone
> +	filter from configuration or using `--filter=` when the filter
> +	definition has changed.
> +endif::git-pull[]

Re my comment on negotiation specifics in 2/7, this documentation is
really over-promising depending on what the answer to that is:
https://lore.kernel.org/git/220331.86o81mp2w1.gmgdl@evledraar.gmail.com/

I.e. instead of saying that we WILL fetch all objects "just like a
clone" shouldn't we have less focus on implementation details here, and
assure the user that we'll make their object store "complete" as though
--filter hadn't been used, without going into the specifics of what'll
happen over the wire?
>  		return -1;
>  
> +	/*
> +	 * Similarly, if we need to refetch, we always want to perform a full
> +	 * fetch ignoring existing objects.
> +	 */
> +	if (refetch)
> +		return -1;
> +
> +

One too many \n here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Robert Coup wrote (reply to this):

Hi Ævar,

On Thu, 31 Mar 2022 at 16:20, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> > +--refetch::
> > +     Instead of negotiating with the server to avoid transferring commits and
> > +     associated objects that are already present locally, this option fetches
> > +     all objects as a fresh clone would. Use this to reapply a partial clone
> > +     filter from configuration or using `--filter=` when the filter
> > +     definition has changed.
>
> Re my comment on negotiation specifics in 2/7, this documentation is
> really over-promising depending on what the answer to that is:
> https://lore.kernel.org/git/220331.86o81mp2w1.gmgdl@evledraar.gmail.com/
>
> I.e. instead of saying that we WILL fetch all objects "just like a
> clone" shouldn't we have less focus on implementation details here, and
> assure the user that we'll make their object store "complete" as though
> --filter hadn't been used, without going into the specifics of what'll
> happen over the wire?

That's not quite the case though: we'll make their object db complete
as if a fresh clone with any specified --filter (including none) from
that remote has been done.

IMO explaining what will happen and to expect (particularly as this is
_not_ transferring some sort of object-delta-set, clone transfers can
be big) is a good thing. If it improves later; then change the docs.

Alternatively, rewording along the lines of "it will fix up your
object db to match a changed filter" is implying that only the missing
bits will be fetched.

Thanks, Rob.

@@ -166,6 +166,56 @@ test_expect_success 'manual prefetch of missing objects' '
test_line_count = 0 observed.oids
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Mon, Mar 28 2022, Robert Coup via GitGitGadget wrote:

> From: Robert Coup <robert@coup.net.nz>
>
> Add a test for doing a refetch to apply a changed partial clone filter
> under protocol v0 and v2.
>
> Signed-off-by: Robert Coup <robert@coup.net.nz>
> ---
>  t/t5616-partial-clone.sh | 52 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 51 insertions(+), 1 deletion(-)
>
> diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
> index 34469b6ac10..87ebf4b0b1c 100755
> --- a/t/t5616-partial-clone.sh
> +++ b/t/t5616-partial-clone.sh
> @@ -166,6 +166,56 @@ test_expect_success 'manual prefetch of missing objects' '
>  	test_line_count = 0 observed.oids
>  '
>  
> +# create new commits in "src" repo to establish a history on file.4.txt
> +# and push to "srv.bare".
> +test_expect_success 'push new commits to server for file.4.txt' '
> +	for x in a b c d e f
> +	do
> +		echo "Mod file.4.txt $x" >src/file.4.txt &&
> +		if list_contains "a,b" "$x"; then
> +			printf "%10000s" X >>src/file.4.txt
> +		fi &&
> +		if list_contains "c,d" "$x"; then
> +			printf "%20000s" X >>src/file.4.txt
> +		fi &&
> +		git -C src add file.4.txt &&
> +		git -C src commit -m "mod $x" || return 1
> +	done &&
> +	git -C src push -u srv main
> +'
> +
> +# Do partial fetch to fetch smaller files; then verify that without --refetch
> +# applying a new filter does not refetch missing large objects. Then use
> +# --refetch to apply the new filter on existing commits. Test it under both
> +# protocol v2 & v0.
> +test_expect_success 'apply a different filter using --refetch' '
> +	git -C pc1 fetch --filter=blob:limit=999 origin &&
> +	git -C pc1 rev-list --quiet --objects --missing=print \
> +		main..origin/main >observed &&
> +	test_line_count = 4 observed &&
> +
> +	git -C pc1 fetch --filter=blob:limit=19999 --refetch origin &&

Is 19999 just "arbitrary big number" here?

> +	git -C pc1 rev-list --quiet --objects --missing=print \
> +		main..origin/main >observed &&
> +	test_line_count = 2 observed &&
> +
> +	git -c protocol.version=0 -C pc1 fetch --filter=blob:limit=29999 \
> +		--refetch origin &&
> +	git -C pc1 rev-list --quiet --objects --missing=print \
> +		main..origin/main >observed &&
> +	test_line_count = 0 observed

Does this test_line_count *really* want to be = 0, or does this mean
test_must_be_empty?

I.e. are we expecting content here, just not ending in a \n, or nothing
at all?

> +'
> +
> +test_expect_success 'fetch --refetch works with a shallow clone' '
> +	git clone --no-checkout --depth=1 --filter=blob:none "file://$(pwd)/srv.bare" pc1s &&
> +	git -C pc1s rev-list --objects --missing=print HEAD >observed &&
> +	test_line_count = 6 observed &&
> +
> +	GIT_TRACE=1 git -C pc1s fetch --filter=blob:limit=999 --refetch origin &&

Why the GIT_TRACE=1? Seems to not be used.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Robert Coup wrote (reply to this):

Hi,

On Thu, 31 Mar 2022 at 16:22, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
>

> > +
> > +     git -C pc1 fetch --filter=blob:limit=19999 --refetch origin &&
>
> Is 19999 just "arbitrary big number" here?

A selected big number so we can change it and have different objects match.

>
> > +     git -C pc1 rev-list --quiet --objects --missing=print \
> > +             main..origin/main >observed &&
> > +     test_line_count = 2 observed &&
> > +
> > +     git -c protocol.version=0 -C pc1 fetch --filter=blob:limit=29999 \
> > +             --refetch origin &&
> > +     git -C pc1 rev-list --quiet --objects --missing=print \
> > +             main..origin/main >observed &&
> > +     test_line_count = 0 observed
>
> Does this test_line_count *really* want to be = 0, or does this mean
> test_must_be_empty?
>
> I.e. are we expecting content here, just not ending in a \n, or nothing
> at all?

I'm expecting no missing objects after a refetch with a new filter
that matches all the objects in the repo.

> > +     GIT_TRACE=1 git -C pc1s fetch --filter=blob:limit=999 --refetch origin &&
>
> Why the GIT_TRACE=1? Seems to not be used.

Ah, extraneous debugging on my part.

Thanks, Rob :)

@@ -163,6 +163,16 @@ endif::git-pull[]
behavior for a remote may be specified with the remote.<name>.tagOpt
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Mon, Mar 28 2022, Robert Coup via GitGitGadget wrote:

> From: Robert Coup <robert@coup.net.nz>
>
> After invoking `fetch --refetch`, the object db will likely contain many
> duplicate objects. If auto-maintenance is enabled, invoke it with
> appropriate settings to encourage repacking/consolidation.
>
> * gc.autoPackLimit: unless this is set to 0 (disabled), override the
>   value to 1 to force pack consolidation.
> * maintenance.incremental-repack.auto: unless this is set to 0, override
>   the value to -1 to force incremental repacking.
>
> Signed-off-by: Robert Coup <robert@coup.net.nz>
> ---
>  Documentation/fetch-options.txt |  3 ++-
>  builtin/fetch.c                 | 19 ++++++++++++++++++-
>  t/t5616-partial-clone.sh        | 29 +++++++++++++++++++++++++++++
>  3 files changed, 49 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
> index d03fce5aae0..622bd84768b 100644
> --- a/Documentation/fetch-options.txt
> +++ b/Documentation/fetch-options.txt
> @@ -169,7 +169,8 @@ ifndef::git-pull[]
>  	associated objects that are already present locally, this option fetches
>  	all objects as a fresh clone would. Use this to reapply a partial clone
>  	filter from configuration or using `--filter=` when the filter
> -	definition has changed.
> +	definition has changed. Automatic post-fetch maintenance will perform
> +	object database pack consolidation to remove any duplicate objects.
>  endif::git-pull[]
>  
>  --refmap=<refspec>::
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index e391a5dbc55..e3791f09ed5 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -2306,8 +2306,25 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>  					     NULL);
>  	}
>  
> -	if (enable_auto_gc)
> +	if (enable_auto_gc) {
> +		if (refetch) {
> +			/*
> +			 * Hint auto-maintenance strongly to encourage repacking,
> +			 * but respect config settings disabling it.
> +			 */
> +			int opt_val;

nit: add a \n after this.

> +			if (git_config_get_int("gc.autopacklimit", &opt_val))
> +				opt_val = -1;
> +			if (opt_val != 0)

nit: don't compare against 0 or null,  just !opt_val

Isn't this whole thing also clearer as:

	int &forget;

        if (git_conf...(..., &forget))
		git_config_push_parameter("gc.autoPackLimit=1");

Maybe I haven't eyeballed this enough, but aren't you ignoring explicit
gc.autoPackLimit=0 configuration? Whereas what you seem to want is "set
this config unlress the user has it set", for which we only need to
check the git_config...(...) return value, no?

> +				git_config_push_parameter("gc.autoPackLimit=1");
> +
> +			if (git_config_get_int("maintenance.incremental-repack.auto", &opt_val))
> +				opt_val = -1;
> +			if (opt_val != 0)
> +				git_config_push_parameter("maintenance.incremental-repack.auto=-1");

hrm, do we really need to set both of these these days (not saying we
don't, just surprised). I.e. both gc.* an maintenance.* config.

*skims the code*

Urgh, yes? too_many_packs() seems to check gc.* only, but
incremental_repack_auto_condition() check this variable... :(

> +test_expect_success 'fetch --refetch triggers repacking' '
> +	GIT_TRACE2_CONFIG_PARAMS=gc.autoPackLimit,maintenance.incremental-repack.auto &&

Nit: Can we use GIT_CONFIG_KEY_* et al for this these days, or do we
still need this trace2 thingy?

> +	export GIT_TRACE2_CONFIG_PARAMS &&
> +

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Robert Coup wrote (reply to this):

Hi Ævar,

On Thu, 31 Mar 2022 at 16:33, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:


> > +                     if (git_config_get_int("gc.autopacklimit", &opt_val))
> > +                             opt_val = -1;
> > +                     if (opt_val != 0)
>
> nit: don't compare against 0 or null,  just !opt_val

I did this since 0 has a specific meaning ("Setting this to 0
disables"), it's not just false-y in this context. Tomayto, tomahto?

>
> Isn't this whole thing also clearer as:
>
>         int &forget;
>
>         if (git_conf...(..., &forget))
>                 git_config_push_parameter("gc.autoPackLimit=1");
>
> Maybe I haven't eyeballed this enough, but aren't you ignoring explicit
> gc.autoPackLimit=0 configuration? Whereas what you seem to want is "set
> this config unlress the user has it set", for which we only need to
> check the git_config...(...) return value, no?

What I'm trying to achieve: if the user has not disabled auto-packing
(autoPackLimit=0), then pass autoPackLimit=1 to the subprocess to
encourage repacking.
Context/why: so we don't 2x the object store size and not even attempt
to repack it now, rather than at some unspecified point in the future.
Maybe.

How the code achieves it:
  load autoPackLimit into opt_val
  if autoPackLimit is not specified in config: set opt_val to -1
  if opt_val is not 0: pass autoPackLimit=1 to the subprocess

AFAICT if we just if(git_config_get_int()) then if they haven't set it
at all in config, we wouldn't encourage repacking in the subprocess.
Which isn't what I'm trying to achieve.

> hrm, do we really need to set both of these these days (not saying we
> don't, just surprised). I.e. both gc.* an maintenance.* config.
>
> *skims the code*
>
> Urgh, yes? too_many_packs() seems to check gc.* only, but
> incremental_repack_auto_condition() check this variable... :(

Yes.

>
> > +test_expect_success 'fetch --refetch triggers repacking' '
> > +     GIT_TRACE2_CONFIG_PARAMS=gc.autoPackLimit,maintenance.incremental-repack.auto &&
>
> Nit: Can we use GIT_CONFIG_KEY_* et al for this these days, or do we
> still need this trace2 thingy?

I copied a pattern existing tests are using.

Thanks, Rob.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 31, 2022

This patch series was integrated into seen via git@8bcd7d7.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 1, 2022

This patch series was integrated into seen via git@649b59b.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 1, 2022

This patch series was integrated into next via git@5a99489.

@gitgitgadget gitgitgadget bot added the next label Apr 1, 2022
@gitgitgadget
Copy link

gitgitgadget bot commented Apr 3, 2022

This patch series was integrated into seen via git@a53786a.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 4, 2022

This patch series was integrated into seen via git@0f5e885.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 4, 2022

This patch series was integrated into master via git@0f5e885.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 4, 2022

This patch series was integrated into next via git@0f5e885.

@gitgitgadget gitgitgadget bot added the master label Apr 4, 2022
@gitgitgadget gitgitgadget bot closed this Apr 4, 2022
@gitgitgadget
Copy link

gitgitgadget bot commented Apr 4, 2022

Closed via 0f5e885.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants