New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fetch: add repair: full refetch without negotiation (was: "refiltering") #1138
Conversation
Welcome to GitGitGadgetHi @rcoup, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests. Please make sure that your Pull Request has a good description, as it will be used as cover letter. Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:
It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code. Contributing the patchesBefore you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form Both the person who commented An alternative is the channel
Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment If you want to see what email(s) would be sent for a After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail). If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the curl -g --user "<EMailAddress>:<Password>" \
--url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):
To send a new iteration, just add another PR comment with the contents: Need help?New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join. You may also be able to find help in real time in the developer IRC channel, |
/allow |
User rcoup is now allowed to use GitGitGadget. |
/preview |
Preview email sent as pull.1138.git.1643730506.gitgitgadget@gmail.com |
/submit |
Submitted as pull.1138.git.1643730593.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Junio C Hamano wrote (reply to this):
|
On the Git mailing list, Robert Coup wrote (reply to this):
|
On the Git mailing list, Jonathan Tan wrote (reply to this):
|
On the Git mailing list, Robert Coup wrote (reply to this):
|
On the Git mailing list, Robert Coup wrote (reply to this):
|
@@ -312,19 +312,21 @@ static int find_common(struct fetch_negotiator *negotiator, | |||
const char *remote_hex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Jonathan Tan wrote (reply to this):
"Robert Coup via GitGitGadget" <gitgitgadget@gmail.com> writes:
> @@ -312,19 +312,21 @@ static int find_common(struct fetch_negotiator *negotiator,
> const char *remote_hex;
> struct object *o;
>
> - /*
> - * If that object is complete (i.e. it is an ancestor of a
> - * local ref), we tell them we have it but do not have to
> - * tell them about its ancestors, which they already know
> - * about.
> - *
> - * We use lookup_object here because we are only
> - * interested in the case we *know* the object is
> - * reachable and we have already scanned it.
> - */
> - if (((o = lookup_object(the_repository, remote)) != NULL) &&
> - (o->flags & COMPLETE)) {
> - continue;
> + if (!args->refilter) {
> + /*
> + * If that object is complete (i.e. it is an ancestor of a
> + * local ref), we tell them we have it but do not have to
> + * tell them about its ancestors, which they already know
> + * about.
> + *
> + * We use lookup_object here because we are only
> + * interested in the case we *know* the object is
> + * reachable and we have already scanned it.
> + */
> + if (((o = lookup_object(the_repository, remote)) != NULL) &&
> + (o->flags & COMPLETE)) {
> + continue;
> + }
The approach that I would have expected is to not call
mark_complete_and_common_ref(), filter_refs(), everything_local(), and
find_common(), but your approach here is to ensure that
mark_complete_and_common_ref() and find_common() do not do anything.
Comparing the two approaches, the advantage of yours is that we only
need to make the change once to support both protocol v0 and v2
(although the change looks more substantial than just skipping function
calls), but it makes the code more difficult to read in that we now have
function calls that do nothing. What do you think about my approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Robert Coup wrote (reply to this):
Hi Jonathan,
Thanks for taking a look at this.
On Fri, 4 Feb 2022 at 18:02, Jonathan Tan <jonathantanmy@google.com> wrote:
>
> The approach that I would have expected is to not call
> mark_complete_and_common_ref(), filter_refs(), everything_local(), and
> find_common(), but your approach here is to ensure that
> mark_complete_and_common_ref() and find_common() do not do anything.
v0: find_common() definitely still does something: during refiltering it skips
checking the local object db, but it's still responsible for sending
the "wants".
filter_refs() is necessary under v0 & v2 so the remote refs all get marked as
matched, otherwise we end up erroring after the transfer with
"error: no such remote ref refs/heads/main" etc.
> Comparing the two approaches, the advantage of yours is that we only
> need to make the change once to support both protocol v0 and v2
> (although the change looks more substantial than just skipping function
> calls), but it makes the code more difficult to read in that we now have
> function calls that do nothing. What do you think about my approach?
My original approach was to leave the negotiator in place, and just conditional
around it in do_fetch_pack_v2 — this worked ok for protocol v2 but the v0
implementation didn't work. After that I switched to forcing noop in [1/6]
which made both implementations match (& tidier imo).
To make the test pass and skip those calls I need a patch like the below
— filter_refs() is still required during refiltering for the ref-matching. To me
this looks more complicated, but I'm happy to defer to your thinking.
Thanks,
Rob :)
diff --git a/fetch-pack.c b/fetch-pack.c
index dd67044165..870bfba267 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1125,15 +1125,16 @@ static struct ref *do_fetch_pack(struct
fetch_pack_args *args,
negotiator = &negotiator_alloc;
if (is_refiltering) {
fetch_negotiator_init_noop(negotiator);
+ filter_refs(args, &ref, sought, nr_sought);
} else {
fetch_negotiator_init(r, negotiator);
- }
- mark_complete_and_common_ref(negotiator, args, &ref);
- filter_refs(args, &ref, sought, nr_sought);
- if (!is_refiltering && everything_local(args, &ref)) {
- packet_flush(fd[1]);
- goto all_done;
+ mark_complete_and_common_ref(negotiator, args, &ref);
+ filter_refs(args, &ref, sought, nr_sought);
+ if (everything_local(args, &ref)) {
+ packet_flush(fd[1]);
+ goto all_done;
+ }
}
if (find_common(negotiator, args, fd, &oid, ref) < 0)
if (!args->keep_pack)
@@ -1615,13 +1616,18 @@ static struct ref *do_fetch_pack_v2(struct
fetch_pack_args *args,
if (args->depth > 0 || args->deepen_since ||
args->deepen_not)
args->deepen = 1;
- /* Filter 'ref' by 'sought' and those that
aren't local */
- mark_complete_and_common_ref(negotiator, args, &ref);
- filter_refs(args, &ref, sought, nr_sought);
- if (!args->refilter && everything_local(args, &ref))
- state = FETCH_DONE;
- else
+ if (args->refilter) {
+ filter_refs(args, &ref, sought, nr_sought);
state = FETCH_SEND_REQUEST;
+ } else {
+ /* Filter 'ref' by 'sought' and those
that aren't local */
+
mark_complete_and_common_ref(negotiator, args, &ref);
+ filter_refs(args, &ref, sought, nr_sought);
+ if (everything_local(args, &ref))
+ state = FETCH_DONE;
+ else
+ state = FETCH_SEND_REQUEST;
+ }
mark_tips(negotiator, args->negotiation_tips);
for_each_cached_alternate(negotiator,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Jonathan Tan wrote (reply to this):
Robert Coup <robert@coup.net.nz> writes:
> On Fri, 4 Feb 2022 at 18:02, Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > The approach that I would have expected is to not call
> > mark_complete_and_common_ref(), filter_refs(), everything_local(), and
> > find_common(), but your approach here is to ensure that
> > mark_complete_and_common_ref() and find_common() do not do anything.
>
> v0: find_common() definitely still does something: during refiltering it skips
> checking the local object db, but it's still responsible for sending
> the "wants".
>
> filter_refs() is necessary under v0 & v2 so the remote refs all get marked as
> matched, otherwise we end up erroring after the transfer with
> "error: no such remote ref refs/heads/main" etc.
>
> > Comparing the two approaches, the advantage of yours is that we only
> > need to make the change once to support both protocol v0 and v2
> > (although the change looks more substantial than just skipping function
> > calls), but it makes the code more difficult to read in that we now have
> > function calls that do nothing. What do you think about my approach?
>
> My original approach was to leave the negotiator in place, and just conditional
> around it in do_fetch_pack_v2 — this worked ok for protocol v2 but the v0
> implementation didn't work. After that I switched to forcing noop in [1/6]
> which made both implementations match (& tidier imo).
>
> To make the test pass and skip those calls I need a patch like the below
> — filter_refs() is still required during refiltering for the ref-matching. To me
> this looks more complicated, but I'm happy to defer to your thinking.
>
> Thanks,
>
> Rob :)
Thanks for investigating this; looks like I was perhaps too optimistic
in thinking that the code could be reorganized in the way I'm
suggesting.
I think that it's worth checking if we could refactor the parts that are
needed with --refilter out of filter_refs() and find_common() (and in
doing so, make these functions do only what their names imply). I
don't have time to do so right now, but I don't want to unnecessarily
block this patch set either, so I'll say that ideally another reviewer
(or you) would do that investigation, but I have no objection to merging
this patch set in this state (other than the change of the argument
name, perhaps to "--repair", and associated documentation).
On the Git mailing list, Jeff Hostetler wrote (reply to this):
|
User |
6c4d426
to
045c0fb
Compare
On the Git mailing list, Robert Coup wrote (reply to this):
|
b80ca59
to
cd072de
Compare
/preview |
Preview email sent as pull.1138.v2.git.1645715098.gitgitgadget@gmail.com |
cd072de
to
2094256
Compare
/submit |
Submitted as pull.1138.v2.git.1645719218.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
Add a test for doing a refetch to apply a changed partial clone filter under protocol v0 and v2. Signed-off-by: Robert Coup <robert@coup.net.nz>
After invoking `fetch --refetch`, the object db will likely contain many duplicate objects. If auto-maintenance is enabled, invoke it with appropriate settings to encourage repacking/consolidation. * gc.autoPackLimit: unless this is set to 0 (disabled), override the value to 1 to force pack consolidation. * maintenance.incremental-repack.auto: unless this is set to 0, override the value to -1 to force incremental repacking. Signed-off-by: Robert Coup <robert@coup.net.nz>
Document it for partial clones as a means to apply a new filter, and reference it from the remote.<name>.partialclonefilter config parameter. Signed-off-by: Robert Coup <robert@coup.net.nz>
f923a06
to
da1e6de
Compare
/preview |
Preview email sent as pull.1138.v4.git.1648475699.gitgitgadget@gmail.com |
/submit |
Submitted as pull.1138.v4.git.1648476131.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
@@ -82,5 +82,7 @@ remote.<name>.promisor:: | |||
objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
"Robert Coup via GitGitGadget" <gitgitgadget@gmail.com> writes:
> remote.<name>.partialclonefilter::
> - The filter that will be applied when fetching from this
> - promisor remote.
> + The filter that will be applied when fetching from this promisor remote.
> + Changing or clearing this value will only affect fetches for new commits.
> + To fetch associated objects for commits already present in the local object
> + database, use the `--refetch` option of linkgit:git-fetch[1].
Good advice to add.
Will replace. I think we've seen this topic enough times and it
looked reasonably well done. Let's mark it for 'next' unless we
hear objections soonish.
This patch series was integrated into seen via git@9304375. |
There was a status update in the "Cooking" section about the branch "git fetch --refetch" learned to fetch everything without telling the other side what we already have, which is useful when you cannot trust what you have in the local object store. Will merge to 'next'? source: <pull.1138.v4.git.1648476131.gitgitgadget@gmail.com> |
This patch series was integrated into seen via git@2d72be0. |
This patch series was integrated into seen via git@229b9ce. |
This patch series was integrated into seen via git@6c9ab52. |
There was a status update in the "Cooking" section about the branch "git fetch --refetch" learned to fetch everything without telling the other side what we already have, which is useful when you cannot trust what you have in the local object store. Will merge to 'next'? source: <pull.1138.v4.git.1648476131.gitgitgadget@gmail.com> |
@@ -312,19 +312,21 @@ static int find_common(struct fetch_negotiator *negotiator, | |||
const char *remote_hex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):
On Mon, Mar 28 2022, Robert Coup via GitGitGadget wrote:
> From: Robert Coup <robert@coup.net.nz>
>
> Allow a "refetch" where the contents of the local object store are
> ignored and a full fetch is performed, not attempting to find or
> negotiate common commits with the remote.
>
> A key use case is to apply a new partial clone blob/tree filter and
> refetch all the associated matching content, which would otherwise not
> be transferred when the commit objects are already present locally.
FWIW it's not clear to me re earlier comments on earlier iterations
whether the "noop fetch" is per-se wanted for this feature for some
reason, or if it's just much easier to implement than doing what I
suggested in:
https://lore.kernel.org/git/220228.86ee3m39jf.gmgdl@evledraar.gmail.com/
I don't think such a thing should hold this series up, but as it would
be a bit kinder to servers I think it's worth at least noting in the
commit message what's desired per-se here, v.s. what's just needed for
the convenience of implementation.
I.e. when this series was in an earlier iteration the scope was to
repair repository corruption, which I pointed out to you it really
couldn't do without more wider changes to the object store management,
and at that point having it be NOOP definitely makes sense. The object
lookups etc. take shortcuts that "fsck" wouldn't do, so we could be
negotiating on the basis of corrupt content.
But now that it's a "fetch what's missing" wouldn't it make more sense
to descend from our otherwise-negotiated tips, and find the OIDs that
are "complete", if any, and negotiate with those?
Which again, I think it's fine to say "yeah, that would be ideal, but
this is easier". I'm just checking if I'm missing some subtlety here...
> Signed-off-by: Robert Coup <robert@coup.net.nz>
> ---
> fetch-pack.c | 46 +++++++++++++++++++++++++++++-----------------
> fetch-pack.h | 1 +
> 2 files changed, 30 insertions(+), 17 deletions(-)
>
> diff --git a/fetch-pack.c b/fetch-pack.c
> index 87657907e78..4e1e88eea09 100644
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -312,19 +312,21 @@ static int find_common(struct fetch_negotiator *negotiator,
> const char *remote_hex;
> struct object *o;
>
> - /*
> - * If that object is complete (i.e. it is an ancestor of a
> - * local ref), we tell them we have it but do not have to
> - * tell them about its ancestors, which they already know
> - * about.
> - *
> - * We use lookup_object here because we are only
> - * interested in the case we *know* the object is
> - * reachable and we have already scanned it.
> - */
> - if (((o = lookup_object(the_repository, remote)) != NULL) &&
> - (o->flags & COMPLETE)) {
> - continue;
> + if (!args->refetch) {
> + /*
> + * If that object is complete (i.e. it is an ancestor of a
> + * local ref), we tell them we have it but do not have to
> + * tell them about its ancestors, which they already know
> + * about.
> + *
> + * We use lookup_object here because we are only
> + * interested in the case we *know* the object is
> + * reachable and we have already scanned it.
> + */
> + if (((o = lookup_object(the_repository, remote)) != NULL) &&
> + (o->flags & COMPLETE)) {
> + continue;
> + }
nit: remove the {} here
nit: this double-indented if can just be one if-statement
nit: don't compare against NULL, use ! instead.
> }
>
> remote_hex = oid_to_hex(remote);
> @@ -692,6 +694,9 @@ static void mark_complete_and_common_ref(struct fetch_negotiator *negotiator,
> int old_save_commit_buffer = save_commit_buffer;
> timestamp_t cutoff = 0;
>
> + if (args->refetch)
> + return;
> +
> save_commit_buffer = 0;
nit: This function has only two callers, perhaps it's clearer to do do
this "early abort" in those calls?
> trace2_region_enter("fetch-pack", "parse_remote_refs_and_find_cutoff", NULL);
> @@ -1028,7 +1033,11 @@ static struct ref *do_fetch_pack(struct fetch_pack_args *args,
> struct fetch_negotiator *negotiator;
>
> negotiator = &negotiator_alloc;
> - fetch_negotiator_init(r, negotiator);
> + if (args->refetch) {
> + fetch_negotiator_init_noop(negotiator);
> + } else {
> + fetch_negotiator_init(r, negotiator);
> + }
More needless {}
> sort_ref_list(&ref, ref_compare_name);
> QSORT(sought, nr_sought, cmp_ref_by_name);
> @@ -1121,7 +1130,7 @@ static struct ref *do_fetch_pack(struct fetch_pack_args *args,
>
> mark_complete_and_common_ref(negotiator, args, &ref);
> filter_refs(args, &ref, sought, nr_sought);
> - if (everything_local(args, &ref)) {
> + if (!args->refetch && everything_local(args, &ref)) {
> packet_flush(fd[1]);
> goto all_done;
> }
Here everything_local() is doing what I suggested for
mark_complete_and_common_ref() above, i.e. we check args->refetch first.
> @@ -1587,7 +1596,10 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
> struct strvec index_pack_args = STRVEC_INIT;
>
> negotiator = &negotiator_alloc;
> - fetch_negotiator_init(r, negotiator);
> + if (args->refetch)
> + fetch_negotiator_init_noop(negotiator);
> + else
> + fetch_negotiator_init(r, negotiator);
This one doesn't have braces (good), unlike do_fetch_pack()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Robert Coup wrote (reply to this):
Hi Ævar,
< just when I was thinking I'm done... ;-) >
On Thu, 31 Mar 2022 at 16:17, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
> FWIW it's not clear to me re earlier comments on earlier iterations
> whether the "noop fetch" is per-se wanted for this feature for some
> reason, or if it's just much easier to implement than doing what I
> suggested in:
> https://lore.kernel.org/git/220228.86ee3m39jf.gmgdl@evledraar.gmail.com/
The noop-fetch is an implementation detail IMO, and can be improved
if/when someone is motivated later.
> I don't think such a thing should hold this series up, but as it would
> be a bit kinder to servers I think it's worth at least noting in the
> commit message what's desired per-se here, v.s. what's just needed for
> the convenience of implementation.
Yes, doing some per-commit negotiation is conceptually nicer. The
user-facing alternative at this point is basically running "clone" in
some form; or temporarily moving aside the object DB and running
fetch. Both those (and the new approach) all end up putting the same
load on the server.
I can note it in the commit message.
> I.e. when this series was in an earlier iteration the scope was to
> repair repository corruption
That was never _my_ motivation, I was a bit eager in reflecting some
comments I received suggesting it would be useful for that as well.
> But now that it's a "fetch what's missing" wouldn't it make more sense
> to descend from our otherwise-negotiated tips, and find the OIDs that
> are "complete", if any, and negotiate with those?
>
> Which again, I think it's fine to say "yeah, that would be ideal, but
> this is easier". I'm just checking if I'm missing some subtlety here...
Yeah, that would be ideal, but this is easier. AFAICS I'm not putting
in place any barriers to making it smarter later.
> nit: remove the {} here
> nit: this double-indented if can just be one if-statement
> nit: don't compare against NULL, use ! instead.
Is the common interpretation for nits "Do another re-roll"; "If you're
doing another re-roll"; or "For future reference"?
> nit: This function has only two callers, perhaps it's clearer to do do
> this "early abort" in those calls?
I discussed similar in
https://lore.kernel.org/git/CACf-nVePhtm_HAzAKzcap0E8kiyyEJPY_+N+bbPcYPVUkjweFg@mail.gmail.com/
but yes, I think you're right about mark_complete_and_common_ref().
Will see what it looks like.
Thanks, Rob.
@@ -163,6 +163,15 @@ endif::git-pull[] | |||
behavior for a remote may be specified with the remote.<name>.tagOpt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):
On Mon, Mar 28 2022, Robert Coup via GitGitGadget wrote:
> From: Robert Coup <robert@coup.net.nz>
>
> Teach fetch and transports the --refetch option to force a full fetch
> without negotiating common commits with the remote. Use when applying a
> new partial clone filter to refetch all matching objects.
>
> [...]
> +ifndef::git-pull[]
> +--refetch::
> + Instead of negotiating with the server to avoid transferring commits and
> + associated objects that are already present locally, this option fetches
> + all objects as a fresh clone would. Use this to reapply a partial clone
> + filter from configuration or using `--filter=` when the filter
> + definition has changed.
> +endif::git-pull[]
Re my comment on negotiation specifics in 2/7, this documentation is
really over-promising depending on what the answer to that is:
https://lore.kernel.org/git/220331.86o81mp2w1.gmgdl@evledraar.gmail.com/
I.e. instead of saying that we WILL fetch all objects "just like a
clone" shouldn't we have less focus on implementation details here, and
assure the user that we'll make their object store "complete" as though
--filter hadn't been used, without going into the specifics of what'll
happen over the wire?
> return -1;
>
> + /*
> + * Similarly, if we need to refetch, we always want to perform a full
> + * fetch ignoring existing objects.
> + */
> + if (refetch)
> + return -1;
> +
> +
One too many \n here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Robert Coup wrote (reply to this):
Hi Ævar,
On Thu, 31 Mar 2022 at 16:20, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> > +--refetch::
> > + Instead of negotiating with the server to avoid transferring commits and
> > + associated objects that are already present locally, this option fetches
> > + all objects as a fresh clone would. Use this to reapply a partial clone
> > + filter from configuration or using `--filter=` when the filter
> > + definition has changed.
>
> Re my comment on negotiation specifics in 2/7, this documentation is
> really over-promising depending on what the answer to that is:
> https://lore.kernel.org/git/220331.86o81mp2w1.gmgdl@evledraar.gmail.com/
>
> I.e. instead of saying that we WILL fetch all objects "just like a
> clone" shouldn't we have less focus on implementation details here, and
> assure the user that we'll make their object store "complete" as though
> --filter hadn't been used, without going into the specifics of what'll
> happen over the wire?
That's not quite the case though: we'll make their object db complete
as if a fresh clone with any specified --filter (including none) from
that remote has been done.
IMO explaining what will happen and to expect (particularly as this is
_not_ transferring some sort of object-delta-set, clone transfers can
be big) is a good thing. If it improves later; then change the docs.
Alternatively, rewording along the lines of "it will fix up your
object db to match a changed filter" is implying that only the missing
bits will be fetched.
Thanks, Rob.
@@ -166,6 +166,56 @@ test_expect_success 'manual prefetch of missing objects' ' | |||
test_line_count = 0 observed.oids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):
On Mon, Mar 28 2022, Robert Coup via GitGitGadget wrote:
> From: Robert Coup <robert@coup.net.nz>
>
> Add a test for doing a refetch to apply a changed partial clone filter
> under protocol v0 and v2.
>
> Signed-off-by: Robert Coup <robert@coup.net.nz>
> ---
> t/t5616-partial-clone.sh | 52 +++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 51 insertions(+), 1 deletion(-)
>
> diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
> index 34469b6ac10..87ebf4b0b1c 100755
> --- a/t/t5616-partial-clone.sh
> +++ b/t/t5616-partial-clone.sh
> @@ -166,6 +166,56 @@ test_expect_success 'manual prefetch of missing objects' '
> test_line_count = 0 observed.oids
> '
>
> +# create new commits in "src" repo to establish a history on file.4.txt
> +# and push to "srv.bare".
> +test_expect_success 'push new commits to server for file.4.txt' '
> + for x in a b c d e f
> + do
> + echo "Mod file.4.txt $x" >src/file.4.txt &&
> + if list_contains "a,b" "$x"; then
> + printf "%10000s" X >>src/file.4.txt
> + fi &&
> + if list_contains "c,d" "$x"; then
> + printf "%20000s" X >>src/file.4.txt
> + fi &&
> + git -C src add file.4.txt &&
> + git -C src commit -m "mod $x" || return 1
> + done &&
> + git -C src push -u srv main
> +'
> +
> +# Do partial fetch to fetch smaller files; then verify that without --refetch
> +# applying a new filter does not refetch missing large objects. Then use
> +# --refetch to apply the new filter on existing commits. Test it under both
> +# protocol v2 & v0.
> +test_expect_success 'apply a different filter using --refetch' '
> + git -C pc1 fetch --filter=blob:limit=999 origin &&
> + git -C pc1 rev-list --quiet --objects --missing=print \
> + main..origin/main >observed &&
> + test_line_count = 4 observed &&
> +
> + git -C pc1 fetch --filter=blob:limit=19999 --refetch origin &&
Is 19999 just "arbitrary big number" here?
> + git -C pc1 rev-list --quiet --objects --missing=print \
> + main..origin/main >observed &&
> + test_line_count = 2 observed &&
> +
> + git -c protocol.version=0 -C pc1 fetch --filter=blob:limit=29999 \
> + --refetch origin &&
> + git -C pc1 rev-list --quiet --objects --missing=print \
> + main..origin/main >observed &&
> + test_line_count = 0 observed
Does this test_line_count *really* want to be = 0, or does this mean
test_must_be_empty?
I.e. are we expecting content here, just not ending in a \n, or nothing
at all?
> +'
> +
> +test_expect_success 'fetch --refetch works with a shallow clone' '
> + git clone --no-checkout --depth=1 --filter=blob:none "file://$(pwd)/srv.bare" pc1s &&
> + git -C pc1s rev-list --objects --missing=print HEAD >observed &&
> + test_line_count = 6 observed &&
> +
> + GIT_TRACE=1 git -C pc1s fetch --filter=blob:limit=999 --refetch origin &&
Why the GIT_TRACE=1? Seems to not be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Robert Coup wrote (reply to this):
Hi,
On Thu, 31 Mar 2022 at 16:22, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
>
> > +
> > + git -C pc1 fetch --filter=blob:limit=19999 --refetch origin &&
>
> Is 19999 just "arbitrary big number" here?
A selected big number so we can change it and have different objects match.
>
> > + git -C pc1 rev-list --quiet --objects --missing=print \
> > + main..origin/main >observed &&
> > + test_line_count = 2 observed &&
> > +
> > + git -c protocol.version=0 -C pc1 fetch --filter=blob:limit=29999 \
> > + --refetch origin &&
> > + git -C pc1 rev-list --quiet --objects --missing=print \
> > + main..origin/main >observed &&
> > + test_line_count = 0 observed
>
> Does this test_line_count *really* want to be = 0, or does this mean
> test_must_be_empty?
>
> I.e. are we expecting content here, just not ending in a \n, or nothing
> at all?
I'm expecting no missing objects after a refetch with a new filter
that matches all the objects in the repo.
> > + GIT_TRACE=1 git -C pc1s fetch --filter=blob:limit=999 --refetch origin &&
>
> Why the GIT_TRACE=1? Seems to not be used.
Ah, extraneous debugging on my part.
Thanks, Rob :)
@@ -163,6 +163,16 @@ endif::git-pull[] | |||
behavior for a remote may be specified with the remote.<name>.tagOpt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):
On Mon, Mar 28 2022, Robert Coup via GitGitGadget wrote:
> From: Robert Coup <robert@coup.net.nz>
>
> After invoking `fetch --refetch`, the object db will likely contain many
> duplicate objects. If auto-maintenance is enabled, invoke it with
> appropriate settings to encourage repacking/consolidation.
>
> * gc.autoPackLimit: unless this is set to 0 (disabled), override the
> value to 1 to force pack consolidation.
> * maintenance.incremental-repack.auto: unless this is set to 0, override
> the value to -1 to force incremental repacking.
>
> Signed-off-by: Robert Coup <robert@coup.net.nz>
> ---
> Documentation/fetch-options.txt | 3 ++-
> builtin/fetch.c | 19 ++++++++++++++++++-
> t/t5616-partial-clone.sh | 29 +++++++++++++++++++++++++++++
> 3 files changed, 49 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
> index d03fce5aae0..622bd84768b 100644
> --- a/Documentation/fetch-options.txt
> +++ b/Documentation/fetch-options.txt
> @@ -169,7 +169,8 @@ ifndef::git-pull[]
> associated objects that are already present locally, this option fetches
> all objects as a fresh clone would. Use this to reapply a partial clone
> filter from configuration or using `--filter=` when the filter
> - definition has changed.
> + definition has changed. Automatic post-fetch maintenance will perform
> + object database pack consolidation to remove any duplicate objects.
> endif::git-pull[]
>
> --refmap=<refspec>::
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index e391a5dbc55..e3791f09ed5 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -2306,8 +2306,25 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
> NULL);
> }
>
> - if (enable_auto_gc)
> + if (enable_auto_gc) {
> + if (refetch) {
> + /*
> + * Hint auto-maintenance strongly to encourage repacking,
> + * but respect config settings disabling it.
> + */
> + int opt_val;
nit: add a \n after this.
> + if (git_config_get_int("gc.autopacklimit", &opt_val))
> + opt_val = -1;
> + if (opt_val != 0)
nit: don't compare against 0 or null, just !opt_val
Isn't this whole thing also clearer as:
int &forget;
if (git_conf...(..., &forget))
git_config_push_parameter("gc.autoPackLimit=1");
Maybe I haven't eyeballed this enough, but aren't you ignoring explicit
gc.autoPackLimit=0 configuration? Whereas what you seem to want is "set
this config unlress the user has it set", for which we only need to
check the git_config...(...) return value, no?
> + git_config_push_parameter("gc.autoPackLimit=1");
> +
> + if (git_config_get_int("maintenance.incremental-repack.auto", &opt_val))
> + opt_val = -1;
> + if (opt_val != 0)
> + git_config_push_parameter("maintenance.incremental-repack.auto=-1");
hrm, do we really need to set both of these these days (not saying we
don't, just surprised). I.e. both gc.* an maintenance.* config.
*skims the code*
Urgh, yes? too_many_packs() seems to check gc.* only, but
incremental_repack_auto_condition() check this variable... :(
> +test_expect_success 'fetch --refetch triggers repacking' '
> + GIT_TRACE2_CONFIG_PARAMS=gc.autoPackLimit,maintenance.incremental-repack.auto &&
Nit: Can we use GIT_CONFIG_KEY_* et al for this these days, or do we
still need this trace2 thingy?
> + export GIT_TRACE2_CONFIG_PARAMS &&
> +
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Robert Coup wrote (reply to this):
Hi Ævar,
On Thu, 31 Mar 2022 at 16:33, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> > + if (git_config_get_int("gc.autopacklimit", &opt_val))
> > + opt_val = -1;
> > + if (opt_val != 0)
>
> nit: don't compare against 0 or null, just !opt_val
I did this since 0 has a specific meaning ("Setting this to 0
disables"), it's not just false-y in this context. Tomayto, tomahto?
>
> Isn't this whole thing also clearer as:
>
> int &forget;
>
> if (git_conf...(..., &forget))
> git_config_push_parameter("gc.autoPackLimit=1");
>
> Maybe I haven't eyeballed this enough, but aren't you ignoring explicit
> gc.autoPackLimit=0 configuration? Whereas what you seem to want is "set
> this config unlress the user has it set", for which we only need to
> check the git_config...(...) return value, no?
What I'm trying to achieve: if the user has not disabled auto-packing
(autoPackLimit=0), then pass autoPackLimit=1 to the subprocess to
encourage repacking.
Context/why: so we don't 2x the object store size and not even attempt
to repack it now, rather than at some unspecified point in the future.
Maybe.
How the code achieves it:
load autoPackLimit into opt_val
if autoPackLimit is not specified in config: set opt_val to -1
if opt_val is not 0: pass autoPackLimit=1 to the subprocess
AFAICT if we just if(git_config_get_int()) then if they haven't set it
at all in config, we wouldn't encourage repacking in the subprocess.
Which isn't what I'm trying to achieve.
> hrm, do we really need to set both of these these days (not saying we
> don't, just surprised). I.e. both gc.* an maintenance.* config.
>
> *skims the code*
>
> Urgh, yes? too_many_packs() seems to check gc.* only, but
> incremental_repack_auto_condition() check this variable... :(
Yes.
>
> > +test_expect_success 'fetch --refetch triggers repacking' '
> > + GIT_TRACE2_CONFIG_PARAMS=gc.autoPackLimit,maintenance.incremental-repack.auto &&
>
> Nit: Can we use GIT_CONFIG_KEY_* et al for this these days, or do we
> still need this trace2 thingy?
I copied a pattern existing tests are using.
Thanks, Rob.
This patch series was integrated into seen via git@8bcd7d7. |
This patch series was integrated into seen via git@649b59b. |
This patch series was integrated into next via git@5a99489. |
This patch series was integrated into seen via git@a53786a. |
This patch series was integrated into seen via git@0f5e885. |
This patch series was integrated into master via git@0f5e885. |
This patch series was integrated into next via git@0f5e885. |
Closed via 0f5e885. |
If a filter is changed on a partial clone repository, for example from blob:none to blob:limit=1m, there is currently no straightforward way to bulk-refetch the objects that match the new filter for existing local commits. This is because the client will report commits as "have" during fetch negotiation and any dependent objects won't be included in the transferred pack. Another use case is discussed at [1].
This patch series introduces a --refetch option to fetch & fetch-pack to enable doing a full fetch without performing any commit negotiation with the remote, as a fresh clone does. It builds upon cbe566a ("negotiator/noop: add noop fetch negotiator", 2020-08-18).
[1] https://lore.kernel.org/git/aa7b89ee-08aa-7943-6a00-28dcf344426e@syntevo.com/
[2] https://lore.kernel.org/git/21ED346B-A906-4905-B061-EDE53691C586@gmail.com/
Changes since v3:
fetch --refetch
in theremote.<name>.partialclonefilter
documentation.Changes since v2:
Changes since RFC (v1):
Reviewed-by: Calvin Wan calvinwan@google.com
CC: Jonathan Tan jonathantanmy@google.com, John Cai johncai86@gmail.com, Jeff Hostetler git@jeffhostetler.com, Junio C Hamano gitster@pobox.com, Derrick Stolee derrickstolee@github.com, Ævar Arnfjörð Bjarmason avarab@gmail.com
cc: Calvin Wan calvinwan@google.com