Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial bundles #1159

Closed
wants to merge 13 commits into from
Closed

Conversation

derrickstolee
Copy link

@derrickstolee derrickstolee commented Feb 23, 2022

While discussing bundle-URIs [1], it came to my attention that bundles have no way to specify an object filter, so bundles cannot be used with partial clones.

[1] https://lore.kernel.org/git/7fab28bf-54e7-d0e9-110a-53fad6244cc9@gmail.com/

This series provides a way to fix that by adding a 'filter' capability to the bundle file format and allowing one to create a partial bundle with 'git bundle create --filter=blob:none '.

There are a couple things that I want to point out about this implementation that could use some high-level feedback:

  1. I moved the '--filter' parsing into setup_revisions() instead of adding another place to parse it. This works for 'git bundle' but it also allows it to be parsed successfully in commands such as 'git diff' which doesn't make sense. Options such as '--objects' are already being parsed there, and they don't make sense either, so I want some thoughts on this.

  2. If someone uses 'git clone partial.bdl partial' where 'partial.bdl' is a filtered bundle, then the clone will fail with a message such as

fatal: missing blob object '9444604d515c0b162e37e59accd54a0bac50ed2e'
fatal: remote did not send all necessary objects

This might be fine. We don't expect users to clone partial bundles or fetch partial bundles into an unfiltered repo and these failures are expected. It is possible that we could put in custom logic to fail faster by reading the bundle header for a filter.

Generally, the idea is to open this up as a potential way to bootstrap a clone of a partial clone using a set of precomputed partial bundles.

Updates in v4

  • 'struct rev_info' now has 'filter' embedded statically, using filter.choice to indicate if it is an empty filter. This makes the range-diff look really messy, as lots of '&' characters are inserted, especially in the middle patches. The end result looks very similar.

  • To accommodate previous lines that were a pointer copy, create list_objects_filter_copy() to assist with deep copies of filters.

  • Commit message typo fixes.

  • Documentation improvements.

  • Tests use 'sed' over 'head' to be more robust to future changes.

  • Initialization of a 'struct traversal_context' is made more compact.

  • "filter.choice != LOFC_DISABLED" is replaced by "filter.choice", since LOFC_DISABLED is zero by design.

Updates in v3

  • 'struct bundle_header' now has 'filter' embedded statically, using filter.choice to indicate if it is an empty filter.

  • list-objects.c is now more robust to NULL function pointers.

Updates in v2

Thanks for the reviews, Jeff, Junio, and Ævar!

  • Commit message typos and grammar are improved.

  • Grammar in MyFirstObjectWalk.txt is improved.

  • Unnecessary line wrapping is unwrapped.

  • Final test to check unbundled repo is made more rigorous.

  • The new 'filter' capability is added to Documentation/technical/bundle-format.txt

  • Expanded docs for 'git bundle verify'.

  • Moved API docs gently_parse_list_objects_filter() to header.

  • Test name swaps '' with "" to evaluate $filter.

  • Added a new patch that helps git clone <bundle> fail gracefully when <bundle> is has a filter capability.

Thanks,
-Stolee

cc: stolee@gmail.com
cc: avarab@gmail.com
cc: gitster@pobox.com
cc: zhiyou.jx@alibaba-inc.com
cc: jonathantanmy@google.com
cc: Jeff Hostetler git@jeffhostetler.com

The --promisor option of 'git index-pack' was created in 88e2f9e
(introduce fetch-object: fetch one promisor object, 2017-12-05) but was
untested. It is currently unused within the Git codebase, but that will
change in an upcoming change to 'git bundle unbundle' when there is a
filter capability.

For now, add documentation about the option and add a test to ensure it
is working as expected.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
@derrickstolee derrickstolee self-assigned this Feb 23, 2022
@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 23, 2022

Submitted as pull.1159.git.1645638911.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1159/derrickstolee/bundle/partial-v1

To fetch this version to local tag pr-1159/derrickstolee/bundle/partial-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1159/derrickstolee/bundle/partial-v1

@dscho dscho force-pushed the seen branch 6 times, most recently from d552efd to 82dd0cb Compare February 26, 2022 07:22
bundle.c Outdated
@@ -451,6 +451,12 @@ struct bundle_prerequisites_info {
int fd;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff Hostetler wrote (reply to this):



On 2/23/22 12:55 PM, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> > Since 'git bundle' uses setup_revisions() to specify the object walk,
> some options do not make sense to include during the pack-objects child
> process. Further, these options are used for a call to
> traverse_commit_list() which would then require a callback which is
> currently NULL.
> > By populating the callback we prevent a segfault in the case of adding
> the --objects flag. This is really a redundant statement because the

Nit: I stumbled over "...because the bundles are constructing..."
Is there a better wording here??

> bundles are constructing a pack-file containing all objects in the
> discovered commit range.
> > Adding --objects to a 'git bundle' command might cause a slower command,
> but at least it will not have a hard failure when the user supplies this
> option. We can also disable walking trees and blobs in advance of this
> walk.
> > Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>   bundle.c               | 10 +++++++++-
>   t/t6020-bundle-misc.sh | 12 ++++++++++++
>   2 files changed, 21 insertions(+), 1 deletion(-)
> > diff --git a/bundle.c b/bundle.c
> index a0bb687b0f4..dc56db9a50a 100644
> --- a/bundle.c
> +++ b/bundle.c
> @@ -451,6 +451,12 @@ struct bundle_prerequisites_info {
>   	int fd;
>   };
>   > +
> +static void ignore_object(struct object *obj, const char *v, void *data)
> +{
> +	/* Do nothing. */
> +}
> +
>   static void write_bundle_prerequisites(struct commit *commit, void *data)
>   {
>   	struct bundle_prerequisites_info *bpi = data;
> @@ -544,7 +550,9 @@ int create_bundle(struct repository *r, const char *path,
>   		die("revision walk setup failed");
>   	bpi.fd = bundle_fd;
>   	bpi.pending = &revs_copy.pending;
> -	traverse_commit_list(&revs, write_bundle_prerequisites, NULL, &bpi);
> +
> +	revs.blob_objects = revs.tree_objects = 0;
> +	traverse_commit_list(&revs, write_bundle_prerequisites, ignore_object, &bpi);
>   	object_array_remove_duplicates(&revs_copy.pending);
>   >   	/* write bundle refs */
> diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
> index b13e8a52a93..6522401617d 100755
> --- a/t/t6020-bundle-misc.sh
> +++ b/t/t6020-bundle-misc.sh
> @@ -475,4 +475,16 @@ test_expect_success 'clone from bundle' '
>   	test_cmp expect actual
>   '
>   > +test_expect_success 'unfiltered bundle with --objects' '
> +	git bundle create all-objects.bdl \
> +		--all --objects &&
> +	git bundle create all.bdl \
> +		--all &&
> +
> +	# Compare the headers of these files.
> +	head -11 all.bdl >expect &&
> +	head -11 all-objects.bdl >actual &&
> +	test_cmp expect actual
> +'
> +
>   test_done
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff Hostetler <git@jeffhostetler.com> writes:

> On 2/23/22 12:55 PM, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <derrickstolee@github.com>
>> Since 'git bundle' uses setup_revisions() to specify the object
>> walk,
>> some options do not make sense to include during the pack-objects child
>> process. Further, these options are used for a call to
>> traverse_commit_list() which would then require a callback which is
>> currently NULL.
>> By populating the callback we prevent a segfault in the case of
>> adding
>> the --objects flag. This is really a redundant statement because the
>
> Nit: I stumbled over "...because the bundles are constructing..."
> Is there a better wording here??

"... because the command is constructing ..." should be sufficient,
I hope?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 3/4/2022 5:58 PM, Junio C Hamano wrote:
> Jeff Hostetler <git@jeffhostetler.com> writes:
> 
>> On 2/23/22 12:55 PM, Derrick Stolee via GitGitGadget wrote:
>>> From: Derrick Stolee <derrickstolee@github.com>
>>> Since 'git bundle' uses setup_revisions() to specify the object
>>> walk,
>>> some options do not make sense to include during the pack-objects child
>>> process. Further, these options are used for a call to
>>> traverse_commit_list() which would then require a callback which is
>>> currently NULL.
>>> By populating the callback we prevent a segfault in the case of
>>> adding
>>> the --objects flag. This is really a redundant statement because the
>>
>> Nit: I stumbled over "...because the bundles are constructing..."
>> Is there a better wording here??
> 
> "... because the command is constructing ..." should be sufficient,
> I hope?

That works for me. Thanks!
-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 28, 2022

User Jeff Hostetler <git@jeffhostetler.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 28, 2022

On the Git mailing list, Jeff Hostetler wrote (reply to this):



On 2/23/22 12:55 PM, Derrick Stolee via GitGitGadget wrote:
> While discussing bundle-URIs [1], it came to my attention that bundles have
> no way to specify an object filter, so bundles cannot be used with partial
> clones.
> > [1]
> https://lore.kernel.org/git/7fab28bf-54e7-d0e9-110a-53fad6244cc9@gmail.com/
> > This series provides a way to fix that by adding a 'filter' capability to
> the bundle file format and allowing one to create a partial bundle with 'git
> bundle create --filter=blob:none '.

Nicely done.  There's a lot of refactoring here to move the
filtering code into a more usable place and get rid of some
of the awkward limitations of my original code.  Sorry that
you had to slog thru all of that.

> > There are a couple things that I want to point out about this implementation
> that could use some high-level feedback:
> >   1. I moved the '--filter' parsing into setup_revisions() instead of adding
>      another place to parse it. This works for 'git bundle' but it also
>      allows it to be parsed successfully in commands such as 'git diff' which
>      doesn't make sense. Options such as '--objects' are already being parsed
>      there, and they don't make sense either, so I want some thoughts on
>      this.

This feels like something that can wait for another task.
Let's keep this series focused on adding partial bundles.

> >   2. If someone uses 'git clone partial.bdl partial' where 'partial.bdl' is a
>      filtered bundle, then the clone will fail with a message such as
> > fatal: missing blob object '9444604d515c0b162e37e59accd54a0bac50ed2e' fatal:
> remote did not send all necessary objects
> > This might be fine. We don't expect users to clone partial bundles or fetch
> partial bundles into an unfiltered repo and these failures are expected. It
> is possible that we could put in custom logic to fail faster by reading the
> bundle header for a filter.
> > Generally, the idea is to open this up as a potential way to bootstrap a
> clone of a partial clone using a set of precomputed partial bundles.

I think this is to be expected.

Would it help to have Git do a no-checkout clone when cloning
from a partial bundle?  Maybe that would give the user a chance to set
a real remote (and maybe set the partial clone/fetch config settings)
and then backfill their local clone??   (That might be functional, but
not very user-friendly....)

Or should we just consider this limitation as a placeholder while we
wait for the Bundle URI effort?

Jeff

> > Thanks, -Stolee
> > Derrick Stolee (11):
>    index-pack: document and test the --promisor option
>    revision: put object filter into struct rev_info
>    pack-objects: use rev.filter when possible
>    pack-bitmap: drop filter in prepare_bitmap_walk()
>    list-objects: consolidate traverse_commit_list[_filtered]
>    MyFirstObjectWalk: update recommended usage
>    bundle: safely handle --objects option
>    bundle: parse filter capability
>    rev-list: move --filter parsing into revision.c
>    bundle: create filtered bundles
>    bundle: unbundle promisor packs
> >   Documentation/MyFirstObjectWalk.txt | 44 ++++++---------
>   Documentation/git-index-pack.txt    |  8 +++
>   builtin/pack-objects.c              |  9 +--
>   builtin/rev-list.c                  | 29 +++-------
>   bundle.c                            | 87 ++++++++++++++++++++++++-----
>   bundle.h                            |  3 +
>   list-objects-filter-options.c       |  2 +-
>   list-objects-filter-options.h       |  5 ++
>   list-objects.c                      | 25 +++------
>   list-objects.h                      | 12 +++-
>   pack-bitmap.c                       | 24 ++++----
>   pack-bitmap.h                       |  2 -
>   reachable.c                         |  2 +-
>   revision.c                          | 11 ++++
>   revision.h                          |  4 ++
>   t/t5300-pack-object.sh              |  4 +-
>   t/t6020-bundle-misc.sh              | 48 ++++++++++++++++
>   17 files changed, 215 insertions(+), 104 deletions(-)
> > > base-commit: 45fe28c951c3e70666ee4ef8379772851a8e4d32
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1159%2Fderrickstolee%2Fbundle%2Fpartial-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1159/derrickstolee/bundle/partial-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1159
> 

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 28, 2022

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 2/28/2022 12:00 PM, Jeff Hostetler wrote:
> On 2/23/22 12:55 PM, Derrick Stolee via GitGitGadget wrote:
>> While discussing bundle-URIs [1], it came to my attention that bundles have
>> no way to specify an object filter, so bundles cannot be used with partial
>> clones.
>>
>> [1]
>> https://lore.kernel.org/git/7fab28bf-54e7-d0e9-110a-53fad6244cc9@gmail.com/
>>
>> This series provides a way to fix that by adding a 'filter' capability to
>> the bundle file format and allowing one to create a partial bundle with 'git
>> bundle create --filter=blob:none '.
> 
> Nicely done.  There's a lot of refactoring here to move the
> filtering code into a more usable place and get rid of some
> of the awkward limitations of my original code.  Sorry that
> you had to slog thru all of that.
> 
>>
>> There are a couple things that I want to point out about this implementation
>> that could use some high-level feedback:
>>
>>   1. I moved the '--filter' parsing into setup_revisions() instead of adding
>>      another place to parse it. This works for 'git bundle' but it also
>>      allows it to be parsed successfully in commands such as 'git diff' which
>>      doesn't make sense. Options such as '--objects' are already being parsed
>>      there, and they don't make sense either, so I want some thoughts on
>>      this.
> 
> This feels like something that can wait for another task.
> Let's keep this series focused on adding partial bundles.

What do you mean "can wait"? Do you recommend that I _don't_ do this
refactor and instead implement filter parsing directly in bundles?

Or, are you saying that we should not worry about these potential
side-effects of allowing (then ignoring) certain options in other
commands, at least until a later series?
 
>>   2. If someone uses 'git clone partial.bdl partial' where 'partial.bdl' is a
>>      filtered bundle, then the clone will fail with a message such as
>>
>> fatal: missing blob object '9444604d515c0b162e37e59accd54a0bac50ed2e' fatal:
>> remote did not send all necessary objects
>>
>> This might be fine. We don't expect users to clone partial bundles or fetch
>> partial bundles into an unfiltered repo and these failures are expected. It
>> is possible that we could put in custom logic to fail faster by reading the
>> bundle header for a filter.
>>
>> Generally, the idea is to open this up as a potential way to bootstrap a
>> clone of a partial clone using a set of precomputed partial bundles.
> 
> I think this is to be expected.
> 
> Would it help to have Git do a no-checkout clone when cloning
> from a partial bundle?  Maybe that would give the user a chance to set
> a real remote (and maybe set the partial clone/fetch config settings)
> and then backfill their local clone??   (That might be functional, but
> not very user-friendly....)
> 
> Or should we just consider this limitation as a placeholder while we
> wait for the Bundle URI effort?

It would be interesting to have another application of partial
bundles, such as cloning directly from a bundle, then allowing
a remote to be configured. It provides a "build-it-yourself"
approach to bundle URIs for partial clones.

I'm not sure if such an application is required for this series,
or if it could be delayed until after. I'm open to suggestions.

Thanks,
-Stolee

@dscho dscho force-pushed the seen branch 4 times, most recently from 8854f3b to a4fa9bb Compare March 1, 2022 07:28
@gitgitgadget
Copy link

gitgitgadget bot commented Mar 1, 2022

On the Git mailing list, Jeff Hostetler wrote (reply to this):



On 2/28/22 12:54 PM, Derrick Stolee wrote:
> On 2/28/2022 12:00 PM, Jeff Hostetler wrote:
>> On 2/23/22 12:55 PM, Derrick Stolee via GitGitGadget wrote:
>>> While discussing bundle-URIs [1], it came to my attention that bundles have
>>> no way to specify an object filter, so bundles cannot be used with partial
>>> clones.
>>>
>>> [1]
>>> https://lore.kernel.org/git/7fab28bf-54e7-d0e9-110a-53fad6244cc9@gmail.com/
>>>
>>> This series provides a way to fix that by adding a 'filter' capability to
>>> the bundle file format and allowing one to create a partial bundle with 'git
>>> bundle create --filter=blob:none '.
>>
>> Nicely done.  There's a lot of refactoring here to move the
>> filtering code into a more usable place and get rid of some
>> of the awkward limitations of my original code.  Sorry that
>> you had to slog thru all of that.
>>
>>>
>>> There are a couple things that I want to point out about this implementation
>>> that could use some high-level feedback:
>>>
>>>    1. I moved the '--filter' parsing into setup_revisions() instead of adding
>>>       another place to parse it. This works for 'git bundle' but it also
>>>       allows it to be parsed successfully in commands such as 'git diff' which
>>>       doesn't make sense. Options such as '--objects' are already being parsed
>>>       there, and they don't make sense either, so I want some thoughts on
>>>       this.
>>
>> This feels like something that can wait for another task.
>> Let's keep this series focused on adding partial bundles.
> > What do you mean "can wait"? Do you recommend that I _don't_ do this
> refactor and instead implement filter parsing directly in bundles?
> > Or, are you saying that we should not worry about these potential
> side-effects of allowing (then ignoring) certain options in other
> commands, at least until a later series?

Sorry to be confusing here.  I just meant that the refactoring looks
good and we should continue with it.  And for now let's not worry
about those odd non-sense combinations that may now be possible.

That at some point later we can address the non-sense combinations
with some control flags on the call to the common parser or something.
I started thinking about that while looking at the code and just thought
that it would be a bit of a mess.  And that we didn't need to do any
thing about it now.  Such tweaking would be better done in a later
series.

>   >>>    2. If someone uses 'git clone partial.bdl partial' where 'partial.bdl' is a
>>>       filtered bundle, then the clone will fail with a message such as
>>>
>>> fatal: missing blob object '9444604d515c0b162e37e59accd54a0bac50ed2e' fatal:
>>> remote did not send all necessary objects
>>>
>>> This might be fine. We don't expect users to clone partial bundles or fetch
>>> partial bundles into an unfiltered repo and these failures are expected. It
>>> is possible that we could put in custom logic to fail faster by reading the
>>> bundle header for a filter.
>>>
>>> Generally, the idea is to open this up as a potential way to bootstrap a
>>> clone of a partial clone using a set of precomputed partial bundles.
>>
>> I think this is to be expected.
>>
>> Would it help to have Git do a no-checkout clone when cloning
>> from a partial bundle?  Maybe that would give the user a chance to set
>> a real remote (and maybe set the partial clone/fetch config settings)
>> and then backfill their local clone??   (That might be functional, but
>> not very user-friendly....)
>>
>> Or should we just consider this limitation as a placeholder while we
>> wait for the Bundle URI effort?
> > It would be interesting to have another application of partial
> bundles, such as cloning directly from a bundle, then allowing
> a remote to be configured. It provides a "build-it-yourself"
> approach to bundle URIs for partial clones.
> > I'm not sure if such an application is required for this series,
> or if it could be delayed until after. I'm open to suggestions.

I think what you have here is nice and self-contained.  That is,
supporting partial bundles is a nice stop point.  Then have 1 or 2
series that consumes them.

Jeff

@dscho dscho force-pushed the seen branch 9 times, most recently from 4f693b8 to d55ad1d Compare March 3, 2022 23:53
@gitgitgadget
Copy link

gitgitgadget bot commented Mar 4, 2022

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 2/23/2022 12:55 PM, Derrick Stolee via GitGitGadget wrote:
> While discussing bundle-URIs [1], it came to my attention that bundles have
> no way to specify an object filter, so bundles cannot be used with partial
> clones.
> 
> [1]
> https://lore.kernel.org/git/7fab28bf-54e7-d0e9-110a-53fad6244cc9@gmail.com/
> 
> This series provides a way to fix that by adding a 'filter' capability to
> the bundle file format and allowing one to create a partial bundle with 'git
> bundle create --filter=blob:none '.
> 
> There are a couple things that I want to point out about this implementation
> that could use some high-level feedback:
> 
>  1. I moved the '--filter' parsing into setup_revisions() instead of adding
>     another place to parse it. This works for 'git bundle' but it also
>     allows it to be parsed successfully in commands such as 'git diff' which
>     doesn't make sense. Options such as '--objects' are already being parsed
>     there, and they don't make sense either, so I want some thoughts on
>     this.
> 
>  2. If someone uses 'git clone partial.bdl partial' where 'partial.bdl' is a
>     filtered bundle, then the clone will fail with a message such as
> 
> fatal: missing blob object '9444604d515c0b162e37e59accd54a0bac50ed2e' fatal:
> remote did not send all necessary objects
> 
> This might be fine. We don't expect users to clone partial bundles or fetch
> partial bundles into an unfiltered repo and these failures are expected. It
> is possible that we could put in custom logic to fail faster by reading the
> bundle header for a filter.
> 
> Generally, the idea is to open this up as a potential way to bootstrap a
> clone of a partial clone using a set of precomputed partial bundles.

Thanks Jeff, for providing a review of this series. I hope that at
least one other reviewer could take a look sometime.

Thanks,
-Stolee

@@ -62,7 +62,6 @@ static const char rev_list_usage[] =
static struct progress *progress;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  static int try_bitmap_count(struct rev_info *revs,
> -			    struct list_objects_filter_options *filter,
>  			    int filter_provided_objects)

This makes quite a lot of sense as filter is now available as
revs->filter.

>  {
>  	uint32_t commit_count = 0,
> @@ -436,7 +434,8 @@ static int try_bitmap_count(struct rev_info *revs,
>  	 */
>  	max_count = revs->max_count;
>  
> -	bitmap_git = prepare_bitmap_walk(revs, filter, filter_provided_objects);
> +	bitmap_git = prepare_bitmap_walk(revs, revs->filter,
> +					 filter_provided_objects);

And we should be able to do the same to prepare_bitmap_walk().  It
is OK if such a change comes later and not as part of this commit.

Perhaps it is deliberate.  Unlike the helpers this step touches,
namely, try_bitmap_count(), try_bitmap_traversal(), and
try_bitmap_disk_usage(), prepare_bitmap_walk() is not a file-scope
static helper and updating it will need touching many more places.

> @@ -597,13 +595,17 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
>  		}
>  
>  		if (skip_prefix(arg, ("--" CL_ARG__FILTER "="), &arg)) {

#leftoverbit.  We need to remember to clean this up, now "--filter"
is well established (I am assuming this literal-string pasting is
because we didn't know what the right and final word to be used as
the option name back when this code was originally written), when
the code around here is quiescent.

> -			parse_list_objects_filter(&filter_options, arg);
> -			if (filter_options.choice && !revs.blob_objects)
> +			if (!revs.filter)
> +				CALLOC_ARRAY(revs.filter, 1);
> +			parse_list_objects_filter(revs.filter, arg);
> +			if (revs.filter->choice && !revs.blob_objects)
>  				die(_("object filtering requires --objects"));
>  			continue;

OK.  The original "filter_options" was a structure and not a pointer
to a structure; now we have a pointer to a structure in revs as a
member so we need an on-demand allocation.  CALLOC_ARRAY() instead
of xcalloc(), when we know we are creating one element and not an
array of elements whose size happens to be one, is not wrong but it
does look strange.  Was there a reason why we avoid xcalloc()?

Makes me also wonder how big the filter_options structure is;
because we will not use unbounded many revs structure, it may have
been a simpler conversion to turn a static struct into an embedded
struct member in a struct (instead of a member of a struct that is a
pointer to a struct).  That way, we did not have to ...

>  		}
>  		if (!strcmp(arg, ("--no-" CL_ARG__FILTER))) {
> -			list_objects_filter_set_no_filter(&filter_options);
> +			if (!revs.filter)
> +				CALLOC_ARRAY(revs.filter, 1);

... repeat the on-demand allocation.  If some code used to pass
&filter_options in a parameter to helper functions, and such calling
sites get rewritten to pass the value in the revs.filter pointer,
and if revs hasn't gone through this codepath, these helper functions
will start receiving NULL in their filter_options parameter, which
they may or may not be prepared to take.  This "we get rid of a
global struct and replace it with an on-demand allocated structure,
pointer to which is stored in the revs structure" rewrite somehow
makes me nervous.

> diff --git a/revision.h b/revision.h
> index 3c58c18c63a..1ddb73ab82e 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -81,6 +81,7 @@ struct rev_cmdline_info {
>  
>  struct oidset;
>  struct topo_walk_info;
> +struct list_objects_filter_options;

Is the forward-declaration the only reason why we needed to have a
pointer to a(n opaque) struct, not an embedded struct, as a member?

>  struct rev_info {
>  	/* Starting list */
> @@ -94,6 +95,9 @@ struct rev_info {
>  	/* The end-points specified by the end user */
>  	struct rev_cmdline_info cmdline;
>  
> +	/* Object filter options. NULL for no filtering. */
> +	struct list_objects_filter_options *filter;
> +
>  	/* excluding from --branches, --refs, etc. expansion */
>  	struct string_list *ref_excludes;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 3/4/2022 5:15 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>>  static int try_bitmap_count(struct rev_info *revs,
>> -			    struct list_objects_filter_options *filter,
>>  			    int filter_provided_objects)
> 
> This makes quite a lot of sense as filter is now available as
> revs->filter.
> 
>>  {
>>  	uint32_t commit_count = 0,
>> @@ -436,7 +434,8 @@ static int try_bitmap_count(struct rev_info *revs,
>>  	 */
>>  	max_count = revs->max_count;
>>  
>> -	bitmap_git = prepare_bitmap_walk(revs, filter, filter_provided_objects);
>> +	bitmap_git = prepare_bitmap_walk(revs, revs->filter,
>> +					 filter_provided_objects);
> 
> And we should be able to do the same to prepare_bitmap_walk().  It
> is OK if such a change comes later and not as part of this commit.
> 
> Perhaps it is deliberate.  Unlike the helpers this step touches,
> namely, try_bitmap_count(), try_bitmap_traversal(), and
> try_bitmap_disk_usage(), prepare_bitmap_walk() is not a file-scope
> static helper and updating it will need touching many more places.

I'm making a note that this cleanup can happen in a follow-up series.
 
>> @@ -597,13 +595,17 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
>>  		}
>>  
>>  		if (skip_prefix(arg, ("--" CL_ARG__FILTER "="), &arg)) {
> 
> #leftoverbit.  We need to remember to clean this up, now "--filter"
> is well established (I am assuming this literal-string pasting is
> because we didn't know what the right and final word to be used as
> the option name back when this code was originally written), when
> the code around here is quiescent.

Good point.

>> -			parse_list_objects_filter(&filter_options, arg);
>> -			if (filter_options.choice && !revs.blob_objects)
>> +			if (!revs.filter)
>> +				CALLOC_ARRAY(revs.filter, 1);
>> +			parse_list_objects_filter(revs.filter, arg);
>> +			if (revs.filter->choice && !revs.blob_objects)
>>  				die(_("object filtering requires --objects"));
>>  			continue;
> 
> OK.  The original "filter_options" was a structure and not a pointer
> to a structure; now we have a pointer to a structure in revs as a
> member so we need an on-demand allocation.  CALLOC_ARRAY() instead
> of xcalloc(), when we know we are creating one element and not an
> array of elements whose size happens to be one, is not wrong but it
> does look strange.  Was there a reason why we avoid xcalloc()?

I think I've been using CALLOC_ARRAY(..., 1) over "... = xcalloc()"
for simplicity's sake for a while. I see quite a few across the
codebase, too, but I can swap the usage here if you feel that is
important.

> Makes me also wonder how big the filter_options structure is;
> because we will not use unbounded many revs structure, it may have
> been a simpler conversion to turn a static struct into an embedded
> struct member in a struct (instead of a member of a struct that is a
> pointer to a struct).  That way, we did not have to ...
>>>  		}
>>  		if (!strcmp(arg, ("--no-" CL_ARG__FILTER))) {
>> -			list_objects_filter_set_no_filter(&filter_options);
>> +			if (!revs.filter)
>> +				CALLOC_ARRAY(revs.filter, 1);
> 
> ... repeat the on-demand allocation.  If some code used to pass
> &filter_options in a parameter to helper functions, and such calling
> sites get rewritten to pass the value in the revs.filter pointer,
> and if revs hasn't gone through this codepath, these helper functions
> will start receiving NULL in their filter_options parameter, which
> they may or may not be prepared to take.  This "we get rid of a
> global struct and replace it with an on-demand allocated structure,
> pointer to which is stored in the revs structure" rewrite somehow
> makes me nervous.

I think the main idea is that the filter being NULL indicates "no
filter is used. Do a full object walk." If we use a static struct,
then we need to instead use revs->filter.filter_spec.nr, but that
is already being used as a BUG() statement:

const char *list_objects_filter_spec(struct list_objects_filter_options *filter)
{
	if (!filter->filter_spec.nr)
		BUG("no filter_spec available for this filter");

so a non-NULL filter with no specs is considered invalid, while
a NULL filter _is_ currently considered valid.

While we _could_ make that switch to using a static struct and
change our checks to allow empty specs, that would be a more
involved change. Maybe we can leave it for a follow up?

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <derrickstolee@github.com> writes:

>> member so we need an on-demand allocation.  CALLOC_ARRAY() instead
>> of xcalloc(), when we know we are creating one element and not an
>> array of elements whose size happens to be one, is not wrong but it
>> does look strange.  Was there a reason why we avoid xcalloc()?
>
> I think I've been using CALLOC_ARRAY(..., 1) over "... = xcalloc()"
> for simplicity's sake for a while. I see quite a few across the
> codebase, too, but I can swap the usage here if you feel that is
> important.

Not at all.  I think it was just me who was confused; CALLOC_ARRAY()
vs xcalloc() is not confusing.  Both are capable of and meant for
allocating an array of these elements, and use of it for a single
element non-array is the same either way.  Nothing to complain about.

>> they may or may not be prepared to take.  This "we get rid of a
>> global struct and replace it with an on-demand allocated structure,
>> pointer to which is stored in the revs structure" rewrite somehow
>> makes me nervous.
>
> I think the main idea is that the filter being NULL indicates "no
> filter is used. Do a full object walk." If we use a static struct,
> then we need to instead use revs->filter.filter_spec.nr, but that
> is already being used as a BUG() statement:

Thanks.  My observation was primarily that it looked deviating from
the original code but that is not an objection unless the original
was without room for improvement.  And in fact in this case, I think
the global variable that was a static struct should have been a
global variable that is a pointer to a struct which is NULL unless
the filtering capability is being used.  So in that sense, the
conversion done in this series is an improvement and going in the
right direction.

> While we _could_ make that switch to using a static struct and
> change our checks to allow empty specs, that would be a more
> involved change. Maybe we can leave it for a follow up?

No, there is no need to.  A pointer that is NULL when unused is the
right thing to do.

Thanks.

@@ -71,6 +71,11 @@ and the Git bundle v2 format cannot represent a shallow clone repository.
== Capabilities
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Tue, Mar 08 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
>  static const char v2_bundle_signature[] = "# v2 git bundle\n";
>  static const char v3_bundle_signature[] = "# v3 git bundle\n";
> @@ -33,6 +33,7 @@ void bundle_header_release(struct bundle_header *header)
>  {
>  	string_list_clear(&header->prerequisites, 1);
>  	string_list_clear(&header->references, 1);
> +	list_objects_filter_release(&header->filter);
>  }
>  
>  static int parse_capability(struct bundle_header *header, const char *capability)
> @@ -45,6 +46,10 @@ static int parse_capability(struct bundle_header *header, const char *capability
>  		header->hash_algo = &hash_algos[algo];
>  		return 0;
>  	}
> +	if (skip_prefix(capability, "filter=", &arg)) {
> +		parse_list_objects_filter(&header->filter, arg);
> +		return 0;
> +	}
>  	return error(_("unknown capability '%s'"), capability);
>  }

I haven't tested, but I did wonder if our purely "check reachability"
(or equivalent) "verify" would be slowed down by doing whatever filter
magic we're doing here, but then remembered/saw that we only parse the
header, so it can't be that bad :)

I.e. this is only checking the syntax of the filter, surely, and then
spits it back at us. That makes sense.

I think that this hunk from the subsequent 10/12 is in the wrong place
though, and should be here when we change "verify" (not "create" later):
	
	diff --git a/Documentation/git-bundle.txt b/Documentation/git-bundle.txt
	index 72ab8139052..831c4788a94 100644
	--- a/Documentation/git-bundle.txt
	+++ b/Documentation/git-bundle.txt
	@@ -75,8 +75,8 @@ verify <file>::
	 	cleanly to the current repository.  This includes checks on the
	 	bundle format itself as well as checking that the prerequisite
	 	commits exist and are fully linked in the current repository.
	-	'git bundle' prints a list of missing commits, if any, and exits
	-	with a non-zero status.
	+	'git bundle' prints the bundle's object filter and its list of
	+	missing commits, if any, and exits with a non-zero status.
	 
	 list-heads <file>::
	 	Lists the references defined in the bundle.  If followed by a

I think instead of starting to list every header we might add in the
future there it would make sense to just add after the pre-image
full-stop something like:

    We'll also print out information about any known capabilities, such
    as "object filter". See "Capabilities" in technical/...

Which future proofs it a bit...

> [...]

I think it would also be great to have tests for intentionally
corrupting the header and seeing what the "verify" output is, i.e. do we
die right away, proceed to validate the rest. So just (somewhat
pseudocode):

    git bundle create [...] my.bdl &&
    sed 's/@filter: .*/@filter: bad/' <my.bdl >bad.bdl &&
    test_must_fail git bundle verify bad.bdl 2>actual &&
    [...]
    test_cmp expect actual

Overall this series looks really good at this point, and I'm down to
minor shades of colors on the bikeshedding :)

As we add more embedded members with type 'struct
list_objects_filter_options', it will be important to easily perform a
deep copy across multiple such structs. Create
list_objects_filter_copy() to satisfy this need.

This method is recursive to match the recursive nature of the struct.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
derrickstolee and others added 11 commits March 9, 2022 10:15
Placing a 'struct list_objects_filter_options' within 'struct rev_info'
will assist making some bookkeeping around object filters in the future.

For now, let's use this new member to remove a static global instance of
the struct from builtin/rev-list.c.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
In builtin/pack-objects.c, we use a 'filter_options' global to populate
the --filter=<X> argument. The previous change created a pointer to a
filter option in 'struct rev_info', so we can use that pointer here as a
start to simplifying some usage of object filters.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Now that all consumers of prepare_bitmap_walk() have populated the
'filter' member of 'struct rev_info', we can drop that extra parameter
from the method and access it directly from the 'struct rev_info'.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Now that all consumers of traverse_commit_list_filtered() populate the
'filter' member of 'struct rev_info', we can drop that parameter from
the method prototype to simplify things. In addition, the only thing
different now between traverse_commit_list_filtered() and
traverse_commit_list() is the presence of the 'omitted' parameter, which
is only non-NULL for one caller. We can consolidate these two methods by
having one call the other and use the simpler form everywhere the
'omitted' parameter would be NULL.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
The previous change consolidated traverse_commit_list() and
traverse_commit_list_filtered(). This allows us to simplify the
recommended usage in MyFirstObjectWalk.txt to use this new set of
values.

While here, add some clarification on the difference between the two
methods.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
If a caller to traverse_commit_list() specifies the options for the
--objects flag but does not specify a show_object function pointer, the
result is a segfault. This is currently visible by running 'git bundle
create --objects HEAD'.

We could fix this problem by supplying a no-op callback in
builtin/bundle.c, but that only solves the problem for one builtin,
leaving this segfault open for other callers.

Replace all callers of the show_commit and show_object function pointers
in list-objects.c to call helper functions show_commit() and
show_object() which check that the given context has non-NULL functions
before passing the necessary data. One extra benefit is that it reduces
duplication due to passing ctx->show_data to every caller.

Test that this segfault no longer occurs for 'git bundle'.

Co-authored-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Derrick Stolee <derrickstolee@github.com>
The v3 bundle format has capabilities, allowing newer versions of Git to
create bundles with newer features. Older versions that do not
understand these new capabilities will fail with a helpful warning.

Create a new capability allowing Git to understand that the contained
pack-file is filtered according to some object filter. Typically, this
filter will be "blob:none" for a blobless partial clone.

This change teaches Git to parse this capability, place its value in the
bundle header, and demonstrate this understanding by adding a message to
'git bundle verify'.

Since we will use gently_parse_list_objects_filter() outside of
list-objects-filter-options.c, make it an external method and move its
API documentation to before its declaration.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Now that 'struct rev_info' has a 'filter' member and most consumers of
object filtering are using that member instead of an external struct,
move the parsing of the '--filter' option out of builtin/rev-list.c and
into revision.c.

This use within handle_revision_pseudo_opt() allows us to find the
option within setup_revisions() if the arguments are passed directly. In
the case of a command such as 'git blame', the arguments are first
scanned and checked with parse_revision_opt(), which complains about the
option, so 'git blame --filter=blob:none <file>' does not become valid
with this change.

Some commands, such as 'git diff' gain this option without having it
make an effect. And 'git diff --objects' was already possible, but does
not actually make sense in that builtin.

The key addition that is coming is 'git bundle create --filter=<X>' so
we can create bundles containing promisor packs. More work is required
to make them fully functional, but that will follow.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
A previous change allowed Git to parse bundles with the 'filter'
capability. Now, teach Git to create bundles with this option.

Some rearranging of code is required to get the option parsing in the
correct spot. There are now two reasons why we might need capabilities
(a new hash algorithm or an object filter) so that is pulled out into a
place where we can check both at the same time.

The --filter option is parsed as part of setup_revisions(), but it
expected the --objects flag, too. That flag is somewhat implied by 'git
bundle' because it creates a pack-file walking objects, but there is
also a walk that walks the revision range expecting only commits. Make
this parsing work by setting 'revs.tree_objects' and 'revs.blob_objects'
before the call to setup_revisions().

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
In order to have a valid pack-file after unbundling a bundle that has
the 'filter' capability, we need to generate a .promisor file. The
bundle does not promise _where_ the objects can be found, but we can
expect that these bundles will be unbundled in repositories with
appropriate promisor remotes that can find those missing objects.

Use the 'git index-pack --promisor=<message>' option to create this
.promisor file. Add "from-bundle" as the message to help anyone diagnose
issues with these promisor packs.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Users can create a new repository using 'git clone <bundle-file>'. The
new "@filter" capability for bundles means that we can generate a bundle
that does not contain all reachable objects, even if the header has no
negative commit OIDs.

It is feasible to think that we could make a filtered bundle work with
the command

  git clone --filter=$filter --bare <bundle-file>

or possibly replacing --bare with --no-checkout. However, this requires
having some repository-global config that specifies the specified object
filter and notifies Git about the existence of promisor pack-files.
Without a remote, that is currently impossible.

As a stop-gap, parse the bundle header during 'git clone' and die() with
a helpful error message instead of the current behavior of failing due
to "missing objects".

Most of the existing logic for handling bundle clones actually happens
in fetch-pack.c, but that logic is the same as if the user specified
'git fetch <bundle>', so we want to avoid failing to fetch a filtered
bundle when in an existing repository that has the proper config set up
for at least one remote.

Carefully comment around the test that this is not the desired long-term
behavior of 'git clone' in this case, but instead that we need to do
more work before that is possible.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 9, 2022

Submitted as pull.1159.v4.git.1646841703.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1159/derrickstolee/bundle/partial-v4

To fetch this version to local tag pr-1159/derrickstolee/bundle/partial-v4:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1159/derrickstolee/bundle/partial-v4

@@ -75,8 +75,11 @@ verify <file>::
cleanly to the current repository. This includes checks on the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/Documentation/git-bundle.txt b/Documentation/git-bundle.txt
> index 72ab8139052..ac4c4352aae 100644
> --- a/Documentation/git-bundle.txt
> +++ b/Documentation/git-bundle.txt
> @@ -75,8 +75,11 @@ verify <file>::
>  	cleanly to the current repository.  This includes checks on the
>  	bundle format itself as well as checking that the prerequisite
>  	commits exist and are fully linked in the current repository.
> -	'git bundle' prints a list of missing commits, if any, and exits
> -	with a non-zero status.
> +	Information about additional capabilities, such as "object filter",
> +	is printed. See "Capabilities" in link:technical/bundle-format.html
> +	for more information. Finally, 'git bundle' prints a list of
> +	missing commits, if any. The exit code is zero for success, but
> +	will be nonzero if the bundle file is invalid.

Hmph.  I wasn't expecting this change (not objecting, but mostly am
surprised) relative to the previous round where the filter was
mentioned only when we issue an error message.  I was expecting to
see something like "list-filters <file>", which is analog to the
"list-heads <file>", to help those who want to programatically build
around the "git bundle" command output.  Or "--list-capabilities" to
accomodate the current, this, and future capabilities.  We already
have the object-format thing before this series.  Do we have an
interface to expose that out of a given bundle file?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 3/9/2022 1:41 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> diff --git a/Documentation/git-bundle.txt b/Documentation/git-bundle.txt
>> index 72ab8139052..ac4c4352aae 100644
>> --- a/Documentation/git-bundle.txt
>> +++ b/Documentation/git-bundle.txt
>> @@ -75,8 +75,11 @@ verify <file>::
>>  	cleanly to the current repository.  This includes checks on the
>>  	bundle format itself as well as checking that the prerequisite
>>  	commits exist and are fully linked in the current repository.
>> -	'git bundle' prints a list of missing commits, if any, and exits
>> -	with a non-zero status.
>> +	Information about additional capabilities, such as "object filter",
>> +	is printed. See "Capabilities" in link:technical/bundle-format.html
>> +	for more information. Finally, 'git bundle' prints a list of
>> +	missing commits, if any. The exit code is zero for success, but
>> +	will be nonzero if the bundle file is invalid.
> 
> Hmph.  I wasn't expecting this change (not objecting, but mostly am
> surprised) relative to the previous round where the filter was
> mentioned only when we issue an error message.  I was expecting to
> see something like "list-filters <file>", which is analog to the
> "list-heads <file>", to help those who want to programatically build
> around the "git bundle" command output.  Or "--list-capabilities" to
> accomodate the current, this, and future capabilities.  We already
> have the object-format thing before this series.  Do we have an
> interface to expose that out of a given bundle file?
 
The object format does _not_ appear to be output by 'git bundle verify'.
I just ran t6020 under GIT_TEST_DEFAULT_HASH=sha256 and see the
capability being written, but not output by verify:


The bundle contains these 10 refs:
d519553fbcf280df4448d588c25a51872f2d8dec95ba65a8a1bd3c64a5eec664 refs/heads/main
265b1effb3fdb80e04f7ea64e717f5677ddf57d00145dce7c508ba1f5ddb9081 refs/heads/release
611ac8182ea26d7aad227873b70f584593af1aa584bbdd37b36055e71be6ccd7 refs/heads/topic/1
4251af01ec70cdca692a3d15d78ccb9a6ca92ef344bd2dbc3bac20081347ae9b refs/heads/topic/2
611ac8182ea26d7aad227873b70f584593af1aa584bbdd37b36055e71be6ccd7 refs/pull/1/head
ec7e40a591df46923b25fd44bd86a2a80927f343d141f55ddf295c5d2d57959e refs/pull/2/head
754e9363bbfce179d35ccc48ae3a3c81db95a489cc632fafe5c10b25aed29d74 refs/tags/v1
a96c78650835f2041c49dce964bb759add14cfc4d35af3b7ee2b22289f9ba817 refs/tags/v2
398a930e72d21ea455c982227cca3c8fb5feb88c31f1d42226a3e6c42ff8db8f refs/tags/v3
d519553fbcf280df4448d588c25a51872f2d8dec95ba65a8a1bd3c64a5eec664 HEAD
The bundle uses this filter: blob:none
The bundle records a complete history.
partial.bdl is okay


This output does not seem to be designed for machine parsing, so
this extension of "The bundle uses this filter:" shouldn't be
problematic. A similar addition for object-format should be
possible as a follow-up.

Further, it seems you are asking for something that just reads
the header and supplies valuable data from the header such as
the refs and the capabilities. Perhaps 'git bundle header <file>'?

(We already have 'git bundle list-heads', so maybe we should
leave the refs out of 'git bundle header' output?)

I can add both as follow-ups to my existing list of things.

Thanks,
-Stolee

@@ -62,7 +62,6 @@ static const char rev_list_usage[] =
static struct progress *progress;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/revision.h b/revision.h
> index 3c58c18c63a..b1669a8cc33 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -8,6 +8,7 @@
>  #include "pretty.h"
>  #include "diff.h"
>  #include "commit-slab-decl.h"
> +#include "list-objects-filter-options.h"
>  
>  /**
>   * The revision walking API offers functions to build a list of revisions
> @@ -94,6 +95,12 @@ struct rev_info {
>  	/* The end-points specified by the end user */
>  	struct rev_cmdline_info cmdline;
>  
> +	/*
> +	 * Object filter options. No filtering is specified
> +	 * if and only if filter.choice is zero.
> +	 */
> +	struct list_objects_filter_options filter;

I wondered if s/zero/LOFC_DISABLED/ would make it more helpful, but
seeing changes like the one i the later "parse filter capability"
step, which can just become

-	if (header->filter.choice != LOFC_DISABLED) {
+	if (header->filter.choice) {
		... do things ...

relative to the previous round, I think what is in the posted patch
indeed is a better way to help developers.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 9, 2022

This patch series was integrated into seen via git@455f075.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 9, 2022

This patch series was integrated into seen via git@3b3888b.

@@ -3651,7 +3651,7 @@ static int pack_options_allow_reuse(void)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Wed, Mar 09 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
>
> In builtin/pack-objects.c, we use a 'filter_options' global to populate
> the --filter=<X> argument. The previous change created a pointer to a
> filter option in 'struct rev_info', so we can use that pointer here as a
> start to simplifying some usage of object filters.
>
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  builtin/pack-objects.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index ba2006f2212..e5b7d015d7d 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3651,7 +3651,7 @@ static int pack_options_allow_reuse(void)
>  
>  static int get_object_list_from_bitmap(struct rev_info *revs)
>  {
> -	if (!(bitmap_git = prepare_bitmap_walk(revs, &filter_options, 0)))
> +	if (!(bitmap_git = prepare_bitmap_walk(revs, &revs->filter, 0)))
>  		return -1;
>  
>  	if (pack_options_allow_reuse() &&
> @@ -3727,6 +3727,7 @@ static void get_object_list(int ac, const char **av)
>  	repo_init_revisions(the_repository, &revs, NULL);
>  	save_commit_buffer = 0;
>  	setup_revisions(ac, av, &revs, &s_r_opt);
> +	list_objects_filter_copy(&revs.filter, &filter_options);
>  
>  	/* make sure shallows are read */
>  	is_repository_shallow(the_repository);
> @@ -3777,7 +3778,7 @@ static void get_object_list(int ac, const char **av)
>  
>  	if (!fn_show_object)
>  		fn_show_object = show_object;
> -	traverse_commit_list_filtered(&filter_options, &revs,
> +	traverse_commit_list_filtered(&revs.filter, &revs,
>  				      show_commit, fn_show_object, NULL,
>  				      NULL);

Re your
https://lore.kernel.org/git/77c8ef4b-5dce-401b-e703-cfa32e18c853@github.com/
I was looking at how to handle the interaction between this and my
revisions_release() series.

This adds a new memory leak, which can be seen with:

    make SANITIZE=leak
    (cd t && ./t5532-fetch-proxy.sh -vixd)

I.e. this part is new:
    
    remote: Direct leak of 1 byte(s) in 1 object(s) allocated from:
    remote:     #0 0x4552f8 in __interceptor_malloc (git+0x4552f8)
    remote:     #1 0x75a089 in do_xmalloc wrapper.c:41:8
    remote:     #2 0x75a046 in xmalloc wrapper.c:62:9
    remote:     #3 0x62c692 in list_objects_filter_copy list-objects-filter-options.c:433:2
    remote:     #4 0x4f70bf in get_object_list builtin/pack-objects.c:3730:2
    remote:     #5 0x4f5e0e in cmd_pack_objects builtin/pack-objects.c:4157:3
    remote:     #6 0x4592ba in run_builtin git.c:465:11
    remote:     #7 0x457d71 in handle_builtin git.c:718:3
    remote:     #8 0x458ca5 in run_argv git.c:785:4
    remote:     #9 0x457b30 in cmd_main git.c:916:19
    remote:     #10 0x563179 in main common-main.c:56:11
    remote:     #11 0x7fddd2da67ec in __libc_start_main csu/../csu/libc-start.c:332:16
    remote:     #12 0x4300e9 in _start (git+0x4300e9)
    
Of course it's not "new" in the sense that we in effect leaked this
before, but it was still reachable, you're just changing it so that
instead of being stored in the static "filter_options" variable in
pack-objects.c we're storing it in "struct rev_info", which has a
different lifetime.

I think instead of me rebasing my series on top of yours and tangling
the two up a better option is to just add a change to this, so after
list_objects_filter_copy() do:

    UNLEAK(revs.filter);

Or, alternatively adding this to the end of the function (in which case
Junio will need to deal with a minor textual conflict):

    list_objects_filter_release(&revs.filter);

Both of those make my series merged with "seen" (which has this change)
pass with SANITIZE=leak + GIT_TEST_PASSING_SANITIZE_LEAK=true again.

You could do the same in your later change adding
list_objects_filter_copy() to verify_bundle(), that one also adds a new
leak, but happens not to cause test failures since the bundle.c code
isn't otherwise marked as passing with SANITIZE=leak, it fails in
various other ways.

Obviously we should do something about the actual leak eventually, but
that can be done in some follow-up work to finish up the missing bits of
release_revisions(), i.e. adding list_objects_filter_release() etc. to
release_revisions().

So I think just adding UNLEAK() here (and optionally, also to the
bundle.c code) is the least invasive thing, if you & Junio are OK with
that approach.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 3/10/2022 8:11 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Mar 09 2022, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <derrickstolee@github.com>
>>
>> In builtin/pack-objects.c, we use a 'filter_options' global to populate
>> the --filter=<X> argument. The previous change created a pointer to a
>> filter option in 'struct rev_info', so we can use that pointer here as a
>> start to simplifying some usage of object filters.
>>
>> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
>> ---
>>  builtin/pack-objects.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
>> index ba2006f2212..e5b7d015d7d 100644
>> --- a/builtin/pack-objects.c
>> +++ b/builtin/pack-objects.c
>> @@ -3651,7 +3651,7 @@ static int pack_options_allow_reuse(void)
>>  
>>  static int get_object_list_from_bitmap(struct rev_info *revs)
>>  {
>> -	if (!(bitmap_git = prepare_bitmap_walk(revs, &filter_options, 0)))
>> +	if (!(bitmap_git = prepare_bitmap_walk(revs, &revs->filter, 0)))
>>  		return -1;
>>  
>>  	if (pack_options_allow_reuse() &&
>> @@ -3727,6 +3727,7 @@ static void get_object_list(int ac, const char **av)
>>  	repo_init_revisions(the_repository, &revs, NULL);
>>  	save_commit_buffer = 0;
>>  	setup_revisions(ac, av, &revs, &s_r_opt);
>> +	list_objects_filter_copy(&revs.filter, &filter_options);
>>  
>>  	/* make sure shallows are read */
>>  	is_repository_shallow(the_repository);
>> @@ -3777,7 +3778,7 @@ static void get_object_list(int ac, const char **av)
>>  
>>  	if (!fn_show_object)
>>  		fn_show_object = show_object;
>> -	traverse_commit_list_filtered(&filter_options, &revs,
>> +	traverse_commit_list_filtered(&revs.filter, &revs,
>>  				      show_commit, fn_show_object, NULL,
>>  				      NULL);
> 
> Re your
> https://lore.kernel.org/git/77c8ef4b-5dce-401b-e703-cfa32e18c853@github.com/
> I was looking at how to handle the interaction between this and my
> revisions_release() series.
> 
> This adds a new memory leak, which can be seen with:
> 
>     make SANITIZE=leak
>     (cd t && ./t5532-fetch-proxy.sh -vixd)
> 
> I.e. this part is new:
>     
>     remote: Direct leak of 1 byte(s) in 1 object(s) allocated from:
>     remote:     #0 0x4552f8 in __interceptor_malloc (git+0x4552f8)
>     remote:     #1 0x75a089 in do_xmalloc wrapper.c:41:8
>     remote:     #2 0x75a046 in xmalloc wrapper.c:62:9
>     remote:     #3 0x62c692 in list_objects_filter_copy list-objects-filter-options.c:433:2
>     remote:     #4 0x4f70bf in get_object_list builtin/pack-objects.c:3730:2
>     remote:     #5 0x4f5e0e in cmd_pack_objects builtin/pack-objects.c:4157:3
>     remote:     #6 0x4592ba in run_builtin git.c:465:11
>     remote:     #7 0x457d71 in handle_builtin git.c:718:3
>     remote:     #8 0x458ca5 in run_argv git.c:785:4
>     remote:     #9 0x457b30 in cmd_main git.c:916:19
>     remote:     #10 0x563179 in main common-main.c:56:11
>     remote:     #11 0x7fddd2da67ec in __libc_start_main csu/../csu/libc-start.c:332:16
>     remote:     #12 0x4300e9 in _start (git+0x4300e9)
>     
> Of course it's not "new" in the sense that we in effect leaked this
> before, but it was still reachable, you're just changing it so that
> instead of being stored in the static "filter_options" variable in
> pack-objects.c we're storing it in "struct rev_info", which has a
> different lifetime.

True, and 'struct rev_info' is not being release in any way, either,
right?

> I think instead of me rebasing my series on top of yours and tangling
> the two up a better option is to just add a change to this, so after
> list_objects_filter_copy() do:
> 
>     UNLEAK(revs.filter);
> 
> Or, alternatively adding this to the end of the function (in which case
> Junio will need to deal with a minor textual conflict):
> 
>     list_objects_filter_release(&revs.filter);
> 
> Both of those make my series merged with "seen" (which has this change)
> pass with SANITIZE=leak + GIT_TEST_PASSING_SANITIZE_LEAK=true again.
> 
> You could do the same in your later change adding
> list_objects_filter_copy() to verify_bundle(), that one also adds a new
> leak, but happens not to cause test failures since the bundle.c code
> isn't otherwise marked as passing with SANITIZE=leak, it fails in
> various other ways.
> 
> Obviously we should do something about the actual leak eventually, but
> that can be done in some follow-up work to finish up the missing bits of
> release_revisions(), i.e. adding list_objects_filter_release() etc. to
> release_revisions().

I understand that you like to "show your work" by marking tests as
safe for leak-check by making the smallest changes possible, but your
series has a lot of small patches that do nothing but add a free() or
release_*() call instead of implementing the "right" release_revisions()
from the start.
 
> So I think just adding UNLEAK() here (and optionally, also to the
> bundle.c code) is the least invasive thing, if you & Junio are OK with
> that approach.

Could you send a patch that does just that, so we are sure to cover
the warnings you are seeing in your tests?

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Thu, Mar 10 2022, Derrick Stolee wrote:

> On 3/10/2022 8:11 AM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Wed, Mar 09 2022, Derrick Stolee via GitGitGadget wrote:
>> 
>>> From: Derrick Stolee <derrickstolee@github.com>
>>>
>>> In builtin/pack-objects.c, we use a 'filter_options' global to populate
>>> the --filter=<X> argument. The previous change created a pointer to a
>>> filter option in 'struct rev_info', so we can use that pointer here as a
>>> start to simplifying some usage of object filters.
>>>
>>> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
>>> ---
>>>  builtin/pack-objects.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
>>> index ba2006f2212..e5b7d015d7d 100644
>>> --- a/builtin/pack-objects.c
>>> +++ b/builtin/pack-objects.c
>>> @@ -3651,7 +3651,7 @@ static int pack_options_allow_reuse(void)
>>>  
>>>  static int get_object_list_from_bitmap(struct rev_info *revs)
>>>  {
>>> -	if (!(bitmap_git = prepare_bitmap_walk(revs, &filter_options, 0)))
>>> +	if (!(bitmap_git = prepare_bitmap_walk(revs, &revs->filter, 0)))
>>>  		return -1;
>>>  
>>>  	if (pack_options_allow_reuse() &&
>>> @@ -3727,6 +3727,7 @@ static void get_object_list(int ac, const char **av)
>>>  	repo_init_revisions(the_repository, &revs, NULL);
>>>  	save_commit_buffer = 0;
>>>  	setup_revisions(ac, av, &revs, &s_r_opt);
>>> +	list_objects_filter_copy(&revs.filter, &filter_options);
>>>  
>>>  	/* make sure shallows are read */
>>>  	is_repository_shallow(the_repository);
>>> @@ -3777,7 +3778,7 @@ static void get_object_list(int ac, const char **av)
>>>  
>>>  	if (!fn_show_object)
>>>  		fn_show_object = show_object;
>>> -	traverse_commit_list_filtered(&filter_options, &revs,
>>> +	traverse_commit_list_filtered(&revs.filter, &revs,
>>>  				      show_commit, fn_show_object, NULL,
>>>  				      NULL);
>> 
>> Re your
>> https://lore.kernel.org/git/77c8ef4b-5dce-401b-e703-cfa32e18c853@github.com/
>> I was looking at how to handle the interaction between this and my
>> revisions_release() series.
>> 
>> This adds a new memory leak, which can be seen with:
>> 
>>     make SANITIZE=leak
>>     (cd t && ./t5532-fetch-proxy.sh -vixd)
>> 
>> I.e. this part is new:
>>     
>>     remote: Direct leak of 1 byte(s) in 1 object(s) allocated from:
>>     remote:     #0 0x4552f8 in __interceptor_malloc (git+0x4552f8)
>>     remote:     #1 0x75a089 in do_xmalloc wrapper.c:41:8
>>     remote:     #2 0x75a046 in xmalloc wrapper.c:62:9
>>     remote:     #3 0x62c692 in list_objects_filter_copy list-objects-filter-options.c:433:2
>>     remote:     #4 0x4f70bf in get_object_list builtin/pack-objects.c:3730:2
>>     remote:     #5 0x4f5e0e in cmd_pack_objects builtin/pack-objects.c:4157:3
>>     remote:     #6 0x4592ba in run_builtin git.c:465:11
>>     remote:     #7 0x457d71 in handle_builtin git.c:718:3
>>     remote:     #8 0x458ca5 in run_argv git.c:785:4
>>     remote:     #9 0x457b30 in cmd_main git.c:916:19
>>     remote:     #10 0x563179 in main common-main.c:56:11
>>     remote:     #11 0x7fddd2da67ec in __libc_start_main csu/../csu/libc-start.c:332:16
>>     remote:     #12 0x4300e9 in _start (git+0x4300e9)
>>     
>> Of course it's not "new" in the sense that we in effect leaked this
>> before, but it was still reachable, you're just changing it so that
>> instead of being stored in the static "filter_options" variable in
>> pack-objects.c we're storing it in "struct rev_info", which has a
>> different lifetime.
>
> True, and 'struct rev_info' is not being release in any way, either,
> right?

Yes, sorry to not be clear there. There's 2x other leaks just in that
one test on "master", my series addresses those, but then the 3rd
"added" here will be left behind.

>> I think instead of me rebasing my series on top of yours and tangling
>> the two up a better option is to just add a change to this, so after
>> list_objects_filter_copy() do:
>> 
>>     UNLEAK(revs.filter);
>> 
>> Or, alternatively adding this to the end of the function (in which case
>> Junio will need to deal with a minor textual conflict):
>> 
>>     list_objects_filter_release(&revs.filter);
>> 
>> Both of those make my series merged with "seen" (which has this change)
>> pass with SANITIZE=leak + GIT_TEST_PASSING_SANITIZE_LEAK=true again.
>> 
>> You could do the same in your later change adding
>> list_objects_filter_copy() to verify_bundle(), that one also adds a new
>> leak, but happens not to cause test failures since the bundle.c code
>> isn't otherwise marked as passing with SANITIZE=leak, it fails in
>> various other ways.
>> 
>> Obviously we should do something about the actual leak eventually, but
>> that can be done in some follow-up work to finish up the missing bits of
>> release_revisions(), i.e. adding list_objects_filter_release() etc. to
>> release_revisions().
>
> I understand that you like to "show your work" by marking tests as
> safe for leak-check by making the smallest changes possible, but your
> series has a lot of small patches that do nothing but add a free() or
> release_*() call instead of implementing the "right" release_revisions()
> from the start.

Yes, another way to do it would be to add the end-state
release_revisions() I have and incrementally add it everywhere.

The way I opted to do it admittedly results in a bit more churn, but was
(and still is) very useful to bisect and debug any changes.

I.e. I can easily pinpoint for any failures what exact rev.* member
being released caused a failure, which I couldn't do if I added the
end-state release_revisions() and then started adding it to code in
various places.

Then a failure due to adding it to say pack-objects wouldn't be easily
distinguishable from a failure due to releasing rev.SOMETHING in
pack-objects.

But in any case, the interaction with tips of these two sets of patches
(this series & mine) would be the same whatever the intra-series
progression I opted for is.

>> So I think just adding UNLEAK() here (and optionally, also to the
>> bundle.c code) is the least invasive thing, if you & Junio are OK with
>> that approach.
>
> Could you send a patch that does just that, so we are sure to cover
> the warnings you are seeing in your tests?

I'm suggesting fixing the following up into this series, the first hunk
is needed for the interaction with the release_revisions() series, the
second is there for completeness, but isn't required currently:

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 1a0b0950a28..ffe6197729c 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3987,6 +3987,7 @@ static void get_object_list(int ac, const char **av)
 	save_commit_buffer = 0;
 	setup_revisions(ac, av, &revs, &s_r_opt);
 	list_objects_filter_copy(&revs.filter, &filter_options);
+	UNLEAK(revs.filter);
 
 	/* make sure shallows are read */
 	is_repository_shallow(the_repository);
diff --git a/bundle.c b/bundle.c
index e359370cfcd..90cfea0c984 100644
--- a/bundle.c
+++ b/bundle.c
@@ -226,6 +226,7 @@ int verify_bundle(struct repository *r,
 	setup_revisions(2, argv, &revs, NULL);
 
 	list_objects_filter_copy(&revs.filter, &header->filter);
+	UNLEAK(revs.filter);
 
 	if (prepare_revision_walk(&revs))
 		die(_("revision walk setup failed"));

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 11, 2022

This patch series was integrated into seen via git@6b8bcc1.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 13, 2022

This patch series was integrated into seen via git@39a1b59.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 14, 2022

This patch series was integrated into seen via git@3126a82.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 14, 2022

This patch series was integrated into next via git@03529ea.

@gitgitgadget gitgitgadget bot added the next label Mar 14, 2022
@gitgitgadget
Copy link

gitgitgadget bot commented Mar 14, 2022

There was a status update in the "Cooking" section about the branch ds/partial-bundles on the Git mailing list:

Bundle file format gets extended to allow a partial bundle,
filtered by similar criteria you would give when making a
partial/lazy clone.

Will merge to 'master'.
source: <pull.1159.v4.git.1646841703.gitgitgadget@gmail.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants