Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSOC][RFC] cat-file: reuse ref-filter logic #980

Conversation

adlternative
Copy link

@adlternative adlternative commented Jun 12, 2021

This patch series make cat-file reuse ref-filter logic.

Change from last version:

  1. Amend part of the description of git for-each-ref.txt.
  2. Modify commit messages.
  3. Rebase on zh/cat-file-batch-fix.

cc: Junio C Hamano gitster@pobox.com
cc: Christian Couder christian.couder@gmail.com
cc: Hariom Verma hariom18599@gmail.com
cc: Bagas Sanjaya bagasdotme@gmail.com
cc: Jeff King peff@peff.net
cc: Ævar Arnfjörð Bjarmason avarab@gmail.com

@adlternative
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 12, 2021

Submitted as pull.980.git.1623496458.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-980/adlternative/cat-file-batch-refactor-v1

To fetch this version to local tag pr-980/adlternative/cat-file-batch-refactor-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-980/adlternative/cat-file-batch-refactor-v1

@@ -144,6 +144,7 @@ enum atom_type {
ATOM_BODY,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Christian Couder wrote (reply to this):

On Sat, Jun 12, 2021 at 1:14 PM ZheNing Hu via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: ZheNing Hu <adlternative@gmail.com>
>
> Let `populate_value()`, `get_ref_atom_value()` and
> `format_ref_array_item()` get the return value of `get_value()`
> correctly. This can help us later let `cat-file --batch` get the
> correct error message and return value of `get_value()`.

Is it get_object() or get_value()?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, ZheNing Hu wrote (reply to this):

Christian Couder <christian.couder@gmail.com> 于2021年6月13日周日 上午4:09写道:
>
> On Sat, Jun 12, 2021 at 1:14 PM ZheNing Hu via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >
> > From: ZheNing Hu <adlternative@gmail.com>
> >
> > Let `populate_value()`, `get_ref_atom_value()` and
> > `format_ref_array_item()` get the return value of `get_value()`
> > correctly. This can help us later let `cat-file --batch` get the
> > correct error message and return value of `get_value()`.
>
> Is it get_object() or get_value()?

Oh, it's get_object().

Thanks for pointing out:)
--
ZheNing Hu

@adlternative
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 15, 2021

Submitted as pull.980.v2.git.1623763746.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-980/adlternative/cat-file-batch-refactor-v2

To fetch this version to local tag pr-980/adlternative/cat-file-batch-refactor-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-980/adlternative/cat-file-batch-refactor-v2

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 16, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This patch series make cat-file reuse ref-filter logic, which based on
> 5a5b5f78 ([GSOC] ref-filter: add %(rest) atom)

Hmph, does anybody have 5a5b5f78?

The way to deal with this and avoid resending the same patches
(assuming that this is not a 9-patch series, but only 5 of them are
new) is to rebase your topic on 723bc66d (ref-filter: add %(rest)
atom, 2021-06-09), which will allow you to discard the 4 earlier
patches, and force push, with base set to 723bc66d, I think, but I
am not a GGG user, so there may need an extra step or two on top of
that.

@@ -144,6 +144,7 @@ enum atom_type {
ATOM_BODY,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: ZheNing Hu <adlternative@gmail.com>
>
> Let `populate_value()`, `get_ref_atom_value()` and
> `format_ref_array_item()` get the return value of `get_object()`
> correctly.

The "get" the value correctly, I think.  What you are teaching them
is to pass the return value from get_object() through the callchain
to their callers.

The readers will be helped if you say what kind of errors
get_object() wants to tell its callers, not just "-1" is for error,
which is what populate_value() assumes to be sufficient.  In other
words, which non-zero returns from get_object() are interesting and
why?

> @@ -1997,9 +1997,11 @@ static int populate_value(struct ref_array_item *ref, struct strbuf *err)
>  static int get_ref_atom_value(struct ref_array_item *ref, int atom,
>  			      struct atom_value **v, struct strbuf *err)
>  {
> +	int ret = 0;
> +
>  	if (!ref->value) {
> -		if (populate_value(ref, err))
> -			return -1;
> +		if ((ret = populate_value(ref, err)))
> +			return ret;

The new variable only needs to be in this scope, and does not have
to be shown to the entire function.

> @@ -2573,6 +2575,7 @@ int format_ref_array_item(struct ref_array_item *info,
>  {
>  	const char *cp, *sp, *ep;
>  	struct ref_formatting_state state = REF_FORMATTING_STATE_INIT;
> +	int ret = 0;

This is dubious...

>  	state.quote_style = format->quote_style;
>  	push_stack_element(&state.stack);
> @@ -2585,10 +2588,10 @@ int format_ref_array_item(struct ref_array_item *info,
>  		if (cp < sp)
>  			append_literal(cp, sp, &state);
>  		pos = parse_ref_filter_atom(format, sp + 2, ep, error_buf);
> -		if (pos < 0 || get_ref_atom_value(info, pos, &atomv, error_buf) ||
> +		if (pos < 0 || (ret = get_ref_atom_value(info, pos, &atomv, error_buf)) ||

Here, if "ret" gets assigned any non-zero value, the condition is
satisfied, and ...

>  		    atomv->handler(atomv, &state, error_buf)) {
>  			pop_stack_element(&state.stack);
> -			return -1;
> +			return ret ? ret : -1;

... the control flow will leave this function.  Therefore, ...

>  		}
>  	}
>  	if (*cp) {
> @@ -2610,7 +2613,7 @@ int format_ref_array_item(struct ref_array_item *info,
>  	}
>  	strbuf_addbuf(final_buf, &state.stack->output);
>  	pop_stack_element(&state.stack);
> -	return 0;
> +	return ret;

... at this point, "ret" can never be anything other than zero.  Am
I misreading the patch?

If I am not misreading the patch, then "ret" does not have to be
globally visible in this function---it can have the same scope as
"pos".

>  }
>  
>  void pretty_print_ref(const char *name, const struct object_id *oid,

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, ZheNing Hu wrote (reply to this):

Junio C Hamano <gitster@pobox.com> 于2021年6月16日周三 下午3:36写道:
>
> "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: ZheNing Hu <adlternative@gmail.com>
> >
> > Let `populate_value()`, `get_ref_atom_value()` and
> > `format_ref_array_item()` get the return value of `get_object()`
> > correctly.
>
> The "get" the value correctly, I think.  What you are teaching them
> is to pass the return value from get_object() through the callchain
> to their callers.
>

Yes, this is exactly what I meant.

> The readers will be helped if you say what kind of errors
> get_object() wants to tell its callers, not just "-1" is for error,
> which is what populate_value() assumes to be sufficient.  In other
> words, which non-zero returns from get_object() are interesting and
> why?
>

As stated in 765337a, We can just print the error without exiting if the
return value of format_ref_array_item() is greater than 0. Therefore,
the current patch is to make get_object() return a value other than
-1 when an error occurs.

> > @@ -1997,9 +1997,11 @@ static int populate_value(struct ref_array_item *ref, struct strbuf *err)
> >  static int get_ref_atom_value(struct ref_array_item *ref, int atom,
> >                             struct atom_value **v, struct strbuf *err)
> >  {
> > +     int ret = 0;
> > +
> >       if (!ref->value) {
> > -             if (populate_value(ref, err))
> > -                     return -1;
> > +             if ((ret = populate_value(ref, err)))
> > +                     return ret;
>
> The new variable only needs to be in this scope, and does not have
> to be shown to the entire function.
>

Makes sense.

> > @@ -2573,6 +2575,7 @@ int format_ref_array_item(struct ref_array_item *info,
> >  {
> >       const char *cp, *sp, *ep;
> >       struct ref_formatting_state state = REF_FORMATTING_STATE_INIT;
> > +     int ret = 0;
>
> This is dubious...
>
> >       state.quote_style = format->quote_style;
> >       push_stack_element(&state.stack);
> > @@ -2585,10 +2588,10 @@ int format_ref_array_item(struct ref_array_item *info,
> >               if (cp < sp)
> >                       append_literal(cp, sp, &state);
> >               pos = parse_ref_filter_atom(format, sp + 2, ep, error_buf);
> > -             if (pos < 0 || get_ref_atom_value(info, pos, &atomv, error_buf) ||
> > +             if (pos < 0 || (ret = get_ref_atom_value(info, pos, &atomv, error_buf)) ||
>
> Here, if "ret" gets assigned any non-zero value, the condition is
> satisfied, and ...
>
> >                   atomv->handler(atomv, &state, error_buf)) {
> >                       pop_stack_element(&state.stack);
> > -                     return -1;
> > +                     return ret ? ret : -1;
>
> ... the control flow will leave this function.  Therefore, ...
>
> >               }
> >       }
> >       if (*cp) {
> > @@ -2610,7 +2613,7 @@ int format_ref_array_item(struct ref_array_item *info,
> >       }
> >       strbuf_addbuf(final_buf, &state.stack->output);
> >       pop_stack_element(&state.stack);
> > -     return 0;
> > +     return ret;
>
> ... at this point, "ret" can never be anything other than zero.  Am
> I misreading the patch?
>
> If I am not misreading the patch, then "ret" does not have to be
> globally visible in this function---it can have the same scope as
> "pos".
>

You are right, It is correct to only return 0 here at the moment.

Thanks.
--
ZheNing Hu

@@ -144,6 +144,7 @@ enum atom_type {
ATOM_BODY,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: ZheNing Hu <adlternative@gmail.com>
>
> Introduce free_array_item_internal() for freeing ref_array_item value.
> It will be called internally by free_array_item(), and it will help
> `cat-file --batch` free ref_array_item's memory later.

As a file local static function, the horrible name free_array_item()
was tolerable.  But before exposing a name like that to the outside
world, think twice if that is specific enough, and it is not.  There
are 47 different kinds of "array"s we use in the system, but this
new helper function only works with ref_array_item and not on items
in any other kinds of arrays.

> -/*  Free memory allocated for a ref_array_item */
> -static void free_array_item(struct ref_array_item *item)
> +void free_array_item_internal(struct ref_array_item *item)
>  {

And "internal" is a horrible name to have as an external name.  You
probably can come up with a more appropriate name when you imagine
yourself explaining to somebody who is relatively new to this part
of the codebase what the difference between free_array_item() and
this new helper is, where the difference comes from, why the symref
member (and no other member) is so special, etc.

I _think_ what is special is not the .symref but is the .value
field, IOW, you are trying to come up with an interface to free the
value part of ref_array_item without touching other things.  But it
is not helpful at all to readers if you do not explain why you want
to do so.  Why is the .value member so special?  The ability to
clear only the .value member without touching other members is useful
because ...?

In any case, assuming that you'd establish why the .value member is
so special to deserve an externally callable function, when external
callers do not have to be able to free the item as a whole (i.e.
free_array_item() is still file-scope static), in the proposed log
message in an updated patch, I would imagine that 

    free_ref_array_item_value()

would be a more suitable name than the _internal thing.  When it
happens, you might want to rename the static one to
free_ref_array_item() to match, even if it does not have external
callers.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, ZheNing Hu wrote (reply to this):

Junio C Hamano <gitster@pobox.com> 于2021年6月16日周三 下午3:49写道:
>
> "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: ZheNing Hu <adlternative@gmail.com>
> >
> > Introduce free_array_item_internal() for freeing ref_array_item value.
> > It will be called internally by free_array_item(), and it will help
> > `cat-file --batch` free ref_array_item's memory later.
>
> As a file local static function, the horrible name free_array_item()
> was tolerable.  But before exposing a name like that to the outside
> world, think twice if that is specific enough, and it is not.  There
> are 47 different kinds of "array"s we use in the system, but this
> new helper function only works with ref_array_item and not on items
> in any other kinds of arrays.
>

Well, this is indeed ignored by me.

So free_array_item() --> free_ref_array_item() and free_array_item()
--> free_ref_array_item().

> > -/*  Free memory allocated for a ref_array_item */
> > -static void free_array_item(struct ref_array_item *item)
> > +void free_array_item_internal(struct ref_array_item *item)
> >  {
>
> And "internal" is a horrible name to have as an external name.  You
> probably can come up with a more appropriate name when you imagine
> yourself explaining to somebody who is relatively new to this part
> of the codebase what the difference between free_array_item() and
> this new helper is, where the difference comes from, why the symref
> member (and no other member) is so special, etc.
>

Yeah, "internal" is an incorrect naming. The main purpose here is
to only free a ref_array_item's value.

> I _think_ what is special is not the .symref but is the .value
> field, IOW, you are trying to come up with an interface to free the
> value part of ref_array_item without touching other things.  But it
> is not helpful at all to readers if you do not explain why you want
> to do so.  Why is the .value member so special?  The ability to
> clear only the .value member without touching other members is useful
> because ...?
>

Because batch_object_write() use a ref_array_item which is allocated on the
stack, and it's member symref is not used at all. ref_array_item's symref and
itself do not need free. The original free_array_item() will free all
dynamically
allocated content in ref_array_item. It cannot meet our requirements
at this time:
only free the ref_array_item's value.

> In any case, assuming that you'd establish why the .value member is
> so special to deserve an externally callable function, when external
> callers do not have to be able to free the item as a whole (i.e.
> free_array_item() is still file-scope static), in the proposed log
> message in an updated patch, I would imagine that
>
>     free_ref_array_item_value()
>
> would be a more suitable name than the _internal thing.  When it
> happens, you might want to rename the static one to
> free_ref_array_item() to match, even if it does not have external
> callers.

Make sence.

--
ZheNing Hu

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 17, 2021

On the Git mailing list, ZheNing Hu wrote (reply to this):

Junio C Hamano <gitster@pobox.com> 于2021年6月16日周三 下午3:29写道:
>
> "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > This patch series make cat-file reuse ref-filter logic, which based on
> > 5a5b5f78 ([GSOC] ref-filter: add %(rest) atom)
>
> Hmph, does anybody have 5a5b5f78?
>
> The way to deal with this and avoid resending the same patches
> (assuming that this is not a 9-patch series, but only 5 of them are
> new) is to rebase your topic on 723bc66d (ref-filter: add %(rest)
> atom, 2021-06-09), which will allow you to discard the 4 earlier
> patches, and force push, with base set to 723bc66d, I think, but I
> am not a GGG user, so there may need an extra step or two on top of
> that.

Oh, 5a5b5f78 is in gitgitgadget.git and 723bc66d in git.git, their commit
messages are different. Later I will rebase my new patch to zh/xxx.

Thanks.
--
ZheNing Hu

@@ -1356,11 +1356,12 @@ static void append_lines(struct strbuf *out, const char *buf, unsigned long size
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Sat, Jun 12 2021, ZheNing Hu via GitGitGadget wrote:

> From: ZheNing Hu <adlternative@gmail.com>
>
> Only tag and commit objects use `grab_sub_body_contents()` to grab
> object contents in the current codebase.  We want to teach the
> function to also handle blobs and trees to get their raw data,
> without parsing a blob (whose contents looks like a commit or a tag)
> incorrectly as a commit or a tag.
>
> Skip the block of code that is specific to handling commits and tags
> early when the given object is of a wrong type to help later
> addition to handle other types of objects in this function.
>
> Mentored-by: Christian Couder <christian.couder@gmail.com>
> Mentored-by: Hariom Verma <hariom18599@gmail.com>
> Helped-by: Junio C Hamano <gitster@pobox.com>
> Signed-off-by: ZheNing Hu <adlternative@gmail.com>
> ---
>  ref-filter.c | 24 +++++++++++++++---------
>  1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/ref-filter.c b/ref-filter.c
> index 4db0e40ff4c6..5cee6512fbaf 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -1356,11 +1356,12 @@ static void append_lines(struct strbuf *out, const char *buf, unsigned long size
>  }
>  
>  /* See grab_values */
> -static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
> +static void grab_sub_body_contents(struct atom_value *val, int deref, struct expand_data *data)
>  {
>  	int i;
>  	const char *subpos = NULL, *bodypos = NULL, *sigpos = NULL;
>  	size_t sublen = 0, bodylen = 0, nonsiglen = 0, siglen = 0;
> +	void *buf = data->content;
>  
>  	for (i = 0; i < used_atom_cnt; i++) {
>  		struct used_atom *atom = &used_atom[i];
> @@ -1371,10 +1372,13 @@ static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
>  			continue;
>  		if (deref)
>  			name++;
> -		if (strcmp(name, "body") &&
> -		    !starts_with(name, "subject") &&
> -		    !starts_with(name, "trailers") &&
> -		    !starts_with(name, "contents"))
> +
> +		if ((data->type != OBJ_TAG &&
> +		     data->type != OBJ_COMMIT) ||
> +		    (strcmp(name, "body") &&
> +		     !starts_with(name, "subject") &&
> +		     !starts_with(name, "trailers") &&
> +		     !starts_with(name, "contents")))

We have 4 "real" object types, commit, tree, blob, tag. Do you really
mean "not tag or commit" here, don't you mean "is tree or blob" instead?
I.e. do we really want to pass OBJ_NONE etc. here?

>  			continue;
>  		if (!subpos)
>  			find_subpos(buf,
> @@ -1438,17 +1442,19 @@ static void fill_missing_values(struct atom_value *val)
>   * pointed at by the ref itself; otherwise it is the object the
>   * ref (which is a tag) refers to.
>   */
> -static void grab_values(struct atom_value *val, int deref, struct object *obj, void *buf)
> +static void grab_values(struct atom_value *val, int deref, struct object *obj, struct expand_data *data)
>  {
> +	void *buf = data->content;
> +
>  	switch (obj->type) {
>  	case OBJ_TAG:
>  		grab_tag_values(val, deref, obj);
> -		grab_sub_body_contents(val, deref, buf);
> +		grab_sub_body_contents(val, deref, data);
>  		grab_person("tagger", val, deref, buf);
>  		break;
>  	case OBJ_COMMIT:
>  		grab_commit_values(val, deref, obj);
> -		grab_sub_body_contents(val, deref, buf);
> +		grab_sub_body_contents(val, deref, data);
>  		grab_person("author", val, deref, buf);
>  		grab_person("committer", val, deref, buf);
>  		break;
> @@ -1678,7 +1684,7 @@ static int get_object(struct ref_array_item *ref, int deref, struct object **obj
>  			return strbuf_addf_ret(err, -1, _("parse_object_buffer failed on %s for %s"),
>  					       oid_to_hex(&oi->oid), ref->refname);
>  		}
> -		grab_values(ref->value, deref, *obj, oi->content);
> +		grab_values(ref->value, deref, *obj, oi);
>  	}
>  
>  	grab_common_values(ref->value, deref, oi);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> -		if (strcmp(name, "body") &&
>> -		    !starts_with(name, "subject") &&
>> -		    !starts_with(name, "trailers") &&
>> -		    !starts_with(name, "contents"))
>> +
>> +		if ((data->type != OBJ_TAG &&
>> +		     data->type != OBJ_COMMIT) ||
>> +		    (strcmp(name, "body") &&
>> +		     !starts_with(name, "subject") &&
>> +		     !starts_with(name, "trailers") &&
>> +		     !starts_with(name, "contents")))
>
> We have 4 "real" object types, commit, tree, blob, tag. Do you really
> mean "not tag or commit" here, don't you mean "is tree or blob" instead?
> I.e. do we really want to pass OBJ_NONE etc. here?
>
>>  			continue;

If somebody throws OBJ_NONE at us by mistake, we do not want to
handle such an object and try to extract the subject member from it
anyway, no?

The intent of the code here, before the patch, is that "what we do
after the control flow passes this point is about the body, subject,
trailers, and contents request, so everybody else should go to the
next iteration".  The caller used to give us an object compatible
with these four types of requests, now the caller may throw others,
hence "by the way, we know these four kinds of requests make sense
only for tags and commits, so everybody else should go to the next
iteration, too" would be a natural thing to add.  So in that sense,
I prefer it over "we know these four types of requests do not make
sense for blobs and trees".

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 17, 2021

User Ævar Arnfjörð Bjarmason <avarab@gmail.com> has been added to the cc: list.

@@ -235,6 +235,15 @@ and `date` to extract the named component. For email fields (`authoremail`,
without angle brackets, and `:localpart` to get the part before the `@` symbol
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Sat, Jun 12 2021, ZheNing Hu via GitGitGadget wrote:

> From: ZheNing Hu <adlternative@gmail.com>
>
> Add new formatting option `%(raw)`, which will print the raw
> object data without any changes. It will help further to migrate
> all cat-file formatting logic from cat-file to ref-filter.

Nice goal and feature to have.

> The raw data of blob, tree objects may contain '\0', but most of
> the logic in `ref-filter` depends on the output of the atom being
> text (specifically, no embedded NULs in it).
>
> E.g. `quote_formatting()` use `strbuf_addstr()` or `*._quote_buf()`
> add the data to the buffer. The raw data of a tree object is
> `100644 one\0...`, only the `100644 one` will be added to the buffer,
> which is incorrect.
>
> Therefore, add a new member in `struct atom_value`: `s_size`, which
> can record raw object size, it can help us add raw object data to
> the buffer or compare two buffers which contain raw object data.

Most of the functions that deal with this already use a strbuf in some
way, before we had a const char *, now there's a size_t to go along with
it, why not simply use a strbuf in the struct for the data? You'll then
get the size and \0 handling for free, and any functions to deal with
conversion can stick to the strbuf API, there seems to be a lot of back
and forth now.

> Beyond, `--format=%(raw)` cannot be used with `--python`, `--shell`,
> `--tcl`, `--perl` because if our binary raw data is passed to a variable
> in the host language, the host language may not support arbitrary binary
> data in the variables of its string type.

Perl at least deals with that just fine, and to the extent that it
doesn't any new problems here would have nothing to do with \0 being in
the data. Perl doesn't have a notion of "binary has \0 in it", it always
supports \0, it has a notion of "is it utf-8 or not?", so any encoding
problems wouldn't be new. I'd think that the same would be true of
Python, but I'm not sure.


> +test_expect_success 'basic atom: refs/tags/testtag *raw' '
> +	git cat-file commit refs/tags/testtag^{} >expected &&
> +	git for-each-ref --format="%(*raw)" refs/tags/testtag >actual &&
> +	sanitize_pgp <expected >expected.clean &&
> +	sanitize_pgp <actual >actual.clean &&
> +	echo "" >>expected.clean &&

Just "echo" will do, ditto for the rest. Also odd to go back and forth
between populating expected.clean & actual.clean.


> +test_expect_success 'set up refs pointing to binary blob' '
> +	printf "a\0b\0c" >blob1 &&
> +	printf "a\0c\0b" >blob2 &&
> +	printf "\0a\0b\0c" >blob3 &&
> +	printf "abc" >blob4 &&
> +	printf "\0 \0 \0 " >blob5 &&
> +	printf "\0 \0a\0 " >blob6 &&
> +	printf "  " >blob7 &&
> +	>blob8 &&
> +	git hash-object blob1 -w | xargs git update-ref refs/myblobs/blob1 &&
> +	git hash-object blob2 -w | xargs git update-ref refs/myblobs/blob2 &&
> +	git hash-object blob3 -w | xargs git update-ref refs/myblobs/blob3 &&
> +	git hash-object blob4 -w | xargs git update-ref refs/myblobs/blob4 &&
> +	git hash-object blob5 -w | xargs git update-ref refs/myblobs/blob5 &&
> +	git hash-object blob6 -w | xargs git update-ref refs/myblobs/blob6 &&
> +	git hash-object blob7 -w | xargs git update-ref refs/myblobs/blob7 &&
> +	git hash-object blob8 -w | xargs git update-ref refs/myblobs/blob8

Hrm, xargs just to avoid:

    git update-ref ... $(git hash-object) ?

> +test_expect_success '%(raw) with --python must failed' '
> +	test_must_fail git for-each-ref --format="%(raw)" --python
> +'
> +
> +test_expect_success '%(raw) with --tcl must failed' '
> +	test_must_fail git for-each-ref --format="%(raw)" --tcl
> +'
> +
> +test_expect_success '%(raw) with --perl must failed' '
> +	test_must_fail git for-each-ref --format="%(raw)" --perl
> +'
> +
> +test_expect_success '%(raw) with --shell must failed' '
> +	test_must_fail git for-each-ref --format="%(raw)" --shell
> +'
> +
> +test_expect_success '%(raw) with --shell and --sort=raw must failed' '
> +	test_must_fail git for-each-ref --format="%(raw)" --sort=raw --shell
> +'

s/must failed/must fail/, but see question above about encoding in these
languages...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> Beyond, `--format=%(raw)` cannot be used with `--python`, `--shell`,
>> `--tcl`, `--perl` because if our binary raw data is passed to a variable
>> in the host language, the host language may not support arbitrary binary
>> data in the variables of its string type.
>
> Perl at least deals with that just fine, and to the extent that it
> doesn't any new problems here would have nothing to do with \0 being in
> the data. Perl doesn't have a notion of "binary has \0 in it", it always
> supports \0, it has a notion of "is it utf-8 or not?", so any encoding
> problems wouldn't be new. I'd think that the same would be true of
> Python, but I'm not sure.

During an earlier iteration long time ago, as we knew Perl is
capable of handling sequence of bytes including NUL, it was decided
to start more strict by rejecting binary for all languages, which
can later be loosened, to limit the scope of the initial
implementation.

>> +	git hash-object blob1 -w | xargs git update-ref refs/myblobs/blob1 &&
>> +	git hash-object blob2 -w | xargs git update-ref refs/myblobs/blob2 &&
>> +	git hash-object blob3 -w | xargs git update-ref refs/myblobs/blob3 &&
>> +	git hash-object blob4 -w | xargs git update-ref refs/myblobs/blob4 &&
>> +	git hash-object blob5 -w | xargs git update-ref refs/myblobs/blob5 &&
>> +	git hash-object blob6 -w | xargs git update-ref refs/myblobs/blob6 &&
>> +	git hash-object blob7 -w | xargs git update-ref refs/myblobs/blob7 &&
>> +	git hash-object blob8 -w | xargs git update-ref refs/myblobs/blob8
>
> Hrm, xargs just to avoid:
>
>     git update-ref ... $(git hash-object) ?

That's horrible.  Thanks for noticing.

We'd want to catch segfaults from both hash-object and update-ref.
One way to do so may be

	O=$(git hash-object -w blob1) &&
	git update-ref refs/myblobs/blob1 "$O"

Thanks for a review.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, ZheNing Hu wrote (reply to this):

Ævar Arnfjörð Bjarmason <avarab@gmail.com> 于2021年6月17日周四 下午3:16写道:
>
> > The raw data of blob, tree objects may contain '\0', but most of
> > the logic in `ref-filter` depends on the output of the atom being
> > text (specifically, no embedded NULs in it).
> >
> > E.g. `quote_formatting()` use `strbuf_addstr()` or `*._quote_buf()`
> > add the data to the buffer. The raw data of a tree object is
> > `100644 one\0...`, only the `100644 one` will be added to the buffer,
> > which is incorrect.
> >
> > Therefore, add a new member in `struct atom_value`: `s_size`, which
> > can record raw object size, it can help us add raw object data to
> > the buffer or compare two buffers which contain raw object data.
>
> Most of the functions that deal with this already use a strbuf in some
> way, before we had a const char *, now there's a size_t to go along with
> it, why not simply use a strbuf in the struct for the data? You'll then
> get the size and \0 handling for free, and any functions to deal with
> conversion can stick to the strbuf API, there seems to be a lot of back
> and forth now.
>

Yes, strbuf is a suitable choice when using <str,len> pair.
But if replace v->s with strbuf, the possible changes will be larger.

> > Beyond, `--format=%(raw)` cannot be used with `--python`, `--shell`,
> > `--tcl`, `--perl` because if our binary raw data is passed to a variable
> > in the host language, the host language may not support arbitrary binary
> > data in the variables of its string type.
>
> Perl at least deals with that just fine, and to the extent that it
> doesn't any new problems here would have nothing to do with \0 being in
> the data. Perl doesn't have a notion of "binary has \0 in it", it always
> supports \0, it has a notion of "is it utf-8 or not?", so any encoding
> problems wouldn't be new. I'd think that the same would be true of
> Python, but I'm not sure.
>

Not python safe. See [1].
Regarding the perl language, I support Junio's point of view: it can be
re-supported in the future.

>
> > +test_expect_success 'basic atom: refs/tags/testtag *raw' '
> > +     git cat-file commit refs/tags/testtag^{} >expected &&
> > +     git for-each-ref --format="%(*raw)" refs/tags/testtag >actual &&
> > +     sanitize_pgp <expected >expected.clean &&
> > +     sanitize_pgp <actual >actual.clean &&
> > +     echo "" >>expected.clean &&
>
> Just "echo" will do, ditto for the rest. Also odd to go back and forth
> between populating expected.clean & actual.clean.
>

Are you saying that sanitize_pgp is not needed?

>
> > +test_expect_success 'set up refs pointing to binary blob' '
> > +     printf "a\0b\0c" >blob1 &&
> > +     printf "a\0c\0b" >blob2 &&
> > +     printf "\0a\0b\0c" >blob3 &&
> > +     printf "abc" >blob4 &&
> > +     printf "\0 \0 \0 " >blob5 &&
> > +     printf "\0 \0a\0 " >blob6 &&
> > +     printf "  " >blob7 &&
> > +     >blob8 &&
> > +     git hash-object blob1 -w | xargs git update-ref refs/myblobs/blob1 &&
> > +     git hash-object blob2 -w | xargs git update-ref refs/myblobs/blob2 &&
> > +     git hash-object blob3 -w | xargs git update-ref refs/myblobs/blob3 &&
> > +     git hash-object blob4 -w | xargs git update-ref refs/myblobs/blob4 &&
> > +     git hash-object blob5 -w | xargs git update-ref refs/myblobs/blob5 &&
> > +     git hash-object blob6 -w | xargs git update-ref refs/myblobs/blob6 &&
> > +     git hash-object blob7 -w | xargs git update-ref refs/myblobs/blob7 &&
> > +     git hash-object blob8 -w | xargs git update-ref refs/myblobs/blob8
>
> Hrm, xargs just to avoid:
>
>     git update-ref ... $(git hash-object) ?
>

I didn’t think about it, just for convenience.

> > +test_expect_success '%(raw) with --python must failed' '
> > +     test_must_fail git for-each-ref --format="%(raw)" --python
> > +'
> > +
> > +test_expect_success '%(raw) with --tcl must failed' '
> > +     test_must_fail git for-each-ref --format="%(raw)" --tcl
> > +'
> > +
> > +test_expect_success '%(raw) with --perl must failed' '
> > +     test_must_fail git for-each-ref --format="%(raw)" --perl
> > +'
> > +
> > +test_expect_success '%(raw) with --shell must failed' '
> > +     test_must_fail git for-each-ref --format="%(raw)" --shell
> > +'
> > +
> > +test_expect_success '%(raw) with --shell and --sort=raw must failed' '
> > +     test_must_fail git for-each-ref --format="%(raw)" --sort=raw --shell
> > +'
>
> s/must failed/must fail/, but see question above about encoding in these
> languages...


[1]: https://lore.kernel.org/git/CAOLTT8QR_GRm4TYk0E_eazQ+unVQODc-3L+b4V5JUN5jtZR8uA@mail.gmail.com/

Thanks for a review.
--
ZheNing Hu

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Thu, Jun 17 2021, ZheNing Hu wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> 于2021年6月17日周四 下午3:16写道:
>>
>> > The raw data of blob, tree objects may contain '\0', but most of
>> > the logic in `ref-filter` depends on the output of the atom being
>> > text (specifically, no embedded NULs in it).
>> >
>> > E.g. `quote_formatting()` use `strbuf_addstr()` or `*._quote_buf()`
>> > add the data to the buffer. The raw data of a tree object is
>> > `100644 one\0...`, only the `100644 one` will be added to the buffer,
>> > which is incorrect.
>> >
>> > Therefore, add a new member in `struct atom_value`: `s_size`, which
>> > can record raw object size, it can help us add raw object data to
>> > the buffer or compare two buffers which contain raw object data.
>>
>> Most of the functions that deal with this already use a strbuf in some
>> way, before we had a const char *, now there's a size_t to go along with
>> it, why not simply use a strbuf in the struct for the data? You'll then
>> get the size and \0 handling for free, and any functions to deal with
>> conversion can stick to the strbuf API, there seems to be a lot of back
>> and forth now.
>>
>
> Yes, strbuf is a suitable choice when using <str,len> pair.
> But if replace v->s with strbuf, the possible changes will be larger.

I for one would like to see it done that way, those changes are usually
easy to read. Also it seems a large part of 2/8 is extra new code
because we didn't do that, e.g. getting length differently if something
is a strbuf or not, passing char*/size_t pairs to new functions etc.

>> > Beyond, `--format=%(raw)` cannot be used with `--python`, `--shell`,
>> > `--tcl`, `--perl` because if our binary raw data is passed to a variable
>> > in the host language, the host language may not support arbitrary binary
>> > data in the variables of its string type.
>>
>> Perl at least deals with that just fine, and to the extent that it
>> doesn't any new problems here would have nothing to do with \0 being in
>> the data. Perl doesn't have a notion of "binary has \0 in it", it always
>> supports \0, it has a notion of "is it utf-8 or not?", so any encoding
>> problems wouldn't be new. I'd think that the same would be true of
>> Python, but I'm not sure.
>>
>
> Not python safe. See [1].
> Regarding the perl language, I support Junio's point of view: it can be
> re-supported in the future.

Ah, I'd missed that. Anyway, if it's easy it seems you discovered that
Perl deals with it correctly, so we could just have it support this.

>>
>> > +test_expect_success 'basic atom: refs/tags/testtag *raw' '
>> > +     git cat-file commit refs/tags/testtag^{} >expected &&
>> > +     git for-each-ref --format="%(*raw)" refs/tags/testtag >actual &&
>> > +     sanitize_pgp <expected >expected.clean &&
>> > +     sanitize_pgp <actual >actual.clean &&
>> > +     echo "" >>expected.clean &&
>>
>> Just "echo" will do, ditto for the rest. Also odd to go back and forth
>> between populating expected.clean & actual.clean.
>>
>
> Are you saying that sanitize_pgp is not needed?

No that instead of:

    echo "" >x

You can do:

    echo >x

And also that going back and forth between populating different files is
confusing, i.e. this:


    echo a >x
    echo c >y
    echo b >>x

is better as:

    echo a >x
    echo b >>x
    echo c >y


>>
>> > +test_expect_success 'set up refs pointing to binary blob' '
>> > +     printf "a\0b\0c" >blob1 &&
>> > +     printf "a\0c\0b" >blob2 &&
>> > +     printf "\0a\0b\0c" >blob3 &&
>> > +     printf "abc" >blob4 &&
>> > +     printf "\0 \0 \0 " >blob5 &&
>> > +     printf "\0 \0a\0 " >blob6 &&
>> > +     printf "  " >blob7 &&
>> > +     >blob8 &&
>> > +     git hash-object blob1 -w | xargs git update-ref refs/myblobs/blob1 &&
>> > +     git hash-object blob2 -w | xargs git update-ref refs/myblobs/blob2 &&
>> > +     git hash-object blob3 -w | xargs git update-ref refs/myblobs/blob3 &&
>> > +     git hash-object blob4 -w | xargs git update-ref refs/myblobs/blob4 &&
>> > +     git hash-object blob5 -w | xargs git update-ref refs/myblobs/blob5 &&
>> > +     git hash-object blob6 -w | xargs git update-ref refs/myblobs/blob6 &&
>> > +     git hash-object blob7 -w | xargs git update-ref refs/myblobs/blob7 &&
>> > +     git hash-object blob8 -w | xargs git update-ref refs/myblobs/blob8
>>
>> Hrm, xargs just to avoid:
>>
>>     git update-ref ... $(git hash-object) ?
>>
>
> I didn’t think about it, just for convenience.

*nod*, Junio had a good suggestion.

>> > +test_expect_success '%(raw) with --python must failed' '
>> > +     test_must_fail git for-each-ref --format="%(raw)" --python
>> > +'
>> > +
>> > +test_expect_success '%(raw) with --tcl must failed' '
>> > +     test_must_fail git for-each-ref --format="%(raw)" --tcl
>> > +'
>> > +
>> > +test_expect_success '%(raw) with --perl must failed' '
>> > +     test_must_fail git for-each-ref --format="%(raw)" --perl
>> > +'
>> > +
>> > +test_expect_success '%(raw) with --shell must failed' '
>> > +     test_must_fail git for-each-ref --format="%(raw)" --shell
>> > +'
>> > +
>> > +test_expect_success '%(raw) with --shell and --sort=raw must failed' '
>> > +     test_must_fail git for-each-ref --format="%(raw)" --sort=raw --shell
>> > +'
>>
>> s/must failed/must fail/, but see question above about encoding in these
>> languages...
>
>
> [1]: https://lore.kernel.org/git/CAOLTT8QR_GRm4TYk0E_eazQ+unVQODc-3L+b4V5JUN5jtZR8uA@mail.gmail.com/
>
> Thanks for a review.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, ZheNing Hu wrote (reply to this):

Ævar Arnfjörð Bjarmason <avarab@gmail.com> 于2021年6月17日周四 下午10:45写道:
>
>
> On Thu, Jun 17 2021, ZheNing Hu wrote:
>
> >
> > Yes, strbuf is a suitable choice when using <str,len> pair.
> > But if replace v->s with strbuf, the possible changes will be larger.
>
> I for one would like to see it done that way, those changes are usually
> easy to read. Also it seems a large part of 2/8 is extra new code
> because we didn't do that, e.g. getting length differently if something
> is a strbuf or not, passing char*/size_t pairs to new functions etc.
>

After some refactoring, I found that there are two problems:
1. There are a lot of codes like this in ref-filter to fill v->s:

v->s = show_ref(...)
v->s = copy_email(...)

It is very difficult to modify here: We know that show_ref()
or copy_email() will allocate a block of memory to v->s, but
if v->s is a strbuf, what should we do? In copy_email(), we
can pass the v->s to copy_email() and use strbuf_add()/strbuf_addstr()
instead of xstrdup() and xmemdupz(). But show_ref() will call
external functions like shorten_unambiguous_ref(), we don’t know
whether it will return us NULL or a dynamically allocated memory.
If continue to pass v->s to the inner function, it is not a feasible
method. Or we can use strbuf_attach() + strlen(), I'm not sure
this is a good method.

2. See:

-       for (i = 0; i < used_atom_cnt; i++) {
+       for (i = 0; i < used_atom_cnt; i++) {
                struct atom_value *v = &ref->value[i];
-               if (v->s == NULL && used_atom[i].source == SOURCE_NONE)
+               if (v->s.len == 0 && used_atom[i].source == SOURCE_NONE)
                        return strbuf_addf_ret(err, -1, _("missing
object %s for %s"),

oid_to_hex(&ref->objectname), ref->refname);
        }

In the case of using strbuf, I don’t know how to distinguish between an empty
strbuf and NULL. It can be easily distinguished by using c-style "const char*".

> >
> > Not python safe. See [1].
> > Regarding the perl language, I support Junio's point of view: it can be
> > re-supported in the future.
>
> Ah, I'd missed that. Anyway, if it's easy it seems you discovered that
> Perl deals with it correctly, so we could just have it support this.
>

Well, it's ok, support for perl will be put in a separate commit.

> >>
> >> > +test_expect_success 'basic atom: refs/tags/testtag *raw' '
> >> > +     git cat-file commit refs/tags/testtag^{} >expected &&
> >> > +     git for-each-ref --format="%(*raw)" refs/tags/testtag >actual &&
> >> > +     sanitize_pgp <expected >expected.clean &&
> >> > +     sanitize_pgp <actual >actual.clean &&
> >> > +     echo "" >>expected.clean &&
> >>
> >> Just "echo" will do, ditto for the rest. Also odd to go back and forth
> >> between populating expected.clean & actual.clean.
> >>
> >
> > Are you saying that sanitize_pgp is not needed?
>
> No that instead of:
>
>     echo "" >x
>
> You can do:
>
>     echo >x
>
> And also that going back and forth between populating different files is
> confusing, i.e. this:
>
>
>     echo a >x
>     echo c >y
>     echo b >>x
>
> is better as:
>
>     echo a >x
>     echo b >>x
>     echo c >y
>
>

Thanks, I get what you meant now.

> >>
> >> > +test_expect_success 'set up refs pointing to binary blob' '
> >> > +     printf "a\0b\0c" >blob1 &&
> >> > +     printf "a\0c\0b" >blob2 &&
> >> > +     printf "\0a\0b\0c" >blob3 &&
> >> > +     printf "abc" >blob4 &&
> >> > +     printf "\0 \0 \0 " >blob5 &&
> >> > +     printf "\0 \0a\0 " >blob6 &&
> >> > +     printf "  " >blob7 &&
> >> > +     >blob8 &&
> >> > +     git hash-object blob1 -w | xargs git update-ref refs/myblobs/blob1 &&
> >> > +     git hash-object blob2 -w | xargs git update-ref refs/myblobs/blob2 &&
> >> > +     git hash-object blob3 -w | xargs git update-ref refs/myblobs/blob3 &&
> >> > +     git hash-object blob4 -w | xargs git update-ref refs/myblobs/blob4 &&
> >> > +     git hash-object blob5 -w | xargs git update-ref refs/myblobs/blob5 &&
> >> > +     git hash-object blob6 -w | xargs git update-ref refs/myblobs/blob6 &&
> >> > +     git hash-object blob7 -w | xargs git update-ref refs/myblobs/blob7 &&
> >> > +     git hash-object blob8 -w | xargs git update-ref refs/myblobs/blob8
> >>
> >> Hrm, xargs just to avoid:
> >>
> >>     git update-ref ... $(git hash-object) ?
> >>
> >
> > I didn’t think about it, just for convenience.
>
> *nod*, Junio had a good suggestion.
>

ok.

Thanks.
--
ZheNing Hu

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Fri, Jun 18 2021, ZheNing Hu wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> 于2021年6月17日周四 下午10:45写道:
>>
>>
>> On Thu, Jun 17 2021, ZheNing Hu wrote:
>>
>> >
>> > Yes, strbuf is a suitable choice when using <str,len> pair.
>> > But if replace v->s with strbuf, the possible changes will be larger.
>>
>> I for one would like to see it done that way, those changes are usually
>> easy to read. Also it seems a large part of 2/8 is extra new code
>> because we didn't do that, e.g. getting length differently if something
>> is a strbuf or not, passing char*/size_t pairs to new functions etc.
>>
>
> After some refactoring, I found that there are two problems:
> 1. There are a lot of codes like this in ref-filter to fill v->s:
>
> v->s = show_ref(...)
> v->s = copy_email(...)
>
> It is very difficult to modify here: We know that show_ref()
> or copy_email() will allocate a block of memory to v->s, but
> if v->s is a strbuf, what should we do? In copy_email(), we
> can pass the v->s to copy_email() and use strbuf_add()/strbuf_addstr()
> instead of xstrdup() and xmemdupz(). But show_ref() will call
> external functions like shorten_unambiguous_ref(), we don’t know
> whether it will return us NULL or a dynamically allocated memory.
> If continue to pass v->s to the inner function, it is not a feasible
> method. Or we can use strbuf_attach() + strlen(), I'm not sure
> this is a good method.
>
> 2. See:
>
> -       for (i = 0; i < used_atom_cnt; i++) {
> +       for (i = 0; i < used_atom_cnt; i++) {
>                 struct atom_value *v = &ref->value[i];
> -               if (v->s == NULL && used_atom[i].source == SOURCE_NONE)
> +               if (v->s.len == 0 && used_atom[i].source == SOURCE_NONE)
>                         return strbuf_addf_ret(err, -1, _("missing
> object %s for %s"),
>
> oid_to_hex(&ref->objectname), ref->refname);
>         }
>
> In the case of using strbuf, I don’t know how to distinguish between an empty
> strbuf and NULL. It can be easily distinguished by using c-style "const char*".

Yes, sometimes it's just too much of a hassle, looking at
shorten_unambiguous_ref() which returns a xstrdup()'d value that could
indeed be strbuf_attach'd. I haven't tried the conversion myself,
perhaps it's too much hassle.

Just a suggestion from reading your patch in isolation.


>> >
>> > Not python safe. See [1].
>> > Regarding the perl language, I support Junio's point of view: it can be
>> > re-supported in the future.
>>
>> Ah, I'd missed that. Anyway, if it's easy it seems you discovered that
>> Perl deals with it correctly, so we could just have it support this.
>>
>
> Well, it's ok, support for perl will be put in a separate commit.
>
>> >>
>> >> > +test_expect_success 'basic atom: refs/tags/testtag *raw' '
>> >> > +     git cat-file commit refs/tags/testtag^{} >expected &&
>> >> > +     git for-each-ref --format="%(*raw)" refs/tags/testtag >actual &&
>> >> > +     sanitize_pgp <expected >expected.clean &&
>> >> > +     sanitize_pgp <actual >actual.clean &&
>> >> > +     echo "" >>expected.clean &&
>> >>
>> >> Just "echo" will do, ditto for the rest. Also odd to go back and forth
>> >> between populating expected.clean & actual.clean.
>> >>
>> >
>> > Are you saying that sanitize_pgp is not needed?
>>
>> No that instead of:
>>
>>     echo "" >x
>>
>> You can do:
>>
>>     echo >x
>>
>> And also that going back and forth between populating different files is
>> confusing, i.e. this:
>>
>>
>>     echo a >x
>>     echo c >y
>>     echo b >>x
>>
>> is better as:
>>
>>     echo a >x
>>     echo b >>x
>>     echo c >y
>>
>>
>
> Thanks, I get what you meant now.
>
>> >>
>> >> > +test_expect_success 'set up refs pointing to binary blob' '
>> >> > +     printf "a\0b\0c" >blob1 &&
>> >> > +     printf "a\0c\0b" >blob2 &&
>> >> > +     printf "\0a\0b\0c" >blob3 &&
>> >> > +     printf "abc" >blob4 &&
>> >> > +     printf "\0 \0 \0 " >blob5 &&
>> >> > +     printf "\0 \0a\0 " >blob6 &&
>> >> > +     printf "  " >blob7 &&
>> >> > +     >blob8 &&
>> >> > +     git hash-object blob1 -w | xargs git update-ref refs/myblobs/blob1 &&
>> >> > +     git hash-object blob2 -w | xargs git update-ref refs/myblobs/blob2 &&
>> >> > +     git hash-object blob3 -w | xargs git update-ref refs/myblobs/blob3 &&
>> >> > +     git hash-object blob4 -w | xargs git update-ref refs/myblobs/blob4 &&
>> >> > +     git hash-object blob5 -w | xargs git update-ref refs/myblobs/blob5 &&
>> >> > +     git hash-object blob6 -w | xargs git update-ref refs/myblobs/blob6 &&
>> >> > +     git hash-object blob7 -w | xargs git update-ref refs/myblobs/blob7 &&
>> >> > +     git hash-object blob8 -w | xargs git update-ref refs/myblobs/blob8
>> >>
>> >> Hrm, xargs just to avoid:
>> >>
>> >>     git update-ref ... $(git hash-object) ?
>> >>
>> >
>> > I didn’t think about it, just for convenience.
>>
>> *nod*, Junio had a good suggestion.
>>
>
> ok.
>
> Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Christian Couder wrote (reply to this):

On Fri, Jun 18, 2021 at 12:51 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:

> On Fri, Jun 18 2021, ZheNing Hu wrote:

> > After some refactoring, I found that there are two problems:
> > 1. There are a lot of codes like this in ref-filter to fill v->s:
> >
> > v->s = show_ref(...)
> > v->s = copy_email(...)
> >
> > It is very difficult to modify here: We know that show_ref()
> > or copy_email() will allocate a block of memory to v->s, but
> > if v->s is a strbuf, what should we do? In copy_email(), we
> > can pass the v->s to copy_email() and use strbuf_add()/strbuf_addstr()
> > instead of xstrdup() and xmemdupz(). But show_ref() will call
> > external functions like shorten_unambiguous_ref(), we don’t know
> > whether it will return us NULL or a dynamically allocated memory.
> > If continue to pass v->s to the inner function, it is not a feasible
> > method. Or we can use strbuf_attach() + strlen(), I'm not sure
> > this is a good method.

If you resend this patch, it might be a good idea to add a short
version of the above explanations into the commit message.

[...]

> > In the case of using strbuf, I don’t know how to distinguish between an empty
> > strbuf and NULL. It can be easily distinguished by using c-style "const char*".

Maybe this could also be part of the explanation.

> Yes, sometimes it's just too much of a hassle, looking at
> shorten_unambiguous_ref() which returns a xstrdup()'d value that could
> indeed be strbuf_attach'd. I haven't tried the conversion myself,
> perhaps it's too much hassle.
>
> Just a suggestion from reading your patch in isolation.

Yeah, thanks for the review anyway!

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 17, 2021

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Tue, Jun 15 2021, ZheNing Hu via GitGitGadget wrote:

> This patch series make cat-file reuse ref-filter logic, which based on
> 5a5b5f78 ([GSOC] ref-filter: add %(rest) atom)
>
> Change from last version:
>
>  1. Use free_array_item_internal() to solve the memory leak problem.
>  2. Change commit message of ([GSOC] ref-filter: teach get_object() return
>     useful value).

I left some comments, but saw after the fact that I'd replied to the v1
E-Mails by accident, but anyway, the comments were all on things that
are also in v2, so it worked out in the end. Sorry about the confusion.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 17, 2021

On the Git mailing list, ZheNing Hu wrote (reply to this):

Ævar Arnfjörð Bjarmason <avarab@gmail.com> 于2021年6月17日周四 下午3:27写道:
>
>
> On Tue, Jun 15 2021, ZheNing Hu via GitGitGadget wrote:
>
> > This patch series make cat-file reuse ref-filter logic, which based on
> > 5a5b5f78 ([GSOC] ref-filter: add %(rest) atom)
> >
> > Change from last version:
> >
> >  1. Use free_array_item_internal() to solve the memory leak problem.
> >  2. Change commit message of ([GSOC] ref-filter: teach get_object() return
> >     useful value).
>
> I left some comments, but saw after the fact that I'd replied to the v1
> E-Mails by accident, but anyway, the comments were all on things that
> are also in v2, so it worked out in the end. Sorry about the confusion.

It's okay. Your comments are very useful.

Thanks.
--
ZheNing Hu

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 18, 2021

User Christian Couder <christian.couder@gmail.com> has been added to the cc: list.

@adlternative adlternative changed the base branch from zh/ref-filter-atom-type to zh/ref-filter-raw-data June 19, 2021 05:55
@adlternative adlternative changed the base branch from zh/ref-filter-raw-data to zh/ref-filter-atom-type June 19, 2021 05:56
@adlternative adlternative changed the base branch from zh/ref-filter-atom-type to zh/ref-filter-raw-data June 19, 2021 06:12
@adlternative adlternative changed the base branch from zh/ref-filter-raw-data to zh/ref-filter-atom-type June 19, 2021 06:13
@adlternative
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 19, 2021

Submitted as pull.980.v3.git.1624086181.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-980/adlternative/cat-file-batch-refactor-v3

To fetch this version to local tag pr-980/adlternative/cat-file-batch-refactor-v3:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-980/adlternative/cat-file-batch-refactor-v3

@@ -226,6 +226,12 @@ newline. The available atoms are:
after that first run of whitespace (i.e., the "rest" of the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Christian Couder wrote (reply to this):

On Sat, Jun 19, 2021 at 9:03 AM ZheNing Hu via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: ZheNing Hu <adlternative@gmail.com>
>
> In order to let cat-file use ref-filter logic, the following
> methods are used:

Maybe: s/the following methods are used/let's do the following/

> 1. Add `cat_file_mode` member in struct `ref_format`, this can
> help us reject atoms in verify_ref_format() which cat-file
> cannot use, e.g. `%(refname)`, `%(push)`, `%(upstream)`...
> 2. Change the type of member `format` in struct `batch_options`
> to `ref_format`, We can add format data in it.

Not sure what "We can add format data in it." means.

> 3. Let `batch_objects()` add atoms to format, and use
> `verify_ref_format()` to check atoms.
> 4. Use `has_object_file()` in `batch_one_object()` to check
> whether the input object exists.
> 5. Let get_object() return 1 and print "<oid> missing" instead
> of returning -1 and printing "missing object <oid> for <refname>",
> this can help `format_ref_array_item()` just report that the
> object is missing without letting Git exit.
> 6. Use `format_ref_array_item()` in `batch_object_write()` to
> get the formatted data corresponding to the object. If the
> return value of `format_ref_array_item()` is equals to zero,
> use `batch_write()` to print object data; else if the return
> value less than zero, use `die()` to print the error message
> and exit; else return value greater than zero, only print the

s/else return value greater/else if the return value is greater/

> error message, but not exit.

s/not exit/don't exit/

> 7. Use free_ref_array_item_value() to free ref_array_item's
> value.

That looks like a lot of changes in a single commit. I wonder if this
commit could be split.

> Most of the atoms in `for-each-ref --format` are now supported,
> such as `%(tree)`, `%(parent)`, `%(author)`, `%(tagger)`, `%(if)`,
> `%(then)`, `%(else)`, `%(end)`. But these atoms will be rejected:
> `%(refname)`, `%(symref)`, `%(upstream)`, `%(push)`, `%(worktreepath)`,
> `%(flag)`, `%(HEAD)`, because our objects don't have refname.

s/refname/a refname/

It might be worth talking a bit about possible performance changes.

[...]

> +       ret = format_ref_array_item(&item, &opt->format, scratch, &err);
> +       if (!ret) {
> +               strbuf_addch(scratch, '\n');
> +               batch_write(opt, scratch->buf, scratch->len);
> +       } else if (ret < 0) {
> +               die("%s\n", err.buf);

This if (ret < 0) could be checked first.

> +       } else {
> +               /* when ret > 0 , don't call die and print the err to stdout*/

I think it would be more helpful to tell what ret > 0 means, rather
than what we do below (which can easily be seen).

> +               printf("%s\n", err.buf);
> +               fflush(stdout);
>         }

For example:

       if (ret < 0) {
               die("%s\n", err.buf);
       if (ret) {
               /* ret > 0 means ... */
               printf("%s\n", err.buf);
               fflush(stdout);
       } else {
               strbuf_addch(scratch, '\n');
               batch_write(opt, scratch->buf, scratch->len);
       }

> +       free_ref_array_item_value(&item);
> +       strbuf_release(&err);
>  }

[...]

> +static int batch_objects(struct batch_options *opt, const struct option *options)

It's unfortunate that one argument is called "opt" and the other one
"options". I wonder if the first one could be called "batch" as it
seems to be called this way somewhere else.

> +       if (!opt->format.format)
> +               strbuf_addstr(&format, "%(objectname) %(objecttype) %(objectsize)");
> +       else
> +               strbuf_addstr(&format, opt->format.format);

If there is no reason for the condition to be (!X) over just (X), I
think the latter is a bit better.

>         if (opt->print_contents)
> -               data.info.typep = &data.type;
> +               strbuf_addstr(&format, "\n%(raw)");
> +       opt->format.format = format.buf;

I wonder if this should be:

      opt->format.format = strbuf_detach(&format, NULL);

> +       if (verify_ref_format(&opt->format))
> +               usage_with_options(cat_file_usage, options);

[...]

> @@ -86,7 +87,7 @@ struct ref_format {
>         int need_color_reset_at_eol;
>  };
>
> -#define REF_FORMAT_INIT { NULL, NULL, 0, 0, -1 }
> +#define REF_FORMAT_INIT { NULL, NULL, 0, 0, 0, -1 }

Maybe this can already be changed to a designated initializer, like
Ævar suggested recently.

[...]

> +test_expect_success 'cat-file refs/heads/main refs/tags/testtag %(rest)' '

If this test is about checking that %(rest) works with both a branch
and a tag, it might be better to say it more explicitly.

> +       cat >expected <<-EOF &&
> +       123 commit 123
> +       456 tag 456
> +       EOF
> +       git cat-file --batch-check="%(rest) %(objecttype) %(rest)" >actual <<-EOF &&
> +       refs/heads/main 123
> +       refs/tags/testtag 456
> +       EOF
> +       test_cmp expected actual
> +'

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, ZheNing Hu wrote (reply to this):

Christian Couder <christian.couder@gmail.com> 于2021年6月21日周一 下午1:55写道:
>
> > 1. Add `cat_file_mode` member in struct `ref_format`, this can
> > help us reject atoms in verify_ref_format() which cat-file
> > cannot use, e.g. `%(refname)`, `%(push)`, `%(upstream)`...
> > 2. Change the type of member `format` in struct `batch_options`
> > to `ref_format`, We can add format data in it.
>
> Not sure what "We can add format data in it." means.

Well, there is something wrong with the expression here. What I want
to express is that we can fill its member "format" with the atoms like
"%(objectname) %(refname)", and then pass it to the ref-filter.

> > 7. Use free_ref_array_item_value() to free ref_array_item's
> > value.
>
> That looks like a lot of changes in a single commit. I wonder if this
> commit could be split.
>

Yeah, But I don’t know if I should take it apart step by step, If taken apart,
those intermediate commits are likely to fail the test.

> > Most of the atoms in `for-each-ref --format` are now supported,
> > such as `%(tree)`, `%(parent)`, `%(author)`, `%(tagger)`, `%(if)`,
> > `%(then)`, `%(else)`, `%(end)`. But these atoms will be rejected:
> > `%(refname)`, `%(symref)`, `%(upstream)`, `%(push)`, `%(worktreepath)`,
> > `%(flag)`, `%(HEAD)`, because our objects don't have refname.
>
> s/refname/a refname/
>
> It might be worth talking a bit about possible performance changes.
>

Makes sense. The performance has indeed deteriorated.

> [...]
>
> > +       ret = format_ref_array_item(&item, &opt->format, scratch, &err);
> > +       if (!ret) {
> > +               strbuf_addch(scratch, '\n');
> > +               batch_write(opt, scratch->buf, scratch->len);
> > +       } else if (ret < 0) {
> > +               die("%s\n", err.buf);
>
> This if (ret < 0) could be checked first.

Yes, it is better to put error checking first.

>
> > +       } else {
> > +               /* when ret > 0 , don't call die and print the err to stdout*/
>
> I think it would be more helpful to tell what ret > 0 means, rather
> than what we do below (which can easily be seen).
>

Ah, There is indeed only one situation for ret > 0 for the time being:
Show "<oid> missing" without exiting Git.

> > +               printf("%s\n", err.buf);
> > +               fflush(stdout);
> >         }
>
> For example:
>
>        if (ret < 0) {
>                die("%s\n", err.buf);
>        if (ret) {
>                /* ret > 0 means ... */
>                printf("%s\n", err.buf);
>                fflush(stdout);
>        } else {
>                strbuf_addch(scratch, '\n');
>                batch_write(opt, scratch->buf, scratch->len);
>        }
>

Yes, this might be better.

> > +       free_ref_array_item_value(&item);
> > +       strbuf_release(&err);
> >  }
>
> [...]
>
> > +static int batch_objects(struct batch_options *opt, const struct option *options)
>
> It's unfortunate that one argument is called "opt" and the other one
> "options". I wonder if the first one could be called "batch" as it
> seems to be called this way somewhere else.
>

OK.

> > +       if (!opt->format.format)
> > +               strbuf_addstr(&format, "%(objectname) %(objecttype) %(objectsize)");
> > +       else
> > +               strbuf_addstr(&format, opt->format.format);
>
> If there is no reason for the condition to be (!X) over just (X), I
> think the latter is a bit better.
>

Although I think it doesn’t matter which one to use first,
I will still follow your suggestions.

> >         if (opt->print_contents)
> > -               data.info.typep = &data.type;
> > +               strbuf_addstr(&format, "\n%(raw)");
> > +       opt->format.format = format.buf;
>
> I wonder if this should be:
>
>       opt->format.format = strbuf_detach(&format, NULL);
>

No. here our opt->format.format will not be changed, it would
be better for us to use `strbuf_release(&format)` for resource
recovery. (strbuf_detach() will forced to let us free opt->format.format)

> > +       if (verify_ref_format(&opt->format))
> > +               usage_with_options(cat_file_usage, options);
>
> [...]
>
> > @@ -86,7 +87,7 @@ struct ref_format {
> >         int need_color_reset_at_eol;
> >  };
> >
> > -#define REF_FORMAT_INIT { NULL, NULL, 0, 0, -1 }
> > +#define REF_FORMAT_INIT { NULL, NULL, 0, 0, 0, -1 }
>
> Maybe this can already be changed to a designated initializer, like
> Ævar suggested recently.
>

I agree.

> [...]
>
> > +test_expect_success 'cat-file refs/heads/main refs/tags/testtag %(rest)' '
>
> If this test is about checking that %(rest) works with both a branch
> and a tag, it might be better to say it more explicitly.
>

OK.

Thanks.
--
ZheNing Hu

@adlternative adlternative force-pushed the cat-file-batch-refactor branch 2 times, most recently from 0bb7727 to d1114a2 Compare June 22, 2021 02:41
@adlternative
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 22, 2021

Submitted as pull.980.v4.git.1624332054.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-980/adlternative/cat-file-batch-refactor-v4

To fetch this version to local tag pr-980/adlternative/cat-file-batch-refactor-v4:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-980/adlternative/cat-file-batch-refactor-v4

@@ -144,6 +144,7 @@ enum atom_type {
ATOM_BODY,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Bagas Sanjaya wrote (reply to this):

On 22/06/21 10.20, ZheNing Hu via GitGitGadget wrote:
> From: ZheNing Hu <adlternative@gmail.com>
> 
> Since in the refactor of `git cat-file --batch` later,
> oid_object_info_extended() in get_object() will be used to obtain
> the info of an object with it's oid. When the object cannot be
> obtained in the git repository, `cat-file --batch` expects to output
> "<oid> missing" and continue the next oid query instead of letting
> Git exit. In other error conditions, Git should exit normally. So we
> can achieve this function by passing the return value of get_object().
> 

s/Since/Because/

-- 
An old man doll... just what I always wanted! - Clara

@@ -428,6 +428,13 @@ static void batch_one_object(const char *obj_name,
return;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Bagas Sanjaya wrote (reply to this):

On 22/06/21 10.20, ZheNing Hu via GitGitGadget wrote:
> From: ZheNing Hu <adlternative@gmail.com>
> 
> Since cat-file reuses ref-filter logic later will add the
> formal parameter "const struct option *options" to
> batch_objects(), the two synonymous parameters of "opt"
> and "options" may confuse readers, so change batch_options
> parameter of batch_objects() from "opt" to "batch".
> 

Better say "Because later cat-file reuses ref-filter logic that will add 
parameter ... ".

-- 
An old man doll... just what I always wanted! - Clara

In the original ref-filter design, it will copy the parsed
atom's name and attributes to `used_atom[i].name` in the
atom's parsing step, and use it again for string matching
in the later specific ref attributes filling step. It use
a lot of string matching to determine which atom we need.

Introduce the enum "atom_type", each enum value is named
as `ATOM_*`, which is the index of each corresponding
valid_atom entry. In the first step of the atom parsing,
`used_atom.atom_type` will record corresponding enum value
from valid_atom entry index, and then in specific reference
attribute filling step, only need to compare the value of
the `used_atom[i].atom_type` to check the atom type.

Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Christian Couder <christian.couder@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The --batch code to print an object assumes we found out the type of
the object from calling oid_object_info_extended(). This is true for
the default format, but even in a custom format, we manually modify
the object_info struct to ask for the type.

This assumption was broken by 845de33 (cat-file: avoid noop calls
to sha1_object_info_extended, 2016-05-18). That commit skips the call
to oid_object_info_extended() entirely when --batch-all-objects is in
use, and the custom format does not include any placeholders that
require calling it.

Or when the custom format only include placeholders like %(objectname) or
%(rest), oid_object_info_extended() will not get the type of the object.

This results in an error when we try to confirm that the type didn't
change:

$ git cat-file --batch=batman --batch-all-objects
batman
fatal: object 0000239 changed type!?

and also has other subtle effects (e.g., we'd fail to stream a blob,
since we don't realize it's a blob in the first place).

We can fix this by flipping the order of the setup. The check for "do
we need to get the object info" must come _after_ we've decided
whether we need to look up the type.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are two "if (opt->all_objects)" blocks next
to each other, merge them into one to provide better
readability.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Acked-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Only tag and commit objects use `grab_sub_body_contents()` to grab
object contents in the current codebase.  We want to teach the
function to also handle blobs and trees to get their raw data,
without parsing a blob (whose contents looks like a commit or a tag)
incorrectly as a commit or a tag.

Skip the block of code that is specific to handling commits and tags
early when the given object is of a wrong type to help later
addition to handle other types of objects in this function.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Add new formatting option `%(raw)`, which will print the raw
object data without any changes. It will help further to migrate
all cat-file formatting logic from cat-file to ref-filter.

The raw data of blob, tree objects may contain '\0', but most of
the logic in `ref-filter` depends on the output of the atom being
text (specifically, no embedded NULs in it).

E.g. `quote_formatting()` use `strbuf_addstr()` or `*._quote_buf()`
add the data to the buffer. The raw data of a tree object is
`100644 one\0...`, only the `100644 one` will be added to the buffer,
which is incorrect.

Therefore, we need to find a way to record the length of the
atom_value's member `s`. Although strbuf can already record the
string and its length, if we want to replace the type of atom_value's
member `s` with strbuf, many places in ref-filter that are filled
with dynamically allocated mermory in `v->s` are not easy to replace.
At the same time, we need to check if `v->s == NULL` in
populate_value(), and strbuf cannot easily distinguish NULL and empty
strings, but c-style "const char *" can do it. So add a new member in
`struct atom_value`: `s_size`, which can record raw object size, it
can help us add raw object data to the buffer or compare two buffers
which contain raw object data.

Note that `--format=%(raw)` cannot be used with `--python`, `--shell`,
`--tcl`, and `--perl` because if the binary raw data is passed to a
variable in such languages, these may not support arbitrary binary data
in their string variable type.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Helped-by: Bagas Sanjaya <bagasdotme@gmail.com>
Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Helped-by: Felipe Contreras <felipe.contreras@gmail.com>
Helped-by: Phillip Wood <phillip.wood@dunelm.org.uk>
Helped-by: Junio C Hamano <gitster@pobox.com>
Based-on-patch-by: Olga Telezhnaya <olyatelezhnaya@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Because the perl language can handle binary data correctly,
add the function perl_quote_buf_with_len(), which can specify
the length of the data and prevent the data from being truncated
at '\0' to help `--format="%(raw)"` re-support `--perl`.

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Use non-const ref_format in *_atom_parser(), which can help us
modify the members of ref_format in *_atom_parser().

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
In order to let "cat-file --batch=%(rest)" use the ref-filter
interface, add %(rest) atom for ref-filter. "git for-each-ref",
"git branch", "git tag" and "git verify-tag" will reject %(rest)
by default.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Because in the refactor of `git cat-file --batch` later,
oid_object_info_extended() in get_object() will be used to obtain
the info of an object with it's oid. When the object cannot be
obtained in the git repository, `cat-file --batch` expects to output
"<oid> missing" and continue the next oid query instead of letting
Git exit. In other error conditions, Git should exit normally. So we
can achieve this function by passing the return value of get_object().

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
When we use ref_array_item which is not dynamically allocated and
want to free the space of its member "value" after the end of use,
free_array_item() does not meet our needs, because it tries to free
ref_array_item itself and its member "symref".

Introduce free_ref_array_item_value() for freeing ref_array_item value.
It will be called internally by free_array_item(), and it will help
`cat-file --batch` free ref_array_item's value memory later.

Helped-by: Junio C Hamano <gitster@pobox.com>
Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Add `cat_file_mode` member in struct `ref_format`, when
`cat-file --batch` use ref-filter logic later, it can help
us reject atoms in verify_ref_format() which cat-file cannot
use, e.g. `%(refname)`, `%(push)`, `%(upstream)`...

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Let get_object() return 1 and print "<oid> missing" instead
of returning -1 and printing "missing object <oid> for <refname>"
if oid_object_info_extended() unable to find the data corresponding
to oid. When `cat-file --batch` use ref-filter logic later it can
help `format_ref_array_item()` just report that the object is missing
without letting Git exit.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Use `has_object_file()` in `batch_one_object()` to check
whether the input object exists. This can help us reject
the missing oid when we let `cat-file --batch` use ref-filter
logic later.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Because later cat-file reuses ref-filter logic that will add
parameter "const struct option *options" to batch_objects(),
the two synonymous parameters of "opt" and "options" may
confuse readers, so change batch_options parameter of
batch_objects() from "opt" to "batch".

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
In order to let cat-file use ref-filter logic, let's do the
following:

1. Change the type of member `format` in struct `batch_options`
to `ref_format`, we will pass it to ref-filter later.
2. Let `batch_objects()` add atoms to format, and use
`verify_ref_format()` to check atoms.
3. Use `format_ref_array_item()` in `batch_object_write()` to
get the formatted data corresponding to the object. If the
return value of `format_ref_array_item()` is equals to zero,
use `batch_write()` to print object data; else if the return
value is less than zero, use `die()` to print the error message
and exit; else if return value is greater than zero, only print
the error message, but don't exit.
4. Use free_ref_array_item_value() to free ref_array_item's
value.

Most of the atoms in `for-each-ref --format` are now supported,
such as `%(tree)`, `%(parent)`, `%(author)`, `%(tagger)`, `%(if)`,
`%(then)`, `%(else)`, `%(end)`. But these atoms will be rejected:
`%(refname)`, `%(symref)`, `%(upstream)`, `%(push)`, `%(worktreepath)`,
`%(flag)`, `%(HEAD)`, because these atoms are unique to those objects
that pointed to by a ref, "for-each-ref"'s family can naturally use
these atoms, but not all objects are pointed to be a ref, so "cat-file"
will not be able to use them.

The performance for `git cat-file --batch-all-objects
--batch-check` on the Git repository itself with performance
testing tool `hyperfine` changes from 669.4 ms ±  31.1 ms to
1.134 s ±  0.063 s.

The performance for `git cat-file --batch-all-objects --batch
>/dev/null` on the Git repository itself with performance testing
tool `time` change from "27.37s user 0.29s system 98% cpu 28.089
total" to "33.69s user 1.54s system 87% cpu 40.258 total".

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Reuse the `err` buffer in batch_object_write(), as the
buffer `scratch` does. This will reduce the overhead
of multiple allocations of memory of the err buffer.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
After cat-file reuses the ref-filter logic, we re-implement the
functions of --textconv and --filters options.

Add members `use_textconv` and `use_filters` in struct `ref_format`,
and use global variables `use_filters` and `use_textconv` in
`ref-filter.c`, so that we can filter the content of the object
in get_object(). Use `actual_oi` to record the real expand_data:
it may point to the original `oi` or the `act_oi` processed by
`textconv_object()` or `convert_to_working_tree()`. `grab_values()`
will grab the contents of `actual_oi` and `grab_common_values()`
to grab the contents of origin `oi`, this ensures that `%(objectsize)`
still uses the size of the unfiltered data.

In `get_object()`, we made an optimization: Firstly, get the size and
type of the object instead of directly getting the object data.
If using --textconv, after successfully obtaining the filtered object
data, an extra oid_object_info_extended() will be skipped, which can
reduce the cost of object data copy; If using --filter, the data of
the object first will be getted first, and then convert_to_working_tree()
will be used to get the filtered object data.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Because "atom_type == ATOM_OBJECTNAME" implies the condition
of `starts_with(name, "objectname")`, "atom_type == ATOM_TREE"
implies the condition of `starts_with(name, "tree")`, so the
check for `starts_with(name, field)` in grab_oid() is redundant.

So Remove the grab_oid() from ref-filter, to reduce repeated check.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
@adlternative adlternative changed the base branch from zh/ref-filter-atom-type to zh/cat-file-batch-fix July 1, 2021 13:51
@adlternative adlternative changed the base branch from zh/cat-file-batch-fix to zh/ref-filter-raw-data July 1, 2021 13:58
@adlternative adlternative changed the base branch from zh/ref-filter-raw-data to zh/ref-filter-atom-type July 1, 2021 13:59
@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

ZheNing Hu <adlternative@gmail.com> writes:

> Junio C Hamano <gitster@pobox.com> 于2021年7月1日周四 上午6:04写道:
>>
>> "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>> > This patch series make cat-file reuse ref-filter logic.
>>
>> Unfortunately this seems to interact with your own
>> zh/cat-file-batch-fix rather badly.
>>
>
> Well, it's because I didn't base this patch on it.
> That should be easy to achieve.

It is preferrable for contributors try merging their individual
topics with the rest of 'seen' to see if there are potential
conflicts (either textual or semantic) before sending their topics
out.  Not all topics need to build on top of other topics (in fact,
the fewer inter-dependencies they have, the better), but in this
case, I think it makes sense to build one on top of the other.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2021

This branch is now known as zh/ref-filter-raw-data.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2021

This patch series was integrated into seen via git@3c525b5.

@gitgitgadget gitgitgadget bot added the seen label Jul 1, 2021
@gitgitgadget
Copy link

gitgitgadget bot commented Jul 2, 2021

This patch series was integrated into seen via git@ce080dd.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 3, 2021

This patch series was integrated into seen via git@c8db213.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 9, 2021

On the Git mailing list, ZheNing Hu wrote (reply to this):

Hi, Junio,

Junio C Hamano <gitster@pobox.com> 于2021年7月1日周四 下午10:17写道:
>
> ZheNing Hu <adlternative@gmail.com> writes:
>
> > Junio C Hamano <gitster@pobox.com> 于2021年7月1日周四 上午6:04写道:
> >>
> >> "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >>
> >> > This patch series make cat-file reuse ref-filter logic.
> >>
> >> Unfortunately this seems to interact with your own
> >> zh/cat-file-batch-fix rather badly.
> >>
> >
> > Well, it's because I didn't base this patch on it.
> > That should be easy to achieve.
>
> It is preferrable for contributors try merging their individual
> topics with the rest of 'seen' to see if there are potential
> conflicts (either textual or semantic) before sending their topics
> out.  Not all topics need to build on top of other topics (in fact,
> the fewer inter-dependencies they have, the better), but in this
> case, I think it makes sense to build one on top of the other.
>

I have a "rebase" trouble:

My new feature branch "cat-file-batch-refactor-rebase-version" should base on
zh/cat-file-batch-fix and  zh/ref-filter-atom-type, so last time I choice
(bb9a3a8f77 Merge branch 'zh/cat-file-batch-fix' into jch) as the patch base.

But github only allow me base the patch on a branch, so I choice
"gitgitgadget:seen"
as my github PR base. It causes that some merge commit include in it. [1]

So In order to prevent these "merge" commits from being sent, the GGG mechanism
is modified to reject their merge commits.

Now I can't choice a good branch as my patch base... Have any ideas?

Thanks.

[1]: https://github.com/gitgitgadget/git/pull/989

--
ZheNing Hu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants