Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSOC] cat-file: reuse ref-filter logic #993

Closed

Conversation

adlternative
Copy link

@adlternative adlternative commented Jul 12, 2021

This patch series makes cat-file reuse ref-filter logic.

Change from last version:

  1. Move some atoms' test to the commit: [GSOC] ref-filter:
    add cat_file_mode to ref_format.
  2. Advance the commit of performance tests.
  3. Modified some commit messages related to cat-file performance.

By the way, "[GSOC] ref-filter: add %(rest) atom" and
its previous commits should belong to zh/ref-filter-raw-data
and the rest should belong to zh/cat-file-batch-refactor.

cc: Junio C Hamano gitster@pobox.com
cc: Christian Couder christian.couder@gmail.com
cc: Hariom Verma hariom18599@gmail.com
cc: Bagas Sanjaya bagasdotme@gmail.com
cc: Jeff King peff@peff.net
cc: Ævar Arnfjörð Bjarmason avarab@gmail.com
cc: Eric Sunshine sunshine@sunshineco.com
cc: Philip Oakley philipoakley@iee.email

@adlternative
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 12, 2021

Submitted as pull.993.git.1626090419.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-993/adlternative/cat-file-batch-refactor-2-v1

To fetch this version to local tag pr-993/adlternative/cat-file-batch-refactor-2-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-993/adlternative/cat-file-batch-refactor-2-v1

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 12, 2021

On the Git mailing list, Christian Couder wrote (reply to this):

On Mon, Jul 12, 2021 at 1:47 PM ZheNing Hu via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This patch series make cat-file reuse ref-filter logic.

s/make/makes/

By the way if you have already sent some of the patches in this series
(and if they haven't changed much since when you sent them), it's a
good idea to use V2 or V3, V4, etc, so we can easily refer to each of
the versions you sent. (See the `-v` option of `git format-patch`.)

> Change from last version:
>
>  1. Declare buf_size in if (atom_type == ATOM_RAW) block.
>  2. Modify the code style of the test.
>  3. Delete "use_textconv" and "use_filter" flag. Instead, add member

s/flag/flags/

>     cat_file_cmdmode to struct ref_array_item.
>  4. Add function reject_atom() to enhance the readability of the code.
>  5. Create p1006-cat-file.sh for performance regression testing.
>  6. Use a "fast path" to output object data to reduce the performance
>     degradation of cat-file --batch with the suggest of Ævar Arnfjörð
>     Bjarmason.

Maybe:

s/with the suggest of Ævar Arnfjörð Bjarmason/as suggested by Ævar
Arnfjörð Bjarmason/

or:

s/with the suggest of Ævar Arnfjörð Bjarmason/according to Ævar
Arnfjörð Bjarmason's suggestion/


> ZheNing Hu (19):
>   cat-file: handle trivial --batch format with --batch-all-objects
>   cat-file: merge two block into one

It's a bit strange that the above 2 don't have [GSOC] while the others
below have it.

>   [GSOC] ref-filter: add obj-type check in grab contents
>   [GSOC] ref-filter: add %(raw) atom
>   [GSOC] ref-filter: --format=%(raw) re-support --perl
>   [GSOC] ref-filter: use non-const ref_format in *_atom_parser()
>   [GSOC] ref-filter: add %(rest) atom
>   [GSOC] ref-filter: pass get_object() return value to their callers
>   [GSOC] ref-filter: introduce free_ref_array_item_value() function
>   [GSOC] ref-filter: introduce reject_atom()
>   [GSOC] ref-filter: modify the error message and value in get_object
>   [GSOC] cat-file: add has_object_file() check
>   [GSOC] cat-file: change batch_objects parameter name
>   [GSOC] cat-file: reuse ref-filter logic
>   [GSOC] cat-file: reuse err buf in batch_object_write()
>   [GSOC] cat-file: re-implement --textconv, --filters options
>   [GSOC] ref-filter: remove grab_oid() function
>   [GSOC] cat-file: create p1006-cat-file.sh

Maybe you could add the new perf test earlier in the series so that we
could see how performance changes when ref-filter logic is reused in
cat-file earlier in the series.

>   [GSOC] cat-file: use fast path when using default_format

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 12, 2021

On the Git mailing list, ZheNing Hu wrote (reply to this):

Christian Couder <christian.couder@gmail.com> 于2021年7月12日周一 下午8:36写道:
>
> On Mon, Jul 12, 2021 at 1:47 PM ZheNing Hu via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >
> > This patch series make cat-file reuse ref-filter logic.
>
> s/make/makes/
>
> By the way if you have already sent some of the patches in this series
> (and if they haven't changed much since when you sent them), it's a
> good idea to use V2 or V3, V4, etc, so we can easily refer to each of
> the versions you sent. (See the `-v` option of `git format-patch`.)
>

Yes, but my patch was generated by gitgitgadget, when I want to send
a new patch which is totally different with before, I will send a new PR.
GGG will not know the connection between these two PRs. So the
version number of the patch will be reset.

> > Change from last version:
> >
> >  1. Declare buf_size in if (atom_type == ATOM_RAW) block.
> >  2. Modify the code style of the test.
> >  3. Delete "use_textconv" and "use_filter" flag. Instead, add member
>
> s/flag/flags/
>
> >     cat_file_cmdmode to struct ref_array_item.
> >  4. Add function reject_atom() to enhance the readability of the code.
> >  5. Create p1006-cat-file.sh for performance regression testing.
> >  6. Use a "fast path" to output object data to reduce the performance
> >     degradation of cat-file --batch with the suggest of Ævar Arnfjörð
> >     Bjarmason.
>
> Maybe:
>
> s/with the suggest of Ævar Arnfjörð Bjarmason/as suggested by Ævar
> Arnfjörð Bjarmason/
>
> or:
>
> s/with the suggest of Ævar Arnfjörð Bjarmason/according to Ævar
> Arnfjörð Bjarmason's suggestion/
>
>
> > ZheNing Hu (19):
> >   cat-file: handle trivial --batch format with --batch-all-objects
> >   cat-file: merge two block into one
>
> It's a bit strange that the above 2 don't have [GSOC] while the others
> below have it.
>

That's because it's belong to the branch zh/cat-file-batch-fix. I
should mention it
in the cover-letter.

> >   [GSOC] ref-filter: add obj-type check in grab contents
> >   [GSOC] ref-filter: add %(raw) atom
> >   [GSOC] ref-filter: --format=%(raw) re-support --perl
> >   [GSOC] ref-filter: use non-const ref_format in *_atom_parser()
> >   [GSOC] ref-filter: add %(rest) atom
> >   [GSOC] ref-filter: pass get_object() return value to their callers
> >   [GSOC] ref-filter: introduce free_ref_array_item_value() function
> >   [GSOC] ref-filter: introduce reject_atom()
> >   [GSOC] ref-filter: modify the error message and value in get_object
> >   [GSOC] cat-file: add has_object_file() check
> >   [GSOC] cat-file: change batch_objects parameter name
> >   [GSOC] cat-file: reuse ref-filter logic
> >   [GSOC] cat-file: reuse err buf in batch_object_write()
> >   [GSOC] cat-file: re-implement --textconv, --filters options
> >   [GSOC] ref-filter: remove grab_oid() function
> >   [GSOC] cat-file: create p1006-cat-file.sh
>
> Maybe you could add the new perf test earlier in the series so that we
> could see how performance changes when ref-filter logic is reused in
> cat-file earlier in the series.

Make sence.

>
> >   [GSOC] cat-file: use fast path when using default_format

Thanks.
--
ZheNing Hu

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 12, 2021

On the Git mailing list, Philip Oakley wrote (reply to this):

On 12/07/2021 12:46, ZheNing Hu via GitGitGadget wrote:
> This patch series make cat-file reuse ref-filter logic.
>
> Change from last version:
minor nit..
Not sure if this is a gitgitgadget feature, but would it be possible
that a version indication be included in future versions of the patch,
e.g. [PATCH vN 00/19] [GSOC] ?
--
Philip

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 12, 2021

User Philip Oakley <philipoakley@iee.email> has been added to the cc: list.

@@ -226,6 +226,12 @@ newline. The available atoms are:
after that first run of whitespace (i.e., the "rest" of the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Christian Couder wrote (reply to this):

On Mon, Jul 12, 2021 at 1:47 PM ZheNing Hu via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: ZheNing Hu <adlternative@gmail.com>
>
> In order to let cat-file use ref-filter logic, let's do the
> following:
>
> 1. Change the type of member `format` in struct `batch_options`
> to `ref_format`, we will pass it to ref-filter later.
> 2. Let `batch_objects()` add atoms to format, and use
> `verify_ref_format()` to check atoms.
> 3. Use `format_ref_array_item()` in `batch_object_write()` to
> get the formatted data corresponding to the object. If the
> return value of `format_ref_array_item()` is equals to zero,
> use `batch_write()` to print object data; else if the return
> value is less than zero, use `die()` to print the error message
> and exit; else if return value is greater than zero, only print
> the error message, but don't exit.
> 4. Use free_ref_array_item_value() to free ref_array_item's
> value.
>
> Most of the atoms in `for-each-ref --format` are now supported,
> such as `%(tree)`, `%(parent)`, `%(author)`, `%(tagger)`, `%(if)`,
> `%(then)`, `%(else)`, `%(end)`. But these atoms will be rejected:
> `%(refname)`, `%(symref)`, `%(upstream)`, `%(push)`, `%(worktreepath)`,
> `%(flag)`, `%(HEAD)`, because these atoms are unique to those objects
> that pointed to by a ref, "for-each-ref"'s family can naturally use
> these atoms, but not all objects are pointed to be a ref, so "cat-file"
> will not be able to use them.
>
> The performance for `git cat-file --batch-all-objects
> --batch-check` on the Git repository itself with performance
> testing tool `hyperfine` changes from 669.4 ms ±  31.1 ms to
> 1.134 s ±  0.063 s.
>
> The performance for `git cat-file --batch-all-objects --batch
> >/dev/null` on the Git repository itself with performance testing
> tool `time` change from "27.37s user 0.29s system 98% cpu 28.089
> total" to "33.69s user 1.54s system 87% cpu 40.258 total".

Saying that a later patch will add a fast path which will mitigate the
performance regression introduced by this patch might help reassure
reviewers.

By the way it is not clear if adding the fast path fully mitigates
this performance regression or not. You might want to discuss that in
the cover letter, or maybe in the patch adding the fast path.

> Mentored-by: Christian Couder <christian.couder@gmail.com>
> Mentored-by: Hariom Verma <hariom18599@gmail.com>
> Signed-off-by: ZheNing Hu <adlternative@gmail.com>
> ---
>  Documentation/git-cat-file.txt |   6 +
>  builtin/cat-file.c             | 242 ++++++-------------------------
>  t/t1006-cat-file.sh            | 251 +++++++++++++++++++++++++++++++++
>  3 files changed, 304 insertions(+), 195 deletions(-)
>
> diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
> index 4eb0421b3fd..ef8ab952b2f 100644
> --- a/Documentation/git-cat-file.txt
> +++ b/Documentation/git-cat-file.txt
> @@ -226,6 +226,12 @@ newline. The available atoms are:
>         after that first run of whitespace (i.e., the "rest" of the
>         line) are output in place of the `%(rest)` atom.
>
> +Note that most of the atoms in `for-each-ref --format` are now supported,
> +such as `%(tree)`, `%(parent)`, `%(author)`, `%(tagger)`, `%(if)`,
> +`%(then)`, `%(else)`, `%(end)`. But these atoms will be rejected:
> +`%(refname)`, `%(symref)`, `%(upstream)`, `%(push)`, `%(worktreepath)`,
> +`%(flag)`, `%(HEAD)`. See linkgit:git-for-each-ref[1].
> +
>  If no format is specified, the default format is `%(objectname)
>  %(objecttype) %(objectsize)`.
>
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 41d407638d5..5b163551fc6 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -16,6 +16,7 @@
>  #include "packfile.h"
>  #include "object-store.h"
>  #include "promisor-remote.h"
> +#include "ref-filter.h"
>
>  struct batch_options {
>         int enabled;
> @@ -25,7 +26,7 @@ struct batch_options {
>         int all_objects;
>         int unordered;
>         int cmdmode; /* may be 'w' or 'c' for --filters or --textconv */
> -       const char *format;
> +       struct ref_format format;
>  };
>
>  static const char *force_path;
> @@ -195,99 +196,10 @@ static int cat_one_file(int opt, const char *exp_type, const char *obj_name,
>
>  struct expand_data {
>         struct object_id oid;
> -       enum object_type type;
> -       unsigned long size;
> -       off_t disk_size;
>         const char *rest;
> -       struct object_id delta_base_oid;
> -
> -       /*
> -        * If mark_query is true, we do not expand anything, but rather
> -        * just mark the object_info with items we wish to query.
> -        */
> -       int mark_query;
> -
> -       /*
> -        * Whether to split the input on whitespace before feeding it to
> -        * get_sha1; this is decided during the mark_query phase based on
> -        * whether we have a %(rest) token in our format.
> -        */
>         int split_on_whitespace;
> -
> -       /*
> -        * After a mark_query run, this object_info is set up to be
> -        * passed to oid_object_info_extended. It will point to the data
> -        * elements above, so you can retrieve the response from there.
> -        */
> -       struct object_info info;
> -
> -       /*
> -        * This flag will be true if the requested batch format and options
> -        * don't require us to call oid_object_info, which can then be
> -        * optimized out.
> -        */
> -       unsigned skip_object_info : 1;
>  };
>
> -static int is_atom(const char *atom, const char *s, int slen)
> -{
> -       int alen = strlen(atom);
> -       return alen == slen && !memcmp(atom, s, alen);
> -}
> -
> -static void expand_atom(struct strbuf *sb, const char *atom, int len,
> -                       void *vdata)
> -{
> -       struct expand_data *data = vdata;
> -
> -       if (is_atom("objectname", atom, len)) {
> -               if (!data->mark_query)
> -                       strbuf_addstr(sb, oid_to_hex(&data->oid));
> -       } else if (is_atom("objecttype", atom, len)) {
> -               if (data->mark_query)
> -                       data->info.typep = &data->type;
> -               else
> -                       strbuf_addstr(sb, type_name(data->type));
> -       } else if (is_atom("objectsize", atom, len)) {
> -               if (data->mark_query)
> -                       data->info.sizep = &data->size;
> -               else
> -                       strbuf_addf(sb, "%"PRIuMAX , (uintmax_t)data->size);
> -       } else if (is_atom("objectsize:disk", atom, len)) {
> -               if (data->mark_query)
> -                       data->info.disk_sizep = &data->disk_size;
> -               else
> -                       strbuf_addf(sb, "%"PRIuMAX, (uintmax_t)data->disk_size);
> -       } else if (is_atom("rest", atom, len)) {
> -               if (data->mark_query)
> -                       data->split_on_whitespace = 1;
> -               else if (data->rest)
> -                       strbuf_addstr(sb, data->rest);
> -       } else if (is_atom("deltabase", atom, len)) {
> -               if (data->mark_query)
> -                       data->info.delta_base_oid = &data->delta_base_oid;
> -               else
> -                       strbuf_addstr(sb,
> -                                     oid_to_hex(&data->delta_base_oid));
> -       } else
> -               die("unknown format element: %.*s", len, atom);
> -}
> -
> -static size_t expand_format(struct strbuf *sb, const char *start, void *data)
> -{
> -       const char *end;
> -
> -       if (*start != '(')
> -               return 0;
> -       end = strchr(start + 1, ')');
> -       if (!end)
> -               die("format element '%s' does not end in ')'", start);
> -
> -       expand_atom(sb, start + 1, end - start - 1, data);
> -
> -       return end - start + 1;
> -}
> -
>  static void batch_write(struct batch_options *opt, const void *data, int len)
>  {
>         if (opt->buffer_output) {
> @@ -297,87 +209,34 @@ static void batch_write(struct batch_options *opt, const void *data, int len)
>                 write_or_die(1, data, len);
>  }
>
> -static void print_object_or_die(struct batch_options *opt, struct expand_data *data)
> -{
> -       const struct object_id *oid = &data->oid;
> -
> -       assert(data->info.typep);
> -
> -       if (data->type == OBJ_BLOB) {
> -               if (opt->buffer_output)
> -                       fflush(stdout);
> -               if (opt->cmdmode) {
> -                       char *contents;
> -                       unsigned long size;
> -
> -                       if (!data->rest)
> -                               die("missing path for '%s'", oid_to_hex(oid));
> -
> -                       if (opt->cmdmode == 'w') {
> -                               if (filter_object(data->rest, 0100644, oid,
> -                                                 &contents, &size))
> -                                       die("could not convert '%s' %s",
> -                                           oid_to_hex(oid), data->rest);
> -                       } else if (opt->cmdmode == 'c') {
> -                               enum object_type type;
> -                               if (!textconv_object(the_repository,
> -                                                    data->rest, 0100644, oid,
> -                                                    1, &contents, &size))
> -                                       contents = read_object_file(oid,
> -                                                                   &type,
> -                                                                   &size);
> -                               if (!contents)
> -                                       die("could not convert '%s' %s",
> -                                           oid_to_hex(oid), data->rest);
> -                       } else
> -                               BUG("invalid cmdmode: %c", opt->cmdmode);
> -                       batch_write(opt, contents, size);
> -                       free(contents);
> -               } else {
> -                       stream_blob(oid);
> -               }
> -       }
> -       else {
> -               enum object_type type;
> -               unsigned long size;
> -               void *contents;
> -
> -               contents = read_object_file(oid, &type, &size);
> -               if (!contents)
> -                       die("object %s disappeared", oid_to_hex(oid));
> -               if (type != data->type)
> -                       die("object %s changed type!?", oid_to_hex(oid));
> -               if (data->info.sizep && size != data->size)
> -                       die("object %s changed size!?", oid_to_hex(oid));
> -
> -               batch_write(opt, contents, size);
> -               free(contents);
> -       }
> -}
>
>  static void batch_object_write(const char *obj_name,
>                                struct strbuf *scratch,
>                                struct batch_options *opt,
>                                struct expand_data *data)
>  {
> -       if (!data->skip_object_info &&
> -           oid_object_info_extended(the_repository, &data->oid, &data->info,
> -                                    OBJECT_INFO_LOOKUP_REPLACE) < 0) {
> -               printf("%s missing\n",
> -                      obj_name ? obj_name : oid_to_hex(&data->oid));
> -               fflush(stdout);
> -               return;
> -       }
> +       int ret;
> +       struct strbuf err = STRBUF_INIT;
> +       struct ref_array_item item = { data->oid, data->rest };
>
>         strbuf_reset(scratch);
> -       strbuf_expand(scratch, opt->format, expand_format, data);
> -       strbuf_addch(scratch, '\n');
> -       batch_write(opt, scratch->buf, scratch->len);
>
> -       if (opt->print_contents) {
> -               print_object_or_die(opt, data);
> -               batch_write(opt, "\n", 1);
> +       ret = format_ref_array_item(&item, &opt->format, scratch, &err);
> +       if (ret < 0)
> +               die("%s\n", err.buf);
> +       if (ret) {
> +               /* ret > 0 means when the object corresponding to oid
> +                * cannot be found in format_ref_array_item(), we only print
> +                * the error message.
> +                */
> +               printf("%s\n", err.buf);
> +               fflush(stdout);
> +       } else {
> +               strbuf_addch(scratch, '\n');
> +               batch_write(opt, scratch->buf, scratch->len);
>         }
> +       free_ref_array_item_value(&item);
> +       strbuf_release(&err);
>  }
>
>  static void batch_one_object(const char *obj_name,
> @@ -495,43 +354,37 @@ static int batch_unordered_packed(const struct object_id *oid,
>         return batch_unordered_object(oid, data);
>  }
>
> -static int batch_objects(struct batch_options *batch)
> +static const char * const cat_file_usage[] = {
> +       N_("git cat-file (-t [--allow-unknown-type] | -s [--allow-unknown-type] | -e | -p | <type> | --textconv | --filters) [--path=<path>] <object>"),
> +       N_("git cat-file (--batch[=<format>] | --batch-check[=<format>]) [--follow-symlinks] [--textconv | --filters]"),
> +       NULL
> +};
> +
> +static int batch_objects(struct batch_options *batch, const struct option *options)
>  {
>         struct strbuf input = STRBUF_INIT;
>         struct strbuf output = STRBUF_INIT;
> +       struct strbuf format = STRBUF_INIT;
>         struct expand_data data;
>         int save_warning;
>         int retval = 0;
>
> -       if (!batch->format)
> -               batch->format = "%(objectname) %(objecttype) %(objectsize)";
> -
> -       /*
> -        * Expand once with our special mark_query flag, which will prime the
> -        * object_info to be handed to oid_object_info_extended for each
> -        * object.
> -        */
>         memset(&data, 0, sizeof(data));
> -       data.mark_query = 1;
> -       strbuf_expand(&output, batch->format, expand_format, &data);
> -       data.mark_query = 0;
> -       strbuf_release(&output);
> -       if (batch->cmdmode)
> -               data.split_on_whitespace = 1;
> -
> -       /*
> -        * If we are printing out the object, then always fill in the type,
> -        * since we will want to decide whether or not to stream.
> -        */
> +       if (batch->format.format)
> +               strbuf_addstr(&format, batch->format.format);
> +       else
> +               strbuf_addstr(&format, "%(objectname) %(objecttype) %(objectsize)");
>         if (batch->print_contents)
> -               data.info.typep = &data.type;
> +               strbuf_addstr(&format, "\n%(raw)");
> +       batch->format.format = format.buf;
> +       if (verify_ref_format(&batch->format))
> +               usage_with_options(cat_file_usage, options);
> +
> +       if (batch->cmdmode || batch->format.use_rest)
> +               data.split_on_whitespace = 1;
>
>         if (batch->all_objects) {
>                 struct object_cb_data cb;
> -               struct object_info empty = OBJECT_INFO_INIT;
> -
> -               if (!memcmp(&data.info, &empty, sizeof(empty)))
> -                       data.skip_object_info = 1;
>
>                 if (has_promisor_remote())
>                         warning("This repository uses promisor remotes. Some objects may not be loaded.");
> @@ -561,6 +414,7 @@ static int batch_objects(struct batch_options *batch)
>                         oid_array_clear(&sa);
>                 }
>
> +               strbuf_release(&format);
>                 strbuf_release(&output);
>                 return 0;
>         }
> @@ -593,18 +447,13 @@ static int batch_objects(struct batch_options *batch)
>                 batch_one_object(input.buf, &output, batch, &data);
>         }
>
> +       strbuf_release(&format);
>         strbuf_release(&input);
>         strbuf_release(&output);
>         warn_on_object_refname_ambiguity = save_warning;
>         return retval;
>  }
>
> -static const char * const cat_file_usage[] = {
> -       N_("git cat-file (-t [--allow-unknown-type] | -s [--allow-unknown-type] | -e | -p | <type> | --textconv | --filters) [--path=<path>] <object>"),
> -       N_("git cat-file (--batch[=<format>] | --batch-check[=<format>]) [--follow-symlinks] [--textconv | --filters]"),
> -       NULL
> -};
> -
>  static int git_cat_file_config(const char *var, const char *value, void *cb)
>  {
>         if (userdiff_config(var, value) < 0)
> @@ -627,7 +476,7 @@ static int batch_option_callback(const struct option *opt,
>
>         bo->enabled = 1;
>         bo->print_contents = !strcmp(opt->long_name, "batch");
> -       bo->format = arg;
> +       bo->format.format = arg;
>
>         return 0;
>  }
> @@ -636,7 +485,9 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
>  {
>         int opt = 0;
>         const char *exp_type = NULL, *obj_name = NULL;
> -       struct batch_options batch = {0};
> +       struct batch_options batch = {
> +               .format = REF_FORMAT_INIT
> +       };
>         int unknown_type = 0;
>
>         const struct option options[] = {
> @@ -675,6 +526,7 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
>         git_config(git_cat_file_config, NULL);
>
>         batch.buffer_output = -1;
> +       batch.format.cat_file_mode = 1;
>         argc = parse_options(argc, argv, prefix, options, cat_file_usage, 0);
>
>         if (opt) {
> @@ -718,7 +570,7 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
>                 batch.buffer_output = batch.all_objects;
>
>         if (batch.enabled)
> -               return batch_objects(&batch);
> +               return batch_objects(&batch, options);
>
>         if (unknown_type && opt != 't' && opt != 's')
>                 die("git cat-file --allow-unknown-type: use with -s or -t");
> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index 18b3779ccb6..7452404f24a 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -607,5 +607,256 @@ test_expect_success 'cat-file --batch="batman" with --batch-all-objects will wor
>         git -C all-two cat-file --batch-all-objects --batch="batman" >actual &&
>         cmp expect actual
>  '
> +. "$TEST_DIRECTORY"/lib-gpg.sh
> +. "$TEST_DIRECTORY"/lib-terminal.sh
> +
> +test_expect_success 'cat-file --batch|--batch-check setup' '
> +       echo 1>blob1 &&
> +       printf "a\0b\0\c" >blob2 &&
> +       git add blob1 blob2 &&
> +       git commit -m "Commit Message" &&
> +       git branch -M main &&
> +       git tag -a -m "v0.0.0" testtag &&
> +       git update-ref refs/myblobs/blob1 HEAD:blob1 &&
> +       git update-ref refs/myblobs/blob2 HEAD:blob2 &&
> +       git update-ref refs/mytrees/tree1 HEAD^{tree}
> +'
> +
> +batch_test_atom() {
> +       if test "$3" = "fail"
> +       then
> +               test_expect_${4:-success} $PREREQ "basic atom: $1 $2 must fail" "
> +                       test_must_fail git cat-file --batch-check='$2' >bad <<-EOF
> +                       $1
> +                       EOF
> +               "
> +       else
> +               test_expect_${4:-success} $PREREQ "basic atom: $1 $2" "
> +                       git for-each-ref --format='$2' $1 >expected &&
> +                       git cat-file --batch-check='$2' >actual <<-EOF &&
> +                       $1
> +                       EOF
> +                       sanitize_pgp <actual >actual.clean &&
> +                       cmp expected actual.clean
> +               "
> +       fi
> +}

I wonder if the above function and some of the tests below could be
introduced in a preparatory patch before this one. It could help check
that reusing ref-filter doesn't change the behavior with some atoms
that were previously supported or rejected. Of course if some atoms
are now failing or are now supported, then it's ok to add new tests
for these atoms in this patch.

> +batch_test_atom refs/heads/main '%(refname)' fail
> +batch_test_atom refs/heads/main '%(refname:)' fail

[...]

> +batch_test_atom refs/heads/main 'VALID'
> +batch_test_atom refs/heads/main '%(INVALID)' fail
> +batch_test_atom refs/heads/main '%(authordate:INVALID)' fail
> +
> +test_expect_success '%(rest) works with both a branch and a tag' '
> +       cat >expected <<-EOF &&
> +       123 commit 123
> +       456 tag 456
> +       EOF
> +       git cat-file --batch-check="%(rest) %(objecttype) %(rest)" >actual <<-EOF &&
> +       refs/heads/main 123
> +       refs/tags/testtag 456
> +       EOF
> +       test_cmp expected actual
> +'

It's a bit strange that this test is added in this patch while the
commit message doesn't talk about %(rest). So I wonder if this new
test could move to another previous commit.

> +batch_test_atom refs/heads/main '%(objectname) %(objecttype) %(objectsize)
> +%(raw)'
> +batch_test_atom refs/tags/testtag '%(objectname) %(objecttype) %(objectsize)
> +%(raw)'
> +batch_test_atom refs/myblobs/blob1 '%(objectname) %(objecttype) %(objectsize)
> +%(raw)'
> +batch_test_atom refs/myblobs/blob2 '%(objectname) %(objecttype) %(objectsize)
> +%(raw)'
> +
> +

It looks like there are two empty lines instead of one above.

> +test_expect_success 'cat-file --batch equals to --batch-check with atoms' '
> +       git cat-file --batch-check="%(objectname) %(objecttype) %(objectsize)
> +%(raw)" >expected <<-EOF &&
> +       refs/heads/main
> +       refs/tags/testtag
> +       EOF
> +       git cat-file --batch >actual <<-EOF &&
> +       refs/heads/main
> +       refs/tags/testtag
> +       EOF
> +       cmp expected actual
> +'

I also wonder if the above new test belong to this commit or if it
could be moved to a previous commit.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Christian Couder wrote (reply to this):

On Mon, Jul 12, 2021 at 3:17 PM Christian Couder
<christian.couder@gmail.com> wrote:
>
> On Mon, Jul 12, 2021 at 1:47 PM ZheNing Hu via GitGitGadget
> <gitgitgadget@gmail.com> wrote:

> > +batch_test_atom() {
> > +       if test "$3" = "fail"
> > +       then
> > +               test_expect_${4:-success} $PREREQ "basic atom: $1 $2 must fail" "
> > +                       test_must_fail git cat-file --batch-check='$2' >bad <<-EOF
> > +                       $1
> > +                       EOF
> > +               "
> > +       else
> > +               test_expect_${4:-success} $PREREQ "basic atom: $1 $2" "
> > +                       git for-each-ref --format='$2' $1 >expected &&
> > +                       git cat-file --batch-check='$2' >actual <<-EOF &&
> > +                       $1
> > +                       EOF
> > +                       sanitize_pgp <actual >actual.clean &&
> > +                       cmp expected actual.clean
> > +               "
> > +       fi
> > +}
>
> I wonder if the above function and some of the tests below could be
> introduced in a preparatory patch before this one. It could help check
> that reusing ref-filter doesn't change the behavior with some atoms
> that were previously supported or rejected. Of course if some atoms
> are now failing or are now supported, then it's ok to add new tests
> for these atoms in this patch.

For example maybe some of the tests could be introduced earlier when
the reject_atom() function is introduced.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, ZheNing Hu wrote (reply to this):

Christian Couder <christian.couder@gmail.com> 于2021年7月12日周一 下午9:17写道:
>
> On Mon, Jul 12, 2021 at 1:47 PM ZheNing Hu via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >
> > From: ZheNing Hu <adlternative@gmail.com>
> >
> > In order to let cat-file use ref-filter logic, let's do the
> > following:
> >
> > 1. Change the type of member `format` in struct `batch_options`
> > to `ref_format`, we will pass it to ref-filter later.
> > 2. Let `batch_objects()` add atoms to format, and use
> > `verify_ref_format()` to check atoms.
> > 3. Use `format_ref_array_item()` in `batch_object_write()` to
> > get the formatted data corresponding to the object. If the
> > return value of `format_ref_array_item()` is equals to zero,
> > use `batch_write()` to print object data; else if the return
> > value is less than zero, use `die()` to print the error message
> > and exit; else if return value is greater than zero, only print
> > the error message, but don't exit.
> > 4. Use free_ref_array_item_value() to free ref_array_item's
> > value.
> >
> > Most of the atoms in `for-each-ref --format` are now supported,
> > such as `%(tree)`, `%(parent)`, `%(author)`, `%(tagger)`, `%(if)`,
> > `%(then)`, `%(else)`, `%(end)`. But these atoms will be rejected:
> > `%(refname)`, `%(symref)`, `%(upstream)`, `%(push)`, `%(worktreepath)`,
> > `%(flag)`, `%(HEAD)`, because these atoms are unique to those objects
> > that pointed to by a ref, "for-each-ref"'s family can naturally use
> > these atoms, but not all objects are pointed to be a ref, so "cat-file"
> > will not be able to use them.
> >
> > The performance for `git cat-file --batch-all-objects
> > --batch-check` on the Git repository itself with performance
> > testing tool `hyperfine` changes from 669.4 ms ±  31.1 ms to
> > 1.134 s ±  0.063 s.
> >
> > The performance for `git cat-file --batch-all-objects --batch
> > >/dev/null` on the Git repository itself with performance testing
> > tool `time` change from "27.37s user 0.29s system 98% cpu 28.089
> > total" to "33.69s user 1.54s system 87% cpu 40.258 total".
>
> Saying that a later patch will add a fast path which will mitigate the
> performance regression introduced by this patch might help reassure
> reviewers.
>

OK.

> By the way it is not clear if adding the fast path fully mitigates
> this performance regression or not. You might want to discuss that in
> the cover letter, or maybe in the patch adding the fast path.
>

I mentioned it: "By using this fast path, we can reduce some of the
extra overhead
when cat-file --batch using ref-filter. The running time of
git cat-file --batch-check will be similar to before, and the
running time of git cat-file --batch will be 9.1% less than before."
which is using the result of t/perf/p1006-cat-file.sh.

>
> I wonder if the above function and some of the tests below could be
> introduced in a preparatory patch before this one. It could help check
> that reusing ref-filter doesn't change the behavior with some atoms
> that were previously supported or rejected. Of course if some atoms
> are now failing or are now supported, then it's ok to add new tests
> for these atoms in this patch.
>

Yes, it might be worth splitting into two commits.

> > +batch_test_atom refs/heads/main '%(refname)' fail
> > +batch_test_atom refs/heads/main '%(refname:)' fail
>
> [...]
>
> > +batch_test_atom refs/heads/main 'VALID'
> > +batch_test_atom refs/heads/main '%(INVALID)' fail
> > +batch_test_atom refs/heads/main '%(authordate:INVALID)' fail
> > +
> > +test_expect_success '%(rest) works with both a branch and a tag' '
> > +       cat >expected <<-EOF &&
> > +       123 commit 123
> > +       456 tag 456
> > +       EOF
> > +       git cat-file --batch-check="%(rest) %(objecttype) %(rest)" >actual <<-EOF &&
> > +       refs/heads/main 123
> > +       refs/tags/testtag 456
> > +       EOF
> > +       test_cmp expected actual
> > +'
>
> It's a bit strange that this test is added in this patch while the
> commit message doesn't talk about %(rest). So I wonder if this new
> test could move to another previous commit.
>

It's just used for checking the uncommonly atoms "%(rest)".
But as you said, we can move it to a split commit.

> > +batch_test_atom refs/heads/main '%(objectname) %(objecttype) %(objectsize)
> > +%(raw)'
> > +batch_test_atom refs/tags/testtag '%(objectname) %(objecttype) %(objectsize)
> > +%(raw)'
> > +batch_test_atom refs/myblobs/blob1 '%(objectname) %(objecttype) %(objectsize)
> > +%(raw)'
> > +batch_test_atom refs/myblobs/blob2 '%(objectname) %(objecttype) %(objectsize)
> > +%(raw)'
> > +
> > +
>
> It looks like there are two empty lines instead of one above.
>
> > +test_expect_success 'cat-file --batch equals to --batch-check with atoms' '
> > +       git cat-file --batch-check="%(objectname) %(objecttype) %(objectsize)
> > +%(raw)" >expected <<-EOF &&
> > +       refs/heads/main
> > +       refs/tags/testtag
> > +       EOF
> > +       git cat-file --batch >actual <<-EOF &&
> > +       refs/heads/main
> > +       refs/tags/testtag
> > +       EOF
> > +       cmp expected actual
> > +'
>
> I also wonder if the above new test belong to this commit or if it
> could be moved to a previous commit.

It's same.

Thanks.
--
ZheNing Hu

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, ZheNing Hu wrote (reply to this):

Christian Couder <christian.couder@gmail.com> 于2021年7月12日周一 下午9:26写道:
>
> >
> > I wonder if the above function and some of the tests below could be
> > introduced in a preparatory patch before this one. It could help check
> > that reusing ref-filter doesn't change the behavior with some atoms
> > that were previously supported or rejected. Of course if some atoms
> > are now failing or are now supported, then it's ok to add new tests
> > for these atoms in this patch.
>
> For example maybe some of the tests could be introduced earlier when
> the reject_atom() function is introduced.

Yeah, move it to "[GSOC] ref-filter: add %(rest) atom"  is better.

Thanks.
--
ZheNing Hu

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Christian Couder <christian.couder@gmail.com> writes:

>> The performance for `git cat-file --batch-all-objects --batch
>> >/dev/null` on the Git repository itself with performance testing
>> tool `time` change from "27.37s user 0.29s system 98% cpu 28.089
>> total" to "33.69s user 1.54s system 87% cpu 40.258 total".
>
> Saying that a later patch will add a fast path which will mitigate the
> performance regression introduced by this patch might help reassure
> reviewers.

More importantly, why is such a fast-path even needed?  Isn't it a
sign that the ref-filter implementation is eating more cycles than
it should for given set of placeholders?  Do we know where the extra
cycles goes?

I find it somewhat alarming if we are talking about "fast-path"
workaround before understanding why we are seeing slowdown in the
first place.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, ZheNing Hu wrote (reply to this):

Junio C Hamano <gitster@pobox.com> 于2021年7月13日周二 上午4:38写道:

>
> Christian Couder <christian.couder@gmail.com> writes:
>
> More importantly, why is such a fast-path even needed?  Isn't it a
> sign that the ref-filter implementation is eating more cycles than
> it should for given set of placeholders?  Do we know where the extra
> cycles goes?
>
> I find it somewhat alarming if we are talking about "fast-path"
> workaround before understanding why we are seeing slowdown in the
> first place.

There is no complete conclusion yet, but I try to use time and hyperfine test
for these commits (t/perf/* is not accurate enough):

----------------------------------------------------------------------------------------------------------------------------
|                        subject                                  |
--batch-check (using hyperfine) |   --batch(using time) |
----------------------------------------------------------------------------------------------------------------------------
|[GSOC] cat-file: use fast path when using default_format         |
        700ms                |          25.450s      |
----------------------------------------------------------------------------------------------------------------------------
|[GSOC] cat-file: re-implement --textconv, --filters options      |
        790ms                |          29.933s      |
----------------------------------------------------------------------------------------------------------------------------
|[GSOC] cat-file: reuse err buf in batch_object_write()           |
        770ms                |          29.153s      |
----------------------------------------------------------------------------------------------------------------------------
|[GSOC] cat-file: reuse ref-filter logic                          |
        780ms                |          29.412s      |
----------------------------------------------------------------------------------------------------------------------------
|The third batch (upstream/master)                                |
        640ms                |          26.025s      |
----------------------------------------------------------------------------------------------------------------------------

I think we their cost is indeed from "[GSOC] cat-file: reuse ref-filter logic".
But what causes the loss of performance needs further analysis.

Thanks.
--
ZheNing Hu

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, ZheNing Hu wrote (reply to this):

ZheNing Hu <adlternative@gmail.com> 于2021年7月15日周四 上午12:24写道:
>
> Junio C Hamano <gitster@pobox.com> 于2021年7月13日周二 上午4:38写道:
>
> >
> > Christian Couder <christian.couder@gmail.com> writes:
> >
> > More importantly, why is such a fast-path even needed?  Isn't it a
> > sign that the ref-filter implementation is eating more cycles than
> > it should for given set of placeholders?  Do we know where the extra
> > cycles goes?
> >
> > I find it somewhat alarming if we are talking about "fast-path"
> > workaround before understanding why we are seeing slowdown in the
> > first place.
>
> There is no complete conclusion yet, but I try to use time and hyperfine test
> for these commits (t/perf/* is not accurate enough):
>
> ----------------------------------------------------------------------------------------------------------------------------
> |                        subject                                  |
> --batch-check (using hyperfine) |   --batch(using time) |
> ----------------------------------------------------------------------------------------------------------------------------
> |[GSOC] cat-file: use fast path when using default_format         |
>         700ms                |          25.450s      |
> ----------------------------------------------------------------------------------------------------------------------------
> |[GSOC] cat-file: re-implement --textconv, --filters options      |
>         790ms                |          29.933s      |
> ----------------------------------------------------------------------------------------------------------------------------
> |[GSOC] cat-file: reuse err buf in batch_object_write()           |
>         770ms                |          29.153s      |
> ----------------------------------------------------------------------------------------------------------------------------
> |[GSOC] cat-file: reuse ref-filter logic                          |
>         780ms                |          29.412s      |
> ----------------------------------------------------------------------------------------------------------------------------
> |The third batch (upstream/master)                                |
>         640ms                |          26.025s      |
> ----------------------------------------------------------------------------------------------------------------------------
>
> I think we their cost is indeed from "[GSOC] cat-file: reuse ref-filter logic".
> But what causes the loss of performance needs further analysis.
>

Now I think:
There are three main reasons why the performance of cat-file --batch
deteriorates after refactor.

1. Too many copies are used in ref-filter and we cannot avoid these copies
easily because ref-filter needs these copied data to implement atoms %(if),
%(else), %(end)... and the --sort option. The original cat-file
--batch only needs
to output the data to the final string. Its copy times are relatively small.

2. More complex data structure and parsing process are used in ref-filter.
This is why it can provide more and more useful atoms. Therefore, I think the
performance degradation that occurs here is normal.

3. As Ævar Arnfjörð Bjarmason mentioned, oid_object_info_extend() was used
twice in get_object() before. oid_object_info_extend() is the hot
path, we should
try to avoid calling it, So in last version of  "[GSOC] cat-file:
re-implement --textconv,
--filters options", I make the unified processing of --textconv and
--filter avoid calling
oid_object_info_extend() twice.

Thanks.
--
ZheNing Hu

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Christian Couder wrote (reply to this):

On Thu, Jul 15, 2021 at 3:53 AM ZheNing Hu <adlternative@gmail.com> wrote:
>
> ZheNing Hu <adlternative@gmail.com> 于2021年7月15日周四 上午12:24写道:
> >
> > Junio C Hamano <gitster@pobox.com> 于2021年7月13日周二 上午4:38写道:

> > > I find it somewhat alarming if we are talking about "fast-path"
> > > workaround before understanding why we are seeing slowdown in the
> > > first place.
> >
> > There is no complete conclusion yet, but I try to use time and hyperfine test
> > for these commits (t/perf/* is not accurate enough):
> >
> > ----------------------------------------------------------------------------------------------------------------------------
> > |                        subject                                  |
> > --batch-check (using hyperfine) |   --batch(using time) |
> > ----------------------------------------------------------------------------------------------------------------------------
> > |[GSOC] cat-file: use fast path when using default_format         |
> >         700ms                |          25.450s      |
> > ----------------------------------------------------------------------------------------------------------------------------
> > |[GSOC] cat-file: re-implement --textconv, --filters options      |
> >         790ms                |          29.933s      |
> > ----------------------------------------------------------------------------------------------------------------------------
> > |[GSOC] cat-file: reuse err buf in batch_object_write()           |
> >         770ms                |          29.153s      |
> > ----------------------------------------------------------------------------------------------------------------------------
> > |[GSOC] cat-file: reuse ref-filter logic                          |
> >         780ms                |          29.412s      |
> > ----------------------------------------------------------------------------------------------------------------------------
> > |The third batch (upstream/master)                                |
> >         640ms                |          26.025s      |
> > ----------------------------------------------------------------------------------------------------------------------------
> >
> > I think we their cost is indeed from "[GSOC] cat-file: reuse ref-filter logic".
> > But what causes the loss of performance needs further analysis.
>
> Now I think:
> There are three main reasons why the performance of cat-file --batch
> deteriorates after refactor.
>
> 1. Too many copies are used in ref-filter and we cannot avoid these copies
> easily because ref-filter needs these copied data to implement atoms %(if),
> %(else), %(end)... and the --sort option. The original cat-file
> --batch only needs
> to output the data to the final string. Its copy times are relatively small.

Is it possible to check early if any of the atoms that needs these
copied data is specified, and if none of them is specified then to
avoid the copies?

> 2. More complex data structure and parsing process are used in ref-filter.
> This is why it can provide more and more useful atoms. Therefore, I think the
> performance degradation that occurs here is normal.

Are there way the more complex parsing could be avoided if it's not
needed by the atoms that are actually used?

> 3. As Ævar Arnfjörð Bjarmason mentioned, oid_object_info_extend() was used
> twice in get_object() before. oid_object_info_extend() is the hot
> path, we should
> try to avoid calling it, So in last version of  "[GSOC] cat-file:
> re-implement --textconv,
> --filters options", I make the unified processing of --textconv and
> --filter avoid calling
> oid_object_info_extend() twice.

Ok, thanks for the details and your work on this performance issue!

I wonder if your patch series could be split, so that the early parts
that add new atoms to ref-filter could be merged sooner?

Best,
Christian.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, ZheNing Hu wrote (reply to this):

Christian Couder <christian.couder@gmail.com> 于2021年7月15日周四 下午5:45写道:
>
> On Thu, Jul 15, 2021 at 3:53 AM ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > ZheNing Hu <adlternative@gmail.com> 于2021年7月15日周四 上午12:24写道:
> > >
> > > Junio C Hamano <gitster@pobox.com> 于2021年7月13日周二 上午4:38写道:
>
> > > > I find it somewhat alarming if we are talking about "fast-path"
> > > > workaround before understanding why we are seeing slowdown in the
> > > > first place.
> > >
> > > There is no complete conclusion yet, but I try to use time and hyperfine test
> > > for these commits (t/perf/* is not accurate enough):
> > >
> > > ----------------------------------------------------------------------------------------------------------------------------
> > > |                        subject                                  |
> > > --batch-check (using hyperfine) |   --batch(using time) |
> > > ----------------------------------------------------------------------------------------------------------------------------
> > > |[GSOC] cat-file: use fast path when using default_format         |
> > >         700ms                |          25.450s      |
> > > ----------------------------------------------------------------------------------------------------------------------------
> > > |[GSOC] cat-file: re-implement --textconv, --filters options      |
> > >         790ms                |          29.933s      |
> > > ----------------------------------------------------------------------------------------------------------------------------
> > > |[GSOC] cat-file: reuse err buf in batch_object_write()           |
> > >         770ms                |          29.153s      |
> > > ----------------------------------------------------------------------------------------------------------------------------
> > > |[GSOC] cat-file: reuse ref-filter logic                          |
> > >         780ms                |          29.412s      |
> > > ----------------------------------------------------------------------------------------------------------------------------
> > > |The third batch (upstream/master)                                |
> > >         640ms                |          26.025s      |
> > > ----------------------------------------------------------------------------------------------------------------------------
> > >
> > > I think we their cost is indeed from "[GSOC] cat-file: reuse ref-filter logic".
> > > But what causes the loss of performance needs further analysis.
> >
> > Now I think:
> > There are three main reasons why the performance of cat-file --batch
> > deteriorates after refactor.
> >
> > 1. Too many copies are used in ref-filter and we cannot avoid these copies
> > easily because ref-filter needs these copied data to implement atoms %(if),
> > %(else), %(end)... and the --sort option. The original cat-file
> > --batch only needs
> > to output the data to the final string. Its copy times are relatively small.
>
> Is it possible to check early if any of the atoms that needs these
> copied data is specified, and if none of them is specified then to
> avoid the copies?
>

Well, The copy I'm talking about here refers to something like "v->s =
xstrdup(xxx)";
but v->s is need by --sort, so it is very difficult to remove. At the
moment I think the
only solution is the fast path mentioned by Ævar Arnfjörð Bjarmason.

> > 2. More complex data structure and parsing process are used in ref-filter.
> > This is why it can provide more and more useful atoms. Therefore, I think the
> > performance degradation that occurs here is normal.
>
> Are there way the more complex parsing could be avoided if it's not
> needed by the atoms that are actually used?

No. For example, we can only support "objectsize" before and now we can
support "objectsize:short", so we need to pay more parsing process here.
(It's necessary)

>
> > 3. As Ævar Arnfjörð Bjarmason mentioned, oid_object_info_extend() was used
> > twice in get_object() before. oid_object_info_extend() is the hot
> > path, we should
> > try to avoid calling it, So in last version of  "[GSOC] cat-file:
> > re-implement --textconv,
> > --filters options", I make the unified processing of --textconv and
> > --filter avoid calling
> > oid_object_info_extend() twice.
>
> Ok, thanks for the details and your work on this performance issue!
>
> I wonder if your patch series could be split, so that the early parts
> that add new atoms to ref-filter could be merged sooner?
>

Should this part of the work be handed over to Junio?
The implementation of %(rest) and %(raw)  may be worth merging,
they are truly "zh/ref-filter-raw-data".
The other part may be called "cat-file-reuse-ref-filter-logic".

> Best,
> Christian.

Thanks.
--
ZheNing Hu

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, ZheNing Hu wrote (reply to this):

Christian Couder <christian.couder@gmail.com> 于2021年7月12日周一 下午9:17写道:
>
> > +test_expect_success 'cat-file --batch equals to --batch-check with atoms' '
> > +       git cat-file --batch-check="%(objectname) %(objecttype) %(objectsize)
> > +%(raw)" >expected <<-EOF &&
> > +       refs/heads/main
> > +       refs/tags/testtag
> > +       EOF
> > +       git cat-file --batch >actual <<-EOF &&
> > +       refs/heads/main
> > +       refs/tags/testtag
> > +       EOF
> > +       cmp expected actual
> > +'
>
> I also wonder if the above new test belong to this commit or if it
> could be moved to a previous commit.

It's relate to %(raw), but cat-file don't realize %(raw) before. So the location
of this test should remain unchanged.

--
ZheNing Hu

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 12, 2021

On the Git mailing list, ZheNing Hu wrote (reply to this):

Philip Oakley <philipoakley@iee.email> 于2021年7月12日周一 下午9:02写道:
>
> On 12/07/2021 12:46, ZheNing Hu via GitGitGadget wrote:
> > This patch series make cat-file reuse ref-filter logic.
> >
> > Change from last version:
> minor nit..
> Not sure if this is a gitgitgadget feature, but would it be possible
> that a version indication be included in future versions of the patch,
> e.g. [PATCH vN 00/19] [GSOC] ?
> --

I think ggg have this [1] , but it is not for the user to control.

[1] https://github.com/gitgitgadget/gitgitgadget/blob/1df3c008552abd1fb788d4988e6d3b92f5765369/lib/patch-series.ts#L654

> Philip

Thanks.
--
ZheNing Hu

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 12, 2021

This branch is now known as zh/ref-filter-raw-data.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 12, 2021

This patch series was integrated into seen via git@8c97edd.

@gitgitgadget gitgitgadget bot added the seen label Jul 12, 2021
@gitgitgadget
Copy link

gitgitgadget bot commented Jul 14, 2021

This patch series was integrated into seen via git@697afb8.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 14, 2021

There was a status update in the "Cooking" section about the branch zh/ref-filter-raw-data on the Git mailing list:

Prepare the "ref-filter" machinery that drives the "--format"
option of "git for-each-ref" and its friends to be used in "git
cat-file --batch".

Heavy performance degradation has been observed with this series.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 14, 2021

This patch series was integrated into seen via git@e50ae59.

Only tag and commit objects use `grab_sub_body_contents()` to grab
object contents in the current codebase.  We want to teach the
function to also handle blobs and trees to get their raw data,
without parsing a blob (whose contents looks like a commit or a tag)
incorrectly as a commit or a tag.

Skip the block of code that is specific to handling commits and tags
early when the given object is of a wrong type to help later
addition to handle other types of objects in this function.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Add new formatting option `%(raw)`, which will print the raw
object data without any changes. It will help further to migrate
all cat-file formatting logic from cat-file to ref-filter.

The raw data of blob, tree objects may contain '\0', but most of
the logic in `ref-filter` depends on the output of the atom being
text (specifically, no embedded NULs in it).

E.g. `quote_formatting()` use `strbuf_addstr()` or `*._quote_buf()`
add the data to the buffer. The raw data of a tree object is
`100644 one\0...`, only the `100644 one` will be added to the buffer,
which is incorrect.

Therefore, we need to find a way to record the length of the
atom_value's member `s`. Although strbuf can already record the
string and its length, if we want to replace the type of atom_value's
member `s` with strbuf, many places in ref-filter that are filled
with dynamically allocated mermory in `v->s` are not easy to replace.
At the same time, we need to check if `v->s == NULL` in
populate_value(), and strbuf cannot easily distinguish NULL and empty
strings, but c-style "const char *" can do it. So add a new member in
`struct atom_value`: `s_size`, which can record raw object size, it
can help us add raw object data to the buffer or compare two buffers
which contain raw object data.

Note that `--format=%(raw)` cannot be used with `--python`, `--shell`,
`--tcl`, and `--perl` because if the binary raw data is passed to a
variable in such languages, these may not support arbitrary binary data
in their string variable type.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Helped-by: Bagas Sanjaya <bagasdotme@gmail.com>
Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Helped-by: Felipe Contreras <felipe.contreras@gmail.com>
Helped-by: Phillip Wood <phillip.wood@dunelm.org.uk>
Helped-by: Junio C Hamano <gitster@pobox.com>
Based-on-patch-by: Olga Telezhnaya <olyatelezhnaya@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Because the perl language can handle binary data correctly,
add the function perl_quote_buf_with_len(), which can specify
the length of the data and prevent the data from being truncated
at '\0' to help `--format="%(raw)"` re-support `--perl`.

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Use non-const ref_format in *_atom_parser(), which can help us
modify the members of ref_format in *_atom_parser().

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
In order to let "cat-file --batch=%(rest)" use the ref-filter
interface, add %(rest) atom for ref-filter. Introduce the
reject_atom() to reject the atom %(rest) for "git for-each-ref",
"git branch", "git tag" and "git verify-tag".

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Because in the refactor of `git cat-file --batch` later,
oid_object_info_extended() in get_object() will be used to obtain
the info of an object with it's oid. When the object cannot be
obtained in the git repository, `cat-file --batch` expects to output
"<oid> missing" and continue the next oid query instead of letting
Git exit. In other error conditions, Git should exit normally. So we
can achieve this function by passing the return value of get_object().

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
When we use ref_array_item which is not dynamically allocated and
want to free the space of its member "value" after the end of use,
free_array_item() does not meet our needs, because it tries to free
ref_array_item itself and its member "symref".

Introduce free_ref_array_item_value() for freeing ref_array_item value.
It will be called internally by free_array_item(), and it will help
`cat-file --batch` free ref_array_item's value memory later.

Helped-by: Junio C Hamano <gitster@pobox.com>
Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Add `cat_file_mode` member to struct `ref_format`. When
`cat-file --batch` use ref-filter logic later, it can help us
reject atoms in verify_ref_format() which cat-file cannot use,
e.g. `%(refname)`, `%(push)`, `%(upstream)"...

Add batch_test_atom() to t/t1006-cat-file.sh and add check
for cat-file --batch, this can help us clearly show which
atoms cat-file accepts and which atoms it rejects.

Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Helped-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Let get_object() return 1 and print "<oid> missing" instead
of returning -1 and printing "missing object <oid> for <refname>"
if oid_object_info_extended() unable to find the data corresponding
to oid. When `cat-file --batch` use ref-filter logic later it can
help `format_ref_array_item()` just report that the object is missing
without letting Git exit.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Use `has_object_file()` in `batch_one_object()` to check
whether the input object exists. This can help us reject
the missing oid when we let `cat-file --batch` use ref-filter
logic later.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Because later cat-file reuses ref-filter logic that will add
parameter "const struct option *options" to batch_objects(),
the two synonymous parameters of "opt" and "options" may
confuse readers, so change batch_options parameter of
batch_objects() from "opt" to "batch".

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Create p1006-cat-file.sh to provide performance testing for
`git cat-file --batch` and `git cat-file --batch-check`. This
will help us compare the performance changes after we let
cat-file reuse the ref-filter logic.

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
In order to let cat-file use ref-filter logic, let's do the
following:

1. Change the type of member `format` in struct `batch_options`
to `ref_format`, we will pass it to ref-filter later.
2. Let `batch_objects()` add atoms to format, and use
`verify_ref_format()` to check atoms.
3. Use `format_ref_array_item()` in `batch_object_write()` to
get the formatted data corresponding to the object. If the
return value of `format_ref_array_item()` is equals to zero,
use `batch_write()` to print object data; else if the return
value is less than zero, use `die()` to print the error message
and exit; else if return value is greater than zero, only print
the error message, but don't exit.
4. Use free_ref_array_item_value() to free ref_array_item's
value.

Most of the atoms in `for-each-ref --format` are now supported,
such as `%(tree)`, `%(parent)`, `%(author)`, `%(tagger)`, `%(if)`,
`%(then)`, `%(else)`, `%(end)`. But these atoms will be rejected:
`%(refname)`, `%(symref)`, `%(upstream)`, `%(push)`, `%(worktreepath)`,
`%(flag)`, `%(HEAD)`, because these atoms are unique to those objects
that pointed to by a ref, "for-each-ref"'s family can naturally use
these atoms, but not all objects are pointed to be a ref, so "cat-file"
will not be able to use them.

The performance for `git cat-file --batch-all-objects
--batch-check` on the Git repository itself with performance
testing tool `hyperfine` changes from 616.7 ms ± 8.9 ms to
758.7 ms ± 16.4 ms.

The performance for `git cat-file --batch-all-objects --batch
>/dev/null` on the Git repository itself with performance testing
tool `time` change from "25.26s user 0.30s system 98% cpu 25.840 total"
to "28.79s user 0.83s system 99% cpu 29.829 total".

The reasons for the performance degradation are as follows:
1. There are a lot of data copies in the logic of ref-filter.
2, In order to be able to support more useful formats, complex
data structure and parsing process are used in ref-filter.

A later patch will add a fast path which will mitigate the
performance regression introduced by this patch.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Reuse the `err` buffer in batch_object_write(), as the
buffer `scratch` does. This will reduce the overhead
of multiple allocations of memory of the err buffer.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
After cat-file reuses the ref-filter logic, we re-implement the
functions of --textconv and --filters options.

Add members `cat_file_cmdmode` to struct `ref_array_item`,
so that struct `batch_option` member `cmdmode` will be passed
to ref-filter, and then ref-filter will take use of it to filter
the content of the object in get_object().

Use `actual_oi` to record the real expand_data: it may point to the
original `oi` or the `act_oi` processed by `textconv_object()` or
`convert_to_working_tree()`. `grab_values()` will grab the contents
of `actual_oi` and `grab_common_values()` to grab the contents of origin
`oi`, this ensures that `%(objectsize)` still uses the size of the
unfiltered data.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Because "atom_type == ATOM_OBJECTNAME" implies the condition
of `starts_with(name, "objectname")`, "atom_type == ATOM_TREE"
implies the condition of `starts_with(name, "tree")`, so the
check for `starts_with(name, field)` in grab_oid() is redundant.

So Remove the grab_oid() from ref-filter, to reduce repeated check.

Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Add the member `default_format` to struct `batch_options`,
when we are using the default format on `git cat-file --batch`,
or `git cat-file --batch-check`, `default_format` will be set,
if we don't use `--textconv` or `--filter`, then we will not call
verify_ref_format(), has_object_file() and format_ref_array_item().
Instead, we get the object data directly through
oid_object_info_extended() and then output the data directly.

By using this fast path, we can reduce some of the extra overhead
when `cat-file --batch` using ref-filter. The running time of
`git cat-file --batch-check` will be similar to before, and the
running time of `git cat-file --batch` will be 9.1% less than before.

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Hariom Verma <hariom18599@gmail.com>
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
@adlternative
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 15, 2021

Submitted as pull.993.v2.git.1626363626.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-993/adlternative/cat-file-batch-refactor-2-v2

To fetch this version to local tag pr-993/adlternative/cat-file-batch-refactor-2-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-993/adlternative/cat-file-batch-refactor-2-v2

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 16, 2021

This patch series was integrated into seen via git@7b092b0.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 17, 2021

This patch series was integrated into seen via git@85f6f93.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 19, 2021

This patch series was integrated into seen via git@ffa6771.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 19, 2021

There was a status update in the "Cooking" section about the branch zh/ref-filter-raw-data on the Git mailing list:

Prepare the "ref-filter" machinery that drives the "--format"
option of "git for-each-ref" and its friends to be used in "git
cat-file --batch".

Heavy performance degradation has been observed with this series.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 20, 2021

This patch series was integrated into seen via git@d2cdca6.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 20, 2021

This patch series was integrated into seen via git@e2cb7f5.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 20, 2021

This patch series was integrated into seen via git@86f78db.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 21, 2021

This patch series was integrated into seen via git@d4fd9a0.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 22, 2021

There was a status update in the "Cooking" section about the branch zh/ref-filter-raw-data on the Git mailing list:

Prepare the "ref-filter" machinery that drives the "--format"
option of "git for-each-ref" and its friends to be used in "git
cat-file --batch".

Heavy performance degradation has been observed with this series.

@adlternative adlternative deleted the cat-file-batch-refactor-2 branch July 22, 2021 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant