New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSOC][RFC] ref-filter: add contents:raw atom #959
[GSOC][RFC] ref-filter: add contents:raw atom #959
Conversation
With the original functions `*._quote_buf()`, there may be '\0' in our source string, only the content before the first '\0' will be processed and added to the buffer. Add `perl_quote_buf_with_size()`, `python_quote_buf_with_size()`, `tcl_quote_buf_with_size`,`sq_quote_buf_with_size()` to `quote.c`, they will process the source string with specified length characters. With them, the content after '\0' will not be truncated. This will help us add binary data containing '\0' to `quote_formatting()` in `ref-filter.c`. Signed-off-by: ZheNing Hu <adlternative@gmail.com>
In order to let `cat-file --batch` use ref-filter logic, we need to print the original content of an object. We can reuse the existing atom `%(contents)` in `ref-filter`, The original `%(contents)` only supports tag and commit objects. If we want to support both blob and tree objects, we must consider the following issues: The original contents of blob, tree objects may contain '\0', most of the logic in `ref-filter` depends on the output of the atom being a string (end with '\0'): `quote_formatting()` use `strbuf_addstr()` or `*._quote_buf()` add content to the buffer. E.g. The original content of a tree object is `100644 one\0...`, only the `100644 one` will be added to the buffer, which is incorrect. Therefore, add a new member in `struct atom_value`: `s_size`, when we record the original content of the blob and tree objects in `grab_contents()`, use `v->s_size` to record the size of the objects contents, then in `quote_formatting()`, if the length of the contents passed in `quote_formatting()` is not equal to 0, we can use `strbuf_add()` instead of `strbuf_addstr()` or `*._quote_buf_with_size()` instead of `*._quote_buf()` to add contents with a specified length if the length of the contents is not equal to 0. It will not cause truncation problems. Similarly, in `append_atom()`, we use `strbuf_add()` instead of `strbuf_addstr()`; In `then_atom_handler()`, we use `memcmp()` instread of `strcmp()`; In `cmp_ref_sorting()`, we use `memcmp()` and `memcasecmp()` instead of `strcmp()` and `strcasecmp()` when the `v->s_size` of one of the two atoms is not equals to 0. Based-on-patch-by: Olga Telezhnaya <olyatelezhnaya@gmail.com> Signed-off-by: ZheNing Hu <adlternative@gmail.com>
Add new formatting option `%(contents:raw)`, which will print all the object contents without any changes. For blob and tree, it priont all object data, its output is equivalent to `%(contents)`; For commit and tag, it not only print objects’ messages (as `%(contents)` does), but also print objects headers, like the "tree XXX", "parent YYY", etc lines in commits, and the "object OOO", "type TTT", etc lines in tags. It will help further to migrate all cat-file formatting logic from cat-file to ref-filter. E.g. git for-each-ref --format=%(contents:raw) refs/heads/foo It will have similar output to: git rev-parse refs/heads/foo | git cat-file --batch Based-on-patch-by: Olga Telezhnaya <olyatelezhnaya@gmail.com> Signed-off-by: ZheNing Hu <adlternative@gmail.com>
6207155
to
320037a
Compare
/submit |
Submitted as pull.959.git.1621763612.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Junio C Hamano wrote (reply to this):
|
On the Git mailing list, Felipe Contreras wrote (reply to this):
|
User |
On the Git mailing list, Bagas Sanjaya wrote (reply to this):
|
User |
On the Git mailing list, ZheNing Hu wrote (reply to this):
|
On the Git mailing list, Junio C Hamano wrote (reply to this):
|
This branch is now known as |
This patch series was integrated into seen via git@be3911d. |
This patch series was integrated into seen via git@325506c. |
@@ -235,11 +235,12 @@ and `date` to extract the named component. For email fields (`authoremail`, | |||
without angle brackets, and `:localpart` to get the part before the `@` symbol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:
> @@ -1292,7 +1326,8 @@ static void append_lines(struct strbuf *out, const char *buf, unsigned long size
> }
>
> /* See grab_values */
> -static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)
As I already said, I do not think %(contents) mixes well with this
use for trees and blobs where you give the raw bytes, since
%(contents) for commit and tag objects was never about the full byte
sequence of the object. It was to give the unstructured part meant
for human consumption, stripping the structured "header" part of the
object.
Nevertheless, since queuing this topic breaks the build and gets in
the way to find issues in other proposed regression fixes of higher
importance,...
> +static void grab_contents(struct atom_value *val, int deref, void *buf,
> + unsigned long buf_size, enum object_type object_type)
> {
> ...
> + switch (object_type) {
> + case OBJ_TAG:
> + case OBJ_COMMIT: {
> ...
> + v->s = xmemdupz(bodypos, bodylen);
> + else if (atom->u.contents.option == C_LENGTH)
> + v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos));
> ...
Note that this part inherits from the original and knows how to feed
a value to PRIuMAX with correct cast ...
> + v->s = xmemdupz(bodypos, nonsiglen);
> + else if (atom->u.contents.option == C_SIG)
> ...
> + }
> + case OBJ_BLOB:
> + case OBJ_TREE: {
> + if (atom->u.contents.option == C_BARE) {
> + v->s_size = buf_size;
> + v->s = xmemdupz(buf, buf_size);
> + } else if (atom->u.contents.option == C_LENGTH)
> + v->s = xstrfmt("%"PRIuMAX, buf_size);
... but this one gets sloppy, and breaks the windows-build of
'seen'. Fix is simple:
+ v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);
I'll squash it in before rebuilding 'seen'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
Junio C Hamano <gitster@pobox.com> writes:
> As I already said, I do not think %(contents) mixes well with this
> use for trees and blobs where you give the raw bytes, since
> %(contents) for commit and tag objects was never about the full byte
> sequence of the object. It was to give the unstructured part meant
> for human consumption, stripping the structured "header" part of the
> object.
To extend on this point a bit (even though this is not all that
urgent during the prerelease freeze), conceptually, the %(content)
field is understood in the larger picture like this:
+--- (the whole thing) ----------------------------------------------+
| |
| +--- "header" ---------------------------------------------------+ |
| | tree 678c03dca0a26afd746e8c8bb9e4aadc8bf479b1 | |
| | parent 378c7c6ad48c4ccddf9b534616a0e86f28440bd3 | |
| | author Junio C Hamano <gitster@pobox.com> 1621675665 +0900 | |
| | committer Junio C Hamano <gitster@pobox.com> 1621675741 +0900 | |
| +----------------------------------------------------------------+ |
| |
| +--- "contents" -------------------------------------------------+ |
| | | |
| | +--- "subject" ----------------------------------------------+ | |
| | | Git 2.32-rc1 | | |
| | +------------------------------------------------------------+ | |
| | | |
| | +--- "body" -------------------------------------------------+ | |
| | | Signed-off-by: Junio C Hamano <gitster@pobox.com> | | |
| | +------------------------------------------------------------+ | |
| | | |
| +----------------------------------------------------------------+ |
| |
+--------------------------------------------------------------------+
Even though %(header), when it is invented, would make perfect sense
for commits and tags, it will never make sense for trees and blobs.
Which means "contents", which is "the whole thing except for the
header part", would not, either.
There is no %(placeholder) to ask for "the whole thing", and that is
what you want to use for cat-file --batch if I am not mistaken, and
adding one would be a good idea. There is no %(header) yet, either,
but if somebody needs it for their scripts, it is clear where it fits
in the picture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, ZheNing Hu wrote (reply to this):
>
> To extend on this point a bit (even though this is not all that
> urgent during the prerelease freeze), conceptually, the %(content)
> field is understood in the larger picture like this:
>
> +--- (the whole thing) ----------------------------------------------+
> | |
> | +--- "header" ---------------------------------------------------+ |
> | | tree 678c03dca0a26afd746e8c8bb9e4aadc8bf479b1 | |
> | | parent 378c7c6ad48c4ccddf9b534616a0e86f28440bd3 | |
> | | author Junio C Hamano <gitster@pobox.com> 1621675665 +0900 | |
> | | committer Junio C Hamano <gitster@pobox.com> 1621675741 +0900 | |
> | +----------------------------------------------------------------+ |
> | |
> | +--- "contents" -------------------------------------------------+ |
> | | | |
> | | +--- "subject" ----------------------------------------------+ | |
> | | | Git 2.32-rc1 | | |
> | | +------------------------------------------------------------+ | |
> | | | |
> | | +--- "body" -------------------------------------------------+ | |
> | | | Signed-off-by: Junio C Hamano <gitster@pobox.com> | | |
> | | +------------------------------------------------------------+ | |
> | | | |
> | +----------------------------------------------------------------+ |
> | |
> +--------------------------------------------------------------------+
>
Thank you for providing such a complete view. This also means
that the "raw" of contents and tag will contain two parts: "header"
and "contents". But for blobs and trees, they don’t have these things.
> Even though %(header), when it is invented, would make perfect sense
> for commits and tags, it will never make sense for trees and blobs.
> Which means "contents", which is "the whole thing except for the
> header part", would not, either.
>
Although we don’t have a %(header), but in fact we already have fragments
of "%(numparent)", "%(parent)" %(tree)" (see grab_commit_values()) and
"%(tag)"," %(type)","%(object)" (see grab_tag_values()), but they are not
obtained through the "header" part of the raw object buffer.
> There is no %(placeholder) to ask for "the whole thing", and that is
> what you want to use for cat-file --batch if I am not mistaken, and
> adding one would be a good idea. There is no %(header) yet, either,
> but if somebody needs it for their scripts, it is clear where it fits
> in the picture.
>
So I don't know if adding %(header) will cause duplication of functions.
Thanks!
--
ZheNing Hu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
ZheNing Hu <adlternative@gmail.com> writes:
> So I don't know if adding %(header) will cause duplication of functions.
I do not think you can duplicate the feature of %(header) with
other existing placeholders. You may want to dump the headers of
series of commits so that you can find if there is any commit with
unusal header elements, but %(tree), %(parents), etc. would by
definition be the known ones, so any combination of them will not be
a substitute.
But nobody is asking for it right now, and your cat-file --batch
does not need it right away.
What I wanted to say in the message you are responding to was *not*
that you would want to add %(header) for completeness right now.
But thinking beyond your immediate need (i.e. the "whole thing") and
imagining how various pieces, including the ones that do not exist
yet, would fit together, would help you to avoid designing the parts
you immediately need in a way you would regret in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, ZheNing Hu wrote (reply to this):
Junio C Hamano <gitster@pobox.com> 于2021年5月26日周三 上午1:11写道:
>
> ZheNing Hu <adlternative@gmail.com> writes:
>
> > So I don't know if adding %(header) will cause duplication of functions.
>
> I do not think you can duplicate the feature of %(header) with
> other existing placeholders. You may want to dump the headers of
> series of commits so that you can find if there is any commit with
> unusal header elements, but %(tree), %(parents), etc. would by
> definition be the known ones, so any combination of them will not be
> a substitute.
>
Well, "find if there is any commit with unusal header elements"
is indeed one thing that may need to be done (not now).
I tried to build "%(header)" yesterday and found some problems:
1. These atoms are too "flat":
"tree", "parent", "numparent", "object"... they are all part of the
header, why not use "%(header:tree)" like "%(contents:subject)"
does? Similarly, why not use "%(author:name)" instead of "%(authorname)"...
2. Why is there no state machine to save the parsing state of an object?
`parse_tag_buffer()`, `parse_commit_buffer()`, they only parse part of the
content of the object. E.g. tag parsed "object", "type", "tag" and "taggerdate"
in `parse_tag_buffer()`, but it did not save the starting position and length of
the parsed fields, which caused `grab_person()`, `grab_date()` to parse the
object again and again. I think this may be an optimization point.
> But nobody is asking for it right now, and your cat-file --batch
> does not need it right away.
>
Or we only provide the simplest "%(header)" and "%(header:size)"
(it can be easily supported with "%(raw)"), leaving the other feature
to the future.
> What I wanted to say in the message you are responding to was *not*
> that you would want to add %(header) for completeness right now.
> But thinking beyond your immediate need (i.e. the "whole thing") and
> imagining how various pieces, including the ones that do not exist
> yet, would fit together, would help you to avoid designing the parts
> you immediately need in a way you would regret in the future.
As I said before, many atomic designs are not very regular.
Our ultimate goal may be we can simultaneously support:
"%(raw)"
"%(raw:header)"
"%(raw:header:tagger)"
"%(raw:header:tagger:eamil)"
"%(raw:header:taggereamil)"
"%(header:taggereamil)"
"%(header:tagger:eamil)"
"%(taggereamil)"
"%(tagger:eamil)"
...
Each layer is a namespace. Of course, it may take a long
distance to support this.
Thank.
--
ZheNing Hu
This patch series was integrated into seen via git@96f42f6. |
This patch series was integrated into seen via git@328b912. |
This patch series was integrated into seen via git@48c4f6b. |
On the Git mailing list, Junio C Hamano wrote (reply to this):
|
This patch series was integrated into seen via git@996a7d7. |
This patch series was integrated into seen via git@95db682. |
On the Git mailing list, ZheNing Hu wrote (reply to this):
|
On the Git mailing list, Junio C Hamano wrote (reply to this):
|
On the Git mailing list, ZheNing Hu wrote (reply to this):
|
/close |
You'll have to close using the button. |
OK. |
In (a2f3241: [GSOC] ref-filter: add contents:raw atom)
I did not notice the breakage that occurred during the test,
In the later remediation, I found a very serious problem:
The output object data of "git cat-file tree foo" or
"git cat-file blob foo" may contain '\0'. However,
most of the logic in
ref-filter
depends on the atomicoutput not containing'\0'.
Therefore, we must carry out a series of repairs to
ref-filter
so that it can support output of data containing '\0'.
In first patch, I add
*.quote_buf_with_size()
functions,this can deal with data with containing'\0'.
In second patch, I add the member
s_size
instruct atom_value
,and protects the output of the atom from being truncated at '\0',
and successfully supported the
%(contents)
of blob and tree.In third patch, I added the
%(contents:raw)
atom, It can printthe original content of an object.
What needs to be reconsidered:
For a binary object blob, tree,
git for-each-ref --format="%(contents)" --python refs/mytrees/first
will output a string processed by
python_quote_buf_with_size()
,which contains'\0'. But the binary files seem to be useless after
quoting. Should we allow these binary files to be output in the default
way with
strbuf_add()
? If so, we can remove the first patch.cc: Junio C Hamano gitster@pobox.com
cc: Christian Couder christian.couder@gmail.com
cc: Hariom Verma hariom18599@gmail.com
cc: Karthik Nayak karthik.188@gmail.com
cc: Felipe Contreras felipe.contreras@gmail.com
cc: Bagas Sanjaya bagasdotme@gmail.com