Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSOC][RFC] ref-filter: add contents:raw atom #959

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
19 changes: 15 additions & 4 deletions Documentation/git-for-each-ref.txt
Expand Up @@ -235,11 +235,12 @@ and `date` to extract the named component. For email fields (`authoremail`,
without angle brackets, and `:localpart` to get the part before the `@` symbol
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:

> @@ -1292,7 +1326,8 @@ static void append_lines(struct strbuf *out, const char *buf, unsigned long size
>  }
>  
>  /* See grab_values */
> -static void grab_sub_body_contents(struct atom_value *val, int deref, void *buf)


As I already said, I do not think %(contents) mixes well with this
use for trees and blobs where you give the raw bytes, since
%(contents) for commit and tag objects was never about the full byte
sequence of the object.  It was to give the unstructured part meant
for human consumption, stripping the structured "header" part of the
object.

Nevertheless, since queuing this topic breaks the build and gets in
the way to find issues in other proposed regression fixes of higher
importance,...

> +static void grab_contents(struct atom_value *val, int deref, void *buf,
> +			  unsigned long buf_size, enum object_type object_type)
>  {
> ...
> +		switch (object_type) {
> +		case OBJ_TAG:
> +		case OBJ_COMMIT: {
> ...
> +				v->s = xmemdupz(bodypos, bodylen);
> +			else if (atom->u.contents.option == C_LENGTH)
> +				v->s = xstrfmt("%"PRIuMAX, (uintmax_t)strlen(subpos));
> ...

Note that this part inherits from the original and knows how to feed
a value to PRIuMAX with correct cast ...

> +				v->s = xmemdupz(bodypos, nonsiglen);
> +			else if (atom->u.contents.option == C_SIG)
> ...
> +		}
> +		case OBJ_BLOB:
> +		case OBJ_TREE: {
> +			if (atom->u.contents.option == C_BARE) {
> +				v->s_size = buf_size;
> +				v->s = xmemdupz(buf, buf_size);
> +			} else if (atom->u.contents.option == C_LENGTH)
> +				v->s = xstrfmt("%"PRIuMAX, buf_size);

... but this one gets sloppy, and breaks the windows-build of
'seen'.  Fix is simple:

+				v->s = xstrfmt("%"PRIuMAX, (uintmax_t)buf_size);

I'll squash it in before rebuilding 'seen'.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Junio C Hamano <gitster@pobox.com> writes:

> As I already said, I do not think %(contents) mixes well with this
> use for trees and blobs where you give the raw bytes, since
> %(contents) for commit and tag objects was never about the full byte
> sequence of the object.  It was to give the unstructured part meant
> for human consumption, stripping the structured "header" part of the
> object.

To extend on this point a bit (even though this is not all that
urgent during the prerelease freeze), conceptually, the %(content)
field is understood in the larger picture like this:

+--- (the whole thing) ----------------------------------------------+
|                                                                    |
| +--- "header" ---------------------------------------------------+ |
| | tree 678c03dca0a26afd746e8c8bb9e4aadc8bf479b1                  | |
| | parent 378c7c6ad48c4ccddf9b534616a0e86f28440bd3                | |
| | author Junio C Hamano <gitster@pobox.com> 1621675665 +0900     | |
| | committer Junio C Hamano <gitster@pobox.com> 1621675741 +0900  | |
| +----------------------------------------------------------------+ |
|                                                                    |
| +--- "contents" -------------------------------------------------+ |
| |                                                                | |
| | +--- "subject" ----------------------------------------------+ | |
| | | Git 2.32-rc1                                               | | |
| | +------------------------------------------------------------+ | |
| |                                                                | |
| | +--- "body" -------------------------------------------------+ | |
| | | Signed-off-by: Junio C Hamano <gitster@pobox.com>          | | |
| | +------------------------------------------------------------+ | |
| |                                                                | |
| +----------------------------------------------------------------+ |
|                                                                    |
+--------------------------------------------------------------------+

Even though %(header), when it is invented, would make perfect sense
for commits and tags, it will never make sense for trees and blobs.
Which means "contents", which is "the whole thing except for the
header part", would not, either.

There is no %(placeholder) to ask for "the whole thing", and that is
what you want to use for cat-file --batch if I am not mistaken, and
adding one would be a good idea.  There is no %(header) yet, either,
but if somebody needs it for their scripts, it is clear where it fits
in the picture.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, ZheNing Hu wrote (reply to this):

>
> To extend on this point a bit (even though this is not all that
> urgent during the prerelease freeze), conceptually, the %(content)
> field is understood in the larger picture like this:
>
> +--- (the whole thing) ----------------------------------------------+
> |                                                                    |
> | +--- "header" ---------------------------------------------------+ |
> | | tree 678c03dca0a26afd746e8c8bb9e4aadc8bf479b1                  | |
> | | parent 378c7c6ad48c4ccddf9b534616a0e86f28440bd3                | |
> | | author Junio C Hamano <gitster@pobox.com> 1621675665 +0900     | |
> | | committer Junio C Hamano <gitster@pobox.com> 1621675741 +0900  | |
> | +----------------------------------------------------------------+ |
> |                                                                    |
> | +--- "contents" -------------------------------------------------+ |
> | |                                                                | |
> | | +--- "subject" ----------------------------------------------+ | |
> | | | Git 2.32-rc1                                               | | |
> | | +------------------------------------------------------------+ | |
> | |                                                                | |
> | | +--- "body" -------------------------------------------------+ | |
> | | | Signed-off-by: Junio C Hamano <gitster@pobox.com>          | | |
> | | +------------------------------------------------------------+ | |
> | |                                                                | |
> | +----------------------------------------------------------------+ |
> |                                                                    |
> +--------------------------------------------------------------------+
>

Thank you for providing such a complete view. This also means
that the "raw" of contents and tag will contain two parts: "header"
and "contents". But for blobs and trees, they don’t have these things.

> Even though %(header), when it is invented, would make perfect sense
> for commits and tags, it will never make sense for trees and blobs.
> Which means "contents", which is "the whole thing except for the
> header part", would not, either.
>

Although we don’t have a %(header), but in fact we already have fragments
of "%(numparent)", "%(parent)" %(tree)" (see grab_commit_values()) and
"%(tag)"," %(type)","%(object)" (see grab_tag_values()), but they are not
obtained through the "header" part of the raw object buffer.

> There is no %(placeholder) to ask for "the whole thing", and that is
> what you want to use for cat-file --batch if I am not mistaken, and
> adding one would be a good idea.  There is no %(header) yet, either,
> but if somebody needs it for their scripts, it is clear where it fits
> in the picture.
>

So I don't know if adding %(header) will cause duplication of functions.

Thanks!
--
ZheNing Hu

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

ZheNing Hu <adlternative@gmail.com> writes:

> So I don't know if adding %(header) will cause duplication of functions.

I do not think you can duplicate the feature of %(header) with
other existing placeholders.  You may want to dump the headers of
series of commits so that you can find if there is any commit with
unusal header elements, but %(tree), %(parents), etc. would by
definition be the known ones, so any combination of them will not be
a substitute.

But nobody is asking for it right now, and your cat-file --batch
does not need it right away.

What I wanted to say in the message you are responding to was *not*
that you would want to add %(header) for completeness right now.
But thinking beyond your immediate need (i.e. the "whole thing") and
imagining how various pieces, including the ones that do not exist
yet, would fit together, would help you to avoid designing the parts
you immediately need in a way you would regret in the future.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, ZheNing Hu wrote (reply to this):

Junio C Hamano <gitster@pobox.com> 于2021年5月26日周三 上午1:11写道:
>
> ZheNing Hu <adlternative@gmail.com> writes:
>
> > So I don't know if adding %(header) will cause duplication of functions.
>
> I do not think you can duplicate the feature of %(header) with
> other existing placeholders.  You may want to dump the headers of
> series of commits so that you can find if there is any commit with
> unusal header elements, but %(tree), %(parents), etc. would by
> definition be the known ones, so any combination of them will not be
> a substitute.
>

Well, "find if there is any commit with unusal header elements"
is indeed one thing that may need to be done (not now).

I tried to build "%(header)" yesterday and found some problems:

1. These atoms are too "flat":
"tree", "parent", "numparent", "object"... they are all part of the
header, why not use "%(header:tree)" like "%(contents:subject)"
does? Similarly, why not use "%(author:name)" instead of "%(authorname)"...

2. Why is there no state machine to save the parsing state of an object?
`parse_tag_buffer()`, `parse_commit_buffer()`, they only parse part of the
content of the object. E.g. tag parsed "object", "type", "tag" and "taggerdate"
in `parse_tag_buffer()`, but it did not save the starting position and length of
the parsed fields, which caused `grab_person()`, `grab_date()` to parse the
object again and again. I think this may be an optimization point.

> But nobody is asking for it right now, and your cat-file --batch
> does not need it right away.
>

Or we only provide the simplest "%(header)" and "%(header:size)"
(it can be easily supported with "%(raw)"), leaving the other feature
to the future.

> What I wanted to say in the message you are responding to was *not*
> that you would want to add %(header) for completeness right now.
> But thinking beyond your immediate need (i.e. the "whole thing") and
> imagining how various pieces, including the ones that do not exist
> yet, would fit together, would help you to avoid designing the parts
> you immediately need in a way you would regret in the future.

As I said before, many atomic designs are not very regular.
Our ultimate goal may be we can simultaneously support:

"%(raw)"
"%(raw:header)"
"%(raw:header:tagger)"
"%(raw:header:tagger:eamil)"
"%(raw:header:taggereamil)"
"%(header:taggereamil)"
"%(header:tagger:eamil)"
"%(taggereamil)"
"%(tagger:eamil)"
...

Each layer is a namespace. Of course, it may take a long
distance to support this.

Thank.
--
ZheNing Hu

out of the trimmed email.

The message in a commit or a tag object is `contents`, from which
`contents:<part>` can be used to extract various parts out of:
The data in a object is `contents`, from which `contents:<part>` can be used
to extract various parts out of:

contents:size::
The size in bytes of the commit or tag message.
The size in bytes of the commit or tag message, and the raw object size
of the blob or tree.

contents:subject::
The first paragraph of the message, which typically is a
Expand All @@ -257,7 +258,17 @@ contents:signature::
The optional GPG signature of the tag.

contents:lines=N::
The first `N` lines of the message.
The first `N` lines of the message of the commit or tag message.

contents:raw::
The raw data of the object. For blob and tree, it priont all object data,
its output is equivalent to `%(contents)`; For commit and tag, it not only
print objects’ messages (as `%(contents)` does), but also print objects
headers, like the "tree XXX", "parent YYY", etc lines in commits, and the
"object OOO", "type TTT", etc lines in tags.

Note: blob and tree objects only support `%(contents)`, `%(contents:raw)` and
`%(contents:size)`.

Additionally, the trailers as interpreted by linkgit:git-interpret-trailers[1]
are obtained as `trailers[:options]` (or by using the historical alias
Expand Down
116 changes: 116 additions & 0 deletions quote.c
Expand Up @@ -43,6 +43,39 @@ void sq_quote_buf(struct strbuf *dst, const char *src)
free(to_free);
}

void sq_quote_buf_with_size(struct strbuf *dst, const char *src, size_t size)
{
char *to_free = NULL;
size_t cur_size = 0;

if (dst->buf == src)
to_free = strbuf_detach(dst, NULL);

strbuf_addch(dst, '\'');
while (cur_size < size) {
size_t len = strcspn(src, "'!");
if (!len) {
strbuf_add(dst, src, 1);
src++;
cur_size++;
} else {
strbuf_add(dst, src, len);
src += len;
cur_size += len;
}
if (cur_size >= size)
break;
while (need_bs_quote(*src)) {
strbuf_addstr(dst, "'\\");
strbuf_addch(dst, *src++);
cur_size++;
strbuf_addch(dst, '\'');
}
}
strbuf_addch(dst, '\'');
free(to_free);
}

void sq_quote_buf_pretty(struct strbuf *dst, const char *src)
{
static const char ok_punct[] = "+,-./:=@_^";
Expand Down Expand Up @@ -471,6 +504,25 @@ void perl_quote_buf(struct strbuf *sb, const char *src)
strbuf_addch(sb, sq);
}

void perl_quote_buf_with_size(struct strbuf *sb, const char *src, size_t size)
{
const char sq = '\'';
const char bq = '\\';
char c;
size_t cur_size = 0;

strbuf_addch(sb, sq);
while (cur_size < size) {
c = *src++;
cur_size++;

if (c == sq || c == bq)
strbuf_addch(sb, bq);
strbuf_addch(sb, c);
}
strbuf_addch(sb, sq);
}

void python_quote_buf(struct strbuf *sb, const char *src)
{
const char sq = '\'';
Expand All @@ -492,6 +544,31 @@ void python_quote_buf(struct strbuf *sb, const char *src)
strbuf_addch(sb, sq);
}

void python_quote_buf_with_size(struct strbuf *sb, const char *src, size_t size)
{
const char sq = '\'';
const char bq = '\\';
const char nl = '\n';
char c;
size_t cur_size = 0;

strbuf_addch(sb, sq);
while (cur_size < size) {
c = *src++;
cur_size++;

if (c == nl) {
strbuf_addch(sb, bq);
strbuf_addch(sb, 'n');
continue;
}
if (c == sq || c == bq)
strbuf_addch(sb, bq);
strbuf_addch(sb, c);
}
strbuf_addch(sb, sq);
}

void tcl_quote_buf(struct strbuf *sb, const char *src)
{
char c;
Expand Down Expand Up @@ -527,6 +604,45 @@ void tcl_quote_buf(struct strbuf *sb, const char *src)
strbuf_addch(sb, '"');
}

void tcl_quote_buf_with_size(struct strbuf *sb, const char *src, size_t size)
{
char c;
size_t cur_size = 0;

strbuf_addch(sb, '"');
while (cur_size < size) {
c = *src++;
cur_size++;

switch (c) {
case '[': case ']':
case '{': case '}':
case '$': case '\\': case '"':
strbuf_addch(sb, '\\');
/* fallthrough */
default:
strbuf_addch(sb, c);
break;
case '\f':
strbuf_addstr(sb, "\\f");
break;
case '\r':
strbuf_addstr(sb, "\\r");
break;
case '\n':
strbuf_addstr(sb, "\\n");
break;
case '\t':
strbuf_addstr(sb, "\\t");
break;
case '\v':
strbuf_addstr(sb, "\\v");
break;
}
}
strbuf_addch(sb, '"');
}

void basic_regex_quote_buf(struct strbuf *sb, const char *src)
{
char c;
Expand Down
4 changes: 4 additions & 0 deletions quote.h
Expand Up @@ -30,6 +30,7 @@ struct strbuf;
*/

void sq_quote_buf(struct strbuf *, const char *src);
void sq_quote_buf_with_size(struct strbuf *, const char *src, size_t size);
void sq_quote_argv(struct strbuf *, const char **argv);
void sq_quotef(struct strbuf *, const char *fmt, ...);

Expand Down Expand Up @@ -94,8 +95,11 @@ char *quote_path(const char *in, const char *prefix, struct strbuf *out, unsigne

/* quoting as a string literal for other languages */
void perl_quote_buf(struct strbuf *sb, const char *src);
void perl_quote_buf_with_size(struct strbuf *sb, const char *src, size_t size);
void python_quote_buf(struct strbuf *sb, const char *src);
void python_quote_buf_with_size(struct strbuf *sb, const char *src, size_t size);
void tcl_quote_buf(struct strbuf *sb, const char *src);
void tcl_quote_buf_with_size(struct strbuf *sb, const char *src, size_t size);
void basic_regex_quote_buf(struct strbuf *sb, const char *src);

#endif