Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chainlint: improve annotated output #1375

Closed
wants to merge 4 commits into from

Conversation

sunshineco
Copy link
Contributor

@sunshineco sunshineco commented Nov 8, 2022

When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentation, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.

An earlier patch series[1] improved the output somewhat by colorizing the
"?!FOO?!" annotations and the "# chainlint:" lines, but the output can
still be difficult to digest.

This patch series further improves the output by instead making
chainlint.pl annotate the original test definition rather than the
parsed token stream, thus preserving indentation (and whitespace, in
general), here-doc bodies, etc., which should make it easier for a test
author to relate each problem back to the source.

This series was inspired by usability comments from Peff[2] and Ævar[3]
and a bit of discussion which followed[4][5].

(Note to self: Add Ævar to nerd-snipe blacklist alongside Peff.)

FOOTNOTES

[1] https://lore.kernel.org/git/pull.1324.v2.git.git.1663041707260.gitgitgadget@gmail.com/
[2] https://lore.kernel.org/git/Yx1x5lme2SGBjfia@coredump.intra.peff.net/
[3] https://lore.kernel.org/git/221024.865yg9ecsx.gmgdl@evledraar.gmail.com/
[4] https://lore.kernel.org/git/CAPig+cRJVn-mbA6-jOmNfDJtK_nX4ZTw+OcNShvvz8zcQYbCHQ@mail.gmail.com/
[5] https://lore.kernel.org/git/CAPig+cT=cWYT6kicNWT+6RxfiKKMyVz72H3_9kwkF-f4Vuoe1w@mail.gmail.com/

Cc: Jeff King peff@peff.net
Cc: Ævar Arnfjörð Bjarmason avarab@gmail.com
cc: Taylor Blau me@ttaylorr.com

The logic in TestParser::accumulate() for detecting broken &&-chains is
mostly well-commented, but a couple branches which were deemed obvious
and straightforward lack comments. In retrospect, though, these cases
may give future readers pause, so comment them, as well.

Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
To extract the next token in the input stream, Lexer::scan_token() finds
the start of the token by skipping whitespace, then consumes characters
belonging to the token until it encounters a non-token character, such
as an operator, punctuation, or whitespace. In the case of an operator
or punctuation which ends a token, before returning the just-scanned
token, it pushes that operator or punctuation character back onto the
input stream to ensure that it will be the first character consumed by
the next call to scan_token().

However, scan_token() is intentionally lax when whitespace ends a token;
it doesn't bother pushing the whitespace character back onto the token
stream since it knows that the next call to scan_token() will, as its
first step, skip over whitespace anyhow when looking for the start of
the token.

Although such laxity is harmless for the proper functioning of the
lexical analyzer, it does make it difficult to precisely identify the
token's end position in the input stream. Accurate token position
information may be desirable, for instance, to annotate problems or
highlight other interesting facets of the input found during the parsing
phase. To accommodate such possibilities, tighten scan_token() by making
it push the token-ending whitespace character back onto the input
stream, just as it does for other token-ending characters.

Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.

To address this shortcoming, an upcoming change will make it print out
an annotated copy of the original test definition rather than the
tokenized representation. In order to do so, it will need to know the
start and end positions of each token in the original test definition.
As preparation, upgrade TestParser::scan_token() to latch the start and
end position of the token being scanned, and return that information
along with the token itself. A subsequent change will take advantage of
this positional information.

In terms of implementation, TestParser::scan_token() is retrofitted to
return a tuple consisting of the token's lexeme and its start and end
positions, rather than returning just the lexeme. However, an
alternative would be to define a class which represents a token:

    package Token;

    sub new {
        my ($class, $lexeme, $start, $end) = @_;
        bless [$lexeme, $start, $end] => $class;
    }

    sub as_string {
        my $self = shift @_;
        return $self->[0];
    }

    sub compare {
        my ($x, $y) = @_;
        if (UNIVERSAL::isa($y, 'Token')) {
            return $x->[0] cmp $y->[0];
        }
        return $x->[0] cmp $y;
    }

    use overload (
        '""' => 'as_string',
        'cmp' => 'compare'
    );

The major benefit of the class-based approach is that it is entirely
non-invasive; it requires no additional changes to the rest of the
script since a Token converts automatically to a string, which is what
scan_token() historically returned.

The big downside to the Token approach, however, is that it is _slow_;
on this developer's (old) machine, it increases user-time by an
unacceptable seven seconds when scanning all test scripts in the
project. Hence, the simple tuple approach is employed instead since it
adds only a fraction of a second user-time.

Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentations, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.

However, now that each parsed token carries positional information, the
location of a detected problem can be pinpointed precisely in the
original test definition. Therefore, take advantage of this information
to annotate the test definition itself rather than annotating the parsed
token stream, thus making it easier for a test author to relate a
problem back to the source.

Maintaining the positional meta-information associated with each
detected problem requires a slight change in how the problems are
managed internally. In particular, shell syntax such as:

    msg="total: $(cd data; wc -w *.txt) words"

requires the lexical analyzer to recursively invoke the parser in order
to detect problems within the $(...) expression inside the double-quoted
string. In this case, the recursive parse context will detect the broken
&&-chain between the `cd` and `wc` commands, returning the token stream:

    cd data ; ?!AMP?! wc -w *.txt

However, the parent parse context will see everything inside the
double-quotes as a single string token:

    "total: $(cd data ; ?!AMP?! wc -w *.txt) words"

losing whatever positional information was attached to the ";" token
where the problem was detected.

One way to preserve the positional information of a detected problem in
a recursive parse context within a string would be to attach the
positional information to the annotation textually; for instance:

    "total: $(cd data ; ?!AMP:21:22?! wc -w *.txt) words"

and then extract the positional information when annotating the original
test definition.

However, a cleaner and much simpler approach is to maintain the list of
detected problems separately rather than embedding the problems as
annotations directly in the parsed token stream. Not only does this
ensure that positional information within recursive parse contexts is
not lost, but it keeps the token stream free from non-token pollution,
which may simplify implementation of validations added in the future
since they won't have to handle non-token "?!FOO!?" items specially.

Finally, the chainlint self-test "expect" files need a few mechanical
adjustments now that the original test definitions are emitted rather
than the parsed token stream. In particular, the following items missing
from the historic parsed-token output are now preserved verbatim:

    * indentation (and whitespace, in general)

    * comments

    * here-doc bodies

    * here-doc tag quoting (i.e. "\EOF")

    * line-splices (i.e. "\" at the end of a line)

Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
@sunshineco
Copy link
Contributor Author

/preview

@gitgitgadget-git
Copy link

Preview email sent as pull.1375.git.git.1667931861.gitgitgadget@gmail.com

@sunshineco
Copy link
Contributor Author

/submit

@gitgitgadget-git
Copy link

Submitted as pull.1375.git.git.1667934510.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-git-1375/sunshineco/chainlintpreserve-v1

To fetch this version to local tag pr-git-1375/sunshineco/chainlintpreserve-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-git-1375/sunshineco/chainlintpreserve-v1

@gitgitgadget-git
Copy link

On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Nov 08, 2022 at 07:08:26PM +0000, Eric Sunshine via GitGitGadget wrote:
> This patch series further improves the output by instead making chainlint.pl
> annotate the original test definition rather than the parsed token stream,
> thus preserving indentation (and whitespace, in general), here-doc bodies,
> etc., which should make it easier for a test author to relate each problem
> back to the source.

Very nicely done. The changes all seemed reasonable to me (and, in fact,
the approach is pretty straightforward -- the diffstat is misleading
since many of changes are to chainlint's expected output).

So I'm happy with it, but let's hear from some other folks who are more
familiar with this area before we start merging it down.


Thanks,
Taylor

@gitgitgadget-git
Copy link

User Taylor Blau <me@ttaylorr.com> has been added to the cc: list.

@gitgitgadget-git
Copy link

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):

On Tue, Nov 08 2022, Eric Sunshine via GitGitGadget wrote:

> When chainlint detects problems in a test, such as a broken &&-chain, it
> prints out the test with "?!FOO?!" annotations inserted at each problem
> location. However, rather than annotating the original test definition, it
> instead dumps out a parsed token representation of the test. Since it lacks
> comments, indentation, here-doc bodies, and so forth, this tokenized
> representation can be difficult for the test author to digest and relate
> back to the original test definition.
>
> An earlier patch series[1] improved the output somewhat by colorizing the
> "?!FOO?!" annotations and the "# chainlint:" lines, but the output can still
> be difficult to digest.
>
> This patch series further improves the output by instead making chainlint.pl
> annotate the original test definition rather than the parsed token stream,
> thus preserving indentation (and whitespace, in general), here-doc bodies,
> etc., which should make it easier for a test author to relate each problem
> back to the source.
>
> This series was inspired by usability comments from Peff[2] and Ævar[3] and
> a bit of discussion which followed[4][5].
>
> (Note to self: Add Ævar to nerd-snipe blacklist alongside Peff.)

Heh! It's great to see a follow-up to our discussion the other day, and
having the output verbatim & annotated looks much better, especially for
complex tests.

E.g. (taking one at random, after some grepping/skimming), ruining this one:
	
	diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
	index dcaab7265f5..c27539a773d 100755
	--- a/t/t6300-for-each-ref.sh
	+++ b/t/t6300-for-each-ref.sh
	@@ -1365,8 +1365,7 @@ test_expect_success 'for-each-ref --ignore-case works on multiple sort keys' '
	                do
	                        GIT_COMMITTER_EMAIL="$email@example.com" \
	                        git tag -m "tag $subject" icase-$(printf %02d $nr) &&
	-                       nr=$((nr+1))||
	-                       return 1
	+                       nr=$((nr+1))
	                done
	        done &&
	        git for-each-ref --ignore-case \

Would, before emit (correct, but a bit of a token-soup):

	
	$ ./t6300-for-each-ref.sh 
	# chainlint: ./t6300-for-each-ref.sh
	# chainlint: for-each-ref --ignore-case works on multiple sort keys
	
	nr=0 &&
	for email in a A b B
	do
	for subject in a A b B
	do
	GIT_COMMITTER_EMAIL="$email@example.com" git tag -m "tag $subject" icase-$(printf %02d $nr) &&
	nr=$((nr+1)) ?!LOOP?!
	done ?!LOOP?!
	done &&
	git for-each-ref --ignore-case --format="%(taggeremail) %(subject) %(refname)" --sort=refname --sort=subject --sort=taggeremail refs/tags/icase-* > actual &&
	cat > expect <<-EOF &&
	test_cmp expect actual
	error: bug in the test script: lint error (see '?!...!? annotations above)

But now it'll instead emit:
	
	$ ./t6300-for-each-ref.sh
	# chainlint: ./t6300-for-each-ref.sh
	# chainlint: for-each-ref --ignore-case works on multiple sort keys
	        # name refs numerically to avoid case-insensitive filesystem conflicts
	        nr=0 &&
	        for email in a A b B
	        do
	                for subject in a A b B
	                do
	                        GIT_COMMITTER_EMAIL="$email@example.com" \
	                        git tag -m "tag $subject" icase-$(printf %02d $nr) &&
	                        nr=$((nr+1)) ?!LOOP?!
	                done ?!LOOP?!
	        done &&
	        git for-each-ref --ignore-case \
	                --format="%(taggeremail) %(subject) %(refname)" \
	                --sort=refname \
	                --sort=subject \
	                --sort=taggeremail \
	                refs/tags/icase-* >actual &&
	        cat >expect <<-\EOF &&
	        <a@example.com> tag a refs/tags/icase-00
	        <a@example.com> tag A refs/tags/icase-01
	        <A@example.com> tag a refs/tags/icase-04
	        <A@example.com> tag A refs/tags/icase-05
	        <a@example.com> tag b refs/tags/icase-02
	        <a@example.com> tag B refs/tags/icase-03
	        <A@example.com> tag b refs/tags/icase-06
	        <A@example.com> tag B refs/tags/icase-07
	        <b@example.com> tag a refs/tags/icase-08
	        <b@example.com> tag A refs/tags/icase-09
	        <B@example.com> tag a refs/tags/icase-12
	        <B@example.com> tag A refs/tags/icase-13
	        <b@example.com> tag b refs/tags/icase-10
	        <b@example.com> tag B refs/tags/icase-11
	        <B@example.com> tag b refs/tags/icase-14
	        <B@example.com> tag B refs/tags/icase-15
	        EOF
	        test_cmp expect actual
	error: bug in the test script: lint error (see '?!...!? annotations above)

Which is so much better, i.e. as you're preserving the whitespace &
comments, and the "?!LOOP?!" is of course much easier to see with the
colored output.

I hadn't noticed before that the contents of here-docs was pruned, but
that made sense in the previous parser, but having the content.

Also, and I guess this is an attempt to evade your blacklist. I *did*
notice when playing around with this that if I now expand the "1 while"
loop here:

	my $s = do { local $/; <$fh> };
	close($fh);
	my $parser = ScriptParser->new(\$s);
	1 while $parser->parse_cmd();

To something that "follows along" with the parser it shouldn't be too
hard in the future to add line number annotations now. E.g. for
"#!/bin/sh\n" you'll emit a token like "\n", but the positions will be
0, 10.

But that's all for some hypothetical future, this is already much better
:)

@gitgitgadget-git
Copy link

On the Git mailing list, Eric Sunshine wrote (reply to this):

On Tue, Nov 8, 2022 at 5:29 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> On Tue, Nov 08 2022, Eric Sunshine via GitGitGadget wrote:
> > (Note to self: Add Ævar to nerd-snipe blacklist alongside Peff.)
>
> Also, and I guess this is an attempt to evade your blacklist. I *did*
> notice when playing around with this that if I now expand the "1 while"
> loop here:
>    [...]
> To something that "follows along" with the parser it shouldn't be too
> hard in the future to add line number annotations now. E.g. for
> "#!/bin/sh\n" you'll emit a token like "\n", but the positions will be
> 0, 10.
>
> But that's all for some hypothetical future, this is already much better

My nerd-snipe blacklist hasn't fully solidified yet, unfortunately (for me).

@gitgitgadget-git
Copy link

On the Git mailing list, Eric Sunshine wrote (reply to this):

On Tue, Nov 8, 2022 at 5:43 PM Eric Sunshine <sunshine@sunshineco.com> wrote:
> On Tue, Nov 8, 2022 at 5:29 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> > On Tue, Nov 08 2022, Eric Sunshine via GitGitGadget wrote:
> > > (Note to self: Add Ævar to nerd-snipe blacklist alongside Peff.)
> >
> > Also, and I guess this is an attempt to evade your blacklist. I *did*
> > notice when playing around with this that if I now expand the "1 while"
> > loop here:
> >    [...]
> > To something that "follows along" with the parser it shouldn't be too
> > hard in the future to add line number annotations now. E.g. for
> > "#!/bin/sh\n" you'll emit a token like "\n", but the positions will be
> > 0, 10.
> >
> > But that's all for some hypothetical future, this is already much better
>
> My nerd-snipe blacklist hasn't fully solidified yet, unfortunately (for me).

I forgot to add that if you do manage to penetrate my nerd-snipe
blacklist, such a feature would be built atop the current series (i.e.
no reason to hold up this series for the "hypothetical future", as you
say).

@gitgitgadget-git
Copy link

This branch is now known as es/chainlint-output.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 9e13522.

@gitgitgadget-git gitgitgadget-git bot added the seen label Nov 8, 2022
@gitgitgadget-git
Copy link

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Nov 08, 2022 at 03:28:34PM -0500, Taylor Blau wrote:

> On Tue, Nov 08, 2022 at 07:08:26PM +0000, Eric Sunshine via GitGitGadget wrote:
> > This patch series further improves the output by instead making chainlint.pl
> > annotate the original test definition rather than the parsed token stream,
> > thus preserving indentation (and whitespace, in general), here-doc bodies,
> > etc., which should make it easier for a test author to relate each problem
> > back to the source.
> 
> Very nicely done. The changes all seemed reasonable to me (and, in fact,
> the approach is pretty straightforward -- the diffstat is misleading
> since many of changes are to chainlint's expected output).
> 
> So I'm happy with it, but let's hear from some other folks who are more
> familiar with this area before we start merging it down.

I don't claim to be _that_ familiar with the code itself, but all of the
patches look reasonable to me. And most importantly, I dug out the state
of my tree from early September (via the reflog) before I fixed all of
the chainlint problems on my local topics. The improvement in the output
with this series is night and day.

I was a little surprised that using a class in patch 3 would cause such
a slowdown. But it's not that hard to believe that the workload is so
heavy on string comparison and manipulation that the overloaded string
and comparison functions introduce significant overhead. It has been a
long time since I've optimized any perl, but I remember the rule of
thumb being to minimize the number of lines of perl (because all of the
builtin stuff is blazingly fast C, and all of the perl is byte-code).

At any rate, the result you came up with doesn't look too bad. The only
risk is that you forgot to s/$token/$token->[0]/ somewhere, and I
suspect we'd have found that in running the tests.

So it all seems like a step forward to me.

-Peff

@gitgitgadget-git
Copy link

On the Git mailing list, Taylor Blau wrote (reply to this):

On Wed, Nov 09, 2022 at 08:11:45AM -0500, Jeff King wrote:
> So it all seems like a step forward to me.

Thanks, all. Let's start merging this topic down :-).


Thanks,
Taylor

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 3a79a80.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 260cefa.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via e4fd0f0.

@gitgitgadget-git
Copy link

This patch series was integrated into next via 9cd7d30.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 85c3e0c.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 91d501c.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 4c6d316.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 68906b9.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 7fde153.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via ad59b48.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 3946f79.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 2ebdb4b.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 6a0d648.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via d779033.

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 144ec25.

@gitgitgadget-git
Copy link

There was a status update in the "Cooking" section about the branch es/chainlint-output on the Git mailing list:

Teach chainlint.pl to annotate the original test definition instead
of the token stream.

Will merge to 'master'.
source: <pull.1375.git.git.1667934510.gitgitgadget@gmail.com>

@gitgitgadget-git
Copy link

This patch series was integrated into seen via 613fb30.

@gitgitgadget-git
Copy link

This patch series was integrated into master via 613fb30.

@gitgitgadget-git
Copy link

This patch series was integrated into next via 613fb30.

@gitgitgadget-git
Copy link

Closed via 613fb30.

@sunshineco sunshineco deleted the chainlintpreserve branch December 8, 2022 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant