New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chainlint: improve annotated output #1375
Conversation
The logic in TestParser::accumulate() for detecting broken &&-chains is mostly well-commented, but a couple branches which were deemed obvious and straightforward lack comments. In retrospect, though, these cases may give future readers pause, so comment them, as well. Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
To extract the next token in the input stream, Lexer::scan_token() finds the start of the token by skipping whitespace, then consumes characters belonging to the token until it encounters a non-token character, such as an operator, punctuation, or whitespace. In the case of an operator or punctuation which ends a token, before returning the just-scanned token, it pushes that operator or punctuation character back onto the input stream to ensure that it will be the first character consumed by the next call to scan_token(). However, scan_token() is intentionally lax when whitespace ends a token; it doesn't bother pushing the whitespace character back onto the token stream since it knows that the next call to scan_token() will, as its first step, skip over whitespace anyhow when looking for the start of the token. Although such laxity is harmless for the proper functioning of the lexical analyzer, it does make it difficult to precisely identify the token's end position in the input stream. Accurate token position information may be desirable, for instance, to annotate problems or highlight other interesting facets of the input found during the parsing phase. To accommodate such possibilities, tighten scan_token() by making it push the token-ending whitespace character back onto the input stream, just as it does for other token-ending characters. Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
When chainlint detects problems in a test, such as a broken &&-chain, it prints out the test with "?!FOO?!" annotations inserted at each problem location. However, rather than annotating the original test definition, it instead dumps out a parsed token representation of the test. Since it lacks comments, indentations, here-doc bodies, and so forth, this tokenized representation can be difficult for the test author to digest and relate back to the original test definition. To address this shortcoming, an upcoming change will make it print out an annotated copy of the original test definition rather than the tokenized representation. In order to do so, it will need to know the start and end positions of each token in the original test definition. As preparation, upgrade TestParser::scan_token() to latch the start and end position of the token being scanned, and return that information along with the token itself. A subsequent change will take advantage of this positional information. In terms of implementation, TestParser::scan_token() is retrofitted to return a tuple consisting of the token's lexeme and its start and end positions, rather than returning just the lexeme. However, an alternative would be to define a class which represents a token: package Token; sub new { my ($class, $lexeme, $start, $end) = @_; bless [$lexeme, $start, $end] => $class; } sub as_string { my $self = shift @_; return $self->[0]; } sub compare { my ($x, $y) = @_; if (UNIVERSAL::isa($y, 'Token')) { return $x->[0] cmp $y->[0]; } return $x->[0] cmp $y; } use overload ( '""' => 'as_string', 'cmp' => 'compare' ); The major benefit of the class-based approach is that it is entirely non-invasive; it requires no additional changes to the rest of the script since a Token converts automatically to a string, which is what scan_token() historically returned. The big downside to the Token approach, however, is that it is _slow_; on this developer's (old) machine, it increases user-time by an unacceptable seven seconds when scanning all test scripts in the project. Hence, the simple tuple approach is employed instead since it adds only a fraction of a second user-time. Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
When chainlint detects problems in a test, such as a broken &&-chain, it prints out the test with "?!FOO?!" annotations inserted at each problem location. However, rather than annotating the original test definition, it instead dumps out a parsed token representation of the test. Since it lacks comments, indentations, here-doc bodies, and so forth, this tokenized representation can be difficult for the test author to digest and relate back to the original test definition. However, now that each parsed token carries positional information, the location of a detected problem can be pinpointed precisely in the original test definition. Therefore, take advantage of this information to annotate the test definition itself rather than annotating the parsed token stream, thus making it easier for a test author to relate a problem back to the source. Maintaining the positional meta-information associated with each detected problem requires a slight change in how the problems are managed internally. In particular, shell syntax such as: msg="total: $(cd data; wc -w *.txt) words" requires the lexical analyzer to recursively invoke the parser in order to detect problems within the $(...) expression inside the double-quoted string. In this case, the recursive parse context will detect the broken &&-chain between the `cd` and `wc` commands, returning the token stream: cd data ; ?!AMP?! wc -w *.txt However, the parent parse context will see everything inside the double-quotes as a single string token: "total: $(cd data ; ?!AMP?! wc -w *.txt) words" losing whatever positional information was attached to the ";" token where the problem was detected. One way to preserve the positional information of a detected problem in a recursive parse context within a string would be to attach the positional information to the annotation textually; for instance: "total: $(cd data ; ?!AMP:21:22?! wc -w *.txt) words" and then extract the positional information when annotating the original test definition. However, a cleaner and much simpler approach is to maintain the list of detected problems separately rather than embedding the problems as annotations directly in the parsed token stream. Not only does this ensure that positional information within recursive parse contexts is not lost, but it keeps the token stream free from non-token pollution, which may simplify implementation of validations added in the future since they won't have to handle non-token "?!FOO!?" items specially. Finally, the chainlint self-test "expect" files need a few mechanical adjustments now that the original test definitions are emitted rather than the parsed token stream. In particular, the following items missing from the historic parsed-token output are now preserved verbatim: * indentation (and whitespace, in general) * comments * here-doc bodies * here-doc tag quoting (i.e. "\EOF") * line-splices (i.e. "\" at the end of a line) Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
/preview |
Preview email sent as pull.1375.git.git.1667931861.gitgitgadget@gmail.com |
/submit |
Submitted as pull.1375.git.git.1667934510.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Taylor Blau wrote (reply to this): On Tue, Nov 08, 2022 at 07:08:26PM +0000, Eric Sunshine via GitGitGadget wrote:
> This patch series further improves the output by instead making chainlint.pl
> annotate the original test definition rather than the parsed token stream,
> thus preserving indentation (and whitespace, in general), here-doc bodies,
> etc., which should make it easier for a test author to relate each problem
> back to the source.
Very nicely done. The changes all seemed reasonable to me (and, in fact,
the approach is pretty straightforward -- the diffstat is misleading
since many of changes are to chainlint's expected output).
So I'm happy with it, but let's hear from some other folks who are more
familiar with this area before we start merging it down.
Thanks,
Taylor |
User |
On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this): On Tue, Nov 08 2022, Eric Sunshine via GitGitGadget wrote:
> When chainlint detects problems in a test, such as a broken &&-chain, it
> prints out the test with "?!FOO?!" annotations inserted at each problem
> location. However, rather than annotating the original test definition, it
> instead dumps out a parsed token representation of the test. Since it lacks
> comments, indentation, here-doc bodies, and so forth, this tokenized
> representation can be difficult for the test author to digest and relate
> back to the original test definition.
>
> An earlier patch series[1] improved the output somewhat by colorizing the
> "?!FOO?!" annotations and the "# chainlint:" lines, but the output can still
> be difficult to digest.
>
> This patch series further improves the output by instead making chainlint.pl
> annotate the original test definition rather than the parsed token stream,
> thus preserving indentation (and whitespace, in general), here-doc bodies,
> etc., which should make it easier for a test author to relate each problem
> back to the source.
>
> This series was inspired by usability comments from Peff[2] and Ævar[3] and
> a bit of discussion which followed[4][5].
>
> (Note to self: Add Ævar to nerd-snipe blacklist alongside Peff.)
Heh! It's great to see a follow-up to our discussion the other day, and
having the output verbatim & annotated looks much better, especially for
complex tests.
E.g. (taking one at random, after some grepping/skimming), ruining this one:
diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index dcaab7265f5..c27539a773d 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -1365,8 +1365,7 @@ test_expect_success 'for-each-ref --ignore-case works on multiple sort keys' '
do
GIT_COMMITTER_EMAIL="$email@example.com" \
git tag -m "tag $subject" icase-$(printf %02d $nr) &&
- nr=$((nr+1))||
- return 1
+ nr=$((nr+1))
done
done &&
git for-each-ref --ignore-case \
Would, before emit (correct, but a bit of a token-soup):
$ ./t6300-for-each-ref.sh
# chainlint: ./t6300-for-each-ref.sh
# chainlint: for-each-ref --ignore-case works on multiple sort keys
nr=0 &&
for email in a A b B
do
for subject in a A b B
do
GIT_COMMITTER_EMAIL="$email@example.com" git tag -m "tag $subject" icase-$(printf %02d $nr) &&
nr=$((nr+1)) ?!LOOP?!
done ?!LOOP?!
done &&
git for-each-ref --ignore-case --format="%(taggeremail) %(subject) %(refname)" --sort=refname --sort=subject --sort=taggeremail refs/tags/icase-* > actual &&
cat > expect <<-EOF &&
test_cmp expect actual
error: bug in the test script: lint error (see '?!...!? annotations above)
But now it'll instead emit:
$ ./t6300-for-each-ref.sh
# chainlint: ./t6300-for-each-ref.sh
# chainlint: for-each-ref --ignore-case works on multiple sort keys
# name refs numerically to avoid case-insensitive filesystem conflicts
nr=0 &&
for email in a A b B
do
for subject in a A b B
do
GIT_COMMITTER_EMAIL="$email@example.com" \
git tag -m "tag $subject" icase-$(printf %02d $nr) &&
nr=$((nr+1)) ?!LOOP?!
done ?!LOOP?!
done &&
git for-each-ref --ignore-case \
--format="%(taggeremail) %(subject) %(refname)" \
--sort=refname \
--sort=subject \
--sort=taggeremail \
refs/tags/icase-* >actual &&
cat >expect <<-\EOF &&
<a@example.com> tag a refs/tags/icase-00
<a@example.com> tag A refs/tags/icase-01
<A@example.com> tag a refs/tags/icase-04
<A@example.com> tag A refs/tags/icase-05
<a@example.com> tag b refs/tags/icase-02
<a@example.com> tag B refs/tags/icase-03
<A@example.com> tag b refs/tags/icase-06
<A@example.com> tag B refs/tags/icase-07
<b@example.com> tag a refs/tags/icase-08
<b@example.com> tag A refs/tags/icase-09
<B@example.com> tag a refs/tags/icase-12
<B@example.com> tag A refs/tags/icase-13
<b@example.com> tag b refs/tags/icase-10
<b@example.com> tag B refs/tags/icase-11
<B@example.com> tag b refs/tags/icase-14
<B@example.com> tag B refs/tags/icase-15
EOF
test_cmp expect actual
error: bug in the test script: lint error (see '?!...!? annotations above)
Which is so much better, i.e. as you're preserving the whitespace &
comments, and the "?!LOOP?!" is of course much easier to see with the
colored output.
I hadn't noticed before that the contents of here-docs was pruned, but
that made sense in the previous parser, but having the content.
Also, and I guess this is an attempt to evade your blacklist. I *did*
notice when playing around with this that if I now expand the "1 while"
loop here:
my $s = do { local $/; <$fh> };
close($fh);
my $parser = ScriptParser->new(\$s);
1 while $parser->parse_cmd();
To something that "follows along" with the parser it shouldn't be too
hard in the future to add line number annotations now. E.g. for
"#!/bin/sh\n" you'll emit a token like "\n", but the positions will be
0, 10.
But that's all for some hypothetical future, this is already much better
:) |
On the Git mailing list, Eric Sunshine wrote (reply to this): On Tue, Nov 8, 2022 at 5:29 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> On Tue, Nov 08 2022, Eric Sunshine via GitGitGadget wrote:
> > (Note to self: Add Ævar to nerd-snipe blacklist alongside Peff.)
>
> Also, and I guess this is an attempt to evade your blacklist. I *did*
> notice when playing around with this that if I now expand the "1 while"
> loop here:
> [...]
> To something that "follows along" with the parser it shouldn't be too
> hard in the future to add line number annotations now. E.g. for
> "#!/bin/sh\n" you'll emit a token like "\n", but the positions will be
> 0, 10.
>
> But that's all for some hypothetical future, this is already much better
My nerd-snipe blacklist hasn't fully solidified yet, unfortunately (for me). |
On the Git mailing list, Eric Sunshine wrote (reply to this): On Tue, Nov 8, 2022 at 5:43 PM Eric Sunshine <sunshine@sunshineco.com> wrote:
> On Tue, Nov 8, 2022 at 5:29 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> > On Tue, Nov 08 2022, Eric Sunshine via GitGitGadget wrote:
> > > (Note to self: Add Ævar to nerd-snipe blacklist alongside Peff.)
> >
> > Also, and I guess this is an attempt to evade your blacklist. I *did*
> > notice when playing around with this that if I now expand the "1 while"
> > loop here:
> > [...]
> > To something that "follows along" with the parser it shouldn't be too
> > hard in the future to add line number annotations now. E.g. for
> > "#!/bin/sh\n" you'll emit a token like "\n", but the positions will be
> > 0, 10.
> >
> > But that's all for some hypothetical future, this is already much better
>
> My nerd-snipe blacklist hasn't fully solidified yet, unfortunately (for me).
I forgot to add that if you do manage to penetrate my nerd-snipe
blacklist, such a feature would be built atop the current series (i.e.
no reason to hold up this series for the "hypothetical future", as you
say). |
This branch is now known as |
This patch series was integrated into seen via 9e13522. |
On the Git mailing list, Jeff King wrote (reply to this): On Tue, Nov 08, 2022 at 03:28:34PM -0500, Taylor Blau wrote:
> On Tue, Nov 08, 2022 at 07:08:26PM +0000, Eric Sunshine via GitGitGadget wrote:
> > This patch series further improves the output by instead making chainlint.pl
> > annotate the original test definition rather than the parsed token stream,
> > thus preserving indentation (and whitespace, in general), here-doc bodies,
> > etc., which should make it easier for a test author to relate each problem
> > back to the source.
>
> Very nicely done. The changes all seemed reasonable to me (and, in fact,
> the approach is pretty straightforward -- the diffstat is misleading
> since many of changes are to chainlint's expected output).
>
> So I'm happy with it, but let's hear from some other folks who are more
> familiar with this area before we start merging it down.
I don't claim to be _that_ familiar with the code itself, but all of the
patches look reasonable to me. And most importantly, I dug out the state
of my tree from early September (via the reflog) before I fixed all of
the chainlint problems on my local topics. The improvement in the output
with this series is night and day.
I was a little surprised that using a class in patch 3 would cause such
a slowdown. But it's not that hard to believe that the workload is so
heavy on string comparison and manipulation that the overloaded string
and comparison functions introduce significant overhead. It has been a
long time since I've optimized any perl, but I remember the rule of
thumb being to minimize the number of lines of perl (because all of the
builtin stuff is blazingly fast C, and all of the perl is byte-code).
At any rate, the result you came up with doesn't look too bad. The only
risk is that you forgot to s/$token/$token->[0]/ somewhere, and I
suspect we'd have found that in running the tests.
So it all seems like a step forward to me.
-Peff |
On the Git mailing list, Taylor Blau wrote (reply to this): On Wed, Nov 09, 2022 at 08:11:45AM -0500, Jeff King wrote:
> So it all seems like a step forward to me.
Thanks, all. Let's start merging this topic down :-).
Thanks,
Taylor |
This patch series was integrated into seen via 3a79a80. |
This patch series was integrated into seen via 260cefa. |
This patch series was integrated into seen via e4fd0f0. |
This patch series was integrated into next via 9cd7d30. |
This patch series was integrated into seen via 85c3e0c. |
This patch series was integrated into seen via 91d501c. |
This patch series was integrated into seen via 4c6d316. |
This patch series was integrated into seen via 68906b9. |
This patch series was integrated into seen via 7fde153. |
This patch series was integrated into seen via ad59b48. |
This patch series was integrated into seen via 3946f79. |
This patch series was integrated into seen via 2ebdb4b. |
This patch series was integrated into seen via 6a0d648. |
This patch series was integrated into seen via d779033. |
This patch series was integrated into seen via 144ec25. |
There was a status update in the "Cooking" section about the branch Teach chainlint.pl to annotate the original test definition instead of the token stream. Will merge to 'master'. source: <pull.1375.git.git.1667934510.gitgitgadget@gmail.com> |
This patch series was integrated into seen via 613fb30. |
This patch series was integrated into master via 613fb30. |
This patch series was integrated into next via 613fb30. |
Closed via 613fb30. |
When chainlint detects problems in a test, such as a broken &&-chain, it
prints out the test with "?!FOO?!" annotations inserted at each problem
location. However, rather than annotating the original test definition,
it instead dumps out a parsed token representation of the test. Since it
lacks comments, indentation, here-doc bodies, and so forth, this
tokenized representation can be difficult for the test author to digest
and relate back to the original test definition.
An earlier patch series[1] improved the output somewhat by colorizing the
"?!FOO?!" annotations and the "# chainlint:" lines, but the output can
still be difficult to digest.
This patch series further improves the output by instead making
chainlint.pl annotate the original test definition rather than the
parsed token stream, thus preserving indentation (and whitespace, in
general), here-doc bodies, etc., which should make it easier for a test
author to relate each problem back to the source.
This series was inspired by usability comments from Peff[2] and Ævar[3]
and a bit of discussion which followed[4][5].
(Note to self: Add Ævar to nerd-snipe blacklist alongside Peff.)
FOOTNOTES
[1] https://lore.kernel.org/git/pull.1324.v2.git.git.1663041707260.gitgitgadget@gmail.com/
[2] https://lore.kernel.org/git/Yx1x5lme2SGBjfia@coredump.intra.peff.net/
[3] https://lore.kernel.org/git/221024.865yg9ecsx.gmgdl@evledraar.gmail.com/
[4] https://lore.kernel.org/git/CAPig+cRJVn-mbA6-jOmNfDJtK_nX4ZTw+OcNShvvz8zcQYbCHQ@mail.gmail.com/
[5] https://lore.kernel.org/git/CAPig+cT=cWYT6kicNWT+6RxfiKKMyVz72H3_9kwkF-f4Vuoe1w@mail.gmail.com/
Cc: Jeff King peff@peff.net
Cc: Ævar Arnfjörð Bjarmason avarab@gmail.com
cc: Taylor Blau me@ttaylorr.com