Track quoting style in the tokenizer #10256

AlexWaygood · 2024-03-06T19:39:57Z

Summary

This PR changes the tokenizer so that all information about quoting style and prefixes is captured in a bitflag; this bitflag is then stored as a field on String, FStringStart, FStringMiddle and FStringEnd tokens.

By itself, this change does not fix any bugs. However, it's a necessary first step if we want to start tracking this information in the AST, which is necessary if we want to solve #7799 in a principled and universal way.

It should be easiest to review this PR one commit at a time:

The first commit adds the bitflag and starts storing it on String tokens.
The second commit also changes f-string tokens so that they store the same information
The third commit applies various cleanups and simplifications to various linter rules that are now possible with this refactor.

Test Plan

cargo test

AlexWaygood · 2024-03-06T19:51:58Z

It looks like some of the benchmarks are showing a 2-3% slowdown for lexer::Lexer::next_token(): https://codspeed.io/astral-sh/ruff/branches/AlexWaygood:quotestyle-tokenizer. I'll see tomorrow if there's anything I can do to ameliorate that.

github-actions · 2024-03-06T19:58:14Z

`ruff-ecosystem` results

Linter (stable)

ℹ️ ecosystem check detected linter changes. (+2 -0 violations, +0 -0 fixes in 1 projects; 42 projects unchanged)

rotki/rotki (+2 -0 violations, +0 -0 fixes)

+ rotkehlchen/chain/ethereum/modules/eigenlayer/constants.py:9:36: Q004 [*] Unnecessary escape on inner quote character
+ rotkehlchen/chain/evm/decoding/cowswap/decoder.py:41:45: Q004 [*] Unnecessary escape on inner quote character

Changes by rule (1 rules affected)

code	total	+ violation	- violation	+ fix	- fix
Q004	2	2	0	0	0

Linter (preview)

ℹ️ ecosystem check detected linter changes. (+2 -0 violations, +0 -0 fixes in 1 projects; 42 projects unchanged)

rotki/rotki (+2 -0 violations, +0 -0 fixes)

ruff check --no-cache --exit-zero --ignore RUF9 --output-format concise --preview

+ rotkehlchen/chain/ethereum/modules/eigenlayer/constants.py:9:36: Q004 [*] Unnecessary escape on inner quote character
+ rotkehlchen/chain/evm/decoding/cowswap/decoder.py:41:45: Q004 [*] Unnecessary escape on inner quote character

Changes by rule (1 rules affected)

code	total	+ violation	- violation	+ fix	- fix
Q004	2	2	0	0	0

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

AlexWaygood · 2024-03-06T20:13:38Z

(I'll also look through the ecosystem results tomorrow.)

charliermarsh · 2024-03-06T20:14:05Z

@AlexWaygood - It's possible that you have a slightly different version of LALRPOP than whatever was used most recently to generate the parser, hence the thousands of lines of changes in the generated python.rs. Can you check if you're using // auto-generated: "lalrpop 0.20.0"?

AlexWaygood · 2024-03-06T20:16:44Z

Can you check if you're using // auto-generated: "lalrpop 0.20.0"?

Ah, I'm using 0.20.2 locally :(

I did wonder...

charliermarsh · 2024-03-06T20:22:12Z

It's one way to buff up your contribution stats.

AlexWaygood · 2024-03-06T21:50:55Z

I fixed up the huge changes from using the wrong lalrpop version, with some help from @BurntSushi on the necessary git-fu to get there 🥳

crates/ruff_linter/src/rules/pycodestyle/rules/invalid_escape_sequence.rs

charliermarsh

I think this looks great!

crates/ruff_python_parser/src/string_token_flags.rs

crates/ruff_python_parser/src/lexer.rs

charliermarsh · 2024-03-07T04:18:55Z

Can you check if the ecosystem changes are intended?

MichaReiser

This looks good. Nice work. I really like how you abstracted away the bit representation and I think it allows us to abstract away even more.

I recommend changing the API of StringFlags by introducing a new StringPrefix enum (similar to the existing StringKind enum) that guarantees that constructing invalid prefixes is impossible. This should reduce the changes necessary in the lexer and remove the unwrap/except calls (that could be the cause of the perf regression).

I further recommend removing StringFlags from FStringMiddle and FStringEnd except if we have a very specific use case where knowing the flags is required.

crates/ruff_linter/src/rules/flake8_quotes/rules/avoidable_escaped_quote.rs

crates/ruff_linter/src/rules/pycodestyle/rules/invalid_escape_sequence.rs

crates/ruff_linter/src/rules/pylint/rules/bad_string_format_character.rs

crates/ruff_linter/src/rules/pyupgrade/rules/printf_string_formatting.rs

crates/ruff_python_parser/src/string_token_flags.rs

crates/ruff_python_parser/src/lexer/fstring.rs

crates/ruff_python_parser/src/token.rs

..._python_parser/src/snapshots/ruff_python_parser__lexer__tests__fstring_with_format_spec.snap

crates/ruff_python_parser/src/string_token_flags.rs

AlexWaygood · 2024-03-07T13:29:35Z

rotkehlchen/chain/ethereum/modules/eigenlayer/constants.py:9:36: Q004 [*] Unnecessary escape on inner quote character

rotkehlchen/chain/evm/decoding/cowswap/decoder.py:41:45: Q004 [*] Unnecessary escape on inner quote character

I wish I could say I knew why these changes causes these two lines to be flagged when previously they weren't. But the good news is... I think these are true positives! The \' escape inside both strings does seem to be unnecessary, since the string uses double quotes:

>>> x = b"\xe7\xeb\x0c\xa1\x1b\x83tN\xce=x\xe9\xbe\x01\xb9\x13B_\xba\xe7\x0c2\xce\'rm\x0e\xcd\xe9.\xf8\xd2"
>>> y = b"\xe7\xeb\x0c\xa1\x1b\x83tN\xce=x\xe9\xbe\x01\xb9\x13B_\xba\xe7\x0c2\xce'rm\x0e\xcd\xe9.\xf8\xd2"
>>> x == y
True

I'll add some tests to make sure these don't regress in the future.

AlexWaygood · 2024-03-07T13:35:54Z

The codspeed benchmarks are still showing some regressions in the lexer, but are also now showing some speedups in the linter.

AlexWaygood · 2024-03-07T14:19:01Z

Reviews have mostly been addressed -- thanks! The outstanding questions are:

What to do about f-string tokens: Track quoting style in the tokenizer #10256 (comment)
Whether StringKind et al deserve to be in their own module/file, or whether they should be moved to string.rs: Track quoting style in the tokenizer #10256 (comment)
The way StringKind stores its data: Track quoting style in the tokenizer #10256 (comment)
Getting a better Debug implementation for StringKind (depends on resolving (3) first): Track quoting style in the tokenizer #10256 (comment)

crates/ruff_python_parser/src/string_token_flags.rs

crates/ruff_linter/src/rules/flake8_quotes/rules/avoidable_escaped_quote.rs

dhruvmanila · 2024-03-07T15:54:49Z

What to do about f-string tokens: Track quoting style in the tokenizer #10256 (comment)

I would recommend to go with FStringStart and FStringMiddle. I think the next steps, as mentioned in the PR description, is to encode the information in the AST. The parser would use the FStringStart token to get the information. Later, we could potentially move some of the rules from token-based to AST-based which would then make having this information in FStringMiddle not required.

We could also have a subset of information stored in the FStringMiddle but I think that's what we're trying to avoid here.

AlexWaygood · 2024-03-07T17:46:02Z

The performance regressions in the lexer seem to have disappeared from the benchmarks. Some of the benchmarks on codspeed are still showing regressions of around 1%, but I can't really make much sense of them -- I think they're just noise (correct me if I'm wrong!). There's also some nice speedups on some of the lint rules that have been reworked as part of this PR to make use of the information that's now tracked in the tokens:

Performance improvements on some lint rules

Overall, codspeed measures this PR as performance-neutral.

charliermarsh · 2024-03-07T22:15:29Z

(Consider me signed off, though I'll leave it to Micha / Dhruv to give the green light!)

dhruvmanila

Great work here! Small nits but otherwise good to go from my side.

dhruvmanila · 2024-03-08T03:20:57Z

crates/ruff_python_parser/src/string_token_flags.rs

+        /// The string has a `u` or `U` prefix.
+        /// While this prefix is a no-op at runtime,
+        /// strings with this prefix can have no other prefixes set.
+        const U_PREFIX = 1 << 2;
+
+        /// The string has a `b` or `B` prefix.
+        /// This means that the string is a sequence of `int`s at runtime,
+        /// rather than a sequence of `str`s.
+        /// Strings with this flag can also be raw strings,
+        /// but can have no other prefixes.
+        const B_PREFIX = 1 << 3;
+
+        /// The string has a `f` or `F` prefix, meaning it is an f-string.
+        /// F-strings can also be raw strings,
+        /// but can have no other prefixes.
+        const F_PREFIX = 1 << 4;
+
+        /// The string has an `r` or `R` prefix, meaning it is a raw string.
+        /// F-strings and byte-strings can be raw,
+        /// as can strings with no other prefixes.
+        /// U-strings cannot be raw.
+        const R_PREFIX = 1 << 5;


nit: we can have the same names as in StringPrefix for consistency

I have a weak preference for keeping them "inconsistent", as I think it emphasises the fact that just because something has the R_PREFIX flag set doesn't necessarily mean that's the only prefix it has -- it might have other prefixes as well. In that way, it's semantically different to StringPrefix in an important way, since StringPrefix enumerates all the ways in which the prefixes can be validly combined, whereas the bitflag does not.

crates/ruff_python_parser/src/string_token_flags.rs

...python_parser/src/snapshots/ruff_python_parser__lexer__tests__triple_quoted_windows_eol.snap

crates/ruff_python_parser/src/lexer/fstring.rs

AlexWaygood · 2024-03-08T08:34:05Z

Thanks, all! On to the AST 🚀

This PR modifies our AST so that nodes for string literals, bytes literals and f-strings all retain the following information: - The quoting style used (double or single quotes) - Whether the string is triple-quoted or not - Whether the string is raw or not This PR is a followup to #10256. Like with that PR, this PR does not, in itself, fix any bugs. However, it means that we will have the necessary information to preserve quoting style and rawness of strings in the `ExprGenerator` in a followup PR, which will allow us to provide a fix for #7799. The information is recorded on the AST nodes using a bitflag field on each node, similarly to how we recorded the information on `Tok::String`, `Tok::FStringStart` and `Tok::FStringMiddle` tokens in #10298. Rather than reusing the bitflag I used for the tokens, however, I decided to create a custom bitflag for each AST node. Using different bitflags for each node allows us to make invalid states unrepresentable: it is valid to set a `u` prefix on a string literal, but not on a bytes literal or an f-string. It also allows us to have better debug representations for each AST node modified in this PR.

This PR modifies our AST so that nodes for string literals, bytes literals and f-strings all retain the following information: - The quoting style used (double or single quotes) - Whether the string is triple-quoted or not - Whether the string is raw or not This PR is a followup to astral-sh#10256. Like with that PR, this PR does not, in itself, fix any bugs. However, it means that we will have the necessary information to preserve quoting style and rawness of strings in the `ExprGenerator` in a followup PR, which will allow us to provide a fix for astral-sh#7799. The information is recorded on the AST nodes using a bitflag field on each node, similarly to how we recorded the information on `Tok::String`, `Tok::FStringStart` and `Tok::FStringMiddle` tokens in astral-sh#10298. Rather than reusing the bitflag I used for the tokens, however, I decided to create a custom bitflag for each AST node. Using different bitflags for each node allows us to make invalid states unrepresentable: it is valid to set a `u` prefix on a string literal, but not on a bytes literal or an f-string. It also allows us to have better debug representations for each AST node modified in this PR.

AlexWaygood requested a review from MichaReiser as a code owner March 6, 2024 19:39

AlexWaygood force-pushed the quotestyle-tokenizer branch from 82c1673 to c18e409 Compare March 6, 2024 21:48

charliermarsh reviewed Mar 6, 2024

View reviewed changes

crates/ruff_linter/src/rules/pycodestyle/rules/invalid_escape_sequence.rs Outdated Show resolved Hide resolved

AlexWaygood force-pushed the quotestyle-tokenizer branch from c18e409 to a2513c3 Compare March 6, 2024 22:14

charliermarsh reviewed Mar 6, 2024

View reviewed changes

AlexWaygood force-pushed the quotestyle-tokenizer branch 2 times, most recently from 81ac61f to eab0e4f Compare March 6, 2024 23:02

MichaReiser reviewed Mar 7, 2024

View reviewed changes

MichaReiser added the internal An internal refactor or improvement label Mar 7, 2024

dhruvmanila reviewed Mar 7, 2024

View reviewed changes

crates/ruff_python_parser/src/lexer/fstring.rs Outdated Show resolved Hide resolved

crates/ruff_python_parser/src/token.rs Outdated Show resolved Hide resolved

MichaReiser reviewed Mar 7, 2024

View reviewed changes

..._python_parser/src/snapshots/ruff_python_parser__lexer__tests__fstring_with_format_spec.snap Outdated Show resolved Hide resolved

AlexWaygood force-pushed the quotestyle-tokenizer branch from 216f582 to 5990dd9 Compare March 7, 2024 13:06

AlexWaygood commented Mar 7, 2024

View reviewed changes

crates/ruff_python_parser/src/string_token_flags.rs Show resolved Hide resolved

AlexWaygood requested review from dhruvmanila and MichaReiser March 7, 2024 14:19

MichaReiser reviewed Mar 7, 2024

View reviewed changes

crates/ruff_python_parser/src/string_token_flags.rs Outdated Show resolved Hide resolved

dhruvmanila reviewed Mar 7, 2024

View reviewed changes

crates/ruff_linter/src/rules/flake8_quotes/rules/avoidable_escaped_quote.rs Show resolved Hide resolved

dhruvmanila reviewed Mar 7, 2024

View reviewed changes

crates/ruff_linter/src/rules/flake8_quotes/rules/avoidable_escaped_quote.rs Show resolved Hide resolved

AlexWaygood requested review from MichaReiser and dhruvmanila March 7, 2024 16:59

dhruvmanila approved these changes Mar 8, 2024

View reviewed changes

AlexWaygood added 10 commits March 8, 2024 08:32

Track quoting style in Tok::String tokens

fb35fbe

Track all quote information in f-string tokens as well

f0b07d0

Assorted cleanups and simplifications enabled by the refactor

0317ae4

Review comments from Charlie

2361f04

Address Micha/Dhruv's review (mostly)

08fab12

Add regression test for accidentally fixed bug

570e0f2

Address Micha/Dhruv (2)

0ba32d4

nits

525a27c

Address Dhruv's review (3)

0cb3a27

improve docs

4695ec3

AlexWaygood force-pushed the quotestyle-tokenizer branch from 07ed0d0 to 4695ec3 Compare March 8, 2024 08:33

AlexWaygood enabled auto-merge (squash) March 8, 2024 08:34

AlexWaygood merged commit c504d7a into astral-sh:main Mar 8, 2024
17 checks passed

AlexWaygood mentioned this pull request Mar 8, 2024

Start tracking quoting style in the AST #10298

Merged

AlexWaygood deleted the quotestyle-tokenizer branch March 8, 2024 23:13

nkxxll pushed a commit to nkxxll/ruff that referenced this pull request Mar 10, 2024

Track quoting style in the tokenizer (astral-sh#10256)

4ac1b15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track quoting style in the tokenizer #10256

Track quoting style in the tokenizer #10256

AlexWaygood commented Mar 6, 2024

AlexWaygood commented Mar 6, 2024 •

edited

github-actions bot commented Mar 6, 2024 •

edited

AlexWaygood commented Mar 6, 2024

charliermarsh commented Mar 6, 2024

AlexWaygood commented Mar 6, 2024 •

edited

charliermarsh commented Mar 6, 2024

AlexWaygood commented Mar 6, 2024

charliermarsh left a comment

charliermarsh commented Mar 7, 2024

MichaReiser left a comment •

edited

AlexWaygood commented Mar 7, 2024

AlexWaygood commented Mar 7, 2024

AlexWaygood commented Mar 7, 2024

dhruvmanila commented Mar 7, 2024

AlexWaygood commented Mar 7, 2024 •

edited

charliermarsh commented Mar 7, 2024

dhruvmanila left a comment •

edited

dhruvmanila Mar 8, 2024

AlexWaygood Mar 8, 2024

AlexWaygood commented Mar 8, 2024

Track quoting style in the tokenizer #10256

Track quoting style in the tokenizer #10256

Conversation

AlexWaygood commented Mar 6, 2024

Summary

Test Plan

AlexWaygood commented Mar 6, 2024 • edited

github-actions bot commented Mar 6, 2024 • edited

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

AlexWaygood commented Mar 6, 2024

charliermarsh commented Mar 6, 2024

AlexWaygood commented Mar 6, 2024 • edited

charliermarsh commented Mar 6, 2024

AlexWaygood commented Mar 6, 2024

charliermarsh left a comment

Choose a reason for hiding this comment

charliermarsh commented Mar 7, 2024

MichaReiser left a comment • edited

Choose a reason for hiding this comment

AlexWaygood commented Mar 7, 2024

AlexWaygood commented Mar 7, 2024

AlexWaygood commented Mar 7, 2024

dhruvmanila commented Mar 7, 2024

AlexWaygood commented Mar 7, 2024 • edited

charliermarsh commented Mar 7, 2024

dhruvmanila left a comment • edited

Choose a reason for hiding this comment

dhruvmanila Mar 8, 2024

Choose a reason for hiding this comment

AlexWaygood Mar 8, 2024

Choose a reason for hiding this comment

AlexWaygood commented Mar 8, 2024

AlexWaygood commented Mar 6, 2024 •

edited

github-actions bot commented Mar 6, 2024 •

edited

`ruff-ecosystem` results

AlexWaygood commented Mar 6, 2024 •

edited

MichaReiser left a comment •

edited

AlexWaygood commented Mar 7, 2024 •

edited

dhruvmanila left a comment •

edited