Update string nodes for implicit concatenation #7927

dhruvmanila · 2023-10-12T12:50:32Z

Summary

This PR updates the string nodes (ExprStringLiteral, ExprBytesLiteral, and ExprFString) to account for implicit string concatenation.

Motivation

In Python, implicit string concatenation are joined while parsing because the interpreter doesn't require the information for each part. While that's feasible for an interpreter, it falls short for a static analysis tool where having such information is more useful. Currently, various parts of the code uses the lexer to get the individual string parts.

One of the main challenge this solves is that of string formatting. Currently, the formatter relies on the lexer to get the individual string parts, and formats them including the comments accordingly. But, with PEP 701, f-string can also contain comments. Without this change, it becomes very difficult to add support for f-string formatting.

Implementation

The initial proposal was made in this discussion: #6183 (comment). There were various AST designs which were explored for this task which are available in the linked internal document¹.

The selected variant was the one where the nodes were kept as it is except that the implicit_concatenated field was removed and instead a new struct was added to the Expr* struct. This would be a private struct would contain the actual implementation of how the AST is designed for both single and implicitly concatenated strings.

This implementation is achieved through an enum with two variants: Single and Concatenated to avoid allocating a vector even for single strings. There are various public methods available on the value struct to query certain information regarding the node.

The nodes are structured in the following way:

ExprStringLiteral - "foo" "bar"
|- StringLiteral - "foo"
|- StringLiteral - "bar"

ExprBytesLiteral - b"foo" b"bar"
|- BytesLiteral - b"foo"
|- BytesLiteral - b"bar"

ExprFString - "foo" f"bar {x}"
|- FStringPart::Literal - "foo"
|- FStringPart::FString - f"bar {x}"
  |- StringLiteral - "bar "
  |- FormattedValue - "x"

Visitor

The way the nodes are structured is that the entire string, including all the parts that are implicitly concatenation, is a single node containing individual nodes for the parts. The previous section has a representation of that tree for all the string nodes. This means that new visitor methods are added to visit the individual parts of string, bytes, and f-strings for Visitor, PreorderVisitor, and Transformer.

Test Plan

cargo insta test --workspace --all-features --unreferenced reject
Verify that the ecosystem results are unchanged

Internal document: https://www.notion.so/astral-sh/Implicit-String-Concatenation-e036345dc48943f89e416c087bf6f6d9?pvs=4 ↩

dhruvmanila · 2023-10-12T12:50:43Z

Current dependencies on/for this PR:

main
- PR Remove leftover constant tuple reference #8062
  - PR New Singleton enum for PatternMatchSingleton node #8063
    - PR Split Constant to individual literal nodes #8064
      - PR Update string nodes for implicit concatenation #7927 👈
        
        PR Explicit as_str (no deref), add no allocation methods #8826
        
        PR Remove #[allow(unused_variables)] from visitor methods #8828
        
        PR New AST nodes for f-string elements #8835
        
        PR Introduce StringLike enum #9016

This stack of pull requests is managed by Graphite.

crates/ruff_linter/src/rules/pyupgrade/rules/use_pep604_annotation.rs

BurntSushi

Ah this is much better! Thank you for breaking it up. It made it oodles easier to review. :-)

Overall great work. I especially found the new fstring/string/byte-string types pretty easy to follow and understand. Their structure makes sense to me.

BurntSushi · 2023-11-22T12:41:53Z

crates/ruff_python_ast/src/nodes.rs

+    ///
+    /// This is always going to be `FStringPart::FString` variant which is
+    /// maintained by the `FStringValue::single` constructor.
+    Single(FStringPart),


My first thought here is to wonder why, based on the comment, this isn't Single(FString) in order to better encode the invariant.

Hmmm yes, I see, the FStringValue::parts method makes it rather annoying to do that without extra type machinery that probably isn't worth doing.

Yeah, I tried adding a FStringPartRef enum but then we can't get the mutable reference for it which is required in the Transformer.

enum FStringPartRef<'a> { Literal(&'a StringLiteral), FString(&'a FString), }

crates/ruff_python_ast/src/nodes.rs

crates/ruff_python_ast/src/visitor.rs

As highlighted in the review: > If you have two `ConcatenatedStringLiteral` values where both have > equivalent values for `strings` but where one has `value` initialized > and the other does not, would you expect them to compare equal? > Semantically I think I would, since the alternative is that equality is > dependent on whether `as_str()` has been called, which seems incidental. #7927 (comment)

This commit implements the AST design to account for implicit concatenation in string nodes, specifically the `ExprFString`, `ExprStringLiteral`, and `ExprBytesLiteral` nodes.

This commit adds the new variants for the string parts to `AnyNode` and `AnyNodeRef` enums. These parts are `StringLiteral` (part of `ExprStringLiteral`), `BytesLiteral` (part of `ExprBytesLiteral`), and `FString` (part of `ExprFString`). The reason for this is to add visitor methods for these parts. This is done in the following commit. So, the visitor would visit the string as a whole first and then visit each part. ``` ExprStringLiteral - "foo" "bar" |- StringLiteral - "foo" |- StringLiteral - "bar" ``` The above tree helps understand the way visitor would work.

The visitor implementations are updated to visit each part nodes for the respective string nodes. The following example better highlights this: ``` ExprStringLiteral - "foo" "bar" |- StringLiteral - "foo" |- StringLiteral - "bar" ``` The `visit_expr` method would be use to visit the `ExprStringLiteral` while the `visit_string_literal` method would be use for the `StringLiteral` node. Similar methods are added for bytes and f-strings.

The generator is basically improved. Earlier, for an implicitly concatenated string we would produce the joined form. So, ```python "foo" "bar" "baz" ``` For the above example, the generator would give us: ```python "foobarbaz" ``` Now, as we have the information for each part, we will be producing the exact code back.

`Expr` is a general type for all expressions while `LiteralExpressionRef` is a type which includes only the literal expressions. The method is suited more for this type instead. This will also help in the formatter change.

As highlighted in the review: > If you have two `ConcatenatedStringLiteral` values where both have > equivalent values for `strings` but where one has `value` initialized > and the other does not, would you expect them to compare equal? > Semantically I think I would, since the alternative is that equality is > dependent on whether `as_str()` has been called, which seems incidental. #7927 (comment)

crates/ruff_linter/src/rules/pylint/rules/bad_string_format_type.rs

charliermarsh · 2023-11-22T17:45:16Z

crates/ruff_linter/src/rules/ruff/rules/explicit_f_string_type_conversion.rs

-    inner(&mut expr.left, &mut formatted_expressions);
-    inner(&mut expr.right, &mut formatted_expressions);
-    formatted_expressions
+    .map(|output| Fix::safe_edit(Edit::range_replacement(output, f_string.range())))


Interesting, how did this get so much simpler?

Ah, that's the magic of this refactor 😉

Jokes aside, this basically reverts this commit ac4a4da which fixed the bug where this rule wasn't producing the correct fix for an implicitly concatenated string. But, now that we work on individual parts, that isn't required.

crates/ruff_python_ast/src/nodes.rs

charliermarsh

This is impressive work. There's a huge surface area here. You did a great job of breaking down the individual PRs and walking us through it.

I left a few questions, which I'm happy to continued discussing, but erring on the side of approving.

charliermarsh · 2023-11-22T17:55:44Z

crates/ruff_python_ast/src/nodes.rs

+#[derive(Clone, Debug, PartialEq)]
+pub struct FString {
+    pub range: TextRange,
+    pub values: Vec<Expr>,


I wonder if these should be smallvec... It's probably worth benchmarking in a separate PR. In practice, I'd bet the vast majority of concatenations are <= 5 elements.

charliermarsh · 2023-11-22T17:57:26Z

crates/ruff_linter/src/rules/flake8_pyi/rules/unrecognized_platform.rs

@@ -130,7 +130,7 @@ pub(crate) fn unrecognized_platform(checker: &mut Checker, test: &Expr) {
            if !matches!(value.as_str(), "linux" | "win32" | "cygwin" | "darwin") {


This is an example where... we can probably do this without the concatenation? Imagine if we had a custom matcher that worked on implicit concatenations (i.e., a vector of strings). \cc @BurntSushi in case you have obvious ideas here.

I think the closest "out of the box" experience you can get is stream searching with aho-corasick, but you need to provide a std::io::Read impl. Mildly annoying but doable in this context.

But I would say to write the simple/obvious code for now, and if this ends up popping up on a profile then we can revisit it and do something bespoke here or use aho-corasick.

I think I would prefer to avoid the complexity for such simple cases but I do understand that this could be useful in similar cases where the strings are long.

@davidszotten

Rebase of #6365 authored by @davidszotten. ## Summary This PR updates the AST structure for an f-string elements. The main **motivation** behind this change is to have a dedicated node for the string part of an f-string. Previously, the existing `ExprStringLiteral` node was used for this purpose which isn't exactly correct. The `ExprStringLiteral` node should include the quotes as well in the range but the f-string literal element doesn't include the quote as it's a specific part within an f-string. For example, ```python f"foo {x}" # ^^^^ # This is the literal part of an f-string ``` The introduction of `FStringElement` enum is helpful which represent either the literal part or the expression part of an f-string. ### Rule Updates This means that there'll be two nodes representing a string depending on the context. One for a normal string literal while the other is a string literal within an f-string. The AST checker is updated to accommodate this change. The rules which work on string literal are updated to check on the literal part of f-string as well. #### Notes 1. The `Expr::is_literal_expr` method would check for `ExprStringLiteral` and return true if so. But now that we don't represent the literal part of an f-string using that node, this improves the method's behavior and confines to the actual expression. We do have the `FStringElement::is_literal` method. 2. We avoid checking if we're in a f-string context before adding to `string_type_definitions` because the f-string literal is now a dedicated node and not part of `Expr`. 3. Annotations cannot use f-string so we avoid changing any rules which work on annotation and checks for `ExprStringLiteral`. ## Test Plan - All references of `Expr::StringLiteral` were checked to see if any of the rules require updating to account for the f-string literal element node. - New test cases are added for rules which check against the literal part of an f-string. - Check the ecosystem results and ensure it remains unchanged. ## Performance There's a performance penalty in the parser. The reason for this remains unknown as it seems that the generated assembly code is now different for the `__reduce154` function. The reduce function body is just popping the `ParenthesizedExpr` on top of the stack and pushing it with the new location. - The size of `FStringElement` enum is the same as `Expr` which is what it replaces in `FString::format_spec` - The size of `FStringExpressionElement` is the same as `ExprFormattedValue` which is what it replaces I tried reducing the `Expr` enum from 80 bytes to 72 bytes but it hardly resulted in any performance gain. The difference can be seen here: - Original profile: https://share.firefox.dev/3Taa7ES - Profile after boxing some node fields: https://share.firefox.dev/3GsNXpD ### Backtracking I tried backtracking the changes to see if any of the isolated change produced this regression. The problem here is that the overall change is so small that there's only a single checkpoint where I can backtrack and that checkpoint results in the same regression. This checkpoint is to revert using `Expr` to the `FString::format_spec` field. After this point, the change would revert back to the original implementation. ## Review process The review process is similar to #7927. The first set of commits update the node structure, parser, and related AST files. Then, further commits update the linter and formatter part to account for the AST change. --------- Co-authored-by: David Szotten <davidszotten@gmail.com>

dhruvmanila added parser Related to the parser internal An internal refactor or improvement and removed parser Related to the parser labels Oct 12, 2023

dhruvmanila force-pushed the dhruv/implicit-str-concat-node branch 2 times, most recently from 7c763c5 to 02ff747 Compare October 19, 2023 13:48

dhruvmanila changed the base branch from main to dhruv/constant-to-literal October 19, 2023 13:48

This was referenced Oct 19, 2023

Remove leftover constant tuple reference #8062

Merged

New Singleton enum for PatternMatchSingleton node #8063

Merged

Split Constant to individual literal nodes #8064

Merged

dhruvmanila changed the title ~~POC: AST node for implicit string concatenation~~ New AST node for implicit string concatenation Oct 19, 2023

dhruvmanila force-pushed the dhruv/constant-to-literal branch from 5b20992 to 66ae3fc Compare October 19, 2023 17:52

dhruvmanila force-pushed the dhruv/implicit-str-concat-node branch from 02ff747 to ada537e Compare October 19, 2023 17:53

dhruvmanila force-pushed the dhruv/constant-to-literal branch from 66ae3fc to fd9f090 Compare October 19, 2023 19:27

dhruvmanila force-pushed the dhruv/implicit-str-concat-node branch from ada537e to 1ae9b42 Compare October 19, 2023 19:27

dhruvmanila force-pushed the dhruv/constant-to-literal branch from fd9f090 to 7111576 Compare October 19, 2023 19:33

dhruvmanila force-pushed the dhruv/implicit-str-concat-node branch 2 times, most recently from fc67e03 to 56e27ac Compare October 20, 2023 06:21

dhruvmanila force-pushed the dhruv/constant-to-literal branch from 7111576 to f544b3f Compare October 20, 2023 10:47

dhruvmanila force-pushed the dhruv/implicit-str-concat-node branch 2 times, most recently from 13534ce to 382e85d Compare October 20, 2023 14:07

konstin reviewed Oct 23, 2023

View reviewed changes

crates/ruff_linter/src/rules/pyupgrade/rules/use_pep604_annotation.rs Outdated Show resolved Hide resolved

dhruvmanila force-pushed the dhruv/constant-to-literal branch 2 times, most recently from cf8ee21 to aaf65a1 Compare October 30, 2023 05:50

Base automatically changed from dhruv/constant-to-literal to main October 30, 2023 06:43

dhruvmanila mentioned this pull request Nov 1, 2023

Consider respecting avoid-escape when enforcing other flake8-quotes rules #7889

Open

dhruvmanila force-pushed the dhruv/implicit-str-concat-node branch 3 times, most recently from 07e49e4 to 8a392a2 Compare November 14, 2023 13:38

dhruvmanila changed the title ~~New AST node for implicit string concatenation~~ Update string nodes for implicit concatenation Nov 14, 2023

BurntSushi approved these changes Nov 22, 2023

View reviewed changes

dhruvmanila added 13 commits November 22, 2023 09:51

Update existing string nodes for implicit concatenation

86d799d

This commit implements the AST design to account for implicit concatenation in string nodes, specifically the `ExprFString`, `ExprStringLiteral`, and `ExprBytesLiteral` nodes.

Update parser for the new AST design

223380f

Update parser snapshots

21ed36c

Update ComparableExpr to account for each string parts

6be8eca

Update the AST helper methods

1c6dc52

misc: use Self consistently

6cdf149

Move is_implicit_concatenated to a narrow type

95f71da

`Expr` is a general type for all expressions while `LiteralExpressionRef` is a type which includes only the literal expressions. The method is suited more for this type instead. This will also help in the formatter change.

Update the linter code for the new AST design

6d6e1c7

Update the formatter code for the new AST design

bcabcca

charliermarsh reviewed Nov 22, 2023

View reviewed changes

crates/ruff_linter/src/rules/pylint/rules/bad_string_format_type.rs Show resolved Hide resolved

charliermarsh reviewed Nov 22, 2023

View reviewed changes

crates/ruff_python_ast/src/nodes.rs Show resolved Hide resolved

charliermarsh approved these changes Nov 22, 2023

View reviewed changes

charliermarsh reviewed Nov 22, 2023

View reviewed changes

dhruvmanila force-pushed the dhruv/implicit-str-concat-node branch from baa24e7 to 8193ff9 Compare November 23, 2023 14:47

dhruvmanila mentioned this pull request Nov 23, 2023

Explicit as_str (no deref), add no allocation methods #8826

Merged

misc: use correct names for visitor arguments

1e72af2

This was referenced Nov 23, 2023

Remove #[allow(unused_variables)] from visitor methods #8828

Merged

New AST nodes for f-string elements #8835

Merged

dhruvmanila merged commit 017e829 into main Nov 24, 2023
17 checks passed

dhruvmanila deleted the dhruv/implicit-str-concat-node branch November 24, 2023 23:55

dhruvmanila mentioned this pull request Dec 5, 2023

Introduce StringLike enum #9016

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update string nodes for implicit concatenation #7927

Update string nodes for implicit concatenation #7927

dhruvmanila commented Oct 12, 2023 •

edited

Loading

dhruvmanila commented Oct 12, 2023 •

edited

Loading

BurntSushi left a comment

BurntSushi Nov 22, 2023

BurntSushi Nov 22, 2023

dhruvmanila Nov 22, 2023

charliermarsh Nov 22, 2023

dhruvmanila Nov 22, 2023

charliermarsh left a comment

charliermarsh Nov 22, 2023

charliermarsh Nov 22, 2023

BurntSushi Nov 22, 2023

dhruvmanila Nov 22, 2023

		@@ -130,7 +130,7 @@ pub(crate) fn unrecognized_platform(checker: &mut Checker, test: &Expr) {
		if !matches!(value.as_str(), "linux" \| "win32" \| "cygwin" \| "darwin") {

Update string nodes for implicit concatenation #7927

Update string nodes for implicit concatenation #7927

Conversation

dhruvmanila commented Oct 12, 2023 • edited Loading

Summary

Motivation

Implementation

Visitor

Test Plan

Footnotes

dhruvmanila commented Oct 12, 2023 • edited Loading

BurntSushi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

charliermarsh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhruvmanila commented Oct 12, 2023 •

edited

Loading

dhruvmanila commented Oct 12, 2023 •

edited

Loading