Add support for parsing f-string as per PEP 701 (#7041)

## Summary This PR adds support for PEP 701 in the parser to use the new tokens emitted by the lexer to construct the f-string node. ### Grammar Without an official grammar, the f-strings were parsed manually. Now that we've the specification, that is being used in the LALRPOP to parse the f-strings. ### `string.rs` This file includes the logic for parsing string literals and joining the implicit string concatenation. Now that we don't require parsing f-strings manually a lot of code involving the same is removed. Earlier, there were 2 entry points to this module: * `parse_string`: Used to parse a single string literal * `parse_strings`: Used to parse strings which were implicitly concatenated Now, there are 3 entry points: * `parse_string_literal`: Renamed from `parse_string` * `parse_fstring_middle`: Used to parse a `FStringMiddle` token which is basically a string literal without the quotes * `concatenate_strings`: Renamed from `parse_strings` but now it takes the parsed nodes instead. So, we just need to concatenate them into a single node. > A short primer on `FStringMiddle` token: This includes the portion of text inside the f-string that's not part of the expression and isn't an opening or closing brace. For example, in `f"foo {bar:.3f{x}} bar"`, the `foo `, `.3f` and ` bar` are `FStringMiddle` token content. ### `Constant::kind` changed in the AST ***Discussion in the official implementation: python/cpython#102855 (comment) This change in the AST is when unicode strings (prefixed with `u`) and f-strings are used in an implicitly concatenated string value. For example, ```python u"foo" f"{bar}" "baz" " some" ``` Pre Python 3.12, the kind field would be assigned only if the prefix was on the first string. So, taking the above example, both `"foo"` and `"baz some"` (implicit concatenation) would be given the `u` kind: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some', kind='u') ``` </p> </details> But, post Python 3.12, only the string with the `u` prefix will be assigned the value: <details><summary>Pre 3.12 AST:</summary> <p> ```python Constant(value='foo', kind='u'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='baz some') ``` </p> </details> Here are some more iterations around the change: 1. `"foo" f"{bar}" u"baz" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno', kind='u') ``` </p> </details> 2. `"foo" f"{bar}" "baz" u"no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foo'), FormattedValue( value=Name(id='bar', ctx=Load()), conversion=-1), Constant(value='bazno') ``` </p> </details> 3. `u"foo" f"bar {baz} realy" u"bar" "no"` <details><summary>Pre 3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno', kind='u') ``` </p> </details> <details><summary>3.12</summary> <p> ```python Constant(value='foobar ', kind='u'), FormattedValue( value=Name(id='baz', ctx=Load()), conversion=-1), Constant(value=' realybarno') ``` </p> </details> ### Errors With the hand written parser, we were able to provide better error messages in case of any errors such as the following but now they all are removed and in those cases an "unexpected token" error will be thrown by lalrpop: * A closing delimiter was not opened properly * An opening delimiter was not closed properly * Empty expression not allowed The "Too many nested expressions in an f-string" was removed and instead we can create a lint rule for that. And, "The f-string expression cannot include the given character" was removed because f-strings now support those characters which are mainly same quotes as the outer ones, escape sequences, comments, etc. ## Test Plan 1. Refactor existing test cases to use `parse_suite` instead of `parse_fstrings` (doesn't exists anymore) 2. Additional test cases are added as required Updated the snapshots. The change from `parse_fstrings` to `parse_suite` means that the snapshot would produce the module node instead of just a list of f-string parts. I've manually verified that the parts are still the same along with the node ranges. ## Benchmarks #7263 (comment) fixes: #7043 fixes: #6835
astral-sh · Sep 14, 2023 · 0dada9f · 0dada9f
1 parent 9820c04
commit 0dada9f
Show file tree

Hide file tree

Showing 31 changed files with 24,099 additions and 16,245 deletions.
diff --git a/crates/ruff/src/linter.rs b/crates/ruff/src/linter.rs
@@ -146,6 +146,7 @@ pub fn check_path(
     if use_ast || use_imports || use_doc_lines {
         match ruff_python_parser::parse_program_tokens(
             tokens,
+            source_kind.source_code(),
             &path.to_string_lossy(),
             source_type.is_ipynb(),
         ) {

diff --git a/crates/ruff_benchmark/benches/formatter.rs b/crates/ruff_benchmark/benches/formatter.rs
@@ -65,7 +65,7 @@ fn benchmark_formatter(criterion: &mut Criterion) {
                 let comment_ranges = comment_ranges.finish();
 
                 // Parse the AST.
-                let python_ast = parse_tokens(tokens, Mode::Module, "<filename>")
+                let python_ast = parse_tokens(tokens, case.code(), Mode::Module, "<filename>")
                     .expect("Input to be a valid python program");
 
                 b.iter(|| {

diff --git a/crates/ruff_python_ast/src/nodes.rs b/crates/ruff_python_ast/src/nodes.rs
@@ -2620,6 +2620,14 @@ impl Constant {
             _ => false,
         }
     }
+
+    /// Returns `true` if the constant is a string constant that is a unicode string (i.e., `u"..."`).
+    pub fn is_unicode_string(&self) -> bool {
+        match self {
+            Constant::Str(value) => value.unicode,
+            _ => false,
+        }
+    }
 }
 
 #[derive(Clone, Debug, PartialEq, Eq)]

diff --git a/crates/ruff_python_ast/tests/preorder.rs b/crates/ruff_python_ast/tests/preorder.rs
@@ -130,7 +130,7 @@ fn function_type_parameters() {
 
 fn trace_preorder_visitation(source: &str) -> String {
     let tokens = lex(source, Mode::Module);
-    let parsed = parse_tokens(tokens, Mode::Module, "test.py").unwrap();
+    let parsed = parse_tokens(tokens, source, Mode::Module, "test.py").unwrap();
 
     let mut visitor = RecordVisitor::default();
     visitor.visit_mod(&parsed);

diff --git a/crates/ruff_python_ast/tests/visitor.rs b/crates/ruff_python_ast/tests/visitor.rs
@@ -131,7 +131,7 @@ fn function_type_parameters() {
 
 fn trace_visitation(source: &str) -> String {
     let tokens = lex(source, Mode::Module);
-    let parsed = parse_tokens(tokens, Mode::Module, "test.py").unwrap();
+    let parsed = parse_tokens(tokens, source, Mode::Module, "test.py").unwrap();
 
     let mut visitor = RecordVisitor::default();
     walk_module(&mut visitor, &parsed);

diff --git a/crates/ruff_python_formatter/src/cli.rs b/crates/ruff_python_formatter/src/cli.rs
@@ -57,7 +57,7 @@ pub fn format_and_debug_print(input: &str, cli: &Cli, source_type: &Path) -> Res
 
     // Parse the AST.
     let python_ast =
-        parse_tokens(tokens, Mode::Module, "<filename>").context("Syntax error in input")?;
+        parse_tokens(tokens, input, Mode::Module, "<filename>").context("Syntax error in input")?;
 
     let options = PyFormatOptions::from_extension(source_type);
     let formatted = format_node(&python_ast, &comment_ranges, input, options)

diff --git a/crates/ruff_python_formatter/src/comments/mod.rs b/crates/ruff_python_formatter/src/comments/mod.rs
@@ -553,7 +553,7 @@ mod tests {
 
             let comment_ranges = comment_ranges.finish();
 
-            let parsed = parse_tokens(tokens, Mode::Module, "test.py")
+            let parsed = parse_tokens(tokens, code, Mode::Module, "test.py")
                 .expect("Expect source to be valid Python");
 
             CommentsTestCase {

diff --git a/crates/ruff_python_formatter/src/lib.rs b/crates/ruff_python_formatter/src/lib.rs
@@ -139,7 +139,7 @@ pub fn format_module(
     let comment_ranges = comment_ranges.finish();
 
     // Parse the AST.
-    let python_ast = parse_tokens(tokens, Mode::Module, "<filename>")?;
+    let python_ast = parse_tokens(tokens, contents, Mode::Module, "<filename>")?;
 
     let formatted = format_node(&python_ast, &comment_ranges, contents, options)?;
 
@@ -237,7 +237,7 @@ if True:
 
         // Parse the AST.
         let source_path = "code_inline.py";
-        let python_ast = parse_tokens(tokens, Mode::Module, source_path).unwrap();
+        let python_ast = parse_tokens(tokens, src, Mode::Module, source_path).unwrap();
         let options = PyFormatOptions::from_extension(Path::new(source_path));
         let formatted = format_node(&python_ast, &comment_ranges, src, options).unwrap();
 

diff --git a/crates/ruff_python_parser/src/lib.rs b/crates/ruff_python_parser/src/lib.rs
@@ -146,6 +146,7 @@ pub fn tokenize(contents: &str, mode: Mode) -> Vec<LexResult> {
 /// Parse a full Python program from its tokens.
 pub fn parse_program_tokens(
     lxr: Vec<LexResult>,
+    source: &str,
     source_path: &str,
     is_jupyter_notebook: bool,
 ) -> anyhow::Result<Suite, ParseError> {
@@ -154,7 +155,7 @@ pub fn parse_program_tokens(
     } else {
         Mode::Module
     };
-    match parse_tokens(lxr, mode, source_path)? {
+    match parse_tokens(lxr, source, mode, source_path)? {
         Mod::Module(m) => Ok(m.body),
         Mod::Expression(_) => unreachable!("Mode::Module doesn't return other variant"),
     }

diff --git a/crates/ruff_python_parser/src/parser.rs b/crates/ruff_python_parser/src/parser.rs
@@ -50,7 +50,7 @@ use ruff_python_ast::{Mod, ModModule, Suite};
 /// ```
 pub fn parse_program(source: &str, source_path: &str) -> Result<ModModule, ParseError> {
     let lexer = lex(source, Mode::Module);
-    match parse_tokens(lexer, Mode::Module, source_path)? {
+    match parse_tokens(lexer, source, Mode::Module, source_path)? {
         Mod::Module(m) => Ok(m),
         Mod::Expression(_) => unreachable!("Mode::Module doesn't return other variant"),
     }
@@ -78,7 +78,7 @@ pub fn parse_suite(source: &str, source_path: &str) -> Result<Suite, ParseError>
 /// ```
 pub fn parse_expression(source: &str, source_path: &str) -> Result<ast::Expr, ParseError> {
     let lexer = lex(source, Mode::Expression);
-    match parse_tokens(lexer, Mode::Expression, source_path)? {
+    match parse_tokens(lexer, source, Mode::Expression, source_path)? {
         Mod::Expression(expression) => Ok(*expression.body),
         Mod::Module(_m) => unreachable!("Mode::Expression doesn't return other variant"),
     }
@@ -107,7 +107,7 @@ pub fn parse_expression_starts_at(
     offset: TextSize,
 ) -> Result<ast::Expr, ParseError> {
     let lexer = lex_starts_at(source, Mode::Module, offset);
-    match parse_tokens(lexer, Mode::Expression, source_path)? {
+    match parse_tokens(lexer, source, Mode::Expression, source_path)? {
         Mod::Expression(expression) => Ok(*expression.body),
         Mod::Module(_m) => unreachable!("Mode::Expression doesn't return other variant"),
     }
@@ -193,7 +193,7 @@ pub fn parse_starts_at(
     offset: TextSize,
 ) -> Result<ast::Mod, ParseError> {
     let lxr = lexer::lex_starts_at(source, mode, offset);
-    parse_tokens(lxr, mode, source_path)
+    parse_tokens(lxr, source, mode, source_path)
 }
 
 /// Parse an iterator of [`LexResult`]s using the specified [`Mode`].
@@ -208,32 +208,37 @@ pub fn parse_starts_at(
 /// ```
 /// use ruff_python_parser::{lexer::lex, Mode, parse_tokens};
 ///
-/// let expr = parse_tokens(lex("1 + 2", Mode::Expression), Mode::Expression, "<embedded>");
+/// let source = "1 + 2";
+/// let expr = parse_tokens(lex(source, Mode::Expression), source, Mode::Expression, "<embedded>");
 /// assert!(expr.is_ok());
 /// ```
 pub fn parse_tokens(
     lxr: impl IntoIterator<Item = LexResult>,
+    source: &str,
     mode: Mode,
     source_path: &str,
 ) -> Result<ast::Mod, ParseError> {
     let lxr = lxr.into_iter();
 
     parse_filtered_tokens(
         lxr.filter_ok(|(tok, _)| !matches!(tok, Tok::Comment { .. } | Tok::NonLogicalNewline)),
+        source,
         mode,
         source_path,
     )
 }
 
 fn parse_filtered_tokens(
     lxr: impl IntoIterator<Item = LexResult>,
+    source: &str,
     mode: Mode,
     source_path: &str,
 ) -> Result<ast::Mod, ParseError> {
     let marker_token = (Tok::start_marker(mode), TextRange::default());
     let lexer = iter::once(Ok(marker_token)).chain(lxr);
     python::TopParser::new()
         .parse(
+            source,
             mode,
             lexer
                 .into_iter()
@@ -1237,11 +1242,55 @@ a = 1
     "#
         .trim();
         let lxr = lexer::lex_starts_at(source, Mode::Ipython, TextSize::default());
-        let parse_err = parse_tokens(lxr, Mode::Module, "<test>").unwrap_err();
+        let parse_err = parse_tokens(lxr, source, Mode::Module, "<test>").unwrap_err();
         assert_eq!(
             parse_err.to_string(),
             "IPython escape commands are only allowed in `Mode::Ipython` at byte offset 6"
                 .to_string()
         );
     }
+
+    #[test]
+    fn test_fstrings() {
+        let parse_ast = parse_suite(
+            r#"
+f"{" f"}"
+f"{foo!s}"
+f"{3,}"
+f"{3!=4:}"
+f'{3:{"}"}>10}'
+f'{3:{"{"}>10}'
+f"{  foo =  }"
+f"{  foo =  :.3f  }"
+f"{  foo =  !s  }"
+f"{  1, 2  =  }"
+f'{f"{3.1415=:.1f}":*^20}'
+
+{"foo " f"bar {x + y} " "baz": 10}
+match foo:
+    case "foo " f"bar {x + y} " "baz":
+        pass
+"#
+            .trim(),
+            "<test>",
+        )
+        .unwrap();
+        insta::assert_debug_snapshot!(parse_ast);
+    }
+
+    #[test]
+    fn test_fstrings_with_unicode() {
+        let parse_ast = parse_suite(
+            r#"
+u"foo" f"{bar}" "baz" " some"
+"foo" f"{bar}" u"baz" " some"
+"foo" f"{bar}" "baz" u" some"
+u"foo" f"bar {baz} really" u"bar" "no"
+"#
+            .trim(),
+            "<test>",
+        )
+        .unwrap();
+        insta::assert_debug_snapshot!(parse_ast);
+    }
 }