Use `TokenSource` to find new location for re-lexing #12060

dhruvmanila · 2024-06-27T04:11:20Z

Summary

This PR splits the re-lexing logic into two parts:

TokenSource: The token source will be responsible to find the position the lexer needs to be moved to
Lexer: The lexer will be responsible to reduce the nesting level and move itself to the new position if recovered from a parenthesized context

This split makes it easy to find the new lexer position without needing to implement the backwards lexing logic again which would need to handle cases involving:

Different kinds of newlines
Line continuation character(s)
Comments
Whitespaces

F-strings

This change did reveal one thing about re-lexing f-strings. Consider the following example:

f'{'
#  ^
f'foo'

Here, the quote as highlighted by the caret (^) is the start of a string inside an f-string expression. This is unterminated string which means the token emitted is actually Unknown. The parser tries to recover from it but there's no newline token in the vector so the new logic doesn't recover from it. The previous logic does recover because it's looking at the raw characters instead.

The parser would be at FStringStart (the one for the second line) when it calls into the re-lexing logic to recover from an unterminated f-string on the first line. So, moving backwards the first character encountered is a newline character but the first token encountered is an Unknown token.

This is improved with #12067

fixes: #12046
fixes: #12036

Test Plan

Update the snapshot and validate the changes.

github-actions · 2024-06-27T04:50:27Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

MichaReiser

Nice!

MichaReiser · 2024-06-27T11:22:59Z

crates/ruff_python_parser/src/lexer.rs

@@ -1388,84 +1392,35 @@ impl<'src> Lexer<'src> {
            return false;
        }

-        let mut current_position = self.current_range().start();


Oh wow, that's a lot of code that is gone now :)

MichaReiser · 2024-06-27T11:23:56Z

crates/ruff_python_parser/src/lexer.rs

+        // Earlier we reduced the nesting level unconditionally. Now that we know the lexer's
+        // position is going to be moved back, the lexer needs to be put back into a
+        // parenthesized context if the current token is a closing parenthesis.
+        //
+        // ```py
+        // (a, [b,
+        //     c
+        // )
+        // ```
+        //
+        // Here, the parser would request to re-lex the token when it's at `)` and can recover
+        // from an unclosed `[`. This method will move the lexer back to the newline character
+        // after `c` which means it goes back into parenthesized context.
+        if matches!(
+            self.current_kind,


Does it still make sense to reduce the nesting level above unconditionally or could we invert the condition here and only then reduce the nesting?

Yeah, good point, we can invert the condition

Hmm, actually, it might still not be possible. Let me confirm

MichaReiser · 2024-06-27T11:25:14Z

crates/ruff_python_parser/src/token_source.rs

+            }
+        }
+
+        if self.lexer.re_lex_logical_token(non_logical_newline_start) {
            let current_start = self.current_range().start();


Not important and I'm fine to keep it this way. I was just wondering if we could store the offset of the non_logical_newline and truncate the self.tokens to that position if re_lex_logical_token returns true.

## Summary This PR fixes the lexer logic to **not** consume the newline character for an unterminated string literal. Currently, the lexer would consume it to be part of the string itself but that would be bad for recovery because then the lexer wouldn't emit the newline token ever. This PR fixes that to avoid consuming the newline character in that case. This was discovered during #12060. ## Test Plan Update the snapshots and validate them.

dhruvmanila force-pushed the dhruv/re-lexing branch from c0a9a2b to 7705ef6 Compare June 27, 2024 04:27

dhruvmanila force-pushed the dhruv/re-lexing branch 2 times, most recently from f7fd083 to 26791e1 Compare June 27, 2024 09:20

dhruvmanila added bug Something isn't working parser Related to the parser labels Jun 27, 2024

Avoid consuming newline for unterminated string

e1e4dd0

dhruvmanila mentioned this pull request Jun 27, 2024

Avoid consuming newline for unterminated string #12067

Merged

dhruvmanila marked this pull request as ready for review June 27, 2024 10:20

dhruvmanila requested a review from MichaReiser as a code owner June 27, 2024 10:20

dhruvmanila added 2 commits June 27, 2024 15:57

Use TokenSource to find new location for re-lexing

a6f7939

Update snapshots

e3020ef

dhruvmanila changed the base branch from main to dhruv/unterminated-string June 27, 2024 10:28

dhruvmanila force-pushed the dhruv/re-lexing branch from 26791e1 to e3020ef Compare June 27, 2024 10:29

MichaReiser approved these changes Jun 27, 2024

View reviewed changes

Base automatically changed from dhruv/unterminated-string to main June 27, 2024 11:32

dhruvmanila merged commit a4688ae into main Jun 27, 2024
20 checks passed

dhruvmanila deleted the dhruv/re-lexing branch June 27, 2024 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `TokenSource` to find new location for re-lexing #12060

Use `TokenSource` to find new location for re-lexing #12060

dhruvmanila commented Jun 27, 2024 •

edited

Loading

github-actions bot commented Jun 27, 2024 •

edited

Loading

MichaReiser left a comment

MichaReiser Jun 27, 2024

MichaReiser Jun 27, 2024

dhruvmanila Jun 27, 2024

dhruvmanila Jun 27, 2024

MichaReiser Jun 27, 2024

Use TokenSource to find new location for re-lexing #12060

Use TokenSource to find new location for re-lexing #12060

Conversation

dhruvmanila commented Jun 27, 2024 • edited Loading

Summary

F-strings

Test Plan

github-actions bot commented Jun 27, 2024 • edited Loading

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

MichaReiser left a comment

Choose a reason for hiding this comment

MichaReiser Jun 27, 2024

Choose a reason for hiding this comment

MichaReiser Jun 27, 2024

Choose a reason for hiding this comment

dhruvmanila Jun 27, 2024

Choose a reason for hiding this comment

dhruvmanila Jun 27, 2024

Choose a reason for hiding this comment

MichaReiser Jun 27, 2024

Choose a reason for hiding this comment

Use `TokenSource` to find new location for re-lexing #12060

Use `TokenSource` to find new location for re-lexing #12060

dhruvmanila commented Jun 27, 2024 •

edited

Loading

github-actions bot commented Jun 27, 2024 •

edited

Loading

`ruff-ecosystem` results