Python: Even more parser fixes #17873

tausbn · 2024-10-30T13:04:03Z

Each commit contains a separate parser fix (apart from the commit that just regenerates the parser files). Rather than open up 5 separate PRs, I figured it was easier to combine them into one.

Pull Request checklist

All query authors

A change note is added if necessary. See the documentation in this repository.
All new queries have appropriate .qhelp. See the documentation in this repository.
QL tests are added if necessary. See Testing custom queries in the GitHub documentation.
New and changed queries have correct query metadata. See the documentation in this repository.

Internal query authors only

Autofixes generated based on these changes are valid, only needed if this PR makes significant changes to .ql, .qll, or .qhelp files. See the documentation (internal access required).
Changes are validated at scale (internal access required).
Adding a new query? Consider also adding the query to autofix.

Quoting the Python documentation (last paragraph of https://docs.python.org/3/reference/lexical_analysis.html#escape-sequences): "Even in a raw literal, quotes can be escaped with a backslash, but the backslash remains in the result; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes)." We did not handle this correctly in the scanner, as we only consumed the backslash but not the following single or double quote, resulting in that character getting interpreted as the end of the string. To fix this, we do a second lookahead after consuming the backslash, and if the next character is the end character for the string, we advance the lexer across it as well. Similarly, backslashes in raw strings can escape other backslashes. Thus, for a string like '\\' we must consume the second backslash, otherwise we'll interpret it as escaping the end quote.

Found when parsing `Lib/test/test_coroutines.py` using the new parser. For whatever reason, having `await` be an `expression` (with an argument of the same kind) resulted in a bad parse. Consulting the official grammar, we see that `await` should actually be a `primary_expression` instead. This is also more in line with the other unary operators, whose precedence is shared by the `await` syntax.

Turns out we were not setting the `is_async` field on anything except `async for` statements. This commit makes it so that we also do this for `async def` and `async with`, and adds a test that this produces the same behaviour as the old parser.

We were writing the `parenthesised` attribute twice on tuples, once because of the explicit parenthetisation, and once because all non-empty tuples are parenthesised. This made `tree-sitter-graph` unhappy. To fix this, we now explicitly check whether a tuple is already parenthesised, and do nothing if that is the case.

Our logic for detecting the first and last item in a generator expression was faulty, sometimes matching comments as well. Because attributes (like `_location_start`) can only be written once, this caused `tree-sitter-graph` to get unhappy. To fix this, we now require the first item to be an `expression`, and the last one to be either a `for_in_clause` or an `if_clause`. Crucially, `comment` is neither of these, and this prevents the unfortunate overlap.

yoff

Thanks for all these fixes. Some comments, but these are great improvements :-)

yoff · 2024-11-01T09:53:24Z

python/extractor/tests/parser/strings.py

+if 39:
+    r'a\
+    '


do/should we test both the \n \rn cases?

We could, but it's somewhat fiddly as we normalise all line endings when committing. In this case, I think the benefits would be marginal at best.

yoff · 2024-11-01T09:57:18Z

python/extractor/tsg-python/python.tsg

+    attr (@funcdef.node) _location_start = start
+    attr (@funcdef.function) _location_start = start
+    attr (@funcdef.funcexpr) _location_start = start


Were these also never set for async def?

That's right. async defs would get their starting position from the entire function definition (i.e. the start of the async bit), but for legacy reasons we want it to start at the def bit.

(This is something I hope we can get rid of in the future, as I don't really have a good justification for it other than "it's how we've always done it.)

tausbn added 6 commits October 28, 2024 14:40

Python: Regenerate parser files

e710c0a

tausbn added the no-change-note-required This PR does not need a change note label Oct 30, 2024

github-actions bot added the Python label Oct 30, 2024

tausbn marked this pull request as ready for review October 31, 2024 10:54

tausbn requested a review from a team as a code owner October 31, 2024 10:54

yoff approved these changes Nov 1, 2024

View reviewed changes

tausbn merged commit 2892f0f into main Nov 1, 2024
10 checks passed

tausbn deleted the tausbn/python-fix-generator-expression-locations branch November 1, 2024 11:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python: Even more parser fixes #17873

Python: Even more parser fixes #17873

Uh oh!

tausbn commented Oct 30, 2024

Uh oh!

yoff left a comment

Uh oh!

yoff Nov 1, 2024

Uh oh!

tausbn Nov 1, 2024

Uh oh!

yoff Nov 1, 2024

Uh oh!

tausbn Nov 1, 2024

Uh oh!

Uh oh!

Uh oh!

Python: Even more parser fixes #17873

Python: Even more parser fixes #17873

Uh oh!

Conversation

tausbn commented Oct 30, 2024

Pull Request checklist

All query authors

Internal query authors only

Uh oh!

yoff left a comment

Choose a reason for hiding this comment

Uh oh!

yoff Nov 1, 2024

Choose a reason for hiding this comment

Uh oh!

tausbn Nov 1, 2024

Choose a reason for hiding this comment

Uh oh!

yoff Nov 1, 2024

Choose a reason for hiding this comment

Uh oh!

tausbn Nov 1, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!