Skip to content

Python: Even more parser fixes #17873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 1, 2024

Conversation

tausbn
Copy link
Contributor

@tausbn tausbn commented Oct 30, 2024

Each commit contains a separate parser fix (apart from the commit that just regenerates the parser files). Rather than open up 5 separate PRs, I figured it was easier to combine them into one.

Pull Request checklist

All query authors

Internal query authors only

  • Autofixes generated based on these changes are valid, only needed if this PR makes significant changes to .ql, .qll, or .qhelp files. See the documentation (internal access required).
  • Changes are validated at scale (internal access required).
  • Adding a new query? Consider also adding the query to autofix.

Quoting the Python documentation (last paragraph of
https://docs.python.org/3/reference/lexical_analysis.html#escape-sequences):

"Even in a raw literal, quotes can be escaped with a backslash, but the
backslash remains in the result; for example, r"\"" is a valid string
literal consisting of two characters: a backslash and a double quote;
r"\" is not a valid string literal (even a raw string cannot end in an
odd number of backslashes)."

We did not handle this correctly in the scanner, as we only consumed the
backslash but not the following single or double quote, resulting in
that character getting interpreted as the end of the string.

To fix this, we do a second lookahead after consuming the backslash, and
if the next character is the end character for the string, we advance
the lexer across it as well.

Similarly, backslashes in raw strings can escape other backslashes.
Thus, for a string like '\\' we must consume the second backslash,
otherwise we'll interpret it as escaping the end quote.
Found when parsing `Lib/test/test_coroutines.py` using the new parser.

For whatever reason, having `await` be an `expression` (with an argument
of the same kind) resulted in a bad parse. Consulting the official
grammar, we see that `await` should actually be a `primary_expression`
instead. This is also more in line with the other unary operators, whose
precedence is shared by the `await` syntax.
Turns out we were not setting the `is_async` field on anything except
`async for` statements. This commit makes it so that we also do this for
`async def` and `async with`, and adds a test that this produces the
same behaviour as the old parser.
We were writing the `parenthesised` attribute twice on tuples, once
because of the explicit parenthetisation, and once because all non-empty
tuples are parenthesised. This made `tree-sitter-graph` unhappy.

To fix this, we now explicitly check whether a tuple is already
parenthesised, and do nothing if that is the case.
Our logic for detecting the first and last item in a generator
expression was faulty, sometimes matching comments as well. Because
attributes (like `_location_start`) can only be written once, this
caused `tree-sitter-graph` to get unhappy.

To fix this, we now require the first item to be an `expression`, and
the last one to be either a `for_in_clause` or an `if_clause`.
Crucially, `comment` is neither of these, and this prevents the
unfortunate overlap.
@tausbn tausbn added the no-change-note-required This PR does not need a change note label Oct 30, 2024
@tausbn tausbn marked this pull request as ready for review October 31, 2024 10:54
@tausbn tausbn requested a review from a team as a code owner October 31, 2024 10:54
Copy link
Contributor

@yoff yoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all these fixes. Some comments, but these are great improvements :-)

Comment on lines +86 to +88
if 39:
r'a\
'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do/should we test both the \n \rn cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but it's somewhat fiddly as we normalise all line endings when committing. In this case, I think the benefits would be marginal at best.

Comment on lines +3269 to +3271
attr (@funcdef.node) _location_start = start
attr (@funcdef.function) _location_start = start
attr (@funcdef.funcexpr) _location_start = start
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these also never set for async def?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. async defs would get their starting position from the entire function definition (i.e. the start of the async bit), but for legacy reasons we want it to start at the def bit.

(This is something I hope we can get rid of in the future, as I don't really have a good justification for it other than "it's how we've always done it.)

@tausbn tausbn merged commit 2892f0f into main Nov 1, 2024
10 checks passed
@tausbn tausbn deleted the tausbn/python-fix-generator-expression-locations branch November 1, 2024 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-change-note-required This PR does not need a change note Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants