-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Python: Even more parser fixes #17873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Quoting the Python documentation (last paragraph of https://docs.python.org/3/reference/lexical_analysis.html#escape-sequences): "Even in a raw literal, quotes can be escaped with a backslash, but the backslash remains in the result; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes)." We did not handle this correctly in the scanner, as we only consumed the backslash but not the following single or double quote, resulting in that character getting interpreted as the end of the string. To fix this, we do a second lookahead after consuming the backslash, and if the next character is the end character for the string, we advance the lexer across it as well. Similarly, backslashes in raw strings can escape other backslashes. Thus, for a string like '\\' we must consume the second backslash, otherwise we'll interpret it as escaping the end quote.
Found when parsing `Lib/test/test_coroutines.py` using the new parser. For whatever reason, having `await` be an `expression` (with an argument of the same kind) resulted in a bad parse. Consulting the official grammar, we see that `await` should actually be a `primary_expression` instead. This is also more in line with the other unary operators, whose precedence is shared by the `await` syntax.
Turns out we were not setting the `is_async` field on anything except `async for` statements. This commit makes it so that we also do this for `async def` and `async with`, and adds a test that this produces the same behaviour as the old parser.
We were writing the `parenthesised` attribute twice on tuples, once because of the explicit parenthetisation, and once because all non-empty tuples are parenthesised. This made `tree-sitter-graph` unhappy. To fix this, we now explicitly check whether a tuple is already parenthesised, and do nothing if that is the case.
Our logic for detecting the first and last item in a generator expression was faulty, sometimes matching comments as well. Because attributes (like `_location_start`) can only be written once, this caused `tree-sitter-graph` to get unhappy. To fix this, we now require the first item to be an `expression`, and the last one to be either a `for_in_clause` or an `if_clause`. Crucially, `comment` is neither of these, and this prevents the unfortunate overlap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all these fixes. Some comments, but these are great improvements :-)
if 39: | ||
r'a\ | ||
' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do/should we test both the \n
\rn
cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but it's somewhat fiddly as we normalise all line endings when committing. In this case, I think the benefits would be marginal at best.
attr (@funcdef.node) _location_start = start | ||
attr (@funcdef.function) _location_start = start | ||
attr (@funcdef.funcexpr) _location_start = start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were these also never set for async def
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. async def
s would get their starting position from the entire function definition (i.e. the start of the async
bit), but for legacy reasons we want it to start at the def
bit.
(This is something I hope we can get rid of in the future, as I don't really have a good justification for it other than "it's how we've always done it.)
Each commit contains a separate parser fix (apart from the commit that just regenerates the parser files). Rather than open up 5 separate PRs, I figured it was easier to combine them into one.
Pull Request checklist
All query authors
.qhelp
. See the documentation in this repository.Internal query authors only
.ql
,.qll
, or.qhelp
files. See the documentation (internal access required).