New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor parser structure to match CPython's grammar more closely #4595
Comments
Maybe also the "parenthasised context managers" from Python 3.10 (https://docs.python.org/3/whatsnew/3.10.html#new-features) could be included in this issue? |
Yes, absolutely. And also #4570 (PEP 614, expressions as decorators), while we're at it. |
Dumb question but why not use the built in |
Maybe because Cython adds additional syntax that isn't covered by AST |
But for the pure python stuff I mean |
That's a tough call. It would be nice to just use |
The other issue is: Cython is designed to be independent of the version of Python you use to run Cython. So run Let's suppose we tried to implement I'm also not sure what the AST looks like for PyPy (which we also support) and GraalPython (which we don't really support, but may do in sometime in the future). But it probably isn't the same as for CPython, which would mean you couldn't generate Cython code on those. Looking/copying the structure of the Python parser used by |
Yes, being able to generate the code directly on the platform that the user has on their side is a valuable advantage. |
I've tried to rewrite it to largely follow the rules from the most recent version of the Python LL parser, so avoiding conditional parameters. Fixes one part of cython#4595
This matches what the most recent version of the LL parser does so I don't think it needs changing (https://docs.python.org/3.8/reference/grammar.html):
As I understand the PEG parser (which isn't hugely well) it works on the basis of:
The idea of #4813 is to make it easier to do the backtracking. However, I think the CPython PEG parser does a reasonable amount of internal caching to make the backtracking perform better. Since we don't have that, I think we might regret switching the whole thing to a PEG-like parser - it's probably best where needed rather than everywhere |
Is the idea of using pegen + pxd worthy of investigation? |
Maybe. It adds an external dependency (which we try to avoid). But it's probably a dependency for Cython rather than the users (i.e. we'd distribute the files it generated). The big benefit would be we'd have a grammar file that's automatically in-sync with Cython's parser, which would help people trying to do syntax highlighting or similar The downsides would be:
|
Why it's important to link to the current scanner? |
It isn't hugely important. It's just a bit of code that's been working well without many changes for a long time |
If cython would change it's parser to pegen, should it happen in 3.0 or 3.1? |
It's somewhere between 3.2 and never.
|
Is there no tests specifially for the parser/syntax? I see some error syntax tests here and there... So for parser we only have integration tests? |
Is there no tests specifially for the parser/syntax?
The parser rarely changes compared to all the rest and is mostly tested through the bulk of feature file tests, which, luckily for us, test most of the features that programmers use in their code. But since we're dealing with a programming language here, specifically one that borrows from three different languages, it's difficult to even get close to testing all syntactic combinations that are relevant for the parser. It's not just syntax constructs, there's often also context involved.
That said, many of the compile tests target mostly the parser.
|
Then what lowest python version needs to be supported? |
I don't really understand the interest in replacing the parser. The advantages don't seem proportional to the effort involved. I didn't have huge issues making pattern matching work with our existing parser (I know no-one else has looked at it yet, but nothing I wrote for it felt like a huge hack). So I don't think our current parser is actually holding anything back.
I don't want to speak for @scoder but my interpretation is:
|
If I understand correctly marking a name to be interned (via |
It's a problem if the generated code cannot be run on every python version, for example if it uses the walrus operator. But it's not a problem anymore |
I wonder, could we take the path of the cpython and make old parser optional, but defaulting to the new one, so it could be released sooner. Or maybe make a new parser optional, so it'll be easier for people to try, and then deprecate the old one. |
I think it means that we only store a single string for the name rather than lots of different identical strings. I think it's really a just a small memory saving thing. It's independent of the cached Python string objects generated in the Cython code (which happens much later).
That would make it lower risk, although it's only valuable if the new parser actually gets tested. I still don't really understand the benefit though. |
Some of the benifits are:
Also Scanners are safe and sound, no need to replace them + some parts of the parser can be left to live, as it's easier to reuse them, rether than poring them |
Additional validation can be done via parsing the cpython stdlib + top pypi packages and comparing the result to the result of the old parser |
I think that you wanted to point not to the lack of benefits, but the fact that these benefits may not be worth it. First of all, in order to fully understand whether it is worth it, it is necessary to make some minimum working version. |
Yes. Essentially I'm
|
I assure you, it will generate them. |
Note that this ticket is about refactoring the parser, not about replacing it.
Definitely helpful, but not a clear reason to replace the parser of Cython. We can provide an external grammar either way, even though it's clearly easier to maintain if we use it ourselves. But it's not a must.
Possibly, although adapting the parser is usually the easiest part when implementing a Python feature. It's really not difficult to maintain.
That's to be seen, but what it clear already is that there is a very high cost for replacing the parser initially. That means that it may simply not be worth it.
That has zero benefit by itself.
Uncertainly. |
Note that it wasn't correct before since it didn't pass the correct flag to `p_lambdef` and thus was equivalent to just using `p_lambdef`. Note also that there's a difference in behaviour between Python3.9+ and before. Python <3.9 allowed `[i for i in range(10) if lambda: i]` while Python >=3.9 disallows this. Arguably it's pointless because the lambda always evaluates to True. See python/cpython#86014 for the Python issue. With this change Cython will follow the Python 3.9 behaviour at the cost of potentially breaking some code that does use the pattern above. If that isn't desirable I can produce an alternative change that fixes p_lambda_nocond instead. Part of cleanup in cython#4595.
Note that it wasn't correct before since it didn't pass the correct flag to `p_lambdef` and thus was equivalent to just using `p_lambdef`. Note also that there's a difference in behaviour between Python3.9+ and before. Python <3.9 allowed `[i for i in range(10) if lambda: i]` while Python >=3.9 disallows this. Arguably it's pointless because the lambda always evaluates to True. See python/cpython#86014 for the Python issue. With this change Cython will follow the Python 3.9 behaviour at the cost of potentially breaking some code that does use the pattern above. If that isn't desirable I can produce an alternative change that fixes p_lambda_nocond instead. Part of cleanup in cython#4595.
…-4992) Note that it wasn't correct before since it didn't pass the correct flag to `p_lambdef` and thus was equivalent to just using `p_lambdef`. Note also that there's a difference in behaviour between Python3.9+ and before. Python <3.9 allowed `[i for i in range(10) if lambda: i]` while Python >=3.9 disallows this. Arguably it's pointless because the lambda always evaluates to True. See python/cpython#86014 for the Python issue. With this change Cython will follow the Python 3.9 behaviour at the cost of potentially breaking some code that does use the pattern above. Part of the cleanup in #4595
The parse function structure of the parser implementation in
Cython/Compiler/Parsing.py
has diverged from the old Grammar in CPython and certainly does not match the new PEG parser. Additionally, several flags were added over time that make it less clear what kind of expression is allowed and supposed to be parsed where. This ticket asks top_
prefix for readability)Basically, it should be clear from the name of a called parse function in which state the parser now changes and what it is allowed to see next. This state should not depend on additional options ("parse X next, unless I'm telling you not to do what I'm asking you to do").
This can (and should best) be done in multiple iterations, both to keep the changes easy to review and to allow us to see where we are going along the way.
Known fields that require a cleanup are
p_test()
as entry point for expressionsAlong the way, the following missing syntax features can be added:
The Python test suite has tests for them.
CPython's old parser grammar: https://github.com/python/cpython/blob/3.9/Grammar/Grammar
CPython's new PEG parser grammar: https://github.com/python/cpython/blob/main/Grammar/python.gram
The text was updated successfully, but these errors were encountered: