Refactor parser structure to match CPython's grammar more closely #4595

scoder · 2022-01-27T00:56:50Z

The parse function structure of the parser implementation in Cython/Compiler/Parsing.py has diverged from the old Grammar in CPython and certainly does not match the new PEG parser. Additionally, several flags were added over time that make it less clear what kind of expression is allowed and supposed to be parsed where. This ticket asks to

adapt the parse function split and their names to what CPython uses in its own parser, as closely as possible (but keeping the p_ prefix for readability)
remove flag options where possible and reasonable
split parse functions that currently take options into separate functions that parse different things, and use them in the appropriate places.

Basically, it should be clear from the name of a called parse function in which state the parser now changes and what it is allowed to see next. This state should not depend on additional options ("parse X next, unless I'm telling you not to do what I'm asking you to do").

This can (and should best) be done in multiple iterations, both to keep the changes easy to review and to allow us to see where we are going along the way.

Known fields that require a cleanup are

p_test() as entry point for expressions
the integration of lambda expressions
star expressions
conditional expressions
named expressions (walrus operator)

Along the way, the following missing syntax features can be added:

parenthesised context managers (https://docs.python.org/3/whatsnew/3.10.html#new-features)
general expressions as decorators (Implement PEP 614: more general expressions for decorators #4570, PEP 614)

The Python test suite has tests for them.

CPython's old parser grammar: https://github.com/python/cpython/blob/3.9/Grammar/Grammar
CPython's new PEG parser grammar: https://github.com/python/cpython/blob/main/Grammar/python.gram

The text was updated successfully, but these errors were encountered:

da-woods · 2022-01-27T07:13:10Z

Maybe also the "parenthasised context managers" from Python 3.10 (https://docs.python.org/3/whatsnew/3.10.html#new-features) could be included in this issue?

scoder · 2022-01-27T09:03:39Z

Yes, absolutely. And also #4570 (PEP 614, expressions as decorators), while we're at it.

AstroFloof · 2022-02-25T00:01:59Z

Dumb question but why not use the built in ast module

da-woods · 2022-02-25T07:05:44Z

Dumb question but why not use the built in ast module

Maybe because Cython adds additional syntax that isn't covered by AST

AstroFloof · 2022-02-25T23:22:27Z

But for the pure python stuff I mean
I know that's not the whole scope of Cython, but isn't it a good starting point?

scoder · 2022-02-26T20:36:22Z

why not use the built in ast module

That's a tough call. It would be nice to just use ast – if we can switch entirely to Python syntax. Whether that's feasible and desirable in the end, I do not know. So, no ast for now. Having to maintain two parsers / syntax tree generators is certainly not desirable.

da-woods · 2022-02-26T20:44:51Z

The other issue is: Cython is designed to be independent of the version of Python you use to run Cython. So run cython somefile.pyx on Python 3.7 should produce the same code as running it on Python 3.11 alpha.

Let's suppose we tried to implement match ...: case... using the ast module. That would mean that we could only compile match ...: case... statements on Python 3.10+. If/when we do implement match ...: case statements we would aim for them to work on all the Python versions that we support.

I'm also not sure what the AST looks like for PyPy (which we also support) and GraalPython (which we don't really support, but may do in sometime in the future). But it probably isn't the same as for CPython, which would mean you couldn't generate Cython code on those.

Looking/copying the structure of the Python parser used by ast is well worthwhile of course.

scoder · 2022-02-27T14:31:11Z

Yes, being able to generate the code directly on the platform that the user has on their side is a valuable advantage.

I've tried to rewrite it to largely follow the rules from the most recent version of the Python LL parser, so avoiding conditional parameters. Fixes one part of cython#4595

da-woods · 2022-06-18T08:29:59Z

p_test() as entry point for expressions

This matches what the most recent version of the LL parser does so I don't think it needs changing (https://docs.python.org/3.8/reference/grammar.html):

test: or_test ['if' or_test 'else' test] | lambdef

certainly does not match the new PEG parser

As I understand the PEG parser (which isn't hugely well) it works on the basis of:

try to parse a bunch of tokens according to one set of rules,
If that fails then work from the same starting point according to the next set of rules,
(And if all the subrules in a rule fail, then it backtracks one level further up again).

The idea of #4813 is to make it easier to do the backtracking. However, I think the CPython PEG parser does a reasonable amount of internal caching to make the backtracking perform better. Since we don't have that, I think we might regret switching the whole thing to a PEG-like parser - it's probably best where needed rather than everywhere

0dminnimda · 2022-06-25T09:03:20Z

Is the idea of using pegen + pxd worthy of investigation?

da-woods · 2022-06-25T11:49:18Z

Is the idea of using pegen + pxd worthy of investigation?

Maybe. It adds an external dependency (which we try to avoid). But it's probably a dependency for Cython rather than the users (i.e. we'd distribute the files it generated).

The big benefit would be we'd have a grammar file that's automatically in-sync with Cython's parser, which would help people trying to do syntax highlighting or similar

The downsides would be:

we'd be replacing a large chunk of code that's tested and (mostly) works with something completely new. That's like to lead to a lot of regressions, especially for things like C++ templates where we probably don't have complete test coverage of everything that works right now,
I dunno how easy or hard it'd be to link to the current scanner (that's a complete "don't know"... might be perfectly simple)
It seems like it'd be hard to do gradually?

0dminnimda · 2022-06-25T22:50:16Z

Why it's important to link to the current scanner?

da-woods · 2022-06-26T07:41:38Z

Why it's important to link to the current scanner?

It isn't hugely important. It's just a bit of code that's been working well without many changes for a long time

0dminnimda · 2022-06-26T09:12:17Z

If cython would change it's parser to pegen, should it happen in 3.0 or 3.1?
I mean, if a lot of old battle-tested code would be swapped, we would need to wait for some time to make sure that it works properly

scoder · 2022-06-26T09:26:04Z

It's somewhere between 3.2 and never.

0dminnimda · 2022-06-27T11:34:19Z

Is there no tests specifially for the parser/syntax? I see some error syntax tests here and there... So for parser we only have integration tests?

scoder · 2022-06-27T17:11:46Z

Is there no tests specifially for the parser/syntax?

The parser rarely changes compared to all the rest and is mostly tested through the bulk of feature file tests, which, luckily for us, test most of the features that programmers use in their code. But since we're dealing with a programming language here, specifically one that borrows from three different languages, it's difficult to even get close to testing all syntactic combinations that are relevant for the parser. It's not just syntax constructs, there's often also context involved. That said, many of the compile tests target mostly the parser.

0dminnimda · 2022-06-27T19:36:31Z

It's somewhere between 3.2 and never.

Then what lowest python version needs to be supported?

da-woods · 2022-06-27T19:56:32Z

I don't really understand the interest in replacing the parser. The advantages don't seem proportional to the effort involved.

I didn't have huge issues making pattern matching work with our existing parser (I know no-one else has looked at it yet, but nothing I wrote for it felt like a huge hack). So I don't think our current parser is actually holding anything back.

It's somewhere between 3.2 and never.

Then what lowest python version needs to be supported?

I don't want to speak for @scoder but my interpretation is:

it'd be too much effort to do for Cython 3.0 - Cython 3.0 is hopefully out sooner than we could make this kind of change
it doesn't need to be in Cython 3.0: if changing the parser was done right then it should be invisible to the user, and the priority for Cython 3.0 is to get breaking changes done in one go.
It isn't in Cython 3.1 because that is intended to be fairly quick after 3.0.
The version of Python probably isn't an issue - it'd be possible to ship the generated files rather than require users to run pegen.
So "3.2 to never" really means "maybe sometime in the future if there's a good reason to do it and someone is motivated to do the work"

0dminnimda · 2022-07-02T11:30:19Z

If I understand correctly marking a name to be interned (via intern_ustring) means that this name is a constant?

0dminnimda · 2022-07-02T11:32:06Z

The version of Python probably isn't an issue - it'd be possible to ship the generated files rather than require users to run pegen.

It's a problem if the generated code cannot be run on every python version, for example if it uses the walrus operator. But it's not a problem anymore

0dminnimda · 2022-07-02T11:37:03Z

I wonder, could we take the path of the cpython and make old parser optional, but defaulting to the new one, so it could be released sooner. Or maybe make a new parser optional, so it'll be easier for people to try, and then deprecate the old one.

da-woods · 2022-07-02T11:54:22Z

If I understand correctly marking a name to be interned (via intern_ustring) means that this name is a constant?

I think it means that we only store a single string for the name rather than lots of different identical strings. I think it's really a just a small memory saving thing. It's independent of the cached Python string objects generated in the Cython code (which happens much later).

I wonder, could we take the path of the cpython and make old parser optional, but defaulting to the new one, so it could be released sooner. Or maybe make a new parser optional, so it'll be easier for people to try, and then deprecate the old one.

That would make it lower risk, although it's only valuable if the new parser actually gets tested. I still don't really understand the benefit though.

0dminnimda · 2022-07-02T12:24:27Z

Some of the benifits are:

Formulated grammar, which will make it easier to make tools for cython
Easy synchronization with cpython grammar
Reducing the maintenance costs
Closes this and some other tickets
Possible performance increase

Also Scanners are safe and sound, no need to replace them + some parts of the parser can be left to live, as it's easier to reuse them, rether than poring them

0dminnimda · 2022-07-02T12:26:11Z

Additional validation can be done via parsing the cpython stdlib + top pypi packages and comparing the result to the result of the old parser

0dminnimda · 2022-07-02T12:47:12Z

I think that you wanted to point not to the lack of benefits, but the fact that these benefits may not be worth it.

First of all, in order to fully understand whether it is worth it, it is necessary to make some minimum working version.
Secondly, a large number, probably most of the work lies on the one who makes the pr, so I think this will not weigh the reviewers, to such an extent that the changes cannot be merged

da-woods · 2022-07-02T12:53:05Z

I think that you wanted to point not to the lack of benefits, but the fact that these benefits may not be worth it.

Yes. Essentially I'm

pretty convinced by the first benefit ("Formulated grammar, which will make it easier to make tools for cython"), although it's hard to know how many people will actually use it,
very sceptical of the last benefit ("better performance"), mainly because the current parser is largely compiled into cdef functions, and I'm not sure pegen is flexible enough to generate pxd files,
neutral on the rest.

0dminnimda · 2022-07-02T13:03:57Z

I'm not sure pegen is flexible enough to generate pxd files,

I assure you, it will generate them.
Remeber that problem with the python version? Default pegen can only generate parger that uses the walrus operator.
Cython will use custom pegen (hopefully fork that'll be owned by the cython organization).
It will generate parsers that could be run by any python version, so cython could be used in the don't compile cython internals mode.
Pegen knows all of the variables and types when generating so pxd will be fully typed, yay

scoder · 2022-07-04T08:17:41Z

Note that this ticket is about refactoring the parser, not about replacing it.

Some of the benifits are:
* Formulated grammar, which will make it easier to make tools for cython

Definitely helpful, but not a clear reason to replace the parser of Cython. We can provide an external grammar either way, even though it's clearly easier to maintain if we use it ourselves. But it's not a must.

* Easy synchronization with cpython grammar

Possibly, although adapting the parser is usually the easiest part when implementing a Python feature. It's really not difficult to maintain.

* Reducing the maintenance costs

That's to be seen, but what it clear already is that there is a very high cost for replacing the parser initially. That means that it may simply not be worth it.

* Closes this and some other tickets

That has zero benefit by itself.

* Possible performance increase

Uncertainly.

… LL parser (GH-4846) I've tried to rewrite it to largely follow the rules from the most recent version of the Python LL parser, so avoiding conditional parameters. See #4595

Note that it wasn't correct before since it didn't pass the correct flag to `p_lambdef` and thus was equivalent to just using `p_lambdef`. Note also that there's a difference in behaviour between Python3.9+ and before. Python <3.9 allowed `[i for i in range(10) if lambda: i]` while Python >=3.9 disallows this. Arguably it's pointless because the lambda always evaluates to True. See python/cpython#86014 for the Python issue. With this change Cython will follow the Python 3.9 behaviour at the cost of potentially breaking some code that does use the pattern above. If that isn't desirable I can produce an alternative change that fixes p_lambda_nocond instead. Part of cleanup in cython#4595.

…-4992) Note that it wasn't correct before since it didn't pass the correct flag to `p_lambdef` and thus was equivalent to just using `p_lambdef`. Note also that there's a difference in behaviour between Python3.9+ and before. Python <3.9 allowed `[i for i in range(10) if lambda: i]` while Python >=3.9 disallows this. Arguably it's pointless because the lambda always evaluates to True. See python/cpython#86014 for the Python issue. With this change Cython will follow the Python 3.9 behaviour at the cost of potentially breaking some code that does use the pattern above. Part of the cleanup in #4595

scoder added Parsing cleanup labels Jan 27, 2022

da-woods mentioned this issue May 29, 2022

Enable parenthesized context managers #4814

Merged

da-woods mentioned this issue Jun 18, 2022

Refactor parsing of named expressions #4846

Merged

da-woods mentioned this issue Aug 21, 2022

PEP614 decorators #4991

Merged

da-woods mentioned this issue Aug 21, 2022

Remove p_lambdef_nocond from the parser. #4992

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor parser structure to match CPython's grammar more closely #4595

Refactor parser structure to match CPython's grammar more closely #4595

scoder commented Jan 27, 2022 •

edited by da-woods

da-woods commented Jan 27, 2022

scoder commented Jan 27, 2022

AstroFloof commented Feb 25, 2022

da-woods commented Feb 25, 2022

AstroFloof commented Feb 25, 2022 •

edited

scoder commented Feb 26, 2022

da-woods commented Feb 26, 2022

scoder commented Feb 27, 2022

da-woods commented Jun 18, 2022

0dminnimda commented Jun 25, 2022

da-woods commented Jun 25, 2022

0dminnimda commented Jun 25, 2022

da-woods commented Jun 26, 2022

0dminnimda commented Jun 26, 2022 •

edited

scoder commented Jun 26, 2022 via email

0dminnimda commented Jun 27, 2022

scoder commented Jun 27, 2022 via email

0dminnimda commented Jun 27, 2022 •

edited

da-woods commented Jun 27, 2022

0dminnimda commented Jul 2, 2022

0dminnimda commented Jul 2, 2022 •

edited

0dminnimda commented Jul 2, 2022

da-woods commented Jul 2, 2022

0dminnimda commented Jul 2, 2022

0dminnimda commented Jul 2, 2022 •

edited

0dminnimda commented Jul 2, 2022 •

edited

da-woods commented Jul 2, 2022

0dminnimda commented Jul 2, 2022

scoder commented Jul 4, 2022

Refactor parser structure to match CPython's grammar more closely #4595

Refactor parser structure to match CPython's grammar more closely #4595

Comments

scoder commented Jan 27, 2022 • edited by da-woods

da-woods commented Jan 27, 2022

scoder commented Jan 27, 2022

AstroFloof commented Feb 25, 2022

da-woods commented Feb 25, 2022

AstroFloof commented Feb 25, 2022 • edited

scoder commented Feb 26, 2022

da-woods commented Feb 26, 2022

scoder commented Feb 27, 2022

da-woods commented Jun 18, 2022

0dminnimda commented Jun 25, 2022

da-woods commented Jun 25, 2022

0dminnimda commented Jun 25, 2022

da-woods commented Jun 26, 2022

0dminnimda commented Jun 26, 2022 • edited

scoder commented Jun 26, 2022 via email

0dminnimda commented Jun 27, 2022

scoder commented Jun 27, 2022 via email

0dminnimda commented Jun 27, 2022 • edited

da-woods commented Jun 27, 2022

0dminnimda commented Jul 2, 2022

0dminnimda commented Jul 2, 2022 • edited

0dminnimda commented Jul 2, 2022

da-woods commented Jul 2, 2022

0dminnimda commented Jul 2, 2022

0dminnimda commented Jul 2, 2022 • edited

0dminnimda commented Jul 2, 2022 • edited

da-woods commented Jul 2, 2022

0dminnimda commented Jul 2, 2022

scoder commented Jul 4, 2022

scoder commented Jan 27, 2022 •

edited by da-woods

AstroFloof commented Feb 25, 2022 •

edited

0dminnimda commented Jun 26, 2022 •

edited

0dminnimda commented Jun 27, 2022 •

edited

0dminnimda commented Jul 2, 2022 •

edited

0dminnimda commented Jul 2, 2022 •

edited

0dminnimda commented Jul 2, 2022 •

edited