Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor parser structure to match CPython's grammar more closely #4595

Open
4 of 7 tasks
scoder opened this issue Jan 27, 2022 · 29 comments
Open
4 of 7 tasks

Refactor parser structure to match CPython's grammar more closely #4595

scoder opened this issue Jan 27, 2022 · 29 comments

Comments

@scoder
Copy link
Contributor

scoder commented Jan 27, 2022

The parse function structure of the parser implementation in Cython/Compiler/Parsing.py has diverged from the old Grammar in CPython and certainly does not match the new PEG parser. Additionally, several flags were added over time that make it less clear what kind of expression is allowed and supposed to be parsed where. This ticket asks to

  • adapt the parse function split and their names to what CPython uses in its own parser, as closely as possible (but keeping the p_ prefix for readability)
  • remove flag options where possible and reasonable
  • split parse functions that currently take options into separate functions that parse different things, and use them in the appropriate places.

Basically, it should be clear from the name of a called parse function in which state the parser now changes and what it is allowed to see next. This state should not depend on additional options ("parse X next, unless I'm telling you not to do what I'm asking you to do").

This can (and should best) be done in multiple iterations, both to keep the changes easy to review and to allow us to see where we are going along the way.

Known fields that require a cleanup are

  • p_test() as entry point for expressions
  • the integration of lambda expressions
  • star expressions
  • conditional expressions
  • named expressions (walrus operator)

Along the way, the following missing syntax features can be added:

The Python test suite has tests for them.

CPython's old parser grammar: https://github.com/python/cpython/blob/3.9/Grammar/Grammar
CPython's new PEG parser grammar: https://github.com/python/cpython/blob/main/Grammar/python.gram

@da-woods
Copy link
Contributor

Maybe also the "parenthasised context managers" from Python 3.10 (https://docs.python.org/3/whatsnew/3.10.html#new-features) could be included in this issue?

@scoder
Copy link
Contributor Author

scoder commented Jan 27, 2022

Yes, absolutely. And also #4570 (PEP 614, expressions as decorators), while we're at it.

@AstroFloof
Copy link

Dumb question but why not use the built in ast module

@da-woods
Copy link
Contributor

Dumb question but why not use the built in ast module

Maybe because Cython adds additional syntax that isn't covered by AST

@AstroFloof
Copy link

AstroFloof commented Feb 25, 2022

But for the pure python stuff I mean
I know that's not the whole scope of Cython, but isn't it a good starting point?

@scoder
Copy link
Contributor Author

scoder commented Feb 26, 2022

why not use the built in ast module

That's a tough call. It would be nice to just use ast – if we can switch entirely to Python syntax. Whether that's feasible and desirable in the end, I do not know. So, no ast for now. Having to maintain two parsers / syntax tree generators is certainly not desirable.

@da-woods
Copy link
Contributor

The other issue is: Cython is designed to be independent of the version of Python you use to run Cython. So run cython somefile.pyx on Python 3.7 should produce the same code as running it on Python 3.11 alpha.

Let's suppose we tried to implement match ...: case... using the ast module. That would mean that we could only compile match ...: case... statements on Python 3.10+. If/when we do implement match ...: case statements we would aim for them to work on all the Python versions that we support.

I'm also not sure what the AST looks like for PyPy (which we also support) and GraalPython (which we don't really support, but may do in sometime in the future). But it probably isn't the same as for CPython, which would mean you couldn't generate Cython code on those.


Looking/copying the structure of the Python parser used by ast is well worthwhile of course.

@scoder
Copy link
Contributor Author

scoder commented Feb 27, 2022

Yes, being able to generate the code directly on the platform that the user has on their side is a valuable advantage.

da-woods added a commit to da-woods/cython that referenced this issue Jun 18, 2022
I've tried to rewrite it to largely follow the rules from the most
recent version of the Python LL parser, so avoiding conditional
parameters.

Fixes one part of cython#4595
@da-woods
Copy link
Contributor

p_test() as entry point for expressions

This matches what the most recent version of the LL parser does so I don't think it needs changing (https://docs.python.org/3.8/reference/grammar.html):

test: or_test ['if' or_test 'else' test] | lambdef

certainly does not match the new PEG parser

As I understand the PEG parser (which isn't hugely well) it works on the basis of:

  • try to parse a bunch of tokens according to one set of rules,
  • If that fails then work from the same starting point according to the next set of rules,
  • (And if all the subrules in a rule fail, then it backtracks one level further up again).

The idea of #4813 is to make it easier to do the backtracking. However, I think the CPython PEG parser does a reasonable amount of internal caching to make the backtracking perform better. Since we don't have that, I think we might regret switching the whole thing to a PEG-like parser - it's probably best where needed rather than everywhere

@0dminnimda
Copy link
Contributor

Is the idea of using pegen + pxd worthy of investigation?

@da-woods
Copy link
Contributor

Is the idea of using pegen + pxd worthy of investigation?

Maybe. It adds an external dependency (which we try to avoid). But it's probably a dependency for Cython rather than the users (i.e. we'd distribute the files it generated).

The big benefit would be we'd have a grammar file that's automatically in-sync with Cython's parser, which would help people trying to do syntax highlighting or similar

The downsides would be:

  • we'd be replacing a large chunk of code that's tested and (mostly) works with something completely new. That's like to lead to a lot of regressions, especially for things like C++ templates where we probably don't have complete test coverage of everything that works right now,
  • I dunno how easy or hard it'd be to link to the current scanner (that's a complete "don't know"... might be perfectly simple)
  • It seems like it'd be hard to do gradually?

@0dminnimda
Copy link
Contributor

Why it's important to link to the current scanner?

@da-woods
Copy link
Contributor

Why it's important to link to the current scanner?

It isn't hugely important. It's just a bit of code that's been working well without many changes for a long time

@0dminnimda
Copy link
Contributor

0dminnimda commented Jun 26, 2022

If cython would change it's parser to pegen, should it happen in 3.0 or 3.1?
I mean, if a lot of old battle-tested code would be swapped, we would need to wait for some time to make sure that it works properly

@scoder
Copy link
Contributor Author

scoder commented Jun 26, 2022 via email

@0dminnimda
Copy link
Contributor

Is there no tests specifially for the parser/syntax? I see some error syntax tests here and there... So for parser we only have integration tests?

@scoder
Copy link
Contributor Author

scoder commented Jun 27, 2022 via email

@0dminnimda
Copy link
Contributor

0dminnimda commented Jun 27, 2022

It's somewhere between 3.2 and never.

Then what lowest python version needs to be supported?

@da-woods
Copy link
Contributor

I don't really understand the interest in replacing the parser. The advantages don't seem proportional to the effort involved.

I didn't have huge issues making pattern matching work with our existing parser (I know no-one else has looked at it yet, but nothing I wrote for it felt like a huge hack). So I don't think our current parser is actually holding anything back.

It's somewhere between 3.2 and never.

Then what lowest python version needs to be supported?

I don't want to speak for @scoder but my interpretation is:

  • it'd be too much effort to do for Cython 3.0 - Cython 3.0 is hopefully out sooner than we could make this kind of change
  • it doesn't need to be in Cython 3.0: if changing the parser was done right then it should be invisible to the user, and the priority for Cython 3.0 is to get breaking changes done in one go.
  • It isn't in Cython 3.1 because that is intended to be fairly quick after 3.0.
  • The version of Python probably isn't an issue - it'd be possible to ship the generated files rather than require users to run pegen.
  • So "3.2 to never" really means "maybe sometime in the future if there's a good reason to do it and someone is motivated to do the work"

@0dminnimda
Copy link
Contributor

If I understand correctly marking a name to be interned (via intern_ustring) means that this name is a constant?

@0dminnimda
Copy link
Contributor

0dminnimda commented Jul 2, 2022

  • The version of Python probably isn't an issue - it'd be possible to ship the generated files rather than require users to run pegen.

It's a problem if the generated code cannot be run on every python version, for example if it uses the walrus operator. But it's not a problem anymore

@0dminnimda
Copy link
Contributor

I wonder, could we take the path of the cpython and make old parser optional, but defaulting to the new one, so it could be released sooner. Or maybe make a new parser optional, so it'll be easier for people to try, and then deprecate the old one.

@da-woods
Copy link
Contributor

da-woods commented Jul 2, 2022

If I understand correctly marking a name to be interned (via intern_ustring) means that this name is a constant?

I think it means that we only store a single string for the name rather than lots of different identical strings. I think it's really a just a small memory saving thing. It's independent of the cached Python string objects generated in the Cython code (which happens much later).

I wonder, could we take the path of the cpython and make old parser optional, but defaulting to the new one, so it could be released sooner. Or maybe make a new parser optional, so it'll be easier for people to try, and then deprecate the old one.

That would make it lower risk, although it's only valuable if the new parser actually gets tested. I still don't really understand the benefit though.

@0dminnimda
Copy link
Contributor

Some of the benifits are:

  • Formulated grammar, which will make it easier to make tools for cython
  • Easy synchronization with cpython grammar
  • Reducing the maintenance costs
  • Closes this and some other tickets
  • Possible performance increase

Also Scanners are safe and sound, no need to replace them + some parts of the parser can be left to live, as it's easier to reuse them, rether than poring them

@0dminnimda
Copy link
Contributor

0dminnimda commented Jul 2, 2022

Additional validation can be done via parsing the cpython stdlib + top pypi packages and comparing the result to the result of the old parser

@0dminnimda
Copy link
Contributor

0dminnimda commented Jul 2, 2022

I think that you wanted to point not to the lack of benefits, but the fact that these benefits may not be worth it.

First of all, in order to fully understand whether it is worth it, it is necessary to make some minimum working version.
Secondly, a large number, probably most of the work lies on the one who makes the pr, so I think this will not weigh the reviewers, to such an extent that the changes cannot be merged

@da-woods
Copy link
Contributor

da-woods commented Jul 2, 2022

I think that you wanted to point not to the lack of benefits, but the fact that these benefits may not be worth it.

Yes. Essentially I'm

  • pretty convinced by the first benefit ("Formulated grammar, which will make it easier to make tools for cython"), although it's hard to know how many people will actually use it,
  • very sceptical of the last benefit ("better performance"), mainly because the current parser is largely compiled into cdef functions, and I'm not sure pegen is flexible enough to generate pxd files,
  • neutral on the rest.

@0dminnimda
Copy link
Contributor

  • I'm not sure pegen is flexible enough to generate pxd files,

I assure you, it will generate them.
Remeber that problem with the python version? Default pegen can only generate parger that uses the walrus operator.
Cython will use custom pegen (hopefully fork that'll be owned by the cython organization).
It will generate parsers that could be run by any python version, so cython could be used in the don't compile cython internals mode.
Pegen knows all of the variables and types when generating so pxd will be fully typed, yay

@scoder
Copy link
Contributor Author

scoder commented Jul 4, 2022

Note that this ticket is about refactoring the parser, not about replacing it.

Some of the benifits are:
* Formulated grammar, which will make it easier to make tools for cython

Definitely helpful, but not a clear reason to replace the parser of Cython. We can provide an external grammar either way, even though it's clearly easier to maintain if we use it ourselves. But it's not a must.

* Easy synchronization with cpython grammar

Possibly, although adapting the parser is usually the easiest part when implementing a Python feature. It's really not difficult to maintain.

* Reducing the maintenance costs

That's to be seen, but what it clear already is that there is a very high cost for replacing the parser initially. That means that it may simply not be worth it.

* Closes this and some other tickets

That has zero benefit by itself.

* Possible performance increase

Uncertainly.

scoder pushed a commit that referenced this issue Jul 16, 2022
… LL parser (GH-4846)

I've tried to rewrite it to largely follow the rules from the most
recent version of the Python LL parser, so avoiding conditional parameters.

See #4595
da-woods added a commit to da-woods/cython that referenced this issue Aug 21, 2022
Note that it wasn't correct before since it didn't pass the
correct flag to `p_lambdef` and thus was equivalent to just
using `p_lambdef`.

Note also that there's a difference in behaviour between Python3.9+
and before. Python <3.9 allowed `[i for i in range(10) if lambda: i]`
while Python >=3.9 disallows this. Arguably it's pointless because
the lambda always evaluates to True. See
python/cpython#86014
for the Python issue.

With this change Cython will follow the Python 3.9 behaviour at
the cost of potentially breaking some code that does use the
pattern above. If that isn't desirable I can produce an
alternative change that fixes p_lambda_nocond instead.

Part of cleanup in cython#4595.
da-woods added a commit to da-woods/cython that referenced this issue Aug 21, 2022
Note that it wasn't correct before since it didn't pass the
correct flag to `p_lambdef` and thus was equivalent to just
using `p_lambdef`.

Note also that there's a difference in behaviour between Python3.9+
and before. Python <3.9 allowed `[i for i in range(10) if lambda: i]`
while Python >=3.9 disallows this. Arguably it's pointless because
the lambda always evaluates to True. See
python/cpython#86014
for the Python issue.

With this change Cython will follow the Python 3.9 behaviour at
the cost of potentially breaking some code that does use the
pattern above. If that isn't desirable I can produce an
alternative change that fixes p_lambda_nocond instead.

Part of cleanup in cython#4595.
scoder pushed a commit that referenced this issue Sep 5, 2022
…-4992)

Note that it wasn't correct before since it didn't pass the
correct flag to `p_lambdef` and thus was equivalent to just
using `p_lambdef`.

Note also that there's a difference in behaviour between Python3.9+
and before. Python <3.9 allowed `[i for i in range(10) if lambda: i]`
while Python >=3.9 disallows this. Arguably it's pointless because
the lambda always evaluates to True. See
python/cpython#86014
for the Python issue.

With this change Cython will follow the Python 3.9 behaviour at
the cost of potentially breaking some code that does use the
pattern above.

Part of the cleanup in #4595
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants