Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a unified tech stack for jedi, rope, RedBaron, yapf, pylint, pep8, and similar tools and libraries #630

Closed
gotgenes opened this issue Oct 2, 2015 · 45 comments

Comments

@gotgenes
Copy link

gotgenes commented Oct 2, 2015

Currently there exists duplication of effort around tooling for parsing, automated manipulation, and analyzing Python code. Each project in this space faces similar fundamental problems of requiring a data representation of the Python code of interest, most commonly represented as a syntax tree. Here is a list, by no means exhaustive, of tools and libraries that face the problem of needing a representation of Python code capable of inspection and possibly even manipulation:

The overlap between any of these projects is by no means complete, but the overlap between all of them is still significant, especially given they solve the same underlying problem of Python code representation. My hope here is we can examine whether the Python ecosystem really needs this many implementations of the Python syntax tree, and whether we can arrive at an ecosystem that has a unified base, with the diversity of effort occurring on more interesting problem spaces such as advanced refactoring and static analyses.

I am hoping to start the discussion here in jedi's GitHub Issues because of the project's recent announcement of a representation of Python code based upon an (improved?) fork of lib2to3, which, from my understanding should provide a usable syntax tree representation for both Python 2 and Python 3.

One question is whether the jedi library's API is wide enough to support projects beyond jedi, and, if it's not, what additions must be made to make it more universal? We must acknowledge that each tool and library mentioned in the list above may only need some portion of information from the syntax tree, but that a good underlying representation would have the union of such information required by all client tools and libraries. Clearly some requirements gathering must be done, but my hope is there is a strong and reconcilable overlap in requirements between the projects.

Another question is, should this library be broken out into its own Python distribution package? This would allow the fundamental underlying library to move at its own pace independent of jedi's release schedule. This means jedi would become a client of library, as I hope other libraries and tools could.

It also seems that the base representation, in being an accurate representation of Python code, may not be the most convenient or appropriate interface through which a client should interact with the code, suggesting room for higher level APIs, provided by separate libraries that wrap the base representation, for example, how pandas is to numpy. I think there exists the opportunity to create a library stack, in which the tools mentioned above become clients of the high-level libraries mentioned above, which in turn are clients of the base representation library. In fact, we can see discussion of these very ideas already within and between multiple projects, e.g., python-rope/rope#57, and PyCQA/baron#61. I think now is the time to get serious about these efforts.

Finally, I'd like to point to two inspirations that this collaboration of effort and unification can happen. The first inspiration is the scientific Python community's unification around the HDF5 Python stack, which I encourage you to read about in @scopatz's excellent blog post about that effort. The second inspiration comes from @royrapoport's description of the way Netflix works in his recent interview on Talk Python To Me. There, teams are permitted to make their own engineering decisions, but this inevitably creates duplication of effort. To counteract that duplication, Netflix periodically evaluates overlapping projects, and if their duplication is found technically unjustifiable, the project's maintainers must sit down with each other and figure out how to merge the code bases. While there's no fiduciary incentive in the case of the projects I've mentioned above here, there is still the incentive of reducing the amount of time, personally and collectively, spent solving very similar problems.

I think a similar effort here to unify would be a great win for Python tooling and the Python community. My hope is this opens up the conversation between these various projects.

cc @davidhalter @aligrudi @mcepl @Psycojoker @gwelymernans @hayd @jcrocholl @florentx @IanLee1521 @PCManticore

I will also try to reach out to maintainers of other projects whom I could not find on GitHub, but if anyone here has contacts mentioned or omitted but interested, please reach out to them as well.

@davidhalter, forgive me for hijacking your project's Issues for this.

Apologies also for anyone who is offended this discussion would happen on GitHub or in a GitHub issue. To me this seems the most open, publicly visible, conducive space to begin the conversation, though, as I suspect few people are members of the Python code-quality mailing list when compared to the number of people impacted and interested.

@mcepl
Copy link

mcepl commented Oct 3, 2015

I feel the smell of Architecture Astronauts here. You quote python-rope/rope#57 but you have apparently missed my comment, especially part “The problem is that we don't have REAL maintainer.” Rope project so far was run pretty tightly on the rule Find the dependencies -- and eliminate them. and I am afraid we cannot move much from that rule.

So, generally I think it is a good idea (reuse is after all The RIght Thing™), but I won’t spend a second on implementing it. If I get a pull request to https://github.com/python-rope/rope/ which will do the job, and it will pass our whole testsuite on my computer (with RHEL-6 and thus python 2.6) then I may start consider the idea.

@davidhalter
Copy link
Owner

@gotgenes Thank you for bringing this up! I think collaboration is a nice thing! However, collaboration needs at least two parties to have the time to collaborate (not even talking about willingness). That's where I see the big problem.

I have personally talked to the pylint core devs a year ago. They were interested to use the Jedi parser (and some more business logic). It didn't happen. Why? Because they didn't have the time. Using a different parser is a huge task that requires fundamental changes to your core. This again entails changes in other code. Believe me I have done it with Jedi itself. It's a pain.

What I'm saying really is: If someone is interested to port, I think it makes sense. However, you need time and motivation. A lot of it. It's not paid work. So likely it won't happen for a very long time, except if you do it?! :)

@mcepl I really liked the Architecture Astronauts read. Good stuff :)

@gotgenes
Copy link
Author

gotgenes commented Oct 6, 2015

@mcepl I see that rope is a refactoring library that could use a code representation that works for Python 3 (and Python 2), and jedi is a library with a code representation for Python 3 (and Python 2) that is in search of a refactoring library. So maybe the two projects could help each other. Or maybe not.

Rope project so far was run pretty tightly on the rule Find the dependencies -- and eliminate them. and I am afraid we cannot move much from that rule.

Understood. Like you said, Rope no longer has an active maintainer, and Python 3 support is still up in the air. jedi is not a heavy dependency. If it is too heavy for rope, as I mentioned/suggested above, the syntax representation library could be broken out into its own Python distribution package, making the dependency even lighter. Maybe jedi or the spinoff library could make rope easier to work with and maintain. Or maybe not.

So, generally I think it is a good idea (reuse is after all The RIght Thing™), but I won’t spend a second on implementing it.

This is quite fair. Other ways exist to contribute besides implementation, though. Your insight on rope you could help provide a requirements necessary for a syntax representation usable by rope.

@davidhalter, thanks for your reply. You said

I have personally talked to the pylint core devs a year ago. They were interested to use the Jedi parser (and some more business logic). It didn't happen.

Sorry it didn't pan out then. Is there any archive of this discussion that could inform of their requirements? Maybe this won't benefit pylint, but it will check the usability of the syntax representation provided by jedi. Was there anything else they found lacking, beyond time and effort?

@davidhalter
Copy link
Owner

Sorry it didn't pan out then. Is there any archive of this discussion that could inform of their requirements?

No, it happened offline, at EuroPython.

Was there anything else they found lacking, beyond time and effort?

Not really. We haven't even talked about this that much, because it doesn't really matter if you don't even have the time to clear all the details.

@Psycojoker
Copy link

Hello,

On my side things are quite simple : I would be really happy to see this happening because I don't care about maintaining my own parser, I only wrote it because I didn't knew the existence of lib2to3 at that time and I'm interested in high level tools (regarding custom refactoring and projectional editing), not parsers.

One of my middle term goal is to drop my parser and to switch to either jedi one or lib2to3 one combined with an adapter but I have now idea on when this will happen since I'm currently stuck in advanced refactoring and debugging (and other million of things in my life).

On my sides, my needs are for a python syntax tree are:

  • being lossless
  • having a json representation
  • each node should be responsible for its own internal formating (lib2to3 and probably jedi don't do that, they use the strategy: a node is responsible for the formatting right behind itself)
  • being structured in a way designed for humans, not compiler (for example in baron a atomtrailer ("a.b.c(d)[e]") is a list, not a recursive structure, because a list is more intuitive and easier to handle for a human)

Big bonus:

  • static analysis
  • supporting both python2 and 3 syntax
  • being easily modified if needed
  • good perfs (because this enable new behaviour, a parser that takes less and half a second can be ran constantly for example)

Jedi is the closest tool I know to those needs and adding the needed code to match them seems doable.

On a broader view, I really believe that python will greatly profit from having a tool of reference for those tasks, especially if it's well documented and accessible but I don't have the set of skills needed to lead this project nor the time like most of you.

Thanks everyone for all your work :)

@bwendling
Copy link

For the YAPF project, we've run into a lot of issues having to do with the lib2to3 library. The way it handles comments is ... difficult at best and we needed to jump through several hoops to add them as first-class citizens to the AST. Secondly, lib2to3 doesn't yet fully implement the newest Python language additions. Also, it "simplifies" the AST, so that some nodes are removed (or more to the point "collapsed") if they don't offer anything of value to lib2to3.

So, in general, YAPF would love to have a new parser, but as others mention it's getting the time to help with the project that's a problem. And as mentioned before, we built YAPF around lib2to3, and replacing it would be difficult. Also, I want to rely upon a few external projects as possible.

All of that said, if I can help with the project I will try. I find such a goal very appealing.

@asmeurer
Copy link
Contributor

asmeurer commented Oct 7, 2015

A problem I found with Jedi's ast (and I admit that I didn't play with it for very long, so there may already be a way around this) is that because it stores whitespace information (in order to be round-trippable), it's more challenging to do refactoring or anything that involves creating ast nodes and injecting them into the tree. I played around with this once and wasn't very satisfied with how things worked. For instance, it was hard to inject a node into a loop body because you'd have to figure out how to indent it. To me, for a nice roundtripping AST library this would "just work".

Another nice thing to have would be for the library to be able to handle syntax for any (supported) version of Python, even ones other than the one that the code is running in. A common source of invalid bug reports to pyflakes comes from people complaining about syntax errors because they ran it in Python 2 and the code only works in Python 3 (or visa versa).

@davidhalter
Copy link
Owner

For instance, it was hard to inject a node into a loop body because you'd have to figure out how to indent it. To me, for a nice roundtripping AST library this would "just work".

I agree. There's a lot of "utility" functions missing right now. But it's not like we couldn't improve that. Jedi has a very distinct set of usages that doesn't include refactoring, yet. So therefore we have not implemented such things.

@Psycojoker

each node should be responsible for its own internal formating (lib2to3 and probably jedi don't do that, they use the strategy: a node is responsible for the formatting right behind itself)

What does that mean?

being structured in a way designed for humans, not compiler (for example in baron a atomtrailer ("a.b.c(d)[e]") is a list, not a recursive structure, because a list is more intuitive and easier to handle for a human)

I think this is not something Jedi would do, because it's not the way how the grammar file works. However at the same time it's debatable.

@gwelymernans

Also, it "simplifies" the AST, so that some nodes are removed (or more to the point "collapsed") if they don't offer anything of value to lib2to3.

Hmm, can you give more examples than comments?

@Psycojoker
Copy link

For instance, it was hard to inject a node into a loop body because you'd have to figure out how to indent it. To me, for a nice roundtripping AST library this would "just work".

I agree. There's a lot of "utility" functions missing right now. But it's not like we couldn't improve that. Jedi has a very distinct set of usages that doesn't include refactoring, yet. So therefore we have not implemented such things.

@asmeurer @davidhalter This is exactly the kind of situation I have solved/I'm solving in RedBaron (which is a high level api to refactor code while not having to take care about low level stuff). See those slides https://psycojoker.github.io/fosdem-redbaron/presentation.html#slide35 or the related documentation http://redbaron.readthedocs.org/en/latest/modifying.html#code-block-modifications http://redbaron.readthedocs.org/en/latest/proxy_list.html

Disclaimer: making it work in a generic and expected way if fucking hard. I might write some text on my algos one day (they aren't really clever, just super super annoying to write and debug) but they aren't all stable right now.

@davidhalter

each node should be responsible for its own internal formating (lib2to3 and probably jedi don't do that, they use the strategy: a node is responsible for the formatting right behind itself)

What does that mean?

It's kinda hard to describe this well with text, if I fail again I'll do some schema :/

Let's use an example.

In [1]: from baron.helpers import show
In [2]: show("a = 1 + 2")
[
    {
        "first_formatting": [
            {
                "type": "space", 
                "value": " "
            }
        ], 
        "target": {
            "type": "name", 
            "value": "a"
        }, 
        "value": {
            "first_formatting": [
                {
                    "type": "space", 
                    "value": " "
                }
            ], 
            "value": "+", 
            "second_formatting": [
                {
                    "type": "space", 
                    "value": " "
                }
            ], 
            "second": {
                "section": "number", 
                "type": "int", 
                "value": "2"
            }, 
            "type": "binary_operator", 
            "first": {
                "section": "number", 
                "type": "int", 
                "value": "1"
            }
        }, 
        "second_formatting": [
            {
                "type": "space", 
                "value": " "
            }
        ], 
        "operator": "", 
        "type": "assignment"
    }
]

Here you can see an assignment node combined with a binary_operator node. If you look at "first_formatting" and "second_formatting" of the first level (in the first dictionary), those are the formatting around the "=", those are the formatting of the assignment node because they are inside the assignment node.

On the lib2to3 side (and if my understanding of it is good), the space after the "=" is handle by "1" because it is before "1" (and the logic is the same for the other formatting informations).

I have made this choice in baron (while it's more complicated to handle) because this allows me to resonate about nodes as independents self-contained units that can therefore be extracted and move around without any problems, while in a lib2to3-like situation this would have been way more complicated and full of special cases.

I hope that my explanation makes sense, don't hesitate to tell it if it's not the case :)

being structured in a way designed for humans, not compiler (for example in baron a atomtrailer ("a.b.c(d)[e]") is a list, not a recursive structure, because a list is more intuitive and easier to handle for a human)

I think this is not something Jedi would do, because it's not the way how the grammar file works. However at the same time it's debatable.

I would totally understand that you wouldn't want to do that. On my side I'm thinking that being intuitive and easier to use prevail other this kind of technical limitation since they can be fixed (my goal is to make the task "writing code that modify source code" as easy and as realistic as possible).

@asmeurer
Copy link
Contributor

asmeurer commented Oct 7, 2015

RedBaron looks like a very nice abstraction. I'll have to play around with it. I admit I'd be more excited about it if it were BSD licensed. How is the performance?

@asmeurer
Copy link
Contributor

asmeurer commented Oct 7, 2015

Also, how does it handle partial AST (like a =)? That's pretty important for libraries like Jedi.

@Psycojoker
Copy link

@asmeurer

RedBaron looks like a very nice abstraction. I'll have to play around with it. I admit I'd be more excited about it if it were BSD licensed. How is the performance?

Not very good in comparison of other tools to be honest :( (but totally okay for live refactoring like here (video is in french but you should have an idea from the code I'm writing). I haven't spend any time at all making optimisation. PYP helps in big jobs.

Also, how does it handle partial AST (like a =)? That's pretty important for libraries like Jedi.

It doesn't. Moving to the jedi parser could both solve this problem and improve performances (and brings static analysis).

RedBaron is also in alpha, expect bugs but you can already do real work with it (people have already do so).

Those are the reason why I haven't make that much advertisement about RedBaron.

@davidhalter
Copy link
Owner

On my side I'm thinking that being intuitive and easier to use prevail other this kind of technical limitation since they can be fixed

I can understand that. It's good for a refactoring library. I might even change that as well - haven't thought about it a lot.

I have made this choice in baron (while it's more complicated to handle) because this allows me to resonate about nodes as independents self-contained units that can therefore be extracted and move around without any problems, while in a lib2to3-like situation this would have been way more complicated and full of special cases.

I have intentionally not done anything like this, because it would use more space. I have spent quite a bit of time optimizing the space used in Jedi (of the parser).

However, something like this would still be possible with helper functions IMO.

@IanLee1521
Copy link
Contributor

In general, I'm all for consolidating several different ways of doing something together. I've not done much with AST parsing, but I would be willing to help out a bit.

That said, one of the stated goals of pep8 (which I've jumped into maintaining as of about a year ago) is to be a single distributable file that only relies on the Python standard library. Currently the code works by parsing file line by line to do the linting. That said, perhaps that requirement could go away... Not willing to make that call right at the moment though. :)

cc @sigmavirus24 as the author / maintainer of flake8.

@sigmavirus24
Copy link

So, I've just now had the chance to really dig into this issue and read the thread (thanks for pinging me @IanLee1521, although I'm merely the maintainer of flake8). Flake8 has three core dependencies: pep8, pyflakes, and mccabe

In general, pep8 avoids using the AST because it can be slow and cumbersome to parse. PyFlakes, however, (cc @myint and @bitglue) is almost entirely reliant on AST parsing and traversal. Further, mccabe uses AST parsing and traversal to attempt to calculate McCabe complexity values for functions and other blocks of code.

I'll leave PyFlakes' decision up to them, but speaking for mccabe (as the sole active maintainer) I'm not certain I would immediately have time to switch to a new engine. The other thing is that the beauty of each of Flake8's dependencies, is that none of them in turn have dependencies, so it's rather hard to break Flake8 by installing a new version of some transitive dependency. I quite like this and adding a new dependency for two of the three could mean nightmarish complexity for me as the sole maintainer of Flake8. Speaking from a position of experience with pyflakes and mccabe, neither need the AST to be round-trip-able, that said, it probably wouldn't hurt us if it were.

Speaking as a core reviewer for a different tool, bandit, I think this might be something other cores might be interested (cc @chair6 @tmcpeak @callidus @ericwb in no particular order). I'm not confident that Bandit doesn't need round-trip-able AST but again it certainly wouldn't hurt. And I know that our core team is slightly adverse to adding too many dependencies to the tool.

That said, I think all the projects I'm involved with would be comfortable adding a dependency on an extremely stable version of the software. What does stability mean:

  1. No backwards incompatible changes between N.M.0 and N.M+1.0 (or N.M.*).
  2. A very long-term focused solution in the hopes that we wouldn't need to update too often
  3. A lightweight, yet powerful library. In other words, make working with the AST simple, flexible, and (dare I say it) possibly enjoyable without adding features that would be better served by third party extension libraries
  4. A large (3+) maintenance team that's committed to maintaining the library

All that said, I'd be happy to host this in the PyCQA organization and add as many people as are interested in working on this. All projects are welcome there and I'm happy to make the team much larger.

@bitglue
Copy link

bitglue commented Oct 10, 2015

image

Speaking for pyflakes, it's been working for something like a decade without really any major changes except to support new language features as they are released. It does one thing and it does it very well. It's gone very far on the philosophy that it should not try to be more clever than the developer. What would be the benefit to switching the underlying parsing library? Would it be faster? More accurate?

@davidhalter
Copy link
Owner

@bitglue I agree. It doesn't make any sense to switch for you. My idea was more into the direction of pylint. I like the simplicity of pyflakes and I would not want to complicate it.

That said, somewhere in the distant future it could be interesting to switch, because once it's really mature it could provide you with partial file parsing, which would make repeated pyflakes checks much faster.

@sigmavirus24 I agree with your list of requirements. However, I think that it's going to be hard to find enough people that are dedicated towards a common parser. For example I have very unique requirements in Jedi. Jedi's parser needs to be able to do error recovery, partial file parsing (caching parts of the file and not reparsing it) and round-tripping needs to be possible.

My idea was never to combine all those tool at the moment - it would be nice - but probably a huge amount of work with a lot of annoyed people. For now I would mostly keep things as they are and try to start combining one or two projects that really need a new parser. I think the Jedi parser fits some projects very well, but for others it's just too complicated.

@almarklein
Copy link

I'd also be interested in this for PyScript (a Python to JS transpiler).

I actually just implemented a module that generates a consistent AST tree for different Python versions by making use of Python's ast module and converting it to a common format. Some info on that here: https://github.com/almarklein/commonast

My needs are mostly consistency and performance, though having something that works in pure Python could open some awesome doors. What I have now serves my needs, and I won't have time to work on something like is proposed here, though I would be interested in adobting it if it happens.

@Carreau
Copy link
Contributor

Carreau commented Oct 30, 2015

My guess is that it might interest people from hy, and py.test might be interested.

Just a version of the AST which is stable across Python versions (in my case 3.x is sufficient) would be a helpful first step.

@takluyver
Copy link
Contributor

python-modernize is another project which may be able to benefit from this. It needs roundtripping and the ability to parse code from other Python versions. It's currently built on top of lib2to3, but there are various annoyances that are harder than they should be to fix because the code responsible is in 2to3.

However, as some other people have mentioned, I'm not sure that we have the time/energy to move it to a new parser (I'm not volunteering; cc @daira). By its nature, modernize doesn't really have many regular users willing to invest time in significant changes to it. I think the most plausible route for python-modernize would be a fork of 2to3 which could be gradually improved.

@edschofield's Python-Future includes the futurize and pasteurize code-rewriting tools, which are also based on 2to3, so he may also be interested in this.

If we can get a concrete proposal together (~what we want to build, what projects can use it), I think this is exactly the kind of infrastructure that the PSF might sponsor someone to work on. That might make life easier for busy maintainers.

@gotgenes
Copy link
Author

gotgenes commented Apr 6, 2016

I know it's been a while since I raised this issue. I have attempted to compile all the concrete feature requirements for a common syntax tree that have been listed by the various parties, and I have tried to attribute respective parties to each feature requirement. If I have omitted a requirement or notation of the party affiliated with that requirement, my apologies; please note the omission and I will update the list.

My hope is that we can look at this list of requirements and see the common threads, or reason if any of them are mutually exclusive and what would be a compromise. I'm also hoping this will rekindle interest.

  1. The ST should allow for easy traversal, e.g., for calculating McCabe complexity values (@sigmavirus24)

  2. The ST should keep a "lossless" representation of the code; it should preserve spacing information, i.e., should round-trip back to identical contents, including spacing (@Psycojoker, @takluyver, @asmeurer, @davidhalter)

  3. The ST should support use as a pure AST, allowing the user to ignore spacing information. (@asmeurer)

  4. It should be possible to add a new node to the ST and have the ST determine the appropriate indentation for the node. (@asmeurer)

  5. Association of spacing should be appropriate to the context; for example, in the expression

    1 + 2
    

    the spacing between the 1 and the +, is associated with the +. The spacing between the + and the 2 is also associated with the +. (@Psycojoker)

  6. The ST should handle a partial syntax tree, e.g., a = (@asmeurer)

  7. The ST should support error recovery (@davidhalter)

  8. The ST should support partial file parsing (@davidhalter)

  9. The ST should support caching (@davidhalter)

  10. The ST should easily serialize to [and deserialize from?] JSON. (@Psycojoker)

  11. The ST should have an intuitive API built for human interaction; this is not designed for compilers (@Psycojoker)

  12. The ST should support Python 2 and Python 3 syntax (@Psycojoker)

  13. The ST should support syntax for versions of Python different than the one on which the program using the ST is running (@asmeurer, @takluyver)

  14. The ST should easily accomodate new additions to the language (@gwelymernans)

  15. The ST should be easy to extend (@Psycojoker) [Unclear if this is redundant with above requirement.]

  16. The library should focus on a simple, flexible ST, leaving higher level features to third-party extension libraries. (@sigmavirus24)

  17. The ST should be performant (@Psycojoker)

  18. An implementation of the ST should be available in pure Python (@almarklein)

@gnprice
Copy link

gnprice commented Jun 14, 2016

For use cases where it's important to parse Python code (a) across different Python versions and (b) quickly, the typed_ast project may be a good choice. This is work that @ddfisher did quite recently -- it's new since this thread started -- for Mypy, which is a typechecker for PEP 484 static types.

The standard library's ast module is pretty great in some important ways:

  • It's very fast, because it's a thin wrapper around CPython's own parser.
  • It produces a nice AST that reflects the structure of Python code well.

The trouble with it from our perspective was twofold:

  • Because it's a thin wrapper around the Python interpreter's own parser, the syntax you parse is the syntax of the Python version your program is itself running under -- so a Python 3 program can't use ast to parse Python 2 code.
  • We need to parse PEP 484 type annotations, some of which appear in comments, which the Python interpreter and ast naturally discard.

So we borrowed the CPython parser -- forked it, effectively -- and fixed those issues:

  • We modified the CPython parser slightly to make an extension module you can import independently of the version of Python your own program using the parser is running on, and did so for both CPython 2.7's and CPython 3.5's parsers so that you can parse either Python 2 or Python 3 code.
  • We added full support for PEP 484 type annotations.

Because Python's syntax is quite stable and so is CPython's parser -- a new version every year or two, generally with modest changes -- we're quite comfortable with maintaining this "fork" as Python 3.6, 3.7, and so on come out with new syntactical features.

At present typed_ast will not be helpful for code-transformation tools that need to be able to render the source code back out losslessly. I think that may not actually be hard to fix, though, and we may even do so ourselves just for the sake of clear error messages in Mypy: if you track the byte offsets in the file at which a given node starts and ends (just like the CPython parser already tracks the line and column numbers at which it starts), you can always read the exact form of any node or of the whitespace between nodes out of a single string containing the original file.

It's unlikely typed_ast will ever handle proceeding past a syntax error. It'll also never be pure Python, though because the interface is essentially the same as the stdlib's ast, someone else may already have a pure-Python implementation of that interface.

Although typed_ast was originally created for Mypy's needs, we broke it out separately precisely because we think it may be useful for other tools and are happy to share. If it seems like a potential fit for the use cases of anyone here, please try it out -- we'd be glad to hear from you with issues and/or pull requests.

@Carreau
Copy link
Contributor

Carreau commented Jun 14, 2016

Sidenote, I saw recently that the PyCQA (Python COde Qualilty Authority) was created, and a few project cited above have moved there.

@gotgenes
Copy link
Author

@gnprice Thanks for contributing that information about typed_ast to the discussion; that's very helpful as I was unaware of that project. Sorry for the omission.

@Carreau, Good call! I also recently stumbled upon the PyCQA. I would think this discussion aligns very closely with their goals. @IanLee1521, apparently you are a member of this organization. Have you raised this to PyCQA's attention?

@sigmavirus24
Copy link

@gotgenes I'm the founder of the PyCQA and on this thread and watching. This is something interesting to me also as the maintainer of Flake8 and a core contributor to pyflakes. typed_ast is something I had heard about happening and am excited to experiment with. Specifically it will allow flake8 to silence warnings when someone is using the typing module on Python 2.7 with the magic # type: comments that mypy-lang supports. We do, however, still need something like RedBaron to support automatic fixing of problems found by Flake8 plugins.

@gnprice
Copy link

gnprice commented Jun 14, 2016

@gotgenes Nothing to apologize for! typed_ast was developed just this year as we started putting a lot of development effort into mypy, so when you wrote your original comment at the start of this thread (last October) it didn't yet exist and there was nothing to be aware of. :) Happy to share the news.

@Carreau
Copy link
Contributor

Carreau commented Jun 14, 2016

I'm wondering if with the number of people involved we shouldn't try to literally find some time at a conference where most of us are there and make a BOF / sprint / something similar.

@sigmavirus24
Copy link

@Carreau I'd much rather a virtual sprint/BOF instead. I don't go to many conferences.

@IanLee1521
Copy link
Contributor

@gotgenes -- I am a member (pycodestyle), and I think that if you're asking the question of "could such a package live in that organization" I suspect @sigmavirus24 would agree and we could host it there.

@Carreau -- We in the PyCQA did a couple of smaller open space gatherings at PyCon last month that were fairly successful, I agree that it would be nice to at least not limit to only in person though, as I also probably won't be at any other Python conferences this year. :)

@sigmavirus24
Copy link

"could such a package live in that organization" I suspect @sigmavirus24 would agree and we could host it there.

I'm not sure which package we're talking about, but the PyCQA aims to be welcoming to new projects and members. I made it so we could work towards reducing the bus-factor on the projects involved because most of them are single-person operations (with the exception of Pylint).

@gotgenes
Copy link
Author

@sigmavirus24 Sorry for my oversight, I missed that you were also on this thread. It sounds from your comments and @IanLee1521 that we could (or even should) move this to something under PyCQA. Let me know if I can help with that, including consolidating what's been said here into another location.

@sigmavirus24
Copy link

@gotgenes no worries. I'd still like clarification around what we're moving to the PyCQA, but yeah. I think we can move some kind of unified AST under-library for these efforts there if the authors/contributors/maintainers are cool with it.

@davidhalter
Copy link
Owner

Just to update you all:

I'm currently fixing a lot of small issues that I've had with the Jedi parser. I might be able to publish it in two months or so.

I definitely think that this parser would provide a lot of value to certain Python projects.

@Nurdok
Copy link

Nurdok commented Jul 2, 2016

Hi all,

I'm the owner of PyCQA/pydocstyle and I started working on replacing our underlying parsing engine. I explicitly want to use a 3rd party parser so that it won't be a part of pydocstyle. We currently have a crude parsing implementation which is very fragile and buggy.
Today I tried to replace our internal engine with pylint's astroid engine, but I found that it didn't keep enough low-level information about the code (specifically, I'm interested in the exact formatting of docstrings, and not just the resulting string object).
I then tried using redbaron and found it much better in that I was able to easily access the raw docstring (although I had to "fish" for it myself - i.e., look for a string expression, instead of it parsing it automatically). However, I understand that redbaron has some performance issues, so that might be a deal breaker.

I'm not sure if I can commit to maintain a parsing library (my wife is expected to be giving birth any day now, so I'll have very little time), but I am definitely going to spend the time and replace pydocstyle's parsing engine.

@Psycojoker
Copy link

hello @Nurdok,

Yes (red)baron performances are not good mostly because I've been focusing my efforts on making it easy to do things that were (very) hard/way too annoying to do before with a nice API. I haven't done any work on performances and that's for later on my todo list (but I won't block contributions in that direction as long as they don't reduce too much code maintainability).

For your situation I would better look at either jedi or lib2to3 parser since they keep formatting information (but not the same way than (red)baron: here a token is responsible for the formatting behind him (or after, I have a doubt)) and are very fast. Be aware that lib2to3 is known to have some bugs: PyCQA/baron#61 (comment) (and there is a bug about that around on some tool that use it but I can't find it anymore). I don't have information regarding the state of jedi.

Cheers,

@mcepl
Copy link

mcepl commented Oct 24, 2016

Just let me mention here https://github.com/Microsoft/language-server-protocol ... it seems that big boys are now uniting under it.

@alexbw
Copy link
Contributor

alexbw commented Mar 12, 2017

This thread is pretty powerful stuff. Just wanted to check in and see what the current status of de-duping parser/ast backends is. Is Jedi the project with the most active development in making lib2to3 palatable?

@davidhalter
Copy link
Owner

davidhalter commented Mar 12, 2017

Probably. Jedi also adds some other stuff to it. The problem is at the moment that I need to figure out a good way to decouple that stuff. But I'm pretty far, I won't make any promises again, but it's getting better and better.

I'd like to keep the parser as generalized as possible to allow parsing other languages as well.

@davidhalter
Copy link
Owner

Also: the API is just not where it will be. The API is quite weird and complicated and will be way easier.

@alexbw
Copy link
Contributor

alexbw commented Mar 12, 2017 via email

@davidhalter
Copy link
Owner

I have started a separate discussion about the progress of the parser. #895 I think this is a first working prototype that should still be expanded and better documented. But for most use cases it's already better then what you can get elsewhere.

Let's also talk your concerns for a roadmap there. I would also need a timeframe like when you would need what feature.

@Nurdok
Copy link

Nurdok commented Mar 15, 2017

I am working on using Jedi's parser as the underlying parser for pydocstyle in PyCQA/pydocstyle#240. It seems like the API is currently undocumented and #895 might change this. Should I wait until the parser is released separately?

@davidhalter
Copy link
Owner

I'm currently creating an API that all of you guys can use, it's in progress and the API will not change a lot anymore. I would actually appreciate if you play with it and tell me what's wrong. Most things are fixed anyway, because Jedi uses it. Just use #895 for feedback.

But it will take some time until it is released. Probably this summer.

@davidhalter
Copy link
Owner

Hi everyone

It's finally done.

https://github.com/davidhalter/parso
https://parso.readthedocs.io/en/latest/

Features include:

  • Parsing each Python version in each Python version
  • Good Error Recovery
  • Round Trips
  • Finding multiple syntax/indentation errors per file
  • Finds all syntax errors, even the ones that CPython's ast.c, compile.c and others raise
  • Caching

~ Dave

@boxed
Copy link

boxed commented Aug 5, 2018

I’ve moved my mutation tester mutmut to parso from baron. It was nontrivial but more boring and annoying than hard. In fact the resultant code is actually simpler! And now I have Python 3 support which I am very happy about.

Thanks for your work on this lib!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests