Make the _split_prefix public and documented to make comment parsing easy. #53

boxed · 2018-12-31T06:27:05Z

Comments being baked into the prefix member of nodes is unintuitive and awkward to use for parsing I think. It would make much more sense to have a Comment class.

(This has come up while I'm working on mutmut).

boxed · 2019-01-01T08:26:44Z

There are some issues getting the line number for a comment too in the current system. You'd think I can just subtract 1 from the node I have but this didn't work for me. It seems the end line node a the very end of a file is different.

davidhalter · 2019-01-04T00:04:18Z

I absolutely agree with you that a class would be nice. It's just not really possible to write a proper parser with all the comment semantics. Or if you write one you will end up with a very weird syntax file.

My solution for this problem is an innofficial API: PythonLeaf._split_prefix. Please check it out. It's available on all leaves.

It will give you positions of all comments/whitespace/other special symbols before a leaf.

Let me know what you think. Otherwise I'm very happy to discuss ideas how we could make such an API usable for a bigger amount of people.

Maybe a comments API on leaves that just returns prefix parts with type "comment"?

boxed · 2019-01-04T11:14:58Z

I absolutely agree with you that a class would be nice. It's just not really possible to write a proper parser with all the comment semantics. Or if you write one you will end up with a very weird syntax file.

I don't understand what you mean... is the parsing of comments incorrect in parso today?

PythonLeaf._split_prefix

Ah, well that seems like exactly what I need. Thanks!

Maybe a comments API on leaves that just returns prefix parts with type "comment"?

That would probably do yea. It's a bit awkward because you go past the point of the comment and then kind of backtrack so traversing the tree is now not in increasing position. But I don't know if that actually matters, it just feels a bit weird :P

boxed · 2019-01-04T12:03:11Z

Yep, I just now had time to try out _split_prefix and it works well, thanks again!

boxed · 2019-01-04T12:03:40Z

Maybe this API should just be stabilized and documented and then this ticket can be closed that way?

davidhalter · 2019-01-05T08:06:00Z

Maybe this API should just be stabilized and documented and then this ticket can be closed that way?

Yeah probably. I will think about that.

That would probably do yea. It's a bit awkward because you go past the point of the comment and then kind of backtrack so traversing the tree is now not in increasing position. But I don't know if that actually matters, it just feels a bit weird :P

It's definitely awkward. But I don't have any idea how that could be done better. The comment nodes would just be annoying, because they would be everywhere (e.g. between a multiplication). This is definitely not what you want. So there's really no other way than parsing them later.

If you have better ideas about the API (maybe comments on the node that contains them? or the leaf before), I'm happy to hear them.

boxed · 2019-01-05T08:48:09Z

The comment nodes would just be annoying, because they would be everywhere (e.g. between a multiplication). This is definitely not what you want.

Yea ok, that is not what I want indeed!

If you have better ideas about the API (maybe comments on the node that contains them? or the leaf before), I'm happy to hear them.

Having them on a node that is on the same line as the comment is would make it a lot less confusing at least! And if there was a comments function that used _split_prefix with a filter that would be fine I think, even without fixing so the comments data is on the same line as the node they are attached to.

davidhalter · 2019-01-05T20:35:49Z

Having them on a node that is on the same line as the comment is would make it a lot less confusing at least!

I can definitely see this. Maybe we just need to add a property like eol_comment for this.

However if we try to generalize this, it's a bit more complicated. What about comments on lines without nodes? These would then also be after the node. The problem we end up with is where do we put comments at the beginning of the file i.e. before any node.

Thanks for the ideas, though.

boxed · 2019-01-05T20:56:16Z

Yea, a public and documented API is enough for now I think. I managed to get it working with the existing API so it seems fine.

boxed · 2019-01-19T06:29:06Z

I also noticed that FStringStart doesn't have _split_prefix. Maybe everything that has a prefix should have a _split_prefix? Or what is the logic?

davidhalter · 2019-01-23T23:18:21Z

You are absolutely right. Changed in b5d8175. FStringStart actually just had the wrong classes. Sorry for that.

I have currently not the time to make _split_prefix public and check if everything is as it should be. But I feel like if it ever is public, it will look very similar - so you can just use the private function and keep an eye on this tracker. Just pin the parso version and you should be fine :)

mkurnikov · 2019-04-28T14:20:53Z

I'm moving isort to parso now (trying to) and just hit the same thing. I want to add/remove lines, reorganize imports, add/remove comments with specific contents (isort section names) and it just feels really awkward.

Considering that parso has NewLine class, so
import os\n parsed as [ImportName, NewLine]
why couldn't import os # comment \n be parsed as [ImportName, Comment, NewLine] instead?

However if we try to generalize this, it's a bit more complicated. What about comments on lines without nodes? These would then also be after the node. The problem we end up with is where do we put comments at the beginning of the file i.e. before any node.

Could you outline complicated cases where it won't work maybe? It will at least help a lot, when I'd work on edgecases for isort.

UPD.
For example, if jedi doesn't use this feature, it could just be hidden behind a flag, ie. comments_as_classes=True, to preserve backwards compatibility.

davidhalter · 2019-04-28T19:31:37Z

I can understand your concern, but I guarantee you that it will be a nightmare. Try creating just the grammar for that. That's not fun. It's actually possible with parso, just modify the grammar that already exists and modify the tokenizer. Working with a CST (concrete syntax tree) that has comments everwhere is also just really really horrible.

So that actually leaves us with much better tooling than we have now. _split_prefix is obviously pretty annoying for two reasons:

It returns comments before the token, this is not how we humans think, we read from left to right.
It returns all kind of horrible things (unicode bom, backslashes, form feeds, etc). This is annoying because mostly we are not interested in comments and nothing else.

(1) makes kind of sense, because it's the only way how we can access comments in the beginning. (2) is something we could improve, we could just return comments. If you have good ideas for an API for comments I'm happy to discuss.

mkurnikov · 2019-05-01T01:41:25Z

What about instead of modifying grammars, just do

Parse code as it is now.
Use _split_prefix to extract comments and newlines in the prefixes in their own classes.
Comment and Newline objects would be added not to the parent container (PythonNode) of the object that has them in prefix, but to one that consist of objects from previous line.
Hide all this behind a flag for parse, so API would be

parse(code, version='3.7', comments_as_classes=False)

So,

import math # comment
import sys

will be parsed into

[
    PythonNode([ImportName, Comment, Newline]),
    PythonNode([ImportName])
]

If there's no PythonNode for the previous line, just use Comment as a separate object,

# comment
import os

parsed as

[
    PythonNode([Comment, Newline]),
    PythonNode([ImportName])
]

All other stuff, like form-feeds and whitespaces would stay in the prefixes of the corresponding objects.

davidhalter · 2019-05-02T23:08:29Z

While this would be possible, I'm strongly opposed to it because it modifies the syntax tree in ways that are very random in some cases (like comment leaves somewhere in parentheses, etc. I feel like even in this case there's so many edge cases that nobody except you would be happy with.

The reason you would be happy with is probably, because you want to work only with imports. If you only work with imports, it's not nearly as bad where comments can appear (except maybe when you use stuff like:

# first
from x import ( # a
 y, # b
 z, # c
) # d

In this case your comments would just be in some crazy places.

mkurnikov · 2019-05-02T23:36:40Z

Ok, you're right. I think I just don't have enough perspective of how parso is used in the wild. Anyway, _split_prefix() method should be enough for now for my usecase, I'm going to create custom Comment node and parse them manually from PrefixPart objects.

mkurnikov · 2019-05-04T15:41:58Z

Ok, last attempt, won't pursue it further.

So that actually leaves us with much better tooling than we have now. _split_prefix is obviously pretty annoying for two reasons:

It returns comments before the token, this is not how we humans think, we read from left to right.

It returns all kind of horrible things (unicode bom, backslashes, form feeds, etc). This is annoying because mostly we are not interested in comments and nothing else.

(1) makes kind of sense, because it's the only way how we can access comments in the beginning. (2) is something we could improve, we could just return comments. If you have good ideas for an API for comments I'm happy to discuss.

Goals:

Access comment at the beginning
left-to-right thinking
complications with multiline nodes and comments

Currently, there are two unobvious things for me:
1.

# mycomment
# mycomment2
def func(): pass

all comments will be in prefix of first leaf of func scope.

import os # comment
def func(): pass

comment will be in the first leaf of func scope.

In addition, nodes in the .children attribute neatly wrapped in the PythonNode for every logical line of code,

Module [
    PythonNode( [import os] ),
    PythonNode( [def func(): pass] )
]

So, what makes sense to me is to add

suffix attribute, where everything that is followed by the node till the next newline. After that, there will be Newline node (or not, not sure about it).
prefix is a source code between start of the current line and start of the current node (mostly indentation)
comments which cover their line completely (with prefix indentation) will be represented as

PythonNode([Comment(#comment), Newline])

_split_prefix / _split_suffix still there in case user wants to access prefix/suffix parts.

mkurnikov · 2019-05-07T12:57:50Z

@davidhalter Would you approve

comments which cover their line completely (with prefix indentation) will be represented as
PythonNode([Comment(#comment), Newline])

if I submit a PR? Everything else is not critical for me (isort), suffix is just a minor convenience, I can just extract a prefix of the next leaf.

davidhalter · 2019-05-07T22:18:26Z

No I don't approve that. I feel like I want a solution that also works for all cases and is not just a band-aid. Comments are not semantically relevant for tree nodes. For something like that the parser has to significantly change (as well as the tokenizer).

davidhalter · 2020-02-21T17:59:15Z

I'm closing this one, because having done a lot of work on refactoring now in Jedi, I don't think this is needed for now. So people should just implement their own "split_prefix" logic and work with comments.

boxed changed the title ~~Usability issue: comments aren't their own nodes~~ Make the _split_prefix public and documented to make comment parsing easy. Jan 16, 2019

davidhalter closed this as completed Feb 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the _split_prefix public and documented to make comment parsing easy. #53

Make the _split_prefix public and documented to make comment parsing easy. #53

boxed commented Dec 31, 2018

boxed commented Jan 1, 2019

davidhalter commented Jan 4, 2019

boxed commented Jan 4, 2019

boxed commented Jan 4, 2019

boxed commented Jan 4, 2019

davidhalter commented Jan 5, 2019

boxed commented Jan 5, 2019

davidhalter commented Jan 5, 2019

boxed commented Jan 5, 2019 •

edited

boxed commented Jan 19, 2019

davidhalter commented Jan 23, 2019

mkurnikov commented Apr 28, 2019 •

edited

davidhalter commented Apr 28, 2019

mkurnikov commented May 1, 2019

davidhalter commented May 2, 2019

mkurnikov commented May 2, 2019

mkurnikov commented May 4, 2019

mkurnikov commented May 7, 2019

davidhalter commented May 7, 2019

davidhalter commented Feb 21, 2020

Make the _split_prefix public and documented to make comment parsing easy. #53

Make the _split_prefix public and documented to make comment parsing easy. #53

Comments

boxed commented Dec 31, 2018

boxed commented Jan 1, 2019

davidhalter commented Jan 4, 2019

boxed commented Jan 4, 2019

boxed commented Jan 4, 2019

boxed commented Jan 4, 2019

davidhalter commented Jan 5, 2019

boxed commented Jan 5, 2019

davidhalter commented Jan 5, 2019

boxed commented Jan 5, 2019 • edited

boxed commented Jan 19, 2019

davidhalter commented Jan 23, 2019

mkurnikov commented Apr 28, 2019 • edited

davidhalter commented Apr 28, 2019

mkurnikov commented May 1, 2019

davidhalter commented May 2, 2019

mkurnikov commented May 2, 2019

mkurnikov commented May 4, 2019

mkurnikov commented May 7, 2019

davidhalter commented May 7, 2019

davidhalter commented Feb 21, 2020

boxed commented Jan 5, 2019 •

edited

mkurnikov commented Apr 28, 2019 •

edited