Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the _split_prefix public and documented to make comment parsing easy. #53

Closed
boxed opened this issue Dec 31, 2018 · 20 comments
Closed

Comments

@boxed
Copy link
Contributor

boxed commented Dec 31, 2018

Comments being baked into the prefix member of nodes is unintuitive and awkward to use for parsing I think. It would make much more sense to have a Comment class.

(This has come up while I'm working on mutmut).

@boxed
Copy link
Contributor Author

boxed commented Jan 1, 2019

There are some issues getting the line number for a comment too in the current system. You'd think I can just subtract 1 from the node I have but this didn't work for me. It seems the end line node a the very end of a file is different.

@davidhalter
Copy link
Owner

I absolutely agree with you that a class would be nice. It's just not really possible to write a proper parser with all the comment semantics. Or if you write one you will end up with a very weird syntax file.

My solution for this problem is an innofficial API: PythonLeaf._split_prefix. Please check it out. It's available on all leaves.

It will give you positions of all comments/whitespace/other special symbols before a leaf.

Let me know what you think. Otherwise I'm very happy to discuss ideas how we could make such an API usable for a bigger amount of people.

Maybe a comments API on leaves that just returns prefix parts with type "comment"?

@boxed
Copy link
Contributor Author

boxed commented Jan 4, 2019

I absolutely agree with you that a class would be nice. It's just not really possible to write a proper parser with all the comment semantics. Or if you write one you will end up with a very weird syntax file.

I don't understand what you mean... is the parsing of comments incorrect in parso today?

PythonLeaf._split_prefix

Ah, well that seems like exactly what I need. Thanks!

Maybe a comments API on leaves that just returns prefix parts with type "comment"?

That would probably do yea. It's a bit awkward because you go past the point of the comment and then kind of backtrack so traversing the tree is now not in increasing position. But I don't know if that actually matters, it just feels a bit weird :P

@boxed
Copy link
Contributor Author

boxed commented Jan 4, 2019

Yep, I just now had time to try out _split_prefix and it works well, thanks again!

@boxed
Copy link
Contributor Author

boxed commented Jan 4, 2019

Maybe this API should just be stabilized and documented and then this ticket can be closed that way?

@davidhalter
Copy link
Owner

Maybe this API should just be stabilized and documented and then this ticket can be closed that way?

Yeah probably. I will think about that.

That would probably do yea. It's a bit awkward because you go past the point of the comment and then kind of backtrack so traversing the tree is now not in increasing position. But I don't know if that actually matters, it just feels a bit weird :P

It's definitely awkward. But I don't have any idea how that could be done better. The comment nodes would just be annoying, because they would be everywhere (e.g. between a multiplication). This is definitely not what you want. So there's really no other way than parsing them later.

If you have better ideas about the API (maybe comments on the node that contains them? or the leaf before), I'm happy to hear them.

@boxed
Copy link
Contributor Author

boxed commented Jan 5, 2019

The comment nodes would just be annoying, because they would be everywhere (e.g. between a multiplication). This is definitely not what you want.

Yea ok, that is not what I want indeed!

If you have better ideas about the API (maybe comments on the node that contains them? or the leaf before), I'm happy to hear them.

Having them on a node that is on the same line as the comment is would make it a lot less confusing at least! And if there was a comments function that used _split_prefix with a filter that would be fine I think, even without fixing so the comments data is on the same line as the node they are attached to.

@davidhalter
Copy link
Owner

Having them on a node that is on the same line as the comment is would make it a lot less confusing at least!

I can definitely see this. Maybe we just need to add a property like eol_comment for this.

However if we try to generalize this, it's a bit more complicated. What about comments on lines without nodes? These would then also be after the node. The problem we end up with is where do we put comments at the beginning of the file i.e. before any node.

Thanks for the ideas, though.

@boxed
Copy link
Contributor Author

boxed commented Jan 5, 2019

Yea, a public and documented API is enough for now I think. I managed to get it working with the existing API so it seems fine.

@boxed boxed changed the title Usability issue: comments aren't their own nodes Make the _split_prefix public and documented to make comment parsing easy. Jan 16, 2019
@boxed
Copy link
Contributor Author

boxed commented Jan 19, 2019

I also noticed that FStringStart doesn't have _split_prefix. Maybe everything that has a prefix should have a _split_prefix? Or what is the logic?

@davidhalter
Copy link
Owner

You are absolutely right. Changed in b5d8175. FStringStart actually just had the wrong classes. Sorry for that.

I have currently not the time to make _split_prefix public and check if everything is as it should be. But I feel like if it ever is public, it will look very similar - so you can just use the private function and keep an eye on this tracker. Just pin the parso version and you should be fine :)

@mkurnikov
Copy link

mkurnikov commented Apr 28, 2019

I'm moving isort to parso now (trying to) and just hit the same thing. I want to add/remove lines, reorganize imports, add/remove comments with specific contents (isort section names) and it just feels really awkward.

Considering that parso has NewLine class, so
import os\n parsed as [ImportName, NewLine]
why couldn't import os # comment \n be parsed as [ImportName, Comment, NewLine] instead?

However if we try to generalize this, it's a bit more complicated. What about comments on lines without nodes? These would then also be after the node. The problem we end up with is where do we put comments at the beginning of the file i.e. before any node.

Could you outline complicated cases where it won't work maybe? It will at least help a lot, when I'd work on edgecases for isort.

UPD.
For example, if jedi doesn't use this feature, it could just be hidden behind a flag, ie. comments_as_classes=True, to preserve backwards compatibility.

@davidhalter
Copy link
Owner

I can understand your concern, but I guarantee you that it will be a nightmare. Try creating just the grammar for that. That's not fun. It's actually possible with parso, just modify the grammar that already exists and modify the tokenizer. Working with a CST (concrete syntax tree) that has comments everwhere is also just really really horrible.

So that actually leaves us with much better tooling than we have now. _split_prefix is obviously pretty annoying for two reasons:

  1. It returns comments before the token, this is not how we humans think, we read from left to right.
  2. It returns all kind of horrible things (unicode bom, backslashes, form feeds, etc). This is annoying because mostly we are not interested in comments and nothing else.

(1) makes kind of sense, because it's the only way how we can access comments in the beginning. (2) is something we could improve, we could just return comments. If you have good ideas for an API for comments I'm happy to discuss.

@mkurnikov
Copy link

What about instead of modifying grammars, just do

  1. Parse code as it is now.
  2. Use _split_prefix to extract comments and newlines in the prefixes in their own classes.
  3. Comment and Newline objects would be added not to the parent container (PythonNode) of the object that has them in prefix, but to one that consist of objects from previous line.
  4. Hide all this behind a flag for parse, so API would be
parse(code, version='3.7', comments_as_classes=False)

So,

import math # comment
import sys

will be parsed into

[
    PythonNode([ImportName, Comment, Newline]),
    PythonNode([ImportName])
]

If there's no PythonNode for the previous line, just use Comment as a separate object,

# comment
import os

parsed as

[
    PythonNode([Comment, Newline]),
    PythonNode([ImportName])
]

All other stuff, like form-feeds and whitespaces would stay in the prefixes of the corresponding objects.

@davidhalter
Copy link
Owner

While this would be possible, I'm strongly opposed to it because it modifies the syntax tree in ways that are very random in some cases (like comment leaves somewhere in parentheses, etc. I feel like even in this case there's so many edge cases that nobody except you would be happy with.

The reason you would be happy with is probably, because you want to work only with imports. If you only work with imports, it's not nearly as bad where comments can appear (except maybe when you use stuff like:

# first
from x import ( # a
 y, # b
 z, # c
) # d

In this case your comments would just be in some crazy places.

@mkurnikov
Copy link

Ok, you're right. I think I just don't have enough perspective of how parso is used in the wild. Anyway, _split_prefix() method should be enough for now for my usecase, I'm going to create custom Comment node and parse them manually from PrefixPart objects.

@mkurnikov
Copy link

Ok, last attempt, won't pursue it further.

So that actually leaves us with much better tooling than we have now. _split_prefix is obviously pretty annoying for two reasons:

  • It returns comments before the token, this is not how we humans think, we read from left to right.
  • It returns all kind of horrible things (unicode bom, backslashes, form feeds, etc). This is annoying because mostly we are not interested in comments and nothing else.

(1) makes kind of sense, because it's the only way how we can access comments in the beginning. (2) is something we could improve, we could just return comments. If you have good ideas for an API for comments I'm happy to discuss.

Goals:

  1. Access comment at the beginning
  2. left-to-right thinking
  3. complications with multiline nodes and comments

Currently, there are two unobvious things for me:
1.

# mycomment
# mycomment2
def func(): pass

all comments will be in prefix of first leaf of func scope.

import os # comment
def func(): pass

comment will be in the first leaf of func scope.

In addition, nodes in the .children attribute neatly wrapped in the PythonNode for every logical line of code,

Module [
    PythonNode( [import os] ),
    PythonNode( [def func(): pass] )
]

So, what makes sense to me is to add

  1. suffix attribute, where everything that is followed by the node till the next newline. After that, there will be Newline node (or not, not sure about it).
  2. prefix is a source code between start of the current line and start of the current node (mostly indentation)
  3. comments which cover their line completely (with prefix indentation) will be represented as
PythonNode([Comment(#comment), Newline])
  1. _split_prefix / _split_suffix still there in case user wants to access prefix/suffix parts.

@mkurnikov
Copy link

@davidhalter Would you approve

comments which cover their line completely (with prefix indentation) will be represented as
PythonNode([Comment(#comment), Newline])

if I submit a PR? Everything else is not critical for me (isort), suffix is just a minor convenience, I can just extract a prefix of the next leaf.

@davidhalter
Copy link
Owner

No I don't approve that. I feel like I want a solution that also works for all cases and is not just a band-aid. Comments are not semantically relevant for tree nodes. For something like that the parser has to significantly change (as well as the tokenizer).

@davidhalter
Copy link
Owner

I'm closing this one, because having done a lot of work on refactoring now in Jedi, I don't think this is needed for now. So people should just implement their own "split_prefix" logic and work with comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants