New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support additional meta-data (such as whitespace and comments) on a CST instead of the AST #133
Conversation
…line comments, multiline comments) to be tracked on (and thus outputted from) all AST-Node types (identifiers, expressions, statements, etc
Looks great. I'll review this. |
/CC: @ariya @michaelficarra |
Some of these have been discussed in https://code.google.com/p/esprima/issues/detail?id=197. |
@ariya thanks for the pointer. Unfortunately, that discussion (and the code just landed for esprima) doesn't seem to do anything about preserving any whitespace, which in my case is necessary for a sufficient solution. What do you think about the more general solution I've provided here which also allows whitespace, and even allows the (crazy) HTML-style comment markers to be supported/preserved? |
Just a note here from the JSShaper author. escodegen followed the leadingComment and trailingComment model I created (@ariya's link has more info). But note that JSShaper used leadingComment/trailingComment for expressions just as well as statements and I guess perhaps so could escodegen. @getify mentions the problem of capturing comments in some cases, for example |
Basically look nice. But I have a concern about If it is addressed by only meta data (not introducing new node, |
/CC: @dherman |
@getify @olov @Constellation Should we continue the discussion set in https://code.google.com/p/esprima/issues/detail?id=197? Otherwise, we may risk repeating it or having a disconnected thread? |
@olov For "orphaned" comment, @Constellation proposed |
@ariya |
Note: I know this discussion is centering around comments. But I also care about preserving whitespace for my use case. @olov @Constellation isn't @Constellation the problem is that in the case shown, only the I understand not wanting to create non-standard AST stuff. But the standard JS AST just doesn't handle cases of wanting to preserve whitespace and comments, so we have to invent. Inventing a no-op empty (literally completely empty) node seems like the smallest change. @ariya @olov where would |
BTW, I'm still undecided about whitespaces (see also https://code.google.com/p/esprima/issues/detail?id=256 and perhaps also https://code.google.com/p/esprima/issues/detail?id=108). I think storing comments is already a stretch (since comments are never parts of a proper syntax tree). |
With all due respect, I don't think it's appropriate to move discussion to the esprima thread. Here's why: this is NOT necessarily as much a question about what a parser would generate in an AST as it is other sorts of tools which either create ASTs from scratch (think transpilers, etc), or modify existing ASTs (code formatters, etc), and where they are in control of what kind of AST they create. I chose to open this discussion here, on escodegen, on purpose, because it takes an AST and produces code. If I have my own bespoke tool that generates ASTs (I do) and I want to have the AST turned into code (I do), my main desire is to have escodegen help that effort. It's a secondary (but not unimportant) issue how a parser like esprima might also track these items in its AST, and I agree it'd be nice to "agree" here. But my concern was first about being able to take an AST with this data already in it, and include it in the output of escodegen. It was an orthagonal task to later tackle whether something like esprima might care about preserving whitespace and comments as it parses. |
@getify yup it is! but |
@olov ahh, good point. so yeah, don't call it |
Do you think about defining new lower level format, Concrete Syntax Tree and handle it by Escodegen? |
@Constellation I think the CST would be a much harder way of getting what we want, not just for escodegen but for all the other tools in the arena which need to handle various stages of the transformations. If there's a strong enough objection to the I think that's a harder solution, because it means that for a tool generating such an AST, it has to decide whether encountered whitespace/comments can belong to a current/adjacent node (if there is one) or must be attached instead to a parent node's extras (if not). The dummy placeholder node approach has the advantage that it reduces the special casing a bit, where you just detect (for instance) an empty arguments list, and insert a dummy node there, and then if you find whitespace or comments, you already have an adjacent node to attach to. I'm not personally sure why the placeholder node is a problem, since any tool processing such an AST (as escodegen does in my PR) can basically just skip it. But anyway, if any major objection comes down to the empty placeholder node approach, then I can (probably, hopefully) adjust the PR to instead encode such things in What do you think? |
It doesn't handle this. function /*0*/ a /*1*/ ( /*2*/ ) /*3*/ {} I don't see how your Besides, as @Constellation said, this information really does not belong in an AST. @Constellation is right that this information should be stored in a CST. Also, remember that escodegen does not actually generate the resultant JS string output, but generates a CST that it annotates and passes to mozilla's source-map library. I'm sure we could come up with an interface for AST->CST that allows the consumer to modify this IR, adding comments/whitespace wherever they want. |
For example, if dummy node is introduced into parameters of FunctionDeclaration, we cannot consider if (node.xxx) {
// actual node
} So if we introduce dummy node, it should be passed to Escodegen immediately with a special option (such as In the long term, I think defining CST is better approach to this (Yup, this is harder approach) |
@michaelficarra However, blanket statements such as "do not belong in the AST" are not helpful. I understand what the point of an AST normally is. But we're dealing with an auxilliary/piggy-back usage of ASTs, which is to support transformations on code. Which is why I suggested tagging it as extra meta-data, not actual first-class AST nodes. Needing to support roundtrip transformations means every step of the transformation has to preserve data, or the roundtrip fails. If I have a tool that produces an AST (or CST), I can insert whitespace (or comments) wherever I want. However, if I have a tool that parses existing code for the purpose of transforming it, then I need to preserve things in some format. If you say that an AST has to lose that data, then any flow which goes through the AST step is a no-go in terms of what types of tools I'm working on. Can you explain what your objection to ASTs storing meta-data is, other than just a purely academic definition of the typical usage of an AST? Moreover, for the class of tools which need to preserve whitespace and comments through transformation (ie, not JS engines, not minifiers, but most everything else), is it really better to say "no, go write your own parser and code-generator from scratch to complete the roundtrip"? I obviously could. But I certainly would rather prefer to embrace and extend the current tools, if at ALL possible. |
@michaelficarra Thanks for pointing out that "extras" should be able to be attached to identifiers as well. Fixed that easily with this commit: https://github.com/getify/escodegen/commit/62116bf28f52c8e09da06023fff148f448675e10 So, now this AST produces the code you suggested (note particularly it even gives the whitespacing you implied! :) ) using my patch to escodegen: {
"type": "Program",
"body": [
{
"type": "FunctionDeclaration",
"id": {
"type": "Identifier",
"name": "a",
"extras": {
"leading": [
{
"type": "MultilineComment",
"value": "0"
},
{
"type": "Whitespace",
"value": " "
}
],
"trailing": [
{
"type": "Whitespace",
"value": " "
},
{
"type": "MultilineComment",
"value": "1"
},
{
"type": "Whitespace",
"value": " "
}
]
}
},
"params": [
{
"type": "EmptyExpression",
"value": "",
"extras": {
"leading": [
{
"type": "Whitespace",
"value": " "
},
{
"type": "MultilineComment",
"value": "2"
},
{
"type": "Whitespace",
"value": " "
}
]
}
}
],
"defaults": [],
"body": {
"type": "BlockStatement",
"body": [],
"extras": {
"leading": [
{
"type": "Whitespace",
"value": " "
},
{
"type": "MultilineComment",
"value": "3"
},
{
"type": "Whitespace",
"value": " "
}
]
}
},
"rest": null,
"generator": false,
"expression": false
}
]
} which yields function /*0*/ a /*1*/ ( /*2*/ ) /*3*/ {} Now, there's clearly an issue above, which is that I inserted an EXCEPT of course the mentions in above comments about checking for instance So, the alternate approach, using {
"type": "Program",
"body": [
{
"type": "FunctionDeclaration",
"id": {
"type": "Identifier",
"name": "a",
"extras": {
"leading": [
{
"type": "MultilineComment",
"value": "0"
},
{
"type": "Whitespace",
"value": " "
}
],
"trailing": [
{
"type": "Whitespace",
"value": " "
},
{
"type": "MultilineComment",
"value": "1"
},
{
"type": "Whitespace",
"value": " "
}
]
}
},
"params": [],
"defaults": [],
"body": {
"type": "BlockStatement",
"body": [],
"extras": {
"leading": [
{
"type": "Whitespace",
"value": " "
},
{
"type": "MultilineComment",
"value": "3"
},
{
"type": "Whitespace",
"value": " "
}
]
},
"extras": {
"inner": [
{
"type": "Whitespace",
"value": " "
},
{
"type": "MultilineComment",
"value": "2"
},
{
"type": "Whitespace",
"value": " "
}
]
}
},
"rest": null,
"generator": false,
"expression": false
}
]
} Notice I just added The remaining question is, should This solution works. It's "harder", but it addresses the concerns of needing empty nodes and throwing off length counts. If we get a sense that |
@getify: The FunctionDeclaration case I gave was not a perfect example, then. (function /* 0 */ ( /* 1 */ ) /* 2 */ {})
Yes, exactly. The whole point of an AST is that we've discarded that information because it doesn't have a semantic meaning. You're still talking about tools that have ties to the syntactic representation of the program, which means a CST is more appropriate for you. Not only that, an AST is inappropriate for your use cases. |
Being "ambiguous" is not nearly as much a problem as being "insufficient". Tools which produce such *STs can have rules to decide where to put the extras and where not. There's all kinds of rules around producing valid ASTs and not sticking nodes in inappropriate/unexpected places. There's no reason such appropriate rules can't be applied to where you put extras. Moreover, in your case, it's not ambiguous at all. ( /* here's some unambiguous extras! */ function /* 0 */ ( /* 1 */ ) /* 2 */ {}) As you can see, all the extras would appear at the beginning (but inside) of the outer Moreover, the other examples you suggest like The only place not currently handled is I suspect that most other examples we can dream up have similar triage. Yes, we're chasing rabbits. I admitted that much earlier in the thread. So? That's what development sometimes requires.
You seem to be overly pedantically hung up on the purity of an AST. And yet, both esprima AND escodegen already add incomplete Why does it really matter if what we're talking about is purely an AST or a CST? Do these tools that I'm talking about expose (public API wise) the AST separately from the CST? To my knowledge, they do not. If your underlying point here is that if I want to track whitespace/comments in code transformations, that I'm shi** out of luck with any of these tools and that I should go invent my own whole stack for them, that's totally unhelpful, and I reject that stance. It's contrary to the spirit of OSS. If however you're just splitting hairs on what we call it, that's not particularly helpful (bikeshedding) either. It's clear what I mean, and I don't think its label matters at this part of the stack. And if you're suggesting that all these tools need to change to separately expose AST distinguished from CST, great, be explicit about that. That would be helpful as a proposal. But make sure that the CST is before the creation of an AST, not after, or we are in a situation where's there's intolerable information loss in the creation of your pure AST concept. But, in the end, regardless of whether we're inventing how to encode this meta-data into an AST, or whether we're debating how that same meta-data should be encoded into something that looks almost like it but has a different label (CST), we still have to figure out if there's a practical way to encode the meta-data. Can you help with that task instead of trying to derail the effort? |
Actually, my above analysis about how Bottom line, YES these annotations have some ambiguity to them if they're sloppily used willy-nilly across the tree. A single comment could be persisted in some cases by several different nodes where the But that's not that big of a deal. escodegen can just output whatever it gets without worrying, and put the onus on the generator of the tree to make sure to put extras annotations in the right spot. Thus, the key is deciding what those definitive, non-ambiguous rules should be. As a starting point, I'd say the extras should always be annotated as deeply in the tree as applicable (which removes the ambiguity of annotations overlapping on the same extra). The extra should always be attached to the closest node that it can attach to. For instance, prefer attaching to a If a tree is created with these proper rules, then escodegen should be able to easily churn through all the nodes and not worry about outputting any extras it finds. |
@michaelficarra you have not addressed how a parser (like esprima) could produce a tree for |
Quoting myself from (way) above...
"(and optionally ASTs)" means we just run the result through a CST->AST transformation. |
In that case, by creating CST from parser and passing it to If we'd like to control the generated code format, we should use CST. But we'd like to analyze the semantics of the code, we should use AST. |
Let me see if I get this straight:
Does that accurately represent what @Constellation and @michaelficarra think is the best path forward? |
Assuming for a moment that I understand (from my previous comment), let me make a few points:
|
Thanks for the clarification :)
Basically right. But to keep the role ( function generate(ast) {
return generateFromCST(convertASTToCST(ast));
}
This functionality is not necessary. We can just use |
Can we agree that the tree which is produced by That is, that it would likely use something like |
escodegen's primary task is generating valid & semantically equal code from AST. escodegen's core principal is So I think this task belongs to escodegen. Personally I think
That's right!
But it cannot represent some concrete syntax information (such as parentheses and semicolon) as a meaning structure, correct? |
Arriving super late to the party, but yes CST, awesome idea! I haven't read all the debates on the API side of things but here are a few elements I find relevant: the CST contains all the information of the AST, but not the other way around. Ideally, that could mean that the AST be a subset of the CST. The problem is that the AST was defined long ago and a bunch of tooling relies on it, so making CST a superset of AST might be impossible without breaking things (I'm interested if proven wrong here). And the reward of breaking things doesn't seem worth it in that instance. Beyond backward compat, given that CST contains more info, one could wonder what is the point of AST after CST... But let's make CST happen.
I guess it starts with the following principle: |
It depends on what you mean by superset, but if you mean current tools will be able to treat a CST as an AST, I can see at least one way to do it by just adding optional properties to AST nodes:
I think that should be sufficient. @Constellation: what do you think? |
What you've described is indeed a critical function of escodegen, that it needs to produce valid code from the tree. This means that in certain places, it does need to add in whitespace between tokens, such as between However, that's a separate and orthagonal task from escodegen adding in extra pretty-printing whitespace like with indentation and such. Looking through the bits of escodegen that handle any sort of whitespace insertion, about 10% of the code (rough estimate) is for required whitespace for grammar validity, and the other 90% of the code that deals with whitespace insertion is dealing with fancier pretty-printing things like adding base indentation to lines of code, adjusting the indentation level of multi-line comments, and other such things. It's just my opinion, but I see the second of these tasks, the pretty-printing part, as not necessarily part of the core primary task of escodegen, while undeniably the first task (ensuring grammar validity with necessary whitespace) as definitely primary. You could save/remove some of the code/complexity in the current code base if you elected to say that pretty printing stuff was not in escodegen but could be done by some other tool which produces a CST. Such a pretty-printer tool could just take an AST, make a CST out of it, and add the proper whitespace/indentation annotations to the CST, such that when escodegen processes that CST, it will produce the desired pretty-printed code, but escodegen wouldn't have to do the part of actually figuring out the pretty-printing parts. Just my 2 cents. It's a feature I don't care about either way, about because the tool I'm building will supply all of the spacing info via CST to escodegen.
I absolutely think the CST can (and should!) represent those things. ASTs have no need for them, CSTs absolutely do have some usage for them (such as reconstructing original code -- my main use case). |
This is the opposite of what I mean by compatibility between AST and CST. I don't mean, or intend in my suggestions, for a CST to be treated like an AST. I think that's a bad goal that hand-ties us too rigidly in our design of a CST. I mean: an AST can be treated like a CST (aka, backwards compatibility preserved that escodegen will be able to keep accepting the ASTs it always did), though an AST is a particularly "dumb" CST in that it's missing all the actual "concrete" stuff. To put it more plainly, I think escodegen is one of the only tools that would care about consuming a CST. Maybe a few others, such as escope. Most other tools would only care to receive an AST. For them, they don't have to change at all just because we're declaring an optional intermediary CST format. We just need for one of the tools to have a CST-to-AST stripper/converter, and we're fine there. That leaves parsers and other tree generators to consider. The idea would be that they would be able to either output an AST (that is, ignore the extra stuff while they parse/tree-build), or a CST (keep all the extra stuff), and the user of that tool would decide which tree they want, and how to use the tree once they get it. To specifically address your question, @DavidBruant, I believe I've proven that a CST structure can look an awful lot like an AST, that is, as a superset of an AST, and yet still handle all the complexities of concrete syntax representation. The vast majority of the grammar allows for whitespace and comments to be annotated using the suggested "extras" implementation I've shown here in this thread. There are a few exceptions where the "extras" have to be added to the tree in a special (aka, non-standard-AST) way/location, but those few exceptions only create a 1-to-1 structure in the CST which can easily be stripped in the transformation of a CST back to an AST. Things such as |
@getify @michaelficarra @DavidBruant @DavidBruant wrote:
Right. CST should represent the source code as is.
Personally I think only users paying attension to formats should consider CST. And we need to keep AST world simple. So I would not like to assume the inheritance between AST and CST. I think CST and AST should be converted each other by some function explicitly. @getify wrote:
Agreed. CST should not be treated like an AST.
I don't agree with this. When API take CST, this API cannot take AST.
Right. If CST add dummy nodes, basically all AST tools cannot handle CST. CST is different from AST. AST tools should not consider CST since it is very low IR. Only users taking care about formats should consider CST. |
How about parentheses? I don't want to break the current functionality since I think AST tools should not consider about CST and escodegen provides a way to generate something valid code from AST. It keeps AST tools world simple. The current functionality, So if you'd like to split escodegen's tasks into small ones, escodegen will provide 2 functionalities, inserting required spaces (and required parentheses) and pretty printer functionality. But personally I think they cannot be splitted since both tasks need to know the information about the all inserted spaces. |
We all seem to have slightly different ideas of what the ideal CST is. I don't know exactly how to resolve the differing views. However, on a positive note, @michaelficarra @puffnfresh and myself got a chance to chat in-depth over drinks while we've been here at the NationJS conference. I think we have a plan to move forward on making the decisions, which involves a google hangout online meeting with all the interested players. I am going to try to set something like that up over the next couple of weeks. Timezones are going to be our enemy in that, but I'll do my best. Stay tuned. :) |
My idea about it seems pretty close to yours ;-) So the cool idea here would (could) be to rethink and refactor this extra information into something more unified and better structured which would allow tools to rebuild the exact source code from which the CST was produced. And to keep the current model of an AST with a few more properties attached to its nodes. Or, if some people (and tools) feel really attached to the academic purity of ASTs, maybe we could consider to have 2 separate data structures. A pure AST for the semantics, and a Whatever-T for the presentation information. (Kind of) like html+css separately represent semantic and presentation aspects of a page. |
As Kyle says, we got a chance to speak about this in person and really understand each other's perspectives. The discussion came down to one basic point: we've each defined isomorphic JS CSTs, each with their own pros and cons. I will attempt to summarise the two proposals and their pros/cons here. If we find that we want to collaborate on this, we can either move this to the wiki or a shared google doc. Proposal: CST By Extending/Annotating ASTI've previously described this format above as a possibility in response to @DavidBruant. I will copy it here for convenience. Essentially, we are just adding optional properties to AST nodes:
Pros
Cons
Proposal: CST With Structural Syntactic FormsThis proposal also remains close to the AST specification. In this proposal, new syntactic node types are added to the AST spec. as well as properties containing syntactic information.
Pros
Cons
Obviously, the pros/cons lists are incomplete and weighted toward the format I'm more familiar with. We should collaborate to fill them out completely so the community can make an informed decision about the CST format we should use going forward. One point that was brought up is that we could always have both formats. Technically, this has all of the pros of both proposals, with the only additional con being that one has to be more conscious about what format they are using and their tools support. We also discussed and all agreed that esprima should have an interface for producing a CST. Hopefully this is a good starting point for continuing discussion. @Constellation: If you prefer, we can move this to the wiki or some other collaboration tool. Please comment with more pros/cons for each of these formats! |
Great great great! |
@getify @DavidBruant @michaelficarra
Very nice. Moving this to wiki is quite reasonable. This clarification helps us a lot. Does Anyone have comments about this clarification? |
Created wiki page. Feel free to edit it. |
I've been traveling at confs for the past 2 weeks solid, sorry about my delay in responding. Hopefully will pick up the task of getting together an online meetup about this so we can figure out how to move forward. |
I would very much also like to get access to CST information; namely, whitespace. It has been a few months now. @getify, do you know what the next step is? |
I feel badly that I got very busy with other things and never organized the discussion meeting to talk about CST formats. We really need to circle back and do that. I'll try to find some time soon to renew the effort. |
OK, that took way too long. But I'm finally doing something "concrete" about this. Please go see this if you are interested in helping define a standard CST: |
I have need for the ability to attach extra information to the standard AST. The extra information is whitespace and comments that appear before (leading) or after (trailing) any node in the AST (such as Identifier, Expression, etc).
NOTE: I have implemented this (see the attached PR) in such a way that only AST's which have this extra data/nodes in them will activate the new behavior, so all existing usages of existing standard AST's should be unaffected by my changes (either functionally or performance).
The reason for this new feature is I'm building a tool which parses a JS file, custom modifies its styling/formatting by adding such information to the normal AST, then needs to re-generate the code from that AST.
In other words, I plan to turn on escodegen's
compact
config, so that none of its automatic whitespace/identation/etc happens, but instead have my own "control" on exactly how whitespace is outputted by having that data encoded into the AST viaextras
annotations.Moreover, the existing mechanism in escodegen for
leadingComments
andtrailingComments
is unfortunately insufficient, because there are many places where comments can validly appear which are not handled by that mechanism. It only tracks comments attached before or after a statement. For example,foo(/* blah */);
, the comment there isn't leading or trailing to a statement, and is thus not trackable by the current comments mechanism.By contrast, though, the "extras" mechanism I've made allows any node in the AST to have leading or trailing "extras", which can either be whitespace, single-line comments, or multiline comments. This allows comments to be output in any of the valid locations, rather than only before/after statements as currently. Perhaps my suggested mechanism here can obviate the need for the existing comments tracking, but note that my attached PR didn't make that bold assumption, so it leaves that stuff entirely alone (for now).
My attached PR here implements this feature as I suggest here in this discussion. It also adds a number of test cases to the "AST" test to verify that it's working correctly.
Here's an example (which is also in the test suite). This example AST shows several places where an
extras: { .. }
key is added to various different nodes, which can contain either/bothleading
ortrailing
extras, which respectively can have eitherWhitespace
,LineComment
and/orMultilineComment
nodes:That AST now produces this via escodegen (with compact mode turned on, obviously):
Note that the code even automatically inserts new-lines after line-comments, if necessary to delimit, even if the AST provided does not (though the way I plan to generate my AST, it always would have new-line
Whitespace
nodes in there).One final note, not directly illustrated in the above code: there are certain places in code where comments and whitespace can appear (and thus I would want to preserve and track) that there is otherwise no appropriate node to attach the "extras" data to. For instance,
foo(/* bar */)
has an emptyarguments
array in theCallExpression
, so there's no node to attach to there. I believe there are a few other places across the grammar as well. My suggested solution to this is to support anEmptyExpression
node type (similar toEmptyStatement
which is already in the code, butEmptyExpression
outputs nothing except any extras it may have, not the default;
asEmptyStatement
does).Just as with the rest of the stuff I've implemented,
EmptyExpression
would merely be supported if an AST included it, but wouldn't be required if an AST had no desire to track these "extras".