Support additional meta-data (such as whitespace and comments) on a CST instead of the AST #133

Closed
wants to merge 2 commits into
from

Conversation

Projects
None yet
8 participants

getify commented Sep 22, 2013

I have need for the ability to attach extra information to the standard AST. The extra information is whitespace and comments that appear before (leading) or after (trailing) any node in the AST (such as Identifier, Expression, etc).

NOTE: I have implemented this (see the attached PR) in such a way that only AST's which have this extra data/nodes in them will activate the new behavior, so all existing usages of existing standard AST's should be unaffected by my changes (either functionally or performance).


The reason for this new feature is I'm building a tool which parses a JS file, custom modifies its styling/formatting by adding such information to the normal AST, then needs to re-generate the code from that AST.

In other words, I plan to turn on escodegen's compact config, so that none of its automatic whitespace/identation/etc happens, but instead have my own "control" on exactly how whitespace is outputted by having that data encoded into the AST via extras annotations.

Moreover, the existing mechanism in escodegen for leadingComments and trailingComments is unfortunately insufficient, because there are many places where comments can validly appear which are not handled by that mechanism. It only tracks comments attached before or after a statement. For example, foo(/* blah */);, the comment there isn't leading or trailing to a statement, and is thus not trackable by the current comments mechanism.

By contrast, though, the "extras" mechanism I've made allows any node in the AST to have leading or trailing "extras", which can either be whitespace, single-line comments, or multiline comments. This allows comments to be output in any of the valid locations, rather than only before/after statements as currently. Perhaps my suggested mechanism here can obviate the need for the existing comments tracking, but note that my attached PR didn't make that bold assumption, so it leaves that stuff entirely alone (for now).

My attached PR here implements this feature as I suggest here in this discussion. It also adds a number of test cases to the "AST" test to verify that it's working correctly.

Here's an example (which is also in the test suite). This example AST shows several places where an extras: { .. } key is added to various different nodes, which can contain either/both leading or trailing extras, which respectively can have either Whitespace, LineComment and/or MultilineComment nodes:

{
    type: 'Program',
    body: [{
        type: 'ExpressionStatement',
        expression: {   
            type: 'CallExpression',
            callee: {
                type: 'Identifier',
                name: 'foo',
                extras: {
                    trailing: [
                        {
                            type: 'MultilineComment',
                            value: '2'
                        },
                    ],
                }
            },
            arguments: [
                {
                    type: 'Literal',
                    value: 4,
                    raw: '4',
                    extras: {
                        leading: [
                            {
                                type: 'LineComment',
                                value: '3'
                            }
                        ],
                        trailing: [
                            {
                                type: 'MultilineComment',
                                value: '5'
                            },
                            // NOTE: testing custom line-comment markers,
                            // in support of HTML-style line-comment markers
                            // Ref: http://javascript.spec.whatwg.org/#comment-syntax
                            {
                                type: 'LineComment',
                                value: '6',
                                marker: '<!--'
                            }
                        ],
                    },
                }
            ],
            extras: {
                leading: [
                    {
                        type: 'MultilineComment',
                        value: '1'
                    },
                ],
                trailing: [
                    {
                        type: 'MultilineComment',
                        value: '7'
                    },
                ]
            }
        },
    }],
    extras: {
        trailing: [
            {
                type: 'MultilineComment',
                value: '8'
            },
        ],
    }
}

That AST now produces this via escodegen (with compact mode turned on, obviously):

/*1*/foo/*2*/(//3
4/*5*/<!--6
)/*7*/;/*8*/

Note that the code even automatically inserts new-lines after line-comments, if necessary to delimit, even if the AST provided does not (though the way I plan to generate my AST, it always would have new-line Whitespace nodes in there).

One final note, not directly illustrated in the above code: there are certain places in code where comments and whitespace can appear (and thus I would want to preserve and track) that there is otherwise no appropriate node to attach the "extras" data to. For instance, foo(/* bar */) has an empty arguments array in the CallExpression, so there's no node to attach to there. I believe there are a few other places across the grammar as well. My suggested solution to this is to support an EmptyExpression node type (similar to EmptyStatement which is already in the code, but EmptyExpression outputs nothing except any extras it may have, not the default ; as EmptyStatement does).

Just as with the rest of the stuff I've implemented, EmptyExpression would merely be supported if an AST included it, but wouldn't be required if an AST had no desire to track these "extras".

@getify getify adding support for 'extras' (leading and trailing whitespace, single-…
…line comments, multiline comments) to be tracked on (and thus outputted from) all AST-Node types (identifiers, expressions, statements, etc
037de22
Owner

Constellation commented Sep 23, 2013

Looks great. I'll review this.

Owner

Constellation commented Sep 23, 2013

/CC: @ariya @michaelficarra
If you have any comments, feel free to write here.

Contributor

ariya commented Sep 23, 2013

Some of these have been discussed in https://code.google.com/p/esprima/issues/detail?id=197.

getify commented Sep 23, 2013

@ariya thanks for the pointer. Unfortunately, that discussion (and the code just landed for esprima) doesn't seem to do anything about preserving any whitespace, which in my case is necessary for a sufficient solution. What do you think about the more general solution I've provided here which also allows whitespace, and even allows the (crazy) HTML-style comment markers to be supported/preserved?

olov commented Sep 23, 2013

Just a note here from the JSShaper author. escodegen followed the leadingComment and trailingComment model I created (@ariya's link has more info). But note that JSShaper used leadingComment/trailingComment for expressions just as well as statements and I guess perhaps so could escodegen.

@getify mentions the problem of capturing comments in some cases, for example foo(/* bar */), and I remember having stumbled on the same issue (but I never solved it). I agree that one really want to store those in a dummy node (in the pre- or postComment property, doesn't matter). As a minor comment perhaps the dummy node should not be called EmptyExpression though because it may be located in non-expression positions, for example function foo(/* bar */).

Owner

Constellation commented Sep 23, 2013

Basically look nice. But I have a concern about EmptyExpression.
EmptyStatement is defined in ECMA262 spec and it represents semicolon only statement (;).
But EmptyExpression represents nothing. I think it is not suitable for Abstract Syntax Tree, is it correct?

If it is addressed by only meta data (not introducing new node, EmptyExpression), I think it is better. Do you have any idea?

Owner

Constellation commented Sep 23, 2013

/CC: @dherman

Contributor

ariya commented Sep 23, 2013

@getify @olov @constellation Should we continue the discussion set in https://code.google.com/p/esprima/issues/detail?id=197? Otherwise, we may risk repeating it or having a disconnected thread?

Contributor

ariya commented Sep 23, 2013

@olov For "orphaned" comment, @constellation proposed containedComment attached to the nearest parent.

Owner

Constellation commented Sep 23, 2013

@ariya
OK, I agreed to discussing on that Esprima's thread.

getify commented Sep 23, 2013

Note: I know this discussion is centering around comments. But I also care about preserving whitespace for my use case.

@olov @constellation isn't foo(/* bar */) an example where an expression very well could have been present, in between the ( .. )? I chose "EmptyExpression" as a way to describe the dummy expression that is necessary for these otherwise-empty grammar places. In all the empty cases I could conceive, where "extras" couldn't be added anywhere else, without having some dummy node there, those were places where an expression would be valid. Hence the name I chose to suggest.

@constellation the problem is that in the case shown, only the ( or the ) could possibly be the parent of such a comment annotation, and neither of them are referenced (as operators) in the AST, but just implied by the CallExpression. I think in any scenario that handles these cases, a dummy node will be required. What type we give it is certainly open to discussion. I think EmptyExpression is a good descriptor, and carries no side effects. But Dummy or DummyExpression or any other name (bikeshedding notwithstanding) would be fine, I suppose.

I understand not wanting to create non-standard AST stuff. But the standard JS AST just doesn't handle cases of wanting to preserve whitespace and comments, so we have to invent. Inventing a no-op empty (literally completely empty) node seems like the smallest change.

@ariya @olov where would containedComment (or containedExtras :) ) actually reside in this case? On the callExpression? What about the other grammar cases where these "empties" arise? Empty { } blocks, empty [ ] array literals, and other such examples were some I conceived. I think we may chase rabbits trying to "invent" exceptions in all of them.

Contributor

ariya commented Sep 23, 2013

BTW, I'm still undecided about whitespaces (see also https://code.google.com/p/esprima/issues/detail?id=256 and perhaps also https://code.google.com/p/esprima/issues/detail?id=108). I think storing comments is already a stretch (since comments are never parts of a proper syntax tree).

getify commented Sep 23, 2013

@ariya @constellation

With all due respect, I don't think it's appropriate to move discussion to the esprima thread.

Here's why: this is NOT necessarily as much a question about what a parser would generate in an AST as it is other sorts of tools which either create ASTs from scratch (think transpilers, etc), or modify existing ASTs (code formatters, etc), and where they are in control of what kind of AST they create. I chose to open this discussion here, on escodegen, on purpose, because it takes an AST and produces code. If I have my own bespoke tool that generates ASTs (I do) and I want to have the AST turned into code (I do), my main desire is to have escodegen help that effort.

It's a secondary (but not unimportant) issue how a parser like esprima might also track these items in its AST, and I agree it'd be nice to "agree" here. But my concern was first about being able to take an AST with this data already in it, and include it in the output of escodegen. It was an orthagonal task to later tackle whether something like esprima might care about preserving whitespace and comments as it parses.

olov commented Sep 23, 2013

@getify yup it is! but function foo(/* bar */) {} is not (parameter names are not expressions). :)

getify commented Sep 23, 2013

@olov ahh, good point. so yeah, don't call it EmptyExpression, just call it Empty or EmptyNode. :)

Owner

Constellation commented Sep 23, 2013

@getify @olov

Do you think about defining new lower level format, Concrete Syntax Tree and handle it by Escodegen?
It represents the source code as is. And AST can be transformed into CST.
I think AST should be abstract. For example, AST should not represent parentheses ((expr) should be represented as expr).

Owner

Constellation commented Sep 23, 2013

@getify @ariya
OK, so let's discuss here :)

getify commented Sep 23, 2013

@constellation I think the CST would be a much harder way of getting what we want, not just for escodegen but for all the other tools in the arena which need to handle various stages of the transformations.

If there's a strong enough objection to the Empty or EmptyNode placeholder solution I suggested, I suppose the extras I suggested could have not only leading and trailing, but also a inner or inside or contained, which is a similar list of extras but which would be implied inside of a certain node's elements. For CallExpression and FunctionDeclaration, that would be the ( .. ) pair. For { .. } and [ .. ] pairs, the same would apply. I am not currently aware of a grammar problem with that approach (though there very well may be with the various ES6 syntax variations coming down the pike).

I think that's a harder solution, because it means that for a tool generating such an AST, it has to decide whether encountered whitespace/comments can belong to a current/adjacent node (if there is one) or must be attached instead to a parent node's extras (if not). The dummy placeholder node approach has the advantage that it reduces the special casing a bit, where you just detect (for instance) an empty arguments list, and insert a dummy node there, and then if you find whitespace or comments, you already have an adjacent node to attach to.

I'm not personally sure why the placeholder node is a problem, since any tool processing such an AST (as escodegen does in my PR) can basically just skip it.

But anyway, if any major objection comes down to the empty placeholder node approach, then I can (probably, hopefully) adjust the PR to instead encode such things in extras.inner or the like. The risk is, we may find another grammar situation later where "inner" doesn't suffice, and we'll be chasing more rabbits to invent solutions there.

What do you think?

Owner

michaelficarra commented Sep 23, 2013

I am not currently aware of a grammar problem with that approach

It doesn't handle this.

function /*0*/ a /*1*/ ( /*2*/ ) /*3*/ {}

I don't see how your EmptyNode solution would either. You would have to add an additional list property to every node for every allowed whitespace position in that node's concrete syntax. It's doable, but not ideal.

Besides, as @constellation said, this information really does not belong in an AST. @constellation is right that this information should be stored in a CST. Also, remember that escodegen does not actually generate the resultant JS string output, but generates a CST that it annotates and passes to mozilla's source-map library. I'm sure we could come up with an interface for AST->CST that allows the consumer to modify this IR, adding comments/whitespace wherever they want.

Owner

Constellation commented Sep 23, 2013

@getify

I'm not personally sure why the placeholder node is a problem, since any tool processing such an AST (as escodegen does in my PR) can basically just skip it.

For example, if dummy node is introduced into parameters of FunctionDeclaration, we cannot consider params.length as its number of parameters. And all Nothing|Node interfaces become broken tested by the following code.

if (node.xxx) {
   // actual node
}

So if we introduce dummy node, it should be passed to Escodegen immediately with a special option (such as concreteSyntaxAwareTree: true) since this tree is no longer abstract and it is only handled by Escodegen.

In the long term, I think defining CST is better approach to this (Yup, this is harder approach)

getify commented Sep 23, 2013

@michaelficarra good point. bummer. I had a sense it might be more complicated than we hoped. It's solvable by chasing rabbits. Sucks. [Update: see below comment]

However, blanket statements such as "do not belong in the AST" are not helpful. I understand what the point of an AST normally is. But we're dealing with an auxilliary/piggy-back usage of ASTs, which is to support transformations on code. Which is why I suggested tagging it as extra meta-data, not actual first-class AST nodes.

Needing to support roundtrip transformations means every step of the transformation has to preserve data, or the roundtrip fails.

If I have a tool that produces an AST (or CST), I can insert whitespace (or comments) wherever I want. However, if I have a tool that parses existing code for the purpose of transforming it, then I need to preserve things in some format. If you say that an AST has to lose that data, then any flow which goes through the AST step is a no-go in terms of what types of tools I'm working on.

Can you explain what your objection to ASTs storing meta-data is, other than just a purely academic definition of the typical usage of an AST?

Moreover, for the class of tools which need to preserve whitespace and comments through transformation (ie, not JS engines, not minifiers, but most everything else), is it really better to say "no, go write your own parser and code-generator from scratch to complete the roundtrip"? I obviously could. But I certainly would rather prefer to embrace and extend the current tools, if at ALL possible.

getify commented Sep 23, 2013

@michaelficarra Thanks for pointing out that "extras" should be able to be attached to identifiers as well. Fixed that easily with this commit: https://github.com/getify/escodegen/commit/62116bf28f52c8e09da06023fff148f448675e10

So, now this AST produces the code you suggested (note particularly it even gives the whitespacing you implied! :) ) using my patch to escodegen:

{
    "type": "Program",
    "body": [
        {
            "type": "FunctionDeclaration",
            "id": {
                "type": "Identifier",
                "name": "a",
                "extras": {
                    "leading": [
                        {
                            "type": "MultilineComment",
                            "value": "0"
                        },
                        {
                            "type": "Whitespace",
                            "value": " "
                        }
                    ],
                    "trailing": [
                        {
                            "type": "Whitespace",
                            "value": " "
                        },
                        {
                            "type": "MultilineComment",
                            "value": "1"
                        },
                        {
                            "type": "Whitespace",
                            "value": " "
                        }
                    ]
                }
            },
            "params": [
                {
                    "type": "EmptyExpression",
                    "value": "",
                    "extras": {
                        "leading": [
                            {
                                "type": "Whitespace",
                                "value": " "
                            },
                            {
                                "type": "MultilineComment",
                                "value": "2"
                            },
                            {
                                "type": "Whitespace",
                                "value": " "
                            }
                        ]
                    }
                }
            ],
            "defaults": [],
            "body": {
                "type": "BlockStatement",
                "body": [],
                "extras": {
                    "leading": [
                        {
                            "type": "Whitespace",
                            "value": " "
                        },
                        {
                            "type": "MultilineComment",
                            "value": "3"
                        },
                        {
                            "type": "Whitespace",
                            "value": " "
                        }
                    ]
                }
            },
            "rest": null,
            "generator": false,
            "expression": false
        }
    ]
}

which yields

function /*0*/ a /*1*/ ( /*2*/ ) /*3*/ {}

Now, there's clearly an issue above, which is that I inserted an EmptyExpression node in the params list, which is probably not a great idea. However, if that were called Empty or EmptyNode or Placeholder, it's not nearly as offensive.

EXCEPT of course the mentions in above comments about checking for instance params.length being thrown off. That's a bummer.

So, the alternate approach, using extras.inner, could easily solve that, ditching then the EmptyExpression name-bikeshedding entirely, and addressing the concerns about length checks. That AST could look like:

{
    "type": "Program",
    "body": [
        {
            "type": "FunctionDeclaration",
            "id": {
                "type": "Identifier",
                "name": "a",
                "extras": {
                    "leading": [
                        {
                            "type": "MultilineComment",
                            "value": "0"
                        },
                        {
                            "type": "Whitespace",
                            "value": " "
                        }
                    ],
                    "trailing": [
                        {
                            "type": "Whitespace",
                            "value": " "
                        },
                        {
                            "type": "MultilineComment",
                            "value": "1"
                        },
                        {
                            "type": "Whitespace",
                            "value": " "
                        }
                    ]
                }
            },
            "params": [],
            "defaults": [],
            "body": {
                "type": "BlockStatement",
                "body": [],
                "extras": {
                    "leading": [
                        {
                            "type": "Whitespace",
                            "value": " "
                        },
                        {
                            "type": "MultilineComment",
                            "value": "3"
                        },
                        {
                            "type": "Whitespace",
                            "value": " "
                        }
                    ]
                },
                "extras": {
                    "inner": [
                            {
                                "type": "Whitespace",
                                "value": " "
                            },
                            {
                                "type": "MultilineComment",
                                "value": "2"
                            },
                            {
                                "type": "Whitespace",
                                "value": " "
                            }
                    ]
                }
            },
            "rest": null,
            "generator": false,
            "expression": false
        }
    ]
}

Notice I just added extras.inner onto the FunctionDeclaration node, which would be set to imply that it should add those extras as inside the ( ) pair.

The remaining question is, should inner in that case mean "at the beginning of the ( ) pair", or at the end? Because there might also be param identifiers which also could have extras on them. I would say that inner in this case would only reasonably be used in the empty ( ) pair case, so it's totally safe to just say that inner means "at the beginning of ( )". Wouldn't even be that surprising if you also had params, that they came after such extras.

This solution works. It's "harder", but it addresses the concerns of needing empty nodes and throwing off length counts.

If we get a sense that inner is in fact a more desired direction to go, I'm happy to update the code in the PR for that.

Owner

michaelficarra commented Sep 23, 2013

@getify: The FunctionDeclaration case I gave was not a perfect example, then. inner is still ambiguous with other constructs like this FunctionExpression:

(function /* 0 */ ( /* 1 */ ) /* 2 */ {})

inner could be either the position of 0 or 1 (or 2, but you could say that's just before the BlockStatement, so we'll assume that). Similar problems exist with try/catch/finally and I'm sure many others. As I said above, you could get this idea to work by adding one list property for every position in the node's concrete syntax where whitespace/comments are allowed. But I think that's a bad solution. And as @constellation mentioned, we don't want to insert fake nodes for practical purposes, not just academic ones.

then any flow which goes through the AST step is a no-go in terms of what types of tools I'm working on.

Yes, exactly. The whole point of an AST is that we've discarded that information because it doesn't have a semantic meaning. You're still talking about tools that have ties to the syntactic representation of the program, which means a CST is more appropriate for you. Not only that, an AST is inappropriate for your use cases.

getify commented Sep 23, 2013

@michaelficarra

...is still ambiguous with other constructs...

Being "ambiguous" is not nearly as much a problem as being "insufficient". Tools which produce such *STs can have rules to decide where to put the extras and where not. There's all kinds of rules around producing valid ASTs and not sticking nodes in inappropriate/unexpected places. There's no reason such appropriate rules can't be applied to where you put extras.

Moreover, in your case, it's not ambiguous at all. inner on that function declaration is totally well-defined by what I already said, which is that inner means "at the beginning of the child ( ) set (or the [ ] or { } as appropriate), which is that such extras attached to the FunctionExpression would appear like this:

( /* here's some unambiguous extras! */ function /* 0 */ ( /* 1 */ ) /* 2 */ {})

As you can see, all the extras would appear at the beginning (but inside) of the outer ( ). Not ambiguous at all.

Moreover, the other examples you suggest like try/catch etc all have BlockStatement children in their trees, which means the leading and trailing can all represent every extras location. The catch clause's identifier has an identifier so we can encode comments even inside the ( ) without needing inner.

The only place not currently handled is .. catch /* not handled */ (err) { .. }. Turns out since we don't need inner to target the ( ) set, since it's always non-empty in a catch, then inner can easily mean "between catch and (".

I suspect that most other examples we can dream up have similar triage.

Yes, we're chasing rabbits. I admitted that much earlier in the thread. So? That's what development sometimes requires.

Not only that, an AST is inappropriate for your use cases.

You seem to be overly pedantically hung up on the purity of an AST. And yet, both esprima AND escodegen already add incomplete leadingComments, etc. meta data nodes into those "pure" ASTs they deal with. I'm not inventing a new concept (meta-data in an AST), but making it better than it currently is (and is already well precedented).

Why does it really matter if what we're talking about is purely an AST or a CST? Do these tools that I'm talking about expose (public API wise) the AST separately from the CST? To my knowledge, they do not.

If your underlying point here is that if I want to track whitespace/comments in code transformations, that I'm shi** out of luck with any of these tools and that I should go invent my own whole stack for them, that's totally unhelpful, and I reject that stance. It's contrary to the spirit of OSS.

If however you're just splitting hairs on what we call it, that's not particularly helpful (bikeshedding) either. It's clear what I mean, and I don't think its label matters at this part of the stack.

And if you're suggesting that all these tools need to change to separately expose AST distinguished from CST, great, be explicit about that. That would be helpful as a proposal. But make sure that the CST is before the creation of an AST, not after, or we are in a situation where's there's intolerable information loss in the creation of your pure AST concept.

But, in the end, regardless of whether we're inventing how to encode this meta-data into an AST, or whether we're debating how that same meta-data should be encoded into something that looks almost like it but has a different label (CST), we still have to figure out if there's a practical way to encode the meta-data. Can you help with that task instead of trying to derail the effort?

getify commented Sep 24, 2013

Actually, my above analysis about how ( /* something */ function(){} ) would need to be represented was unnecessarily complicated. After inspecting the ExpressionStatement, which is the outer ( .. ), it doesn't need an inner because ( ) alone with nothing inside it is invalid. So, there's always a child node inside the ( ), and in this case, that child is FunctionExpression. That node can have a extras.leading to persist the /* something */ comment extra.

Bottom line, YES these annotations have some ambiguity to them if they're sloppily used willy-nilly across the tree. A single comment could be persisted in some cases by several different nodes where the extras could be attached.

But that's not that big of a deal. escodegen can just output whatever it gets without worrying, and put the onus on the generator of the tree to make sure to put extras annotations in the right spot. Thus, the key is deciding what those definitive, non-ambiguous rules should be. As a starting point, I'd say the extras should always be annotated as deeply in the tree as applicable (which removes the ambiguity of annotations overlapping on the same extra). The extra should always be attached to the closest node that it can attach to.

For instance, prefer attaching to a Identifier node if applicable, and not the parent Expression, assuming of course that encodes the same extra in the same location.

If a tree is created with these proper rules, then escodegen should be able to easily churn through all the nodes and not worry about outputting any extras it finds.

Owner

michaelficarra commented Sep 27, 2013

Being "ambiguous" is not nearly as much a problem as being "insufficient".

Again, I was not clear enough. It's insufficient. Show me the tree that represents (function /* 0 */ ( /* 1 */ ) /* 2 */ {}) using just leading, trailing, and inner. Yes, you can choose to define inner to be the portion where 0 is or the portion where 1 is, but it can't be both simultaneously. And neither are before or after any node. You would need 2 separate inner-like list properties: nextToTheFunctionKeyword and insideTheParameterListBeforeTheFirstParameter. I will say this for the third time: you could get this idea to work by adding one list property for every position in the node's concrete syntax where whitespace/comments are allowed.

You seem to be overly pedantically hung up on the purity of an AST.

I do not feel it is being pedantic. These are very different structures. An AST represents the semantic structure of a program with all of the concrete syntax information stripped away. Tools like a partial evaluator, a minifier, or an interpreter work on these structures. A CST retains some/all information about how programs (possibly that very same program) are syntactically represented. Not that more than one CST can share the semantics of a single AST, and you can always generate an AST for a given CST. Tools that perform program transformations where unchanged code should be represented using the original input's concrete syntax should work on the CST.

I'd argue that esprima/acorn/whatever should generate CSTs (and optionally ASTs). escodegen can operate on either one. We just need to find a good CST format to use. It'd be extra cool if all CSTs were represented in a format that's compatible with the AST format (in this case, the SpiderMonkey AST format), but you would need to use an approach like I showed above for this. There's pros and cons to using that format.

I'm being "overly pedantic" because it's very important to choose this format carefully. escodegen is an influential tool in this space, and I am extremely passionate about these ECMAScript static analysis and static metaprogramming tools (as you'll see next week at NationJS).

And if you're suggesting that all these tools need to change to separately expose AST distinguished from CST, great, be explicit about that.

That's exactly what I was trying to say.

we still have to figure out if there's a practical way to encode the meta-data

... which brings us back to the other point I was trying to make. See above.

getify commented Sep 27, 2013

@michaelficarra

First off, let me just state for the record: my explicit goal is to be as unobtrusive to the existing AST structure as possible. If I were inventing a solution from scratch, and existing AST compatibility weren't an issue, most of the "hard" cases could just be solved by modifying the AST structure to put a node in somewhere to represent a thing we need to address with positional extras.

I believe the end result of this exercise is that we'll find a system which sufficiently covers all the necessary positions but isn't so crazy complex as to have a named property for every single grammar position. The simpler the solution, the more effectively we can reason about it, and the more effectively we can maintain the pattern with future language additions.

So, I'm trying to look for a solution that is sufficient but minimal. That's why I haven't just blindly gone down the path you suggested of "one list property for every position in the node's concrete syntax". I would prefer to add complexity to my suggested solution only where it's proven that it's necessary, and where not proven, be restrained and simpler.

Again, I was not clear enough. It's insufficient. Show me the tree that represents (function /* 0 / ( / 1 / ) / 2 */ {})

That _WAS_ a trickier one, agreed, but it turned out there was a "solution" I arrived at a few days ago that wasn't too terribly ugly in terms of AST, and didn't require extra-special casing. In this case, here's how I "solved" it thus far, and verified it works fine (doesn't break) in escodegen (ie, it works fully in my cases, and doesn't break any unit tests of escodegen):

{
    "type": "Program",
    "body": [
        {
            "type": "ExpressionStatement",
            "expression": {
                "type": "FunctionExpression",
                "id": {                       // <---- instead of `null` here, it's
                    "extras": {               // an object with no `type` or `name`
                        "leading": [
                            {
                                "type": "MultilineComment",
                                "value": "0"
                            }
                        ]
                    }
                },
                "extras": {
                    "between": [
                        {
                            "type": "MultilineComment",
                            "value": "1"
                        }
                    ]
                },
                "params": [],
                "defaults": [],
                "body": {
                    "type": "BlockStatement",
                    "body": [],
                    "extras": {
                        "leading": [
                            {
                                "type": "MultilineComment",
                                "value": "2"
                            }
                        ]
                    }
                },
                "rest": null,
                "generator": false,
                "expression": false
            }
        }
    ]
}

Note: I have since changed from calling it inner to calling it between, as in various parts of the grammar, it's a little more appropriate in naming semantics. Also, I basically changed the id property from being null to being an object that has no type or name, so the serializations in escodegen don't treat it as an identifier, but it gives me a place to hang the extras.

If there are other code bases than escodegen which rely on checks like if (expr.id), those could easily change to be if (expr.id && expr.id.value) and have no other side effects from this change.

The nice symmetry of this is that if this were a named function instead of anonymous, we'd be putting those extras in the exact same place, so it's actually much less of an exception to the AST than it seems. It's really just keeping the extras in the same place regardless of named or anonymous.

If that turned out to be too unacceptable/obtrusive, it could still be solved without another positional property by just creating another dummy property node like anonymous_id or whatever, which would house the extras property and work exactly like the above.

This is one of 3 "hard" cases I've found after having worked through basically (almost) the whole grammar the last couple of days, making changes in escodegen where necessary. It turns out that the _vast majority_ of cases are perfectly well represented with no other changes or weirdness in the AST, and just hanging "leading/trailing/between" extras annotations off the appropriate nodes.

"between" in all cases means "extras that sit between two other tokens" but which tokens otherwise do not have independent nodes in the AST. The most common patterns are things like if /*0*/ ( ... ), where there is no node for if and there is no node (only an array) for the ( ... ) list. But "between" unambiguously means here "between if and (". Since function declarations and call-expressions are the only place I've found that an empty ( ) can appear, "between" only needs to mean "inside the ( )" in those cases.

The other 2 "hard" cases I've found, which I haven't yet settled on the appropriate solution for, are (..) /*0*/ => ... and function /*0*/ * foo.... Unsurprising that ES6 generators and shorthand sugars would create extra complexity for grammar and for my tasks. But it's not fatal.

I believe in both those cases, the "best" solution is, again, not to add another positional extra property, but instead to add a special property node to hold the extras. For instance, in the arrow function case, there's a boolean arrow property in the AST. To represent any extras that appear between the function and the *, we could change arrow to be an object (still would pass any simple boolean tests like if (node.arrow)..), or we could add an optional object like arrow_position or whatever that housed the extras.


Bottom line: since my examination and modifications of the escodegen code have revealed relatively very few places where "leading/trailing/between" is insufficient all by itself, I haven't chosen to go all the way down the rabbit hole with adding lots of extra complexity around the positional property names, but rather seek places where we can unobtrusively add property-nodes into the AST which can house the extras in a reasonable way.

This approach significantly restrains the complexity of a fully-sufficient solution, and contains the few exception "hard" cases to only those node types, so the rest of the non-hard cases stay simpler.

getify commented Sep 27, 2013

^^^^^ /s/AST/CST

Owner

michaelficarra commented Sep 27, 2013

new /* 0 */ ( /* 1 */ a( /* 2 */ ) /* 3 */) /* 4 */ ( /* 5 */ )
  • 1/2/3 can easily be before, between, and after the CallExpression
  • 0, 4, and 5 are all "between" the NewExpression in a way

If you wanted to keep going with this approach, the only way to do it correctly would be a property for each position where whitespace/comments are allowed: betweenForPosition0, betweenForPosition4, and betweenForPosition5.

If there are other code bases than escodegen which rely on checks like if (expr.id)

There are, and they are not wrong to do that check. Look at the specification for FunctionExpression

interface FunctionExpression <: Function, Expression {
    type: "FunctionExpression";
    id: Identifier | null;
    params: [ Pattern ];
    defaults: [ Expression ];
    rest: Identifier | null;
    body: BlockStatement | Expression;
    generator: boolean;
    expression: boolean;
}

If id is not null, you can safely assume that it is an Identifier (and therefore a Node, Expression, and Pattern). There's no way we could put a non-null, non-Identifier there.

getify commented Sep 27, 2013

@michaelficarra

If you wanted to keep going with this approach

I'm confused by the somewhat adversarial tone of this thread. I'm hoping I'm just misunderstanding your intent or reading too much into it. I'm trying to collaboratively problem-solve (to invent this new CST comment/whitespace tracking approach). But it feels like you'd rather that attempt just fail. Or, it seems like you may want to prove that it's just too complicated to be practical, so that I'll drop it.

If that's not it, and I hope it's not, is your perspective that you'd rather have whatever the solution is be as complicated as possible, simply because of the relatively very few "hard case" exceptions?

I have the opposite feeling and intent, as I've said several times, that I'm looking for the minimal sufficient, rather than the maximally verbose/robust.

For instance, your example with new is solvable either with more positional extras, as you suggested, OR it's solvable through other means, as I've suggested. I don't understand the assertion "the only way to do it correctly would be...", except that we just differ on what we each mean by "correctly".

I would rather have "leading/trailing/between" be the syntax for extras across the entire grammar, and solve the "hard cases" by putting extra properties/nodes in to hold those extras. I haven't yet found a case where that can't work, including all the cases you keep suggesting.

The opposite approach, which seems to be what you prefer, is that in all parts of the grammar, the positional extras names are verbose and highly specific, which IMO creates unnecessary complexity for the vast majority of cases.

That we've come up with a few exceptions which require special handling doesn't justify in my mind a solution that complects all cases. I'd rather the simple be the norm in the majority, and the exception cases only have the extra complexity.

If you don't share that feeling, can you explain better why?

getify commented Sep 27, 2013

@michaelficarra

There's no way we could put a non-null, non-Identifier there.

A few times now you've shot down suggested "solutions" to the hard cases because you asserted that those approaches would "break" existing tools' assumptions about the purity of the AST structure. I'm not really sure if those assertions are entirely true. For instance, escodegen certainly doesn't break on the "empty id" case, it handles it just fine. Do other tools break? I dunno. You probably know this space better than I do, but it's not _clear and obvious_ that it's as dire an approach as you imply.

Especially if we (JS tools authors) could all agree on a new CST structure that implied only minimal changes to those tools (like one extra && id.value check) in exchange for universal support for the additional "extras" functionality. That doesn't sound like an impossible compromise.

We don't have to go that route (I suggested alternatives). But I'm just saying I don't blankly agree that they're impossible non-starters. Unless you universally speak for all other JS tools.

But more troubling, this assertion brings up another inconsistency in your arguments, from how I see it (unless I'm missing something). You've made two classes of arguments thus far in the thread:

  1. "The 'AST' is a purely abstract syntax, impermeable to meta-data annotations. Any attempts to add meta-data annotations to an AST disqualifies it as an AST, and makes it instead a CST. Stop calling it an AST. It's not an AST. It's a CST. It matters what we call it."

    (nevermind the inherent contradictions here that the trees produced and used by esprima, escodegen, traceur, etc all have existing functionality for producing and using meta-data in their trees, including line/column position data, leadingComments, and trailingComments, and thus far precedent pre-dating my current efforts to improve the meta-data in said trees)

  2. "You can't do what you're suggesting (to the tree) because existing tools which ingest ASTs have an immutable set of expectations about AST structure and your violations of those expectations are intolerable."

These arguments seem inherently contradictory to me. You can't have it both ways.

Either we're discussing the invention of a new CST (which looks an awful lot like existing AST) that most/all of the JS tools could eventually adopt as their tree representation, thus preserving and protecting the AST as pure and abstract and free from pollution. And so any deviations we consider are to the CST and not the AST.

Or, we're discussing superset modifications to the currently acceptable AST format to allow for other sorts of tasks besides just syntax generation (such as code re-formatting, etc).

If we're talking about a CST, its resemblance to an AST is only accidental (but fortunate). Importantly, it doesn't have to abide by the same strict rules and assumptions as pure ASTs do. So your arguments about assumptions around AST (such as id assumptions) are, point in fact, not germane, because we could easily say that CSTs require if (id && id.value) type checks, whereas ASTs can get away with if (id) type checks. IOW, CSTs don't have to abide by the same rules as ASTs, by nature of how we've already distinguished those two terms.

Otherwise, all your "pedantic" arguments about AST vs CST were irrelevant, and in the end we're still talking about having to make strictly-compliant superset additions to standard AST, in which case we obviously do have to be careful about violating AST assumptions.

So, which is it? Are we designing a CST syntax, or are we designing a superset addition to AST?

If we're designing a CST syntax, then there's nothing wrong with my id node hijacking, because the tools will already be changing to deal with differences in CST vs AST, and the difference here will be slight and quite reasonable. We don't have to do it that way (I made other alternate suggestions), and we may not want to do it in the end, but it's certainly not a breaking-case as you strongly asserted.

If OTOH we're designing an AST superset addition, we can use my alternate suggestions, like adding anonymous_id: { extras: { ... } }. That works as a transparent addition to the AST because no current AST parsing code cares to look for such extra keys, and the only risk we introduce is avoiding property name choices which could create future compat issues as AST standard evolves. That's a common and mitigatable risk in software design.

getify commented Sep 27, 2013

Another observation (in favor of a separate CST which doesn't necessarily have to abide by the same rules as AST). This statement can't be represented in an AST in a way that it can be retrieved directly (which is actually a general re-statement of the new case that @michaelficarra brought up earlier above):

var a = (((b)));

The production of the AST naturally simplifies that code, and doesn't preserve each ( ) set as its own expression. Thus, if that above code had comments in between any of those sets of unnecessary ( ), such an AST would lose that info, and what came back out wouldn't have the ( ) sets, and moreover, it would lose those comments too, probably.

As I said, the problem with the new example above is that it has an unnecessary set of ( ) wrapped around the callee, and that extra level of wrapping isn't preserved in the AST. If that extra level of expression had been preserved, the comments 0, 4, and 5 would easily be expressable as "leading/trailing/between".

A CST on the other hand could have, for instance, a general "Expression" node, or maybe a "ExpressionGroup" or "ParenthesizedExpression", which could of course have as its single child element another such expression, nested as deep as you put those unnecessary ( ) sets. Having those elements to attach extras annotations to trivially solves the new case stated above.

Bottom line, we have to discuss these changes as proposals for a CST that doesn't lose data (like an AST does). As such, the CST doesn't have to necessarily abide by the same strict AST rules, and thus suggestions I made earlier in the thread, such as the EmptyExpression node, or hijacking the id node for anonymous functions, or whatever... those suggestions shouldn't be invalidated simply on the argument that AST assumptions are violated.

Any tool that deals both with CSTs and ASTs would just have to make sure that such "extra or incompatible data" in a CST was discarded before it created its exportable pure AST representation.

getify commented Sep 27, 2013

@constellation

I propose that escodegen formally change its semantics/behavior to dealing with the input of a CST, rather than of an AST. The CST it receives could look identical to an AST, or it might have extra stuff, and so the rules/assumptions of escodegen should be about CSTs (as we invent here) and not only the academic definition of pure AST.

Additionally, escodegen's API might be extended to have on it a method that "converts a CST to an AST", as a helper for distinguishing between the two formats, and to support interop with other tools which won't have yet dealt with CST vs. AST implications.

It will be a separate and orthagonal discussion if parsers (like @ariya's esprima or Acorn) want to first create and expose a CST (like we invent here), before then stripping down to an AST. The ideal case would be that they expose both, and a tooler could then take the CST from esprima and dump it into escodegen if they wanted to preserve real concrete syntax things (like comments), or it could ask esprima instead for the AST if the tooler didn't care about such things.

What do you think?

Owner

Constellation commented Sep 28, 2013

Oh, sorry. I'll catch up this thread.

Contributor

ariya commented Sep 28, 2013

"..and thus far precedent pre-dating my current efforts to improve the meta-data in said trees)"

True, but to be fair, those extra bits are designed to give more information for the consumers/tools built around it. They are originally not meant to facilitate the faithful reconstruction of the code.

Contributor

ariya commented Sep 28, 2013

For CST, it is probably worth discussing it first at the level of dev-tech-js-engine-internals and/or js-tools.

Also, as I mentioned to @getify during our discussion at Edge Conference, another thing that can be considered for such a rich CST is scope information (which one needs to get/obtain/cross-reference via Escope these days).

Owner

Constellation commented Sep 28, 2013

@getify @michaelficarra

Defining CST spec may be right way to handle this problem. But it's very tough task and it takes a lot of time.
Controlling the format of generated code is an urgent need. So in the mean time, we should provide some solution.

Introducing EmptyNode breaks the AST and tools. null|Node typed members exist widely in Parser AST spec and params.length is also widely used by tools. I think it isn't acceptable for solving this problem.

Personally, adding metadata to controlling the format of generated code looks the best way. inner may be ambiguous. But I think this is solved by attaching more verbose members corresponding to the each position (e.g. extra.comments.afterFunctionKeyword etc), is it right?

getify commented Sep 28, 2013

@ariya

those extra bits are designed to give more information for the consumers/tools built around it

How are leadingComments and trailingComments (and esprima's comments for any other comments it couldn't place) used for "more information" for the other tools? Is there another usage for them I wasn't aware of besides reconstruction of the code?

For CST, it is probably worth discussing it first at...

I'm not opposed to discussing it in those locations, it's probably a very good idea. But given we've got a fair amount of context here, I think the end result of this thread should first be:

  1. A proposal for escodegen to switch formally to CST (once one is "agreed upon") semantics, which in particular means that it's logic which currently assumes only AST might be subject to some modest additions/modifications to accommodate generating code from a richer CST instead of just an AST. There was some pushback to such suggestions earlier in the thread, so we need to iron out if escodegen will be open to such changes. From my perspective, if escodegen wouldn't consider such CST additions, that would be a full-stop and the discussions outside this thread would be rather pointless. escodegen is critical in this whole step, because it's clearly the main tool which would benefit from such CSTs.

    The goal should be that if escodegen receives an AST, it should be able to construct valid code from it (as it does now), but if it receives a CST, it can also produce a richer representation of the code. This ideally shouldn't need config to do so, it should be able to auto-detect what it receives in the tree. IOW, a CST should be a superset of AST semantics with as much backwards compatibility as possible.

    One additional/specific idea along those lines: escodegen currently does a lot of automatic spacing/indentation from the AST/CST it's presented. But, if it receives a CST, and that tree has "extras" like whitespace and comments, it should respect those instead of the automatic stuff. Moreover, if the tree has location/position information in it, escodegen could derive the spacing/indentation based on that data, rather than automatically normalizing it.

  2. We should formalize an initial proposal for a CST that can track whitespace/comments faithfully, and then take that proposal to those other lists as a place to start. If we go there with only partial thoughts, it's probably likely to open up a whole new round of bikeshedding, whereas a complete 1st-draft proposal might reduce that temptation a little bit. IOW, this thread can be like a working-group on a CST proposal. :)

another thing that can be considered for such a rich CST is scope information

This is intriguing information indeed. Do you have any initial thoughts on what that kind of meta-data might look like? To avoid circular tree structures (which are not JSON serializable and thus not portable), we might need to think about something like adding a unique ID to each block that can hold scope, then having each statement include an annotation of its scope ID.

Thoughts?

Owner

michaelficarra commented Sep 28, 2013

@constellation: Agreed. And we should start discussing a CST format that can later be used by tools that operate on a program while preserving its syntax.

edit: I think we should try, as mentioned a few times above, to design the CST format in a way that all CSTs also comply with the AST spec. If it can be done, it would make the CST -> AST transformation the identity function. Unfortunately, a possible downside is that we may make it easier for people to forget which type of data they are working with.

Contributor

ariya commented Sep 28, 2013

@getify Think annotation-related tools, everything from documentation (jsdoc3/jsdoc#93) to coverage hint (gotwarlost/istanbul#15).

Owner

michaelficarra commented Sep 28, 2013

@getify:

As I said, the problem with the new example above is that it has an unnecessary set of ( )

That is not true. I purposely chose that example because it forced the grouping. If the parentheses are removed, the program would have different semantics.

getify commented Sep 28, 2013

@michaelficarra

That is not true. I purposely chose that example because it forced the grouping. If the parentheses are removed, the program would have different semantics.

OK, so I shouldn't have said "unnecessary". My point wasn't the "unnecessary" part, but that the ( ) set didn't end up with its own node to attach extras to, because of the way the AST parsing groups and simplifies. I thought it was clear that my point was about ( ) sets not ending up with their own AST node to attach to, given my discussion of the (((b))) expression example.

It's true that the ( ) set in your example signals a different execution semantic (that is, which expression is the new-expression, and which is the call-expression). It's an interesting example of operator precedence where the disambiguating ( ) swaps the order of binding the two expressions.

However, with and without the enclosing ( ) set, the AST structure is fundamentally the same type/magnitude of structure. It only differs in which expression node is the parent of the other. Compare the outputs of esprima here:

You can see in the original, the NewExpression is the parent and the CallExpression is the child, whereas in the other, the CallExpression is the parent and the NewExpression is the child. Otherwise, the ASTs are pretty much identical.

So, while they do have different execution semantics, they don't really have different AST structures for the purposes of figuring out where to attach "extras" meta-data, and as such, my original point remains: the AST is clearly insufficient, and we need a CST, specifically one that would keep a node representation in the tree for each ( ) set.


Moreover, this frames perfectly the point I keep trying to make: _we don't need to invent special positional properties_ like "beforeOuterOpeningParenInNewExpression" or silly stuff like that.

What we need, what's MUCH simpler to reason about and write code for, is a node in the CST that represents that outer ( ) set. Why? Because developers can insert unnecessary ( ) sets, and any tool which needs to preserve (not rewrite) code needs to know about them. And for the extras tracking issue, the "leading/trailing/between" property names are perfectly sufficient for representing the necessary extras locations.

getify commented Sep 28, 2013

@constellation

Controlling the format of generated code is an urgent need. So in the mean time, we should provide some solution.

I don't agree with this. Doing something "wrong but quick" is, I presume, how we got to partial solutions like leadingComments and trailingComments, which are complication that doesn't even suit the full use-cases.

I'm highly motivated to come to a standard solution for a tree representation (call it a CST I suppose) which both something like esprima/acorn could agree to output and which escodegen would be able to take and reproduce the code exactly as it was input. It's quite clear from the investigations in this thread that going directly to an AST loses information irreparably, and no amount of meta-data is going to fix that.

Also, it's a much bigger task to try and convince consumers of ASTs (like JS engines, for instance) to change what kind of AST they take. What we need, as was suggested earlier in the thread, is to have first a hook to get a CST, and then a hook to go from CST to AST, so that all other tools are free to consume which version of the tree makes sense for their task.

Once we decide that a CST is the proper approach, we're free from the "burden" of "oh no, that would break AST assumptions". We can design a CST appropriate for the task, and all we need to do is keep in mind that we need a transformation that turns a CST into an AST.

The easier we make that transformation, the better. But there should be no need to enforce the assumptions of ASTs on the design of a CST.

Introducing EmptyNode breaks the AST and tools. null|Node typed members exist widely in Parser AST spec and params.length is also widely used by tools...

Not an issue anymore now that we're designing for a CST which is somewhat orthagonal to the rules of ASTs. We should pick the best, most complete, and most simple solution for the CST design. We should ignore arguments which say "But, ASTs expect ____". Tools which expect strictly ASTs just need to make sure a CST is converted to an AST.

If we design the CST format smartly, transformation to AST should be trivial, and pretty much mostly just a removal of the "extra" stuff.

But I think this is solved by attaching more verbose members corresponding to the each position (e.g. extra.comments.afterFunctionKeyword etc), is it right?

No, I strongly disagree with that as well. The code is simpler, the mental model is simpler, and the CST freedom allows us to design by saying something like:

{
   type: "FunctionExpression",
   id: null,
   id_placeholder: {
      extras: ...
   },
   ...
}

OR

{
   type: "FunctionExpression",
   id: {
      extras: ...
   },
   ...
}

In either case, the simplicity of the code, and simplicity of the mental model, where the actual extras property names are always just "leading" "trailing" and "between" is, IMO, the compelling win.

In the vast majority of the grammar, save a FEW exceptions we've found in this thread, we don't need even special stuff like id_placeholder property to hang extras off of. In the few exception cases, it's a small non-breaking addition to store extras in a place that gives semantic meaning to their position.

I strongly prefer that to extras.inPlaceOfAnonymousFunctionIDPosition ...

Contributor

ariya commented Sep 29, 2013

"But given we've got a fair amount of context here, ..."

which IMHO is a disadvantage. Someone who just drop by and try to digest the discussion will likely get the wrong impression. Also, there is a disconnect to the past history (Esprima's objective of supporting annotation on a best-effort basis, JSShaper prior art on comment attachment). Granted, I posted some links there but I don't think folks will follow it.

I would recommend starting immediately a discussion on js-tools, the very least.

getify commented Sep 29, 2013

Well, the original intent of my thread was to make changes to the kind of tree that _escodegen_ could accept. So I think this was the right place to have that discussion. But I don't object to the notion that now it's a bigger idea and there are more general places to continue the discussion. I certainly don't think it's a disadvantage or mistake to have discussed it here.

Owner

Constellation commented Sep 29, 2013

@getify @michaelficarra @ariya

And we should start discussing a CST format that can later be used by tools that operate on a program while preserving its syntax.

Agreed. We should start discussing a CST format.

I don't agree with this. Doing something "wrong but quick" is, I presume, how we got to partial solutions like leadingComments and trailingComments, which are complication that doesn't even suit the full use-cases.

CST is the proper way to solve this in the long term. So there's no urgent need, I'd like to take this way.

I think the consensus is the following.

  • escodegen.generate(...) should accept only AST. AST should be abstract.
  • Introducing escodegen.generateFromCST. Generating code from CST.
  • Providing CST to AST API by some module.
  • Defining CST spec.
  • Prototyping the parser module that produces CST. Probably forked esprima.

Is it OK? If so, I'd like to start definining CST format spec.

getify commented Sep 29, 2013

@constellation

escodegen.generate(...) should accept only AST. AST should be abstract. Introducing escodegen.generateFromCST. Generating code from CST.

I personally think this is an ill-advised approach. As far as I've been working through the escodegen code base, I can see that handling extras from a CST weaves itself quite intrusively throughout the entirety of the code generation stuff. So, to have separate end-points in your API for AST vs CST means you'd have to have significantly separate huge chunks of code, probably nearly doubling the size of escodegen, and also creating a ton of duplication of code making future maintenance a near-impossible nightmare.

If we can achieve a CST format that I think is easiest to code against (both in parsers generating the tree, and in consumption of the tree while code-generating) and also easiest to convert from CST to AST (by stripping out extras without radical restructuring), then the CST will IMO be nothing but a safe (backwards compatible) superset that escodegen can accept, and if you instead happen to pass the stricter/purer AST subset to escodegen, it'll work just fine too.

If you will permit, I've been patching escodegen along the way in this thread to do as I've been suggesting. Perhaps sharing the code changes I've made will make it clearer how escodegen can passively look for CST extras and use them if it finds them, and if not, respond accordingly just as if it was passed a completely standard pure AST.

I don't think we should want two entirely separate API end-points and code paths for this, but rather have the exceptions for "extras" be in-lined in the logic and be as unobtrusive as possible.

Providing CST to AST API by some module.

escodegen might provide such a module, but honestly I would think that esprima would be the better place for that. If esprima eventually is able to generate either a richer more complete CST, or a purer stripped down AST, then it will already have in it the requisite logic to "convert" (ie, strip down) a CST to standard AST. I imagine that's a fairly simple transform that traverses the tree, removes extras, and perhaps in a few places lightly simplifies the physical structure. But it shouldn't be anything radical.

Owner

Constellation commented Sep 30, 2013

@getify

I personally think this is an ill-advised approach. As far as I've been working through the escodegen code base, I can see that handling extras from a CST weaves itself quite intrusively throughout the entirety of the code generation stuff. So, to have separate end-points in your API for AST vs CST means you'd have to have significantly separate huge chunks of code, probably nearly doubling the size of escodegen, and also creating a ton of duplication of code making future maintenance a near-impossible nightmare.

I don't agree with that.
CST and AST should be layered. If we mix the implementation, it limits CST representation and it makes the generator code more complicated.

you'd have to have significantly separate huge chunks of code

I don't think so. If escodegen.generate creates CST internally and pass it to escodegen.generateFromCST, there's no duplicated code.

getify commented Sep 30, 2013

@constellation

If escodegen.generate creates CST internally and pass it to escodegen.generateFromCST, there's no duplicated code.

You can't generate a CST from an AST. You can generate an AST from a CST. This is strictly a one-way sort of transformation.

The CST contains extra information that an AST does not have. So if you have an AST passed into generate(), you have already lost the information that would be needed to make a CST to pass over to generateFromCST().

In other words, you can't take in an AST in one API call, turn it into a CST, and pass it to another API call. Either you have a CST (with all its extra data) or you don't. You can't add that extra data in after the fact, because you don't have that data anymore. It's been lost.

getify commented Sep 30, 2013

@constellation

there's no duplicated code.

What I meant by "duplicated code" is: if you insist on having one set of your internal code logic that operates strictly on the pure assumptions of ASTs, and another set of your internal code logic that operates on the different CST structure, you will have tons of extra code duplication.

Take a look at the patches I've been making to escodegen to get the existing code to optionally handle the CST extra data:

http://pastebin.com/nh8NwUdB

Particularly, look at changes starting around line 212 of the diff.

If we didn't make those changes (look how widespread they are) to make the current existing logic _also_ handle CSTs, but instead needed to keep the existing logic unchanged (to handle strict AST semantics), then we'd have to duplicate all of that code in separate functions to have the CST stuff in it.

I think the only practical option is to have the CST logic inlined with the AST logic, and have each branch test whether it's dealing with a node in the tree that needs CST processing, or AST processing. That's the approach you see in that diff I linked.

What this means is, escodegen can still accept ASTs as it currently does, of course. BUT, the difference will be, escodegen currently would have made strict AST assumptions about the tree it received, and those strict AST assumptions have to be relaxed, because the _one shared set_ of internal code logic (not the API) has to accept either an AST or a CST.


Now, it's a separate issue what the API looks like. You could have two separate API calls, but both API calls are going to have to call into that single set of internal logic. That existing set of logic, as I just explained/demonstrated, has to handle both CST and AST rules (to avoid massive code duplication).

SO... if there's two API calls that call into the same set of internal code logic, and that internal code logic self-switches between CST and AST processing as it sees the structure of nodes, then what's the point of the two calls? I could pass a CST to the AST API call, or an AST to the CST API call, and I'd still get all the same results.

In your suggestion, there would be no difference in the output from an inputted CST tree whether I called generate() or generateFromCST(). There also would be no difference in the output from an inputted AST tree whether I called generate() or generateFromCST(). This proves the two API calls are duplicates of each other, and thus unnecessary to have separate.

Does that make my point clearer?

Owner

Constellation commented Sep 30, 2013

@getify

You can't generate a CST from an AST. You can generate an AST from a CST. This is strictly a one-way sort of transformation.

No. If we provide good default format and options as escodegen currently does, we can create CST from AST. Actually escodegen already generates the code from AST. escodegen.generate will create the CST that corresponds to the currently generated code.

This should be kept since basically users should use AST rather than CST for code analysis (since CST is too lower than AST). AST is a more appropriate format than CST in code analysis tools. Since it is well abstract and it doesn't have nodes which are not needed in code analysis (such as dummy nodes). This is why various code anslysis tools use AST as an IR even in the other languages.

CST is quite different from AST. Probably I think that my view is different from your view about this. CST should have all source code information. For example, CST should have terminated semicolon information. (AST doesn't have the information: which is the statement terminated by, semicolon or ASI?)

AST:

{
    type: 'ExpressionStatement',
    expr: {...}
}

CST:

{
    type: 'SemicolonTerminated',
    body: {
        type: 'ExpressionStatement',
        expr: { ... }
    }
}

If we simply treat them in the same logic, CST's result becomes expr;;.
To support all such cases, we need to insert a lot of edge case codes. It makes the generator complicated.

What I meant by "duplicated code" is: if you insist on having one set of your internal code logic that operates strictly on the pure assumptions of ASTs, and another set of your internal code logic that operates on the different CST structure, you will have tons of extra code duplication.

It is only true if we doesn't create CST from AST.
If we generate CST from AST (for example, current escodegen construct generated code from AST), we don't need to have 2 different internal code logic. So diff becomes small.

Owner

michaelficarra commented Sep 30, 2013

Completely agreed with @constellation. I was just going to write pretty much the same comment.

getify commented Sep 30, 2013

@constellation @michaelficarra

Regardless of how similar or different the CST is from the AST, how is it even remotely possible to take an AST (which has been stripped of information) and re-construct the CST by magically adding back in the lost information (which you don't even know about) that represents the original code?

Consider this code:

var a = (b + 2);

That code produces this AST in esprima:

{
    "type": "Program",
    "body": [
        {
            "type": "VariableDeclaration",
            "declarations": [
                {
                    "type": "VariableDeclarator",
                    "id": {
                        "type": "Identifier",
                        "name": "a"
                    },
                    "init": {
                        "type": "BinaryExpression",
                        "operator": "+",
                        "left": {
                            "type": "Identifier",
                            "name": "b"
                        },
                        "right": {
                            "type": "Literal",
                            "value": 2,
                            "raw": "2"
                        }
                    }
                }
            ],
            "kind": "var"
        }
    ]
}

If you then put that AST into escodegen, it outputs this code:

var a = b + 2;

Do you see what happened? There was code (aka, data) in the original program, the enclosing ( ) set, that was _lost_ when the parsing conversion to AST happened. Taking the AST, at the point shown above, and trying to "reconstruct" what the CST would have been, will not get you back to the original program. The data that was permanently lost in this example is the enclosing ( ) set.

Other data which is lost when you go straight from code to an AST (instead of code to CST) are things like original whitespace, comments, etc.

So, if you hand a tree like the one above to escodegen, it is impossible for escodegen to add back in the data about the enclosing ( ) set that was previously lost data. How could escodegen possibly even know about that data? You don't provide original code to escodegen, you only provide it a tree. And if that tree has already had information loss, you can't somehow magically get that data back, because you don't even know what's missing.

Tree transformation is one-way.

By contrast, if esprima outputted a CST instead of an AST, the CST tree could look like this, and have encoded the non-lost data:

{
    "type": "Program",
    "body": [
        {
            "type": "VariableDeclaration",
            "declarations": [
                {
                    "type": "VariableDeclarator",
                    "id": {
                        "type": "Identifier",
                        "name": "a"
                    },
                    "init": {
                        {
                            "type": "ExpressionParenGrouping",   // <--- lookey here!
                            "value": {
                                "type": "BinaryExpression",
                                "operator": "+",
                                "left": {
                                    "type": "Identifier",
                                    "name": "b"
                                },
                                "right": {
                                    "type": "Literal",
                                    "value": 2,
                                    "raw": "2"
                                }
                            }
                        }
                    }
                }
            ],
            "kind": "var"
        }
    ]
}

And then, if escodegen was given this CST tree, with info about the enclosing ( ), then of course it could reproduce the exact original program: var a = (b + 2);

You can take that CST and strip out the ExpressionParenGrouping node (nesting) data, and come up with the AST from above. But you cannot take the AST from above and magically come up with this CST. How could escodegen possibly know that the original code either did or did not have the enclosing ( ) in there? It couldn't.

Does that make it clear why you can only go from CST to AST, but not AST to CST?

Owner

michaelficarra commented Sep 30, 2013

@getify: Please read others' comments more carefully. @constellation said

If we provide good default format and options as escodegen currently does, we can create CST from AST

This means that we can use whatever syntactic form we think is best to represent that program. We create a CST that, when transformed to an AST, will generate the AST we used as input. Also, as @constellation says,

Actually escodegen already generates the code from AST

we already do this.

edit: removed some unnecessary detail so as not to confuse.

getify commented Sep 30, 2013

@michaelficarra you have not addressed how a parser (like esprima) could produce a tree for var a = (b + 2); that escodegen could turn back into, exactly var a = (b + 2);. If the tree removes any information about the ( ) set, how could any program add that information back in, when it doesn't even know it was removed?

Owner

michaelficarra commented Sep 30, 2013

Quoting myself from (way) above...

I'd argue that esprima/acorn/whatever should generate CSTs (and optionally ASTs). escodegen can operate on either one.

"(and optionally ASTs)" means we just run the result through a CST->AST transformation.

Owner

Constellation commented Sep 30, 2013

@getify

@michaelficarra you have not addressed how a parser (like esprima) could produce a tree for var a = (b + 2); that escodegen could turn back into, exactly var a = (b + 2);. If the tree removes any information about the ( ) set, how could any program add that information back in, when it doesn't even know it was removed?

In that case, by creating CST from parser and passing it to escodegen.generateFromCST, we can get var a = (b + 2); :)
AST is not intended to be used for that case since above problem is related to the code format (not semantics). In that case, CST is the proper choice.

If we'd like to control the generated code format, we should use CST. But we'd like to analyze the semantics of the code, we should use AST.
If we'd like to analyze the code semantics, there's no problem if the result becomes var a = b + 2; since the result semantically equals to the original code.

getify commented Sep 30, 2013

Let me see if I get this straight:

  1. escodegen.generate() takes an AST as its input. What it does, by default is make some _guesses_ at extra data it can add in, like default spacing/indentation, etc, the way escodegen currently works. The way it will now do this is to produce a faked/guessed CST.
  2. Then, escodegen passes that faked/guessed CST over to escodegen.generateFromCST().
  3. escodegen.generateFromCST() takes a CST as its input, and it produces code. It can accept a faked/guessed CST from generate(), OR it can accept a real full faithful original-program-reproducing CST with all the real original syntax data, OR it can take a plain-ol-stripped-down-AST. Either way, it does its best to just output that tree as-received, not adding in ANY (non-required) extra formatting, itself.

Does that accurately represent what @constellation and @michaelficarra think is the best path forward?

getify commented Sep 30, 2013

Assuming for a moment that I understand (from my previous comment), let me make a few points:

  • For (1) above, I understand that one of the reasons you do all this faked/guessed data addition (formatting) is because some people/tools are in the habit of sending in a pure AST to escodegen and wanting a human-friendly-readable representation of the code outputted.

    IMO, I don't think that should be a primary, or assumed, task of escodegen. There are "pretty printer" tools already. So it seems like duplicative, to me, to have escodegen doing the pretty-printing part. To me, the important part of escodegen is that it actually reconstructs all the JS grammar, _not_ that it can make pretty indentations.

    But that's just my opinion. I get it that you want to keep that. Let's just be clear that it's a separate and orthagonal task from actual grammar-reconstruction code generation, and that moreover it's separate and orthagonal from the stuff that I've been asking for. Can we agree on that?

  • So, for (2) above, this is just one of the ways you can think about the API. You could, instead, do this:

    • autoFormatASTintoCST(..): which takes a pure AST and adds in, according to config options, guessed/fake formatting data like spacing/indentation, etc. It strictly returns another tree, which is a CST inferred from the AST.
    • generate(..): takes a CST. Assumes no automatic formatting to be added, and expects the CST itself to provide any and all data it should use for formatting. The only stuff it will add "automatically" are grammatically required whitespace, if no other such CST formatting meta-data is present for that exact location. Otherwise, it will only use what it finds in the tree.

    The benefit of this approach is that you can pass either an AST or a CST to generate(), and you'll get out the code appropriate for what kind of tree you passed in. But you don't have to learn different API calls for generating, one which auto-formats and one which doesn't.

    If you additionally want some extra auto-formatting added to your AST, you can be explicit about it like: escodegen.generate( escodegen.autoFormatASTintoCST( mytree ) ). I prefer auxilliary behavior (the auto-formatting) not to be implicit but to be explicit. Just my opinion though.

    The only "downside" to this approach is that anyone who's currently doing escodegen.generate(myAST) and expecting the automatic pretty-printing will have to add in the explicit formatting call. Again, just my opinion, but I think explicit vs. implicit API semantics is more than enough justification for that change.

Owner

Constellation commented Sep 30, 2013

@getify @michaelficarra

Thanks for the clarification :)

  1. escodegen.generate() takes an AST as its input. What it does, by default is make some guesses at extra data it can add in, like default spacing/indentation, etc, the way escodegen currently works. The way it will now do this is to produce a faked/guessed CST.

Basically right. But to keep the role (escodegen.generate(AST) -> code), I'm planning to provide a function, escodegen.convertASTToCST. And I think escodegen.generate implementation will become conceptually the following.

function generate(ast) {
    return generateFromCST(convertASTToCST(ast));
}

escodegen.convertASTToCST takes AST and returnes faked/guessed CST.

OR it can take a plain-ol-stripped-down-AST. Either way, it does its best to just output that tree as-received, not adding in ANY extra formatting, itself.

This functionality is not necessary. We can just use escodegen.generate(ast) and it works correctly :)

getify commented Sep 30, 2013

@constellation

Can we agree that the tree which is produced by convertASTToCST(..) will produce a CST with those extra guessed formattings encoded in the same mechanisms that a real/faithful CST would use to represent the original program contents?

That is, that it would likely use something like extras.leading... as has been discussed in this thread, rather than inventing a whole different way to encode the guessed formatting data into a CST?

Owner

Constellation commented Sep 30, 2013

@getify

IMO, I don't think that should be a primary, or assumed, task of escodegen. There are "pretty printer" tools already. So it seems like duplicative, to me, to have escodegen doing the pretty-printing part. To me, the important part of escodegen is that it actually reconstructs all the JS grammar, not that it can make pretty indentations.

escodegen's primary task is generating valid & semantically equal code from AST.
This is a difficult job. For example, if you don't insert a space between regexp /reg/ and in operator, code becomes broken.
If you are CST producer, you should handle this by yourself (Of cource, escodegen will throw a error early and help you). Since you can control the whitespaces, escodegen.generateFromCST cannot decide which token should be inserted to here. (tab? space? 2 spaces? etc.)

escodegen's core principal is parse(generate(AST)) structurely equals to parse(AST).

So I think this task belongs to escodegen.

Personally I think escodegen.generateFromCST can be extracted from escodegen.

Can we agree that the tree which is produced by convertASTToCST(..) will produce a CST with those extra guessed formattings encoded in the same mechanisms that a real/faithful CST would use to represent the original program contents?

That's right!

That is, that it would likely use something like extras.leading... as has been discussed in this thread, rather than inventing a whole different way to encode the guessed formatting data into a CST?

But it cannot represent some concrete syntax information (such as parentheses and semicolon) as a meaning structure, correct?

Arriving super late to the party, but yes CST, awesome idea!
Indeed lots of fine-grain styling consistency things aren't present in the AST (spaces here and there, position of brackets, etc.), so tools can't alert on them.

I haven't read all the debates on the API side of things but here are a few elements I find relevant: the CST contains all the information of the AST, but not the other way around. Ideally, that could mean that the AST be a subset of the CST. The problem is that the AST was defined long ago and a bunch of tooling relies on it, so making CST a superset of AST might be impossible without breaking things (I'm interested if proven wrong here). And the reward of breaking things doesn't seem worth it in that instance.

Beyond backward compat, given that CST contains more info, one could wonder what is the point of AST after CST... But let's make CST happen.

Is it OK? If so, I'd like to start definining CST format spec.

I guess it starts with the following principle:
There exists 2 functions sourceToCST(src: string) : CST and CSTToSource(cst: CST) : string. CST is a tree-shaped object. Object properties can be source fragments containing at most one token and for any syntactically valid JavaScript string src, we have CSTToSource(sourceToCST(src)) === src
(how these functions are exposed in libraries is out of scope for the CST spec ;-) )

Owner

michaelficarra commented Sep 30, 2013

so making CST a superset of AST might be impossible without breaking things (I'm interested if proven wrong here)

It depends on what you mean by superset, but if you mean current tools will be able to treat a CST as an AST, I can see at least one way to do it by just adding optional properties to AST nodes:

  • Add an additional list property containing 0 or more whitespace/comment fragments; one property for each position whitespace/comments are allowed
  • Add a parens list property to each node that contains 0 or more objects which each contain the syntactic information (whitespace/comment fragments) for a surrounding pair of parentheses. They can be listed outside-in or inside-out, it probably doesn't matter.
  • Add a property to each type of node that supports trailing semicolons, indicating whether one is used in the representation.
  • edit: Also, we need something equivalent to escodegen's current "verbatim" support on Literal nodes to specify how numbers/strings are represented.

I think that should be sufficient. @constellation: what do you think?

getify commented Oct 1, 2013

@constellation

...So I think this task belongs to escodegen.

What you've described is indeed a critical function of escodegen, that it needs to produce valid code from the tree. This means that in certain places, it does need to add in whitespace between tokens, such as between /regex/ and in.

However, that's a separate and orthagonal task from escodegen adding in extra pretty-printing whitespace like with indentation and such. Looking through the bits of escodegen that handle any sort of whitespace insertion, about 10% of the code (rough estimate) is for required whitespace for grammar validity, and the other 90% of the code that deals with whitespace insertion is dealing with fancier pretty-printing things like adding base indentation to lines of code, adjusting the indentation level of multi-line comments, and other such things.

It's just my opinion, but I see the second of these tasks, the pretty-printing part, as not necessarily part of the core primary task of escodegen, while undeniably the first task (ensuring grammar validity with necessary whitespace) as definitely primary.

You could save/remove some of the code/complexity in the current code base if you elected to say that pretty printing stuff was not in escodegen but could be done by some other tool which produces a CST. Such a pretty-printer tool could just take an AST, make a CST out of it, and add the proper whitespace/indentation annotations to the CST, such that when escodegen processes that CST, it will produce the desired pretty-printed code, but escodegen wouldn't have to do the part of actually figuring out the pretty-printing parts.

Just my 2 cents. It's a feature I don't care about either way, about because the tool I'm building will supply all of the spacing info via CST to escodegen.

But it cannot represent some concrete syntax information (such as parentheses and semicolon) as a meaning structure, correct?

I absolutely think the CST can (and should!) represent those things. ASTs have no need for them, CSTs absolutely do have some usage for them (such as reconstructing original code -- my main use case).

getify commented Oct 1, 2013

@DavidBruant @michaelficarra

so making CST a superset of AST might be impossible without breaking things (I'm interested if proven wrong here)
...
...if you mean current tools will be able to treat a CST as an AST...

This is the opposite of what I mean by compatibility between AST and CST. I don't mean, or intend in my suggestions, for a CST to be treated like an AST. I think that's a bad goal that hand-ties us too rigidly in our design of a CST.

I mean: an AST can be treated like a CST (aka, backwards compatibility preserved that escodegen will be able to keep accepting the ASTs it always did), though an AST is a particularly "dumb" CST in that it's missing all the actual "concrete" stuff.

To put it more plainly, I think escodegen is one of the only tools that would care about consuming a CST. Maybe a few others, such as escope. Most other tools would only care to receive an AST. For them, they don't have to change at all just because we're declaring an optional intermediary CST format. We just need for one of the tools to have a CST-to-AST stripper/converter, and we're fine there.

That leaves parsers and other tree generators to consider. The idea would be that they would be able to either output an AST (that is, ignore the extra stuff while they parse/tree-build), or a CST (keep all the extra stuff), and the user of that tool would decide which tree they want, and how to use the tree once they get it.


To specifically address your question, @DavidBruant, I believe I've proven that a CST structure can look an awful lot like an AST, that is, as a superset of an AST, and yet still handle all the complexities of concrete syntax representation.

The vast majority of the grammar allows for whitespace and comments to be annotated using the suggested "extras" implementation I've shown here in this thread. There are a few exceptions where the "extras" have to be added to the tree in a special (aka, non-standard-AST) way/location, but those few exceptions only create a 1-to-1 structure in the CST which can easily be stripped in the transformation of a CST back to an AST. Things such as ( ) sets wrapped around expressions (either as extra/unnecessary or as affecting of precedence rules) are also easily added into the superset CST structure, and again, easily removed to create a valid standard AST.

Owner

Constellation commented Oct 5, 2013

@getify @michaelficarra @DavidBruant

@DavidBruant wrote:

I guess it starts with the following principle:
There exists 2 functions sourceToCST(src: string) : CST and CSTToSource(cst: CST) : string. CST is a tree-shaped object. Object properties can be source fragments containing at most one token and for any syntactically valid JavaScript string src, we have CSTToSource(sourceToCST(src)) === src
(how these functions are exposed in libraries is out of scope for the CST spec ;-) )

Right. CST should represent the source code as is.

I haven't read all the debates on the API side of things but here are a few elements I find relevant: the CST contains all the information of the AST, but not the other way around. Ideally, that could mean that the AST be a subset of the CST. The problem is that the AST was defined long ago and a bunch of tooling relies on it, so making CST a superset of AST might be impossible without breaking things (I'm interested if proven wrong here). And the reward of breaking things doesn't seem worth it in that instance.

Personally I think only users paying attension to formats should consider CST. And we need to keep AST world simple. So I would not like to assume the inheritance between AST and CST. I think CST and AST should be converted each other by some function explicitly.

@getify wrote:

This is the opposite of what I mean by compatibility between AST and CST. I don't mean, or intend in my suggestions, for a CST to be treated like an AST. I think that's a bad goal that hand-ties us too rigidly in our design of a CST.

Agreed. CST should not be treated like an AST.

I mean: an AST can be treated like a CST (aka, backwards compatibility preserved that escodegen will be able to keep accepting the ASTs it always did), though an AST is a particularly "dumb" CST in that it's missing all the actual "concrete" stuff.

I don't agree with this. When API take CST, this API cannot take AST.
If API take CST|AST, the problem happens. If API takes CST, we should explicitly convert CST to AST.
Maybe CST format is very similar to AST. But I think that we should not assume that AST can be treated as CST.

To put it more plainly, I think escodegen is one of the only tools that would care about consuming a CST. Maybe a few others, such as escope. Most other tools would only care to receive an AST. For them, they don't have to change at all just because we're declaring an optional intermediary CST format. We just need for one of the tools to have a CST-to-AST stripper/converter, and we're fine there.

Right. If CST add dummy nodes, basically all AST tools cannot handle CST. CST is different from AST. AST tools should not consider CST since it is very low IR. Only users taking care about formats should consider CST.

Owner

Constellation commented Oct 5, 2013

@getify

You could save/remove some of the code/complexity in the current code base if you elected to say that pretty printing stuff was not in escodegen but could be done by some other tool which produces a CST. Such a pretty-printer tool could just take an AST, make a CST out of it, and add the proper whitespace/indentation annotations to the CST, such that when escodegen processes that CST, it will produce the desired pretty-printed code, but escodegen wouldn't have to do the part of actually figuring out the pretty-printing parts.

How about parentheses?

I don't want to break the current functionality since I think AST tools should not consider about CST and escodegen provides a way to generate something valid code from AST. It keeps AST tools world simple. The current functionality, escodegen.generate(AST, option) -> code must be kept.

So if you'd like to split escodegen's tasks into small ones, escodegen will provide 2 functionalities, inserting required spaces (and required parentheses) and pretty printer functionality. But personally I think they cannot be splitted since both tasks need to know the information about the all inserted spaces.

getify commented Oct 5, 2013

@constellation

We all seem to have slightly different ideas of what the ideal CST is. I don't know exactly how to resolve the differing views.

However, on a positive note, @michaelficarra @puffnfresh and myself got a chance to chat in-depth over drinks while we've been here at the NationJS conference. I think we have a plan to move forward on making the decisions, which involves a google hangout online meeting with all the interested players. I am going to try to set something like that up over the next couple of weeks. Timezones are going to be our enemy in that, but I'll do my best. Stay tuned. :)

fpirsch commented Oct 6, 2013

@getify

We all seem to have slightly different ideas of what the ideal CST is. I don't know exactly how to resolve the differing views.

My idea about it seems pretty close to yours ;-)
I like very much the idea of the CST being an "enriched AST". Like you said, currently the trees used in the escode* family are already somewhere between a pure AST and a full-blown CST. They already have optional non-semantic information about locations, comments, ranges, tokens.

So the cool idea here would (could) be to rethink and refactor this extra information into something more unified and better structured which would allow tools to rebuild the exact source code from which the CST was produced. And to keep the current model of an AST with a few more properties attached to its nodes.

Or, if some people (and tools) feel really attached to the academic purity of ASTs, maybe we could consider to have 2 separate data structures. A pure AST for the semantics, and a Whatever-T for the presentation information. (Kind of) like html+css separately represent semantic and presentation aspects of a page.

Owner

michaelficarra commented Oct 8, 2013

As Kyle says, we got a chance to speak about this in person and really understand each other's perspectives. The discussion came down to one basic point: we've each defined isomorphic JS CSTs, each with their own pros and cons. I will attempt to summarise the two proposals and their pros/cons here. If we find that we want to collaborate on this, we can either move this to the wiki or a shared google doc.

Proposal: CST By Extending/Annotating AST

I've previously described this format above as a possibility in response to @DavidBruant. I will copy it here for convenience. Essentially, we are just adding optional properties to AST nodes:

  • Add an additional list property containing 0 or more whitespace/comment fragments; one property for each position whitespace/comments are allowed
  • Add a parens list property to each node that contains 0 or more objects which each contain the syntactic information (whitespace/comment fragments) for a surrounding pair of parentheses. They can be listed outside-in or inside-out, it probably doesn't matter.
  • Add a property to each type of node that supports trailing semicolons, indicating whether one is used in the representation.
  • Something equivalent to escodegen's current "verbatim" support on Literal nodes to specify how numbers/strings are represented.

Pros

  • current tools will be able to accept a CST
  • syntax-agnostic transformations will preserve syntax when passed a CST; don't have to have two code paths in AST tools
  • do not have to traverse the tree to convert between CST/AST
  • escodegen can simply treat any input as a partially-filled-in CST and enrich it with defaults to create a full CST before rendering it

Cons

  • slightly harder to reason with, as properties require more logic to interpret than separate nodes

Proposal: CST With Structural Syntactic Forms

This proposal also remains close to the AST specification. In this proposal, new syntactic node types are added to the AST spec. as well as properties containing syntactic information.

  • Add a ParenthesisedExpression node, representing a parenthesised expression in expression position. For parentheses in statement position, this node can be wrapped in a ExpressionStatement.
  • Add an optional extras property to each node (@getify: please clarify exactly how this is represented). I believe this includes whitespace/comment/semicolon information all in one property.

Pros

  • new tools that operate on CSTs should have a slightly easier job reasoning about the syntax represented by the tree
  • parsers may have an easier time generating this format, but more evidence is needed to determine this

Cons

  • need a single-pass transform between AST/CST
  • tools that operate on ASTs only will need either a separate code path or two transformations at its interface (which would lose all syntactic information)

Obviously, the pros/cons lists are incomplete and weighted toward the format I'm more familiar with. We should collaborate to fill them out completely so the community can make an informed decision about the CST format we should use going forward.

One point that was brought up is that we could always have both formats. Technically, this has all of the pros of both proposals, with the only additional con being that one has to be more conscious about what format they are using and their tools support. We also discussed and all agreed that esprima should have an interface for producing a CST.

Hopefully this is a good starting point for continuing discussion. @constellation: If you prefer, we can move this to the wiki or some other collaboration tool. Please comment with more pros/cons for each of these formats!

Owner

Constellation commented Oct 8, 2013

@getify, @michaelficarra

but I'll do my best. Stay tuned. :)

Great great great!
Thanks for your clarification. Great. Currently I'm busy on my research (sorry) but I'll surely read it tomorrow and reply them!

Owner

Constellation commented Oct 10, 2013

@getify @DavidBruant @michaelficarra

Hopefully this is a good starting point for continuing discussion. @constellation: If you prefer, we can move this to the wiki or some other collaboration tool. Please comment with more pros/cons for each of these formats!

Very nice. Moving this to wiki is quite reasonable. This clarification helps us a lot.
I've read this and personally I think this covers necessary points.

Does Anyone have comments about this clarification?

Owner

Constellation commented Oct 25, 2013

Created wiki page. Feel free to edit it.
https://github.com/Constellation/escodegen/wiki/CST-Proposals

getify commented Oct 25, 2013

I've been traveling at confs for the past 2 weeks solid, sorry about my delay in responding. Hopefully will pick up the task of getting together an online meetup about this so we can figure out how to move forward.

jedmao commented Feb 28, 2014

I would very much also like to get access to CST information; namely, whitespace. It has been a few months now. @getify, do you know what the next step is?

getify commented Feb 28, 2014

I feel badly that I got very busy with other things and never organized the discussion meeting to talk about CST formats. We really need to circle back and do that. I'll try to find some time soon to renew the effort.

getify commented Mar 18, 2014

OK, that took way too long. But I'm finally doing something "concrete" about this.

Please go see this if you are interested in helping define a standard CST:

https://github.com/getify/concrete-syntax-tree

getify closed this May 8, 2014

dashed referenced this pull request in babel/babel Feb 3, 2015

Closed

Implement better comment attachment algorithm #672

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment