Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support additional meta-data (such as whitespace and comments) on a CST instead of the AST #133

Closed
wants to merge 2 commits into from

Conversation

getify
Copy link

@getify getify commented Sep 22, 2013

I have need for the ability to attach extra information to the standard AST. The extra information is whitespace and comments that appear before (leading) or after (trailing) any node in the AST (such as Identifier, Expression, etc).

NOTE: I have implemented this (see the attached PR) in such a way that only AST's which have this extra data/nodes in them will activate the new behavior, so all existing usages of existing standard AST's should be unaffected by my changes (either functionally or performance).


The reason for this new feature is I'm building a tool which parses a JS file, custom modifies its styling/formatting by adding such information to the normal AST, then needs to re-generate the code from that AST.

In other words, I plan to turn on escodegen's compact config, so that none of its automatic whitespace/identation/etc happens, but instead have my own "control" on exactly how whitespace is outputted by having that data encoded into the AST via extras annotations.

Moreover, the existing mechanism in escodegen for leadingComments and trailingComments is unfortunately insufficient, because there are many places where comments can validly appear which are not handled by that mechanism. It only tracks comments attached before or after a statement. For example, foo(/* blah */);, the comment there isn't leading or trailing to a statement, and is thus not trackable by the current comments mechanism.

By contrast, though, the "extras" mechanism I've made allows any node in the AST to have leading or trailing "extras", which can either be whitespace, single-line comments, or multiline comments. This allows comments to be output in any of the valid locations, rather than only before/after statements as currently. Perhaps my suggested mechanism here can obviate the need for the existing comments tracking, but note that my attached PR didn't make that bold assumption, so it leaves that stuff entirely alone (for now).

My attached PR here implements this feature as I suggest here in this discussion. It also adds a number of test cases to the "AST" test to verify that it's working correctly.

Here's an example (which is also in the test suite). This example AST shows several places where an extras: { .. } key is added to various different nodes, which can contain either/both leading or trailing extras, which respectively can have either Whitespace, LineComment and/or MultilineComment nodes:

{
    type: 'Program',
    body: [{
        type: 'ExpressionStatement',
        expression: {   
            type: 'CallExpression',
            callee: {
                type: 'Identifier',
                name: 'foo',
                extras: {
                    trailing: [
                        {
                            type: 'MultilineComment',
                            value: '2'
                        },
                    ],
                }
            },
            arguments: [
                {
                    type: 'Literal',
                    value: 4,
                    raw: '4',
                    extras: {
                        leading: [
                            {
                                type: 'LineComment',
                                value: '3'
                            }
                        ],
                        trailing: [
                            {
                                type: 'MultilineComment',
                                value: '5'
                            },
                            // NOTE: testing custom line-comment markers,
                            // in support of HTML-style line-comment markers
                            // Ref: http://javascript.spec.whatwg.org/#comment-syntax
                            {
                                type: 'LineComment',
                                value: '6',
                                marker: '<!--'
                            }
                        ],
                    },
                }
            ],
            extras: {
                leading: [
                    {
                        type: 'MultilineComment',
                        value: '1'
                    },
                ],
                trailing: [
                    {
                        type: 'MultilineComment',
                        value: '7'
                    },
                ]
            }
        },
    }],
    extras: {
        trailing: [
            {
                type: 'MultilineComment',
                value: '8'
            },
        ],
    }
}

That AST now produces this via escodegen (with compact mode turned on, obviously):

/*1*/foo/*2*/(//3
4/*5*/<!--6
)/*7*/;/*8*/

Note that the code even automatically inserts new-lines after line-comments, if necessary to delimit, even if the AST provided does not (though the way I plan to generate my AST, it always would have new-line Whitespace nodes in there).

One final note, not directly illustrated in the above code: there are certain places in code where comments and whitespace can appear (and thus I would want to preserve and track) that there is otherwise no appropriate node to attach the "extras" data to. For instance, foo(/* bar */) has an empty arguments array in the CallExpression, so there's no node to attach to there. I believe there are a few other places across the grammar as well. My suggested solution to this is to support an EmptyExpression node type (similar to EmptyStatement which is already in the code, but EmptyExpression outputs nothing except any extras it may have, not the default ; as EmptyStatement does).

Just as with the rest of the stuff I've implemented, EmptyExpression would merely be supported if an AST included it, but wouldn't be required if an AST had no desire to track these "extras".

…line comments, multiline comments) to be tracked on (and thus outputted from) all AST-Node types (identifiers, expressions, statements, etc
@Constellation
Copy link
Member

Looks great. I'll review this.

@Constellation
Copy link
Member

/CC: @ariya @michaelficarra
If you have any comments, feel free to write here.

@ariya
Copy link
Contributor

ariya commented Sep 23, 2013

Some of these have been discussed in https://code.google.com/p/esprima/issues/detail?id=197.

@getify
Copy link
Author

getify commented Sep 23, 2013

@ariya thanks for the pointer. Unfortunately, that discussion (and the code just landed for esprima) doesn't seem to do anything about preserving any whitespace, which in my case is necessary for a sufficient solution. What do you think about the more general solution I've provided here which also allows whitespace, and even allows the (crazy) HTML-style comment markers to be supported/preserved?

@olov
Copy link

olov commented Sep 23, 2013

Just a note here from the JSShaper author. escodegen followed the leadingComment and trailingComment model I created (@ariya's link has more info). But note that JSShaper used leadingComment/trailingComment for expressions just as well as statements and I guess perhaps so could escodegen.

@getify mentions the problem of capturing comments in some cases, for example foo(/* bar */), and I remember having stumbled on the same issue (but I never solved it). I agree that one really want to store those in a dummy node (in the pre- or postComment property, doesn't matter). As a minor comment perhaps the dummy node should not be called EmptyExpression though because it may be located in non-expression positions, for example function foo(/* bar */).

@Constellation
Copy link
Member

Basically look nice. But I have a concern about EmptyExpression.
EmptyStatement is defined in ECMA262 spec and it represents semicolon only statement (;).
But EmptyExpression represents nothing. I think it is not suitable for Abstract Syntax Tree, is it correct?

If it is addressed by only meta data (not introducing new node, EmptyExpression), I think it is better. Do you have any idea?

@Constellation
Copy link
Member

/CC: @dherman

@ariya
Copy link
Contributor

ariya commented Sep 23, 2013

@getify @olov @Constellation Should we continue the discussion set in https://code.google.com/p/esprima/issues/detail?id=197? Otherwise, we may risk repeating it or having a disconnected thread?

@ariya
Copy link
Contributor

ariya commented Sep 23, 2013

@olov For "orphaned" comment, @Constellation proposed containedComment attached to the nearest parent.

@Constellation
Copy link
Member

@ariya
OK, I agreed to discussing on that Esprima's thread.

@getify
Copy link
Author

getify commented Sep 23, 2013

Note: I know this discussion is centering around comments. But I also care about preserving whitespace for my use case.

@olov @Constellation isn't foo(/* bar */) an example where an expression very well could have been present, in between the ( .. )? I chose "EmptyExpression" as a way to describe the dummy expression that is necessary for these otherwise-empty grammar places. In all the empty cases I could conceive, where "extras" couldn't be added anywhere else, without having some dummy node there, those were places where an expression would be valid. Hence the name I chose to suggest.

@Constellation the problem is that in the case shown, only the ( or the ) could possibly be the parent of such a comment annotation, and neither of them are referenced (as operators) in the AST, but just implied by the CallExpression. I think in any scenario that handles these cases, a dummy node will be required. What type we give it is certainly open to discussion. I think EmptyExpression is a good descriptor, and carries no side effects. But Dummy or DummyExpression or any other name (bikeshedding notwithstanding) would be fine, I suppose.

I understand not wanting to create non-standard AST stuff. But the standard JS AST just doesn't handle cases of wanting to preserve whitespace and comments, so we have to invent. Inventing a no-op empty (literally completely empty) node seems like the smallest change.

@ariya @olov where would containedComment (or containedExtras :) ) actually reside in this case? On the callExpression? What about the other grammar cases where these "empties" arise? Empty { } blocks, empty [ ] array literals, and other such examples were some I conceived. I think we may chase rabbits trying to "invent" exceptions in all of them.

@ariya
Copy link
Contributor

ariya commented Sep 23, 2013

BTW, I'm still undecided about whitespaces (see also https://code.google.com/p/esprima/issues/detail?id=256 and perhaps also https://code.google.com/p/esprima/issues/detail?id=108). I think storing comments is already a stretch (since comments are never parts of a proper syntax tree).

@getify
Copy link
Author

getify commented Sep 23, 2013

@ariya @Constellation

With all due respect, I don't think it's appropriate to move discussion to the esprima thread.

Here's why: this is NOT necessarily as much a question about what a parser would generate in an AST as it is other sorts of tools which either create ASTs from scratch (think transpilers, etc), or modify existing ASTs (code formatters, etc), and where they are in control of what kind of AST they create. I chose to open this discussion here, on escodegen, on purpose, because it takes an AST and produces code. If I have my own bespoke tool that generates ASTs (I do) and I want to have the AST turned into code (I do), my main desire is to have escodegen help that effort.

It's a secondary (but not unimportant) issue how a parser like esprima might also track these items in its AST, and I agree it'd be nice to "agree" here. But my concern was first about being able to take an AST with this data already in it, and include it in the output of escodegen. It was an orthagonal task to later tackle whether something like esprima might care about preserving whitespace and comments as it parses.

@olov
Copy link

olov commented Sep 23, 2013

@getify yup it is! but function foo(/* bar */) {} is not (parameter names are not expressions). :)

@getify
Copy link
Author

getify commented Sep 23, 2013

@olov ahh, good point. so yeah, don't call it EmptyExpression, just call it Empty or EmptyNode. :)

@Constellation
Copy link
Member

@getify @olov

Do you think about defining new lower level format, Concrete Syntax Tree and handle it by Escodegen?
It represents the source code as is. And AST can be transformed into CST.
I think AST should be abstract. For example, AST should not represent parentheses ((expr) should be represented as expr).

@Constellation
Copy link
Member

@getify @ariya
OK, so let's discuss here :)

@getify
Copy link
Author

getify commented Sep 23, 2013

@Constellation I think the CST would be a much harder way of getting what we want, not just for escodegen but for all the other tools in the arena which need to handle various stages of the transformations.

If there's a strong enough objection to the Empty or EmptyNode placeholder solution I suggested, I suppose the extras I suggested could have not only leading and trailing, but also a inner or inside or contained, which is a similar list of extras but which would be implied inside of a certain node's elements. For CallExpression and FunctionDeclaration, that would be the ( .. ) pair. For { .. } and [ .. ] pairs, the same would apply. I am not currently aware of a grammar problem with that approach (though there very well may be with the various ES6 syntax variations coming down the pike).

I think that's a harder solution, because it means that for a tool generating such an AST, it has to decide whether encountered whitespace/comments can belong to a current/adjacent node (if there is one) or must be attached instead to a parent node's extras (if not). The dummy placeholder node approach has the advantage that it reduces the special casing a bit, where you just detect (for instance) an empty arguments list, and insert a dummy node there, and then if you find whitespace or comments, you already have an adjacent node to attach to.

I'm not personally sure why the placeholder node is a problem, since any tool processing such an AST (as escodegen does in my PR) can basically just skip it.

But anyway, if any major objection comes down to the empty placeholder node approach, then I can (probably, hopefully) adjust the PR to instead encode such things in extras.inner or the like. The risk is, we may find another grammar situation later where "inner" doesn't suffice, and we'll be chasing more rabbits to invent solutions there.

What do you think?

@michaelficarra
Copy link
Member

I am not currently aware of a grammar problem with that approach

It doesn't handle this.

function /*0*/ a /*1*/ ( /*2*/ ) /*3*/ {}

I don't see how your EmptyNode solution would either. You would have to add an additional list property to every node for every allowed whitespace position in that node's concrete syntax. It's doable, but not ideal.

Besides, as @Constellation said, this information really does not belong in an AST. @Constellation is right that this information should be stored in a CST. Also, remember that escodegen does not actually generate the resultant JS string output, but generates a CST that it annotates and passes to mozilla's source-map library. I'm sure we could come up with an interface for AST->CST that allows the consumer to modify this IR, adding comments/whitespace wherever they want.

@Constellation
Copy link
Member

@getify

I'm not personally sure why the placeholder node is a problem, since any tool processing such an AST (as escodegen does in my PR) can basically just skip it.

For example, if dummy node is introduced into parameters of FunctionDeclaration, we cannot consider params.length as its number of parameters. And all Nothing|Node interfaces become broken tested by the following code.

if (node.xxx) {
   // actual node
}

So if we introduce dummy node, it should be passed to Escodegen immediately with a special option (such as concreteSyntaxAwareTree: true) since this tree is no longer abstract and it is only handled by Escodegen.

In the long term, I think defining CST is better approach to this (Yup, this is harder approach)

@getify
Copy link
Author

getify commented Sep 23, 2013

@michaelficarra good point. bummer. I had a sense it might be more complicated than we hoped. It's solvable by chasing rabbits. Sucks. [Update: see below comment]

However, blanket statements such as "do not belong in the AST" are not helpful. I understand what the point of an AST normally is. But we're dealing with an auxilliary/piggy-back usage of ASTs, which is to support transformations on code. Which is why I suggested tagging it as extra meta-data, not actual first-class AST nodes.

Needing to support roundtrip transformations means every step of the transformation has to preserve data, or the roundtrip fails.

If I have a tool that produces an AST (or CST), I can insert whitespace (or comments) wherever I want. However, if I have a tool that parses existing code for the purpose of transforming it, then I need to preserve things in some format. If you say that an AST has to lose that data, then any flow which goes through the AST step is a no-go in terms of what types of tools I'm working on.

Can you explain what your objection to ASTs storing meta-data is, other than just a purely academic definition of the typical usage of an AST?

Moreover, for the class of tools which need to preserve whitespace and comments through transformation (ie, not JS engines, not minifiers, but most everything else), is it really better to say "no, go write your own parser and code-generator from scratch to complete the roundtrip"? I obviously could. But I certainly would rather prefer to embrace and extend the current tools, if at ALL possible.

@getify
Copy link
Author

getify commented Sep 23, 2013

@michaelficarra Thanks for pointing out that "extras" should be able to be attached to identifiers as well. Fixed that easily with this commit: https://github.com/getify/escodegen/commit/62116bf28f52c8e09da06023fff148f448675e10

So, now this AST produces the code you suggested (note particularly it even gives the whitespacing you implied! :) ) using my patch to escodegen:

{
    "type": "Program",
    "body": [
        {
            "type": "FunctionDeclaration",
            "id": {
                "type": "Identifier",
                "name": "a",
                "extras": {
                    "leading": [
                        {
                            "type": "MultilineComment",
                            "value": "0"
                        },
                        {
                            "type": "Whitespace",
                            "value": " "
                        }
                    ],
                    "trailing": [
                        {
                            "type": "Whitespace",
                            "value": " "
                        },
                        {
                            "type": "MultilineComment",
                            "value": "1"
                        },
                        {
                            "type": "Whitespace",
                            "value": " "
                        }
                    ]
                }
            },
            "params": [
                {
                    "type": "EmptyExpression",
                    "value": "",
                    "extras": {
                        "leading": [
                            {
                                "type": "Whitespace",
                                "value": " "
                            },
                            {
                                "type": "MultilineComment",
                                "value": "2"
                            },
                            {
                                "type": "Whitespace",
                                "value": " "
                            }
                        ]
                    }
                }
            ],
            "defaults": [],
            "body": {
                "type": "BlockStatement",
                "body": [],
                "extras": {
                    "leading": [
                        {
                            "type": "Whitespace",
                            "value": " "
                        },
                        {
                            "type": "MultilineComment",
                            "value": "3"
                        },
                        {
                            "type": "Whitespace",
                            "value": " "
                        }
                    ]
                }
            },
            "rest": null,
            "generator": false,
            "expression": false
        }
    ]
}

which yields

function /*0*/ a /*1*/ ( /*2*/ ) /*3*/ {}

Now, there's clearly an issue above, which is that I inserted an EmptyExpression node in the params list, which is probably not a great idea. However, if that were called Empty or EmptyNode or Placeholder, it's not nearly as offensive.

EXCEPT of course the mentions in above comments about checking for instance params.length being thrown off. That's a bummer.

So, the alternate approach, using extras.inner, could easily solve that, ditching then the EmptyExpression name-bikeshedding entirely, and addressing the concerns about length checks. That AST could look like:

{
    "type": "Program",
    "body": [
        {
            "type": "FunctionDeclaration",
            "id": {
                "type": "Identifier",
                "name": "a",
                "extras": {
                    "leading": [
                        {
                            "type": "MultilineComment",
                            "value": "0"
                        },
                        {
                            "type": "Whitespace",
                            "value": " "
                        }
                    ],
                    "trailing": [
                        {
                            "type": "Whitespace",
                            "value": " "
                        },
                        {
                            "type": "MultilineComment",
                            "value": "1"
                        },
                        {
                            "type": "Whitespace",
                            "value": " "
                        }
                    ]
                }
            },
            "params": [],
            "defaults": [],
            "body": {
                "type": "BlockStatement",
                "body": [],
                "extras": {
                    "leading": [
                        {
                            "type": "Whitespace",
                            "value": " "
                        },
                        {
                            "type": "MultilineComment",
                            "value": "3"
                        },
                        {
                            "type": "Whitespace",
                            "value": " "
                        }
                    ]
                },
                "extras": {
                    "inner": [
                            {
                                "type": "Whitespace",
                                "value": " "
                            },
                            {
                                "type": "MultilineComment",
                                "value": "2"
                            },
                            {
                                "type": "Whitespace",
                                "value": " "
                            }
                    ]
                }
            },
            "rest": null,
            "generator": false,
            "expression": false
        }
    ]
}

Notice I just added extras.inner onto the FunctionDeclaration node, which would be set to imply that it should add those extras as inside the ( ) pair.

The remaining question is, should inner in that case mean "at the beginning of the ( ) pair", or at the end? Because there might also be param identifiers which also could have extras on them. I would say that inner in this case would only reasonably be used in the empty ( ) pair case, so it's totally safe to just say that inner means "at the beginning of ( )". Wouldn't even be that surprising if you also had params, that they came after such extras.

This solution works. It's "harder", but it addresses the concerns of needing empty nodes and throwing off length counts.

If we get a sense that inner is in fact a more desired direction to go, I'm happy to update the code in the PR for that.

@michaelficarra
Copy link
Member

@getify: The FunctionDeclaration case I gave was not a perfect example, then. inner is still ambiguous with other constructs like this FunctionExpression:

(function /* 0 */ ( /* 1 */ ) /* 2 */ {})

inner could be either the position of 0 or 1 (or 2, but you could say that's just before the BlockStatement, so we'll assume that). Similar problems exist with try/catch/finally and I'm sure many others. As I said above, you could get this idea to work by adding one list property for every position in the node's concrete syntax where whitespace/comments are allowed. But I think that's a bad solution. And as @Constellation mentioned, we don't want to insert fake nodes for practical purposes, not just academic ones.

then any flow which goes through the AST step is a no-go in terms of what types of tools I'm working on.

Yes, exactly. The whole point of an AST is that we've discarded that information because it doesn't have a semantic meaning. You're still talking about tools that have ties to the syntactic representation of the program, which means a CST is more appropriate for you. Not only that, an AST is inappropriate for your use cases.

@getify
Copy link
Author

getify commented Sep 23, 2013

@michaelficarra

...is still ambiguous with other constructs...

Being "ambiguous" is not nearly as much a problem as being "insufficient". Tools which produce such *STs can have rules to decide where to put the extras and where not. There's all kinds of rules around producing valid ASTs and not sticking nodes in inappropriate/unexpected places. There's no reason such appropriate rules can't be applied to where you put extras.

Moreover, in your case, it's not ambiguous at all. inner on that function declaration is totally well-defined by what I already said, which is that inner means "at the beginning of the child ( ) set (or the [ ] or { } as appropriate), which is that such extras attached to the FunctionExpression would appear like this:

( /* here's some unambiguous extras! */ function /* 0 */ ( /* 1 */ ) /* 2 */ {})

As you can see, all the extras would appear at the beginning (but inside) of the outer ( ). Not ambiguous at all.

Moreover, the other examples you suggest like try/catch etc all have BlockStatement children in their trees, which means the leading and trailing can all represent every extras location. The catch clause's identifier has an identifier so we can encode comments even inside the ( ) without needing inner.

The only place not currently handled is .. catch /* not handled */ (err) { .. }. Turns out since we don't need inner to target the ( ) set, since it's always non-empty in a catch, then inner can easily mean "between catch and (".

I suspect that most other examples we can dream up have similar triage.

Yes, we're chasing rabbits. I admitted that much earlier in the thread. So? That's what development sometimes requires.

Not only that, an AST is inappropriate for your use cases.

You seem to be overly pedantically hung up on the purity of an AST. And yet, both esprima AND escodegen already add incomplete leadingComments, etc. meta data nodes into those "pure" ASTs they deal with. I'm not inventing a new concept (meta-data in an AST), but making it better than it currently is (and is already well precedented).

Why does it really matter if what we're talking about is purely an AST or a CST? Do these tools that I'm talking about expose (public API wise) the AST separately from the CST? To my knowledge, they do not.

If your underlying point here is that if I want to track whitespace/comments in code transformations, that I'm shi** out of luck with any of these tools and that I should go invent my own whole stack for them, that's totally unhelpful, and I reject that stance. It's contrary to the spirit of OSS.

If however you're just splitting hairs on what we call it, that's not particularly helpful (bikeshedding) either. It's clear what I mean, and I don't think its label matters at this part of the stack.

And if you're suggesting that all these tools need to change to separately expose AST distinguished from CST, great, be explicit about that. That would be helpful as a proposal. But make sure that the CST is before the creation of an AST, not after, or we are in a situation where's there's intolerable information loss in the creation of your pure AST concept.

But, in the end, regardless of whether we're inventing how to encode this meta-data into an AST, or whether we're debating how that same meta-data should be encoded into something that looks almost like it but has a different label (CST), we still have to figure out if there's a practical way to encode the meta-data. Can you help with that task instead of trying to derail the effort?

@getify
Copy link
Author

getify commented Sep 24, 2013

Actually, my above analysis about how ( /* something */ function(){} ) would need to be represented was unnecessarily complicated. After inspecting the ExpressionStatement, which is the outer ( .. ), it doesn't need an inner because ( ) alone with nothing inside it is invalid. So, there's always a child node inside the ( ), and in this case, that child is FunctionExpression. That node can have a extras.leading to persist the /* something */ comment extra.

Bottom line, YES these annotations have some ambiguity to them if they're sloppily used willy-nilly across the tree. A single comment could be persisted in some cases by several different nodes where the extras could be attached.

But that's not that big of a deal. escodegen can just output whatever it gets without worrying, and put the onus on the generator of the tree to make sure to put extras annotations in the right spot. Thus, the key is deciding what those definitive, non-ambiguous rules should be. As a starting point, I'd say the extras should always be annotated as deeply in the tree as applicable (which removes the ambiguity of annotations overlapping on the same extra). The extra should always be attached to the closest node that it can attach to.

For instance, prefer attaching to a Identifier node if applicable, and not the parent Expression, assuming of course that encodes the same extra in the same location.

If a tree is created with these proper rules, then escodegen should be able to easily churn through all the nodes and not worry about outputting any extras it finds.

@getify
Copy link
Author

getify commented Sep 30, 2013

@michaelficarra you have not addressed how a parser (like esprima) could produce a tree for var a = (b + 2); that escodegen could turn back into, exactly var a = (b + 2);. If the tree removes any information about the ( ) set, how could any program add that information back in, when it doesn't even know it was removed?

@michaelficarra
Copy link
Member

Quoting myself from (way) above...

I'd argue that esprima/acorn/whatever should generate CSTs (and optionally ASTs). escodegen can operate on either one.

"(and optionally ASTs)" means we just run the result through a CST->AST transformation.

@Constellation
Copy link
Member

@getify

@michaelficarra you have not addressed how a parser (like esprima) could produce a tree for var a = (b + 2); that escodegen could turn back into, exactly var a = (b + 2);. If the tree removes any information about the ( ) set, how could any program add that information back in, when it doesn't even know it was removed?

In that case, by creating CST from parser and passing it to escodegen.generateFromCST, we can get var a = (b + 2); :)
AST is not intended to be used for that case since above problem is related to the code format (not semantics). In that case, CST is the proper choice.

If we'd like to control the generated code format, we should use CST. But we'd like to analyze the semantics of the code, we should use AST.
If we'd like to analyze the code semantics, there's no problem if the result becomes var a = b + 2; since the result semantically equals to the original code.

@getify
Copy link
Author

getify commented Sep 30, 2013

Let me see if I get this straight:

  1. escodegen.generate() takes an AST as its input. What it does, by default is make some _guesses_ at extra data it can add in, like default spacing/indentation, etc, the way escodegen currently works. The way it will now do this is to produce a faked/guessed CST.
  2. Then, escodegen passes that faked/guessed CST over to escodegen.generateFromCST().
  3. escodegen.generateFromCST() takes a CST as its input, and it produces code. It can accept a faked/guessed CST from generate(), OR it can accept a real full faithful original-program-reproducing CST with all the real original syntax data, OR it can take a plain-ol-stripped-down-AST. Either way, it does its best to just output that tree as-received, not adding in ANY (non-required) extra formatting, itself.

Does that accurately represent what @Constellation and @michaelficarra think is the best path forward?

@getify
Copy link
Author

getify commented Sep 30, 2013

Assuming for a moment that I understand (from my previous comment), let me make a few points:

  • For (1) above, I understand that one of the reasons you do all this faked/guessed data addition (formatting) is because some people/tools are in the habit of sending in a pure AST to escodegen and wanting a human-friendly-readable representation of the code outputted.

    IMO, I don't think that should be a primary, or assumed, task of escodegen. There are "pretty printer" tools already. So it seems like duplicative, to me, to have escodegen doing the pretty-printing part. To me, the important part of escodegen is that it actually reconstructs all the JS grammar, _not_ that it can make pretty indentations.

    But that's just my opinion. I get it that you want to keep that. Let's just be clear that it's a separate and orthagonal task from actual grammar-reconstruction code generation, and that moreover it's separate and orthagonal from the stuff that I've been asking for. Can we agree on that?

  • So, for (2) above, this is just one of the ways you can think about the API. You could, instead, do this:

    • autoFormatASTintoCST(..): which takes a pure AST and adds in, according to config options, guessed/fake formatting data like spacing/indentation, etc. It strictly returns another tree, which is a CST inferred from the AST.
    • generate(..): takes a CST. Assumes no automatic formatting to be added, and expects the CST itself to provide any and all data it should use for formatting. The only stuff it will add "automatically" are grammatically required whitespace, if no other such CST formatting meta-data is present for that exact location. Otherwise, it will only use what it finds in the tree.

    The benefit of this approach is that you can pass either an AST or a CST to generate(), and you'll get out the code appropriate for what kind of tree you passed in. But you don't have to learn different API calls for generating, one which auto-formats and one which doesn't.

    If you additionally want some extra auto-formatting added to your AST, you can be explicit about it like: escodegen.generate( escodegen.autoFormatASTintoCST( mytree ) ). I prefer auxilliary behavior (the auto-formatting) not to be implicit but to be explicit. Just my opinion though.

    The only "downside" to this approach is that anyone who's currently doing escodegen.generate(myAST) and expecting the automatic pretty-printing will have to add in the explicit formatting call. Again, just my opinion, but I think explicit vs. implicit API semantics is more than enough justification for that change.

@Constellation
Copy link
Member

@getify @michaelficarra

Thanks for the clarification :)

  1. escodegen.generate() takes an AST as its input. What it does, by default is make some guesses at extra data it can add in, like default spacing/indentation, etc, the way escodegen currently works. The way it will now do this is to produce a faked/guessed CST.

Basically right. But to keep the role (escodegen.generate(AST) -> code), I'm planning to provide a function, escodegen.convertASTToCST. And I think escodegen.generate implementation will become conceptually the following.

function generate(ast) {
    return generateFromCST(convertASTToCST(ast));
}

escodegen.convertASTToCST takes AST and returnes faked/guessed CST.

OR it can take a plain-ol-stripped-down-AST. Either way, it does its best to just output that tree as-received, not adding in ANY extra formatting, itself.

This functionality is not necessary. We can just use escodegen.generate(ast) and it works correctly :)

@getify
Copy link
Author

getify commented Sep 30, 2013

@Constellation

Can we agree that the tree which is produced by convertASTToCST(..) will produce a CST with those extra guessed formattings encoded in the same mechanisms that a real/faithful CST would use to represent the original program contents?

That is, that it would likely use something like extras.leading... as has been discussed in this thread, rather than inventing a whole different way to encode the guessed formatting data into a CST?

@Constellation
Copy link
Member

@getify

IMO, I don't think that should be a primary, or assumed, task of escodegen. There are "pretty printer" tools already. So it seems like duplicative, to me, to have escodegen doing the pretty-printing part. To me, the important part of escodegen is that it actually reconstructs all the JS grammar, not that it can make pretty indentations.

escodegen's primary task is generating valid & semantically equal code from AST.
This is a difficult job. For example, if you don't insert a space between regexp /reg/ and in operator, code becomes broken.
If you are CST producer, you should handle this by yourself (Of cource, escodegen will throw a error early and help you). Since you can control the whitespaces, escodegen.generateFromCST cannot decide which token should be inserted to here. (tab? space? 2 spaces? etc.)

escodegen's core principal is parse(generate(AST)) structurely equals to parse(AST).

So I think this task belongs to escodegen.

Personally I think escodegen.generateFromCST can be extracted from escodegen.

Can we agree that the tree which is produced by convertASTToCST(..) will produce a CST with those extra guessed formattings encoded in the same mechanisms that a real/faithful CST would use to represent the original program contents?

That's right!

That is, that it would likely use something like extras.leading... as has been discussed in this thread, rather than inventing a whole different way to encode the guessed formatting data into a CST?

But it cannot represent some concrete syntax information (such as parentheses and semicolon) as a meaning structure, correct?

@DavidBruant
Copy link

Arriving super late to the party, but yes CST, awesome idea!
Indeed lots of fine-grain styling consistency things aren't present in the AST (spaces here and there, position of brackets, etc.), so tools can't alert on them.

I haven't read all the debates on the API side of things but here are a few elements I find relevant: the CST contains all the information of the AST, but not the other way around. Ideally, that could mean that the AST be a subset of the CST. The problem is that the AST was defined long ago and a bunch of tooling relies on it, so making CST a superset of AST might be impossible without breaking things (I'm interested if proven wrong here). And the reward of breaking things doesn't seem worth it in that instance.

Beyond backward compat, given that CST contains more info, one could wonder what is the point of AST after CST... But let's make CST happen.

Is it OK? If so, I'd like to start definining CST format spec.

I guess it starts with the following principle:
There exists 2 functions sourceToCST(src: string) : CST and CSTToSource(cst: CST) : string. CST is a tree-shaped object. Object properties can be source fragments containing at most one token and for any syntactically valid JavaScript string src, we have CSTToSource(sourceToCST(src)) === src
(how these functions are exposed in libraries is out of scope for the CST spec ;-) )

@michaelficarra
Copy link
Member

so making CST a superset of AST might be impossible without breaking things (I'm interested if proven wrong here)

It depends on what you mean by superset, but if you mean current tools will be able to treat a CST as an AST, I can see at least one way to do it by just adding optional properties to AST nodes:

  • Add an additional list property containing 0 or more whitespace/comment fragments; one property for each position whitespace/comments are allowed
  • Add a parens list property to each node that contains 0 or more objects which each contain the syntactic information (whitespace/comment fragments) for a surrounding pair of parentheses. They can be listed outside-in or inside-out, it probably doesn't matter.
  • Add a property to each type of node that supports trailing semicolons, indicating whether one is used in the representation.
  • edit: Also, we need something equivalent to escodegen's current "verbatim" support on Literal nodes to specify how numbers/strings are represented.

I think that should be sufficient. @Constellation: what do you think?

@getify
Copy link
Author

getify commented Oct 1, 2013

@Constellation

...So I think this task belongs to escodegen.

What you've described is indeed a critical function of escodegen, that it needs to produce valid code from the tree. This means that in certain places, it does need to add in whitespace between tokens, such as between /regex/ and in.

However, that's a separate and orthagonal task from escodegen adding in extra pretty-printing whitespace like with indentation and such. Looking through the bits of escodegen that handle any sort of whitespace insertion, about 10% of the code (rough estimate) is for required whitespace for grammar validity, and the other 90% of the code that deals with whitespace insertion is dealing with fancier pretty-printing things like adding base indentation to lines of code, adjusting the indentation level of multi-line comments, and other such things.

It's just my opinion, but I see the second of these tasks, the pretty-printing part, as not necessarily part of the core primary task of escodegen, while undeniably the first task (ensuring grammar validity with necessary whitespace) as definitely primary.

You could save/remove some of the code/complexity in the current code base if you elected to say that pretty printing stuff was not in escodegen but could be done by some other tool which produces a CST. Such a pretty-printer tool could just take an AST, make a CST out of it, and add the proper whitespace/indentation annotations to the CST, such that when escodegen processes that CST, it will produce the desired pretty-printed code, but escodegen wouldn't have to do the part of actually figuring out the pretty-printing parts.

Just my 2 cents. It's a feature I don't care about either way, about because the tool I'm building will supply all of the spacing info via CST to escodegen.

But it cannot represent some concrete syntax information (such as parentheses and semicolon) as a meaning structure, correct?

I absolutely think the CST can (and should!) represent those things. ASTs have no need for them, CSTs absolutely do have some usage for them (such as reconstructing original code -- my main use case).

@getify
Copy link
Author

getify commented Oct 1, 2013

@DavidBruant @michaelficarra

so making CST a superset of AST might be impossible without breaking things (I'm interested if proven wrong here)
...
...if you mean current tools will be able to treat a CST as an AST...

This is the opposite of what I mean by compatibility between AST and CST. I don't mean, or intend in my suggestions, for a CST to be treated like an AST. I think that's a bad goal that hand-ties us too rigidly in our design of a CST.

I mean: an AST can be treated like a CST (aka, backwards compatibility preserved that escodegen will be able to keep accepting the ASTs it always did), though an AST is a particularly "dumb" CST in that it's missing all the actual "concrete" stuff.

To put it more plainly, I think escodegen is one of the only tools that would care about consuming a CST. Maybe a few others, such as escope. Most other tools would only care to receive an AST. For them, they don't have to change at all just because we're declaring an optional intermediary CST format. We just need for one of the tools to have a CST-to-AST stripper/converter, and we're fine there.

That leaves parsers and other tree generators to consider. The idea would be that they would be able to either output an AST (that is, ignore the extra stuff while they parse/tree-build), or a CST (keep all the extra stuff), and the user of that tool would decide which tree they want, and how to use the tree once they get it.


To specifically address your question, @DavidBruant, I believe I've proven that a CST structure can look an awful lot like an AST, that is, as a superset of an AST, and yet still handle all the complexities of concrete syntax representation.

The vast majority of the grammar allows for whitespace and comments to be annotated using the suggested "extras" implementation I've shown here in this thread. There are a few exceptions where the "extras" have to be added to the tree in a special (aka, non-standard-AST) way/location, but those few exceptions only create a 1-to-1 structure in the CST which can easily be stripped in the transformation of a CST back to an AST. Things such as ( ) sets wrapped around expressions (either as extra/unnecessary or as affecting of precedence rules) are also easily added into the superset CST structure, and again, easily removed to create a valid standard AST.

@Constellation
Copy link
Member

@getify @michaelficarra @DavidBruant

@DavidBruant wrote:

I guess it starts with the following principle:
There exists 2 functions sourceToCST(src: string) : CST and CSTToSource(cst: CST) : string. CST is a tree-shaped object. Object properties can be source fragments containing at most one token and for any syntactically valid JavaScript string src, we have CSTToSource(sourceToCST(src)) === src
(how these functions are exposed in libraries is out of scope for the CST spec ;-) )

Right. CST should represent the source code as is.

I haven't read all the debates on the API side of things but here are a few elements I find relevant: the CST contains all the information of the AST, but not the other way around. Ideally, that could mean that the AST be a subset of the CST. The problem is that the AST was defined long ago and a bunch of tooling relies on it, so making CST a superset of AST might be impossible without breaking things (I'm interested if proven wrong here). And the reward of breaking things doesn't seem worth it in that instance.

Personally I think only users paying attension to formats should consider CST. And we need to keep AST world simple. So I would not like to assume the inheritance between AST and CST. I think CST and AST should be converted each other by some function explicitly.

@getify wrote:

This is the opposite of what I mean by compatibility between AST and CST. I don't mean, or intend in my suggestions, for a CST to be treated like an AST. I think that's a bad goal that hand-ties us too rigidly in our design of a CST.

Agreed. CST should not be treated like an AST.

I mean: an AST can be treated like a CST (aka, backwards compatibility preserved that escodegen will be able to keep accepting the ASTs it always did), though an AST is a particularly "dumb" CST in that it's missing all the actual "concrete" stuff.

I don't agree with this. When API take CST, this API cannot take AST.
If API take CST|AST, the problem happens. If API takes CST, we should explicitly convert CST to AST.
Maybe CST format is very similar to AST. But I think that we should not assume that AST can be treated as CST.

To put it more plainly, I think escodegen is one of the only tools that would care about consuming a CST. Maybe a few others, such as escope. Most other tools would only care to receive an AST. For them, they don't have to change at all just because we're declaring an optional intermediary CST format. We just need for one of the tools to have a CST-to-AST stripper/converter, and we're fine there.

Right. If CST add dummy nodes, basically all AST tools cannot handle CST. CST is different from AST. AST tools should not consider CST since it is very low IR. Only users taking care about formats should consider CST.

@Constellation
Copy link
Member

@getify

You could save/remove some of the code/complexity in the current code base if you elected to say that pretty printing stuff was not in escodegen but could be done by some other tool which produces a CST. Such a pretty-printer tool could just take an AST, make a CST out of it, and add the proper whitespace/indentation annotations to the CST, such that when escodegen processes that CST, it will produce the desired pretty-printed code, but escodegen wouldn't have to do the part of actually figuring out the pretty-printing parts.

How about parentheses?

I don't want to break the current functionality since I think AST tools should not consider about CST and escodegen provides a way to generate something valid code from AST. It keeps AST tools world simple. The current functionality, escodegen.generate(AST, option) -> code must be kept.

So if you'd like to split escodegen's tasks into small ones, escodegen will provide 2 functionalities, inserting required spaces (and required parentheses) and pretty printer functionality. But personally I think they cannot be splitted since both tasks need to know the information about the all inserted spaces.

@getify
Copy link
Author

getify commented Oct 5, 2013

@Constellation

We all seem to have slightly different ideas of what the ideal CST is. I don't know exactly how to resolve the differing views.

However, on a positive note, @michaelficarra @puffnfresh and myself got a chance to chat in-depth over drinks while we've been here at the NationJS conference. I think we have a plan to move forward on making the decisions, which involves a google hangout online meeting with all the interested players. I am going to try to set something like that up over the next couple of weeks. Timezones are going to be our enemy in that, but I'll do my best. Stay tuned. :)

@fpirsch
Copy link

fpirsch commented Oct 6, 2013

@getify

We all seem to have slightly different ideas of what the ideal CST is. I don't know exactly how to resolve the differing views.

My idea about it seems pretty close to yours ;-)
I like very much the idea of the CST being an "enriched AST". Like you said, currently the trees used in the escode* family are already somewhere between a pure AST and a full-blown CST. They already have optional non-semantic information about locations, comments, ranges, tokens.

So the cool idea here would (could) be to rethink and refactor this extra information into something more unified and better structured which would allow tools to rebuild the exact source code from which the CST was produced. And to keep the current model of an AST with a few more properties attached to its nodes.

Or, if some people (and tools) feel really attached to the academic purity of ASTs, maybe we could consider to have 2 separate data structures. A pure AST for the semantics, and a Whatever-T for the presentation information. (Kind of) like html+css separately represent semantic and presentation aspects of a page.

@michaelficarra
Copy link
Member

As Kyle says, we got a chance to speak about this in person and really understand each other's perspectives. The discussion came down to one basic point: we've each defined isomorphic JS CSTs, each with their own pros and cons. I will attempt to summarise the two proposals and their pros/cons here. If we find that we want to collaborate on this, we can either move this to the wiki or a shared google doc.

Proposal: CST By Extending/Annotating AST

I've previously described this format above as a possibility in response to @DavidBruant. I will copy it here for convenience. Essentially, we are just adding optional properties to AST nodes:

  • Add an additional list property containing 0 or more whitespace/comment fragments; one property for each position whitespace/comments are allowed
  • Add a parens list property to each node that contains 0 or more objects which each contain the syntactic information (whitespace/comment fragments) for a surrounding pair of parentheses. They can be listed outside-in or inside-out, it probably doesn't matter.
  • Add a property to each type of node that supports trailing semicolons, indicating whether one is used in the representation.
  • Something equivalent to escodegen's current "verbatim" support on Literal nodes to specify how numbers/strings are represented.

Pros

  • current tools will be able to accept a CST
  • syntax-agnostic transformations will preserve syntax when passed a CST; don't have to have two code paths in AST tools
  • do not have to traverse the tree to convert between CST/AST
  • escodegen can simply treat any input as a partially-filled-in CST and enrich it with defaults to create a full CST before rendering it

Cons

  • slightly harder to reason with, as properties require more logic to interpret than separate nodes

Proposal: CST With Structural Syntactic Forms

This proposal also remains close to the AST specification. In this proposal, new syntactic node types are added to the AST spec. as well as properties containing syntactic information.

  • Add a ParenthesisedExpression node, representing a parenthesised expression in expression position. For parentheses in statement position, this node can be wrapped in a ExpressionStatement.
  • Add an optional extras property to each node (@getify: please clarify exactly how this is represented). I believe this includes whitespace/comment/semicolon information all in one property.

Pros

  • new tools that operate on CSTs should have a slightly easier job reasoning about the syntax represented by the tree
  • parsers may have an easier time generating this format, but more evidence is needed to determine this

Cons

  • need a single-pass transform between AST/CST
  • tools that operate on ASTs only will need either a separate code path or two transformations at its interface (which would lose all syntactic information)

Obviously, the pros/cons lists are incomplete and weighted toward the format I'm more familiar with. We should collaborate to fill them out completely so the community can make an informed decision about the CST format we should use going forward.

One point that was brought up is that we could always have both formats. Technically, this has all of the pros of both proposals, with the only additional con being that one has to be more conscious about what format they are using and their tools support. We also discussed and all agreed that esprima should have an interface for producing a CST.

Hopefully this is a good starting point for continuing discussion. @Constellation: If you prefer, we can move this to the wiki or some other collaboration tool. Please comment with more pros/cons for each of these formats!

@Constellation
Copy link
Member

@getify, @michaelficarra

but I'll do my best. Stay tuned. :)

Great great great!
Thanks for your clarification. Great. Currently I'm busy on my research (sorry) but I'll surely read it tomorrow and reply them!

@Constellation
Copy link
Member

@getify @DavidBruant @michaelficarra

Hopefully this is a good starting point for continuing discussion. @Constellation: If you prefer, we can move this to the wiki or some other collaboration tool. Please comment with more pros/cons for each of these formats!

Very nice. Moving this to wiki is quite reasonable. This clarification helps us a lot.
I've read this and personally I think this covers necessary points.

Does Anyone have comments about this clarification?

@Constellation
Copy link
Member

Created wiki page. Feel free to edit it.
https://github.com/Constellation/escodegen/wiki/CST-Proposals

@getify
Copy link
Author

getify commented Oct 25, 2013

I've been traveling at confs for the past 2 weeks solid, sorry about my delay in responding. Hopefully will pick up the task of getting together an online meetup about this so we can figure out how to move forward.

@jednano
Copy link

jednano commented Feb 28, 2014

I would very much also like to get access to CST information; namely, whitespace. It has been a few months now. @getify, do you know what the next step is?

@getify
Copy link
Author

getify commented Feb 28, 2014

I feel badly that I got very busy with other things and never organized the discussion meeting to talk about CST formats. We really need to circle back and do that. I'll try to find some time soon to renew the effort.

@getify
Copy link
Author

getify commented Mar 18, 2014

OK, that took way too long. But I'm finally doing something "concrete" about this.

Please go see this if you are interested in helping define a standard CST:

https://github.com/getify/concrete-syntax-tree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants