- Repo: babel/babel
- Start Date: 2024-04-13
- RFC PR:
- Related Issues:
- Authors: Danielle Church
- Champion:
- Implementors: Danielle Church
Request for position on (and implementation of) draft TC39 proposal for standardized parser transformations.
- 2024/04/17: Added sections on Runtime PA, Validator
transformations, and Future-proofing ECMAScript.
Slightly altered the semantics of the
syntax
directive and updated the example in Composition of Transformation Paths. - 2024/04/18: Rewrote the example for Validator transformations to clarify the use case. Edited the pipeline example in TD Semantics to clarify that this proposal does NOT change existing ECMAScript parse or import order. Added an example static analysis method to the FAQ section.
- 2024/04/20: Revised the alternative syntax grammar in Specifying Tranformations / Inline, adding a
from
keyword before the opening bracket of the array-like syntax, to avoid collision with member access syntax whensyntax
is used as a variable name. It is now possible to determine if a given instance ofsyntax
is a directive or a primary expression by looking at the next scanner token.
ECMAScript does not currently have any way to describe the syntax transformations required to turn source code (as it appears in source repositories) into ECMAScript code that can be used in modern engines. In the absence of any official mechanism, Babel and other transpilers use documentation, external configuration files, and web fora like StackOverflow to describe and define the transformations that must be applied to any given project before it can be used by an ECMAScript host. These artifacts and ephemera have no consistency between transpilers and it places a heavy burden on developers that want or need to change their build processes. Furthermore, there is no standard way to validate that these changes have been applied to the source files properly, which opens the door to hard-to-debug errors that occur only because of incorrect transpilation; in the worst case scenario, a supply-chain attack like the one that recently affected liblzma could go unnoticed by hiding only in the transpiled/minified distribution of an ECMAScript-targeting package.
I'm developing a proposal for inclusion in the ECMA262 spec that addresses this shortcoming. I have posted an incomplete draft of it on the TC39 discourse, which can be seen here:
https://es.discourse.group/t/proposal-parser-augmentation-mechanism/2008
That draft is largely focused on the impact to the TC39 proposal process, so I'll dig a little deeper into the implementation vision that I have for it, here. Note of course that unless and until TC39 decides to weigh in on the proposal or a concrete implementation is realized, the syntax and semantics discussed herein should be considered extremely tentative.
Parser Augmentation (hereafter PA) is, at its core, a mechanism to describe transformations that can be applied to source material to render it into a valid ECMAScript syntax tree. While the PA framework can be used to define novel transformations in the style of a classic C preprocessor, its primary purpose is to provide a method for (a) an ECMAScript engine or transpiler to report its builtin syntactic capabilities to ECMAScript code, and for (b) ECMAScript code to inform the engine or transpiler what transformations must be applied to any given source before it can be successfully parsed.
This represents an unprecedented level of user control over the ECMAScript import machinery, exceeded only by the proposed VirtualModuleSource mechanism described in the Compartments proposal at Stage 1 of the TC39 process. Unlike VirtualModuleSource, PA can only generate code that could be written in pure ECMAScript, and it cannot override the default import behavior to, for example, provide a different object to every consumer of a given module. However, it can represent modification or replacement of an entire source file without restriction, as long as the result is a valid ECMAScript syntax tree.
Because of this, some parts of the PA design will be defined as optional. In particular, a host is permitted to reject attempts to define new transformations for any reason, including unconditionally. Also, a host is permitted to "play dumb" - to generate a SyntaxError when asked to perform a transformation as though it were unknown - as long as the given transformation is not mandated to be supported by ECMA262. For example, the JSON Modules proposal could be implemented as a PA transformer with the type "json"; once the proposal is accepted and included in the spec, it would not be permissible to refuse transformation of an import with "json" type.
Despite these limitations, it is exceedingly likely that browser vendors will be slow to adopt a polyfillable PA mechanism given the attack risk it repesents, not to mention the common impression that implementing PA requires calling JS code at parse-time. However, neither of these drawbacks affect transpilers like Babel. Transpilation is nearly always performed with controlled inputs, and PA directives in the transpilation context only serve to enable, disable, or configure the transformations that a transpiler is already making.
To lead with an example, the JSON Modules specification could be described the following way, using PA semantics:
async function json(parser) {
// The parser should parse the rest of the input as text and emit it
// as a string constant token, as though it were surrounded in double quotes
// and all appropriate escape characters added.
parser.setTokenizerMode(Parser.TOKENIZER_STRING_CONSTANT);
// Without any arguments, parseToken() will consume as much as possible.
// The return type is an AST node bound to file:line from the source.
const stringConstantToken = await parser.parseToken();
// The syntheticAST function runs the default ECMAScript parser on a template,
// marking all generated AST nodes as synthetic (not in source text)
parser.emit(Parser.syntheticAST`export default JSON.parse(${stringConstantToken})`);
}
// Engines are not required to allow registration of implementations, so the following statement could fail:
Parser.registerImplementation("json", json, { polyfill: true, validate: true });
const { default: { meaning } } = await import("life.json", { type: "json" });
The json
function shown here is a Transformation Description (TD). It defines the steps that a parser could take
to perform the given transformation. The API will be designed in such a way that it can be as performant as possible
without compromising its semi-declarative nature, but no engine or transpiler, even one that supports arbitrary PA,
is required to execute the TD function in any observable way. An implementation can use its own native functionality
for the transformation (such as Babel's existing JSON module support); it can perform static analysis of the
TD function to construct an internal parser representation (engines are allowed to reject any non-analyzable
transformations); or, in fact, it can simply run the TD function at parse-time, passing it a handle to a
PA-compliant parser object.
As long as the resulting AST matches what the TD function would have produced, the implementation is compliant. If executing the TD function would produce observable side-effects, the PA implementation is permitted either to reproduce those side-effects faithfully (for example, by executing the TD function as part of parsing), or to inhibit the side-effects completely (for example, by using a different implementation, or by running it in a sandbox), and it may choose differently for each transformation. It is a programming error for the TD function to produce different AST outputs based on the observability of side-effects or lack thereof, and PA implementations are permitted to reject any transformations that show such discrepancies.
A TD can also be registered to validate a transformation, as it is here. If a validating TD is registered for a transformation and that transformation executes with inhibited side-effects, then afterwards, the TD function must be called with visible side-effects. This second transformation must produce an AST that matches the first, and the PA implementation must reject the Transformation if it does not.
A transformation description must be written in statically analyzable ECMAScript. The PA specification does not and will not define what is meant by "statically analyzable", other than to make the tautological assertion that "a statically analyzable TD function is one that can be statically analyzed by a PA-compliant engine".
A single Transformation actually represents two conversions: a non-syntactic conversion at the level of bytes or text, and a syntactic conversion of an AST from one form to another. Most transformations are either purely syntactic, or purely non-syntactic, but there are exceptions. Every transformation, no matter what category, is defined by a Transformation Description, above. A set of several Transformations in sequence form a Transformation Path, and the PA mechanism includes a number of ways to associate a source text with a Transformation Path; see below. Transformations, and likewise Transformation Descriptions, can be categorized by what type of input they expect and what type of output they produce. All are represented by the same TD interface, but the various categories have slightly different requirements and semantics.
A non-syntactic transformation is one that does not utilize the ECMAScript parsing logic. The json
function
above describes a non-syntactic transformation; it treats the entire source text as a string, without regard for
syntax. Non-syntactic transformations may directly produce an AST node as above, but they can also produce a
a string or an ArrayBuffer (representing textual vs binary output), both of which can be obtained from appropriate
methods called on the parser
object.
One (and only one) transformation in a path may be a baseline transformation. A baseline transformation is one
that is responsible for defining the initial parse from raw text into AST nodes, and it may be non-syntactic or
syntactic. A non-syntactic baseline, like the json
transformation modeled above, utilizes the tokenizer control
methods and methods like parseToken
to pull sequential tokens from the input stream, producing synthetic nodes as
necessary to emit an ECMAScript AST. If a transformation path does not specify any baseline transformation, the default
ECMAScript baseline is conceptually positioned between the last non-syntactic and the first syntactic transformation.
The syntactic transformations are those that use the ECMAScript parser and accept parsed AST nodes as input.
The parser
object passed to the TD function has methods to change certain characteristics of the parser, for
example by defining or redefining operator syntax or precedence. Typically, a syntactic TD function will start
by adjusting any parser characteristics needed, and then it will repeatedly call parser methods to scan for AST
nodes of interest, emitting output nodes according to its internal logic.
Note, however, that these categories are simplifications, and are not absolute. All transformations represent both a syntactic and a non-syntactic conversion, and a "syntactic" transformation is simply one whose non-syntactic portion is identity; the null transformation. Transformations may shift categories during parsing; implementations are permitted to reject any that do with a SyntaxError. (Or, as always, for any other reason.)
A syntactic baseline transformation is one that calls the newBaseline()
or setBaseline()
methods on the
parser instance before anything else. The baseline transformation is responsible for setting all operators,
keywords, etc; if no baseline transformation is specified, the default ECMAScript parser settings are used. The
newBaseline()
method clears the parser state entirely and could be used e.g. to define a transformation from a
strictly limited subset of ECMAScript syntax, like JSON or JSON5, or even from an entirely different language.
The setBaseline()
method resets the parser state to a known default, taking options that can specify which
release year of the ECMA262 specification provides the appropriate operators and keywords.
If a syntactic transformation does not define a baseline, then it is an augmenting transformation. An augmenting transformation can be used to define or redefine syntax features. For example, a TD for the Pipeline operator might read as follows:
function pipeline(parser) {
// Defined operators and keywords can have handlers associated with them, as the third argument.
// The handlers are called upon emit.
parser.defineOperator("|>", {associativity: "left", precedence: "=", context: "block"}, pipeOperator);
parser.defineKeyword("%", {treatAs: "identifier", context: {operator: "|>", operand: 2}}, topicReference);
// having no parse/emit calls in a TD is logically equivalent to a final line reading:
// for await (const node of parser.parseNodes()) parser.emit(node);
}
function pipeOperator(parser, expr, context) {
if (expr.rhs.isKeyword("%")) {
parser.emit(expr.lhs);
} else {
// expr.state is a container to store TD-local metadata. It is not visible to other TDs or to
// the emitted AST.
expr.state.topicVariable = context.newSyntheticLocal();
// this emit will trigger the keyword handler for topic references in the rhs
parser.emit(Parser.syntheticExpression`${topicVariable}=${expr.lhs},${expr.rhs}`);
}
}
function topicReference(parser, expr, context) {
parser.emit(context.state.topicVariable);
}
Parser.registerImplementation("pipeline", pipeline);
If the PA implementation accepts registration of this TD, then the application could then expect to be able to parse and execute a file containing the following:
syntax "pipeline"; // a syntax directive takes effect after the newline following the statement, so just after here →
console.log("foo" |> one(%) |> two(%) |> three(%));
(This could not appear in the same file, which has already been fully parsed, except as part of an eval
or
new Function
or other such dynamic code generation, but it could appear in a file parsed after the point when
the registerImplementation
call executes successfully. This proposal makes no changes to ECMA262 parse or
import timing with the exception of the optional syntax import ... from ...
form of the syntax
directive,
described in Specifying Transformations > Inline)
When a TD is called as part of parsing, it is called once the input stream is open (but, potentially, before any data
is available to it). As a logical description, it operates on a node-and-tree view of the source text, but the API
is designed such that a scanning parser or streaming transpiler can interface with it without backtracking, and with
a minimum of buffering. For example, the define
methods on the parser object take a context
option that specifies
both where the token is valid (the topic reference token is only valid in the rhs of a pipeline operator, for example)
as well as the maximum amount of context required to emit. If the pipeline operator here specified "toplevel"
for
its context, for example, a streaming transpiler would have to buffer output for every top-level statement or
declaration, not emitting until it could be sure there were no pipeline operators anywhere in, say, the class
definition. However, since it specifies "block"
, the transpiler is only required to buffer as far as the nearest
block statement.
This also exemplifies why the primary output of the PA process is an AST, not a raw source text. An augmenting transformation may need to make non-local changes to the source tree, which is much easier when it is already in a non-linear format; and, since one of the goals of this mechanism is that it should be possible to implement it at runtime, in a running engine, with a minimum of overhead, it is sensible to use a format that will be as close as possible to what these engines expect to consume.
(Standing reminder: this API and these semantics are not set in stone and will change as new requirements emerge.)
Possible examples of the various types of transformation:
- "base64": Non-syntactic, producing ArrayBuffer(s)
- "gzip": Non-syntactic, producing ArrayBuffer(s)
- "rot13": Non-syntactic, producing string(s)
- "binaryast": Non-syntactic baseline, producing AST
- "comment": Non-syntactic baseline
- "json": Non-syntactic baseline. Could instead be implemented as a syntactic baseline
- "module": Syntactic baseline, forces parser into ES Module mode
- "script": Syntactic baseline, forces parser into Script mode
- "typescript": Syntactic baseline. Could be implemented as augmenting
- "pipeline": Syntactic augmenting
Note of course that the PA mechanism could be used to describe reverse transformations, e.g. producing a binary AST output from valid ECMAScript code. This is outside the scope of the PA proposal.
The "script" and "module" modes could be used by an engine to autodetect whether a file should be treated as a
Script or as a Module. Engines are not required to support this; in many contexts, parsing a module as a script
or vice versa will cause a load error, and in that case the engine should throw an error. However, in cases where
either a script or a module can be used (like a <script>
tag or a Worker source) this would allow for omitting
the type
attribute/property at the import site, decoupling the implementation from the referencing document.
The "comment" mode is proposed for inclusion in the standard, providing similar developer utility to C's #if 0
construct. It parses all following input as a multiline comment up until a directive changing the parse mode.
TC39 could additionally define modes that act as baseline transformers but remove the support for deprecated syntax
forms from the language. This would allow it to reclaim syntax space in the language, at the cost of gating some
newer syntaxes behind the equivalent of a Python from __future__
statement.
Transformation Descriptions, especially those used by runtime PA implementations, are symbolic. They are designed to be as simple as possible, using as simple syntax as possible, so that they can be verified at the source level as well as in terms of their output. This necessarily means that they are context-dependent; as they are describing behavior that the parser can take, they can use APIs that are specific to operating in the parser context.
For example, a hypothetical open-source project might release an engine that can execute both ECMAScript code and
Python code; it contains both kernels and uses an IPC mechanism to pass values and execution between the domains.
The integration is as smooth as it could possibly be: you simply import
a .py
file as though it were a .js
file, and you can access it just like it were a native ES Module. All this is possible today, and has probably
already been done, many times.
So where does Parser Augmentation fit into this picture? PA gives us a way to describe what is happening, so that
the interface between the "syntax you're importing" (remember, the meaning of import
is "parse this ECMAScript
file and load it as a module") and the rest of the ECMAScript environment is well-defined.
Let's create a Transformation Description for this hypothetical engine. It exposes all its Python plumbing APIs
through a non-standard top-level symbol, PythonVM
. You can access a property on PythonVM
according to the
version of Python you care about if you want, or you can just use PythonVM
itself to use the default version;
in either case, there are only two methods we care about:
PythonVM.compileScript(text)
is an async method which tells the Python runtime to compile a given Python source text. It returns a promise that resolves to an object with abytecode
property containing the bytecode in an ArrayBuffer and anexports
property which is an array of strings like the following:"export let foo = PythonVM.topLevelGlobals['foo'];"
PythonVM.executeBytecode(bytecode)
does exactly what it says on the tin. It executes the bytecode as a Python script, storing the toplevel variable context in the propertytopLevelGlobals
. Synchronous.
We can reproduce the engine's native behavior using these two methods as follows:
async function parsePython(parser, {version}) {
parser.setTokenizerMode(Parser.TOKENIZER_STRING_CONSTANT);
const scriptToken = await parser.parseToken();
const {bytecode, exports} = await PythonVM[version].compileScript(scriptToken.value);
parser.emit(Parser.syntheticAST`PythonVM[${version}].executeBytecode(${bytecode}); ${exports}`);
}
The parsePython
function is a Transformation Description. It doesn't matter that it can only run on this one
specific engine, and the implementation is not provided. If you wanted to run your JS/Python program on an older
version of the engine, one that supported the PythonVM
API but not the automagic import behavior from ES to
Python based on file extension, then you could use the code
Parser.registerImplementation("python", parsePython, {polyfill: true});
to register that TD with the system if there's no definition for "python", and then use some yet-to-be-determined
method to register it as a handler for ".py" files imported by this file. Then you've achieved full parity
with the newer version. The TD could even be used with another engine entirely, as long as there's some way to
implement the PythonVM
API - like, say, a third-party extension, or an NPM package.
Imagine now that a second JS engine decides to add Python support to their implementation. This project does it a different way, though. Rather than welding an entire Python engine to the JS engine, it instead syntax-translates the Python code into equivalent JS code, polyfilling enough Python system functions that standard Python libraries are mostly usable - the WINE solution. It works well enough that you can use most of the simpler pure-Python packages out there. And, since the Python code translates into native JS and doesn't require any inter-engine IPC, the syntax-translated version actually runs faster in most scenarios.
However, it's not a good match for your project. Your use of Python is solely for interfacing with the Python bindings
of a native library that doesn't have a JS binding, and so it won't work if it's syntax-translated. If the engine
can't run a native-Python version of the .py
"module", you need to fall back on your pure-JS implementation.
You can test to make sure the PythonVM
API is available before you await import()
the module, of course, but how
can you be certain that what you import is the PythonVM
implementation?
Well, we can start out by taking the above TD function and registering it as a validator, in addition to a polyfill:
Parser.registerImplementation("python", parsePython, {polyfill: true, validate: true});
- A transformation that is registered with the
validate
option set totrue
is a Validator Transformation. These will always run on any input being transformed by that identifier. It may or may not be selected by the host as the implementation to use when parsing, but if it is not used as the actual parse-time implementation, the host must execute (an equivalent of) the validator TD as well. If the AST produced by the validator TD does not match the AST produced by the original parser (using a definition of "match" to be workshopped later), the host must reject the transformation before any results of the resulting AST are observable - most likely, by throwing aSyntaxError
.
Now, you can be sure that if your await import()
succeeds, it means that this code was exactly the right code,
produced by the same compiler, running on the same version, exporting the right exports. If your project runs
on a syntactic-transforming Python⇒JS engine, it will see that the generated AST is not the same as the
executeBytecode
AST produced by your validator, and it will fail the await import()
so that you can fall
back to the JS implementation.
That's a lot of extra work being done, though. We don't have to compile the entire Python file ourself to know that a syntax-translated AST is wrong. And what's more, this validator will run even on the native-Python engine, which means that you'll be compiling every Python file twice. So, let's simplify the validator function. All you really care about is that some version of Python is executing the file as bytecode, i.e. in an actual Python engine. Copy the parsing TD and change the initial lines about compilation, and you get this:
async function parsePython(parser, {version}) {
parser.setTokenizerMode(Parser.TOKENIZER_STRING_CONSTANT);
const scriptToken = await parser.parseToken();
const {bytecode, exports} = await PythonVM[version].compileScript(scriptToken.value);
parser.emit(Parser.syntheticAST`PythonVM[${version}].executeBytecode(${bytecode}); ${exports}`);
}
function validatePython(parser) {
const version = Parser.ANY_STRING;
const bytecode = Parser.ANY_ARRAYBUFFER;
const exports = Parser.ANY; // ANY is 0 or more nodes, ANY_NODE is exactly 1
parser.emit(Parser.syntheticAST`PythonVM[${version}].executeBytecode(${bytecode}); ${exports}`);
}
Parser.registerImplementation("python", parsePython, {polyfill: true});
Parser.registerImplementation("python", validatePython, {validate: true});
The parsePython
TD function is now registered as a polyfill, to enable functionality on the versions of the hybrid
engine that don't natively support .py
import, and the validatePython
TD function is registered as a validator.
Since validatePython
never makes a parse request and simply runs to completion, it is a passive transformation;
it isn't dependent on the input file, so the PA engine can use it to decide which transformation algorithm to run
in the first place. If the engine has both a native-Python and a syntax-translated-Python mechanism (perhaps one is the
fallback for the other), the validatePython
validator would let it know that it needs to use the native-Python
implementation. If it only has a syntax-translated version, it would know the translation was wrong as soon
as it emitted an AST node that wasn't a PythonVM
call.
The AST format in use will be modeled after Babel's AST. No need to reinvent the wheel on this one; the TC39 Binary AST proposal also plans to use the Babel AST.
The polyfill capability can also be used to give engines future-compatibility for new ECMAScript syntax, even if development on the engine stalls. Let's imagine a standard convention in which engines will report any syntax they have native compatibility for as a null transformer. So, inventing a little bit of API here, we can imagine the following:
console.log(Parser.getAvailableTransformations());
// Prints an object that looks like:
{
"ES2022": function() {},
"ES2021": function() {},
"ES2020": function() {},
...
}
This engine supports up to ES2022 syntax, but not ES2023. The only syntactical change made in the ES2023 specification since ES2022 was the addition of the hashbang grammar, which we can represent as the following TD:
async function downlevelES2023(parser) {
parser.emit(Parser.syntheticAST`syntax "ES2022"`);
if (await parser.peekBytes("#!")) {
parser.setTokenizerMode(Parser.TOKENIZER_COMMENT);
const commentToken = await parser.parseToken({until: "\n"});
parser.emit(commentToken);
}
}
Parser.registerImplementation("ES2023", downlevelES2023, {polyfill: true});
Now, when this engine encounters a file that is imported specifying the "ES2023" transformation (See Specifying
Transformations below), it will execute the downlevelES2023
function.
The first thing this function does is to emit a syntax "ES2022"
directive into the syntactic stream, which
will cause the parser to execute its ES2022 transform. Since it supports ES2022 natively, that transform is a
no-op function. Then it peeks the bytestream to see if it starts with #!
(a non-syntactic request, making this
a non-syntactic transform - we're not defining #!
as an operator, we're just looking for it in this one point
in the bytestream). If it finds it, it will switch its tokenizer to the mode that parses everything as a comment,
and it consumes one line of input, emitting a comment token into the AST as the first top-level item. This happens
before the syntactic parser ever gets the first byte of input, which will start at the second line.
The first engine that supports runtime polyfillable PA is the first one that supports all versions of ECMAScript, past, present, and future.
Defining transformations does very little good if they can't be applied to source files. Parser Augmentation specifies three mechanisms that can be used to define transformations that should be applied to a given source text:
This is intended as the primary means of specifying syntactic transformations in an ECMAScript-equivalent source file.
Because syntactic transformations can change the meaning of source text, like import
statements change the meaning
of identifiers, putting them in the same file offers many advantages:
- Editors and IDEs can use a leading syntax directive to detect file type. Since a TypeScript file with a leading
syntax "typescript";
statement can be considered a compliant ECMAScript source file, it's reasonable to use a ".js" extension for it. - Developers can expect to have a single place to look to determine what syntax is being used in this file, if they see unfamiliar syntax.
- Importing modules do not take an implicit dependency on the implementation of the module.
- The syntax can be used in any location that expects ECMAScript code.
- Depending on implementation support, a single file could contain multiple syntaxes. A developer could write a file primarily in JavaScript but include a top-level block of TypeScript that defines types used by editor autocompletion.
- The file does not become unparseable when removed from its context.
The syntax
directive comes in three forms:
syntax <transformation-path>;
syntax push <transformation-path>;
syntax pop;
A <transformation-path>
is a comma-delimited list of transformations to apply from this point forwards. The semantic
rules listed above must still apply: non-syntactic transformations must occur before syntactic transformations, and
any baseline transformation must occur before any augmenting transformations. Each transformation in the list may
take one of the following forms:
"transformation-id"
- requires that the given transformation be defined elsewhere, like the ECMA262 spec"transformation-id" with {parameters}
- parameters are passed in the second argument to the TD function (if executed), must be string-keyed. Advisory NLTH (see below) beforewith
, otherwise parsers will end too early.import "transformation-id" from <module-specifier>
- forces parsing to stop at the end of the statement line until<module-specifier>
has been imported and executed. The transformation-id must be defined in the module. If the parser already has a specification for "transformation-id", it is permitted to switch to that mode immediately and continue the parse, but it is not permitted to release the parsed AST to the rest of the engine until it has verified that the defining module does not contain a conflicting implementation of the transformation. This can also take awith
, immediately prior to thefrom
; Advisory NLTH before both keywords, and one before awith
in the<module-specifier>
(this is a difference from the usual Import Attributes syntax as described in this issue; in this case it's not ASI hazard but parse hazard)from [<transformation>, <transformation>]
- including multiple transformations in an array-like syntax indicates that any of the listed transformations can be applied; the first one that succeeds will be used. This allows for dynamic selection of transformation implementations, depending on engine support. Note that this is array-like and not a true array syntax; the "members" of this "array" are transformation specifiers, which can include non-Expression syntax. Thefrom
keyword before the opening square-bracket is required to disambiguate this from member access syntax, and has a mnemonic meaning of "from amongst the following:".null
- the null transformer always succeeds, but it has no effect. It can be used as the final alternative in a set of alternate transformations; for example, the directiveis a request to transform this source by the built-insyntax from [ "typescript", import "typescript" with {version: "4.4"} from "https://www.typescriptlang.org/syntax", null];
"typescript"
transformer, if available; by the (obviously currently non-existent) official syntax description for TypeScript 4.4, if unavailable and the runtime supports transformation polyfills; or simply parse the rest of this file untransformed if both of those fail. (It would probably fail, unless the file were syntax-compatible with ECMAScript.)
The "advisory NLTH" before the keywords is actually a specific instance of a general rule, which is that no
leading proper subset of lines may form a legal syntax
directive. The NLTH is not required within the
square-bracket alternate transformation syntax, for example, because the parser is waiting for a close bracket.
(Alternative: requiring a semicolon at the end of a syntax
directive would remove the NLTH requirement, but it
would block ASI.)
The syntax push
form of the directive allows for temporary modification of the parse mode. A subsequent
syntax pop
disables that syntax push
directive, as well as all bare syntax
directives that have occured since.
The "comment" transformer has explicit support for this syntax; it treats all following source text as a multiline
comment up until it encounters the 11-byte sequence "\nsyntax pop"
. It then parses and emits the syntax pop
directive using the default ECMAScript parser, which will return the parser to the state it was in prior to the
switch to "comment" mode. Combined with transformer alternative syntax, it allows conditional inclusion of
source text depending on engine capabilities:
// Standard ECMAScript code
syntax push from ["typescript", "comment"];
// this will either be parsed as TypeScript or ignored
interface InterfaceDeclaration {
foo: string;
bar: number;
}
// the following line will be parsed no matter what
syntax pop;
// Standard ECMAScript code
There are a couple places where newlines are forbidden because of ASI hazard, and the transformation path may not have a trailing comma unless the statement ends in a semicolon; because of these restrictions, the parser is guaranteed that the first newline that can end the statement, is the newline that ends the statement.
Transformation paths can be specified using import attributes. The exact syntax is an open question, but since the JSON Modules proposal uses the following syntax:
import json from "./foo.json" with { type: "json" };
And given that the "JSON Modules" type can be viewed as one specific transformation, it makes sense to use the same attribute. This format allows three different syntaxes:
- a bare string, like above, representing a single transformation
- an object with a "name" property representing the transform and possibly other properties, which will be passed in the second argument to the TD function
- an array of either of the above two syntaxes, representing a Transformation Path.
The "importing" and "alternative" forms have no equivalent and are unsupported in this syntax.
Hosts can define out-of-band mechanisms both for registering transformations and for specifying them. For example,
a transpiler or bundler could register transformer plugins listed in a package.json
file, and its configuration
file could map file extensions or directory trees to transformation paths.
These mechanisms are generally out-of-scope, but this specification suggests using the +
character to separate
subsequent transformations in a path in cases where an entire path must be represented as a single string, for
example in an HTML attribute. Parameters could be represented as ;name=value
, URL-encoded.
When multiple sources specify a Transformation Path for the same file, these paths are composed using the following algorithm:
- Start with an empty working transformation path.
- Order the specified transformation paths as follows:
- Out-of-band, specified for this file in particular (for example, an HTTP header).
- Out-of-band, specified generally (for example, a file-extension rule)
- Import attribute
- Every
syntax
orsyntax push
directive in the source that has not been deactivated by asyntax pop
, in source order
- For each specified path:
- For each transformation in the specified path that also occurs in the working path,
overwrite the working path properties object with the specified properties as though
Object.assign
were called on it. - For each transformation in the specified path that does not occur in the working path, append it to the working path.
- For each transformation in the specified path that also occurs in the working path,
overwrite the working path properties object with the specified properties as though
For example, consider the following scenario: An ECMAScript module imports the file "./foo.ts", specifying the
transformation path "typescript", "pipeline"
in an import attribute. The engine has native support for ".ts"
files, described as a single-element "typescript"
transformation path. The foo.ts file is fetched from a server
that specifies it is to be transformed via the "gzip"
transformation. The file begins with a directive of
syntax "pipeline" with {pipeStyle: "F#"};
, and later in the file, the parser encounters the following directive:
syntax push "comment";
The engine could determine the appropriate transformation path to use after this line as follows:
- Start with an empty working path.
- Order the specified paths and layer them:
"gzip"
(out-of-band, file-specific)"gzip"
does not occur in the working path.- Append
"gzip"
to the working path. - New working path:
"gzip"
"typescript"
(out-of-band, general)"typescript"
does not appear in the working path.- Append
"typescript"
to the path. - New working path:
"gzip", "typescript"
"typescript", "pipeline"
(import attribute)"typescript"
occurs in the working path and has no properties, it is ignored.- New working path:
"gzip", "typescript", "pipeline"
"pipeline" with {pipeStyle: "F#"}
(non-push syntax directive)- The
"pipeline"
transformation appears in the working path, so thepipeStyle
attribute is assigned to the working path properties object (the same one that was passed to the TD function, if it was already called). - New working path:
"gzip", "typescript", "pipeline" with {pipeStyle: "F#"}
- The
"comment"
(syntax push directive)- The
"comment"
transformation does not occur in the working path, so it is appended. - New working path:
"gzip", "typescript", "pipeline" with {pipeStyle: "F#"}, "comment"
.
- The
This algorithm provides a few useful guarantees:
- Transformations will only ever be added onto the end of a path (or popped off the end during a
syntax pop
). - Non-syntactic, non-baseline transformations (byte-to-byte filters, like decompressors) will accumulate at the start
of the path, and they will never be removed or change position. Thus, a developer can't accidentally break the
gzip compression with a
syntax
directive. - Non-syntactic transformations specified out-of-band for a specific file, like in HTTP headers, will always appear at the beginning of the transformation path. Thus, servers can employ reverse transformations on files, like compression, and provide the decompressor on the transformation path, even if the compression method isn't supported by the transport protocol. Engines could support ECMAScript-specific compression technologies like pre-shared dictionaries, and browsers could request and servers could specify them using HTTP headers.
While the actual execution order of code involved in parsing is implementation-specific and not defined by PA, the logical execution order - in other words, the execution order in a host that simply implements PA by calling the TD functions during parsing - is defined as follows:
-
Initialize two lists, each the length of the transformation path, where each element is a queue that starts out empty. The first list is the Non-syntactic request list, the second is the Syntactic request list.
-
The TD functions in the path are called one at a time, starting from the beginning of the path. Each is passed its own instance of the
parser
object; non-syntactic transforms each have their own scanner configuration, but all the syntactic transforms in the path share a single parser configuration which is affected bysetBaseline()
,defineOperator()
, etc, in the order the methods are called. The TD function executes up until its first parse request to the parser, which is an async method returning a Promise, or to completion if the function does not have a parse request, like thepipeline
example above.The type of the first parse request determines whether this transform is syntactic (
parseNode[s]()
call), non-syntactic (parseToken[s]()
call), or passive (runs to completion). The requests are set in the appropriate list, at the position of the transformation in the path. (Thus, both lists are expected to have "holes" in them.) -
While there is input in the input stream:
- If there are no requests in the non-syntactic request list, read a line of input and feed it to the syntactic parser.
- If there are any non-syntactic requests, dequeue a request from the first non-empty queue and fulfill it
from the input stream. This may result in an
emit
from the associated TD function (discussed below) and will probably also result in another parse call (or a continuation of the asyncIterator contract). - If the first non-syntactic request generates an
emit
of astring
orArrayBuffer
, it is used to fulfill a request from the next non-empty queue in the non-syntactic request list; it does NOT dequeue the next request in the current queue. However, if it generates anemit
of an AST node, it goes through the whole synthetic chain (see below). If there are no more non-empty queues in the non-syntactic requests list, it is fed to the syntactic parser. - Assuming the first non-syntactic request made a (possibly implicit) non-syntactic request of its own, the child request is added to the queue at the same position in the appropriate list. A TD can choose to remove itself from the path by returning without making another request.
- Repeat from the top of step 3 ("While there is input...") as long as there is input in the stream.
-
If there are any requests in the non-syntactic request list, dequeue and reject the first one, ignoring any additional parse calls but propagating
emit
calls as usual. Repeat until there are no non-syntactic requests left. -
(This step happens concurrently with step 3, as it is driven by the syntactic parser)
If there are any non-empty queues in the syntactic request list:- For each queue, peek and see what kind of node the request at the head of the queue cares about. (Likely:
"any top-level node", a specific operator, or a specific keyword.) Also note, for each position in the path,
what handlers have been defined for various operators or keywords. Only one such handler can be defined for
each
parser
instance for each operator or keyword, and it is a programming error to request a node usingparseNode
that also has a handler defined; implementations may handle that situation in any way, including adding both requests to the chain, ignoring one, or rejecting the entire transformation (throwing a SyntaxError). - Whenever a node that is being requested is emitted by the parser or by a non-syntactic transform, use it to
fulfill the first such request, dequeueing it from its queue if it isn't a handler. Just like the
non-syntactic chain, any node emitted by a parser will fulfill a matching request from later in the path,
but not from its own spot in the syntactic request list. Any
string
orArrayBuffer
emitted by this request will be emitted at this point in the non-syntactic request chain, exactly as it would if this were a non-syntactic request. - Requests made during the fulfillment of a request are added to the appropriate queue in the appropriate
list. Normally-syntactic transformations may generate a non-syntactic request; for example, a transformer
could define a Bash-style
<<EOT
operator, adding a non-syntactic request to the queue in the handler, continuing to add non-syntactic requests to its queue until it detects a line with EOT; then it emits a string constant node into the syntactic chain and refrains from adding another non-syntactic request. - AST nodes emitted at any point in the tree are emitted to the same tree position as the node that triggered
the request. So, if the AST represents
3 + %
and the%
keyword triggers a handler in the syntactic request chain, and that handler emits an AST node of the identifiertemp7
, then the AST that passes to the next link in the chain will be3 + temp7
. - After dequeueing a request or adding a request to an empty queue, peek to see what kind of node the request at the (new) head of the queue cares about.
- For each queue, peek and see what kind of node the request at the head of the queue cares about. (Likely:
"any top-level node", a specific operator, or a specific keyword.) Also note, for each position in the path,
what handlers have been defined for various operators or keywords. Only one such handler can be defined for
each
-
After the parser finishes emitting nodes, if there are any requests in the syntactic request list, dequeue and reject the first one, ignoring any additional parse calls but propagating
emit
calls as usual. Repeat until there are no syntactic requests left.
There are two relevant APIs, the runtime Parser Augmentation API (shown in examples as Parser.registerImplementation
)
and the statically analyzable Transformation Description API (implemented by the parser
object passed to the
TD function).
Neither is defined at this time. If TC39 decides to take up the proposal, it will be developed in collaboration with TC39 members. Otherwise, it will materialize during implementation according to technical need.
- There is no telling what TC39 will think of this; the proposal does not currently have a TC39 champion, and so it
can't even be presented to them. Unless and until TC39 makes a decision on this front, the
syntax
directive will remain a transpiler extension, like the pipeline operator is currently. This does not mean it can't be used, since it won't be present in the output, but it does represent a forward compatibility risk. - Browsers are unlikely to adopt the
syntax
directive any time soon, and they will likely never support user-defined or polyfilled transformers, no matter how cool it would be to typeThus, transpilers will continue to bear the implementation burden of Parser Augmentation for the foreseeable future, and will likely always be the only implementations that support all the features of PA.import {array} from "./numpy.py" with {type: {name: "python", version: "3.7"}};
- The TD function specification, if designed badly, could shackle implementations to a particular type of parser if they want to support PA. TD functions are intended to be generic and symbolic enough to be divorceable from the actual parser implementation even if the parser chooses to call the TD function directly, but that depends on the design being sufficiently parser-agnostic. There will no doubt be a lot of bikeshedding on what an ideal TD structure would look like, both from the perspective of static analyzability and of performance.
- This mechanism could (and probably, at some point, will) be used as a way to obfuscate source code. I'm not entirely certain why, given that it also provides the means to de-obfuscate it, but the possibility is there.
- This could end up being a way to fragment the ECMAScript language, with individual projects using their own syntactic variant of ECMAScript that someone would have to learn if they wanted to contribute, rather than just using TypeScript like the rest of us.</tongue-in-cheek>
- A
syntax
directive represents a possible one-line attack vector that could change the meaning of an entire source file.
Counterpoint: animport
directive represents a possible one-line attack vector that could change global object prototypes for an entire runtime.
These are the biggest reasons why Parser Augmentation is designed to be opt-in from all participants. Browsers
are not required to implement it at runtime; transpilers are not required to directly execute user-supplied code;
developers are not required to use syntax
directives; TC39 is not required to add it to ECMA262. The PA mechanism
is designed to be able function in any environment, regardless of participation, and still be useful.
The most obvious alternative is the status quo. Transpilers are already capable of performing syntax transformations on user code, and they can do so without shackling themselves to a particular design definition. There are any number of non-standard ways to introduce non-standard syntax into an ECMAScript file; JSDoc allows you to embed TypeScript code into block comments, decorators (which are themselves non-standard syntax, at present) are used as signals to control a code transformer. None of the individual concepts in Parser Augmentation are new, but some of the more relevant prior art includes:
-
Prolog
op/3
The Prolog language has directives like op/3 which allow changing parser behavior mid-stream; this is the particular behavior that inspired theparser.defineOperator()
mechanic. It is common practice in Prolog development to define a novel syntax at the top of a given source file in order to make the rest of the file easier to understand; the Prolog programmable parser is a large part of the inspiration for PA. -
C Preprocessor
While technically just a macro expander, CPP's flexibility allows it to make effectively-semantic changes to the language itself. It is not at all unusual to see notations likeTRY { call_some_function(&arg); } CATCH(e) { perror(e); }
in a C project, despite the fact that the language does not support try/catch functionality. That way, contributors can use a familiar and uncomplicated syntax to interface with whatever underlying representation that project uses to control and report errors. A well-designed set of syntax-like macros can make a given project much easier to understand at the code level.
The C
#pragma
mechanism is likewise a way for source text to interact with the parser during the parse process, and it illustrates the utility of thesyntax push
andsyntax pop
directives (the CPP equivalents, sadly, are non-standard across C preprocessor implementations). -
Python
__future__
Python'sfrom __future__
statement is a statement that looks like an import but is actually a signal to the parser that it should switch its parse mode before continuing with the file. Utilizing it has allowed Python has to make backwards-incompatible syntax changes that the developer community can adopt at its own pace. -
Java JEP Interfaces
For all its quirks, modern enterprise Java has some of the best separation of interface and implementation in the field, and it does it by defining standard patterns for developers to follow, rather than providing APIs for them to access. Do you want to write a REST API server? Okay, write your code to conform to JSR 370, JAX-RS. Do you want to embed a scripting language that end-users can write in? Interface with the script engine using JSR 223, the Java Scripting API. Projects can swap one JAX-RS implementation for another or even one scripting language for another just by changing what dependencies are pulled in by the packager; none, or at least very little, of the project's code needs to change.PA provides a standard pattern for defining exactly what steps a given transformation comprises. No more "I have to keep using Webpack because the syntax I'm using is only implemented by a Webpack plugin". Using a given syntax in a project becomes nothing more difficult than adding a transpiler-agnostic
devDependency
and including asyntax
directive at the top of the file. -
TC39 Binary AST Proposal
This was the inspiration that led to PA's primary format being an AST. Browsers do not want to parse ECMAScript source code. They would prefer to simply import a pre-parsed AST. Given that it's also much easier to perform code transformations at the AST level rather than at the byte level, the decision seemed obvious; with this formulation of PA, a given browser can unilaterally experiment with binary AST formats as long as they use a vendor-specific identifier for their PA implementation, and servers could dynamically choose whether to serve the original source or the equivalent, pre-generated BinAST representation. -
TC39 Pipeline Proposal
The evolution of the ECMAScript pipeline operator has been tumultuous. Between internal disagreement on semantics (F# vs Hack) and open questions about what characters should be used for syntax to which there are no perfect answers, the proposal remains at stage 2 after nine years of development, while the JSON Modules proposal is at stage 3 after five, only awaiting inclusion of the Import Attributes proposal (also stage 3 after five years). PA would provide a mechanism whereby TC39 could accept both semantics of the Pipeline proposal, allowing developers to choose the non-default syntax using a PA directive.
As mentioned above, PA is designed to be an entirely opt-in mechanism at all levels and from all participants. No developer will need to use it, and no engine needs to support it; but the more that do, the more useful it gets. My tentative expectation for how this will go is roughly as follows:
- Submit proposal to Babel. (This document.)
- Get feedback from Babel team; use it to refine the architecture and figure out which things I haven't remembered to put in writing.
- Implement a reference implementation of PA in the Babel parser; finalize first draft of the PA APIs; write demonstration TDs. Gather feedback from other transpiler projects to make sure the architecture doesn't represent a significant development burden; provide implementation support if necessary.
- Revise proposal with implementation details and submit to TC39. Continue raising awareness among TC39 delegates, especially champions of early-stage proposals that would benefit from PA.
- Reach out to authors of popular transpiler plugins/modules; assist with hooking their code into the PA framework.
Start introducing
syntax "foo" from "node-package";
as a transpiler-agnostic way to support non-standard syntaxes in JS projects. - Guide server-side engines (Node, Deno, etc) on how to properly implement PA behind experimental flags.
- Politely ask MDN to document the
syntax
directive as an experimental technology.
It gets a little hazy beyond there, but I think that likely by that point either it'll be popular or it'll be dead.
The only particularly significant change to Babel users is that in some circumstances, they won't need to write as much Babel configuration.
I do think, however, that it's important to teach developers that the syntax
directive, especially in its
syntax ... from
variant, can be dangerous if used incorrectly. Developers should also be warned against writing
transformations that use standard syntax in non-standard ways, with the same severity (and rationale) as adding
to the prototypes of global objects.
I think that "syntax" is probably a good term; it has a very "technical and low-level" feel to it, making it a little scary to less-experienced developers, which is the right feeling to have. (Sort of like how "import" is a friendlier and more approachable term than "require", only in the other direction.)
So, so many. The biggest is that I have no idea how TC39 will react to this. I've gotten a few remarks on the idea from delegates (some reproduced below), but I don't really have a read on their enthusiasm or lack thereof for the idea, only that they don't quite understand the mechanism yet.
Another consideration is that, while I'm quite familiar with low-level parser techniques in general, I have no direct experience with Babel's internal code. I've tried to structure my examples to be fairly obvious to people with parser experience, and while I believe this code architecture will be fairly easy to slot into Babel's existing parser, there could be techniques or considerations at play that make the PA mechanism more difficult to implement in Babel's particular context.
(Note that I'm not asking for development support in specific on this, it'll just take me a little while to get used to the Babel codebase.)
Finally, the question of security. This isn't as hypothetical in the transpiler environment as I'd like it to be, given the risk of supply-side attacks, and I'm not confident I've thought of and mitigated every avenue an attacker might use. I've listed some of my reasonings below, but I'd appreciate backup on this aspect.
The comments that I've received are in quotation marks, and the Q: links to the source. Hypothetical questions are not in quotation marks.
-
Q: "It's highly unlikely that browsers will implement something that changes the syntax on the fly."
A: Browsers don't need to. They can keep throwingSyntaxError
if they see asyntax
directive, exactly as they do now. My expectation is that if browsers decide to support user-defined syntax transforms, they will either (a) only support the Import Attributes method of declaring them, and/or (b) only support thesyntax
directive when it starts at byte 0 of the file. -
Q: "So far, the entirety of the language is at runtime - i doubt this feature would be the one to change that."
A: PA describes a functionality that is equivalent to a process that could occur at runtime, specifically at import/parse time. Browsers being unlikely to support it at runtime for security or performance reasons doesn't change its nature. -
Q: [In reference to the mention of
"use strict"
] "Additionally, "no more modes" is a pretty common mantra on the committee, meaning, it wouldn't be viable to have a pragma that changes how a program evaluates."
A: PA doesn't change the evaluation of a program at all - only its parsing. Once the program makes it into AST form, it evaluates the same no matter how it got there. That's not the case with"use strict"
, which imparts semantic changes to the runtime. -
Q: [In reference to using PA to enable BinAST] "The polyfill adds a transform and doesn't eliminate any existing ones."
A: The polyfill is only used by engines without native support for the "transformation". If a server sends a BinAST source file to a browser that does not support BinAST, it probably means the CDN failed its content delivery check. A BinAST-supporting browser would be expected to see thesyntax
directive (or HTTP header) and immediately switch to its native BinAST parser/importer. -
Q: What safeguards prevent a malicious actor from using Parser Augmentation as an attack vector?
A: The only contexts that can affect how code is parsed are contexts that already have arbitrary control over what code gets executed. In order for PA to be active during the parse of a file, one of three conditions must hold:- A
syntax
directive appears in the source being processed, which means it was put there by someone with access to the source - An Import Attribute was specified at the load site, which had executive control in the first place, and could have chosen to import something else instead. Note that this does not "corrupt" the module for someone else; a difference in Import Attributes means that it's logically a different module.
- An out-of-band method was used to specify a transformation, which is host-defined behavior. The host already has control over all code being executed. It will be important to make sure that implementers understand that applying a transformation is a security risk, and not to apply untrusted transformations.
This is why the
syntax import ... from ...
variant is especially important to be cautious of - it means that any change to the module referenced (the transformation definition) can change the meaning of the current file. (This was already the case, but even more so now.) Very important that the module come from a trusted source. - A
-
Q: What order do transformations "run in"?
A: The exact algorithm is listed above in the Execution Order section, but the approximate answer is "all the non-syntactic transformations, then all the syntactic transformations, in order". -
Q: Won't it take too long to call out to JS during the parse?
A: The intent is not for the parser to call the TD function directly, but for it to statically analyze the function (perhaps by examining the AST, perhaps by running the function once with mock inputs) and then use that information to set parser parameters. I intend to demonstrate all three modes (static AST, static execution, dynamic execution) in the reference implementation. -
Q: Okay, but what really does "statically analyze the TD function" mean?
A: The reason this is left so open-ended is to avoid limiting the options for a PA-compliant engine; static code analysis is an evolving field, and hosts should be able to use novel techniques to analyze and comprehend the purpose of a TD function. However, as a proof-of-concept, here is one possible implementation:- When JS code calls
Parser.registerImplementation
, the PA engine calls the TD function, passing it a mock object that conforms to theparser
instance API. - Record all calls to the mock parser, associating them with well-known PA capabilities. If any unsupported or unknown capabilities are requested, fail the registration.
- Tests and comparisons that can be performed by PA are always implemented as boolean predicates with fixed
comparison inputs, like the
parser.peekBytes()
method shown in thedownlevelES2023
TD shown in the Future-proofing ECMAScript section. For each predicate called, execute the TD function again, returningtrue
from the predicate once andfalse
the other time. Build a decision tree based on the tests performed. Any unsupported or unrecognized test methods should, of course, be grounds for failing the registration. - This style of tracing should be capable of fully analyzing functions that are limited to simple API connections and settings, like the ones shown in this document. For more complex TD functions, set a limit on the runtime of the analysis procedure and/or the number of additional executions queued, and fail the registration (or switch to an alternate analysis mode, or fall back to a dynamic PA impelementation that directly calls the TD during parsing) if the limit is exceeded.
This method represents a minimum implementation of a static analysis, and the PA specification will continue to be developed with the goal of enabling this type of tracing analysis, which can be performed without access to the TD source.
It is expected that engines that do support static runtime PA will advertise their support for various methods of static analysis that they support the same way they advertise support for other core functionality (MDN, caniuse, etc); there could also be an API that allows engines to report what types of TDs they will accept. One engine might allow all TDs by falling back to a dynamic implementation, while another might only support minor textual changes to its own native TDs as returned by a
Parser.getImplementations()
function, like changing the values passed to a configuration method. In this latter case, the engine might syntax-match the passed-in TD's AST against the ASTs of all its native TDs until it finds one that matches, or fail the registration.An engine could also allow dynamic generation of TDs by exposing a builder-like API; a TD generated by such a builder would obviously come pre-analyzed.
- When JS code calls