# [spec] Text format #471

merged 35 commits into from Jun 1, 2017

### rossberg commented May 11, 2017 • edited

 This PR specifies the text format, based on earlier discussion between @binji, @lukewagner, @sunfishcode, and myself. Preview at http://webassembly.github.io/spec/ The changes relative to the .wast format currently implemented in the interpreter and other tools are the following. Removals: some of the more baroque forms of sugar for if binary module bodies anything script related (assertions, invokes, etc) infinity as a secondary spelling for float inf (we had discussed removing the optional module name as well, but in the light of design PR #1055, we probably want to keep it?) Additions: \u{...} escapes in strings (see below) more than just one inline export (that is, you can write (func $f (export "f1") (export "f2") ...), closing a gap in the syntax), and it combines with import the toplevel (module ...) is optional Changes: I took the liberty to propose one breaking change to make the syntax forward compatible with some of the future extensions that have been discussed: non-empty block signatures must now be written (result i32), in order to generalise cleanly to function signatures Unicode: the lexical syntax is defined in terms of Unicode characters (i.e., code points) comments and strings may contain mostly arbitrary Unicode, the rest stays within ASCII in strings, a Unicode character denotes its UTF-8 encoding in strings, Unicode characters can be given explicitly with \u{...} notation .wat files are assumed to be encoded in UTF-8 The PR includes changes to the interpreter implementing the Unicode support (but not yet the other changes). Misc Remarks: formatting characters: currently only the minimum set of formatting characters are allowed as white space (\t, \n, \r); we could include more, e.g. the whole set of ASCII "format effectors" (\b, \v, \f), but you quickly get into a lot of Unicode complexity if you want to go further than that Unicode in comments: similarly, in order to avoid getting into Unicode specifics, any legal code point is currently allowed in comments; should we be more restrictive? binary module bodies: they would seem pretty unusual for a "text" format, so are not included. Is there a reason to keep them? abbreviations: to avoid combinatorial complexity in defining the AST to map on, most syntactic sugar is specified in the form of "abbreviations", simple rewritings into the core syntax inline function signatures: I tried to come up with a decent way to describe their rewriting into type indices (and the potential insertion of new type definitions) in terms of rules, but ultimately gave up; it's too cumbersome to express succinctly; so this is the one part that is left partially informal (though hopefully still unambiguous) formatting: many of the rules do not currently fit the page width; I left them as is for now, and plan to clean up layout issues once the spec is complete, probably tweaking some layout parameters as well PDF: currently doesn't build; due to various limitations of MathJax (such as the inability to use packages) I had to resort to some hacks for special characters that apparently don't work in proper LaTeX; will fix later tests: lots of stuff we could write tests for, e.g. regarding the Unicode support... ### rossberg added some commits Apr 27, 2017  Start on text format   fa6ee73   Core syntax done   cefc934   Lexical   fed5963   Disallow control characters in strings   748ccac   Lexical; index context   006f359   Unicode   26b9df9   Finish   6cfc342   Merge branch 'master' into spec.textual   e2f42d3   Locals, params   5cac42d   Reject malformed UTF-8 sources   de44c0c   Inline types   ac11a6c   Free module ordering   818b4cc   Free module ordering   5658fab   More abbreviations   dfaa467   Complete   7907caa   Merge branch 'spec.exec.1' into spec.textual   3cae95a   Fix various xref and TeX bugs   bb30331  ### rossberg changed the title [spec] Specify text format[spec] Text formatMay 11, 2017 Member ### jfbastien commented May 11, 2017  On Unicode comments: are U+2028 and U+2029 allowed? Member ### rossberg commented May 11, 2017  @jfbastien, yes, currently, any code point is allowed, with no special interpretation. The only ones with specific meaning are ; ( ) and \n. (In particular, U+2028/9 would not end a line comment, but nor does ASCII \r, \f, or \v.) Member ### jfbastien commented May 11, 2017 • edited  Gotcha, that makes sense. I'm not sure we want to restrict more than what you have, silly Unicode-isms are silly. Were we to try to fool-proof things there's other semicolons such as U+FF1B ； and friends ؛፤⁏⍮⸵︔﹔ ❨ and more ❩ ❪ parens than ❫ ⦗ one can ⦘ ﹙ keep track ﹚ ﹝ of ﹞ （ because lol Unicode ）. I just would rather ask so we all think about JSONP and shed a Unicode tear 💧 for it. Member ### sunfishcode commented May 11, 2017  Concerning module names and WebAssembly/design#1055, the module name in the name section has no semantic effect, and so seems like it could differ from the module name used for linking. I'd suggest this is an interesting enough corner case that it's worth testing, which would mean we'd want distinct syntax for the name-section module name. Are the baroque forms of sugar mentioned above being removed from the .wast format too? Similarly, are the various Additions and Changes being made to the .wast format too? What is the spec.exec.1 branch? This PR contains some changes that look like changes already in master, so it's not clear what the specific changes are here. Member ### rossberg commented May 11, 2017  @sunfishcode, yes, I plan to make the changes to the interpreter, since .wast should remain a strict super set of .wat that differs only in terms of the additional script constructs. Branch spec.exec.1 corresponds to PR #467 -- I used that as a baseline here, since there were too many dependent changes. Member ### rossberg commented May 11, 2017 • edited  One item I forgot initially: this also doesn't include the binary module body syntax that we use for some tests. Should it? Member ### jfbastien commented May 11, 2017  One item I forgot initially: this also doesn't include the binary module body syntax that we use for some tests. Should it? Can the proposed format generate all interesting invalid inputs without this? Additionally, can the proposed format generate equivalent modules (say, non-canonical LEBs)? Member ### rossberg commented May 11, 2017  @jfbastien, no, without the direct binary notation something like LEB isn't even a thing in the text format. I guess the question we need to answer is whether it is the purpose of the official text format (as opposed to .wast) to enable expressing such tests. Member ### jfbastien commented May 11, 2017  @jfbastien, no, without the direct binary notation something like LEB isn't even a thing in the text format. I guess the question we need to answer is whether it is the purpose of the official text format (as opposed to .wast) to enable expressing such tests. Yes, I think that's the right question to ask. I think we can move forward with your proposal, without this addition, and then add it later if we decide we need it. Correct? Member ### rossberg commented May 11, 2017  @jfbastien, correct. ### rossberg added some commits May 11, 2017  Merge branch 'spec.exec.1' into spec.textual   998078c   C&P typo   b184b66  ### rossberg added some commits May 16, 2017  Merge branch 'master' into spec.textual   a768935   Stricter lexical rules for token separation   7fc3ba7  ### rossberg referenced this pull request May 16, 2017 Closed #### [spec+interpreter] Accepts "i32.const0" (missing space) #478 ### rossberg added some commits May 16, 2017  Parens are tokens too   302d843   Update interpreter to match text format spec   d1469fa   Support .wat files   678dade   Forgot added test file   6bfd70e  Member ### rossberg commented May 17, 2017 • edited  I included the changes necessary to both interpreter and test suite to adjust to the listed grammar modifications. Also a few fixes to the spec, notably including #478 and clarification of the ability to combine inline import/export sugar.  Fix tokenisation   bcec37f  ### This was referenced May 18, 2017 Closed #### Interpreter: Inconsistencies in s-expression format #437 Closed #### S-Expression Syntax #466 Member ### rossberg commented May 22, 2017  Anybody opposed to landing this? Anybody willing to review the PR? :) ### lukewagner reviewed May 25, 2017  * Terminal symbols are either literal strings of characters enclosed in quotes: :math:\text{module}; or expressed as Unicode _ code points: :math:\unicode{0A}. (All characters written literally are unambguously drawn from the 7-bit ASCII _ subset of Unicode.) #### lukewagner May 25, 2017 Member *unambiguously #### rossberg May 29, 2017 Member Done. ### lukewagner reviewed May 25, 2017  \Tkeyword ~|~ \TuN ~|~ \TsN ~|~ \TfN ~|~ \Tstring ~|~ \Tid ~|~ \text{(} ~|~ \text{)} ~|~ \Treserved \\ \production{keyword} & \Tkeyword &::=& \mbox{(any terminal symbol in the grammar that is non of the above)} \\ #### lukewagner May 25, 2017 Member *none #### rossberg May 29, 2017 Member Done. ### lukewagner reviewed May 25, 2017  Values ------ The grammar produtions in this section define *lexical syntax*, #### lukewagner May 25, 2017 Member *productions #### rossberg May 29, 2017 Member Done. Member ### lukewagner commented May 25, 2017  Nice work! lgtm with a few nits above and here: lexical.html#characters : the Note seems a bit confusing to me given the preceding para just said all valid unicode code points are characters. I think what's being said is that, across the entire set of rules which define the text format, it is noted to be the case that, outside of comments and string literals, only a subset of 7-bit ASCII characters are used? the abstract syntax of a floating constant is defined by values.html#floating-point to be a sequence of (IEEE754-interpreted) bytes however values.html#floating-point seems to produce reals with only a fuzzy note above that they get rounded; could you instead have an explicit realBytes(...) function applied to the reals to explicitly produce the bytes? types.html#table-types : perhaps add a Note that elemtype may be extended with other types in the future ### binji referenced this pull request May 26, 2017 Closed #### Add options to wat-writer #436 5 of 6 tasks complete ### rossberg added some commits May 29, 2017  More Unicode-related fixes in the interpreter   2e52a8a   Comments   997a258  Member ### rossberg commented May 29, 2017  On 25 May 2017 at 05:38, Luke Wagner ***@***.***> wrote: Nice work! lgtm with a few nits above and here: - lexical.html#characters : the Note seems a bit confusing to me given the preceding para just said all valid unicode code points are characters. I *think* what's being said is that, across the entire set of rules which define the text format, it is noted to be the case that, outside of comments and string literals, only a subset of 7-bit ASCII characters are used? Yes, that was the intention. Reworded. - the abstract syntax of a floating constant is defined by values.html#floating-point to be a sequence of (IEEE754-interpreted) bytes however values.html#floating-point seems to produce reals with only a fuzzy note above that they get rounded; could you instead have an explicit realBytes(...) function applied to the reals to explicitly produce the bytes? Right, that's the purpose of the ieee meta functions used in the attribute expressions. I yet have to define those, though (see the todo), which I plan to do when I get to the numeric ops. … - types.html#table-types : perhaps add a Note that elemtype may be extended with other types in the future Added. Member ### lukewagner commented May 29, 2017  @rossberg-chromium Ah, I see now; they all feed in to fN; I missed that before. ### rossberg added some commits May 29, 2017  Fix Latex   47b7405   Rename text to string token   f1f24e4  Member ### rossberg commented Jun 1, 2017  Landing with above LGTM and no objections. ### rossberg merged commit 0a8fda1 into spec.exec.1 Jun 1, 2017 0 of 2 checks passed #### 0 of 2 checks passed continuous-integration/travis-ci/pr The Travis CI build is in progress Details continuous-integration/travis-ci/push The Travis CI build is in progress Details ### rossberg deleted the spec.textual branch Jun 1, 2017 ### rossberg added a commit that referenced this pull request Jun 1, 2017  [spec/interpreter] Specify text format and adapt interpreter (#471)  This change specifies the text format, based on earlier discussion between @binji, @lukewagner, @sunfishcode, and myself. It also adapts the interpreter to the changes listed below. The changes relative to the .wast format previously implemented in the interpreter and other tools are the following. Removals: - some of the more baroque forms of sugar for if - binary module bodies - anything script related (assertions, invokes, etc) - infinity as a secondary spelling for float inf Additions: - \u{...} escapes in strings (see below) - more than just one inline export (that is, you can write (func$f (export "f1") (export "f2") ...), closing a gap in the syntax), and it combines with import
- the toplevel (module ...) is optional

Changes:

One breaking change makes the syntax forward compatible with some of the future extensions that have been discussed:

- non-empty block signatures must now be written (result i32), in order to generalise cleanly to function signatures

Unicode:

- the lexical syntax is defined in terms of Unicode characters (i.e., code points)
- comments and strings may contain mostly arbitrary Unicode, the rest stays within ASCII
- in strings, a Unicode character denotes its UTF-8 encoding
- in strings, Unicode characters can be given explicitly with \u{...} notation
- .wat files are assumed to be encoded in UTF-8

Misc Remarks:

- formatting characters: currently only the minimum set of formatting characters are allowed as white space (\t, \n, \r); we could include more, e.g. the whole set of ASCII "format effectors" (\b, \v, \f), but you quickly get into a lot of Unicode complexity if you want to go further than that

- Unicode in comments: similarly, in order to avoid getting into Unicode specifics, any legal code point is currently allowed in comments; should we be more restrictive?

- binary module bodies: they would seem pretty unusual for a "text" format, so are not included for now.

- abbreviations: to avoid combinatorial complexity in defining the AST to map on, most syntactic sugar is specified in the form of "abbreviations", simple rewritings into the core syntax

- inline function signatures: I tried to come up with a decent way to describe their rewriting into type indices (and the potential insertion of new type definitions) in terms of rules, but ultimately gave up; it's too cumbersome to express succinctly; so this is the one part that is left partially informal (though hopefully still unambiguous)

- formatting: many of the rules do not currently fit the page width; I left them as is for now, and plan to clean up layout issues once the spec is complete, probably tweaking some layout parameters as well

- tests: lots of stuff we could write tests for, e.g. regarding the Unicode support...
