New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spec] Text format #471

Merged
merged 35 commits into from Jun 1, 2017

Conversation

Projects
None yet
7 participants
@rossberg
Copy link
Member

rossberg commented May 11, 2017

This PR specifies the text format, based on earlier discussion between @binji, @lukewagner, @sunfishcode, and myself. Preview at http://webassembly.github.io/spec/

The changes relative to the .wast format currently implemented in the interpreter and other tools are the following.

Removals:

  • some of the more baroque forms of sugar for if
  • binary module bodies
  • anything script related (assertions, invokes, etc)
  • infinity as a secondary spelling for float inf
  • (we had discussed removing the optional module name as well, but in the light of design PR #1055, we probably want to keep it?)

Additions:

  • \u{...} escapes in strings (see below)
  • more than just one inline export (that is, you can write (func $f (export "f1") (export "f2") ...), closing a gap in the syntax), and it combines with import
  • the toplevel (module ...) is optional

Changes:

I took the liberty to propose one breaking change to make the syntax forward compatible with some of the future extensions that have been discussed:

  • non-empty block signatures must now be written (result i32), in order to generalise cleanly to function signatures

Unicode:

  • the lexical syntax is defined in terms of Unicode characters (i.e., code points)
  • comments and strings may contain mostly arbitrary Unicode, the rest stays within ASCII
  • in strings, a Unicode character denotes its UTF-8 encoding
  • in strings, Unicode characters can be given explicitly with \u{...} notation
  • .wat files are assumed to be encoded in UTF-8

The PR includes changes to the interpreter implementing the Unicode support (but not yet the other changes).

Misc Remarks:

  • formatting characters: currently only the minimum set of formatting characters are allowed as white space (\t, \n, \r); we could include more, e.g. the whole set of ASCII "format effectors" (\b, \v, \f), but you quickly get into a lot of Unicode complexity if you want to go further than that

  • Unicode in comments: similarly, in order to avoid getting into Unicode specifics, any legal code point is currently allowed in comments; should we be more restrictive?

  • binary module bodies: they would seem pretty unusual for a "text" format, so are not included. Is there a reason to keep them?

  • abbreviations: to avoid combinatorial complexity in defining the AST to map on, most syntactic sugar is specified in the form of "abbreviations", simple rewritings into the core syntax

  • inline function signatures: I tried to come up with a decent way to describe their rewriting into type indices (and the potential insertion of new type definitions) in terms of rules, but ultimately gave up; it's too cumbersome to express succinctly; so this is the one part that is left partially informal (though hopefully still unambiguous)

  • formatting: many of the rules do not currently fit the page width; I left them as is for now, and plan to clean up layout issues once the spec is complete, probably tweaking some layout parameters as well

  • PDF: currently doesn't build; due to various limitations of MathJax (such as the inability to use packages) I had to resort to some hacks for special characters that apparently don't work in proper LaTeX; will fix later

  • tests: lots of stuff we could write tests for, e.g. regarding the Unicode support...

@rossberg rossberg changed the title [spec] Specify text format [spec] Text format May 11, 2017

@jfbastien

This comment has been minimized.

Copy link
Member

jfbastien commented May 11, 2017

On Unicode comments: are U+2028 and U+2029 allowed?

@rossberg

This comment has been minimized.

Copy link
Member

rossberg commented May 11, 2017

@jfbastien, yes, currently, any code point is allowed, with no special interpretation. The only ones with specific meaning are ; ( ) and \n. (In particular, U+2028/9 would not end a line comment, but nor does ASCII \r, \f, or \v.)

@jfbastien

This comment has been minimized.

Copy link
Member

jfbastien commented May 11, 2017

Gotcha, that makes sense. I'm not sure we want to restrict more than what you have, silly Unicode-isms are silly. Were we to try to fool-proof things there's other semicolons such as U+FF1B ; and friends ؛፤⁏⍮⸵︔﹔
❨ and more ❩ ❪ parens than ❫ ⦗ one can ⦘ ﹙ keep track ﹚ ﹝ of ﹞ ( because lol Unicode ).

I just would rather ask so we all think about JSONP and shed a Unicode tear 💧 for it.

@sunfishcode

This comment has been minimized.

Copy link
Member

sunfishcode commented May 11, 2017

Concerning module names and WebAssembly/design#1055, the module name in the name section has no semantic effect, and so seems like it could differ from the module name used for linking. I'd suggest this is an interesting enough corner case that it's worth testing, which would mean we'd want distinct syntax for the name-section module name.

Are the baroque forms of sugar mentioned above being removed from the .wast format too? Similarly, are the various Additions and Changes being made to the .wast format too?

What is the spec.exec.1 branch? This PR contains some changes that look like changes already in master, so it's not clear what the specific changes are here.

@rossberg

This comment has been minimized.

Copy link
Member

rossberg commented May 11, 2017

@sunfishcode, yes, I plan to make the changes to the interpreter, since .wast should remain a strict super set of .wat that differs only in terms of the additional script constructs.

Branch spec.exec.1 corresponds to PR #467 -- I used that as a baseline here, since there were too many dependent changes.

@rossberg

This comment has been minimized.

Copy link
Member

rossberg commented May 11, 2017

One item I forgot initially: this also doesn't include the binary module body syntax that we use for some tests. Should it?

@jfbastien

This comment has been minimized.

Copy link
Member

jfbastien commented May 11, 2017

One item I forgot initially: this also doesn't include the binary module body syntax that we use for some tests. Should it?

Can the proposed format generate all interesting invalid inputs without this?

Additionally, can the proposed format generate equivalent modules (say, non-canonical LEBs)?

@rossberg

This comment has been minimized.

Copy link
Member

rossberg commented May 11, 2017

@jfbastien, no, without the direct binary notation something like LEB isn't even a thing in the text format. I guess the question we need to answer is whether it is the purpose of the official text format (as opposed to .wast) to enable expressing such tests.

@jfbastien

This comment has been minimized.

Copy link
Member

jfbastien commented May 11, 2017

@jfbastien, no, without the direct binary notation something like LEB isn't even a thing in the text format. I guess the question we need to answer is whether it is the purpose of the official text format (as opposed to .wast) to enable expressing such tests.

Yes, I think that's the right question to ask. I think we can move forward with your proposal, without this addition, and then add it later if we decide we need it. Correct?

@rossberg

This comment has been minimized.

Copy link
Member

rossberg commented May 11, 2017

@jfbastien, correct.

rossberg added some commits May 11, 2017

@rossberg

This comment has been minimized.

Copy link
Member

rossberg commented May 17, 2017

I included the changes necessary to both interpreter and test suite to adjust to the listed grammar modifications. Also a few fixes to the spec, notably including #478 and clarification of the ability to combine inline import/export sugar.

@rossberg

This comment has been minimized.

Copy link
Member

rossberg commented May 22, 2017

Anybody opposed to landing this? Anybody willing to review the PR? :)


* Terminal symbols are either literal strings of characters enclosed in quotes: :math:`\text{module}`;
or expressed as `Unicode <http://www.unicode.org/versions/latest/>`_ code points: :math:`\unicode{0A}`.
(All characters written literally are unambguously drawn from the `7-bit ASCII <http://webstore.ansi.org/RecordDetail.aspx?sku=INCITS+4-1986%5bR2012%5d>`_ subset of Unicode.)

This comment has been minimized.

@lukewagner

lukewagner May 25, 2017

Member

*unambiguously

This comment has been minimized.

@rossberg
\Tkeyword ~|~ \TuN ~|~ \TsN ~|~ \TfN ~|~ \Tstring ~|~ \Tid ~|~
\text{(} ~|~ \text{)} ~|~ \Treserved \\
\production{keyword} & \Tkeyword &::=&
\mbox{(any terminal symbol in the grammar that is non of the above)} \\

This comment has been minimized.

@lukewagner

This comment has been minimized.

@rossberg
Values
------

The grammar produtions in this section define *lexical syntax*,

This comment has been minimized.

@lukewagner

lukewagner May 25, 2017

Member

*productions

This comment has been minimized.

@rossberg
@lukewagner

This comment has been minimized.

Copy link
Member

lukewagner commented May 25, 2017

Nice work! lgtm with a few nits above and here:

  • lexical.html#characters : the Note seems a bit confusing to me given the preceding para just said all valid unicode code points are characters. I think what's being said is that, across the entire set of rules which define the text format, it is noted to be the case that, outside of comments and string literals, only a subset of 7-bit ASCII characters are used?
  • the abstract syntax of a floating constant is defined by values.html#floating-point to be a sequence of (IEEE754-interpreted) bytes however values.html#floating-point seems to produce reals with only a fuzzy note above that they get rounded; could you instead have an explicit realBytes(...) function applied to the reals to explicitly produce the bytes?
  • types.html#table-types : perhaps add a Note that elemtype may be extended with other types in the future

@binji binji referenced this pull request May 26, 2017

Closed

Add options to wat-writer #436

5 of 6 tasks complete

rossberg added some commits May 29, 2017

@rossberg

This comment has been minimized.

Copy link
Member

rossberg commented May 29, 2017

@lukewagner

This comment has been minimized.

Copy link
Member

lukewagner commented May 29, 2017

@rossberg-chromium Ah, I see now; they all feed in to fN; I missed that before.

rossberg added some commits May 29, 2017

@rossberg

This comment has been minimized.

Copy link
Member

rossberg commented Jun 1, 2017

Landing with above LGTM and no objections.

@rossberg rossberg merged commit 0a8fda1 into spec.exec.1 Jun 1, 2017

0 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
continuous-integration/travis-ci/push The Travis CI build is in progress
Details

@rossberg rossberg deleted the spec.textual branch Jun 1, 2017

rossberg added a commit that referenced this pull request Jun 1, 2017

[spec/interpreter] Specify text format and adapt interpreter (#471)
This change specifies the text format, based on earlier discussion between @binji, @lukewagner, @sunfishcode, and myself. It also adapts the interpreter to the changes listed below.

The changes relative to the .wast format previously implemented in the interpreter and other tools are the following.

Removals:

- some of the more baroque forms of sugar for `if`
- binary module bodies
- anything script related (assertions, invokes, etc)
- `infinity` as a secondary spelling for float `inf`

Additions:

- \u{...} escapes in strings (see below)
- more than just one inline export (that is, you can write (func $f (export "f1") (export "f2") ...), closing a gap in the syntax), and it combines with import
- the toplevel (module ...) is optional

Changes:

One breaking change makes the syntax forward compatible with some of the future extensions that have been discussed:

- non-empty block signatures must now be written (result i32), in order to generalise cleanly to function signatures

Unicode:

- the lexical syntax is defined in terms of Unicode characters (i.e., code points)
- comments and strings may contain mostly arbitrary Unicode, the rest stays within ASCII
- in strings, a Unicode character denotes its UTF-8 encoding
- in strings, Unicode characters can be given explicitly with \u{...} notation
- .wat files are assumed to be encoded in UTF-8

Misc Remarks:

- formatting characters: currently only the minimum set of formatting characters are allowed as white space (\t, \n, \r); we could include more, e.g. the whole set of ASCII "format effectors" (\b, \v, \f), but you quickly get into a lot of Unicode complexity if you want to go further than that

- Unicode in comments: similarly, in order to avoid getting into Unicode specifics, any legal code point is currently allowed in comments; should we be more restrictive?

- binary module bodies: they would seem pretty unusual for a "text" format, so are not included for now.

- abbreviations: to avoid combinatorial complexity in defining the AST to map on, most syntactic sugar is specified in the form of "abbreviations", simple rewritings into the core syntax

- inline function signatures: I tried to come up with a decent way to describe their rewriting into type indices (and the potential insertion of new type definitions) in terms of rules, but ultimately gave up; it's too cumbersome to express succinctly; so this is the one part that is left partially informal (though hopefully still unambiguous)

- formatting: many of the rules do not currently fit the page width; I left them as is for now, and plan to clean up layout issues once the spec is complete, probably tweaking some layout parameters as well

- tests: lots of stuff we could write tests for, e.g. regarding the Unicode support...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment