LES as the text format #697

Closed
qwertie opened this Issue May 24, 2016 · 20 comments

Comments

Projects
None yet
8 participants
@qwertie

qwertie commented May 24, 2016

I would like to propose the next version of LES to be used as the WebAssembly text format. More specifically:

  • I propose that the Wasm text format be compatible with the next version of LES, in the sense of being a subset of it, such that a "full" LES parser would not reject anything that is valid Wasm text.
  • For the MVP, those that want to parse Wasm text will be able to choose whether to use a custom parser dedicated to Wasm or a generic LES parser.
  • My proposed syntax (link at bottom) is not compatible with the current LES specification, but LES is in beta and can still be tweaked to Wasm's needs. Based on the Wasm text format, a third version of LES (LESv3) will be drafted before the end of 2016.

This gives the CG some freedom to make some changes to LES and not others. Specifically, any elements that make sense only in WebAssembly (e.g. keywords for wasm opcodes) would not be permitted, but changes such as tweaks to operator precedence, handling of semicolons, the grammar of LES "superexpressions", or the name used for "infinity", are fine.

Why use LES for WebAssembly?

  • Fewer parsers will be needed in the future: in today's world, every new language needs a new parser to be written, bikeshedded, and specified from scratch. While LES is not sufficient for all languages, it will be enough for some (and in particular it is sufficient for Wasm, which has weaker reasons than most languages to use a custom grammar); being part of Wasm encourages use of LES elsewhere (edit: consider DSLs and advanced search boxes. I wrote some GIS software that did formulas & searches, so used an ANTLR-based parser. But why are people still writing custom parsers for expressions? There should be a standard - so here it is.)

  • More importanty, learning curves decrease: imagine a future world in which a user has already used LES as a programming language or a data language. She wants to write Wasm assembly for the first time, and it's easier to learn, because the syntax is already familiar. She may have many concepts to learn regarding the semantics of Wasm, but at least the syntax is easy. Conversely, she may learn Wasm first and another LES-based language later. Either way, the same benefit accrues.

    By analogy we may compare LES with punctuation and grammar in natural language. When I learn a new language, I have a lot of new words to learn, of course. Luckily, most languages share the same meaning for punctuation: I don't have to re-learn a new version of the comma, dash, parentheses and so on. But grammar is another story: it always varies between languages, sometimes dramatically. Similarly, different programming languages will always vary in concepts and idioms, but it isn't necessary for every language to also have different punctuation and grammar - and even most of the words could stay the same. LES is the only language (AFAIK) that takes the kind of standardization you see in s-expressions and applies it in the Algol-family space.

  • Various things become easier: let's say I want to analyze some C++ code... from within my Rust or Python code. But all the C++ parsers are written in C/C++! See the problem? We don't have easy ways to cross arbitrary language barriers today. Now imagine a world after Wasm MVP: people will want to process Wasm text (and binaries) from many different languages. Because it is hard to cross language barriers, the way this will happen is that many different people will write their own readers and writers for their language of choice. If the text format isn't something generic like LES, the story ends here with folks merrily parsing Wasm. But if the format is generic, the results are more interesting:

    • LES typically represents data more compactly than XML and JSON (and I assert it's
      easier to parse correctly than YAML, and is built atop better primitives), so
      some people will start using it as a data format. It's especially good for data with
      embedded code (e.g. build systems). Network effects are crucial here: no one wants
      to use an obscure data format, so it must be used first in a major standard like
      WebAssembly.
    • LES and Loyc trees were designed to help with a variety of tasks related to
      building compilers, converting code between languages, and bridging language
      barriers (both human and machine). Some compilers today can dump their AST in
      an XML format, which can allow a program in a different language to pick up that
      code and do something with it. But LES is far more compact and just better at
      representing code than XML is. So LES makes it easier to talk about, work with,
      and reason about syntax trees in the same way s-expressions do. However, LES is
      completely unknown today. WebAssembly's role here (should you choose to accept
      it) would be to create awareness of LES and cause people to write LES parsers
      for numerous languages. This may lead more people to get involved in writing
      language-processing tools than in the world we have now, where you're lucky if
      you can even get an XML representation of a given language. This in turn should
      facilitate the creation of interoperability and code-conversion tools,
      ultimately improving interoperability between languages.
    • On a personal note, I want to build a library for converting code between
      programming languages, and I want to use WebAssembly semantics as a "common core"
      for such tasks. I expect this to be easier to do if WebAssembly uses LES.

I'm concerned that some would simply say "this is out of scope". But I don't think this will strain the dev process. So why not try it? Others might argue along the lines of "we shouldn't use LES because I really prefer foo: instead of :foo for labels." I would urge those people to look at the bigger picture. Thanks for your consideration!

What does Wasm+LES look like?

To show you how Wasm could be encoded in LES, I have created a PR to show how Dan Gohman's strawman would need to change to be LES-compatible.

@qwertie qwertie referenced this issue in sunfishcode/design May 24, 2016

Open

Strawman counterproposal #3

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken May 24, 2016

Member

I think this is interesting, and I find a lot of the rationale compelling. In particular simple parsing and familiarity.

What is the status of LES? Is it used anywhere?

The familiarity argument also applies to JS, though - what all current web devs are used to. But, that does come with harder parsing in the general case.

Member

kripken commented May 24, 2016

I think this is interesting, and I find a lot of the rationale compelling. In particular simple parsing and familiarity.

What is the status of LES? Is it used anywhere?

The familiarity argument also applies to JS, though - what all current web devs are used to. But, that does come with harder parsing in the general case.

@qwertie

This comment has been minimized.

Show comment
Hide comment
@qwertie

qwertie May 24, 2016

(edited) @kripken Thank you for saying so. LES is "beta" and in its second major iteration (LESv2), and unfortunately has only two confirmed users that I know of - which, on the plus side, means that changes can still be made to suit tastes of CG members. I'm ready to accept a variety of changes as long as the parser stays language-agnostic and simple (for reference, the current grammar is 239 lines for the parser and 221 lines for the lexer, including syntax tree construction and error handling but excluding token interpretation and helper methods.) Fun fact: my LES parser is written in LES.

We know that the JS parser isn't appropriate for Wasm (there are obvious problems like the lack of types, and less obvious problems like the inability to use arbitrary characters as identifiers, and the lack of separate signed and unsigned operators), and the complexity of its parser is inappropriately high, especially given that Wasm's text format is tangential rather than central to the standard.

qwertie commented May 24, 2016

(edited) @kripken Thank you for saying so. LES is "beta" and in its second major iteration (LESv2), and unfortunately has only two confirmed users that I know of - which, on the plus side, means that changes can still be made to suit tastes of CG members. I'm ready to accept a variety of changes as long as the parser stays language-agnostic and simple (for reference, the current grammar is 239 lines for the parser and 221 lines for the lexer, including syntax tree construction and error handling but excluding token interpretation and helper methods.) Fun fact: my LES parser is written in LES.

We know that the JS parser isn't appropriate for Wasm (there are obvious problems like the lack of types, and less obvious problems like the inability to use arbitrary characters as identifiers, and the lack of separate signed and unsigned operators), and the complexity of its parser is inappropriately high, especially given that Wasm's text format is tangential rather than central to the standard.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken May 24, 2016

Member

Note that JS may well get optional types, similar to TypeScript which is basically a preview of them. It does makes sense I think for wasm notation to be similar to JS where possible, and the TypeScript/JS declarations for types is one possible area for that - x : i32, etc. It's nicer if languages on the web look more consistent, or at least avoid arbitrary differences.

Member

kripken commented May 24, 2016

Note that JS may well get optional types, similar to TypeScript which is basically a preview of them. It does makes sense I think for wasm notation to be similar to JS where possible, and the TypeScript/JS declarations for types is one possible area for that - x : i32, etc. It's nicer if languages on the web look more consistent, or at least avoid arbitrary differences.

@sunfishcode sunfishcode referenced this issue in sunfishcode/design May 25, 2016

Closed

Strawman counterproposal (fixed) #8

@wanderer

This comment has been minimized.

Show comment
Hide comment
@wanderer

wanderer May 26, 2016

Contributor

Note that JS may well get optional types

i'm assuming that wont be anytime soon though

Contributor

wanderer commented May 26, 2016

Note that JS may well get optional types

i'm assuming that wont be anytime soon though

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jun 14, 2016

Glad someone is exploring this. It still does not seem to have the type inference so how can expressions work as well without a change to the binary encoding?

ghost commented Jun 14, 2016

Glad someone is exploring this. It still does not seem to have the type inference so how can expressions work as well without a change to the binary encoding?

@qwertie

This comment has been minimized.

Show comment
Hide comment
@qwertie

qwertie Jun 14, 2016

I recently posted in my blog about potential changes to LES, some designed for Wasm and some not.

@JSStats: LES is only a syntax, so type inference is outside its scope. Could you give me an example of the problem you're talking about?

qwertie commented Jun 14, 2016

I recently posted in my blog about potential changes to LES, some designed for Wasm and some not.

@JSStats: LES is only a syntax, so type inference is outside its scope. Could you give me an example of the problem you're talking about?

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jun 14, 2016

@qwertie The text format being proposed uses 'Infix syntax for arithmetic, with simple overloading.' Without a change to the binary encoding this needs some type inference to present, and would that be a part of the proposal here too?

ghost commented Jun 14, 2016

@qwertie The text format being proposed uses 'Infix syntax for arithmetic, with simple overloading.' Without a change to the binary encoding this needs some type inference to present, and would that be a part of the proposal here too?

@qwertie

This comment has been minimized.

Show comment
Hide comment
@qwertie

qwertie Jun 14, 2016

@JSStats If I understand your question, then yes - LES would use the same kind of "type inference" as almost every programming language. Note that my design strategy so far is to take community proposals (chiefly @sunfishcode's) and apply those proposals to LES. Thus, general questions like that tend to have the same answer here as they would in #704.

qwertie commented Jun 14, 2016

@JSStats If I understand your question, then yes - LES would use the same kind of "type inference" as almost every programming language. Note that my design strategy so far is to take community proposals (chiefly @sunfishcode's) and apply those proposals to LES. Thus, general questions like that tend to have the same answer here as they would in #704.

@sunfishcode sunfishcode modified the milestone: MVP Jul 8, 2016

@flagxor

This comment has been minimized.

Show comment
Hide comment
@flagxor

flagxor Jul 30, 2016

Member

After some communication between implementers, we've decided to focus on having all browsers be able to display linear opcodes for the MVP timeframe. Moving to Discussion.

Member

flagxor commented Jul 30, 2016

After some communication between implementers, we've decided to focus on having all browsers be able to display linear opcodes for the MVP timeframe. Moving to Discussion.

@flagxor flagxor modified the milestones: Discussion, MVP Jul 30, 2016

@qwertie

This comment has been minimized.

Show comment
Hide comment
@qwertie

qwertie Aug 2, 2016

@flagxor What is meant by "display linear opcodes"?

qwertie commented Aug 2, 2016

@flagxor What is meant by "display linear opcodes"?

@qwertie

This comment has been minimized.

Show comment
Hide comment
@qwertie

qwertie Aug 2, 2016

@JSStats although I liked the design as it was before, and although the new design in the slide show isn't a perfect fit to expression notation, the new design (with multiple values) still could allow expression notation in most cases (as you have noted), so it's not necessarily a big loss in terms of presentation - but @flagxor seems to imply they want to drop expression/infix notation in all cases, a proposal to which I would be opposed.

I'm torn. In terms of end-user comprehensibility, an AST is best. Flat assembly code is very tall and hard to follow; the new proposal isn't totally flat, but the same principle applies. On the other hand, the new stack-machine interpretation appears to have (minor) performance and size advantages and it's understood that concision and speed are top priorities. Perhaps the biggest advantage of the new interpretation is thought to be multi-value support, but there are so many ways to skin that cat, it's hard to say the new proposal is the best.

qwertie commented Aug 2, 2016

@JSStats although I liked the design as it was before, and although the new design in the slide show isn't a perfect fit to expression notation, the new design (with multiple values) still could allow expression notation in most cases (as you have noted), so it's not necessarily a big loss in terms of presentation - but @flagxor seems to imply they want to drop expression/infix notation in all cases, a proposal to which I would be opposed.

I'm torn. In terms of end-user comprehensibility, an AST is best. Flat assembly code is very tall and hard to follow; the new proposal isn't totally flat, but the same principle applies. On the other hand, the new stack-machine interpretation appears to have (minor) performance and size advantages and it's understood that concision and speed are top priorities. Perhaps the biggest advantage of the new interpretation is thought to be multi-value support, but there are so many ways to skin that cat, it's hard to say the new proposal is the best.

@qwertie

This comment has been minimized.

Show comment
Hide comment
@qwertie

qwertie Aug 2, 2016

The slides say that "S-expression and Textual infix notation breaks down for stack machine", but it's not a fatal issue, because we can still use expression notation opportunistically.

Considering an example from the slides:

block
i32.const 4
f64.const 5.5
call $foo
i64.add
end

The text format could allow a linear form. In LESv3 (keeping in mind that it's undecided whether to use semicolons or not):

{
    4;
    5.5;
    $foo(pop 2); // use 2 values that were pushed earlier as arguments
    i64.add;
}

But whether via s-expressions or something else, most code could still be expressed in a mostly nested way. As an s-expression it would be (block (call $foo (i32.const 4) (f64.const 5.5)) i64.add). In LESv3, we might write

{
    $foo(4, 5.5);
    i64.add;
}

It could even be a one-liner and use type inference:

{ +$foo(4, 5.5); }

But I wouldn't recommend this, as the meaning seems non-obvious and one would expect -$foo() to be a negation, not a subtraction.

Let's try another one:

block
i32.const 4
f64.const 5.5
get_local 0
br_if[arity=2] 0
i32.const 6
f64.const 7.5
end

I assume that br_if[arity=2] 0 means "pop one number and if that number is nonzero pop 2 items, and if that number is nonzero, break from block 0 which is the innermost block". Linearly in LESv3 this might be

{
   4; 5.5;
   'br 0 (pop 2) 'if $0; 
   6; 7.5;
}

Or in "maximally tall" notation:

{
   4;
   5.5;
   $0;
   'br_if 0 (pop 2);
   6;
   7.5;
}

But more likely we'd want to use a compact form

{
   // branch from block 0 with values {4; 5.5} if $0, where $0 means `get_local 0`
   'br 0 (4; 5.5) 'if $0; 
   6; 7.5;
}

Note: "pop 2" might be something slightly different from the original arity=2, in that you could also write this:

{
   4;
   'br 0 (pop 1; 5.5) 'if $0; 
   6; 7.5;
}

Here the arity of br_if is still 2, but the first is popped by "pop 1" and the second is specified inline. In this way the text format could support a continuum between a very compact expression language and a very tall "linear" language.

qwertie commented Aug 2, 2016

The slides say that "S-expression and Textual infix notation breaks down for stack machine", but it's not a fatal issue, because we can still use expression notation opportunistically.

Considering an example from the slides:

block
i32.const 4
f64.const 5.5
call $foo
i64.add
end

The text format could allow a linear form. In LESv3 (keeping in mind that it's undecided whether to use semicolons or not):

{
    4;
    5.5;
    $foo(pop 2); // use 2 values that were pushed earlier as arguments
    i64.add;
}

But whether via s-expressions or something else, most code could still be expressed in a mostly nested way. As an s-expression it would be (block (call $foo (i32.const 4) (f64.const 5.5)) i64.add). In LESv3, we might write

{
    $foo(4, 5.5);
    i64.add;
}

It could even be a one-liner and use type inference:

{ +$foo(4, 5.5); }

But I wouldn't recommend this, as the meaning seems non-obvious and one would expect -$foo() to be a negation, not a subtraction.

Let's try another one:

block
i32.const 4
f64.const 5.5
get_local 0
br_if[arity=2] 0
i32.const 6
f64.const 7.5
end

I assume that br_if[arity=2] 0 means "pop one number and if that number is nonzero pop 2 items, and if that number is nonzero, break from block 0 which is the innermost block". Linearly in LESv3 this might be

{
   4; 5.5;
   'br 0 (pop 2) 'if $0; 
   6; 7.5;
}

Or in "maximally tall" notation:

{
   4;
   5.5;
   $0;
   'br_if 0 (pop 2);
   6;
   7.5;
}

But more likely we'd want to use a compact form

{
   // branch from block 0 with values {4; 5.5} if $0, where $0 means `get_local 0`
   'br 0 (4; 5.5) 'if $0; 
   6; 7.5;
}

Note: "pop 2" might be something slightly different from the original arity=2, in that you could also write this:

{
   4;
   'br 0 (pop 1; 5.5) 'if $0; 
   6; 7.5;
}

Here the arity of br_if is still 2, but the first is popped by "pop 1" and the second is specified inline. In this way the text format could support a continuum between a very compact expression language and a very tall "linear" language.

@drom

This comment has been minimized.

Show comment
Hide comment
@drom

drom Aug 2, 2016

@flagxor @JSStats @qwertie Forth language has good 45year history of efficiently expressing "stack machine" code. And here is relevant proposal for "PAF: A portable assembly language" proposal: http://www.complang.tuwien.ac.at/anton/euroforth/ef13/papers/ertl-paf.pdf

drom commented Aug 2, 2016

@flagxor @JSStats @qwertie Forth language has good 45year history of efficiently expressing "stack machine" code. And here is relevant proposal for "PAF: A portable assembly language" proposal: http://www.complang.tuwien.ac.at/anton/euroforth/ef13/papers/ertl-paf.pdf

@qwertie

This comment has been minimized.

Show comment
Hide comment
@qwertie

qwertie Aug 2, 2016

@JSStats I don't think this is the appropriate thread to discuss that issue.

qwertie commented Aug 2, 2016

@JSStats I don't think this is the appropriate thread to discuss that issue.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Aug 2, 2016

@qwertie Ok, sorry, lets move the discussion on the text format to #704 and keep this as LES for whatever format.

ghost commented Aug 2, 2016

@qwertie Ok, sorry, lets move the discussion on the text format to #704 and keep this as LES for whatever format.

@qwertie

This comment has been minimized.

Show comment
Hide comment
@qwertie

qwertie Aug 2, 2016

@drom I guess PAF resembles the new Wasm design in many ways. Are you aware of any features/benefits of PAF that the new Wasm doesn't have? Exception handling is the only one I notice. The syntax is entirely unfamiliar.

qwertie commented Aug 2, 2016

@drom I guess PAF resembles the new Wasm design in many ways. Are you aware of any features/benefits of PAF that the new Wasm doesn't have? Exception handling is the only one I notice. The syntax is entirely unfamiliar.

@kripken

This comment has been minimized.

Show comment
Hide comment
@kripken

kripken Aug 2, 2016

Member

@qwertie

On the other hand, the new stack-machine interpretation appears to have (minor) performance and size advantages

There are two stages here. First, the move to drop/tee, which has some clear perf benefits. At that point things are equivalent to both an AST (that has explicit drop) and a stack machine (that has some requirements for structure).

The second stage is the move away from an AST to a full stack machine ("relaxation from 0xc" in the slides), which may have minor perf benefits but also minor downsides. There isn't actual data on perf that I am aware of; I would guess there is no significant change. The main benefit appears to be future support for multi-values.

Member

kripken commented Aug 2, 2016

@qwertie

On the other hand, the new stack-machine interpretation appears to have (minor) performance and size advantages

There are two stages here. First, the move to drop/tee, which has some clear perf benefits. At that point things are equivalent to both an AST (that has explicit drop) and a stack machine (that has some requirements for structure).

The second stage is the move away from an AST to a full stack machine ("relaxation from 0xc" in the slides), which may have minor perf benefits but also minor downsides. There isn't actual data on perf that I am aware of; I would guess there is no significant change. The main benefit appears to be future support for multi-values.

@titzer

This comment has been minimized.

Show comment
Hide comment
@titzer

titzer Aug 2, 2016

Contributor

FWIW I was the author of the slides.

I implemented the full stack machine in V8 (patch here: https://codereview.chromium.org/2176653002/), including full multi-value support in both the interpreter and compiler. So I am 100% confident that multi-values work :-)
@rossberg-chromium implemented the stack machine in the stack branch (https://github.com/WebAssembly/spec/tree/stack) of the spec.

I was planning to propose the full stack machine in an issue. It's mostly not a single PR, since the stack machine does not directly affect the binary format, but documentation such as the AstSemantics (to be renamed ExecutionSemantics I presume), and obviously implementations.

Contributor

titzer commented Aug 2, 2016

FWIW I was the author of the slides.

I implemented the full stack machine in V8 (patch here: https://codereview.chromium.org/2176653002/), including full multi-value support in both the interpreter and compiler. So I am 100% confident that multi-values work :-)
@rossberg-chromium implemented the stack machine in the stack branch (https://github.com/WebAssembly/spec/tree/stack) of the spec.

I was planning to propose the full stack machine in an issue. It's mostly not a single PR, since the stack machine does not directly affect the binary format, but documentation such as the AstSemantics (to be renamed ExecutionSemantics I presume), and obviously implementations.

@qwertie

This comment has been minimized.

Show comment
Hide comment
@qwertie

qwertie Aug 21, 2016

Now that Wasm is moving toward a stack machine, I think that LES is now more relevant than ever, because there seems to be a need for not one wasm language but two: one for wasm itself, and another for the pseudo-wasm AST used by producers (i.e. binaryen). Both languages would support expressions, but each would interpret them differently. In the "official" wasm, $a() + $b() could be syntactic sugar for the sequence $a(); $b(); i32.add if both functions return i32, whereas in binaryen, the same text would represent the AST i32.add($a(), $b()).

While only one Wasm variation is expected so far, it's not hard to imagine others:

  • People will need to prototype new Wasm features.
  • I'm interested in making some kind of higher-level language using wasm's AST variant as a semantic foundation.

A general-purpose syntax like LESv3 allows people to do all this without touching the parser, let alone writing a new one. I was disappointed that no one commented on my ideas for LESv3, but I did get around to writing a parser + unit tests. In doing so, my ideas evolved slightly beyond the blog post.

Bikeshedders Wanted

So, I've prepared a new post that explains my design choices for LESv3 and requests your opinions on matters I'm unsure about. I’d much rather have your opinion now than after I’ve written separate parsers for multiple languages!

qwertie commented Aug 21, 2016

Now that Wasm is moving toward a stack machine, I think that LES is now more relevant than ever, because there seems to be a need for not one wasm language but two: one for wasm itself, and another for the pseudo-wasm AST used by producers (i.e. binaryen). Both languages would support expressions, but each would interpret them differently. In the "official" wasm, $a() + $b() could be syntactic sugar for the sequence $a(); $b(); i32.add if both functions return i32, whereas in binaryen, the same text would represent the AST i32.add($a(), $b()).

While only one Wasm variation is expected so far, it's not hard to imagine others:

  • People will need to prototype new Wasm features.
  • I'm interested in making some kind of higher-level language using wasm's AST variant as a semantic foundation.

A general-purpose syntax like LESv3 allows people to do all this without touching the parser, let alone writing a new one. I was disappointed that no one commented on my ideas for LESv3, but I did get around to writing a parser + unit tests. In doing so, my ideas evolved slightly beyond the blog post.

Bikeshedders Wanted

So, I've prepared a new post that explains my design choices for LESv3 and requests your opinions on matters I'm unsure about. I’d much rather have your opinion now than after I’ve written separate parsers for multiple languages!

@jfbastien

This comment has been minimized.

Show comment
Hide comment
@jfbastien

jfbastien May 11, 2017

Member

Text format proposal. Closing this in favor of the proposal.

Member

jfbastien commented May 11, 2017

Text format proposal. Closing this in favor of the proposal.

@jfbastien jfbastien closed this May 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment