Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata in s-expression? atoms #258

Closed
yurydelendik opened this issue Feb 26, 2016 · 17 comments
Closed

Metadata in s-expression? atoms #258

yurydelendik opened this issue Feb 26, 2016 · 17 comments
Milestone

Comments

@yurydelendik
Copy link

Most of the languages have some way of expressing code metadata on the source level. It's probably not an issue for binary format since it's easy to add unknown sections [1] that will refer encoded op-codes/data by offsets in the e.g. function sections. However the wast/s-expression is missing this capability, and currently this format is heavily used for prototyping. While it's possible to provide this information as comment e.g. with '!' added (;!DILocation (line 2) (column 9) (scope (!4)));), it will be nice to define it as a syntax and probably skip/ignore that for MVP (as we do it for the comments). What we are trying to reach here is some verifiable structure for the nested metadata tags.

The main use case is to include source-level debug information into intermediate WebAssembly language. At the end we would like to see something like LLVM has (see [2] and [3]). Perhaps something like:

  ...
  (!dgb $1)
  (func (!dbg $2) (param i64) (result i64)
    (!dbg $3) (if_else (i64.eq (get_local 0) (i64.const 0))
      (!dbg $4) (i64.const 1)
      (!dbg $5) (i64.mul (get_local 0) (call 0 (i64.sub (get_local 0) (i64.const 1))))
    )
  )
  (!set $0 (!DIFile (filename "src/factorial.cpp")))
  (!set $1 (!DISubprogram (name "fac_rec") (file $0)))
  (!set $2 (!DILocalVariable (name "n") (scope $1)))
  (!set $3 (!DILocation (line 2) (column 9) (scope $1)))
  ...

The main idea will be to associate some metadata with s-expression list nodes, but still be something we can validate and serialize to/deserialize from binary format.

(Related issues found [4])

[1] https://github.com/WebAssembly/design/blob/master/BinaryEncoding.md#unknown-sections
[2] http://llvm.org/docs/LangRef.html#metadata
[3] http://llvm.org/docs/SourceLevelDebugging.html#object-lifetimes-and-scoping
[4] WebAssembly/design#208

@lukewagner
Copy link
Member

To check my understanding: are you asking for a way to reliably generate unknown sections (with given names) in the current s-expr text language? That seems useful for testing, in addition to other things.

@sunfishcode
Copy link
Member

I agree that a way to represent our arbitrary unknown sections in the s-expr format seems useful.

This feature would wants a way for these sections to contain references to arbitrary locations in the code. In the binary format one might imagine using byte offsets for this, but that's less practical in a text format. Possible approches include:

  • number each AST node according to a traversal, and refer to nodes by their number (this could work for the binary format too)
  • associate symbolic names with nodes, which is what the example syntax above shows. These would be specific to the text syntax and presumably translated into node index or byte offset in the binary format.

@mbebenita
Copy link
Contributor

These mark nodes should probably behave kind of like blocks that yield the last value, something like:

(mark $1
  (func (mark $2 (param i64)) (result i64)
    (mark $3 
      (if_else (i64.eq (get_local 0) (i64.const 0))
        (mark $4 (i64.const 1))
        (mark $5 (i64.mul (get_local 0) (call 0 (i64.sub (get_local 0) (i64.const 1)))))
      )
    )
  )
)

This way you could also refer to a region of the AST, the exact meaning of what you're referring to depends on the use.

@yurydelendik
Copy link
Author

To check my understanding: are you asking for a way to reliably generate unknown sections (with given names) in the current s-expr text language?

This will be awesome for unknown sections, e.g. for WebAssembly/design#208 . But I want to limit this issue only to annotating AST nodes with metadata in textual formats such as s-expression, like described by @sunfishcode's comment above. And to be more specific, for source-level debug information.

@jfbastien
Copy link
Member

I'm not sure it's worth designing this specific feature at the moment, since we haven't settled on an actual textual format. It may not be s-expressions. I agree that this issue is important, but too early until we figure out what the textual format is.

@mbebenita
Copy link
Contributor

We need this particular feature so we can make progress on source-level debugging tooling. A temporary solution might be okay, we could always adapt it to the actual textual format.

Yury's prototype: http://people.mozilla.org/~mbebenita/wasm/wast-debugging.mp4

@titzer
Copy link
Contributor

titzer commented Feb 26, 2016

Rather than (mark $name) being a separate node, what about adding a @name
annotation that could appear at the beginning (i.e. following the open
paren) anywhere? In expressions, that translates to byte offsets of the
expression in the binary; in function declarations, it translates into the
function index, etc. That would subsume names for functions as well as this
use case.

On Fri, Feb 26, 2016 at 3:05 PM, Michael Bebenita notifications@github.com
wrote:

We need this particular feature so we can make progress on source-level
debugging tooling. A temporary solution might be okay, we could always
adapt it to the actual textual format.

Yury's prototype:
http://people.mozilla.org/~mbebenita/wasm/wast-debugging.mp4


Reply to this email directly or view it on GitHub
#258 (comment).

@mbebenita
Copy link
Contributor

Sounds even better. Would @name subsume names of label targets as well? If so, then we would have to prevent duplicate labels and shadowing of labels in block scopes.

To refer to a particular AST node, we could just use the syntax @functionName:@labelName, and might as well use the $ instead of @ since it's already how we define names.

@ghost
Copy link

ghost commented Feb 27, 2016

Personally I think the locations should be based on a property of the source code rather than requiring annotations to the source code. For example, the text source file character position, or the form number obtained by a depth first walk of the sexp, or a list of indexes to walk to the node, etc - all with different tradeoffs.

Can your tool emitting the debug info track such a key while emitting the wasm binary or text?

If people want to represent unknown sections in the wast then it might need to be a binary blob for now, but I hope the community can work together to use a common data layer even if it needs the flexibility to handle both pre and post order data encodings.

@titzer
Copy link
Contributor

titzer commented Feb 27, 2016

On Fri, Feb 26, 2016 at 4:02 PM, JSStats notifications@github.com wrote:

Personally I think the locations should be based on a property of the
source code rather than requiring annotations to the source code. For
example, the text source file character position, or the form number
obtained by a depth first walk of the sexp, or a list of indexes to walk to
the node, etc - all with different tradeoffs.

The problem with numbering or text source lines is that they are not robust
to transform; they need to be transformed along with the original code, and
they cannot be preserved. Imagine a simple tool to strip nops; it would
destroy all the numbering schemes unless it had a mechanism to record the
original offsets/numbers, which is basically an annotation. User-level
annotations are thus more robust.

Can your tool emitting the debug info track such a key while emitting the
wasm binary or text?

If people want to represent unknown sections in the wast then it might
need to be a binary blob for now, but I hope the community can work
together to use a common data layer even if it needs the flexibility to
handle both pre and post order data encodings.


Reply to this email directly or view it on GitHub
#258 (comment).

@ghost
Copy link

ghost commented Feb 27, 2016

Would I be correct that these annotations are not visible in the binary encoding of the wasm sections, and that even a wasm debug section would not use them?

If so then they appear to be purely a tooling issue and I'll stay out of this one :)

@titzer
Copy link
Contributor

titzer commented Feb 27, 2016

On Fri, Feb 26, 2016 at 4:13 PM, JSStats notifications@github.com wrote:

Would I be correct that these annotations are not visible in the binary
encoding of the wasm sections, and that even a wasm debug section would not
use them?

If so then they appear to be purely a tooling issue and I'll stay out of
this one :)

I was assuming that the translation from s-expr to binary would store a
table as a separate section, to allow recovering all annotations in-place
or in binary -> s-expr.


Reply to this email directly or view it on GitHub
#258 (comment).

@ghost
Copy link

ghost commented Feb 27, 2016

@titzer I presume then that the annotations table binary encoding will use some key, and probably something short like a pc offset, so will not be robust to transforms either. I think it was conceded some time ago that tools would be expected to transform functions on the function-level granularity, so the tools would be expected to decode the AST and the function annotations into an intermediate representation that supports tracking the annotations while transforming the code then to re-encode both. Seems like a tooling issue to me for now, and perhaps moving into a debug section definition to encode the location of the annotations.

@lukewagner
Copy link
Member

@yurydelendik Ah hah, I see now. So to check my understanding v.2: the root problem is that you need to get the byte offsets of various nodes (for debug info) and you don't want to have to duplicate all the logic in the .wast-to-.wasm just to get these offsets; you want to be able to reuse the existing .wast-to-.wasm tooling and extract this data. I was thinking that this offset info could just as well be a second file output, but I think it would be more convenient to use a section inside the .wasm. For example, it'd make it easier to add to SM's wasmTextToBinary. So agreed!

The "@functionName:@labelName" @mbebenita mentioned makes sense to use as the key since it has the stability property and obvious correspondence to what you wrote in the .wast. So concretely, this new "label-offsets" section could be a sequence of (function name string, label name string, offset) tuples. But, to save space (potentially hundreds of MB for big .wasts), we could also add one level of nesting and have the section contain a sequence of functions where each function started with the function name string followed by a sequence of (label name, offset) pairs. This would actually have nice symmetry with the optional "func names"/"local names" sections discussed earlier.

@yurydelendik
Copy link
Author

Let's think of wast as some format that helps us with discovery of WebAssembly, e.g. understanding how source code maps to wasm AST or inspect what information is associated with specific operations. I'm thinking that having this information expressed in some different syntax might be useful: the visualization source-to-wast utilities can be created, wast round-trip tooling (e.g. injecting some extra diagnostics code while preserving original source mapping), or just learn the platform basics; while tools that is not interested in this information may easily ignore or strip it.

Currently for custom tooling it can be replaced by special comments: they are easily strip-able and not intervene with primary spec prototype or existing implementations, but having something that has status of metadata or pre-processor directive would be nice. More simplified analogs to LLVM syntax I mentioned above will be #line line control at (C/C++ or C#) or CLR IL .line directive.

@AndrewScheidecker
Copy link
Contributor

There are three components to this:

  • a way to add a section to a WebAssembly module that can be used to store debug and other metadata without affecting the runtime semantics of the module. That seems to be possible in the binary format, but not the s-expression text format.
  • a way to reference individual operations and intermediate values, for source location information and source variable storage information.
  • a schema for source-level debug information. IIUC this is beyond WebAssembly's scope, but hopefully at least file/line information can be standardized.

The second component is, I think, the tricky part. I think it is an argument that the text format should be more like the current stack-machine assembly output from LLVM than the S-expression AST. Either way, I don't think the text format should interleave debug information with semantic code. However, I can see it being useful for the text format to include an inline declaration to associate an operation with an identifier that can be referenced by the metadata instead of using a naked index.

Here's a sketch of what I'm imagining:

Source:

fac = n ->
    if n == 0 then 1
    else n * fac (n-1)

WebAssembly text format:

func $2 : i64 -> i64
    get_local 0      !opref0
    i64.eqz
    if
        i64.const 1  !opref1
    else
        get_local 0  !opref2
        get_local 0
        i64.const 1
        i64.sub
        call 0
        i64.mul

meta
    !file0 = (DIFile "src/factorial.cpp")
    !subprogram0 = (DISubprogram fac_rec !file0 (func $2))
    (DILocalVariable n !subprogram0 (local 0))
    (DILocation 2 4  !opref0 !subprogram0)
    (DILocation 2 18 !opref1 !subprogram0)
    (DILocation 3 8  !opref2 !subprogram0)

Even if the text format supports metadata directly, it might be useful to add a way to declare sections containing a "raw" string, similar to static data, so binaries with unknown sections can be converted to text without losing information.

@rossberg
Copy link
Member

This issue is old, but most of what's been discussed here is beyond the scope of the core Wasm spec and its S-expression format. => Closing.

ngzhian added a commit to ngzhian/spec that referenced this issue Nov 4, 2021
* Support i16x8 and implement neg, add, sub, mul

Create a new module I16, which uses Int.Make, and is backed by Int32. It
reuses a bunch of logic in Int. It stores 16-bit integers sign-extended
in Int32. This means that -1 (0xFFFF) is stored as 0xFFFFFFFF, rather
than 0x0000FFFF. All the bytes decode/encode logic is also done using
the signed form.

* Remove debug code, better names, make sign extend check more generic
dhil pushed a commit to dhil/webassembly-spec that referenced this issue Mar 2, 2023
* Update index of instructions.

Fixes Issue WebAssembly#258

-  Python script tweak, TRY has now two validation and two execution rules.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants