Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial ProducersSection.md #65

Merged
merged 17 commits into from
Nov 16, 2018
130 changes: 130 additions & 0 deletions ProducersSection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Producers Section

The purpose of the producers section is to provide an optional,
highly-structured record of all the distinct tools that were used to produce
a given WebAssembly module. A primary purpose of this record is to allow
broad analysis of toolchain usage in the wild, which can help inform both wasm
producers and consumers.

The producers section is a
[custom section](https://webassembly.github.io/spec/core/binary/modules.html#custom-section)
and thus has no semantic effects and can be stripped at any time.
Since the producers section is relatively small, tools are encouraged to emit
the section or include themselves in an existing section by default, keeping
the producers section even in release builds.

An additional goal of the producers section is to provide a discrete, but
easily-growable [list of known tools](#known-tools) for each record field. This
avoids the skew that otherwise happens with unstructured strings. Evergreen
WebAssembly consumers (like browsers) are encourage to emit diagnostics
encouraging producers to register new field values in this document. However, an
unknown tool does not make the producers section invalid and all consumers
should gracefully handle unknown tool names.

Since version information is useful but highly-variable, every field value is
optionally suffixed with a parenthesized version string which is not checked
against any known list.

# Known tools

The following lists contain all the known tool names for the fields listed below.
**If your tool is not on this list and you'd like it to be, please submit a PR.**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand why we want a strict validations for source lang/tool, I really doubt that any tool will be able to keep up-to-date with the amount of combinations that we can expect in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably nuance this a bit in the doc, but my thinking is that having a tool name outside the known list wouldn't be a validation error, just a thing that evergreen consumers like browsers could warn about (to provide gentle pressure) but not reject.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL at new wording regarding how tool names are checked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just worried about the maintainability of such a list, even if it's just for information. I don't see why an unrecognized pipeline would be a warning, or at least displayed to the end user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I guess those are two separable issues: attempting to maintain a list and having consumers issue warnings. In both cases, I may well be overly idealistic in thinking they could work (would not be the first time...), but since it's quite easy to just relax these requirements later, it seems worth it to at least shoot for the ideal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that unknown tools shouldn't ever be a real problem; with a harmless diagnostic being the extent of it, if even that. Personally, I'd want to at least try this in FF for a while to help bootstrap the process, but this should be an easy thing to rip out if it's a pain and I can dial down the text in the proposal here. Another incentive is if browsers or npm analyses publish their telemetry for known tools on the list; then if you get on the list, you get free telemetry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing I was thinking: if we want a known tools list, we may want to store it in a separate flat file, one item per line. That way it's easy for build scripts to grab it and use it in whatever way they want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. I was wondering about that myself too. What do you think the ideal trivial text format is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking something like:

tool1
tool2
tool3
...

You could use JSON or something more complex, but I don't think we need anything fancier than that (except maybe comments?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split by line works best (we don't prevent line breaks in the names?). JSON could be useful to store additional and JSON5 allows comments, but It sounds overkill to me.

Just an idea: the consumers will likely understand wast, we could store it in the data. I don't think that's a idea.


## Source Languages

* `wat`
* `C`
* `C++`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No Rust?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are missing many more langaues and tools here but we can do follow up PR to keep this one focused on the RFC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was intentionally leaving out well-known producers so they can choose how to spell/capitalize/categorize their tools.


## Individual Tools

* `wabt`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WABT and LLVM are acronyms; do we want those uppercase or lowercase? (My vote is uppercase)

More generally we probably shouldn't be prescriptive about how people spell their tool names, but I guess we get to decide for our own tools.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always use wabt personally, since it's really more of a backronym than an acronym. Then again, I didn't come up with the name! :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to do what you both decide. I see 👍 for wabt staying lowercase; so I'll just update LLVM for now.

* `llvm`
* `lld`
* `Binaryen`

## SDKs

* `Emscripten`

# String formats

The binary encoding of record fields uses the standard
[name encoding](https://webassembly.github.io/spec/core/binary/values.html#names)
used elsewhere in wasm modules. However, the producers section imposes additional
validity constraints on the UTF-8-decoded code points of these strings.

## Atom

An "atom" is a sequence of code points containing anything *other* than
parentheses and commas (which are the only relevant separators in producer
section strings).

JS Pattern: `/[^(),]+/`

Example tool name strings:
* wabt
* c++
* ☃

## Tool-version string

A tool-version string is an atom identifying the tool name followed by
an optional parenthesized atom identifying the version.

Pattern:
* Logical: [`Atom`](#atom) ( `(` [`Atom`](#atom) `)` )?
* JS: `/[^(),]+(\([^(),]*\))?/`

Example tool-version strings:
* a
* c++(11)
* ☃(1.0.☃)

## Tool-version set string

A tool-version set string is a possibly-empty, comma-delimited list where each
contained tool name string is unique.

Pattern (ignoring uniqueness requirement):
* Logical: ( [`Tool-version string`](#tool-version-string) `,` )* [`Tool-version string`](#tool-version-string)
* JS: `/([^(),]+(\([^(),]*\))?,)*[^(),]+(\([^(),]*\))?/`

Example tool-version set strings:
* a
* a(1.0)
* llvm(20.3-beta),binaryen,lld(1.3),webpack(4)

# Custom Section

Custom section `name` field: `producers`

The producers section may appear only once, and only after the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I was thinking recently about this is that from a producer's perspective it'd be nice to relax this to saying that it can appear multiple time. Projects like rustc don't actually (or at least eventually won't) have any WebAssembly encoding/decoding functionality. The Rust compiler, for example, would exclusively rely on LLVM/LLD to produce the wasm file.

To that end it'd be easiest for Rust to simply seek to the end of the file and append a few bytes, possibly adding a duplicate producers section. The binary format is albeit quite easy to parse, but producers appending values would have to find an existing section, if any, augment the list with another entry (or make a new list if one wasn't present), and then re-encode the section back out.

I think from a consumer perspective it might not be too hard to concatenate as well? In that sense I'm not sure that there's too big of a downside of allowing multiple sections to exist other than "it feels less clean"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how it would be easier, but I worry that this will create complexity for consumers (intermediate and otherwise) everywhere throughout the toolchain. It seems like the extra work of decoding and injecting a tool into the producers section could be handled by a single trivial command-line tool that you'd use like wasm-add-producer-tool key_name value_name. Alternatively, tools like lld could be liberal in what they accept so that wasm object files could have multiple producers sections but the output had a single merged producers section.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking though that some of the complexity which multiple instances of this custom section might add is sort of already there? For example, as specified, consumers would have to handle a section with multiple processed-by fields as it's not necessarily guaranteed that they're all concatenated in one field with commas?

If that's the case, then it seems that processing independent entries already implies some degree of merging logic and now it'd just span sections instead of being within one section, which in theory wouldn't be adding all that much more complexity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's not so much the complexity of a correct implementation as the likelihood that everyone independently does the correct thing.

For example, you make a good point w.r.t the field names; it'd be good I think to stipulate that they are unique like JSON.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my mistake, I assumed that duplication of fields was intentional! Without that allowed it's definitely a different kind of decision to allow multiple.

I still personally though feel that this would ideally be relaxed for producers as there's likely far more producers than consumers. I don't really feel too strongly either way though, I'm happy to implement whatever in rustc and wasm bindgen!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems odd to me that we have a nice structure with a set of fields (just like elsewhere in the binary format) but then one of those fields is just a comma-separated text string. And of course that's the field that the most tools will have to modify. I would have expected to have multiple processed-by fields (and why not multiple langauge fields too?). For that matter, how do we decide which tool is the "sdk"? Is it Emscripten, or the Unity SDK that embeds Emscripten?

In terms of @alexcrichton's concerns about section-munging ease, having field duplication is sort of the worst of both worlds because consumers have to deal with multiples, and intermediate tools have to decode and re-encode the section. But I'd think that any tool that otherwise modifies the binary at all would easily have the primitives for that, and tools that don't modify the binary will probably be using primitives like WABT (we should definitely add a tool like wasm-add-producer-tool or objcopy or whatever to WABT).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with derek regarding the comma separated thing. Would make more sense to have each item be its own string, preceded by a count. Alternatively require duplication.

WRT to ease of concatenation I'm a little sad that we loose the ability to do this in a generic way a la SHF_STRINGS but lld already does a whole lot of custom combination logic already so I guess that ship has sailed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points!

Regarding the unnecessary use of commas in the string to provide structure that could instead be in the containing binary format. For that matter, the magic parens could also be removed and thus there would be no text analysis: just strings without any constraints. I'll update to do that.

Regarding limitation to single language/SDK: yeah, thinking about it again, you could have multiple of each. Will update.

[Name section](https://webassembly.github.io/spec/core/appendix/custom.html#name-section).

lukewagner marked this conversation as resolved.
Show resolved Hide resolved
The producers section contains a sequence of fields with unique names, where the
end of the last field must coincide with the last byte of the producers section:

| Field | Type | Description |
| ----------- | ----------- | ----------- |
| field_count | `varuint32` | number of fields that follow |
| fields | `field*` | sequence of `field` |

where a `field` is encoded as:

| Field | Type | Description |
| ----------- | ---- | ----------- |
| field_name | [name](https://webassembly.github.io/spec/core/binary/values.html#names) | name of this field, chosen from one of the set of valid field names below |
| field_value | [name](https://webassembly.github.io/spec/core/binary/values.html#names) | a string which match the specified pattern according to the table below |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"match" -> "matches" (or "must match")


Each field_name in the list must be unique and found in the folowing table:

| field_name | field_value pattern | Valid tool names |
| -------------- | -------------------- | --------- |
| `language` | [Tool-version string](#tool-version-string) | [source language list](#source-languages) |
| `processed-by` | [Tool-version set string](#tool-version-set-string) | [individual tool list](#individual-tools) |
| `sdk` | [Tool-version string](#tool-version-string) | [SDK list](#sdks) |

# Text format

TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any advantage of using a syntax for that? I think that when custom sections are available in wast it will be easy to declare the producer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well even when custom sections are in wast, you'd still have to write out the encoded binary, which seems unpleasant to read or write. For example, if you look at a wasm module in the browser debugger, it'd be nice if you simply saw the toolchain.