Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap to determinism in WASI #190

Open
kubkon opened this issue Jan 3, 2020 · 18 comments
Open

Roadmap to determinism in WASI #190

kubkon opened this issue Jan 3, 2020 · 18 comments
Labels
discussion A discussion that doesn't yet have a specific conclusion or actionable proposal.

Comments

@kubkon
Copy link
Contributor

kubkon commented Jan 3, 2020

The discussion about enforcing (ensuring?) determinism in WASI has already been started and touched upon in a couple of issues here and there (#185, #118, bytecodealliance/wasmtime#748, if I missed any, please feel free to mention it in this thread). I'd like to gather all the knowledge, ideas, perceived issues, etc. here creating essentially a meta-issue that we could use to track this, and come up with solutions, or at least guidance as to what direction to take.

I'll try and describe all potential sources of nondeterminism below leaving out sockets for now though. Feel free to correct me, add more, etc.

Randomness and entropy

This is an obvious one, and from what I understand, the current consensus is to have it require a capability (see #185 and bytecodealliance/wasmtime#748 for more details). random_get also will get its own module in the upcoming WASI snapshot: wasi_ephemeral_random.witx.

Clocks

Access to system/thread/process clocks will also lead to nondeterminism, and as far as I understand, like in the randomness case, the consensus is to have it require a capability (see #118 and bytecodealliance/wasmtime#748 for more details). Also as in the randomness case, clock_time_get will get its own module in the upcoming WASI snapshost: wasi_ephemeral_clock.witx.

File access/modification/change times

This one concerns four WASI syscalls that may introduce nondeterminism into the picture, namely: fd_filestat_get, fd_filestat_set_times, path_filestat_get, and path_filestat_set_times. The nondeterminism may sneak in if the client app makes use in some way of the access (atim), modification (mtim) or change (ctim) times of a file descriptor which can change between any two runs of the app.

I'm not sure what the best approach to handle this would be, so I'd like to start some brain storming on this. Could we only perhaps populate the filestat time values if a clock capability was requested/provided? Or introduce a different type of capability?

Readdir

This one is potentially of lesser importance/impact since, I assume that on the same host, fd_readdir syscall should return the same ordering of the entries---according to the macOS man of readdir

Note that the order of the directory entries vended by readdir() is not specified. Some filesystems may return entries in lexicographic sort order and others may not.

I assume similar will hold on all *nixes, so as long as the same host with the same filesystem is used, the order should be the same between the app runs. The problem, however, may become more pronounced in distributed settings where we'd like to execute an app on two unknown and potentially different hosts, and expect deterministic, comparable results on both.

There already was some discussion about ordering of results, seeking, and fd_readdir in general in #61.

Poll

I'm adding that one in since I remember having a discussion with @marmistrz about this one, and he was convinced he could generate entropy with a clever use of poll_oneoff, hence bringing in nondeterminism into the picture. @marmistrz perhaps you could shed more light on this?

@programmerjake
Copy link
Contributor

Looks good to me!

Should we also address potential races with other processes and/or threads or just leave that unaddressed as out-of-scope-for-now?

@sunfishcode
Copy link
Member

Here are a few more ideas:

Filesystems:

  • whether filesystems are case-sensitive or what form of case conversions they use (https://github.com/WebAssembly/WASI/issues/72)
  • the character sets used by host filesystems may determine what filenames are valid -- WASI is expected to use UTF-8 for its part, but it may need to be transcoded in the engine when talking to the actual filesystem (see also https://github.com/WebAssembly/WASI/issues/8)
  • the set of valid characters permitted by host filesystems may vary too -- is backslash a valid file name?
  • the dev/ino numbers in a filestat, and the returned cookie value in a readdir, expose bits of the fs implementation
  • whether you can successfully delete a file while it is open
  • running out of free space in a file system or hitting other resource limits
  • the maximum length of a filename, the maximum directory nesting depth, or other fixed limits
  • probably several things with symlinks

Other:

  • Currently there is some ambiguity in read and write about the circumstances under which they're permitted to read or write fewer bytes than requested.
  • If you poll for a combination of I/O and timer events, does the I/O become ready before the timer expires? A particularly tricky case: polling for a timer of 0 nanoseconds and a local file descriptor at the same time.
  • WASI doesn't currently have the guarantee that path_open will return the next available file descriptor number; it could return any available number.

@devsnek
Copy link
Member

devsnek commented Jan 3, 2020

cookies and fds could be fixed by using anyref, right?

@sunfishcode
Copy link
Member

Yeah. To handle cookies with anyref we'd probably need to reorganize the readdir API so that it isn't just writing out dirent records into a raw buffer, but that's something that we may want to do anyway.

@marmistrz
Copy link
Contributor

@kubkon I probably meant a method which implied using multiple threads, but are such methods in scope for now?

If we have multiple threads, then poll could be leveraged to get a safe race condition (i.e. a race condition which doesn't lead to undefined behavior), but this required multiple threads. There are other places where one could get a safe race conditions, e.g. path_open using multiple threads. I can describe the precise method I'm thinking about if it's in scope.

IMO, there are other, more fundamental ways of achieving non-determinism in a single-threaded WASI program: (I'm not sure if the runtime takes any special care about any of these issues)

  • access to uninitialized memory, out-of-bound memory access or pointers to expired objects: UB in C/C++, which may end up reading some data previously used by the program. This may generate entropy if there are pointers stored on the stack and ASLR is enabled - this is a common method of circumventing ASLR when pwning, actually. We probably should always zero-initialize variables unless they haven't been explicitly initialized.
  • reaching the end of a non-void function without a return statement
  • integer division by minus one is architecture-dependent
  • out of memory conditions
  • with the filesystem access not constrained enough, the file size may be a source of non-determinism. In particular, blocking it in fd_filestat_get is not enough, since one may try to find out the file size using lseek.
  • whether it's possible to create a link to a non-existing file (possible on Linux, impossible on Windows)

As for clocks, we could probably expose a logical clock, where every query would increase the time by (say) 1 second. This may be useful for programs which use clocks in some minor way.

As for randomness, disabling it completely may be too heavy-handed an approach. Some applications require pseudo-randomness, in particular some numerical simulations or Monte Carlo methods. Some simple approach would be to allow passing a seed value for runtime and use a seeded, thread-specific deterministic rng with the seed provided.

@devsnek
Copy link
Member

devsnek commented Jan 6, 2020

Some of these items are not issues in wasm:

  • wasm doesn't have uninitialized memory or out-of-bound memory accesses
  • wasm can't reach the end of a non-void function without a return statement (or rather, without pushing the appropriate values to the stack)
  • that -1 division issue is intrinsic to C, not division in general
    • (and note that overflowing in wasm is well defined to trap)

@sunfishcode
Copy link
Member

  • Whether or not opening a file for both O_TRUNC and O_APPEND at the same time succeeds.

@Serentty
Copy link

The issue of filenames is so incredibly complex that I wonder if it is even possible in the first place without exposing some details of the OS to the software, or if the only solution that would give complete determinism is to expose a virtual filesystem inside of an image.

Take case mapping for example. I'm fairly sure that both Apple's and Microsoft's implementations of it would allow certain filename pairs that the other wouldn't. A cross-platform set of rules which doesn't allow anything which would be forbidden on any major platform might work for a while, but in the case of Windows the case mapping isn't stable, and is subject to change with updates. After such an update, it would be possible for OS-specific behaviour to suddenly become visible.

@Evrey
Copy link

Evrey commented Jan 12, 2020

Yes, the only possibility to make this deterministic is by using a virtual FS layer for WASI that does not even have text as its paths but an array of numeric VFS node identifiers.

  • Windows goes boom for paths over 255 UTF-16 code units in length, unless you do \\?\ magic.
  • Windows also effectively goes boom when using \\?\, as a lot of software seemingly can't handle this style of path in a user-friendly fashion, if at all.
  • On many file systems node names can't be longer than 255 UTF-8 code units.
  • Well, FAT16 still exists. Happy 8.3 names.
  • All the file systems and all the OSs have way different ideas on what is and is not allowed in a node name or path.
  • So yes, software could break by just changing the file system on the same OS.
  • Don't forget special magical files like Windows' COM1, LPT7, etc., which also apply for all extensions like LPT4.mp4, or magic like Windows' file streams some_file.txt:stream1.
  • And all that on top of things like case handling of text in paths.

So the safest strategy would be to go numeric. Have a virtual node 0xDEADBEEF, name it deadbeef or whatever in the host OS's file system. If you were to access an OS-provided node like Downloads, give it a not yet assigned virtual node ID, or a pre-defined constant one. (1 could point at the user home, then [1, 4] may be $HOME/Downloads, etc.) And when the user tries accessing the file system, let them fetch the correct node IDs by performing a fuzzy search on a text name. Something like let downloads_id = home_dir.find_by_name("Downloads");, which may return multiple suggestions.

Note that this is just all a maximum caution suggestion post. Whatever people may come up with here may look very differently.

@Serentty
Copy link

@Evrey Even then, normal OS paths would have to be read-only, to prevent experimentation with file naming rules.

My thoughts on this would be that it would make sense to have two tiers of filesystem access. A virtual but completely deterministic filesystem, which could be used for things like reproducible builds, and nondeterministic access to the real filesystem.

@cchudant
Copy link

We could make a capability that allows this two-tiered access - when used, filename validation would be skipped by the runtime so that any filename would be allowed (as long as the underlying os allows it)

@kubkon
Copy link
Contributor Author

kubkon commented Jan 15, 2020

This thread exposed a lot more potential problems than I expected, some of which could potentially also impact portability IMHO, so I'm glad I started this discussion. Kudos to everyone for active participation and ideas, this is highly appreciated! :-)

In terms of the problem of nondeterminism, I was wondering, if we were to constrain the problem space somewhat more to the application of simple compute functions (I guess you could think AWS Lambda or similar) and worry only about deterministic result of the computation, then I figured we potentially wouldn't have to touch so many avenues where everything might go wrong. In essence, since one of the cornerstones of WASI is preopens, I was wondering if we could toy with the idea of "preopening" any sort of a byte stream which then would get assigned an Fd value, and we could for instance only either read or write from it, this would essentially abstract away the problem of dealing with actual raw fds/handles on the host. Then, a WASI program targeting such a compute environment would only ever require (at the very minimum) fd_read for reading from a byte stream, fd_write for writing to a byte stream, fd_fdstat_* for manipulating rights, and fd_close.

Here, such a byte stream could be anything from a preopened and preloaded file from host to a (named) pipe to a socket. Now that I mention all this, I believe someone already mentioned something similar in the past. @sunfishcode was it you or perhaps @pchickey? In any case, IMHO this kind of abstraction would lend itself nicely to the problem of edge/serverless computing, or in fact, any environment where we'd like to run a MapReduce style problems in a distributed fashion (deterministic or not).

In any case, I hope this makes sense, if not, please shout out. I'd love to hear your thoughts on this!

@sunfishcode
Copy link
Member

@kubkon Not entirely by accident, this aligns well with the design principles I'm proposing, specifically the parts about shared filesystem views :-).

@kubkon
Copy link
Contributor Author

kubkon commented Jan 16, 2020

@sunfishcode Oh oops, my apologies! Somehow, I forgot about this completely! Anyhow, this is excellent and indeed aligns itself really well the your proposal in #192. Very exciting! :-)

@kubkon
Copy link
Contributor Author

kubkon commented Jan 28, 2020

Out of curiosity and in preparation for my talk about determinism in WASI in distributed settings at FOSDEM2020, I've started tinkering a little bit with wasmtime, and came up with a working draft of the "compute" functions mentioned above which you can find in this repo: kubkon/wasi-compute. I've even managed to use this way an existing library written in C which is pretty satisfying. Comments are welcomed!

@sunfishcode I'd love your input on this and how this could perhaps be a starting point for discussions towards facilitating (even as a draft) your proposal on the design principles!

@kubkon kubkon added the discussion A discussion that doesn't yet have a specific conclusion or actionable proposal. label Jan 28, 2020
@indolering
Copy link

Some applications require pseudo-randomness, in particular some numerical simulations or Monte Carlo methods. Some simple approach would be to allow passing a seed value for runtime and use a seeded, thread-specific deterministic rng with the seed provided.

Perhaps unrandom? Because the output of PRNGs regularly to leak into things that can cause DoS attacks or are used inappropriately for security sensitive purposes, I would like to propose a PRNG whose output cannot be trivially predicted. I believe PCG is still the only PRNG that would tick all of these boxes. Note that the seed for the function should be salted and hashed using a cryptographic hash function, to prevent similar seeds from producing correlated outputs.

@sunfishcode
Copy link
Member

On the topic of cryptographic uses of randomness versus determinism, I filed WebAssembly/wasi-crypto#30 to investigate whether wasi-crypto is deterministic. It's interesting because it allows applications to do crypto without interacting with private key data directly.

Beyond that, thinking about randomness APIs, I think we can make two top-level categories that we care about:

  • APIs where nondeterminism is part of the API contract, and users may depend on it to generate eg. private keys. A deterministic environment would probably just opt to exclude these APIs (and optionally tell their users to use wasi-crypto instead), but other environments may want them.
  • APIs where nondeterminism may be present but isn't strictly required. These seems like they can be modelled as an input stream that you read from with the same APIs you use to read from a file or socket. These sources may be nondeterministic (just as data in external filesystems may be), but need not be. From a determinism perspective, an environment could provide deterministic versions of them if it wanted to.

@indolering
Copy link

indolering commented Nov 16, 2020

I think this discussion would be helped by incorporating the interplay between determinism and reproducibility. Determinism is not something that is desirable in most deployments, as it can lead to DoS attacks and screws over crypto in ways that may be difficult to fix. If a programmer wants to introduce non-determinism, they could exploit side channels or simply engage in feature detection.

Usually when someone says they want determinism, what they really want is reproducibility. A CPU or bytecode should be fully deterministic, but a runtime should be reproducible. For example, if you need a seed for hash tables that is externally difficult to guess but not cryptographically secure, then an unrandom API could incorporate the hash of the WASM binary to generate a statistically random output that is reproducible.

I would still lockout any non-deterministic APIs. So a real random function call (possibly providing the exact same API) would remain, but it would not be allowed to return deterministic randomness except in a runtime-level debug mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion A discussion that doesn't yet have a specific conclusion or actionable proposal.
Projects
None yet
Development

No branches or pull requests

9 participants