Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WASM support #164

Open
lei-rs opened this issue Sep 29, 2023 · 10 comments
Open

WASM support #164

lei-rs opened this issue Sep 29, 2023 · 10 comments
Labels
new ml framework Adding support for a new runner/ML framework

Comments

@lei-rs
Copy link
Contributor

lei-rs commented Sep 29, 2023

How easy would it be to add support for Rust based libraries like Candle and Burn. I'd like to implement this if you aren't already working on it. I'd also appreciate your thoughts on whether this integration is even necessary or useful, since both packages allow you to compile everything down. Maybe it would make more sense, to instead create runners for the formats those libraries can produces, like binaries, wasm, and executables.

@VivekPanyam
Copy link
Owner

VivekPanyam commented Sep 30, 2023

Hi, thanks for the interest in contributing!

I've looked at Candle and Burn a little, but not in a ton of depth. From what I've seen, the code is the model and then you can possibly store weights separately (vs something like TorchScript where the code defines a model structure that can be exported and stored with the weights). Is that correct?

Do either of them have a serializable and loadable format that contains both weights and model structure? In some sense, I guess an executable is that, but the downside is that it might not be super portable. I guess my question is do both packages allow you to compile everything down or force you to compile everything down?

WASM Runners

I think a runner that can run WASM code could be really interesting. That's something I've thought about a little bit, but haven't fully explored.

If we build something that supports WebGPU (maybe through wgpu) as well, that could give us a cross-platform runner for these artifacts that supports inference on CPU and GPU. I'm not sure how the details of that would work, but that certainly seems appealing.

On the packaging side, we'd probably define a set of functions that the WASM module has to implement (similar to how we handle arbitrary Python code) and then any program that can compile down to WASM and implement that interface will work with this runner.

This seems pretty promising to me. Would you be interested in exploring/helping to implement it?

@lei-rs
Copy link
Contributor Author

lei-rs commented Sep 30, 2023

Great insights. I can only speak for candle but you are right, the code is not detachable from the model, and you have to compile everything. I don't think there is an idiomatic way to support models in the form of code + weights it would involve some sort of JIT. Would definitely make more sense to support their compilation targets instead.

If we build something that supports WebGPU (maybe through wgpu) as well, that could give us a cross-platform runner for these artifacts that supports inference on CPU and GPU. I'm not sure how the details of that would work, but that certainly seems appealing.

Definitely agree. We can define standard APIs for various languages that have wasm targets, which would allow us to create runners for not only wasm but normal binaries as well. We can use the wasmer-runtime wasmtime crate for the wasm runner. Seems pretty doable, so I'll create a demo of what this might look like.

@VivekPanyam
Copy link
Owner

Seems pretty doable, so I'll create a demo of what this might look like.

Sounds good!

A few more thoughts:

WebGPU

As far as I know, there currently isn't a standard way of exposing WebGPU to Wasm. It looks like many projects that use WebGPU from WASM have JS shims that expose the relevant browser functionality as custom APIs to Wasm that the Wasm code knows how to use.

We could also expose WebGPU as an API to Wasm ourselves, but I don't think that provides much value because users would have to explicitly design and build against that API (vs it being something that just works). So this might be difficult to do until there's a standard way to use GPUs from Wasm.

As a workaround, we could implement the API that wgpu expects when running from Wasm. Not sure if that's the best use of time at the moment though.

Wasm vs Native

I think it's worth exploring Wasm vs Native (on CPU) performance for some popular models. Are they fast enough vs native alternatives where the portability outweighs the lower performance?

That's to say, would people actually want to ship a model in Wasm if they aren't just targeting browsers?

Native binaries

I think this requires us to be quite thoughtful about design:

We can define standard APIs for various languages that have wasm targets

For native binaries, I think we likely don't want to do the above. I think it could turn into a lot of work for questionable gain. A good balance might be something like "you can package up a .so file and/or .dylib and it must implement this C interface." I spent a bit of time thinking about what that interface would look like and it's not too complex. That should make it relatively easy for people to build stuff in C, C++, and Rust. And if there's demand, people could implement libraries for different languages that make it easy to implement the C interface we require.

Now that I'm thinking more about it, that approach could work for Wasm too.

We'd also have to spend more time thinking about security and portability. You can already limit a model to specific platforms so that isn't a huge deal, but native dependencies could make models frustrating to use (e.g. if people don't ship fully statically linked binaries, require a particular system configuration, etc). Wasm avoids a lot of that.

You could solve a lot of the UX problems I just described by letting people package up a Docker/OCI image, but that introduces some other problems. For example, if Carton is already being run in a container, we'd need to use nested containers or talk to the Docker daemon on the host. Both of these have a lot of issues. This is also a Linux only solution (which might be okay, but it's worth noting).

My TODO list also includes tasks around better model isolation from the system so I think we'd want to ship those before making it easy for people to run arbitrary native code. Technically many ML models are just programs (see the TensorFlow security doc) and we already allow arbitrary Python code (which can include native code and PyPi packages), but I think it's important to have a well-thought-through security model before we let people directly package up an arbitrary .so or .dylib.

Summary

  • WebGPU with Wasm might be tricky
  • Are models in Wasm fast enough where people would be okay with them outside of browsers?
  • Native binaries are powerful, but we'd need to be make it easy to use. This also requires spending some more time on portability and security.

Proposal

Maybe a good approach moving forward is :

  1. Continue prototyping a Wasm runner for Carton (without WebGPU)
  2. Benchmark Wasm vs non-Wasm for popular models using Candle (or find an existing benchmark).
  3. Assuming that looks good, we can stabilize the interface Wasm modules need to implement and release the runner!

We can explore all the rest later (WebGPU, native binaries, etc).

Thoughts?

@lei-rs
Copy link
Contributor Author

lei-rs commented Oct 2, 2023

Appreciate the insights! My initial suggestions for language specific APIs was motivated by the difficulty of designing an API that is idiomatic to implement in multiple languages (especially Rust). But yeah, the responsibility of that abstraction could lie in a different module if needed. I almost got a super simple CPU only runner working.

So maybe this is a dumb question, I don't really know how WebGPU works, but is there a reason you need to explicitly implement WebGPU support for runners? If I compile a model with wgpu as it's backend into wasm, would it not just work out of the box? I think the user would just compile and package both CPU and GPU versions of their model and handle the fallback themselves. Ok never mind, I get what you are saying now. I found an example of exposing wgpu to wasm as well.

As for performance, obviously native is going to be faster than wasm, since you will be able to leverage hardware specific intrinsics and kernels like mkl, cublas, accelerate, etc. The selling point for wasm/wasi, as you mentioned, is the portability. If your product is a model as a service, you would't really care. But if your product ships with the model embedded, like say an app or a game, you could use wasm as a fallback.

@VivekPanyam
Copy link
Owner

I almost got a super simple CPU only runner working.

That's great! Looking forward to the PR

Ok never mind, I get what you are saying now. I found an example of exposing wgpu to wasm as well.

Cool :)

Not that we're going to implement WebGPU support in a runner right now, but just a note: it looks like the code/project you linked to doesn't have a license yet so be wary of copying from it.

As for performance, obviously native is going to be faster than wasm, since you will be able to leverage hardware specific intrinsics and kernels like mkl, cublas, accelerate, etc. The selling point for wasm/wasi, as you mentioned, is the portability.

Yes, you're correct. However, if Wasm is 10x slower than native for your model, it may be tempting to ship a separate native model for each platform rather than just one Wasm one. 10x is almost certainly an exaggeration, but it would be good to know what the actual number is for a few popular models.

For example, if it's only 10% slower, the portability and convenience likely outweighs the performance difference for many use cases.

But if your product ships with the model embedded, like say an app or a game, you could use wasm as a fallback.

To make sure I understand correctly; are you saying an app or game that supports platforms X, Y, and Z might ship a "native" Carton model that can run on platform X, but use a Wasm carton on Y and Z?

That primarily applies in the context of models that can compile to both Wasm and Native code right? Which brings us back to Burn and Candle I guess.

@lei-rs
Copy link
Contributor Author

lei-rs commented Oct 3, 2023

To make sure I understand correctly; are you saying an app or game that supports platforms X, Y, and Z might ship a "native" Carton model that can run on platform X, but use a Wasm carton on Y and Z?

That primarily applies in the context of models that can compile to both Wasm and Native code right? Which brings us back to Burn and Candle I guess.

Yeah, for example you could have CUDA, Vulkan, etc, and a wasm one, such that if the end user does not have the matching hardware deps the app would still work. In such a case wasm's speed wouldn't matter as long as it's somewhat usable. You should try some of the candle browser examples like LLAMA2, which is surprisingly performant even without WebGPU. Putting that in a game is definitely feasible.

I definitely plan on doing benchmarks though, since I'm on Mac I'll do wasm vs native vs native + accelerate. My initial prediction is that wasm and vanilla native will be quite similar.

I think thanks to those libraries, you may start see apps that offload some computation off to the user's device, to save on inference cost, preserve privacy, enable offline usage, etc. I think Carton could be super helpful to make the development of such apps more idiomatic and modular.

@VivekPanyam VivekPanyam added the new ml framework Adding support for a new runner/ML framework label Oct 4, 2023
@lei-rs lei-rs changed the title Candle support WASM support Oct 10, 2023
VivekPanyam pushed a commit that referenced this issue Oct 12, 2023
This PR adds a WASM runner, which can run WASM models compiled using the interface (subject to change #175) defined in ```../carton-runner-wasm/wit/lib.wit```. The existing implementation is still unoptimized, requiring 2 copies per Tensor moved to/from WASM. An example of compiling a compatible model can be found in ```carton-runner-wasm/tests/test_model```.

## Limitations
- Only the ```wasm32-unknown-unknown``` target has been tested to be working.
- Only ```infer``` is supported for now.
- Packing only supports a single ```.wasm``` file and no other artifacts.
- No WebGPU, and probably not for a while.

## Test Coverage
All type conversions from Carton to WASM and vice versa and fully covered. Pack, Load, Infer are covered in pack.rs.

## TODOs
Track in #164
@lei-rs
Copy link
Contributor Author

lei-rs commented Oct 12, 2023

Post #173

Is it possible/straightforward for us to implement Lift for Tensor (and use it easily)? It looks like wasmtime implements Lift for several types that can be built from a wit list (i.e. it's not a 1:1 mapping from list to Vec). I haven't looked at this in depth so maybe it doesn't actually give us what we want. It seems like implementing Lift might require messing with wasmtime implementation details so maybe it's not worth it (definitely not in this PR at least). We can explore this more if we find that this actually matters for performance in use cases we see.

I think the interfaces option for wasmtime::component::bindgen! may allow you to implement the types manually. In the .wit we could put tensor in it's own interface, and have it import tensor-numeric and tensor-string, then bindgen! should only need to implement Tensor, and we can handle the variants ourselves.

TODOs

  • Use for_each_carton_type! when possible.
  • Implement Seal and InferWithHandle.
  • Reduce number of copies. Possibly by defining host side component types manually, and not using vec. Or just return a pointer, not sure if you can access memory in components though.

@lei-rs
Copy link
Contributor Author

lei-rs commented Oct 23, 2023

@VivekPanyam I'm going to start implementing Seal and InferWithHandle sometime in the next week. What are these methods supposed to do? From the torch runner it seems like they can be added without much modification.

@VivekPanyam
Copy link
Owner

Yeah, these methods aren't super well documented yet. Here's a little snippet from the code for the public interface:

/// "Seal" a set of inputs that will be used for inference.
/// This lets carton start processing tensors (e.g. moving them to the correct devices) before
/// actually running inference and can lead to more efficient pipelines.
pub async fn seal(&self, tensors: HashMap<String, Tensor>) -> Result<SealHandle> {

/// Infer using a handle from `seal`.
/// This approach can make inference pipelines more efficient vs just using `infer`
pub async fn infer_with_handle(&self, handle: SealHandle) -> Result<HashMap<String, Tensor>> {

Basically, these let users write more efficient pipelines by allowing them to send tensors for inference N + 1 (or later) to the runner while inference N is happening.

This works well for some runners where its easy to do things in parallel (e.g. the TorchScript runner where we could in theory start moving tensors for the next inference to the correct devices while the current inference is running in another thread). Even if we're not moving tensors between devices, we could at the very least parallelize the type conversions (which could matter if they involve copies of somewhat large tensors).

Currently we don't do anything in the TorchScript runner other than storing the tensors in seal and using them in infer_with_handle. This on its own can provide a decent speedup for large tensors because it can hide the latency of copies within the IPC system.

In the Python runner, we allow users to optionally implement seal and infer_with_handle themselves (although this isn't yet documented). Ignoring details with the GIL because they're not relevant for this comment, these methods enable users to also parallelize model-specific preprocessing work.

Ideally, we'd do something similar for Wasm, but it might be tricky to implement in a way that actually parallelizes things because it looks like threading and thread safety aren't super mature features of Wasm and Wasmtime. So for now, I think we can add these methods to the interface that users can implement and then down the road once threading support is more mature, these methods will have more value (without breaking the interface).

Some references that may be useful to explore:

@lei-rs
Copy link
Contributor Author

lei-rs commented Dec 11, 2023

Have you checked out lunatic? Might be a workaround for the lack of threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new ml framework Adding support for a new runner/ML framework
Projects
None yet
Development

No branches or pull requests

2 participants