Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WASM UDFs #9326

Open
crepererum opened this issue Feb 23, 2024 · 12 comments
Open

WASM UDFs #9326

crepererum opened this issue Feb 23, 2024 · 12 comments
Labels
enhancement New feature or request

Comments

@crepererum
Copy link
Contributor

Is your feature request related to a problem or challenge?

Databases / ETL solutions built on top of DataFusion can use UDFs (in their various forms) to extend the functionality of DataFusion, e.g. to add new scalar or aggregation functions. This extensibility however does NOT automatically extend to their users, since they cannot and (for security reasons) should not add code to the running system. So the U in UDF currently stands for "DataFusion API User", not for "End-User".

WASM provides a way to run user code in a secure sandbox under an "unknown" host (i.e. the user does not need to know about the operating system or CPU architecture). A DataFusion-based solution can use that to implement UDFs. However, since the calling convention from the WASM payload to the UDF are solution-defined, the end user is likely to have a hard time with it, and there is likely only a non-existing/small ecosystem for tooling to develop UDFs.

Defining the UDF WASM interface in DataFusion -- potentially in collaboration with the Arrow (since we need to get Arrow data across the WASM memory boundary) -- would likely facilitate a wider ecosystem and a more streamlined solution. Prior art to this is Arrow Flight, which is now being integrated into more and more server and client implementations.

Describe the solution you'd like

  1. Define/find a way to pass Arrow data in/out a WASM payload.
  2. Define the WASM calling convention for the different types of UDFs (scalar, aggregation, window functions, ...). Make sure to version that interface so we can advance it later (e.g. by using new WASM features).
  3. Implement UDFs using wasmtime in DataFusion.
  4. Offer some easy blueprint / framework to develop UDFs in at least two languages.

Describe alternatives you've considered

  • Doing this as part of the DataFusion-based solution (i.e. downstream). See drawbacks illustrated within the intro.
  • Use other UDF interface types like Arrow IPC & a Python payload. That clearly has security issues and is harder to deploy/manage.

Additional context

Projects that might be helpful:

@crepererum crepererum added the enhancement New feature or request label Feb 23, 2024
@alamb
Copy link
Contributor

alamb commented Feb 28, 2024

BTW #9332 might be related (API to allow functions to be registered externally)

@milenkovicm
Copy link
Contributor

milenkovicm commented Mar 16, 2024

Example exposing JVM as UDF with JNI https://github.com/milenkovicm/adhesive using #9332

From my experience WASM is going to be a bit more complicated as there are many things missing.

@crepererum
Copy link
Contributor Author

From my experience WASM is going to be a bit more complicated as there are many things missing.

Could you elaborate this? What things / building blocks / features / ... are missing? Is it WASM stuff, or Arrow/DataFusion stuff?

@milenkovicm
Copy link
Contributor

Ive done few pocs some time ago, they were focused on rust calling rust generated wasm. I found that documentation and/or examples missing for anything more complex than simple usage.
Also, I found that implementations claiming having complex type passing between rust and wasm have it disabled due to bugs/performance issues.

Due to all this I personally found it very complex to grasp, and not fit for my use case. Also, from this perspective, I believe I failed to understand importance of wasm bindgen as well. I gave up before reaching stage to add arrow.

Now, don't get discouraged, looks like there is progress in wasm community. Personally it looks wasm edge may be the way to go and I'll try to replace JVM implementation with wasm in near future, from JVM experience it should not take more than few days work to get basic example running.

I hope this helps

@crepererum
Copy link
Contributor Author

Due to all this I personally found it very complex to grasp, and not fit for my use case. Also, from this perspective, I believe I failed to understand importance of wasm bindgen as well. I gave up before reaching stage to add arrow.

I totally agree that mapping Arrow through the WASM boundary isn't a trivial task, which is also why I've opened this ticket and I wanna have one impl. and SDKs / blueprints for different UDF "guest" languages, so people don't have to suffer through this pain on their own.

@milenkovicm
Copy link
Contributor

My understanding is that any interaction with wasm would involve data copy across boundary. (schema, data) : (Vec<u8>,Vec<u8>) to and from wasm function, if that's the case interaction with wasm can be done as in https://github.com/second-state/wasmedge-rustsdk-examples/tree/main/wasmedge-bindgen

Do I miss something?

On that note, what would be the simplest way to convert [ArrayRef] and [DataType] to Vec<u8> and vice verse 😀?

@kylebarron
Copy link

I totally agree that mapping Arrow through the WASM boundary isn't a trivial task

It's very tied to JS at the moment, but you can use https://github.com/kylebarron/arrow-js-ffi to view Arrow data across the Wasm boundary in the cases where you want to persist the data in Wasm. I'm not sure how this would work with server-side wasm though. It seems like you could use the Arrow C Data interface across the Wasm boundary directly, though? The main work in Arrow JS FFI is just implementing the C Data Interface in JS, but it already exists in Rust.

This (and a related/successor project I've been working on https://github.com/kylebarron/arrow-wasm) are both quite tied to wasm-bindgen and browser use cases. It's not clear to me how they'd work in WASI.

@milenkovicm
Copy link
Contributor

I just read your article @kylebarron https://observablehq.com/@kylebarron/zero-copy-apache-arrow-with-webassembly 👍🏻 few minutes ago

This (and a related/successor project I've been working on https://github.com/kylebarron/arrow-wasm) are both quite tied to wasm-bindgen and browser use cases. It's not clear to me how they'd work in WASI.

https://lib.rs/crates/wasmedge-bindgen does not look tided up to browser/js

@milenkovicm
Copy link
Contributor

It might be relevant #9834

@milenkovicm
Copy link
Contributor

Fresh off the press, a quick WASM UDF poc built on top of WasmEdge library

https://github.com/milenkovicm/wasaffi

It is working, but far from being production ready.

Data needs to be copied to wasm VM ( arrow ipc-ed to local buffer then buffer copied across using wasmedge bindgen, and same for function result ). Most of the complexity can be hidden with a simple macro.

One note, ipc buffer contains schema definition as well, can this be avoided if both parties know the schema?

@crepererum
Copy link
Contributor Author

Most of the complexity can be hidden with a simple macro.

That was my hope. I think we should make sure though that the FFI WASM API is well defined and that other guest languages (like C++ or Python) could also implement that boilerplate w/o needing to go through a Rust layer within the guest (otherwise the WASM payloads get rather large).

One note, ipc buffer contains schema definition as well, can this be avoided if both parties know the schema?

Yes. This is for example what Arrow Flight (SQL) also uses. The client can query a server w/o knowing the schema a priori.

@milenkovicm
Copy link
Contributor

IMHO, I dont think there is too many problems on datafusion/arrow side, WASM side FFI interface is still left to be desired, and in constant state of flux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants