Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy WASM extensions in the present and its future (Proxy-Wasm) #35420

Open
marc-barry opened this issue Jul 24, 2024 · 71 comments
Open

Envoy WASM extensions in the present and its future (Proxy-Wasm) #35420

marc-barry opened this issue Jul 24, 2024 · 71 comments
Labels
area/docs area/wasm question Questions that are neither investigations, bugs, nor enhancements

Comments

@marc-barry
Copy link
Contributor

Title: Envoy WASM extensions in the present and its future (Proxy-Wasm)

Description:

Envoy current supports WASM extensions via the WASM filter. I am aware of the following warning:

The Wasm filter is experimental and is currently under active development. Capabilities will be expanded over time and the configuration structures are likely to change.

The documentation for the feature is fairly terse and I largely used articles like https://tetrate.io/blog/wasm-modules-and-envoy-extensibility-explained-part-1/ and there was Google Document proposal that I had read a while back but I'm now unable to find the document. We have hit a number of issues in terms of the documentation, extension development process and clear understanding of the adhered ABI spec and plans for that.

There are many more references across the Internet with reference to WASM extensions for things that use Envoy under the hood or have decided to also adopt it as its ABI. But as I mentioned above the spec isn't really even defined in the public space and all work on it appears abandoned or stalled.

What I'm trying to determine is the following:

  • Does Envoy in-fact use Proxy-Wasm and v0.1.0 of the spec?
  • Since the spec doesn't exist in consumable form under https://github.com/proxy-wasm/spec/tree/main/abi-versions how can I determine the ABI that Envoy is using?
  • Does Envoy intend to move to later versions of the spec, assuming they are documented and agreed upon at some point in the future?
  • When products like https://gateway.envoyproxy.io refer to WASM extensibility I assume they are using Envoy under the hood and that in fact Envoy's WASM extensions are used. Is this a correct assumption?
  • Does anyone have a good understanding of the future of WASM extensions in Envoy and what the support and evolution might look like?

Relevant Links:

@marc-barry marc-barry added the triage Issue requires triage label Jul 24, 2024
@phlax
Copy link
Member

phlax commented Jul 24, 2024

@marc-barry i would be happy to review any improvements to the docs - sounds like we have some gaps in current docs

cc @mpwarres wrt substantive questions

@phlax phlax added question Questions that are neither investigations, bugs, nor enhancements area/docs area/wasm and removed triage Issue requires triage labels Jul 24, 2024
@marc-barry
Copy link
Contributor Author

@phlax thanks for answering. I also don't mind helping improve this in the community and contributing. Perhaps we can start with documenting the ABI interface details that Envoy uses so that developers can reference this when developing extensions. I couldn't find any examples for Envoy and perhaps I could contribute a simple example that we could include.

https://www.envoyproxy.io/docs/envoy/v1.31.0/api-v3/extensions/wasm/v3/wasm.proto#envoy-v3-api-msg-extensions-wasm-v3-pluginconfig is documented quite well but the gap is, for example, from say Go code to a WASM plugin that can be loaded. If you navigate down to allowed_capabilities you'll see a reference to The capability names are given in the [Proxy-Wasm ABI](https://github.com/proxy-wasm/spec/tree/master/abi-versions/vNEXT). but if you follow that link it no longer exists. I think we could focus on cleaning this up which would make it more clear. We should probably figure out how to improve the https://github.com/proxy-wasm contribution situation as this is then something that the Envoy docs could just reference.

@phlax
Copy link
Member

phlax commented Jul 24, 2024

yeah, makes sense

we have this example https://www.envoyproxy.io/docs/envoy/latest/start/sandboxes/wasm-cc, i tried previously to add something similar for rust but didnt get too far

@lizan is not as active as before but might have ideas about who to speak to - i know there is quite a bit of commitment on the google side to develop/maintain the wasm filter

@mathetake
Copy link
Member

mathetake commented Jul 26, 2024

@mpwarres @martijneken are the current owners of Proxy-Wasm and Envoy codebase, so I am truly hoping they share what direction this goes and how the evolution looks like - as a former maintainer there, I wanted to express my sorry here for the mess and I actually failed to bring it to the healthy state. I completely share the frustration that I sense you have. I hope the folks mentioned ^^ clarify their stance and how this will be resolved, especially Google's point of view on this matter. cc @alyssawilk

@mathetake
Copy link
Member

I think one idea is to eliminate the Proxy-Wasm organization dependency completely, and make the Envoy codebase self-contained. After that, host the spec and complete user-level documentation here in envoyproxy.io. Plus, host the SDKs here as well (we are more than happy donating the Go SDK). Given that there was no inter-proxy collaboration per-se since the release, there's absolutely zero benefit to Envoy in having the spec separated from Envoy at this point. pure technical debt is from my view the consensus from the conversation with other community members in the last few weeks. That's my take. That's just a random idea, but having the central place to look at is as a user myself is less confusing as well as better IMHO.

@mathetake
Copy link
Member

^^ if this sounds good to other maintainers, I am more than happy to help and maintain again - I really feel the obligation to fix this once and for all

@kyessenov
Copy link
Contributor

I am happy to review any doc improvements since I'm familiar with the extensions as well. A lot of the base code is used in production, for real products, so there's definitely a valid use case, but it is difficult to figure out where the stability ends and the rough edges start without looking at the code. The larger problem is that it is simply hard to start writing Wasm no matter which language one chooses (not everything will learn Rust for changing headers), so there's a barrier of entry that the improved docs may not fully address.

I think the current Wasm implementation is no longer experimental and reached some stability (for core parts), but it also failed to reach 1.0. It doesn't really matter what the version number says, since in practice, there's just one ABI used by various Proxy-Wasm efforts.

@mathetake
Copy link
Member

mathetake commented Jul 26, 2024

yeah I completely agree with @kyessenov, and in order to improve the situation (not saying I am sure how the end stability means here), I think as @marc-barry hinted (and I believe everyone is aware), the unnecessary dependency on the Proxy-Wasm org makes the situation worse or standstill. What I am suggesting is to document and host everything in envoyproxy.io so there is just one single source of truth. This is not only about the documentation, but also how the ecosystem around it works like who's responsible for what, how's the issue handled, where's the place to report issues, what's the support policy etc. Currently all of them is a mess. But open to suggestions, and curious how others think about the coupling with Proxy-Wasm and leaving everything there as-is benefits Envoy just by looking at the history

@thenewwazoo
Copy link

I'm an onlooker, but thought I might weigh in. My employer is interested in leveraging existing logic written in Rust (or in porting Java code to Rust) in order to embed it in multiple places, one of which is Envoy filters. We benchmarked C++, Lua, and Rust WASM (using the v8 runtime) and found the overhead of WASM to be a show-stopper. I have been experimenting for the last ~week with building NullVM filters to compile into v1.30 but ran into this problem after some significant struggles bringing everything up-to-date, and have stopped trying.

I'm not opposed per se to proxy-wasm, but I'm doubtful as to its value given that adoption has been very poor over the last ~4 years since it was introduced. The overhead of a "real" WASM runtime is too high for us, and NullVM has (as far as I can tell) never quite been finished (and its overhead is still to be measured). As such, I'll be exploring the recent work on loadable modules. IMO that's where development effort should be directed.

@kyessenov
Copy link
Contributor

@thenewwazoo The criticism is shared but it's an ecosystem problem with Wasm, and not something Envoy can fix. IMO, a stable ABI is all that matters, and Wasm in Envoy has kept the stability promise despite many internal changes inside Envoy. This is no worse than other extension points, e.g. Lua, and it's more flexible since you can also use dynload "nullVM" and not be blocked by runtime performance.

@martijneken
Copy link

martijneken commented Jul 27, 2024

(adding @PiotrSikora for anything I'm missing)

share what direction this goes and how the evolution looks like

Yes, thank you for the opportunity. I will try to provide some context and direction:

First, I acknowledge there's been (at least) a 1 year gap in ProxyWasm maintenance. The team at Google that was doing most of the work more or less disintegrated, and I've been working to rebuild it. Google is bringing new products to market based on [Proxy]Wasm, so rest assured we will be investing in Wasm extensibility going forward, hopefully in partnership with other stakeholders. Here are the major areas we plan to invest in, approximately in order:

  1. Stabilization. This means writing documentation (in progress), fixing CI (in progress), and updating dependencies (to at least match Envoy). We also are working on some tools and code samples, although these do have a Google Cloud Platform bend. I'm open to 'upstreaming' some of these into ProxyWasm itself, if it would be helpful.

  2. Productionization of Envoy "inline wasm". We've identified a number of shortcomings in the (alpha) Envoy-local wasm extensions, such as: lack of reliable wasm delivery, lack of error handling, lack of automatic scaling, lack of isolation (process and/or sandbox). We are working on a design to address these, which we hope to iterate on with the Envoy community.

  3. ABI evolution. We acknowledge ProxyWasm is a one-off wasm ABI, born years go while wasm was still in its infancy. Per @kyessenov's point above, we are looking to engage with ABI standardization efforts. Specifically, we are interested in evolving to support wasi-http and supplementary ABIs such as wasi-keyvalue.

absolutely zero benefit to Envoy in having the spec separated from Envoy at this point

@mathetake I'm not sure I agree. Per links in the original report, there are integrations with nginx at least. Is that not maintained/used? I will agree that ProxyWasm is essentially an ext_proc based ABI, so maybe you have a point. But I would be wary of breaking existing non-Envoy users.

I do also share the concern that there's version/dependency drift between Envoy and ProxyWasm. ProxyWasm seems to lag Envoy all the time. Anyone have ideas on how to minimize this?

how can I determine the ABI that Envoy is using?

Good question. I see the Envoy dep but that points to a commit and not a release version. Browsing that commit, I would guess 0.2.1. Can someone confirm?

found the overhead of WASM to be a show-stopper

@thenewwazoo Can you elaborate? In our benchmarks wasm performs quite well. We are careful about our choice of wasm engine, and we do precompile plugins ahead of time.

https://github.com/tetratelabs/proxy-wasm-go-sdk

I have serious concerns about the memory management module in TinyGo, operating under wasm. I filed tetratelabs/proxy-wasm-go-sdk#450 just this week, and it seems there's no supported solution. I would only recommend using C++ or Rust at this time.

@mathetake
Copy link
Member

I will agree that ProxyWasm is essentially an ext_proc based ABI, so maybe you have a point. But I would be wary of breaking existing non-Envoy users.

@martijneken curious, how in the universe just removing the dependency on one library would break the another existing the users of the library? could you tell me how that works? I am not saying making changes to Proxy-Wasm in the org at all, and if you think in that way, that's not my intention.

@mathetake
Copy link
Member

@martijneken what I am saying is just to have a single place of the documentation and implementations which I believe is Envoy official documentation. Looks at this mess proxy-wasm/spec#42, the original intention was to document properly and what happened was it is ignored by one single person who has been constantly saying "i am doing in a few weeks" and disappearing. That's totally and completely disgraceful to users, don't you think? That's the whole blocker of why SO MANY people before have complained about this documentation mess. Just being detached from the throne of one single person there doesn't harm anyone here, but just benefits Envoy community. What's damage doing so?

@kyessenov
Copy link
Contributor

kyessenov commented Jul 27, 2024

I think we really need both: an upstream ProxyWasm reference documentation that spells out the least common denominator functionality for all implementations - this requires fixing the governance of ProxyWasm, as many have pointed out. I would recommend drawing a stricter boundary for "core" and "experimental" ABI definitions here. A lifecycle and a capabilities model also belong here.

We also need to document the Envoy implementation of ProxyWasm in the "inline wasm filter". There are various extensions and backdoors to access Envoy internals that cannot be captured in ProxyWasm spec, but they are crucial if one wants to use this particular Wasm implementation.

Re: performance - I'd be surprised that a simple task would perform poorly in Wasm unless a poorly chosen language/runtime/library is used. There is a real problem that it's difficult to author "good" Wasm, and there's a misplaced expectation that any code would perform well in Wasm. I'm not sure how to address this - I can't see Rust being a replacement for Lua, for example, for all the intended audience.

@mathetake
Copy link
Member

this requires fixing the governance of ProxyWasm, as many have pointed out. I would recommend drawing a stricter boundary for "core" and "experimental" ABI definitions here. A lifecycle and a capabilities model also belong here.

yeah, if this really is possible - I meant fixing the governance there. Not sure if there's such a thing in the first place and if there's anyone interested doing so either after this catastrophic mess

@mathetake
Copy link
Member

mathetake commented Jul 27, 2024

sorry for saying a lot folks, but I wanted to say the governance of Proxy-Wasm is 100% of the problem here, and all what I think is the best is just nuke the dependency on that and have a complete spec/doc/sdk coupled with Envoy and don't care about other proxies since IMO there's no benefit to anyone in Envoy community (please tell me if anyone in the universe successfully migrated a Wasm binary in production from Envoy to Openresty, that's great if that really happens).

But agree with @kyessenov if the governance is fixable - that should work as well.

All I have expressed here was from my guilt and apologies about me being involved in a few years back. I really am looking forward to better Wasm situation!

@thenewwazoo
Copy link

@kyessenov said:

IMO, a stable ABI is all that matters, and Wasm in Envoy has kept the stability promise despite many internal changes inside Envoy.

That's... mostly true, I think. Between 1.25 and 1.30, there were a number of (afaict undocumented) functions added to the WASM VM API that required changes to @mathetake's Rust NullVM playground code. I spent last week trying to bring it up-to-date until I got stuck.

@martijneken said:

Can you elaborate? In our benchmarks wasm performs quite well. We are careful about our choice of wasm engine, and we do precompile plugins ahead of time.

We benchmarked based on the v8 runtime and found per-filter memory overhead on the order of 180 MB and CPU overhead on the order of (iirc) 10%. It was an admittedly naive attempt I wasn't involved in, but the results indicated unsuitability for our use cases.

I don't mean to make this into an either-or choice if it's not one, but I'm coming at this from the perspective of a frustrated would-be user who's excited by an alternative.

@PiotrSikora
Copy link
Contributor

Looks at this mess proxy-wasm/spec#42, [...] That's the whole blocker of why SO MANY people before have complained about this documentation mess.

Yes, I've dropped the ball on the ABI specification (please review that PR when you get a chance, since it should be good to merge once it's approved).

But how exactly is that preventing you and/or others from writing documentation in Envoy and/or SDKs? You could even document or correct the ABI specification, but you've chosen to nuked the whole repository instead of fixing it.

I wanted to say the governance of Proxy-Wasm is 100% of the problem here

Could you elaborate what do you mean here?

The same people that maintain Wasm in Envoy also maintain Proxy-Wasm C++ Host and Proxy-Wasm C++ SDK, so I fail to see how moving the C++ Host and other projects into Envoy codebase would change anything.

what I think is the best is just nuke the dependency on that and have a complete spec/doc/sdk coupled with Envoy

Why does it matter whether it's Envoy or Proxy-Wasm org? What's preventing you from contributing to one but not the other?

Also, Wasm in Envoy was originally developed inside Envoy's codebase, but as far as I recall, Envoy maintainers refused to accept/review such big change, so it was split into various projects inside the Proxy-Wasm org.

don't care about other proxies since IMO there's no benefit to anyone in Envoy community (please tell me if anyone in the universe successfully migrated a Wasm binary in production from Envoy to Openresty, that's great if that really happens).

IMHO, that's quite rude to people who implemented and use Proxy-Wasm in other proxies.

@martijneken
Copy link

We benchmarked based on the v8 runtime and found per-filter memory overhead on the order of 180 MB and CPU overhead on the order of (iirc) 10%. It was an admittedly naive attempt I wasn't involved in, but the results indicated unsuitability for our use cases.

Gotcha. The memory use sounds a bit higher than our (non-Envoy) use case, where a subprocess with one v8 isolate takes <100 MiB, and some of that is the wasm memory. I'm not well versed in the Envoy filter (yet -- per above we do plan to focus on it soon), but I wonder how much of the memory is on the 'host' vs 'wasm' side. Have you looked at it with pprof?

If the 10% CPU refers to the plugin execution only, that sounds fair -- wasm can add some CPU overhead compared to native code. I think this highlights the need for robust NullVM implementations so that those who care less about isolation don't have to compromise on performance.

I have been experimenting for the last ~week with building NullVM filters to compile into v1.30 but ran into this problem after some significant struggles bringing everything up-to-date, and have stopped trying.

I hadn't heard about it until today, but I think Rust SDK + NullVM makes a lot of sense! We'd be happy to work on this. Is there a FR in proxy-wasm-cpp-host tracking it, in addition to #12155? It would need a local NullVM implementation (like this?) and tests to make we don't lose support / compatibility.

@keithmattix
Copy link
Contributor

Just dropping my 2 cents in here as an interested vendor (Microsoft) who plans to invest in Envoy's WASM support in the next couple of months. I've talked with several of the folks still involved in Envoy/proxy-wasm, and I'm hopeful that, over time, we'll be able to stabilize and modernize Envoy's WASM support (whether that's proxy-wasm, some form of WASI support, or both). Perhaps it would be useful to focus on getting consensus/agreement on the following points:

  1. Bolstering the governance of the proxy-wasm org - Document Proxy-Wasm ABI v0.2.1. proxy-wasm/spec#42 looks pretty active (I plan on adding my own comments in the next couple of days); however, there is indeed only a single person currently shown as a member of the organization. More formal governance should aid in helping those like myself who are looking to contribute find the right people to talk to. The C++ host repo has 3 CODEOWNERS; maybe start there? A more defined contributor ladder would help set expectations as well.
  2. Defining a 6 or 12 month roadmap - Sadly, I have to join the chorus of folks who would love to leverage Envoy's WASM support, but cannot due to poor performance. Whether or not these and other barriers to adoption will be addressed in the next ABI version is unclear. Are the perf issues known and just require someone to do the work? Is further investigation needed? All of these are unclear. I would ask the existing maintainers of the proxy-wasm project to produce a roadmap and clearly delineate where help is needed so that interested parties can contribute if desired.
  3. Fostering the proxy-wasm community - This point is a bit of an expansion on point 1. As was pointed out above, multiple proxies (Kong, OpenResty, NGINX, Envoy) depend on proxy-wasm. This thread (now) has at least 3 vendors (Google, Tetrate, Microsoft) who have opinions on the direction of proxy-wasm. Given this, it's interesting o me that I haven't been able to find a Slack channel, community meetings, etc. to get questions answered, design docs approved, etc. It's entirely possible that I just haven't done a great job at looking, but IMO, the success of any OSS project (including) proxy-wasm is incumbent upon streamlined channels of communication. My suggestion to proxy-wasm maintainers would be to create/facilitate these avenues, potentially leaning upon existing projects/foundations like the CNCF or the ByteCodeAlliance.

I welcome feedback on any/all of the above points. There's obviously a ton of history here, and I'm hopeful that focusing on concrete action items will aid in a healthy resolution for everyone involved.

@johnlanni
Copy link
Contributor

johnlanni commented Jul 29, 2024

tetratelabs/proxy-wasm-go-sdk#450

@martijneken Based on our extensive experience applying TinyGo with WebAssembly in large-scale scenarios, the combination of TinyGo and bdwgc is indeed feasible; @anuraaga might be overly pessimistic. Although memory leaks in bdwgc are indeed possible in 32-bit environments, the likelihood remains low. Our practical experiences – which include developing over 30 plugins with intricate logic that are utilized across diverse user environments – have encountered virtually no issues.

We also have quite a few users who have developed their own wasm plugins based on tinygo+bdwgc, including those with ten-billion-level pv, and they have not encountered any issues.

There has been a single exception, though: scenarios involving substantial handling of random binary data, as discussed here. For gateway scenarios, similar problems might only arise when dealing with compressed or encrypted data.

@jcchavezs
Copy link
Contributor

jcchavezs commented Jul 29, 2024

My 2p: as @mathetake pointed out the main issue with proxy-wasm spec was (and still is) governance more than technical (which has also some challenges but solving them urges better governance). You can find a lot of frustration on proxy-wasm/spec#41 when people said for weeks "it is coming" with no clear goals or direction and frankly with no community involvement.

I think keeping a half backed spec in an isolated repository, disconnected from the reality and implementors is a really bad idea, we all saw that happening with the tracing standards and that took the ecosystem through a few big bangs and you can see the status of SDKs in a project like OpenTelemetry (with the highest focus and the biggest community + all CNCF exposure) in 2024 https://opentelemetry.io/docs/languages/#status-and-releases (the project started in 2018). I support @mathetake's idea of moving this into envoy and work from there as the spec is already envoy centric. Other proxies can still implement on this.

If the spec doesn't move to envoy there should be a good, diverse and flat governance committee and I would really suggest it would involve users. I learnt that old contributors that have no steak on this become gatekeepers and/or usually harm the project. Also Bear in mind leading/maintaining the standard means writing implementations, support, promotion and leading changes in the ecosystem so people must be hands on.

I like @keithmattix suggestions but I would really not focus ONLY on vendors because no offense but I heard the phrase "a vendor interested to invest in the project" many times. If a company is willing to invest, a good way to start is to get involved in the ecosystem, not going directly to the spec. I would discard the BytecodeAlliance alternative as they are also going through big bangs with component model.

PS. I am maintainer of https://github.com/corazawaf/coraza-proxy-wasm/ and also lead the usage of http-wasm in traefik.

@keithmattix
Copy link
Contributor

If a company is willing to invest, a good way to start is to get involved in the ecosystem, not going directly to the spec

Oh of course; I did not mean to imply that our investments would begin limited to the spec. I will say, however, that the spec repo seems to have more directional discussion than anywhere else, so that's why I highlighted that PR specifically.

For clarity, part of the reason I'm interested in the proxy-WASM project is because of Microsoft's existing investments in the WASI space, including the component model, WASI-http, and others. If possible, I believe having proxy-wasm be compatible with these developments would be beneficial but that's a pure implementation detail at this point🙂

@marc-barry
Copy link
Contributor Author

When I initially posted this question, I had assumptions that have now been cleared up. I see these as gaps in the formalization and documentation of the spec. It's clear that multiple parties, myself included, have a vested interest in the direction of what is currently referred to as "proxy-wasm".

As co-founder and CTO of my company (Qpoint), I am heavily invested in Envoy's WASM extensions. We have also developed other technology leveraging "proxy-wasm" outside of Envoy, making the formalization of the documentation highly important to us.

Given the number of interested parties, I'm volunteering my time and effort to ensure this gets the attention it deserves. I'll start by creating a collaborative document that will attempt to articulate the roadmap, gaps in documentation, and future interests of the individuals and organizations already on this thread. I welcome contributions and feedback from all interested parties.

@martijneken
Copy link

I'll start by creating a collaborative document that will attempt to articulate the roadmap

Great, we (Google) would like to contribute. Aggregating the feedback from this discussion, I think it would help to break this into tracks, such as:

  • Spec / ABI / evolution (core vs experimental, WASI convergence)
  • Base host (ProxyWasmCppHost maintenance, Wasm engine support)
  • Envoy host implementation (getting this out of alpha, performance)
  • SDKs, language support (C++, Rust, Golang, etc)

We plan to set up a community meeting to gather stakeholders and discuss roadmap/governance/community. A Slack channel is also a great idea. @leonm1 from our team volunteered to organize.

the combination of TinyGo and bdwgc is indeed feasible

@johnlanni That's great to know. Are you using https://github.com/wasilibs/nottinygc or a different integration? Would love to get this SDK supported.

@eshepelyuk
Copy link

eshepelyuk commented Sep 7, 2024

AIUI http-wasm.io was developed by a single ex-Tetrate employee, but it is not maintained and I don't know of any users. It has a fancy website, but quoting someone more familiar than me: "It is safe to assume it’s a dead project".

Traefik uses it starting in their recent version v3. Also in this thread there was a post from @jcchavezs - a maintainer of https://github.com/jcchavezs/coraza-http-wasm-traefik.

@spacewander
Copy link
Contributor

spacewander commented Sep 9, 2024

The most concrete evidence of this (other than the PRs in ProxyWasm repos) is that we (Google) are starting a project to get the Envoy wasm filter out of alpha.

Thanks for @martijneken sharing the plan!

I wager that it will not provide the resource/fault isolation which we intend to bring to Envoy + ProxyWasm. But for those writing trusted/1P extensions, maybe that's not what you want or need.

Could you name some situations in which people need to write trusted extensions? Usually, we trust the developer but not the plugin itself. For example, most of the plugins we run are developed by our teammates. So the technology doesn't need to be fully sandboxed - ensuring our teammate is sane (and code review) is enough. Unless we are lending the Envoy cluster to run the customer's plugins... (maybe that is the Google's use case?)

BTW, the Proxy Wasm can not perfectly provide a trust declaration so far (maybe it will be improved in the future). Let's list some risks here:

  1. the plugin contains unsafe syscall operations, for example, reading the other configuration on the disk: handled well in Wasm
  2. the plugin consumes unlimited memory: currently, it seems that proxy wasm doesn't have per-plugin memory limitation. But technically a Wasm runtime can handle this well.
  3. the plugin triggers an infinite loop sometimes: Wasm doesn't have CPU limitation.
  4. the plugin allows untrusted clients to get control (XSS injection, authn/z bypass, and so on): this is usually caused by the plugin logic, not by the way to implement the plugin.

Wasm plugin can handle 1&2, but to get a trusted extension, we still need to have a careful code review to address all the risks.

@johnlanni
Copy link
Contributor

johnlanni commented Sep 9, 2024

I believe that being trusted has two levels:

  1. Plugin logic guarantees the security of Envoy's operational logic
  2. Plugin logic guarantees the security of the Envoy's operating environment

The former is difficult to ensure at the mechanism level, but the latter can be guaranteed through the Wasm mechanism because prohibiting system calls can ensure the security of the host environment and prevent the creation of logic with high-risk security vulnerabilities.

In addition, in fact, the memory limits for plugins (no more than 1G per VM) have already been implemented in the proxy-wasm-cpp-host project; as for CPU limits, they need to be implemented at the runtime level, for example, WAMR can already measure the CPU execution time for each VM (although there is additional overhead cost), and subsequently, this can be combined with Envoy's overload mechanism to enforce limits.

@spacewander
Copy link
Contributor

spacewander commented Sep 9, 2024

in fact, the memory limits for plugins (no more than 1G per VM) have already been implemented in the proxy-wasm-cpp-host project

I am glad to hear that the memory limitation already exists in the Proxy Wasm, which proves the conclusion that Wasm plugin can handle attack vector 1&2.

WAMR can already measure the CPU execution time for each VM (although there is additional overhead cost), and subsequently, this can be combined with Envoy's overload mechanism to enforce limits

So far, can Envoy's overload mechanism turn an infinite loop into a finite one? Even if we can limit the CPU to the level of Envoy, this doesn't mean the Wasm plugin is trusted because such a Wasm plugin can take away other features' CPU resources. A per-plugin CPU limitation is required, and this is not an easy job - as the runtime needs to be able to do CPU schedule itself, not just the measurement.

@johnlanni
Copy link
Contributor

For a infinite loop, it can be detected and the corresponding wasm VM can be destroyed, although this would likely require support at the runtime level.
Similarly, for abnormal CPU usage of certain plugins, we can also consider:

  1. For cases exceeding the non-severe threshold, introduce a delay for requests that are to pass through the wasm plugin logic.
  2. Exceeds too much, directly destroy the corresponding wasm vm

@marc-barry
Copy link
Contributor Author

Since this discussion began I noticed that the situation with documenting proxy-wasm has changed. The following pull requests were merged:

With those you can find the documented spec for the respective versions under https://github.com/proxy-wasm/spec/tree/main/abi-versions.

@martijneken
Copy link

Could you name some situations in which people need to write trusted extensions? Usually, we trust the developer but not the plugin itself.

You're right @spacewander, thanks for putting a finer point on this. There is absolutely a difference between a vendor like Google running customers' code and 1P extensions for an Envoy owner. The topics then are security vs. production stability. Vendors need both, so I think our interests are still well aligned.

Paraphrasing your list of risks:

  1. Security boundary. Wasm does provide a security boundary. For vendors it may or may not be sufficient, depending on the runtime and the risk profile. Likely N/A for 1P.
  2. Logical compromise. In this respect plugin logic is the same as any server code -- operating on user facing input/output. If compromised, the other protections may contain the impact, depending on the attack.
  3. Memory limits. Yep, these exist and they are globally configurable, see: https://github.com/proxy-wasm/proxy-wasm-cpp-host/blob/main/include/proxy-wasm/limits.h
  4. CPU limits. This doesn't exist today, but this is one of the improvements we want to make soon. Some runtimes have better built-in support than others (e.g. wasm instruction counting), but worst case we can fall back on a watchdog thread that checks CPU time spent by the Envoy/wasm thread. The ideas offered by @johnlanni match ours, with the addition that one could rate-limit VM restarts to prevent abuse.

@spacewander
Copy link
Contributor

@martijneken
Thanks for your infomation!

@PiotrSikora
Copy link
Contributor

Could you name some situations in which people need to write trusted extensions? Usually, we trust the developer but not the plugin itself. For example, most of the plugins we run are developed by our teammates. So the technology doesn't need to be fully sandboxed - ensuring our teammate is sane (and code review) is enough. Unless we are lending the Envoy cluster to run the customer's plugins... (maybe that is the Google's use case?)

To add to @martijneken's answer, the isolation provided by Wasm is IMHO quite important and useful even when dealing with plugins authored by trusted developers, since it limits the blast radius in case of non-malicious bugs that could otherwise crash the proxy.

Notably, the rate-limited restart logic is not currently implemented in Envoy, so a buggy plugin might still render the proxy unhealthy, but that's not a limitation of Proxy-Wasm or Wasm in general.

@jcchavezs
Copy link
Contributor

jcchavezs commented Oct 3, 2024

Thanks @martijneken for holding your tongue, it is better to stay quiet when in lack of information.

I've been holding my tongue but this is the opposite of reality. AIUI http-wasm.io was developed by a single ex-Tetrate employee, but it is not maintained and I don't know of any users. It has a fancy website, but quoting someone more familiar than me: "It is safe to assume it’s a dead project".

http-wasm was developed by a bunch of tetrate employees (of course there was a leader who contributed the most) as replacement of proxy-wasm because we were tired of the gatekeeping on the project, poor leadership and the lack of connection with reality and use cases.

http-wasm landed in dapr (see https://docs.dapr.io/reference/components-reference/supported-middleware/middleware-wasm/) and lately traefik went for it as the way to leverage wasm extensions inside the proxy (https://traefik.io/blog/traefik-3-deep-dive-into-wasm-support-with-coraza-waf-plugin/).

The project is now building a community and it is a fact that we are mainly focused in go because wazero had http-wasm as use case in mind. There were some efforts to porting http-wasm to Envoy but not sure what is the status of it.

Is http-wasm the replacement of proxy-wasm? I don't think so, http-wasm was designed from the lessons learnt from proxy-wasm and we specifically tried to keep a narrow API focused on request/response case. Wazero allows you to combine http-wasm with other ABIs to leverage stuff like distributed tracing or socket connections.

I hope there is movement and proxy-wasm gets proper leadership and it becomes maintainable. I don't know the status of the project right now but as someone who worked full time on building wasm plugins I hope it gets to a good shape.

@martijneken
Copy link

Thanks for the info @jcchavezs. I'm glad to be wrong about http-wasm, more wasm adoption is good for everyone. My reaction was much more about the claim that ProxyWasm is "abandoned" -- it absolutely is not. We are actively working on it, and in response to the commentary on this thread, we have a draft roadmap nearing publication and will be setting up community meetings to solicit input.

@wbpcode
Copy link
Member

wbpcode commented Oct 8, 2024

It's no doubt that wasm extension is still necessary even we have the dynamic modules done.

The dynamic modules provides a great way to implement a dynamic extension with native performance. But it's hard to be relieved to run third party dynamic modules on the envoy for a public product.

So they basically could be used for different scenarios.

But one of the core target of proxy-wasm is be proxy-independent. That means it's hard to support the Envoy-specific feature or optimization (I think multiple persions here have noticed this point?). And it's also means the proxy may keep a limited feature-sets to be compatible with different proxies? (Constructing a perfect abstraction to adapt different proxied are much complex work.)

Rather than the performance problem (actually, in most cases, if users/developers choose the wasm extension, I think the performance is not their first goal.), more painful thing is that it's still hard to develop a complex extension like external authz with request body. (Note, 5 years has passed after proxy wasm is created.)

We built our product based on the Envoy, from my personal perspective, I will like to fork the proxy-wasm and develop it in Envoy's way. Fixing the issues, resolving the problem in practices, using it more widely, then, we can discuss the standard or spec.

@mpwarres
Copy link
Contributor

mpwarres commented Oct 8, 2024

WRT @wbpcode's comment:

We built our product based on the Envoy, from my personal perspective, I will like to fork the proxy-wasm and develop it in Envoy's way. Fixing the issues, resolving the problem in practices, using it more widely, then, we can discuss the standard or spec.

I think there are two related but separate considerations: (1) ease/speed of updating Envoy WasmFilter with or without having to manage the external dependency on proxy-wasm-cpp-host, and (2) ability to add Envoy-specific functionality.
For (2), there is already precedent and mechanism in source/extensions/common/wasm/ext for adding Envoy-specific hostcalls. That can also be a good place to "try out" a more general hostcall before adding to the standard proxy-wasm ABI.

For (1), I understand the appeal but am worried about divergence from other host implementations, and also (in the reverse direction) missing out on any bugfixes in Envoy that could also benefit other proxy-wasm-cpp-host users. I think that in practice, most Envoy-specific fixes tend to be in the Envoy-side WasmFilter anyways--if there's a need to change proxy-wasm-cpp-host code, chances are that it's a more general issue.

@wbpcode
Copy link
Member

wbpcode commented Oct 9, 2024

@mpwarres thanks for the response.

I think actually even in the core ABI, we may also expect some Envoy-specific things like the stop iteration. (I think this is a common thing, but seems like the proxy-wasm doesn't think so.)

And, considering the existing of the Envoy-specific features, Envoy still need to fork the language SDK to provide these Envoy-specific features to end developers.
Only rare people care the spec, most extension developers only care the develop framework or SDK. They develop their extension based on the SDK rather the spec. A well-designed SDK, tools and docs to the SDK is much important than the spec for the end extension developers.

Also, Envoy need to provide related docs based on the Envoy-specific SDK.

If we want let's the wasm support of Envoy be production ready, all these works are unavoidable.

@PiotrSikora
Copy link
Contributor

That means it's hard to support the Envoy-specific feature or optimization (I think multiple persions here have noticed this point?).

Do you have anything specific in mind?

I think actually even in the core ABI, we may also expect some Envoy-specific things like the stop iteration. (I think this is a common thing, but seems like the proxy-wasm doesn't think so.)

There is nothing Envoy-specific about buffering requests, and the support for StopIteration was removed because of Envoy was crashing when using it (see: proxy-wasm/proxy-wasm-cpp-host#95 (comment)), not because of other proxies.

I believe I've mentioned this elsewhere, but we're working on adding support for buffering complete requests in the upcoming ABI update.

And, considering the existing of the Envoy-specific features, Envoy still need to fork the language SDK to provide these Envoy-specific features to end developers.

Proxy-Wasm supports custom hostcalls and callbacks (e.g. #32127) and the existing SDKs support calling those without the need to fork them.

Only rare people care the spec, most extension developers only care the develop framework or SDK. They develop their extension based on the SDK rather the spec. A well-designed SDK, tools and docs to the SDK is much important than the spec for the end extension developers.

I 100% agree with you, but there is a vocal group that kept blaming all the issues with Proxy-Wasm on the missing specification, so unfortunately that took the priority...

@johnlanni
Copy link
Contributor

johnlanni commented Oct 9, 2024

There is nothing Envoy-specific about buffering requests, and the support for StopIteration was removed because of Envoy was crashing when using it (see: proxy-wasm/proxy-wasm-cpp-host#95 (comment)), not because of other proxies.

@PiotrSikora I believe that instead of removing support for StopIteration due to this crash, the issue can be addressed through checks within the SDK or on the Host side to prevent developers from writing erroneous code that leads to Envoy crashes.
StopIteration plays a significant role, and its removal would hinder the implementation of many functionalities. Consequently, we had no choice but to fork the repository in Higress, adjusting the ABI to support return more value types.

@wbpcode
Copy link
Member

wbpcode commented Oct 9, 2024

There is nothing Envoy-specific about buffering requests, and the support for StopIteration was removed because of Envoy was crashing when using it (see: proxy-wasm/proxy-wasm-cpp-host#95 (comment)), not because of other proxies.

@PiotrSikora I believe that instead of removing support for StopIteration due to this crash, the issue can be addressed through checks within the SDK or on the Host side to prevent developers from writing erroneous code that leads to Envoy crashes. StopIteration plays a significant role, and its removal would hinder the implementation of many functionalities. Consequently, we had no choice but to fork the repository in Higress, adjusting the ABI to support return more value types.

This is why I think we could fork it in the Envoy directly and enter more quick iteration. Because proxy-wasm has been forked in some way, for example, the higress from alibaba cloud, which I think has big influence to adoption of wasm extension in Chinese cloud market.

@johnlanni
Copy link
Contributor

johnlanni commented Oct 9, 2024

Yes, the Higress community has 40+ wasm plugins, most of which are compatible with official Envoy, but over 10 are not due to the use of the StopIteration feature (like the AI Proxy plugin). Having these plugins locked to Higress is not our intention; Higress's focus is on extending Envoy based on Wasm, and we hope for more non-Higress Envoy vendors to join in building these plugins.

@wbpcode
Copy link
Member

wbpcode commented Oct 9, 2024

There is nothing Envoy-specific about buffering requests, and the support for StopIteration was removed because of Envoy was crashing when using it (see: proxy-wasm/proxy-wasm-cpp-host#95 (comment)), not because of other proxies.

I believe I've mentioned this elsewhere, but we're working on adding support for buffering complete requests in the upcoming ABI update.

I think we should treat it as bug and fix it. There is no way to complete forbid the extension to do some harmful operations.

Do you have anything specific in mind?

filter state, dynamic metadata, same route cache control with Envoy, modification of route, route specific configuration, etc.

I believe I've mentioned this elsewhere, but we're working on adding support for buffering complete requests in the upcoming ABI update.

I personally think support the stop iteration and create a beginer-friendly wrapper in the SDK would the better way to do this.

Proxy-Wasm supports custom hostcalls and callbacks (e.g. #32127) and the existing SDKs support calling those without the need to fork them.

I know we can call it with proxy_call_foreign_function.

But I am not sure it's a good choice to let the end developers to call CallForeignFunction and handle the parameters' serialization. For example:

CallForeignFunction("set_envoy_filter_state", <serialized_proto>);

This just sacrify the experience of the end extension developers. I think the requirement and experience of the end extension developers are most important. They make the actual value of the wasm extension. We just a provider of tools. If we cannot provider good experience or cannot address their requirement, then, they will choose other tools, like lua , dynamic modules, third-parity forks, etc.

@wbpcode
Copy link
Member

wbpcode commented Oct 9, 2024

Yes, the Higress community has 40+ wasm plugins, most of which are compatible with official Envoy, but over 10 are not due to the use of the StopIteration feature (like the AI Proxy plugin). Having these plugins locked to Higress is not our intention; Higress's focus is on extending Envoy based on Wasm, and we hope for more non-Higress Envoy vendors to join in building these plugins.

I think current route cache control and body modification also are you annoyances 🤣

@johnlanni
Copy link
Contributor

@wbpcode Yes, to achieve this, we also made some minor hacks to Envoy, but they are not ABI-incompatible changes. I am willing to create an issue later to outline the capabilities we have implemented that are not currently satisfied by the official repo, so everyone can discuss which ones are worth being officially implemented.

@wbpcode
Copy link
Member

wbpcode commented Oct 9, 2024

Theoretically, based that on get/set property + foreign functions could do almost everything and needn't to change the ABI function's signatures.
But the key isn't only the signatures, is the feature set that be exposed to end developers.

When we forked and hacked the SDK+ host (Envoy) to provide some specific features, then the intermediate bridge (proxy-wasm host lib) and ABI cannot represent the actual feature set, the compatibility actually has been broken.

@PiotrSikora
Copy link
Contributor

Yes, the Higress community has 40+ wasm plugins,

That's awesome!

I personally think support the stop iteration and create a beginer-friendly wrapper in the SDK would the better way to do this.

But you need a new ABI version for that, otherwise you'll have new plugins returning StopIteration that are deployed on older versions of Envoy/Istio/X, where they behave differently than expected.

I know we can call it with proxy_call_foreign_function.

But I am not sure it's a good choice to let the end developers to call CallForeignFunction and handle the parameters' serialization. For example:

CallForeignFunction("set_envoy_filter_state", <serialized_proto>);

This just sacrify the experience of the end extension developers.

Right, but you can easily add first-class wrappers to the SDKs for those, so that end-users won't be able to tell whether it's standardized ABI call or a foreign functions, and once the interface is proven, then it can be included in the next ABI version.

Fast iteration and prototyping is exactly what this interface was designed for, and it doesn't require forking anything.

@PiotrSikora
Copy link
Contributor

filter state

This is already supported as a custom set_envoy_filter_state hostcall.

dynamic metadata

Is this also used in presence of filter state? I thought only one of those was supposed to be used going forward? See: #4929

same route cache control with Envoy

This indeed is Envoy-specific, but we already have a dedicated issue for it: proxy-wasm/proxy-wasm-cpp-host#421

modification of route

Do you mean selection of the upstream server/cluster or something else?

route specific configuration

Do you mean per-route plugin configuration? Isn't this supported in Envoy by composite filter? In any case, Proxy-Wasm already supports running many instances of the same plugin with different configurations inside the WasmVM, so this seems more of a host implementation issue.

@wbpcode
Copy link
Member

wbpcode commented Oct 9, 2024

This is already supported as a custom set_envoy_filter_state hostcall.

Yeah. But we still need to wrap it to avoid exposing it to end developers in the origin way.

Is this also used in presence of filter state? I thought only one of those was supposed to be used going forward? See: #4929

AFAIK, some auth filters will prefer the dynamic metadata to store json-like data.

Do you mean selection of the upstream server/cluster or something else?

I mean allow the filter to change the upstream cluster by rewrite the route.

Do you mean per-route plugin configuration?

Yeah.

Isn't this supported in Envoy by composite filter?

The composite actually make things more complex and hard to use for users.

In any case, Proxy-Wasm already supports running many instances of the same plugin with different configurations inside the WasmVM, so this seems more of a host implementation issue.

Considering the overhead of a WasmVM, it's hard to same to run lots of different vm instances is good solution.

But anyway, yeah, maybe we also need to rethinking this problem at host. Lots of problems is hard to resolve by single side.

@wbpcode
Copy link
Member

wbpcode commented Oct 9, 2024

Right, but you can easily add first-class wrappers to the SDKs for those, so that end-users won't be able to tell whether it's standardized ABI call or a foreign functions, and once the interface is proven, then it can be included in the next ABI version.
Fast iteration and prototyping is exactly what this interface was designed for, and it doesn't require forking anything.

I didn't get it. I think the SDK maintained by the proxy-wasm also should be proxy-independent? Or it just make the end developers be confused. They used the SDKs, and develop an extension at their platform, then find lots of interface actually cannot work?

@botengyao
Copy link
Member

botengyao commented Oct 15, 2024

Yes, the Higress community has 40+ wasm plugins, most of which are compatible with official Envoy, but over 10 are not due to the use of the StopIteration feature (like the AI Proxy plugin). Having these plugins locked to Higress is not our intention; Higress's focus is on extending Envoy based on Wasm, and we hope for more non-Higress Envoy vendors to join in building these plugins.

@johnlanni, off the wasm topic, and noticed Alibaba Higress is using Envoy, which is great! Does Alibaba plan to be on the security distributor list to receive and report early CVE notifications under embargo?

@johnlanni
Copy link
Contributor

@botengyao Absolutely, and it appears we meet the requirements to join the distributor list. I have already sent an email.

@PiotrSikora
Copy link
Contributor

The composite actually make things more complex and hard to use for users.

Agreed, but when we were adding Wasm to Envoy, we were told not to handle this ourselves, and instead use the (still under development at the time) composite filters, which were supposed to address this problem for all Envoy extensions.

In any case, this should be already solved in Proxy-Wasm (but perhaps it needs to be glued together with the per-route configuration in Envoy), since you can have multiple plugin instances with different configurations, so route configuration A and route configuration B can be instantiated as plugin with configuration A and plugin with configuration B... unless I'm missing something?

Considering the overhead of a WasmVM, it's hard to same to run lots of different vm instances is good solution.

But anyway, yeah, maybe we also need to rethinking this problem at host. Lots of problems is hard to resolve by single side.

Proxy-Wasm already supports running multiple instances of the same plugin with different configurations inside the same WasmVM (i.e. many-to-one), so there is no extra overhead here.

This is how the configuration reloads are handled in Envoy, and how the same plugin is used with different configurations in different filter chains (assuming the same vm_id).

I didn't get it. I think the SDK maintained by the proxy-wasm also should be proxy-independent? Or it just make the end developers be confused. They used the SDKs, and develop an extension at their platform, then find lots of interface actually cannot work?

Yes and no. I want to be as conservative as possible, but at the same time we should prevent unnecessary fragmentation of the Proxy-Wasm ecosystem and avoid splitting the already limited engineering resources.

Based on the discussion in this thread, if we ignore the generic features that will be added in the upcoming ABI update (e.g. complete request buffering) and things that should be handled in Envoy (route cache control and per-route configuration), then it seems that there are very few Envoy-specific features (filter state & dynamic metadata).

As such, it's probably more productive to add clearly named wrappers to the existing SDKs for those 2 or 4 (for both getters and setters) custom hostcalls, than to fork away, which might prevent plugins written using those alternative SDKs from running on non-Envoy Proxy-Wasm hosts, even when they don't require any Envoy-specific features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docs area/wasm question Questions that are neither investigations, bugs, nor enhancements
Projects
None yet
Development

No branches or pull requests