Add support for multiple wasm QueryPlanner objects #226

jaredly · 2020-10-12T21:41:08Z

Fixes #210

I considered using a HashMap, but it's a little more complicated, and I'm assuming we won't have people creating many hundreds of query plans 🙃 if that's a use case we want to support, we could move to HashMaps.

One nice thing about passing around an index instead of a pointer is that it's much more robust :) rust doesn't really make guarantees about raw pointers remaining valid in the presence of allocations.

cc @Enrico2 looks like you did the first impl of this stuff?

Fixes apollographql#210 I considered using a HashMap, but it's a little more complicated, and I'm assuming we won't have people creating many hundreds of query plans 🙃 if that's a use case we want to support, we could move to HashMaps.

apollo-cla · 2020-10-12T21:41:10Z

@jaredly: Thank you for submitting a pull request! Before we can merge it, you'll need to sign the Apollo Contributor License Agreement here: https://contribute.apollographql.com/

jaredly · 2020-10-12T21:43:35Z

here's the test failure 🙃

Enrico2 · 2020-10-13T18:50:05Z

query-planner-wasm/src/lib.rs

+/// Most applications will have a single query planner that they use
+/// for the duration of the app's lifetime, but if you are working


This sentence is a bit misleading; One of the ways in which Apollo Gateway can run is by polling for downstream schema changes, composing a new federated schema (csdl) and updating the runtime schema on the fly.

That fact brings to question if this is the right solution, because some people do in fact create a lot of new schemas during the runtime of the server by design.
I like that you've added the function to drop a planner, if we want to go with this approach, we'd have to make sure and drop the old planner whenever a schema updates (i.e. identify schema changes vs. additions).

At the root of the problem though, is that there is one global wasm module loaded, but there can be multiple ApolloGateway instances. We might want to solve this at the root of the problem, if it's possible, rather than adapting the global model to a local one.

Let me stew on this for a bit.

Generally though, the use of an index is indeed better than a pointer!

I'll go ahead and change the wording here. I've now made changes to gateway-js to use this function when needed

Enrico2

As for the test failure, I'm fixing that wasm-pack installation issue separately, rebase your branch once #228 is merged.

Enrico2

The current suggestion has a critical bug. I've made a suggestion if you want to keep at it.

Enrico2 · 2020-10-14T16:08:03Z

query-planner-wasm/src/lib.rs

+                    let id = SCHEMA.len();
+                    SCHEMA.push(Some(schema));
+                    // No need to re-parse, we can just clone!
+                    DATA.push(DATA[i].clone());


I see why you went with this approach, 2 calls using the same index and one of them dropping the planner would cause a dangling reference to an index.

Why not clone?

We've been intentional about not implementing Clone for structs that can potentially be very large. For the QueryPlanner case, its data is the parsed schema ast, which can be a very large deeply nested tree. One of the design tenets for this gateway is to be able to operate over very large schemas (e.g. imagine an entire enterprise's data graph, composed from 100+ different services) and cloning that would create a noticeable performance degradation.

Cloning here creates a state where the gateway will be susceptible to undue memory pressure, if a user mis-uses the js library or even in the normal case, if there is a schema composition event but no change in the resulting composed schema, and we keep cloning the same planner over and over.

Generally speaking, using an Rc would solve that, but there's a bug here....

Bug: QueryPlanner references the schema String.

This is the definition:

pub struct QueryPlanner<'s> { pub schema: schema::Document<'s>, }

that 's is the lifetime referenced by Document, which, if you look at where that ends up coming from, is the lifetime of the original &str that contains the schema. So when we put the schema in SCHEMA, we then use a reference to it to pass to the parser. By cloning (even if this was an Rc), the cloned planner still references the String that was used to create the original planner. When dropping that planner and its schema, any cloned planner will have dangling references.

How should we fix the root issue?

We've not had a chance to discuss this, and we appreciate the contribution and ideas, I think they may inform an eventual solution. Ideally I'd like to figure out if we can fix the problem outside of the surface area of these objects somehow.

Idea?

Just had a thought, maybe we should be adding a static GATEWAYS vector that will contain a schema and planner, which would map to a js ApolloGateway instance?
Meaning, if you have 2 ApolloGateway instances, each of which is constantly replacing its schema, you'd have 2 items in a GATEWAYS vector, and for each gateway we'd still just have one schema and planner that's being replaced?
In that situation we'd not have to implement any drop (which really adds a lot of complexity)

That might be a better design. Doing that would mean introducing some breaking API changes, the names would match what we're returning:

#[wasm_bindgen(js_name = getGatewayBridge)] pub fn get_gateway_bridge(schema: JsString) -> usize { .. } #[wasm_bindgen(js_name = getQueryPlan)] pub fn get_query_plan(gateway_idx: usize, query: &str, options: &JsValue) -> JsValue { .. }

(Noting that this would be a breaking API change, which will be relevant when releasing)

jaredly · 2020-10-21T16:31:30Z

@Enrico2 thanks for the great review! I've reworked my solution to fix the bugs, and I think I've got a pretty neat solution that doesn't involve any breaking changes! Memory access is localized, and cloning is done behind an Rc which makes it very cheap.

Enrico2 · 2020-10-26T16:51:09Z

@jaredly Thanks for the update! We're discussing how to proceed with this. This adds some level of complexity that ideally I'd like to avoid, and so I'm researching if there are other ways to solve the problem. Stay tuned!

This PR makes it so that getting a query planner creates a new planner every time and never overrides it. We've added a new function named `updatePlannerSchema` that overrides a specific planner. This hopefully fixes #210 This implementation is suggested instead of #226 to keep things simple.

Enrico2 · 2020-10-29T18:35:39Z

Take a look at #261 as a proposed simpler alternative.

jaredly · 2020-10-30T18:25:03Z

hmmm @Enrico2 it's simpler because it doesn't allow you to free memory if you're done with an ApolloGateway object?

Enrico2 · 2020-10-30T18:27:22Z

hmmm @Enrico2 it's simpler because it doesn't allow you to free memory if you're done with an ApolloGateway object?

@jaredly It's simpler because it removes the need for memory management from JS while still allowing multiple Gateways to run in the same process, as well as update a gateway's schema on the fly without leaking memory.

ApolloGateway is not meant to "be done with", under what circumstances would you be done with it while keeping the node process around?

dkapadia · 2020-11-02T21:12:37Z

(@jaredly is a colleague of mine, working on the same project - he's also far more fluent in rust/wasm than I am).

Take a look at #261 as a proposed simpler alternative.

From an API level, it looks like our tests pass now with #261 as well as this PR, so that's a definite improvement over main (where our tests fail due to the overwriting of the schema). I haven't had a chance to profile the memory consumption of the rust query planner yet, but if it's comparable to the JS query plan code we'll definitely need to worry about freeing unused gateway objects and so would greatly prefer this PR over #261.

ApolloGateway is not meant to "be done with", under what circumstances would you be done with it while keeping the node process around?

We currently create ApolloGateway objects as part of our deploy testing flow to allow devs to test out and validate schema changes in production before those changes are rolled out to 'real' users. For example, if dev visits test-schema-version.example.com, we create (and cache) an ApolloGateway object for test-schema-version, then execute the request using that newly created gateway, instead of the gateway used for requests sent to example.com (which uses the production schema for end users). Once they are happy testing against the test-schema, we make the test-schema the default production schema, by updating the production ApolloGateway's schema.

Each ApolloGateway we create uses up a fair bit of memory, so we have a fairly small LRU cache setup that evicts unused ApolloGateway objects when they are no longer used. This is fine for our use case since we only need a max of ~5 test versions at any given time, but there might be ~50 such versions created over the course of a day, so we can't really have the memory usage grow unbounded.

Happy to go into more detail if any of that is unclear, and very much appreciate your time looking into this issue!

abernix · 2021-04-16T10:54:01Z

Copying and pasting my comment from #261 (comment):

This should no longer be necessary with the landing of #622 (specifically, the changing of the query planner back to TypeScript). Put another way, the code this changes is no longer present and the query planners are now scoped within the Gateways that generate and execute them.

Thanks very much for opening this PR in the first place; very much appreciated contribution, and hopefully this simplifies your configuration in your deploy testing.

jaredly · 2021-04-16T15:55:43Z

oh neat! Does this mean y'all are moving away from rust/wasm? (I see reducing the number of languages in a project as a positive thing 😅)

abernix · 2021-04-16T17:11:07Z

@jaredly We're still excited about Rust (and hiring for it!). For @apollo/gateway in particular though, we are focused on iterating on the existing codebase in TypeScript (read: reducing the number of languages in this repository). We didn't necessarily want to be maintaining multiple languages within a project for long, though that was part of what unfolded in this experiment. Happy to get back to a single language here and simplify the packaging though, for sure.

Overall, this will allow us to iterate, improve and learn on the Federation model as a whole, bringing new features and bug fixes + adding value for many current users.

Enrico2 reviewed Oct 13, 2020

View reviewed changes

jaredly added 2 commits October 13, 2020 14:22

better memory management

2f0106c

better docs

4dee73e

Enrico2 suggested changes Oct 14, 2020

View reviewed changes

jaredly added 2 commits October 21, 2020 11:06

maybe working?

4ee8d5c

working! much better design

0b1e2b6

jaredly added 3 commits October 21, 2020 11:32

undo cargo changes

f6b27b7

better naming

5b975d2

docs

b76e840

jaredly requested a review from Enrico2 October 26, 2020 16:38

Enrico2 mentioned this pull request Oct 28, 2020

Support multiple gateways in the same process #261

Closed

abernix closed this Apr 16, 2021

apollographql locked and limited conversation to collaborators Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for multiple wasm QueryPlanner objects #226

Add support for multiple wasm QueryPlanner objects #226

jaredly commented Oct 12, 2020 •

edited

apollo-cla commented Oct 12, 2020

jaredly commented Oct 12, 2020

Enrico2 Oct 13, 2020

jaredly Oct 14, 2020

Enrico2 left a comment

Enrico2 left a comment

Enrico2 Oct 14, 2020

jaredly commented Oct 21, 2020

Enrico2 commented Oct 26, 2020

Enrico2 commented Oct 29, 2020

jaredly commented Oct 30, 2020

Enrico2 commented Oct 30, 2020 •

edited

dkapadia commented Nov 2, 2020 •

edited

abernix commented Apr 16, 2021

jaredly commented Apr 16, 2021

abernix commented Apr 16, 2021

		/// Most applications will have a single query planner that they use
		/// for the duration of the app's lifetime, but if you are working

Add support for multiple wasm QueryPlanner objects #226

Add support for multiple wasm QueryPlanner objects #226

Conversation

jaredly commented Oct 12, 2020 • edited

apollo-cla commented Oct 12, 2020

jaredly commented Oct 12, 2020

Enrico2 Oct 13, 2020

Choose a reason for hiding this comment

jaredly Oct 14, 2020

Choose a reason for hiding this comment

Enrico2 left a comment

Choose a reason for hiding this comment

Enrico2 left a comment

Choose a reason for hiding this comment

Enrico2 Oct 14, 2020

Choose a reason for hiding this comment

Why not clone?

Bug: QueryPlanner references the schema String.

How should we fix the root issue?

Idea?

jaredly commented Oct 21, 2020

Enrico2 commented Oct 26, 2020

Enrico2 commented Oct 29, 2020

jaredly commented Oct 30, 2020

Enrico2 commented Oct 30, 2020 • edited

dkapadia commented Nov 2, 2020 • edited

abernix commented Apr 16, 2021

jaredly commented Apr 16, 2021

abernix commented Apr 16, 2021

jaredly commented Oct 12, 2020 •

edited

Bug: `QueryPlanner` references the schema `String`.

Enrico2 commented Oct 30, 2020 •

edited

dkapadia commented Nov 2, 2020 •

edited