Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multiple wasm QueryPlanner objects #226

Closed
wants to merge 8 commits into from

Conversation

jaredly
Copy link

@jaredly jaredly commented Oct 12, 2020

Fixes #210

I considered using a HashMap, but it's a little more complicated, and I'm assuming we won't have people creating many hundreds of query plans 馃檭 if that's a use case we want to support, we could move to HashMaps.

One nice thing about passing around an index instead of a pointer is that it's much more robust :) rust doesn't really make guarantees about raw pointers remaining valid in the presence of allocations.

cc @Enrico2 looks like you did the first impl of this stuff?

Fixes apollographql#210

I considered using a HashMap, but it's a little more complicated, and I'm assuming we won't have people creating many hundreds of query plans 馃檭 if that's a use case we want to support, we could move to HashMaps.
@apollo-cla
Copy link

@jaredly: Thank you for submitting a pull request! Before we can merge it, you'll need to sign the Apollo Contributor License Agreement here: https://contribute.apollographql.com/

@jaredly
Copy link
Author

jaredly commented Oct 12, 2020

image
here's the test failure 馃檭

Comment on lines 37 to 38
/// Most applications will have a single query planner that they use
/// for the duration of the app's lifetime, but if you are working
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is a bit misleading; One of the ways in which Apollo Gateway can run is by polling for downstream schema changes, composing a new federated schema (csdl) and updating the runtime schema on the fly.

That fact brings to question if this is the right solution, because some people do in fact create a lot of new schemas during the runtime of the server by design.
I like that you've added the function to drop a planner, if we want to go with this approach, we'd have to make sure and drop the old planner whenever a schema updates (i.e. identify schema changes vs. additions).

At the root of the problem though, is that there is one global wasm module loaded, but there can be multiple ApolloGateway instances. We might want to solve this at the root of the problem, if it's possible, rather than adapting the global model to a local one.

Let me stew on this for a bit.

Generally though, the use of an index is indeed better than a pointer!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go ahead and change the wording here. I've now made changes to gateway-js to use this function when needed

Copy link
Contributor

@Enrico2 Enrico2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the test failure, I'm fixing that wasm-pack installation issue separately, rebase your branch once #228 is merged.

Copy link
Contributor

@Enrico2 Enrico2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current suggestion has a critical bug. I've made a suggestion if you want to keep at it.

let id = SCHEMA.len();
SCHEMA.push(Some(schema));
// No need to re-parse, we can just clone!
DATA.push(DATA[i].clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why you went with this approach, 2 calls using the same index and one of them dropping the planner would cause a dangling reference to an index.

Why not clone?

We've been intentional about not implementing Clone for structs that can potentially be very large. For the QueryPlanner case, its data is the parsed schema ast, which can be a very large deeply nested tree. One of the design tenets for this gateway is to be able to operate over very large schemas (e.g. imagine an entire enterprise's data graph, composed from 100+ different services) and cloning that would create a noticeable performance degradation.

Cloning here creates a state where the gateway will be susceptible to undue memory pressure, if a user mis-uses the js library or even in the normal case, if there is a schema composition event but no change in the resulting composed schema, and we keep cloning the same planner over and over.

Generally speaking, using an Rc would solve that, but there's a bug here....

Bug: QueryPlanner references the schema String.

This is the definition:

pub struct QueryPlanner<'s> {
    pub schema: schema::Document<'s>,
}

that 's is the lifetime referenced by Document, which, if you look at where that ends up coming from, is the lifetime of the original &str that contains the schema. So when we put the schema in SCHEMA, we then use a reference to it to pass to the parser. By cloning (even if this was an Rc), the cloned planner still references the String that was used to create the original planner. When dropping that planner and its schema, any cloned planner will have dangling references.

How should we fix the root issue?

We've not had a chance to discuss this, and we appreciate the contribution and ideas, I think they may inform an eventual solution. Ideally I'd like to figure out if we can fix the problem outside of the surface area of these objects somehow.

Idea?

Just had a thought, maybe we should be adding a static GATEWAYS vector that will contain a schema and planner, which would map to a js ApolloGateway instance?
Meaning, if you have 2 ApolloGateway instances, each of which is constantly replacing its schema, you'd have 2 items in a GATEWAYS vector, and for each gateway we'd still just have one schema and planner that's being replaced?
In that situation we'd not have to implement any drop (which really adds a lot of complexity)

That might be a better design. Doing that would mean introducing some breaking API changes, the names would match what we're returning:

#[wasm_bindgen(js_name = getGatewayBridge)]
pub fn get_gateway_bridge(schema: JsString) -> usize { .. }

#[wasm_bindgen(js_name = getQueryPlan)]
pub fn get_query_plan(gateway_idx: usize, query: &str, options: &JsValue) -> JsValue { .. }

(Noting that this would be a breaking API change, which will be relevant when releasing)

@jaredly
Copy link
Author

jaredly commented Oct 21, 2020

@Enrico2 thanks for the great review! I've reworked my solution to fix the bugs, and I think I've got a pretty neat solution that doesn't involve any breaking changes! Memory access is localized, and cloning is done behind an Rc which makes it very cheap.

@jaredly jaredly requested a review from Enrico2 October 26, 2020 16:38
@Enrico2
Copy link
Contributor

Enrico2 commented Oct 26, 2020

@jaredly Thanks for the update! We're discussing how to proceed with this. This adds some level of complexity that ideally I'd like to avoid, and so I'm researching if there are other ways to solve the problem. Stay tuned!

Enrico2 pushed a commit that referenced this pull request Oct 28, 2020
This PR makes it so that getting a query planner creates a new planner every time and never overrides it.
We've added a new function named `updatePlannerSchema` that overrides a specific planner.

This hopefully fixes #210
This implementation is suggested instead of #226 to keep things simple.
@Enrico2
Copy link
Contributor

Enrico2 commented Oct 29, 2020

Take a look at #261 as a proposed simpler alternative.

@jaredly
Copy link
Author

jaredly commented Oct 30, 2020

hmmm @Enrico2 it's simpler because it doesn't allow you to free memory if you're done with an ApolloGateway object?

@Enrico2
Copy link
Contributor

Enrico2 commented Oct 30, 2020

hmmm @Enrico2 it's simpler because it doesn't allow you to free memory if you're done with an ApolloGateway object?

@jaredly It's simpler because it removes the need for memory management from JS while still allowing multiple Gateways to run in the same process, as well as update a gateway's schema on the fly without leaking memory.

ApolloGateway is not meant to "be done with", under what circumstances would you be done with it while keeping the node process around?

@dkapadia
Copy link

dkapadia commented Nov 2, 2020

(@jaredly is a colleague of mine, working on the same project - he's also far more fluent in rust/wasm than I am).

Take a look at #261 as a proposed simpler alternative.

From an API level, it looks like our tests pass now with #261 as well as this PR, so that's a definite improvement over main (where our tests fail due to the overwriting of the schema). I haven't had a chance to profile the memory consumption of the rust query planner yet, but if it's comparable to the JS query plan code we'll definitely need to worry about freeing unused gateway objects and so would greatly prefer this PR over #261.

ApolloGateway is not meant to "be done with", under what circumstances would you be done with it while keeping the node process around?

We currently create ApolloGateway objects as part of our deploy testing flow to allow devs to test out and validate schema changes in production before those changes are rolled out to 'real' users. For example, if dev visits test-schema-version.example.com, we create (and cache) an ApolloGateway object for test-schema-version, then execute the request using that newly created gateway, instead of the gateway used for requests sent to example.com (which uses the production schema for end users). Once they are happy testing against the test-schema, we make the test-schema the default production schema, by updating the production ApolloGateway's schema.

Each ApolloGateway we create uses up a fair bit of memory, so we have a fairly small LRU cache setup that evicts unused ApolloGateway objects when they are no longer used. This is fine for our use case since we only need a max of ~5 test versions at any given time, but there might be ~50 such versions created over the course of a day, so we can't really have the memory usage grow unbounded.

Happy to go into more detail if any of that is unclear, and very much appreciate your time looking into this issue!

@abernix
Copy link
Member

abernix commented Apr 16, 2021

Copying and pasting my comment from #261 (comment):

This should no longer be necessary with the landing of #622 (specifically, the changing of the query planner back to TypeScript). Put another way, the code this changes is no longer present and the query planners are now scoped within the Gateways that generate and execute them.

Thanks very much for opening this PR in the first place; very much appreciated contribution, and hopefully this simplifies your configuration in your deploy testing.

@abernix abernix closed this Apr 16, 2021
@jaredly
Copy link
Author

jaredly commented Apr 16, 2021

oh neat! Does this mean y'all are moving away from rust/wasm? (I see reducing the number of languages in a project as a positive thing 馃槄)

@abernix
Copy link
Member

abernix commented Apr 16, 2021

@jaredly We're still excited about Rust (and hiring for it!). For @apollo/gateway in particular though, we are focused on iterating on the existing codebase in TypeScript (read: reducing the number of languages in this repository). We didn't necessarily want to be maintaining multiple languages within a project for long, though that was part of what unfolded in this experiment. Happy to get back to a single language here and simplify the packaging though, for sure.

Overall, this will allow us to iterate, improve and learn on the Federation model as a whole, bringing new features and bug fixes + adding value for many current users.

@apollographql apollographql locked and limited conversation to collaborators Apr 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for multiple wasm query planner objects?
5 participants