-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for multiple wasm QueryPlanner objects #226
Conversation
Fixes apollographql#210 I considered using a HashMap, but it's a little more complicated, and I'm assuming we won't have people creating many hundreds of query plans 馃檭 if that's a use case we want to support, we could move to HashMaps.
@jaredly: Thank you for submitting a pull request! Before we can merge it, you'll need to sign the Apollo Contributor License Agreement here: https://contribute.apollographql.com/ |
query-planner-wasm/src/lib.rs
Outdated
/// Most applications will have a single query planner that they use | ||
/// for the duration of the app's lifetime, but if you are working |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is a bit misleading; One of the ways in which Apollo Gateway can run is by polling for downstream schema changes, composing a new federated schema (csdl) and updating the runtime schema on the fly.
That fact brings to question if this is the right solution, because some people do in fact create a lot of new schemas during the runtime of the server by design.
I like that you've added the function to drop a planner, if we want to go with this approach, we'd have to make sure and drop the old planner whenever a schema updates (i.e. identify schema changes vs. additions).
At the root of the problem though, is that there is one global wasm module loaded, but there can be multiple ApolloGateway instances. We might want to solve this at the root of the problem, if it's possible, rather than adapting the global model to a local one.
Let me stew on this for a bit.
Generally though, the use of an index is indeed better than a pointer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go ahead and change the wording here. I've now made changes to gateway-js to use this function when needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the test failure, I'm fixing that wasm-pack installation issue separately, rebase your branch once #228 is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current suggestion has a critical bug. I've made a suggestion if you want to keep at it.
query-planner-wasm/src/lib.rs
Outdated
let id = SCHEMA.len(); | ||
SCHEMA.push(Some(schema)); | ||
// No need to re-parse, we can just clone! | ||
DATA.push(DATA[i].clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why you went with this approach, 2 calls using the same index and one of them dropping the planner would cause a dangling reference to an index.
Why not clone?
We've been intentional about not implementing Clone
for structs that can potentially be very large. For the QueryPlanner
case, its data is the parsed schema ast, which can be a very large deeply nested tree. One of the design tenets for this gateway is to be able to operate over very large schemas (e.g. imagine an entire enterprise's data graph, composed from 100+ different services) and cloning that would create a noticeable performance degradation.
Cloning here creates a state where the gateway will be susceptible to undue memory pressure, if a user mis-uses the js library or even in the normal case, if there is a schema composition event but no change in the resulting composed schema, and we keep cloning the same planner over and over.
Generally speaking, using an Rc
would solve that, but there's a bug here....
Bug: QueryPlanner
references the schema String
.
This is the definition:
pub struct QueryPlanner<'s> {
pub schema: schema::Document<'s>,
}
that 's
is the lifetime referenced by Document
, which, if you look at where that ends up coming from, is the lifetime of the original &str
that contains the schema. So when we put the schema in SCHEMA
, we then use a reference to it to pass to the parser. By cloning (even if this was an Rc
), the cloned planner still references the String
that was used to create the original planner. When dropping that planner and its schema, any cloned planner will have dangling references.
How should we fix the root issue?
We've not had a chance to discuss this, and we appreciate the contribution and ideas, I think they may inform an eventual solution. Ideally I'd like to figure out if we can fix the problem outside of the surface area of these objects somehow.
Idea?
Just had a thought, maybe we should be adding a static GATEWAYS
vector that will contain a schema and planner, which would map to a js ApolloGateway
instance?
Meaning, if you have 2 ApolloGateway
instances, each of which is constantly replacing its schema, you'd have 2 items in a GATEWAYS
vector, and for each gateway we'd still just have one schema and planner that's being replaced?
In that situation we'd not have to implement any drop
(which really adds a lot of complexity)
That might be a better design. Doing that would mean introducing some breaking API changes, the names would match what we're returning:
#[wasm_bindgen(js_name = getGatewayBridge)]
pub fn get_gateway_bridge(schema: JsString) -> usize { .. }
#[wasm_bindgen(js_name = getQueryPlan)]
pub fn get_query_plan(gateway_idx: usize, query: &str, options: &JsValue) -> JsValue { .. }
(Noting that this would be a breaking API change, which will be relevant when releasing)
@Enrico2 thanks for the great review! I've reworked my solution to fix the bugs, and I think I've got a pretty neat solution that doesn't involve any breaking changes! Memory access is localized, and cloning is done behind an |
@jaredly Thanks for the update! We're discussing how to proceed with this. This adds some level of complexity that ideally I'd like to avoid, and so I'm researching if there are other ways to solve the problem. Stay tuned! |
Take a look at #261 as a proposed simpler alternative. |
hmmm @Enrico2 it's simpler because it doesn't allow you to free memory if you're done with an |
@jaredly It's simpler because it removes the need for memory management from JS while still allowing multiple Gateways to run in the same process, as well as update a gateway's schema on the fly without leaking memory. ApolloGateway is not meant to "be done with", under what circumstances would you be done with it while keeping the node process around? |
(@jaredly is a colleague of mine, working on the same project - he's also far more fluent in rust/wasm than I am).
From an API level, it looks like our tests pass now with #261 as well as this PR, so that's a definite improvement over main (where our tests fail due to the overwriting of the schema). I haven't had a chance to profile the memory consumption of the rust query planner yet, but if it's comparable to the JS query plan code we'll definitely need to worry about freeing unused gateway objects and so would greatly prefer this PR over #261.
We currently create ApolloGateway objects as part of our deploy testing flow to allow devs to test out and validate schema changes in production before those changes are rolled out to 'real' users. For example, if dev visits test-schema-version.example.com, we create (and cache) an ApolloGateway object for Each ApolloGateway we create uses up a fair bit of memory, so we have a fairly small LRU cache setup that evicts unused ApolloGateway objects when they are no longer used. This is fine for our use case since we only need a max of ~5 test versions at any given time, but there might be ~50 such versions created over the course of a day, so we can't really have the memory usage grow unbounded. Happy to go into more detail if any of that is unclear, and very much appreciate your time looking into this issue! |
Copying and pasting my comment from #261 (comment):
Thanks very much for opening this PR in the first place; very much appreciated contribution, and hopefully this simplifies your configuration in your deploy testing. |
oh neat! Does this mean y'all are moving away from rust/wasm? (I see reducing the number of languages in a project as a positive thing 馃槄) |
@jaredly We're still excited about Rust (and hiring for it!). For Overall, this will allow us to iterate, improve and learn on the Federation model as a whole, bringing new features and bug fixes + adding value for many current users. |
Fixes #210
I considered using a HashMap, but it's a little more complicated, and I'm assuming we won't have people creating many hundreds of query plans 馃檭 if that's a use case we want to support, we could move to HashMaps.
One nice thing about passing around an index instead of a pointer is that it's much more robust :) rust doesn't really make guarantees about raw pointers remaining valid in the presence of allocations.
cc @Enrico2 looks like you did the first impl of this stuff?