Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use spawn_blocking for parsing #5235

Closed

Conversation

xuorig
Copy link
Contributor

@xuorig xuorig commented May 24, 2024

Note

Superseded by #5582

I'm investigating an issue where it looks like query parsing / validation becomes extremely latent, leading the router to stop serving requests entirely.

Given that we see parsing take more than a second frequently (possibly something to investigate on its own), it seems wise to not block a worker while we do it. This PR uses spawn_blocking to allow the workers to server other requests. Note that in similar scenarios this could lead us to exhaust the amount of blocking threads for tokio, but at least the runtime would remain unblocked.

I've also swapped the tokio mutex for a std::sync::Mutex as the async lock did not seem to help / wasn't needed. removed.

Few other questions:

  • Is specific back pressure for the spawn blocking needed at this level? I'm thinking this is fine for now, back pressure can happen as a concurrency limiter / rate limiter at ingress.
  • Would a wait map similar to planning make sense eventually here?

@router-perf
Copy link

router-perf bot commented May 24, 2024

CI performance tests

  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • reload - Reload test over a long period of time at a constant rate of users
  • large-request - Stress test with a 1 MB request payload
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • const - Basic stress test that runs with a constant number of users
  • step - Basic stress test that steps up the number of users over time
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring enabled
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • no-graphos - Basic stress test, no GraphOS.
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled

Marc-Andre Giroux added 2 commits May 24, 2024 10:21
@Geal
Copy link
Contributor

Geal commented May 24, 2024

maybe don't change the mutex for now, we are looking at another solution in #5204 because it looks like the mutex is creating some queueing behaviour between queries
spawn_blocking will by itself backpressure the queries that need to be parsed, because at some point all threads in the blocking pool are used, but the other executor threads will still be able to handle queries which are already present in the cache

@xuorig
Copy link
Contributor Author

xuorig commented May 24, 2024

maybe don't change the mutex for now, we are looking at another solution in #5204

ACK, changed it back

@xuorig
Copy link
Contributor Author

xuorig commented May 24, 2024

There's parsing in warm up as well, which could cause similar issues but I guess it would rarely / never cause all workers to be blocked? Any opinion on whether we want to tackle this in the same PR?

@Geal
Copy link
Contributor

Geal commented May 27, 2024

the warm up process would only block one worker at a time, so it's a bit better, but it should probably be done too, because some deployments do not make a lot of CPU threads available for the router

@Geal Geal requested a review from a team May 27, 2024 08:33
@xuorig
Copy link
Contributor Author

xuorig commented May 28, 2024

Handling warm_up as well by spawn_blocking within QueryAnalysisLayer::parse_document instead, which is used by both.

Copy link
Contributor

@garypen garypen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about the impact of the change.

  1. max_blocking_threads is default 512. That may be hit very easily by a busy router with a lot of complex documents to parse. Maybe we should set a value which is a lot. > 512? Maybe we need configuration?
  2. I don't like the expect(), but I guess it's no worse than the current situation. Maybe we could have an is_panic and log an error. That might be an improvement on the current situation? I'm not sure about this to be honest. Is it better to panic if the parsing panicked or is it better to log an error?
  3. At line 301 in caching_query_planner.rs won't we end up manufacturing a SpecError if a spawn_blocking fails? If that's legitimate, then maybe we should do the same thing in the query analysis layer?

I suppose, between 2 and 3, the important thing would be consistent treatment. Either manufacture a spec error in both places or do special case logging if the Err is a JoinError that panicked.

@xuorig
Copy link
Contributor Author

xuorig commented Jun 3, 2024

max_blocking_threads is default 512. That may be hit very easily by a busy router with a lot of complex documents to parse. Maybe we should set a value which is a lot. > 512? Maybe we need configuration?

After 512 queuing starts happening, which arguably is still a lot better than the current state. In normal scenarios this queue should drain very quickly, but in worst case scenarios like the one we're seeing here at least already parsed queries can still get through rather than blocking the runtime.

I don't like the expect(), but I guess it's no worse than the current situation. Maybe we could have an is_panic and log an error. That might be an improvement on the current situation? I'm not sure about this to be honest. Is it better to panic if the parsing panicked or is it better to log an error?

No strong opinion on that one, the PR doesn't really change the failure mode of parsing panicking as far as I know. Not sure if this is something that should be "caught"?

I guess it would panic on cancellation as well, which we could handle separately with is_panic as you mentionned.

At line 301 in caching_query_planner.rs won't we end up manufacturing a SpecError if a spawn_blocking fails? If that's legitimate, then maybe we should do the same thing in the query analysis layer?

Might be missing something here but I don't think spawn_blocking returns an error here? It either panics or returns the result from parsing?

Overall it might make more sense to put this work in something like rayon / dedicated thread pool rather than spawn_blocking, but this seems like a good in between improvement.

@garypen
Copy link
Contributor

garypen commented Jun 4, 2024

max_blocking_threads is default 512. That may be hit very easily by a busy router with a lot of complex documents to parse. Maybe we should set a value which is a lot. > 512? Maybe we need configuration?

After 512 queuing starts happening, which arguably is still a lot better than the current state. In normal scenarios this queue should drain very quickly, but in worst case scenarios like the one we're seeing here at least already parsed queries can still get through rather than blocking the runtime.

I agree that it's an improvement, but I was wondering if it could be improved even further.

I think it's ok to go with the 512 default and avoid the extra thinking about configuration, but thought it was worth mentioning.

I don't like the expect(), but I guess it's no worse than the current situation. Maybe we could have an is_panic and log an error. That might be an improvement on the current situation? I'm not sure about this to be honest. Is it better to panic if the parsing panicked or is it better to log an error?

No strong opinion on that one, the PR doesn't really change the failure mode of parsing panicking as far as I know. Not sure if this is something that should be "caught"?

I was wondering if we'd added a new failure mode (i.e.: spawn_blocking() itself fails), but I guess, after thinking about the implementation, that we haven't. In which case, ignore this and the next comment.

I guess it would panic on cancellation as well, which we could handle separately with is_panic as you mentionned.

At line 301 in caching_query_planner.rs won't we end up manufacturing a SpecError if a spawn_blocking fails? If that's legitimate, then maybe we should do the same thing in the query analysis layer?

Might be missing something here but I don't think spawn_blocking returns an error here? It either panics or returns the result from parsing?

Overall it might make more sense to put this work in something like rayon / dedicated thread pool rather than spawn_blocking, but this seems like a good in between improvement.

Yup.

@garypen garypen self-requested a review June 4, 2024 09:58
@xuorig xuorig requested review from a team as code owners June 4, 2024 12:42
@xuorig xuorig changed the title use std mutex for query analysis cache, use spawn_blocking for parsing use spawn_blocking for parsing Jun 4, 2024
@xuorig
Copy link
Contributor Author

xuorig commented Jun 4, 2024

Looks like the router is not starting in integration tests:

{"timestamp":"2024-05-30T20:17:25.620548594Z","level":"ERROR","message":"Not connected to GraphOS. In order to enable these features for a self-hosted instance of Apollo Router, the Router must be connected to a graph in GraphOS (using APOLLO_KEY and APOLLO_GRAPH_REF) that provides a license for the following features:\n\nConfiguration yaml:\n* Advanced telemetry\n  .telemetry..instruments\n\n* Advanced telemetry\n  .telemetry..graphql\n\nSee https://go.apollo.dev/o/elp for more information.","target":"apollo_router::state_machine","resource":{}}
{"timestamp":"2024-05-30T20:17:25.620756559Z","level":"INFO","message":"stopped","target":"apollo_router::state_machine","resource":{}}
{"timestamp":"2024-05-30T20:17:25.621150735Z","level":"ERROR","message":"license violation","target":"apollo_router::executable","resource":{}}

Related to GraphOS / license maybe? Same thing locally. Any ideas?

@xuorig
Copy link
Contributor Author

xuorig commented Jun 4, 2024

Ah, right I don't have a valid TEST_APOLLO_GRAPH_REF and probably don't have the right circleci env either.

@garypen
Copy link
Contributor

garypen commented Jun 5, 2024

Ah, right I don't have a valid TEST_APOLLO_GRAPH_REF and probably don't have the right circleci env either.

Recent changes to our CI testing strategy have broken tests for forked PRs. I hope this will be fixed soon.

@Geal
Copy link
Contributor

Geal commented Jun 6, 2024

I think it's ok to go with the 512 default and avoid the extra thinking about configuration, but thought it was worth mentioning.

I'm ok with that too, but we should note that somewhere and think of a follow up issue. The alternative right now without this PR is that validation is hogging all of the executor threads, which amounts to the same result

At line 301 in caching_query_planner.rs won't we end up manufacturing a SpecError if a spawn_blocking fails? If that's legitimate, then maybe we should do the same thing in the query analysis layer?

In caching_query_planner.rs it is just the warm up phase. If a query does not pass validation anymore (maybe the schema changed), then we can ignore it because we won't make a plan for it. And the result of validation would still be recorded by the query analysis cache

@Geal
Copy link
Contributor

Geal commented Jun 10, 2024

could you merge dev again? Apparently I can't do it from here, and dev now has this commit which will make the tests pass

@Geal
Copy link
Contributor

Geal commented Jun 24, 2024

@xuorig can you merge dev?

@xuorig
Copy link
Contributor Author

xuorig commented Jun 25, 2024

Done, just merged @Geal

@Geal Geal mentioned this pull request Jul 2, 2024
6 tasks
@BrynCooke
Copy link
Contributor

Let's merge with dev again. I think there was another redis test that needed to be disabled.

@Geal
Copy link
Contributor

Geal commented Jul 5, 2024

@BrynCooke continue on that one, you can push on it #5582

@abernix
Copy link
Member

abernix commented Jul 5, 2024

I'm going to close this in lieu of #5582, where we can hopefully help move this along. 😄 🌴 I'll edit the description, too, to point that way. Thanks for opening this!

@abernix abernix closed this Jul 5, 2024
@abernix
Copy link
Member

abernix commented Jul 5, 2024

Just as a note, there were some outstanding questions in the original PR body above — let's try to answer those on the other PR:

Few other questions:

  • Is specific back pressure for the spawn blocking needed at this level? I'm thinking this is fine for now, back pressure can happen as a concurrency limiter / rate limiter at ingress.
  • Would a wait map similar to planning make sense eventually here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants