-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use spawn_blocking for parsing #5235
use spawn_blocking for parsing #5235
Conversation
CI performance tests
|
maybe don't change the mutex for now, we are looking at another solution in #5204 because it looks like the mutex is creating some queueing behaviour between queries |
42d813c
to
70f1f1b
Compare
ACK, changed it back |
There's parsing in warm up as well, which could cause similar issues but I guess it would rarely / never cause all workers to be blocked? Any opinion on whether we want to tackle this in the same PR? |
the warm up process would only block one worker at a time, so it's a bit better, but it should probably be done too, because some deployments do not make a lot of CPU threads available for the router |
Handling warm_up as well by spawn_blocking within |
7182428
to
b4625cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about the impact of the change.
max_blocking_threads
is default512
. That may be hit very easily by a busy router with a lot of complex documents to parse. Maybe we should set a value which is a lot. > 512? Maybe we need configuration?- I don't like the
expect()
, but I guess it's no worse than the current situation. Maybe we could have anis_panic
and log an error. That might be an improvement on the current situation? I'm not sure about this to be honest. Is it better to panic if the parsing panicked or is it better to log an error? - At line 301 in caching_query_planner.rs won't we end up manufacturing a SpecError if a spawn_blocking fails? If that's legitimate, then maybe we should do the same thing in the query analysis layer?
I suppose, between 2 and 3, the important thing would be consistent treatment. Either manufacture a spec error in both places or do special case logging if the Err
is a JoinError
that panicked.
After 512 queuing starts happening, which arguably is still a lot better than the current state. In normal scenarios this queue should drain very quickly, but in worst case scenarios like the one we're seeing here at least already parsed queries can still get through rather than blocking the runtime.
No strong opinion on that one, the PR doesn't really change the failure mode of parsing panicking as far as I know. Not sure if this is something that should be "caught"? I guess it would panic on cancellation as well, which we could handle separately with
Might be missing something here but I don't think Overall it might make more sense to put this work in something like rayon / dedicated thread pool rather than |
I agree that it's an improvement, but I was wondering if it could be improved even further. I think it's ok to go with the 512 default and avoid the extra thinking about configuration, but thought it was worth mentioning.
I was wondering if we'd added a new failure mode (i.e.:
Yup. |
Looks like the router is not starting in integration tests:
Related to GraphOS / license maybe? Same thing locally. Any ideas? |
Ah, right I don't have a valid |
Recent changes to our CI testing strategy have broken tests for forked PRs. I hope this will be fixed soon. |
I'm ok with that too, but we should note that somewhere and think of a follow up issue. The alternative right now without this PR is that validation is hogging all of the executor threads, which amounts to the same result
In |
could you merge dev again? Apparently I can't do it from here, and dev now has this commit which will make the tests pass |
@xuorig can you merge dev? |
Done, just merged @Geal |
Let's merge with dev again. I think there was another redis test that needed to be disabled. |
@BrynCooke continue on that one, you can push on it #5582 |
I'm going to close this in lieu of #5582, where we can hopefully help move this along. 😄 🌴 I'll edit the description, too, to point that way. Thanks for opening this! |
Just as a note, there were some outstanding questions in the original PR body above — let's try to answer those on the other PR:
|
Note
Superseded by #5582
I'm investigating an issue where it looks like query parsing / validation becomes extremely latent, leading the router to stop serving requests entirely.
Given that we see parsing take more than a second frequently (possibly something to investigate on its own), it seems wise to not block a worker while we do it. This PR uses
spawn_blocking
to allow the workers to server other requests. Note that in similar scenarios this could lead us to exhaust the amount of blocking threads for tokio, but at least the runtime would remain unblocked.I've also swapped the tokio mutex for aremoved.std::sync::Mutex
as the async lock did not seem to help / wasn't needed.Few other questions: