-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix QueryContext race condition #13049
Conversation
What was the stack trace that this is now fixing? I'm curious as to the source of the NPE... I say that because races inside of that A perhaps even nicer fix would be a |
QueryContext is definitely mutable even without that: it's got a bunch of methods for adding and removing context parameters. They each call Either way, it feels sketchy to patch over it by synchronizing
For these reasons the change in the patch feels like it's halfway between two good solutions. In this patch, QueryContext takes on some responsibility for concurrency safety, where previously it had none. But it still leaves a lot up to the caller. I'd rather either leave it all up to the caller, which means not doing this change, and instead, doing another change where we prevent callers from using the context in such a way; or, leave none of it to the caller, which means making QueryContext fully concurrency-safe. |
Oh man, you are right, Why did we change it to be mutable in the first place? With every mutation, if we are invalidating the merged map, we are either
In either case, it's actually no different than having an immutable object that effectively copies itself every time a "mutation" happens. For the former, we already end up copying the entire object to rebuild the merged map. For the latter case, if the mutations are all bundled together in one place, it's pretty easy to build that in a Map first and then pass into a brand new I'm wondering if we shouldn't move |
I saw this one when running the test locally in repeat-until-failure mode:
The trace is confusing because it happens in a separate thread and is thrown without a wrapper. I added a wrapper, ran the tests until failure again, and got:
|
All good points. More fundamentally, context variables only make sense to set in the planning thread, then only read in the execution threads. It is inherently unstable to change a value in one thread concurrently with reading the context in another. I’ll do more debug code to track down the issue and enforce proper semantics. |
I think I see the problem. DruidConnection (object that represents a JDBC connection on the server side) uses a single QueryContext object to create all DruidJdbcStatements associated with that connection. The query lifecycle code associated with each statement mutates the contexts, which isn't safe because they are shared. I believe the bug was introduced in #12396, since prior to that patch, the context in DruidConnection was an ImmutableMap, and query planning copies the map rather than mutating it in-place; for example, when initializing.
It looks like the one in BaseQuery to be a |
@gianm, @imply-cheddar, the mess goes deeper. The native query "engine" adjusts context values as it proceeds. But, the places that do the adjusting don't know if they are working with "default", "system" or "user" parameters: they just want to call I hate to say it, but this design is a complete muddle: it won't work as currently coded. The race condition is the least of our worries: the greater worry is that the logic itself is inconsistent. There is somewhat the same muddle in the SQL layer that I've been wrestling with around using It seems we introduced the "default" and "system" parameters to work around some issue in doing a security check on parameters? If so, then perhaps we should just solve that problem directly, which we can do in the SQL layer (if that is the only place we enforce context security.) The new rules are simple:
So, the question for this group: do you know of any place, other than SQL context value security, where we need to separate the kinds of parameters? I'll try the above approach to see if it hits any roadblocks. This approach will, unfortunately, change public APIs, so we may need to devise solutions for extensions not in the Druid code base. |
Eliminated |
0da5ff8
to
105179e
Compare
Revised again to only fix the race condition which @gianm identified. The refactoring moved to a new PR. The result is a two-line change. |
This targeted fix here looks good to me! As to the bigger issue of what to do with query contexts generally — looking through the history of #12396, there seems to have been a bunch of approaches evaluated, such as in this thread: #12396 (comment). We can continue that conversation on #13071. |
Fixes #12955, #12739.
Ensures that JDBC makes a copy of the
QueryContext
stored on a connection when creating statements. Doing so avoids race conditions which produced flaky results inDruidAvaticaHandlerTest.testConcurrentQueries()
.An earlier version of this PR first took a hack approach. @gianm correctly pointed out that the problem is really with a flaw in JDBC. A second version explored the alternative of replacing
QueryContext
. That is a good idea, and expands on work that @FrankChen021 started in #13022, but was overkill for the specific problem in question. The refactoring was moved to a new PR, #13071. In the end, the simplest fix was good enough, and is safe, so this PR presents just that minimal fix.The fix was verified by running
testConcurrentQueries()
many times in a loop until it failed due to an OOM. That OOM is a separate issue to be addressed elsewhere.This PR has: