New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix direct memory OOM on broker #11496
Conversation
Codecov Report
@@ Coverage Diff @@
## master #11496 +/- ##
=============================================
- Coverage 63.05% 14.47% -48.59%
+ Complexity 1109 201 -908
=============================================
Files 2320 2326 +6
Lines 124667 124845 +178
Branches 19033 19061 +28
=============================================
- Hits 78614 18072 -60542
- Misses 40458 105246 +64788
+ Partials 5595 1527 -4068
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 1497 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
fb4e553
to
cf0fb4f
Compare
pinot-core/src/main/java/org/apache/pinot/core/transport/DataTableHandler.java
Outdated
Show resolved
Hide resolved
99bfc2a
to
9295bb8
Compare
pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java
Show resolved
Hide resolved
Looks like we have some updates, I'll re-run my tests on the changes. |
One thing that occurred to me about this approach. Can we have a situation where a query causes the Servers to send very big replies back in a sequence such that the channel restart process gets triggered multiple times in a row? |
bbb7997
to
78c99b3
Compare
pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/transport/ServerChannels.java
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/transport/InstanceRequestHandler.java
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java
Outdated
Show resolved
Hide resolved
} | ||
_serverToChannelMap.remove(serverRoutingInstance); | ||
}); | ||
_queryRouter.markServerDown(_serverRoutingInstance, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of AdaptiveServerSelection:
If this error is hit, it looks like this codepath will not decrement the numInProgressQueries for all servers?
Can you please validate that? Looking at the code it looks like we might have to invoke the following function inside MarkServerDown.
_serverRoutingStatsManager.recordStatsUponResponseArrival(_requestId, entry.getKey().getInstanceId(),
_timeoutMs);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see the ADSS stats recording added in the latest commit.
Did we miss pushing the commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see the ADSS stats recording added in the latest commit. Did we miss pushing the commit?
No I reverted that. I did some code reading and it turned out that we have already handled the cases where response doesn't get back in getFinalResponses
so no need to decrement again in markServerDown. the tests will in fact fail if we add it. @vvivekiyer
pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java
Show resolved
Hide resolved
1b3bfea
to
992aa54
Compare
303c287
to
0b6f1c1
Compare
So the direct memory graph looks like (the query itself took nearly 3 minutes to oom) And error looks like
Also took a heap dump and direct buffer are all clean after the incident |
f0b7884
to
62c02b5
Compare
Related to this PR, I've added some netty buffer metrics in: #11575 |
pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java
Show resolved
Hide resolved
Sorry for starting a new thread here, but while we are trying to "handle" OOM by resetting the netty channels, queries may still fail (I think). So, System.exit can be called instead, which will trigger the shutdown hook to shutdown the broker. This approach is simpler and has less unknowns. I am not a netty expert, hence this proposal is biased to keep it simple. |
Hey @soumitra-st
Only the queries that overlaps with the OOM event fails (inevitably). After the OOM is handled (pretty fast), all subsequent queries succeeds. I have tested this repeatitively on our perf cluster.
This approach takes much less time than a broker restart, especially for larger clusters. In case of a rogue query is retried a few times, we will easily see all brokers taken down and take harder availability impact.
Could you eleborate your concern here? I think the tests/heap dump/graphs show that we recover deterministically and the direct buffers are deallocated. |
The concerns that @soumitra-st and I had when working on a fix was that neither of us were Netty experts and weren't sure how well Netty would behave if it get a direct memory OOM and we kept running: e.g., we weren't sure if there would be a memory leak or some other type of resource leak, or if there might be unknown side effects for in flight queries, or what other effects within the Broker there might be. So, we erred towards treating this as an unrecoverable fault and triggered a shutdown, we felt this was the safest and easiest solution because shutdowns are a normal operation and would be the least likely to create unexpected issues. |
Thanks @jasperjiaguo for you comments!
My concern is that we are trying to prove that the fix is working using tests/heap dump, etc. vs the restart will just work. We have customers using Pinot, and their workload may have some surprises. This fix certainly has less recovery time though. Beyond the recovery time, do you have other concerns on shutting down the Broker? How many restarts do you see in your environment, and how many occurrences of direct memory OOM are there? If the fraction of number of direct memory OOM is not significant with respect to restarts because of other reasons, then the additional restarts won't be significant. |
My perspective is that we should not rely on operational toil (restarts etc) to recover from issues that can largely be handled in code. I think this is what the fix is doing. Let me just say that we have had significant number of OOMs (both heap and direct) and that's why we have built features like runtime query killing etc to try and improve resiliency via code as opposed to resorting to restarts. Restart could take significant time having to build BrokerResource. I don't think it is wise to rely on restarts unless the problem is absolutely unsolvable via code |
Let me elaborate a bit on the nature of problem we saw in our production. We have a cluster several thousands of tables served by handful of brokers. A really bad query that was fetching around 150MB of data from each of the 160 servers (fan out was 160) caused direct memory OOM on broker. Note that this was a soft OOM (broker didn't crash unlike Java heap space OOM) The problem is not just with the OOM. It is the cascade impact of this OOM on the overall stability / availability of the system. Concurrent queries around the same time and subsequent ones also failed because
So this collectively destabilized and reduced the availability. Now we also restarted initially when we detected this to mitigate but by that time it had already negatively impacted our critical production use case and it missed the SLA -- because of the cascading impact on the concurrent / subsequent queries. I understand and agree that restart is simpler but from detailed RCA there are definitely opportunities in code that could have prevented this or at least reduce the impact. @jasperjiaguo 's fix is aimed at that and that's why we also shared numbers on memory overhead reduction testing + subsequent queries working fine after the short recovery. I agree that shutting down channels will cause the other queries to fail but that particular impact may not be worse than the potential real life worst impact that I described above -- which without manual interference or other tooling etc will continue to cause problems on the cluster IMHO @soumitra-st @ege-st - I hope this gives some insight into where we are coming from. We can also chat offline and align if need be |
I agree with the operational toil and thanks sharing the RCA, looking forward to deploy the fix for our customers! |
Yes, this is the core problem here, imo, this bug causes the Broker to become a bad actor and start sabotaging queries which is, imo, one of the worst situations the cluster could be in.
Causing queries to fail is not a huge issue as any solution will necessarily involve some queries failing while the broker recovers. It is certainly very minor when compared to the Broker continuing to accept queries even though it cannot execute them.
I think that this is very insightful, thanks. |
Great work! |
Debugging this issue I've discovered that in Java 11 Netty does not uses cleaners, so my theory is that it is relaying on the normal cleaning mechanism based on GC used in JDK ByteBuffers. See https://github.com/netty/netty/blob/f1fa227ddf675f055766d04900cce6804fd7f710/common/src/main/java/io/netty/util/internal/PlatformDependent0.java and https://github.com/netty/netty/blob/f1fa227ddf675f055766d04900cce6804fd7f710/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L281 (it fails exactly there) |
Hey @Jackie-Jiang the situation we had was exactly as Sidd described:
So the broker is able to process the responese from one server technically, but all servers (150+) together sending back these mid sized responses together will create a situation where each channel allocates some amont of direct buffer, but then the direct memory is filled and no server can proceed. |
@jasperjiaguo This is resource starvation instead of deadlock then. During your investigation do you see closing all connections necessary? Ideally we want to only close the connection that throws exception for the following reasons:
|
Yeah I phrased dealock as every channel actually got a (pretty even) portion of the memory allocation and waiting.
Yes that is also what I initially thought would work. But in the worst case we tested (150+ servers each sending some hundred MB response), some connections to these 150 servers would start to throw OOM exceptions first; once we close them, the others recovered very slow/ would still be blocked in 10 minutes. In other words, once the OOM starts to happen, reseting connection only to channels throwing OOM not give us a prompt recover.
I'm not sure what is the usecase on your side? One thing we might be able to do to limit the blast range is: for the OOM channels we fetch the current in fly query ids and just resetting the channels having those ids. However, this would introduces quite some complication to the error handling code; not sure if that's what we want to do here or we might invest some time in a finer grained solution as folks have suggested in the discussions above? |
Wondering why other connections take up to 10 minutes to recover? Is that caused by more and more connections getting OOM? Ideally we should only close the connections that OOM'ed, and connections not requesting memory/already got the memory allocated shouldn't be affected. |
* close channel on direct oom, log large response * Trigger Test * close channel on direct oom, log large response * close channel on direct oom, log large response * Trigger Test * close channel on direct oom, log large response * move metrics send to reflect all direct oom incidents
Upon large server response, broker can have direct memory oom and result in resource deadlock. PoC code to fix it. Tested using OfflineClusterMemBasedBrokerQueryKillingTest with -XX:MaxDirectMemorySize=100M, and also in our production environment.
The issue we saw is basically when multiple servers concurrently sending medium sized (several hundred MB) responses, the broker will hit direct memory oom due to multiple direct buffer allocation; and all the netty channels will deadlock on requesting direct memory.