Fix direct memory OOM on broker #11496

jasperjiaguo · 2023-09-02T02:32:22Z

Upon large server response, broker can have direct memory oom and result in resource deadlock. PoC code to fix it. Tested using OfflineClusterMemBasedBrokerQueryKillingTest with -XX:MaxDirectMemorySize=100M, and also in our production environment.

The issue we saw is basically when multiple servers concurrently sending medium sized (several hundred MB) responses, the broker will hit direct memory oom due to multiple direct buffer allocation; and all the netty channels will deadlock on requesting direct memory.

codecov-commenter · 2023-09-02T03:12:41Z

Codecov Report

Merging #11496 (57f89e7) into master (6208e7c) will decrease coverage by 48.59%.
Report is 17 commits behind head on master.
The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master   #11496       +/-   ##
=============================================
- Coverage     63.05%   14.47%   -48.59%     
+ Complexity     1109      201      -908     
=============================================
  Files          2320     2326        +6     
  Lines        124667   124845      +178     
  Branches      19033    19061       +28     
=============================================
- Hits          78614    18072    -60542     
- Misses        40458   105246    +64788     
+ Partials       5595     1527     -4068

Flag	Coverage Δ
integration	`0.00% <0.00%> (-0.01%)`	⬇️
integration1	`?`
integration2	`0.00% <0.00%> (ø)`
java-11	`14.47% <0.00%> (-48.54%)`	⬇️
java-17	`0.00% <0.00%> (-62.92%)`	⬇️
java-20	`14.47% <0.00%> (-48.43%)`	⬇️
temurin	`14.47% <0.00%> (-48.59%)`	⬇️
unittests	`14.47% <0.00%> (-48.58%)`	⬇️
unittests1	`?`
unittests2	`14.47% <0.00%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
...a/org/apache/pinot/common/metrics/BrokerMeter.java	`0.00% <0.00%> (-93.75%)`	⬇️
...a/org/apache/pinot/common/metrics/ServerMeter.java	`0.00% <0.00%> (-97.60%)`	⬇️
...he/pinot/core/transport/ChannelHandlerFactory.java	`0.00% <0.00%> (-50.00%)`	⬇️
.../apache/pinot/core/transport/DirectOOMHandler.java	`0.00% <0.00%> (ø)`
...e/pinot/core/transport/InstanceRequestHandler.java	`0.00% <0.00%> (-61.72%)`	⬇️
...rg/apache/pinot/core/transport/ServerChannels.java	`0.00% <0.00%> (-63.42%)`	⬇️

... and 1497 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Jackie-Jiang · 2023-09-02T05:46:51Z

cc @erichgess @soumitra-st

pinot-core/src/main/java/org/apache/pinot/core/transport/DataTableHandler.java

pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java

ege-st · 2023-09-08T19:09:15Z

Looks like we have some updates, I'll re-run my tests on the changes.

ege-st · 2023-09-08T20:56:18Z

One thing that occurred to me about this approach. Can we have a situation where a query causes the Servers to send very big replies back in a sequence such that the channel restart process gets triggered multiple times in a row?

pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java

pinot-core/src/main/java/org/apache/pinot/core/transport/ServerChannels.java

pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java

pinot-core/src/main/java/org/apache/pinot/core/transport/InstanceRequestHandler.java

pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java

vvivekiyer · 2023-09-11T17:42:44Z

pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java

+            }
+            _serverToChannelMap.remove(serverRoutingInstance);
+          });
+          _queryRouter.markServerDown(_serverRoutingInstance,


In the context of AdaptiveServerSelection:
If this error is hit, it looks like this codepath will not decrement the numInProgressQueries for all servers?
Can you please validate that? Looking at the code it looks like we might have to invoke the following function inside MarkServerDown.

_serverRoutingStatsManager.recordStatsUponResponseArrival(_requestId, entry.getKey().getInstanceId(), _timeoutMs);

I do not see the ADSS stats recording added in the latest commit.
Did we miss pushing the commit?

I do not see the ADSS stats recording added in the latest commit. Did we miss pushing the commit?

No I reverted that. I did some code reading and it turned out that we have already handled the cases where response doesn't get back in getFinalResponses so no need to decrement again in markServerDown. the tests will in fact fail if we add it. @vvivekiyer

pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java

jasperjiaguo · 2023-09-12T06:30:31Z

@jasperjiaguo - you may also want to share numbers from the internal production test we did using this, before v/s after numbers / observations from heap dump etc
* Direct memory usage should come down after initial failure

* No repeated retries from Netty
Yes, this would be very helpful.

So the direct memory graph looks like (the query itself took nearly 3 minutes to oom)

And error looks like

 curl 'https://xxxxx:xxxx/pinot-broker/resources/query' --data-urlencode 'q=query' --data-urlencode 'sql=xxxxx'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  138M    0  138M    0     0   788k      0 --:--:--  0:02:59 --:--:-- 8956k
{
    "metadata": {
        "columns": [
            {
                "columnType": "STRING",
                "columnName": "XXX"
            },
            {
                "columnType": "LONG",
                "columnName": "count(*)"
            }
        ],
        "hasPartialResults": true,
        "numEntriesScannedPostFilter": 215036581,
        "numDocsScanned": 215036581,
        "errors": [
            {
                "type": "UNKNOWN",
                "message": "org.apache.pinot.spi.exception.QueryCancelledException: Query cancelled as broker is out of direct memory\n\tat org.apache.pinot.core.transport.DirectOOMHandler.exceptionCaught(DirectOOMHandler.java:80)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:325)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:317)"

Also took a heap dump and direct buffer are all clean after the incident

gortiz · 2023-09-12T11:09:31Z

Related to this PR, I've added some netty buffer metrics in: #11575

pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java

soumitra-st · 2023-09-12T17:45:15Z

Sorry for starting a new thread here, but while we are trying to "handle" OOM by resetting the netty channels, queries may still fail (I think). So, System.exit can be called instead, which will trigger the shutdown hook to shutdown the broker. This approach is simpler and has less unknowns. I am not a netty expert, hence this proposal is biased to keep it simple.

jasperjiaguo · 2023-09-12T20:05:25Z

Hey @soumitra-st

OOM by resetting the netty channels, queries may still fail (I think)

Only the queries that overlaps with the OOM event fails (inevitably). After the OOM is handled (pretty fast), all subsequent queries succeeds. I have tested this repeatitively on our perf cluster.

So, System.exit can be called instead, which will trigger the shutdown hook to shutdown the broker.

This approach takes much less time than a broker restart, especially for larger clusters. In case of a rogue query is retried a few times, we will easily see all brokers taken down and take harder availability impact.

has less unknowns

Could you eleborate your concern here? I think the tests/heap dump/graphs show that we recover deterministically and the direct buffers are deallocated.

ege-st · 2023-09-12T20:09:54Z

has less unknowns

Could you eleborate your concern here? I think the tests/heap dump/graphs show that we recover deterministically and the direct buffers are deallocated.

The concerns that @soumitra-st and I had when working on a fix was that neither of us were Netty experts and weren't sure how well Netty would behave if it get a direct memory OOM and we kept running: e.g., we weren't sure if there would be a memory leak or some other type of resource leak, or if there might be unknown side effects for in flight queries, or what other effects within the Broker there might be. So, we erred towards treating this as an unrecoverable fault and triggered a shutdown, we felt this was the safest and easiest solution because shutdowns are a normal operation and would be the least likely to create unexpected issues.

soumitra-st · 2023-09-12T20:21:28Z

Thanks @jasperjiaguo for you comments!

Could you eleborate your concern here? I think the tests/heap dump/graphs show that we recover deterministically and the direct buffers are deallocated.

My concern is that we are trying to prove that the fix is working using tests/heap dump, etc. vs the restart will just work. We have customers using Pinot, and their workload may have some surprises. This fix certainly has less recovery time though.

Beyond the recovery time, do you have other concerns on shutting down the Broker? How many restarts do you see in your environment, and how many occurrences of direct memory OOM are there? If the fraction of number of direct memory OOM is not significant with respect to restarts because of other reasons, then the additional restarts won't be significant.

siddharthteotia · 2023-09-12T20:33:19Z

Thanks @jasperjiaguo for you comments!

Could you eleborate your concern here? I think the tests/heap dump/graphs show that we recover deterministically and the direct buffers are deallocated.

My concern is that we are trying to prove that the fix is working using tests/heap dump, etc. vs the restart will just work. We have customers using Pinot, and their workload may have some surprises. This fix certainly has less recovery time though.

Beyond the recovery time, do you have other concerns on shutting down the Broker? How many restarts do you see in your environment, and how many occurrences of direct memory OOM are there? If the fraction of number of direct memory OOM is not significant with respect to restarts because of other reasons, then the additional restarts won't be significant.

My perspective is that we should not rely on operational toil (restarts etc) to recover from issues that can largely be handled in code. I think this is what the fix is doing.

Let me just say that we have had significant number of OOMs (both heap and direct) and that's why we have built features like runtime query killing etc to try and improve resiliency via code as opposed to resorting to restarts. Restart could take significant time having to build BrokerResource.

I don't think it is wise to rely on restarts unless the problem is absolutely unsolvable via code

siddharthteotia · 2023-09-12T21:01:13Z

My concern is that we are trying to prove that the fix is working using tests/heap dump, etc. vs the restart will just work.

Let me elaborate a bit on the nature of problem we saw in our production.

We have a cluster several thousands of tables served by handful of brokers.

A really bad query that was fetching around 150MB of data from each of the 160 servers (fan out was 160) caused direct memory OOM on broker. Note that this was a soft OOM (broker didn't crash unlike Java heap space OOM)

The problem is not just with the OOM. It is the cascade impact of this OOM on the overall stability / availability of the system.

Concurrent queries around the same time and subsequent ones also failed because

All the direct buffer references were held up. Our heap dump confirmed it
Netty threads were relentlessly trying to allocate memory -- multiple times per channel-per server. So the same failure was happening repeatedly. Netty called into NIO to allocate direct memory. The latter tried to first GC. GC didn't help because the corresponding direct buffer reference was still in scope.

So this collectively destabilized and reduced the availability. Now we also restarted initially when we detected this to mitigate but by that time it had already negatively impacted our critical production use case and it missed the SLA -- because of the cascading impact on the concurrent / subsequent queries.

I understand and agree that restart is simpler but from detailed RCA there are definitely opportunities in code that could have prevented this or at least reduce the impact. @jasperjiaguo 's fix is aimed at that and that's why we also shared numbers on memory overhead reduction testing + subsequent queries working fine after the short recovery.

I agree that shutting down channels will cause the other queries to fail but that particular impact may not be worse than the potential real life worst impact that I described above -- which without manual interference or other tooling etc will continue to cause problems on the cluster IMHO

@soumitra-st @ege-st - I hope this gives some insight into where we are coming from. We can also chat offline and align if need be

cc @jasperjiaguo @vvivekiyer

soumitra-st · 2023-09-12T21:20:25Z

@soumitra-st @ege-st - I hope this gives some insight into where we are coming from. We can also chat offline and align if need be

I agree with the operational toil and thanks sharing the RCA, looking forward to deploy the fix for our customers!

ege-st · 2023-09-12T21:22:04Z

My concern is that we are trying to prove that the fix is working using tests/heap dump, etc. vs the restart will just work.

Let me elaborate a bit on the nature of problem we saw in our production.

We have a cluster several thousands of tables served by handful of brokers.

A really bad query that was fetching around 150MB of data from each of the 160 servers (fan out was 160) caused direct memory OOM on broker. Note that this was a soft OOM (broker didn't crash unlike Java heap space OOM)

The problem is not just with the OOM. It is the cascade impact of this OOM on the overall stability / availability of the system.

Yes, this is the core problem here, imo, this bug causes the Broker to become a bad actor and start sabotaging queries which is, imo, one of the worst situations the cluster could be in.

I agree that shutting down channels will cause the other queries to fail but that particular impact may not be worse than the potential real life worst impact that I described above -- which without manual interference or other tooling etc will continue to cause problems on the cluster IMHO

Causing queries to fail is not a huge issue as any solution will necessarily involve some queries failing while the broker recovers. It is certainly very minor when compared to the Broker continuing to accept queries even though it cannot execute them.

@soumitra-st @ege-st - I hope this gives some insight into where we are coming from. We can also chat offline and align if need be

cc @jasperjiaguo @vvivekiyer

I think that this is very insightful, thanks.

Jackie-Jiang · 2023-09-14T01:52:13Z

Great work!
Out of curiosity, can you elaborate more on the netty channel deadlock? Or is there some blog I can read about describing the issue?

gortiz · 2023-09-14T07:20:55Z

Debugging this issue I've discovered that in Java 11 Netty does not uses cleaners, so my theory is that it is relaying on the normal cleaning mechanism based on GC used in JDK ByteBuffers.

See https://github.com/netty/netty/blob/f1fa227ddf675f055766d04900cce6804fd7f710/common/src/main/java/io/netty/util/internal/PlatformDependent0.java and https://github.com/netty/netty/blob/f1fa227ddf675f055766d04900cce6804fd7f710/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L281 (it fails exactly there)

jasperjiaguo · 2023-09-14T22:37:02Z

Great work! Out of curiosity, can you elaborate more on the netty channel deadlock? Or is there some blog I can read about describing the issue?

Hey @Jackie-Jiang the situation we had was exactly as Sidd described:

A really bad query that was fetching around 150MB of data from each of the 160 servers (fan out was 160) caused direct memory OOM on broker. Note that this was a soft OOM (broker didn't crash unlike Java heap space OOM)

So the broker is able to process the responese from one server technically, but all servers (150+) together sending back these mid sized responses together will create a situation where each channel allocates some amont of direct buffer, but then the direct memory is filled and no server can proceed.

Jackie-Jiang · 2023-09-17T20:29:50Z

@jasperjiaguo This is resource starvation instead of deadlock then. During your investigation do you see closing all connections necessary? Ideally we want to only close the connection that throws exception for the following reasons:

Closing all connections will fail all on-the-fly queries
Some connections might not request lots of memory (data not evenly distributed, expensive query only hits subset of servers)
After releasing the connection throwing the exception, other connections might recover

jasperjiaguo · 2023-09-18T21:14:56Z

@jasperjiaguo This is resource starvation instead of deadlock then. During your investigation do you see closing all connections necessary? Ideally we want to only close the connection that throws exception for the following reasons:

Closing all connections will fail all on-the-fly queries

Some connections might not request lots of memory (data not evenly distributed, expensive query only hits subset of servers)

After releasing the connection throwing the exception, other connections might recover

Yeah I phrased dealock as every channel actually got a (pretty even) portion of the memory allocation and waiting.

Ideally we want to only close the connection that throws exception

Yes that is also what I initially thought would work. But in the worst case we tested (150+ servers each sending some hundred MB response), some connections to these 150 servers would start to throw OOM exceptions first; once we close them, the others recovered very slow/ would still be blocked in 10 minutes. In other words, once the OOM starts to happen, reseting connection only to channels throwing OOM not give us a prompt recover.

Closing all connections will fail all on-the-fly queries

I'm not sure what is the usecase on your side? One thing we might be able to do to limit the blast range is: for the OOM channels we fetch the current in fly query ids and just resetting the channels having those ids. However, this would introduces quite some complication to the error handling code; not sure if that's what we want to do here or we might invest some time in a finer grained solution as folks have suggested in the discussions above?

Jackie-Jiang · 2023-09-20T00:34:07Z

Wondering why other connections take up to 10 minutes to recover? Is that caused by more and more connections getting OOM? Ideally we should only close the connections that OOM'ed, and connections not requesting memory/already got the memory allocated shouldn't be affected.
If you can reproduce this, can you help check if the slow recovery is caused by connections not releasing memory properly? If that is the case, it might be a bug in the failure handling logic

* close channel on direct oom, log large response * Trigger Test * close channel on direct oom, log large response * close channel on direct oom, log large response * Trigger Test * close channel on direct oom, log large response * move metrics send to reflect all direct oom incidents

Jackie-Jiang added the enhancement label Sep 2, 2023

jasperjiaguo force-pushed the fix_direct_oom2 branch 2 times, most recently from fb4e553 to cf0fb4f Compare September 6, 2023 21:52

gortiz reviewed Sep 7, 2023

View reviewed changes

pinot-core/src/main/java/org/apache/pinot/core/transport/DataTableHandler.java Outdated Show resolved Hide resolved

jasperjiaguo force-pushed the fix_direct_oom2 branch 3 times, most recently from 99bfc2a to 9295bb8 Compare September 8, 2023 17:51

jasperjiaguo marked this pull request as ready for review September 8, 2023 18:02

jasperjiaguo requested a review from ege-st September 8, 2023 18:09

jasperjiaguo changed the title ~~[POC] close channel on direct oom~~ Close channel on direct oom Sep 8, 2023

ege-st reviewed Sep 8, 2023

View reviewed changes

close channel on direct oom, log large response

78c99b3

jasperjiaguo force-pushed the fix_direct_oom2 branch from bbb7997 to 78c99b3 Compare September 8, 2023 23:33

Trigger Test

dc213c7

gortiz reviewed Sep 11, 2023

View reviewed changes

pinot-core/src/main/java/org/apache/pinot/core/transport/DirectOOMHandler.java Show resolved Hide resolved

gortiz reviewed Sep 11, 2023

View reviewed changes

pinot-core/src/main/java/org/apache/pinot/core/transport/ServerChannels.java Show resolved Hide resolved

siddharthteotia changed the title ~~Close channel on direct oom~~ Fix direct memory OOM on broker Sep 11, 2023