Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix direct memory OOM on broker #11496

Merged
merged 7 commits into from Sep 12, 2023

Conversation

jasperjiaguo
Copy link
Contributor

@jasperjiaguo jasperjiaguo commented Sep 2, 2023

Upon large server response, broker can have direct memory oom and result in resource deadlock. PoC code to fix it. Tested using OfflineClusterMemBasedBrokerQueryKillingTest with -XX:MaxDirectMemorySize=100M, and also in our production environment.

The issue we saw is basically when multiple servers concurrently sending medium sized (several hundred MB) responses, the broker will hit direct memory oom due to multiple direct buffer allocation; and all the netty channels will deadlock on requesting direct memory.

@codecov-commenter
Copy link

codecov-commenter commented Sep 2, 2023

Codecov Report

Merging #11496 (57f89e7) into master (6208e7c) will decrease coverage by 48.59%.
Report is 17 commits behind head on master.
The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master   #11496       +/-   ##
=============================================
- Coverage     63.05%   14.47%   -48.59%     
+ Complexity     1109      201      -908     
=============================================
  Files          2320     2326        +6     
  Lines        124667   124845      +178     
  Branches      19033    19061       +28     
=============================================
- Hits          78614    18072    -60542     
- Misses        40458   105246    +64788     
+ Partials       5595     1527     -4068     
Flag Coverage Δ
integration 0.00% <0.00%> (-0.01%) ⬇️
integration1 ?
integration2 0.00% <0.00%> (ø)
java-11 14.47% <0.00%> (-48.54%) ⬇️
java-17 0.00% <0.00%> (-62.92%) ⬇️
java-20 14.47% <0.00%> (-48.43%) ⬇️
temurin 14.47% <0.00%> (-48.59%) ⬇️
unittests 14.47% <0.00%> (-48.58%) ⬇️
unittests1 ?
unittests2 14.47% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
...a/org/apache/pinot/common/metrics/BrokerMeter.java 0.00% <0.00%> (-93.75%) ⬇️
...a/org/apache/pinot/common/metrics/ServerMeter.java 0.00% <0.00%> (-97.60%) ⬇️
...he/pinot/core/transport/ChannelHandlerFactory.java 0.00% <0.00%> (-50.00%) ⬇️
.../apache/pinot/core/transport/DirectOOMHandler.java 0.00% <0.00%> (ø)
...e/pinot/core/transport/InstanceRequestHandler.java 0.00% <0.00%> (-61.72%) ⬇️
...rg/apache/pinot/core/transport/ServerChannels.java 0.00% <0.00%> (-63.42%) ⬇️

... and 1497 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Jackie-Jiang
Copy link
Contributor

cc @erichgess @soumitra-st

@jasperjiaguo jasperjiaguo force-pushed the fix_direct_oom2 branch 2 times, most recently from fb4e553 to cf0fb4f Compare September 6, 2023 21:52
@jasperjiaguo jasperjiaguo force-pushed the fix_direct_oom2 branch 3 times, most recently from 99bfc2a to 9295bb8 Compare September 8, 2023 17:51
@jasperjiaguo jasperjiaguo marked this pull request as ready for review September 8, 2023 18:02
@jasperjiaguo jasperjiaguo changed the title [POC] close channel on direct oom Close channel on direct oom Sep 8, 2023
@ege-st
Copy link
Contributor

ege-st commented Sep 8, 2023

Looks like we have some updates, I'll re-run my tests on the changes.

@ege-st
Copy link
Contributor

ege-st commented Sep 8, 2023

One thing that occurred to me about this approach. Can we have a situation where a query causes the Servers to send very big replies back in a sequence such that the channel restart process gets triggered multiple times in a row?

@siddharthteotia siddharthteotia changed the title Close channel on direct oom Fix direct memory OOM on broker Sep 11, 2023
}
_serverToChannelMap.remove(serverRoutingInstance);
});
_queryRouter.markServerDown(_serverRoutingInstance,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of AdaptiveServerSelection:
If this error is hit, it looks like this codepath will not decrement the numInProgressQueries for all servers?
Can you please validate that? Looking at the code it looks like we might have to invoke the following function inside MarkServerDown.

          _serverRoutingStatsManager.recordStatsUponResponseArrival(_requestId, entry.getKey().getInstanceId(),
              _timeoutMs);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see the ADSS stats recording added in the latest commit.
Did we miss pushing the commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see the ADSS stats recording added in the latest commit. Did we miss pushing the commit?

No I reverted that. I did some code reading and it turned out that we have already handled the cases where response doesn't get back in getFinalResponses so no need to decrement again in markServerDown. the tests will in fact fail if we add it. @vvivekiyer

@jasperjiaguo
Copy link
Contributor Author

jasperjiaguo commented Sep 12, 2023

@jasperjiaguo - you may also want to share numbers from the internal production test we did using this, before v/s after numbers / observations from heap dump etc

* Direct memory usage should come down after initial failure

* No repeated retries from Netty

Yes, this would be very helpful.

So the direct memory graph looks like (the query itself took nearly 3 minutes to oom)
Screenshot 2023-09-11 at 11 22 37 PM (2)

And error looks like

 curl 'https://xxxxx:xxxx/pinot-broker/resources/query' --data-urlencode 'q=query' --data-urlencode 'sql=xxxxx'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  138M    0  138M    0     0   788k      0 --:--:--  0:02:59 --:--:-- 8956k
{
    "metadata": {
        "columns": [
            {
                "columnType": "STRING",
                "columnName": "XXX"
            },
            {
                "columnType": "LONG",
                "columnName": "count(*)"
            }
        ],
        "hasPartialResults": true,
        "numEntriesScannedPostFilter": 215036581,
        "numDocsScanned": 215036581,
        "errors": [
            {
                "type": "UNKNOWN",
                "message": "org.apache.pinot.spi.exception.QueryCancelledException: Query cancelled as broker is out of direct memory\n\tat org.apache.pinot.core.transport.DirectOOMHandler.exceptionCaught(DirectOOMHandler.java:80)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:346)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:325)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:317)"

Also took a heap dump and direct buffer are all clean after the incident

@gortiz
Copy link
Contributor

gortiz commented Sep 12, 2023

Related to this PR, I've added some netty buffer metrics in: #11575

@soumitra-st
Copy link
Contributor

Sorry for starting a new thread here, but while we are trying to "handle" OOM by resetting the netty channels, queries may still fail (I think). So, System.exit can be called instead, which will trigger the shutdown hook to shutdown the broker. This approach is simpler and has less unknowns. I am not a netty expert, hence this proposal is biased to keep it simple.

@jasperjiaguo
Copy link
Contributor Author

jasperjiaguo commented Sep 12, 2023

Hey @soumitra-st

OOM by resetting the netty channels, queries may still fail (I think)

Only the queries that overlaps with the OOM event fails (inevitably). After the OOM is handled (pretty fast), all subsequent queries succeeds. I have tested this repeatitively on our perf cluster.

So, System.exit can be called instead, which will trigger the shutdown hook to shutdown the broker.

This approach takes much less time than a broker restart, especially for larger clusters. In case of a rogue query is retried a few times, we will easily see all brokers taken down and take harder availability impact.

has less unknowns

Could you eleborate your concern here? I think the tests/heap dump/graphs show that we recover deterministically and the direct buffers are deallocated.

@ege-st
Copy link
Contributor

ege-st commented Sep 12, 2023

has less unknowns

Could you eleborate your concern here? I think the tests/heap dump/graphs show that we recover deterministically and the direct buffers are deallocated.

The concerns that @soumitra-st and I had when working on a fix was that neither of us were Netty experts and weren't sure how well Netty would behave if it get a direct memory OOM and we kept running: e.g., we weren't sure if there would be a memory leak or some other type of resource leak, or if there might be unknown side effects for in flight queries, or what other effects within the Broker there might be. So, we erred towards treating this as an unrecoverable fault and triggered a shutdown, we felt this was the safest and easiest solution because shutdowns are a normal operation and would be the least likely to create unexpected issues.

@soumitra-st
Copy link
Contributor

Thanks @jasperjiaguo for you comments!

Could you eleborate your concern here? I think the tests/heap dump/graphs show that we recover deterministically and the direct buffers are deallocated.

My concern is that we are trying to prove that the fix is working using tests/heap dump, etc. vs the restart will just work. We have customers using Pinot, and their workload may have some surprises. This fix certainly has less recovery time though.

Beyond the recovery time, do you have other concerns on shutting down the Broker? How many restarts do you see in your environment, and how many occurrences of direct memory OOM are there? If the fraction of number of direct memory OOM is not significant with respect to restarts because of other reasons, then the additional restarts won't be significant.

@siddharthteotia
Copy link
Contributor

siddharthteotia commented Sep 12, 2023

Thanks @jasperjiaguo for you comments!

Could you eleborate your concern here? I think the tests/heap dump/graphs show that we recover deterministically and the direct buffers are deallocated.

My concern is that we are trying to prove that the fix is working using tests/heap dump, etc. vs the restart will just work. We have customers using Pinot, and their workload may have some surprises. This fix certainly has less recovery time though.

Beyond the recovery time, do you have other concerns on shutting down the Broker? How many restarts do you see in your environment, and how many occurrences of direct memory OOM are there? If the fraction of number of direct memory OOM is not significant with respect to restarts because of other reasons, then the additional restarts won't be significant.

My perspective is that we should not rely on operational toil (restarts etc) to recover from issues that can largely be handled in code. I think this is what the fix is doing.

Let me just say that we have had significant number of OOMs (both heap and direct) and that's why we have built features like runtime query killing etc to try and improve resiliency via code as opposed to resorting to restarts. Restart could take significant time having to build BrokerResource.

I don't think it is wise to rely on restarts unless the problem is absolutely unsolvable via code

@siddharthteotia
Copy link
Contributor

siddharthteotia commented Sep 12, 2023

My concern is that we are trying to prove that the fix is working using tests/heap dump, etc. vs the restart will just work.

Let me elaborate a bit on the nature of problem we saw in our production.

We have a cluster several thousands of tables served by handful of brokers.

A really bad query that was fetching around 150MB of data from each of the 160 servers (fan out was 160) caused direct memory OOM on broker. Note that this was a soft OOM (broker didn't crash unlike Java heap space OOM)

The problem is not just with the OOM. It is the cascade impact of this OOM on the overall stability / availability of the system.

Concurrent queries around the same time and subsequent ones also failed because

  • All the direct buffer references were held up. Our heap dump confirmed it
  • Netty threads were relentlessly trying to allocate memory -- multiple times per channel-per server. So the same failure was happening repeatedly. Netty called into NIO to allocate direct memory. The latter tried to first GC. GC didn't help because the corresponding direct buffer reference was still in scope.

So this collectively destabilized and reduced the availability. Now we also restarted initially when we detected this to mitigate but by that time it had already negatively impacted our critical production use case and it missed the SLA -- because of the cascading impact on the concurrent / subsequent queries.

I understand and agree that restart is simpler but from detailed RCA there are definitely opportunities in code that could have prevented this or at least reduce the impact. @jasperjiaguo 's fix is aimed at that and that's why we also shared numbers on memory overhead reduction testing + subsequent queries working fine after the short recovery.

I agree that shutting down channels will cause the other queries to fail but that particular impact may not be worse than the potential real life worst impact that I described above -- which without manual interference or other tooling etc will continue to cause problems on the cluster IMHO

@soumitra-st @ege-st - I hope this gives some insight into where we are coming from. We can also chat offline and align if need be

cc @jasperjiaguo @vvivekiyer

@soumitra-st
Copy link
Contributor

@soumitra-st @ege-st - I hope this gives some insight into where we are coming from. We can also chat offline and align if need be

I agree with the operational toil and thanks sharing the RCA, looking forward to deploy the fix for our customers!

@ege-st
Copy link
Contributor

ege-st commented Sep 12, 2023

My concern is that we are trying to prove that the fix is working using tests/heap dump, etc. vs the restart will just work.

Let me elaborate a bit on the nature of problem we saw in our production.

We have a cluster several thousands of tables served by handful of brokers.

A really bad query that was fetching around 150MB of data from each of the 160 servers (fan out was 160) caused direct memory OOM on broker. Note that this was a soft OOM (broker didn't crash unlike Java heap space OOM)

The problem is not just with the OOM. It is the cascade impact of this OOM on the overall stability / availability of the system.

Yes, this is the core problem here, imo, this bug causes the Broker to become a bad actor and start sabotaging queries which is, imo, one of the worst situations the cluster could be in.

I agree that shutting down channels will cause the other queries to fail but that particular impact may not be worse than the potential real life worst impact that I described above -- which without manual interference or other tooling etc will continue to cause problems on the cluster IMHO

Causing queries to fail is not a huge issue as any solution will necessarily involve some queries failing while the broker recovers. It is certainly very minor when compared to the Broker continuing to accept queries even though it cannot execute them.

@soumitra-st @ege-st - I hope this gives some insight into where we are coming from. We can also chat offline and align if need be

cc @jasperjiaguo @vvivekiyer

I think that this is very insightful, thanks.

@siddharthteotia siddharthteotia merged commit e62a271 into apache:master Sep 12, 2023
21 checks passed
@Jackie-Jiang
Copy link
Contributor

Great work!
Out of curiosity, can you elaborate more on the netty channel deadlock? Or is there some blog I can read about describing the issue?

@gortiz
Copy link
Contributor

gortiz commented Sep 14, 2023

Debugging this issue I've discovered that in Java 11 Netty does not uses cleaners, so my theory is that it is relaying on the normal cleaning mechanism based on GC used in JDK ByteBuffers.

See https://github.com/netty/netty/blob/f1fa227ddf675f055766d04900cce6804fd7f710/common/src/main/java/io/netty/util/internal/PlatformDependent0.java and https://github.com/netty/netty/blob/f1fa227ddf675f055766d04900cce6804fd7f710/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L281 (it fails exactly there)

@jasperjiaguo
Copy link
Contributor Author

Great work! Out of curiosity, can you elaborate more on the netty channel deadlock? Or is there some blog I can read about describing the issue?

Hey @Jackie-Jiang the situation we had was exactly as Sidd described:

A really bad query that was fetching around 150MB of data from each of the 160 servers (fan out was 160) caused direct memory OOM on broker. Note that this was a soft OOM (broker didn't crash unlike Java heap space OOM)

So the broker is able to process the responese from one server technically, but all servers (150+) together sending back these mid sized responses together will create a situation where each channel allocates some amont of direct buffer, but then the direct memory is filled and no server can proceed.

@Jackie-Jiang
Copy link
Contributor

@jasperjiaguo This is resource starvation instead of deadlock then. During your investigation do you see closing all connections necessary? Ideally we want to only close the connection that throws exception for the following reasons:

  • Closing all connections will fail all on-the-fly queries
  • Some connections might not request lots of memory (data not evenly distributed, expensive query only hits subset of servers)
  • After releasing the connection throwing the exception, other connections might recover

@jasperjiaguo
Copy link
Contributor Author

@jasperjiaguo This is resource starvation instead of deadlock then. During your investigation do you see closing all connections necessary? Ideally we want to only close the connection that throws exception for the following reasons:

  • Closing all connections will fail all on-the-fly queries
  • Some connections might not request lots of memory (data not evenly distributed, expensive query only hits subset of servers)
  • After releasing the connection throwing the exception, other connections might recover

Yeah I phrased dealock as every channel actually got a (pretty even) portion of the memory allocation and waiting.

Ideally we want to only close the connection that throws exception

Yes that is also what I initially thought would work. But in the worst case we tested (150+ servers each sending some hundred MB response), some connections to these 150 servers would start to throw OOM exceptions first; once we close them, the others recovered very slow/ would still be blocked in 10 minutes. In other words, once the OOM starts to happen, reseting connection only to channels throwing OOM not give us a prompt recover.

Closing all connections will fail all on-the-fly queries

I'm not sure what is the usecase on your side? One thing we might be able to do to limit the blast range is: for the OOM channels we fetch the current in fly query ids and just resetting the channels having those ids. However, this would introduces quite some complication to the error handling code; not sure if that's what we want to do here or we might invest some time in a finer grained solution as folks have suggested in the discussions above?

@Jackie-Jiang
Copy link
Contributor

Wondering why other connections take up to 10 minutes to recover? Is that caused by more and more connections getting OOM? Ideally we should only close the connections that OOM'ed, and connections not requesting memory/already got the memory allocated shouldn't be affected.
If you can reproduce this, can you help check if the slow recovery is caused by connections not releasing memory properly? If that is the case, it might be a bug in the failure handling logic

swaminathanmanish pushed a commit to swaminathanmanish/pinot that referenced this pull request Sep 20, 2023
* close channel on direct oom, log large response

* Trigger Test

* close channel on direct oom, log large response

* close channel on direct oom, log large response

* Trigger Test

* close channel on direct oom, log large response

* move metrics send to reflect all direct oom incidents
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants