[Bug] FE 频繁打满64G内存导致宕机，集群上只有Broker Load在定时执行，过一段时间内存就满了 #27594

DA1OOO · 2023-11-27T02:18:07Z

Search before asking

I had searched in the issues and found no similar issues.

Version

版本 2.0.1.1 release

What's Wrong?

下图是内存使用情况，内存无法回收，每次需要重启，过一段时间又满了：

版本 2.0.1.1 release
JVM -xmx64g
宕机前gc日志：

宕机前fe.log:

来自该机器的sql只有broker load，show load，show partitions，drop partition，add partition这几类。

What You Expected?

什么原因导致FE 宕机，应该不是64g内存也不够吧

How to Reproduce?

No response

Anything Else?

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

zengxiangqi1031 · 2023-11-27T02:54:39Z

遇到同样的问题，doris 2.0.2 release

DA1OOO · 2023-11-27T03:50:05Z

JAVA_OPTS="-Djavax.security.auth.useSubjectCredsOnly=false -Xss4m -Xmx64g -XX:+UseMembar -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:$DORIS_HOME/log/fe.gc.log.$CUR_DATE"

DA1OOO · 2023-11-27T03:57:48Z

只改了-xmx大小其他都是默认JVM配置

liugddx · 2023-11-27T08:18:58Z

Using G1GC

DA1OOO · 2023-11-28T02:05:16Z

Using G1GC

thanks, i will try

DA1OOO · 2023-11-28T10:58:06Z

btw, i used broker 2.0.2, not 2.0.1.1.

DA1OOO · 2023-11-29T01:54:56Z

liugddx · 2023-11-29T02:18:31Z

https://doris.apache.org/zh-CN/docs/1.2/admin-manual/query-profile?_highlight=profile#%E5%90%8D%E8%AF%8D%E8%A7%A3%E9%87%8A

Maybe you need to turn off the global profile SET [GLOBAL] enable_profile=false;

DA1OOO · 2023-11-29T02:55:05Z

After reviewing the source code, the default max_query_profile_num seems to be 100, so it would't keep pushing profile into memory?

liugddx · 2023-11-29T03:13:44Z

After reviewing the source code, the default max_query_profile_num seems to be 100, so it would't keep pushing profile into memory?

I don’t have a detailed understanding yet. You can continue to follow or provide more detailed log information.

DA1OOO · 2023-11-29T03:52:35Z

After restart fe and SET [GLOBAL] enable_profile=false:

I have a broker load task running from 11:34 to 11:36, which is when the memory is rapidly increasing.

liugddx · 2023-11-29T05:04:57Z

Has this memory problem affected usage? In addition, will the memory be lost by gc?

DA1OOO · 2023-11-29T06:24:06Z

I need to observer the change of memory after closeing the enable_profile. But before closing it, memory just lost a little by gc, after the memory reaches the maximum value set by -xmx, FE will stop serving.

wj215318 · 2023-11-30T08:19:11Z

I need to observer the change of memory after closeing the enable_profile. But before closing it, memory just lost a little by gc, after the memory reaches the maximum value set by -xmx, FE will stop serving.

how about fe memory after closeing the enable_profile,thanks

DA1OOO · 2023-12-01T03:50:42Z

It seems become normal now. Maybe remove profile have some bug. @wj215318

wj215318 · 2023-12-01T05:20:01Z

It seems become normal now. Maybe remove profile have some bug. @wj215318

We have encountered the same problem.and now we also closed the profile.yestoday wo dump the jvm data,DBA is analyzing

DA1OOO · 2023-12-02T03:06:26Z

Due to the impact of dumping on the normal use of the cluster, we did not dump the JVM data. If you discover anything after dumping, please share the specific situation here. @wj215318 Thanks!

DA1OOO · 2023-12-06T12:04:09Z

@wj215318 btw, 2.0.2 release don't have this problem.

ziyanTOP · 2023-12-15T09:42:41Z

一样的问题，minor gc的频率跟不上老年代增长的速度，最后三个fe节点全部查询排队超时卡死宕机，建议用prometheus+grafana监控fe的JVM，看看到底问题出在哪，顺便改下你的参数，年轻代等于老年代的1/3，并且不要用-XX:NewRatio=3这种，而是固定设置成-Xmn16G，打开CMS的并行重标记，不然minor gc那点时间这么多内存根本标记不完，然后调低CMS初始化时的内存占比，80%太靠后了，可能gc没完成服务就down了，可以改成60或者65，实测有效，我的集群调整完至今没有fe宕机

zhbdesign · 2023-12-18T06:42:30Z

一样的问题，minor gc的频率跟不上老年代增长的速度，最后三个fe节点全部查询排队超时卡死宕机，建议用prometheus+grafana监控fe的JVM，看看到底问题出在哪，顺便改下你的参数，年轻代等于老年代的1/3，并且不要用-XX:NewRatio=3这种，而是固定设置成-Xmn16G，打开CMS的并行重标记，不然minor gc那点时间这么多内存根本标记不完，然后调低CMS初始化时的内存占比，80%太靠后了，可能gc没完成服务就down了，可以改成60或者65，实测有效，我的集群调整完至今没有fe宕机

修改后的启动参数可以分享下

ziyanTOP · 2023-12-18T06:50:44Z

JAVA_OPTS="-server -Xmx64g -Xmn16g -Xms32g -XX:+UseMembar -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=15 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:$DORIS_HOME/log/fe.gc.log.$DATE" @zhbdesign 具体内存大小根据机器的实际值来设置

DA1OOO · 2023-12-26T06:21:33Z

用了G1回收器调大JVM内存后。目前正常。

还是不理解为什么内存增速这么快。

ihadoop · 2023-12-27T06:23:59Z

dump下来的文件可以上传上来

DA1OOO closed this as completed Dec 4, 2023

DA1OOO reopened this Dec 4, 2023

DA1OOO closed this as completed Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] FE 频繁打满64G内存导致宕机，集群上只有Broker Load在定时执行，过一段时间内存就满了 #27594

[Bug] FE 频繁打满64G内存导致宕机，集群上只有Broker Load在定时执行，过一段时间内存就满了 #27594

DA1OOO commented Nov 27, 2023 •

edited

Loading

zengxiangqi1031 commented Nov 27, 2023

DA1OOO commented Nov 27, 2023

DA1OOO commented Nov 27, 2023

liugddx commented Nov 27, 2023

DA1OOO commented Nov 28, 2023

DA1OOO commented Nov 28, 2023

DA1OOO commented Nov 29, 2023

liugddx commented Nov 29, 2023

DA1OOO commented Nov 29, 2023

liugddx commented Nov 29, 2023

DA1OOO commented Nov 29, 2023

liugddx commented Nov 29, 2023

DA1OOO commented Nov 29, 2023

wj215318 commented Nov 30, 2023

DA1OOO commented Dec 1, 2023 •

edited

Loading

wj215318 commented Dec 1, 2023

DA1OOO commented Dec 2, 2023

DA1OOO commented Dec 6, 2023

ziyanTOP commented Dec 15, 2023

zhbdesign commented Dec 18, 2023

ziyanTOP commented Dec 18, 2023

DA1OOO commented Dec 26, 2023

ihadoop commented Dec 27, 2023

[Bug] FE 频繁打满64G内存导致宕机，集群上只有Broker Load在定时执行，过一段时间内存就满了 #27594

[Bug] FE 频繁打满64G内存导致宕机，集群上只有Broker Load在定时执行，过一段时间内存就满了 #27594

Comments

DA1OOO commented Nov 27, 2023 • edited Loading

Search before asking

Version

What's Wrong?

What You Expected?

How to Reproduce?

Anything Else?

Are you willing to submit PR?

Code of Conduct

zengxiangqi1031 commented Nov 27, 2023

DA1OOO commented Nov 27, 2023

DA1OOO commented Nov 27, 2023

liugddx commented Nov 27, 2023

DA1OOO commented Nov 28, 2023

DA1OOO commented Nov 28, 2023

DA1OOO commented Nov 29, 2023

liugddx commented Nov 29, 2023

DA1OOO commented Nov 29, 2023

liugddx commented Nov 29, 2023

DA1OOO commented Nov 29, 2023

liugddx commented Nov 29, 2023

DA1OOO commented Nov 29, 2023

wj215318 commented Nov 30, 2023

DA1OOO commented Dec 1, 2023 • edited Loading

wj215318 commented Dec 1, 2023

DA1OOO commented Dec 2, 2023

DA1OOO commented Dec 6, 2023

ziyanTOP commented Dec 15, 2023

zhbdesign commented Dec 18, 2023

ziyanTOP commented Dec 18, 2023

DA1OOO commented Dec 26, 2023

ihadoop commented Dec 27, 2023

DA1OOO commented Nov 27, 2023 •

edited

Loading

DA1OOO commented Dec 1, 2023 •

edited

Loading