Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] FE 频繁打满64G内存导致宕机,集群上只有Broker Load在定时执行,过一段时间内存就满了 #27594

Closed
3 tasks done
DA1OOO opened this issue Nov 27, 2023 · 23 comments

Comments

@DA1OOO
Copy link

DA1OOO commented Nov 27, 2023

Search before asking

  • I had searched in the issues and found no similar issues.

Version

版本 2.0.1.1 release

What's Wrong?

下图是内存使用情况,内存无法回收,每次需要重启,过一段时间又满了:
image
版本 2.0.1.1 release
JVM -xmx64g
宕机前gc日志:
image
宕机前fe.log:
image
来自该机器的sql只有broker load,show load,show partitions,drop partition,add partition这几类。

What You Expected?

什么原因导致FE 宕机,应该不是64g内存也不够吧

How to Reproduce?

No response

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@zengxiangqi1031
Copy link

遇到同样的问题,doris 2.0.2 release

@DA1OOO
Copy link
Author

DA1OOO commented Nov 27, 2023

JAVA_OPTS="-Djavax.security.auth.useSubjectCredsOnly=false -Xss4m -Xmx64g -XX:+UseMembar -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:$DORIS_HOME/log/fe.gc.log.$CUR_DATE"
image

@DA1OOO
Copy link
Author

DA1OOO commented Nov 27, 2023

只改了-xmx大小 其他都是默认JVM配置

@liugddx
Copy link
Member

liugddx commented Nov 27, 2023

Using G1GC

@DA1OOO
Copy link
Author

DA1OOO commented Nov 28, 2023

Using G1GC

thanks, i will try

@DA1OOO
Copy link
Author

DA1OOO commented Nov 28, 2023

btw, i used broker 2.0.2, not 2.0.1.1.

@DA1OOO
Copy link
Author

DA1OOO commented Nov 29, 2023

image

@liugddx
Copy link
Member

liugddx commented Nov 29, 2023

https://doris.apache.org/zh-CN/docs/1.2/admin-manual/query-profile?_highlight=profile#%E5%90%8D%E8%AF%8D%E8%A7%A3%E9%87%8A

Maybe you need to turn off the global profile SET [GLOBAL] enable_profile=false;

@DA1OOO
Copy link
Author

DA1OOO commented Nov 29, 2023

After reviewing the source code, the default max_query_profile_num seems to be 100, so it would't keep pushing profile into memory?

@liugddx
Copy link
Member

liugddx commented Nov 29, 2023

After reviewing the source code, the default max_query_profile_num seems to be 100, so it would't keep pushing profile into memory?

I don’t have a detailed understanding yet. You can continue to follow or provide more detailed log information.

@DA1OOO
Copy link
Author

DA1OOO commented Nov 29, 2023

After restart fe and SET [GLOBAL] enable_profile=false:
image
image
I have a broker load task running from 11:34 to 11:36, which is when the memory is rapidly increasing.

@liugddx
Copy link
Member

liugddx commented Nov 29, 2023

Has this memory problem affected usage? In addition, will the memory be lost by gc?

@DA1OOO
Copy link
Author

DA1OOO commented Nov 29, 2023

I need to observer the change of memory after closeing the enable_profile. But before closing it, memory just lost a little by gc, after the memory reaches the maximum value set by -xmx, FE will stop serving.
image

@wj215318
Copy link

I need to observer the change of memory after closeing the enable_profile. But before closing it, memory just lost a little by gc, after the memory reaches the maximum value set by -xmx, FE will stop serving. image

how about fe memory after closeing the enable_profile,thanks

@DA1OOO
Copy link
Author

DA1OOO commented Dec 1, 2023

image
It seems become normal now. Maybe remove profile have some bug. @wj215318

@wj215318
Copy link

wj215318 commented Dec 1, 2023

image It seems become normal now. Maybe remove profile have some bug. @wj215318

We have encountered the same problem.and now we also closed the profile.yestoday wo dump the jvm data,DBA is analyzing

@DA1OOO
Copy link
Author

DA1OOO commented Dec 2, 2023

Due to the impact of dumping on the normal use of the cluster, we did not dump the JVM data. If you discover anything after dumping, please share the specific situation here. @wj215318 Thanks!

@DA1OOO DA1OOO closed this as completed Dec 4, 2023
@DA1OOO DA1OOO reopened this Dec 4, 2023
@DA1OOO
Copy link
Author

DA1OOO commented Dec 6, 2023

@wj215318 btw, 2.0.2 release don't have this problem.

@ziyanTOP
Copy link
Contributor

一样的问题,minor gc的频率跟不上老年代增长的速度,最后三个fe节点全部查询排队超时卡死宕机,建议用prometheus+grafana监控fe的JVM,看看到底问题出在哪,顺便改下你的参数,年轻代等于老年代的1/3,并且不要用-XX:NewRatio=3这种,而是固定设置成-Xmn16G,打开CMS的并行重标记,不然minor gc那点时间这么多内存根本标记不完,然后调低CMS初始化时的内存占比,80%太靠后了,可能gc没完成服务就down了,可以改成60或者65,实测有效,我的集群调整完至今没有fe宕机

@zhbdesign
Copy link

一样的问题,minor gc的频率跟不上老年代增长的速度,最后三个fe节点全部查询排队超时卡死宕机,建议用prometheus+grafana监控fe的JVM,看看到底问题出在哪,顺便改下你的参数,年轻代等于老年代的1/3,并且不要用-XX:NewRatio=3这种,而是固定设置成-Xmn16G,打开CMS的并行重标记,不然minor gc那点时间这么多内存根本标记不完,然后调低CMS初始化时的内存占比,80%太靠后了,可能gc没完成服务就down了,可以改成60或者65,实测有效,我的集群调整完至今没有fe宕机

修改后的启动参数可以分享下

@ziyanTOP
Copy link
Contributor

JAVA_OPTS="-server -Xmx64g -Xmn16g -Xms32g -XX:+UseMembar -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=15 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:$DORIS_HOME/log/fe.gc.log.$DATE" @zhbdesign 具体内存大小根据机器的实际值来设置

@DA1OOO
Copy link
Author

DA1OOO commented Dec 26, 2023

用了G1回收器 调大JVM内存后。目前正常。
image
还是不理解为什么内存增速这么快。

@ihadoop
Copy link

ihadoop commented Dec 27, 2023

dump下来的文件可以上传上来

@DA1OOO DA1OOO closed this as completed Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants