New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug][remote] channel time out #3789
Comments
Hi, can you provide detailed stack information for this error?In addition, please confirm whether your network environment has changed during this period? |
今天监测过网络,两台服务器之间网络没有断开,时不时就会timeout,很奇怪没有人报告这个问题。netty通信中存在心跳监测吗?能否通过心跳机制增加稳定性! |
Thank you very much for your feedback. We will send an email today to discuss related solutions for this issue. We will synchronize the issue at that time and we will resolve this issue as soon as possible. |
I think this should be a network jitter problem . but from NettyRemotingClient.createChannel . I thik can modify channelFuture.awaitUninterruptibly(this.nettyClientConfig.getConnectTimeoutMillis()) 我觉得这应该是网络抖动的问题 . NettyRemotingClient.createChannel 中的future.sync() 可以修改为 channelFuture.awaitUninterruptibly(this.nettyClientConfig.getConnectTimeoutMillis()) |
在ds中添加了一个20分钟运行一次的调度任务维持master到worker的通信,平稳运行一晚上了,不会再超时,感觉加个心跳机制完全可以解决。另外要说一下,worker节点执行完任务没有删除工作目录,删除目录的方法在master?这也是一个大问题! |
Thank you so much for your feedback, I will finish it as soon as possible |
hi, do you have any ideas on how to solve this problem? |
please referer: https://lists.apache.org/thread.html/rb3e3c5f09764bae74cdeef16ee12db0e751d463fd2aed2d011ad5c6e%40%3Cdev.dolphinscheduler.apache.org%3E |
Describe the bug
某些网络情况下,master submit task时,无法进行netty通信,task信息发送不到worker,等待很长时间之后,出现time out的异常,然后过一段儿时间就又会出现这种现象。
To Reproduce
Steps to reproduce the behavior, for example:
Expected behavior
在send方法中,获取channel的时候判断了channel的状态是否active,怀疑这里获取到的active 状态的channel并不能向worker发送数据,等待这个channel异常之后,重新创建的channel可以短暂使用,但是过一段儿时间还是会这样复现
Screenshots
公司环境截不了图
Which version of Dolphin Scheduler:
-[1.3.1]
-[1.3.2]
Additional context
不同的网络环境可能结果不同,有朋友的测试集群没有出现异常,而生产出现异常。我个人的生产环境还没有上线进行测试,测试环境基本每半个小时左右可以出现一次
Requirement or improvement
The text was updated successfully, but these errors were encountered: