Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][remote] channel time out #3789

Closed
nightxing opened this issue Sep 22, 2020 · 9 comments · Fixed by #3868
Closed

[Bug][remote] channel time out #3789

nightxing opened this issue Sep 22, 2020 · 9 comments · Fixed by #3868
Assignees
Labels
bug Something isn't working
Milestone

Comments

@nightxing
Copy link

Describe the bug
某些网络情况下,master submit task时,无法进行netty通信,task信息发送不到worker,等待很长时间之后,出现time out的异常,然后过一段儿时间就又会出现这种现象。

To Reproduce
Steps to reproduce the behavior, for example:

  1. 手动运行某个流程
  2. 流程处于运行中,所有任务全部是已提交的灰色圆点状态
  3. master节点很长一段时间之后会出现timeout的异常
  4. worker端没有接受到master的信息

Expected behavior
在send方法中,获取channel的时候判断了channel的状态是否active,怀疑这里获取到的active 状态的channel并不能向worker发送数据,等待这个channel异常之后,重新创建的channel可以短暂使用,但是过一段儿时间还是会这样复现

Screenshots
公司环境截不了图

Which version of Dolphin Scheduler:
-[1.3.1]
-[1.3.2]

Additional context
不同的网络环境可能结果不同,有朋友的测试集群没有出现异常,而生产出现异常。我个人的生产环境还没有上线进行测试,测试环境基本每半个小时左右可以出现一次

Requirement or improvement

  • 希望尽快修复这个问题,严重影响调度
@nightxing nightxing added the bug Something isn't working label Sep 22, 2020
@CalvinKirs
Copy link
Member

Hi, can you provide detailed stack information for this error?In addition, please confirm whether your network environment has changed during this period?

@Slowfever-star
Copy link

Slowfever-star commented Sep 23, 2020

I have also encountered similar problems, the error report is as follows
微信图片_20200923095731

@nightxing
Copy link
Author

今天监测过网络,两台服务器之间网络没有断开,时不时就会timeout,很奇怪没有人报告这个问题。netty通信中存在心跳监测吗?能否通过心跳机制增加稳定性!

@CalvinKirs
Copy link
Member

今天监测过网络,两台服务器之间网络没有断开,时不时就会timeout,很奇怪没有人报告这个问题。netty通信中存在心跳监测吗?能否通过心跳机制增加稳定性!

Thank you very much for your feedback. We will send an email today to discuss related solutions for this issue. We will synchronize the issue at that time and we will resolve this issue as soon as possible.

@qiaozhanwei
Copy link
Contributor

I think this should be a network jitter problem . but from NettyRemotingClient.createChannel . I thik can modify channelFuture.awaitUninterruptibly(this.nettyClientConfig.getConnectTimeoutMillis())

我觉得这应该是网络抖动的问题 . NettyRemotingClient.createChannel 中的future.sync() 可以修改为 channelFuture.awaitUninterruptibly(this.nettyClientConfig.getConnectTimeoutMillis())

@nightxing
Copy link
Author

在ds中添加了一个20分钟运行一次的调度任务维持master到worker的通信,平稳运行一晚上了,不会再超时,感觉加个心跳机制完全可以解决。另外要说一下,worker节点执行完任务没有删除工作目录,删除目录的方法在master?这也是一个大问题!

@CalvinKirs
Copy link
Member

Thank you so much for your feedback, I will finish it as soon as possible
email discussion link:
https://lists.apache.org/thread.html/rb3e3c5f09764bae74cdeef16ee12db0e751d463fd2aed2d011ad5c6e%40%3Cdev.dolphinscheduler.apache.org%3E

@lenboo lenboo added this to the 1.3.3-release milestone Sep 25, 2020
@Slowfever-star
Copy link

hi, do you have any ideas on how to solve this problem?

@davidzollo
Copy link
Contributor

hi, do you have any ideas on how to solve this problem?

please referer: https://lists.apache.org/thread.html/rb3e3c5f09764bae74cdeef16ee12db0e751d463fd2aed2d011ad5c6e%40%3Cdev.dolphinscheduler.apache.org%3E

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants