elastic-job-lite-spring 2.1.5 个别任务不调度 #403

luoyong1989 · 2017-09-08T08:21:13Z

Please answer these questions before submitting your issue. Thanks!

Which version of Elastic-Job do you using?

1、elastic-job-lite-spring 2.1.5

2、使用springboot 工程

@ImportResource(locations = "classpath:es-job.xml")
public class BizApplication {}

Expected behavior

定时调度配置的任务

Actual behavior

某些任务没有触发调度

Steps to reproduce the behavior

正常运行，发现其中一个任务没预期输出，最后查看日志发现该任务并没有触发。

Please provide the reproduce example codes (such as github link) if possible.

配置片段

<reg:zookeeper id="regCenter" server-lists="@zookeeper.serverLists@"
                   namespace="@zookeeper.namespace@"
                   session-timeout-milliseconds="600000"
                   base-sleep-time-milliseconds="1000"
                   max-sleep-time-milliseconds="3000"
                   max-retries="3"/>

 <job:simple id="job1"
                class="com.job.job1"
                registry-center-ref="regCenter"
                sharding-total-count="1"
                cron="0 0/20 * * * ?"
                failover="true"
                overwrite="true"
                description="上传任务"/>

java代码

@Service("job1") 
public class PostSalesDataByTCP extends AbstractPostDataBaseServiceImpl {
 @Override public void execute(ShardingContext shardingContext) {
        log.info(
            "tcp job start Executed" + shardingContext.getJobName() + "job IP is:[" + CommonUtils
                .getServerIp() + "]");
        getMerchantConfig();
    }
}

通过运维平台上看项目是正常运行状态

The text was updated successfully, but these errors were encountered:

luoyong1989 · 2017-09-08T08:24:22Z

项目是跑在阿里云上的，也没有报错的日志，之前用1.x的时候也出现过，然后换成了2.x 最近又出现了

fanfantastic · 2017-09-13T05:19:02Z

我最近也是第一次遇到了这种情况，使用的版本是elastic-job-lite: 2.0.3。
有一个作业一共5台机器，20片，每隔1秒执行，表面现象是作业没有执行。
通过ElasticJobListener，发现有3台机器的xxxjob_Worker其实是按照cron表达式正确执行的，只是分片的set是空，还有2台机器是连xxxjob_Worker都没有执行。
问题是发生在上线重启服务器的时候。是不是重启的时候触发选举分片的阶段出现了问题。后来我们又把服务重启了，重启之后就好了。
因为线上的环境，没有太多权限，当时情况紧急，没有太多时间收集线索。

luoyong1989 · 2017-09-13T09:12:31Z

@fanfantastic 对头，我们也是重启好了的，不过不是第一次重启好了，是第二次重启才好的

terrymanu · 2017-09-13T11:00:59Z

开启reconcileIntervalMinutes这个配置进行自我修复。
如果还不行，仅凭这些信息无法断定，请将“个别”，“某些”这样的词具象化，否则无法确认bug，请理解

luoyong1989 · 2017-09-13T11:15:39Z

@terrymanu 之所以用这种词语，是因为不调度的任务是随机的，完全无法确定到底是那个会不被调度。。。我也很无奈啊。。。

kevinmails · 2018-06-16T09:21:38Z

@terrymanu 如果spring的xml 里不配置reconcile-interval-minutes 是说明该配置项是关闭的么?

lijian0706 · 2020-11-28T10:18:22Z

我也遇到了，生产环境紧急恢复没有时间进行排查。我是重启应用以后发现还是不执行，然后只能登陆zookeeper删除节点，重启应用，就又恢复了。跑着跑着时不时来这么一下子，受不鸟

lyl2008dsg · 2024-03-20T13:08:17Z

最近在生产环境中，我们遭遇了一个问题，经过调查发现是由于ZooKeeper（ZK）发生故障引起的。在这次故障中，所有节点尝试连接ZK时均超时，这直接导致了计划中的任务未能按时触发。幸运的是，ZK在2分钟后自动恢复了正常，但遗憾的是，期间错过的任务并未得到补偿执行。

[ERROR][2024-03-14T00:39:56.834+0800][ConnectionState.java:228] …… _msg=Connection timed out for connection string (ip:port) and timeout (1000) / elapsed (1000)

至于为什么没有被ReconcileService reshard，怀疑是因为 hasShardingInfoInOfflineServers = true 所以没有被修复；

短期解决办法是感知到应触发而未触发的，补偿调度；
期待大佬帮忙跟进更合适的应对办法。
respect~

lyl2008dsg · 2024-03-21T13:04:37Z

最近在生产环境中，我们遭遇了一个问题，经过调查发现是由于ZooKeeper（ZK）发生故障引起的。在这次故障中，所有节点尝试连接ZK时均超时，这直接导致了计划中的任务未能按时触发。幸运的是，ZK在2分钟后自动恢复了正常，但遗憾的是，期间错过的任务并未得到补偿执行。
[ERROR][2024-03-14T00:39:56.834+0800][ConnectionState.java:228] …… _msg=Connection timed out for connection string (ip:port) and timeout (1000) / elapsed (1000)
至于为什么没有被ReconcileService reshard，怀疑是因为 hasShardingInfoInOfflineServers = true 所以没有被修复；

短期解决办法是感知到应触发而未触发的，补偿调度；期待大佬帮忙跟进更合适的应对办法。 respect~

继续拜读了一遍源码，发现下面的三个参数组合，可以解决，针对每天跑一次的任务遇到zk网络故障：

private int baseSleepTimeMilliseconds = 1 * 60 * 1000;   //等待重试的间隔时间的初始值.
private int maxSleepTimeMilliseconds = 10 * 60 * 1000;  //等待重试的间隔时间的最大值.
private int maxRetries = 10;

对上面配置的解释：假设网络故障不超过10分钟，上面的配置可以轻松应对；

但是上面的配置是服务级别，而非job级别，so 期待按job级别的 ExponentialBackoffRetry。
respect~

terrymanu closed this as completed Sep 13, 2017

terrymanu added the invalid label Sep 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

elastic-job-lite-spring 2.1.5 个别任务不调度 #403

elastic-job-lite-spring 2.1.5 个别任务不调度 #403

luoyong1989 commented Sep 8, 2017 •

edited

Loading

luoyong1989 commented Sep 8, 2017

fanfantastic commented Sep 13, 2017

luoyong1989 commented Sep 13, 2017

terrymanu commented Sep 13, 2017

luoyong1989 commented Sep 13, 2017

kevinmails commented Jun 16, 2018

lijian0706 commented Nov 28, 2020

lyl2008dsg commented Mar 20, 2024

lyl2008dsg commented Mar 21, 2024 •

edited

Loading

elastic-job-lite-spring 2.1.5 个别任务不调度 #403

elastic-job-lite-spring 2.1.5 个别任务不调度 #403

Comments

luoyong1989 commented Sep 8, 2017 • edited Loading

Which version of Elastic-Job do you using?

1、elastic-job-lite-spring 2.1.5

2、使用springboot 工程

Expected behavior

Actual behavior

Steps to reproduce the behavior

Please provide the reproduce example codes (such as github link) if possible.

luoyong1989 commented Sep 8, 2017

fanfantastic commented Sep 13, 2017

luoyong1989 commented Sep 13, 2017

terrymanu commented Sep 13, 2017

luoyong1989 commented Sep 13, 2017

kevinmails commented Jun 16, 2018

lijian0706 commented Nov 28, 2020

lyl2008dsg commented Mar 20, 2024

lyl2008dsg commented Mar 21, 2024 • edited Loading

luoyong1989 commented Sep 8, 2017 •

edited

Loading

lyl2008dsg commented Mar 21, 2024 •

edited

Loading