Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.0.8【bug】 通过程序代码删除Job,报NumberFormatException异常,导致Job部分节点没有完全清除 #99

Closed
gaoyaqiu opened this issue May 27, 2016 · 10 comments
Labels

Comments

@gaoyaqiu
Copy link

1. 调用方法关键点如下:

// 先关闭Job
jobAPIService.getJobOperatorAPI().shutdown(Optional.of(jobServer.getJobName()), Optional.absent());

// 在删除Job(会清空zk中节点)
jobAPIService.getJobOperatorAPI().remove(Optional.of(jobServer.getJobName()), Optional.absent());

# 2. 异常如下:
[JOB] 2016-05-27-19:23:10.251 [DEFAULT.ED8A44CDC0A8C7792D59468EE687552C_Scheduler_Worker-1] ERROR c.d.d.j.p.j.t.s.AbstractSimpleElasticJob - Elastic job: exception occur in job processing...
java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:542) ~[na:1.8.0_45]
at java.lang.Integer.parseInt(Integer.java:615) ~[na:1.8.0_45]
at com.dangdang.ddframe.job.internal.config.ConfigurationService.getShardingTotalCount(ConfigurationService.java:88) ~[elastic-job-core-1.0.8.jar:na]
at com.dangdang.ddframe.job.internal.guarantee.GuaranteeService.isAllCompleted(GuaranteeService.java:87) ~[elastic-job-core-1.0.8.jar:na]
at com.dangdang.ddframe.job.api.listener.AbstractDistributeOnceElasticJobListener.afterJobExecuted(AbstractDistributeOnceElasticJobListener.java:84) ~[elastic-job-core-1.0.8.jar:na]
at com.dangdang.ddframe.job.internal.schedule.JobFacade.afterJobExecuted(JobFacade.java:246) ~[elastic-job-core-1.0.8.jar:na]
at com.dangdang.ddframe.job.internal.job.AbstractElasticJob.execute(AbstractElasticJob.java:68) ~[elastic-job-core-1.0.8.jar:na]
at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [quartz-2.2.2.jar:na]
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) [quartz-2.2.2.jar:na]

@gaoyaqiu
Copy link
Author

不知道为什么,现在删除不报异常了,zk中节点也都删除了。。。

@terrymanu
Copy link
Member

可能是zk延迟导致的。我们也考虑使用别的注册中心替换zk,但没那么快。
先在观察一下吧,如果不能重现就关闭issue吧

@gaoyaqiu
Copy link
Author

不好意思亮哥,删除问题还是有,虽然没有异常,但是需要连续删两次,第二次才会清除zk中的节点Job

操作步骤:

1. 创建Job,先跑一次

2. 通过程序代码删除Job,调用方法关键点如下(注意:以下方法先执行一次):

// 2.1, 先关闭Job
jobAPIService.getJobOperatorAPI().shutdown(Optional.of(jobServer.getJobName()), Optional.absent());

// 2.2, 在删除Job(会清空zk中节点)
jobAPIService.getJobOperatorAPI().remove(Optional.of(jobServer.getJobName()), Optional.absent());

3. 查看zk中的Job节点还是存在的

4. 再次执行 第2.1、2.2的代码

5. zk中Job节点删除成功

问题:需要调用两次删除,才真的删除zk中的Job节点。

@gaoyaqiu
Copy link
Author

删除Job时候,有时候还是会有NumberFormatException异常

[JOB] 2016-05-31-15:00:30.441 [DEFAULT.059CC194C0A802E4132FD022D84A1AE4_Scheduler_Worker-1] ERROR c.d.d.j.p.j.t.s.AbstractSimpleElasticJob - Elastic job: exception occur in job processing...
java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:542) ~[na:1.8.0_45]
at java.lang.Integer.parseInt(Integer.java:615) ~[na:1.8.0_45]
at com.dangdang.ddframe.job.internal.config.ConfigurationService.getShardingTotalCount(ConfigurationService.java:88) ~[elastic-job-core-1.0.8.jar:na]
at com.dangdang.ddframe.job.internal.guarantee.GuaranteeService.isAllCompleted(GuaranteeService.java:87) ~[elastic-job-core-1.0.8.jar:na]
at com.dangdang.ddframe.job.api.listener.AbstractDistributeOnceElasticJobListener.afterJobExecuted(AbstractDistributeOnceElasticJobListener.java:84) ~[elastic-job-core-1.0.8.jar:na]
at com.dangdang.ddframe.job.internal.schedule.JobFacade.afterJobExecuted(JobFacade.java:246) ~[elastic-job-core-1.0.8.jar:na]
at com.dangdang.ddframe.job.internal.job.AbstractElasticJob.execute(AbstractElasticJob.java:68) ~[elastic-job-core-1.0.8.jar:na]
at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [quartz-2.2.2.jar:na]
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) [quartz-2.2.2.jar:na]

@terrymanu
Copy link
Member

可以删除时先sleep一下,看看是否删除成功了么,是不是延迟导致

@gaoyaqiu
Copy link
Author

gaoyaqiu commented Jun 2, 2016

关闭任务后延迟500ms,然后在执行删除操作,还是报NumberFormatException异常
下面执行了remove方法之后的提示信息
[JOB] 2016-06-02-15:08:50.875 [DEFAULT.0FF16314C0A802AA12D41AD69F92AEFD_Scheduler_Worker-1] DEBUG c.d.d.r.e.RegExceptionHandler - Elastic job: ignored exception for: KeeperErrorCode = NoNode for /cy-job/0FF16314C0A802AA12D41AD69F92AEFD/config/monitorExecution
[JOB] 2016-06-02-15:08:50.876 [DEFAULT.0FF16314C0A802AA12D41AD69F92AEFD_Scheduler_Worker-1] DEBUG c.d.d.r.e.RegExceptionHandler - Elastic job: ignored exception for: KeeperErrorCode = NoNode for /cy-job/0FF16314C0A802AA12D41AD69F92AEFD/config/monitorExecution
[JOB] 2016-06-02-15:08:50.878 [DEFAULT.0FF16314C0A802AA12D41AD69F92AEFD_Scheduler_Worker-1] DEBUG c.d.d.r.e.RegExceptionHandler - Elastic job: ignored exception for: KeeperErrorCode = NoNode for /cy-job/0FF16314C0A802AA12D41AD69F92AEFD/config/misfire
[JOB] 2016-06-02-15:08:50.879 [DEFAULT.0FF16314C0A802AA12D41AD69F92AEFD_Scheduler_Worker-1] DEBUG c.d.d.r.e.RegExceptionHandler - Elastic job: ignored exception for: KeeperErrorCode = NoNode for /cy-job/0FF16314C0A802AA12D41AD69F92AEFD/config/monitorExecution
[JOB] 2016-06-02-15:08:50.946 [DEFAULT.0FF16314C0A802AA12D41AD69F92AEFD_Scheduler_Worker-1] DEBUG c.d.d.r.e.RegExceptionHandler - Elastic job: ignored exception for: KeeperErrorCode = NoNode for /cy-job/0FF16314C0A802AA12D41AD69F92AEFD/config/shardingTotalCount
[JOB] 2016-06-02-15:08:50.947 [DEFAULT.0FF16314C0A802AA12D41AD69F92AEFD_Scheduler_Worker-1] ERROR c.d.d.j.p.j.t.s.AbstractSimpleElasticJob - Elastic job: exception occur in job processing...
java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:542) ~[na:1.8.0_45]
at java.lang.Integer.parseInt(Integer.java:615) ~[na:1.8.0_45]
at com.dangdang.ddframe.job.internal.config.ConfigurationService.getShardingTotalCount(ConfigurationService.java:88) ~[elastic-job-core-1.0.8.jar:na]
at com.dangdang.ddframe.job.internal.guarantee.GuaranteeService.isAllCompleted(GuaranteeService.java:87) ~[elastic-job-core-1.0.8.jar:na]
at com.dangdang.ddframe.job.api.listener.AbstractDistributeOnceElasticJobListener.afterJobExecuted(AbstractDistributeOnceElasticJobListener.java:84) ~[elastic-job-core-1.0.8.jar:na]
at com.dangdang.ddframe.job.internal.schedule.JobFacade.afterJobExecuted(JobFacade.java:246) ~[elastic-job-core-1.0.8.jar:na]
at com.dangdang.ddframe.job.internal.job.AbstractElasticJob.execute(AbstractElasticJob.java:68) ~[elastic-job-core-1.0.8.jar:na]
at org.quartz.core.JobRunShell.run(JobRunShell.java:202) [quartz-2.2.2.jar:na]
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573) [quartz-2.2.2.jar:na]

@gaoyaqiu
Copy link
Author

gaoyaqiu commented Jun 2, 2016

调用SHUTDOWN方法不会报异常,只要调用REMOVE,几乎都会报这个NumberFormatException异常。

@ZhangShufan15
Copy link

ZhangShufan15 commented Jun 14, 2016

我觉得问题的症结可能在这里:

 public void shutdown() {
        **schedulerFacade.releaseJobResource();**
        try {
            if (!scheduler.isShutdown()) {
                scheduler.shutdown();
            }
        } catch (final SchedulerException ex) {
            throw new JobException(ex);
        }
    }

**其中的清理资源在quartz调度器关闭之前完成。**那么shutdown命令被监听器处理之后,正在运行的一次任务尚未完成,此时调用remove,remove看到STATUS节点已经被删除,那么他就会删除对应任务的节点,故而在正在运行的这次任务执行到最后时,调用GuaranteeService.isAllCompleted获取shardingitem时找不到节点,故而出错。

后来, @gaoyaqiu 出现两次才能remove,可能是因为第一次删除的时间点时,shutdown监听器还没有删除STATUS节点,remove方法直接返回了。

 @Override
    public Collection<String> remove(final Optional<String> jobName, final Optional<String> serverIp) {
        return jobOperatorTemplate.operate(jobName, serverIp, new JobOperateCallback() {

            @Override
            public boolean doOperate(final String jobName, final String serverIp) {
                JobNodePath jobNodePath = new JobNodePath(jobName);
                if (registryCenter.isExisted(jobNodePath.getServerNodePath(serverIp, JobNodePath.STATUS_NODE))) {
                    return false;
                }
                registryCenter.remove(jobNodePath.getServerNodePath(serverIp));
                if (registryCenter.getChildrenKeys(jobNodePath.getServerNodePath()).isEmpty()) {
                    registryCenter.remove("/" + jobName);
                }
                return true;
            }
        });
    }

如果这里调整之后,会出现由于zk延时和当前任务执行延时,造成remove一直不起效的情况,那么此时如果一定要删除zk节点(其实,我觉得没有必要,毕竟shutdown已经关闭了定时任务,zk节点保留着也没什么影响),可以使用while删除,一直到remove方法返回true为止。

应该也可以把创建JobScheduler的地方不写SimpleDistributeOnceElasticJobListener,这样就不会调用isCompleted方法去检查了,也就不会报合格NumberFormat异常了。

new JobScheduler(regCenter, simpleJobConfig, new SimpleDistributeOnceElasticJobListener()).init();

----->

new JobScheduler(regCenter, simpleJobConfig).init();

@terrymanu
Copy link
Member

目前已知两个问题:

  1. 删除作业异步导致作业删除后, 还未结束的作业继续创建zk数据。看起来像作业没删掉。
  2. 继续运行的作业找不到zk相应的信息导致NumberFormatException。
    都已解决

@shaohuanan
Copy link

你好,请教一下,删除任务的功能怎么实现的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants