Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [Worker] concurrent task response occasional cause status error #13409

Closed
2 of 3 tasks
KEN-LJQ opened this issue Jan 16, 2023 · 4 comments
Closed
2 of 3 tasks

[Bug] [Worker] concurrent task response occasional cause status error #13409

KEN-LJQ opened this issue Jan 16, 2023 · 4 comments
Assignees
Labels
bug Something isn't working need more information
Milestone

Comments

@KEN-LJQ
Copy link

KEN-LJQ commented Jan 16, 2023

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

When a task finish very soon, like within 1 seconds, may stuck running status forever, although it had finish successfully.

while digging into code and log, we found that, when worker receive task, it will send two ack response first, and send finish response while task finish.
But when task finish very fast, those 3 response may receive by master at the same time.
企业微信截图_16729927376406

So, master may handle finish repsonse before ack response, which lead to wrong task status, and we can see it from db binlog

  1. insert a record while submit
#230105  6:31:41 server id 13  end_log_pos 476308895 CRC32 0xcd211e98   Query   thread_id=441009        exec_time=0     error_code=0
SET TIMESTAMP=1672871501/*!*/;
INSERT INTO t_ds_task_instance  ( dry_run,
flag,
environment_code,
pid,
task_params,
task_type,
task_instance_priority,
task_code,
worker_group,
state,
process_instance_id,
executor_id,
alert_flag,
first_submit_time,
max_retry_times,
retry_times,
submit_time,
name,
task_definition_version,
delay_time,
retry_interval )  VALUES  ( 0,
1,
0,
0,
'{"type":"HIVE","datasource":5,"sql":"CREATE TABLE if not exists `tmp.tmp_table_bss_consult_detail`(\\n`aid` int, \\n`consult_foreign_id` string, \\n`consult_durat
ion_second` int, \\n`consult_start_time` int, \\n`consult_end_time` int, \\n`consult_from` int, \\n`consult_vip_level` int, \\n`consult_evaluation` int, \\n`consul
t_evaluation_type` int)\\nPARTITIONED BY (`dt` string)\\nSTORED AS parquet","udfs":"","sqlType":"1","title":"","receivers":"","receiversCc":"","showType":"","local
Params":[],"connParams":"","preStatements":[],"postStatements":[],"conditionResult":"null","dependence":"null","switchResult":"null","waitStartTimeout":null}',
'SQL',
2,
7484249458146,
'default',
0,
19434,
3,
0,
'2023-01-05 06:31:41.514',
3,
0,
'2023-01-05 06:31:41.514',
'test',
2,
0,
1 )
/*!*/;
  1. dispatch to worker
#230105  6:31:41 server id 13  end_log_pos 476309341 CRC32 0x9c9c2283   Query   thread_id=441008        exec_time=0     error_code=0
SET TIMESTAMP=1672871501/*!*/;
update t_ds_task_instance
        set host        = '172.16.1.151:1235',
            submit_time = '2023-01-05 06:31:41.527'
        where id = 69502
/*!*/;
  1. update state to running
# at 476320841
#230105  6:31:42 server id 13  end_log_pos 476322293 CRC32 0x439e70b4   Query   thread_id=441009        exec_time=0     error_code=0
SET TIMESTAMP=1672871502/*!*/;
UPDATE t_ds_task_instance  SET dry_run=0,
flag=1,
environment_code=0,
pid=0,
task_params='{"type":"HIVE","datasource":5,"sql":"CREATE TABLE if not exists `tmp.tmp_table_bss_consult_detail`(\\n`aid` int, \\n`consult_foreign_id` string, \\n`c
onsult_duration_second` int, \\n`consult_start_time` int, \\n`consult_end_time` int, \\n`consult_from` int, \\n`consult_vip_level` int, \\n`consult_evaluation` int
, \\n`consult_evaluation_type` int)\\nPARTITIONED BY (`dt` string)\\nSTORED AS parquet","udfs":"","sqlType":"1","title":"","receivers":"","receiversCc":"","showTyp
e":"","localParams":[],"connParams":"","preStatements":[],"postStatements":[],"conditionResult":"null","dependence":"null","switchResult":"null","waitStartTimeout"
:null}',

task_type='SQL',
task_instance_priority=2,
task_code=7484249458146,
log_path='/var/log/dolphinscheduler2/7484249458144_6/19434/69502.log',
host='172.16.1.151:1235',
start_time='2023-01-05 06:31:41',

worker_group='default',
state=1,
process_instance_id=19434,
executor_id=3,
alert_flag=0,
execute_path='/tmp/dolphinscheduler2/exec/process/7484249457760/7484249458144_6/19434/69502',
first_submit_time='2023-01-05 06:31:42',
max_retry_times=3,
retry_times=0,

submit_time='2023-01-05 06:31:42',
name='test',
task_definition_version=2,
delay_time=0,
retry_interval=1  WHERE id=69502
/*!*/;
  1. update state to success
#230105  6:31:42 server id 13  end_log_pos 476325363 CRC32 0x045eddfc   Query   thread_id=441009        exec_time=0     error_code=0
SET TIMESTAMP=1672871502/*!*/;
UPDATE t_ds_task_instance  SET dry_run=0,
flag=1,
environment_code=0,
pid=0,
task_params='{"type":"HIVE","datasource":5,"sql":"CREATE TABLE if not exists `tmp.tmp_table_bss_consult_detail`(\\n`aid` int, \\n`consult_foreign_id` string, \\n`c
onsult_duration_second` int, \\n`consult_start_time` int, \\n`consult_end_time` int, \\n`consult_from` int, \\n`consult_vip_level` int, \\n`consult_evaluation` int
, \\n`consult_evaluation_type` int)\\nPARTITIONED BY (`dt` string)\\nSTORED AS parquet","udfs":"","sqlType":"1","title":"","receivers":"","receiversCc":"","showTyp
e":"","localParams":[],"connParams":"","preStatements":[],"postStatements":[],"conditionResult":"null","dependence":"null","switchResult":"null","waitStartTimeout"
:null}',

task_type='SQL',
task_instance_priority=2,
task_code=7484249458146,
log_path='/var/log/dolphinscheduler2/7484249458144_6/19434/69502.log',
host='172.16.1.151:1235',
start_time='2023-01-05 06:31:41',

worker_group='default',
state=7,
process_instance_id=19434,
executor_id=3,
alert_flag=0,
execute_path='/tmp/dolphinscheduler2/exec/process/7484249457760/7484249458144_6/19434/69502',
first_submit_time='2023-01-05 06:31:42',
max_retry_times=3,
retry_times=0,
var_pool='[]',
submit_time='2023-01-05 06:31:42',
name='test',
task_definition_version=2,
delay_time=0,
retry_interval=1,
end_time='2023-01-05 06:31:42'  WHERE id=69502
/*!*/;
  1. update state back to running
#230105  6:31:42 server id 13  end_log_pos 476326921 CRC32 0x0207b2c5   Query   thread_id=441008        exec_time=0     error_code=0
SET TIMESTAMP=1672871502/*!*/;
UPDATE t_ds_task_instance  SET dry_run=0,
flag=1,
environment_code=0,
pid=0,
task_params='{"type":"HIVE","datasource":5,"sql":"CREATE TABLE if not exists `tmp.tmp_table_bss_consult_detail`(\\n`aid` int, \\n`consult_foreign_id` string, \\n`c
onsult_duration_second` int, \\n`consult_start_time` int, \\n`consult_end_time` int, \\n`consult_from` int, \\n`consult_vip_level` int, \\n`consult_evaluation` int
, \\n`consult_evaluation_type` int)\\nPARTITIONED BY (`dt` string)\\nSTORED AS parquet","udfs":"","sqlType":"1","title":"","receivers":"","receiversCc":"","showTyp
e":"","localParams":[],"connParams":"","preStatements":[],"postStatements":[],"conditionResult":"null","dependence":"null","switchResult":"null","waitStartTimeout"
:null}',
task_type='SQL',
task_instance_priority=2,
task_code=7484249458146,
log_path='/var/log/dolphinscheduler2/7484249458144_6/19434/69502.log',
host='172.16.1.151:1235',
start_time='2023-01-05 06:31:41',
worker_group='default',
state=1,
process_instance_id=19434,
executor_id=3,
alert_flag=0,
first_submit_time='2023-01-05 06:31:42',
max_retry_times=3,
retry_times=0,
submit_time='2023-01-05 06:31:42',
name='test',
task_definition_version=2,
delay_time=0,
retry_interval=1  WHERE id=69502

What you expected to happen

The final state of task should be success, which state=7

How to reproduce

Kind of hard to reproduce, we can try create a sql task with short run statment like create table , and create a branch merge like belove
企业微信截图_16729861779567

then return this process multiple times, untill task node stuck at running state forever

Anything else

No response

Version

2.0.x

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@KEN-LJQ KEN-LJQ added bug Something isn't working Waiting for reply Waiting for reply labels Jan 16, 2023
@github-actions
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

When a task finish very soon, like within 1 seconds, may stuck running status forever, although it had finish successfully.

while digging into code and log, we found that, when worker receive task, it will send two ack response first, and send finish response while task finish.
But when task finish very fast, those 3 response may receive by master at the same time.
企业微信截图_16729927376406

So, master may handle finish repsonse before ack response, which lead to wrong task status, and we can see it from db binlog

  1. insert a record while submit
#230105  6:31:41 server id 13  end_log_pos 476308895 CRC32 0xcd211e98   Query   thread_id=441009        exec_time=0     error_code=0
SET TIMESTAMP=1672871501/*!*/;
INSERT INTO t_ds_task_instance  ( dry_run,
flag,
environment_code,
pid,
task_params,
task_type,
task_instance_priority,
task_code,
worker_group,
state,
process_instance_id,
executor_id,
alert_flag,
first_submit_time,
max_retry_times,
retry_times,
submit_time,
name,
task_definition_version,
delay_time,
retry_interval )  VALUES  ( 0,
1,
0,
0,
'{"type":"HIVE","datasource":5,"sql":"CREATE TABLE if not exists `tmp.tmp_table_bss_consult_detail`(\\n`aid` int, \\n`consult_foreign_id` string, \\n`consult_durat
ion_second` int, \\n`consult_start_time` int, \\n`consult_end_time` int, \\n`consult_from` int, \\n`consult_vip_level` int, \\n`consult_evaluation` int, \\n`consul
t_evaluation_type` int)\\nPARTITIONED BY (`dt` string)\\nSTORED AS parquet","udfs":"","sqlType":"1","title":"","receivers":"","receiversCc":"","showType":"","local
Params":[],"connParams":"","preStatements":[],"postStatements":[],"conditionResult":"null","dependence":"null","switchResult":"null","waitStartTimeout":null}',
'SQL',
2,
7484249458146,
'default',
0,
19434,
3,
0,
'2023-01-05 06:31:41.514',
3,
0,
'2023-01-05 06:31:41.514',
'test',
2,
0,
1 )
/*!*/;
  1. dispatch to worker
#230105  6:31:41 server id 13  end_log_pos 476309341 CRC32 0x9c9c2283   Query   thread_id=441008        exec_time=0     error_code=0
SET TIMESTAMP=1672871501/*!*/;
update t_ds_task_instance
        set host        = '172.16.1.151:1235',
            submit_time = '2023-01-05 06:31:41.527'
        where id = 69502
/*!*/;
  1. update state to running
# at 476320841
#230105  6:31:42 server id 13  end_log_pos 476322293 CRC32 0x439e70b4   Query   thread_id=441009        exec_time=0     error_code=0
SET TIMESTAMP=1672871502/*!*/;
UPDATE t_ds_task_instance  SET dry_run=0,
flag=1,
environment_code=0,
pid=0,
task_params='{"type":"HIVE","datasource":5,"sql":"CREATE TABLE if not exists `tmp.tmp_table_bss_consult_detail`(\\n`aid` int, \\n`consult_foreign_id` string, \\n`c
onsult_duration_second` int, \\n`consult_start_time` int, \\n`consult_end_time` int, \\n`consult_from` int, \\n`consult_vip_level` int, \\n`consult_evaluation` int
, \\n`consult_evaluation_type` int)\\nPARTITIONED BY (`dt` string)\\nSTORED AS parquet","udfs":"","sqlType":"1","title":"","receivers":"","receiversCc":"","showTyp
e":"","localParams":[],"connParams":"","preStatements":[],"postStatements":[],"conditionResult":"null","dependence":"null","switchResult":"null","waitStartTimeout"
:null}',

task_type='SQL',
task_instance_priority=2,
task_code=7484249458146,
log_path='/var/log/dolphinscheduler2/7484249458144_6/19434/69502.log',
host='172.16.1.151:1235',
start_time='2023-01-05 06:31:41',

worker_group='default',
state=1,
process_instance_id=19434,
executor_id=3,
alert_flag=0,
execute_path='/tmp/dolphinscheduler2/exec/process/7484249457760/7484249458144_6/19434/69502',
first_submit_time='2023-01-05 06:31:42',
max_retry_times=3,
retry_times=0,

submit_time='2023-01-05 06:31:42',
name='test',
task_definition_version=2,
delay_time=0,
retry_interval=1  WHERE id=69502
/*!*/;
  1. update state to success
#230105  6:31:42 server id 13  end_log_pos 476325363 CRC32 0x045eddfc   Query   thread_id=441009        exec_time=0     error_code=0
SET TIMESTAMP=1672871502/*!*/;
UPDATE t_ds_task_instance  SET dry_run=0,
flag=1,
environment_code=0,
pid=0,
task_params='{"type":"HIVE","datasource":5,"sql":"CREATE TABLE if not exists `tmp.tmp_table_bss_consult_detail`(\\n`aid` int, \\n`consult_foreign_id` string, \\n`c
onsult_duration_second` int, \\n`consult_start_time` int, \\n`consult_end_time` int, \\n`consult_from` int, \\n`consult_vip_level` int, \\n`consult_evaluation` int
, \\n`consult_evaluation_type` int)\\nPARTITIONED BY (`dt` string)\\nSTORED AS parquet","udfs":"","sqlType":"1","title":"","receivers":"","receiversCc":"","showTyp
e":"","localParams":[],"connParams":"","preStatements":[],"postStatements":[],"conditionResult":"null","dependence":"null","switchResult":"null","waitStartTimeout"
:null}',

task_type='SQL',
task_instance_priority=2,
task_code=7484249458146,
log_path='/var/log/dolphinscheduler2/7484249458144_6/19434/69502.log',
host='172.16.1.151:1235',
start_time='2023-01-05 06:31:41',

worker_group='default',
state=7,
process_instance_id=19434,
executor_id=3,
alert_flag=0,
execute_path='/tmp/dolphinscheduler2/exec/process/7484249457760/7484249458144_6/19434/69502',
first_submit_time='2023-01-05 06:31:42',
max_retry_times=3,
retry_times=0,
var_pool='[]',
submit_time='2023-01-05 06:31:42',
name='test',
task_definition_version=2,
delay_time=0,
retry_interval=1,
end_time='2023-01-05 06:31:42'  WHERE id=69502
/*!*/;
  1. update state back to running
#230105  6:31:42 server id 13  end_log_pos 476326921 CRC32 0x0207b2c5   Query   thread_id=441008        exec_time=0     error_code=0
SET TIMESTAMP=1672871502/*!*/;
UPDATE t_ds_task_instance  SET dry_run=0,
flag=1,
environment_code=0,
pid=0,
task_params='{"type":"HIVE","datasource":5,"sql":"CREATE TABLE if not exists `tmp.tmp_table_bss_consult_detail`(\\n`aid` int, \\n`consult_foreign_id` string, \\n`c
onsult_duration_second` int, \\n`consult_start_time` int, \\n`consult_end_time` int, \\n`consult_from` int, \\n`consult_vip_level` int, \\n`consult_evaluation` int
, \\n`consult_evaluation_type` int)\\nPARTITIONED BY (`dt` string)\\nSTORED AS parquet","udfs":"","sqlType":"1","title":"","receivers":"","receiversCc":"","showTyp
e":"","localParams":[],"connParams":"","preStatements":[],"postStatements":[],"conditionResult":"null","dependence":"null","switchResult":"null","waitStartTimeout"
:null}',
task_type='SQL',
task_instance_priority=2,
task_code=7484249458146,
log_path='/var/log/dolphinscheduler2/7484249458144_6/19434/69502.log',
host='172.16.1.151:1235',
start_time='2023-01-05 06:31:41',
worker_group='default',
state=1,
process_instance_id=19434,
executor_id=3,
alert_flag=0,
first_submit_time='2023-01-05 06:31:42',
max_retry_times=3,
retry_times=0,
submit_time='2023-01-05 06:31:42',
name='test',
task_definition_version=2,
delay_time=0,
retry_interval=1  WHERE id=69502

What you expected to happen

The final state of task should be success, which state=7

How to reproduce

Kind of hard to reproduce, we can try create a sql task with short run statment like create table , and create a branch merge like belove
企业微信截图_16729861779567

then return this process multiple times, untill task node stuck at running state forever

Anything else

No response

Version

2.0.x

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@github-actions
Copy link

Thank you for your feedback, we have received your issue, Please wait patiently for a reply.

  • In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
  • If you haven't received a reply for a long time, you can join our slack and send your question to channel #troubleshooting

@SbloodyS
Copy link
Member

Which version are you using? @KEN-LJQ

@SbloodyS SbloodyS added need more information and removed Waiting for reply Waiting for reply labels Jan 18, 2023
@KEN-LJQ
Copy link
Author

KEN-LJQ commented Jan 18, 2023

@SbloodyS 2.0.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working need more information
Projects
None yet
Development

No branches or pull requests

3 participants