Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate tasks invoked for a single task_id when manually invoked task details modal. #10026

Closed
andytrigg opened this issue Jul 28, 2020 · 37 comments
Labels
affected_version:2.1 Issues Reported for 2.1 area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug

Comments

@andytrigg
Copy link

andytrigg commented Jul 28, 2020

Apache Airflow version:
1.10.11

Kubernetes version (if you are using kubernetes) (use kubectl version): NA

Environment:

  • Cloud provider or hardware configuration: AWS (EC2 instances)
  • OS (e.g. from /etc/os-release):
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
  • Kernel (e.g. uname -a):
    Linux airflow-scheduler-10-229-13-220 4.14.165-131.185.amzn2.x86_64 #1 SMP Wed Jan 15 14:19:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools:

  • Others:

What happened:

When manually invoke a task from the task details dialog, we see the task running for approximately 22 seconds before we see the following appear in the log...

[2020-07-28 01:25:14,726] {local_task_job.py:150} WARNING - Recorded pid 26940 does not match the current pid 26751
[2020-07-28 01:25:14,728] {helpers.py:325} INFO - Sending Signals.SIGTERM to GPID 26757

The task then is killed. We notice this is accompanied with a second failure shortly afterwards that correlates to the new pid that has been written to the task_instance table.

It is interesting to note that if the task is scheduled as part of a normal dag run, or by clearing state and allowing the schedular to schedule its execution then we do not experience any issue.

We have attempted to specify task_concurrency on our operators with no effect.

What you expected to happen:
We expected a single process to be spawned for the manually executed task.

How to reproduce it:
Manually invoke a task via the task details dialog where that task execution is going to be longer than the heart rate interval that has been set.

The heart rate checks the pid and sees a mismatch and so kills the task.

Anything else we need to know:

We can produce this reliably if the task execution time is > than the heart rate interval.

@andytrigg andytrigg added the kind:bug This is a clearly a bug label Jul 28, 2020
@boring-cyborg
Copy link

boring-cyborg bot commented Jul 28, 2020

Thanks for opening your first issue here! Be sure to follow the issue template!

@TheWalletInspector
Copy link

Having the same issue :(

@turbaszek turbaszek added the area:Scheduler including HA (high availability) scheduler label Jul 31, 2020
@tooptoop4
Copy link
Contributor

still an issue?

@andytrigg
Copy link
Author

It is still an issue with airflow 1.10.12.

@quinten-goens
Copy link

quinten-goens commented Feb 18, 2021

I encountered the exact same issue (airflow 1.10.9.)

It is interesting to note that if the task is scheduled as part of a normal dag run, or by clearing state and allowing the schedular to schedule its execution then we do not experience any issue.

I am trying out this workaround.

EDIT: When rerunning a task by clearing the state and letting the scheduler to reschedule the task I do not have this bug.

@ashaw27
Copy link

ashaw27 commented Mar 15, 2021

I'm having this issue in 2.0.1. Is there a workaround? My timesensor failed at a seemingly random time, twice, resulting in 4 failures. I only had 3 retries set, so the DAG failed.

[2021-03-15 02:47:28,089] {time_sensor.py:38} INFO - Checking if the time (03:30:00) has come
[2021-03-15 02:48:28,149] {time_sensor.py:38} INFO - Checking if the time (03:30:00) has come
[2021-03-15 02:49:28,209] {time_sensor.py:38} INFO - Checking if the time (03:30:00) has come
[2021-03-15 02:50:21,225] {local_task_job.py:184} WARNING - Recorded pid 211954 does not match the current pid 189553
[2021-03-15 02:50:21,229] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 189554
[2021-03-15 02:50:21,230] {taskinstance.py:1239} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-03-15 02:50:21,241] {taskinstance.py:1455} ERROR - Task received SIGTERM signal
Traceback (most recent call last):
  File "/home/airflowuser/miniconda3/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1112, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "/home/airflowuser/miniconda3/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1285, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "/home/airflowuser/miniconda3/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1315, in _execute_task
    result = task_copy.execute(context=context)
  File "/home/airflowuser/miniconda3/lib/python3.8/site-packages/airflow/sensors/base.py", line 243, in execute
    time.sleep(self._get_next_poke_interval(started_at, run_duration, try_number))
  File "/home/airflowuser/miniconda3/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1241, in signal_handler
    raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2021-03-15 02:50:21,242] {taskinstance.py:1496} INFO - Marking task as FAILED. dag_id=Daily_Ref_Data, task_id=at_0330_ET, execution_date=20210314T050100, start_date=20210315T065020, end_date=20210315T065021

@BenoitHanotte
Copy link
Contributor

BenoitHanotte commented May 27, 2021

I have started seeing the same issue since migrating to Airflow 2.1 when clearing task states, never had it with Airflow 2.0.1 🤔

INFO 2021-05-27 15:40:20,582 airflow.task.task_runner.standard_task_runner.StandardTaskRunner Started process 17792 to run task
INFO 2021-05-27 15:40:20,587 airflow.task.task_runner.standard_task_runner.StandardTaskRunner Job 51280: Subtask XXXX
INFO 2021-05-27 15:40:28,885 airflow.task Running <TaskInstance: XXX.XXXX 2020-01-12T00:00:00+00:00 [running]> on host airflow-d88cdb6b6-9lg6t
WARNING 2021-05-27 15:40:29,640 airflow.jobs.local_task_job.LocalTaskJob Recorded pid 4304 does not match the current pid 17792
INFO 2021-05-27 15:40:29,642 airflow.task.task_runner.standard_task_runner.StandardTaskRunner Sending Signals.SIGTERM to GPID 17792
INFO 2021-05-27 15:40:29,649 airflow.task.task_runner.standard_task_runner.StandardTaskRunner Process psutil.Process(pid=17792, status='terminated', exitcode=<Negsignal.SIGTERM: -15>, started='15:40:20') (17792) terminated with exit code Negsignal.SIGTERM

It seems to happen when many tasks are to be scheduled, I suspect maybe a DB concurrency issue? Maybe something similar to #15714 ? But I am not sure why it never happened before

@huozhanfeng
Copy link
Contributor

huozhanfeng commented Jun 15, 2021

It's still an issue with version 2.1.0. Is there anyone who has a workaround to solve it in the airflow 2.x version?

[2021-06-15 16:48:12,916] {local_task_job.py:193} WARNING - Recorded pid 747 does not match the current pid 745
[2021-06-15 16:48:12,920] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 745 
[2021-06-15 16:48:12,937] {taskinstance.py:1264} ERROR - Received SIGTERM. Terminating subprocesses. 
[2021-06-15 16:48:12,937] {logging_mixin.py:104} INFO - Killing the Hive job 
[2021-06-15 16:49:12,940] {process_utils.py:113} WARNING - process psutil.Process(pid=747, name='airflow', status='sleeping', started='16:48:07') did not respond to SIGTERM. Trying SIGKILL 
[2021-06-15 16:49:12,941] {process_utils.py:113} WARNING - process psutil.Process(pid=771, name='java', status='zombie', started='16:48:09') did not respond to SIGTERM. Trying SIGKILL 
[2021-06-15 16:49:12,942] {process_utils.py:113} WARNING - process psutil.Process(pid=745, name='sudo', status='sleeping', started='16:48:07') did not respond to SIGTERM. Trying SIGKILL 
[2021-06-15 16:49:12,985] {process_utils.py:66} INFO - Process psutil.Process(pid=747, name='airflow', status='terminated', started='16:48:07') (747) terminated with exit code None 
[2021-06-15 16:49:12,987] {process_utils.py:66} INFO - Process psutil.Process(pid=771, name='java', status='terminated', started='16:48:09') (771) terminated with exit code None 
[2021-06-15 16:49:12,987] {process_utils.py:66} INFO - Process psutil.Process(pid=745, name='sudo', status='terminated', exitcode=<Negsignal.SIGKILL: -9>, started='16:48:07') (745) terminated with exit code Negsignal.SIGKILL 

@jvictorchen
Copy link

+1 encountered this issue on 2.1.0. Rolling it back to 2.0.2 works.

@moranlemusj
Copy link

moranlemusj commented Jul 16, 2021

+1 encountered same issue on 2.1.2, but mine's worse... the non matching pid are actually matching

[2021-07-16 16:56:03,719] {{pod_launcher.py:128}} WARNING - Pod not yet started: scraper.2af2f07714944939b31cebb1473710b2
[2021-07-16 16:56:04,236] {{local_task_job.py:194}} WARNING - Recorded pid 9 does not match the current pid 9
[2021-07-16 16:56:04,237] {{process_utils.py:100}} INFO - Sending Signals.SIGTERM to GPID 9

@andrelop
Copy link

andrelop commented Jul 27, 2021

Same issue here, running Airflow 2.1.1 :

[2021-07-27 19:21:52,578] {local_task_job.py:195} WARNING - Recorded pid 656 does not match the current pid 76
[2021-07-27 19:21:52,581] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 76
[2021-07-27 19:21:52,588] {process_utils.py:66} INFO - Process psutil.Process(pid=76, status='terminated', exitcode=<Negsignal.SIGTERM: -15>, started='19:21:41') (76) terminated with exit code `Negsignal.SIGTERM`

Also, tried to clear the task state and the same issue happens.

@yevsh
Copy link

yevsh commented Aug 2, 2021

Same here:
I placed python DAG file in airflow_dags folder, and DAG executed twice!

any solution?

[2021-08-02 14:48:07,900] {scheduler_job.py:1210} INFO - Executor reports execution of TEST.submit execution_date=2021-08-02 13:48:05.155505+00:00 exited with status success for try_number 1
[2021-08-02 14:48:07,900] {scheduler_job.py:1210} INFO - Executor reports execution of TEST.submit execution_date=2021-08-02 12:48:05.155505+00:00 exited with status success for try_number 1

it tries to submit same task twice at same time, but execution_date date is different!

@phtechwohletz
Copy link

phtechwohletz commented Aug 14, 2021

I'm encountering the same issue. Any update on this or workaround? Thank you!

@potiuk
Copy link
Member

potiuk commented Aug 14, 2021

Hmm I think it is the same as #17411 (fix is merged and it will be released in upcoming 2.1.3). It's not the same as the original duplicate task, but I believe it will be fixed by migrating to 2.1.3.

Update: After looking more closely, It's just about showing the correct PID (the issue reported by @moranlemusj ). So I am not sure any more if the recent comments cfrom @andrelop are addressed.

@potiuk
Copy link
Member

potiuk commented Aug 14, 2021

This issue seems to started from something quite different. Seems that the original issue refers to 1.10* version which will not be updated and there are few other issues mentioned (and it was entirely different issue).

But the issue mentioned later seems to be quite widespread in 2.1.* series.

[2021-06-15 16:48:12,916] {local_task_job.py:193} WARNING - Recorded pid 747 does not match the current pid 745
[2021-06-15 16:48:12,920] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 745 
[2021-06-15 16:48:12,937] {taskinstance.py:1264} ERROR - Received SIGTERM. Terminating subprocesses. 
[2021-06-15 16:48:12,937] {logging_mixin.py:104} INFO - Killing the Hive job 

Proposal: I believe there were other fixes related to similar propblems in 2.1.3. And this seems to be a widespread issue. So @ephraimbuddy @kaxil @ashb (maybe others) I know you were looking at those -> could you please take a look and tell if that issue has been fixed/looked at recently? We know that at least the #17411 should give proper warning) and we can close this issue, and in case it's not addressed - one of you @yevsh , @phtechwohletz , @andrelop, @jvictorchen @huozhanfeng @ashaw27 @BenoitHanotte could open a new issue describing the problem in a bit more detail.

@csherrell
Copy link

I am having the same issue.
Airflow(2.1.2) / CeleryWorker / BashOperator / run_as_user, running from an Anaconda Environment.

I upgraded from 2.0.1 to 2.1.2. For me, this was not happening in 2.0.1. I know this is related to the run_as_user setting in the DAG. I spent the day converting the DAGs over to run as the airflow user and things are working again. Finally!

The run_as_user was nice because all the settings for the user's application were in the account. I would have liked to use the run_as_user argument, but it was more important for me to get the data flowing again.

I'm willing to help. It is was a lot of work to merge user setting into the airflow account, but toggling back should not be too bad. I use conda and have a 2.0.1 and 2.1.2 environment. I can roll back the install easy enough, but I did do the airflow db upgrade, so I would need guidance on an airflow db downgrade.

@potiuk
Copy link
Member

potiuk commented Aug 15, 2021

Yep. It definitely relates to run_as_user

@ephraimbuddy
Copy link
Contributor

I was able to reproduce a case where deleting running Dag on the UI will not delete all the TaskInstances because it’s still being processed by the scheduler/executor. However, the Dag is deleted. When the Dag appear and is ran again, the TaskInstances would be seen to have changed state and a sigterm will be sent to terminate it.

I’m taking a look at fixing this. Also, we fixed a part of run_as_user that would be released in 2.1.3. I’ll also check if I can still reproduce with run_as_user

@ephraimbuddy
Copy link
Contributor

I have started seeing the same issue since migrating to Airflow 2.1 when clearing task states, never had it with Airflow 2.0.1 🤔

INFO 2021-05-27 15:40:20,582 airflow.task.task_runner.standard_task_runner.StandardTaskRunner Started process 17792 to run task
INFO 2021-05-27 15:40:20,587 airflow.task.task_runner.standard_task_runner.StandardTaskRunner Job 51280: Subtask XXXX
INFO 2021-05-27 15:40:28,885 airflow.task Running <TaskInstance: XXX.XXXX 2020-01-12T00:00:00+00:00 [running]> on host airflow-d88cdb6b6-9lg6t
WARNING 2021-05-27 15:40:29,640 airflow.jobs.local_task_job.LocalTaskJob Recorded pid 4304 does not match the current pid 17792
INFO 2021-05-27 15:40:29,642 airflow.task.task_runner.standard_task_runner.StandardTaskRunner Sending Signals.SIGTERM to GPID 17792
INFO 2021-05-27 15:40:29,649 airflow.task.task_runner.standard_task_runner.StandardTaskRunner Process psutil.Process(pid=17792, status='terminated', exitcode=<Negsignal.SIGTERM: -15>, started='15:40:20') (17792) terminated with exit code Negsignal.SIGTERM

It seems to happen when many tasks are to be scheduled, I suspect maybe a DB concurrency issue? Maybe something similar to #15714 ? But I am not sure why it never happened before

I suspect you cleared these tasks when they were still running. I have added a PR for a case I reproduced when deleting running DAGs from the UI #17630

@noelmcloughlin
Copy link
Contributor

noelmcloughlin commented Aug 26, 2021

I see "pid not matching" issue in Airflow 2.1.3/CeleryWorker/RMQ when manually triggering DAGS. Was not happening in airflow 2.1.2 (+cherrypick from #16860). I'm not using run_as_user. I cannot replicate the issue when DAG has single task, only with multiple tasks, my testing listed below. I'm have ntp enabled. Issue happens on branching into parallelization of tasks, not sequential

DAG1 (BashOp)
task1 Success

DAG2 (BashOp)
taskgroup1(task1): Recorded pid 14206 does not match the current pid 14207
taskgroup2(task1): Recorded pid 18247 does not match the current pid 18248

DAG3 (PyOP,ShortCircuitOp)
task1 >> task2 >> task3 success

@noelmcloughlin
Copy link
Contributor

noelmcloughlin commented Sep 2, 2021

I downgraded from 2.1.3 back to 2.1.2 release and the problem is gone for now. See related #17507

@anitakrueger
Copy link

anitakrueger commented Sep 15, 2021

We are seeing the same issue on every DAG run on 2.1.3. It makes 2.1.3 basically unusable.
I have tried setting job_heartbeat_sec because I saw the tasks being killed after exactly 5 seconds. I set job_heartbeat_sec to 45 seconds and scheduler_heartbeat_sec to 60 and now the jobs get killed after 45 seconds.

Based on that, can anyone think of any workaround that might unblock us?

@ephraimbuddy
Copy link
Contributor

@anitakrueger I have not been able to reproduce this, if you have a method you think I can use to reproduce this please share. Also, what is your scheduler_interval?

@anitakrueger
Copy link

@ephraimbuddy We don't have scheduler_interval set in the airflow.cfg, so it is default, right?

I'm honestly not sure how to reproduce it. When on 2.1.3 whichever DAG I manually trigger or which runs on a schedule, it gets killed after 5 seconds. If I increase the job_heartbeat_sec setting, it gets killed after that interval.
The log output is always the same:

[2021-09-14 13:19:52,184] {{local_task_job.py:203}} WARNING - Recorded pid 16 does not match the current pid 17748
[2021-09-14 13:19:52,186] {{process_utils.py:100}} INFO - Sending Signals.SIGTERM to GPID 17748

I've reverted to 2.1.2 now and can trigger DAGs just fine.

How can we troubleshoot this on 2.1.3?

@ephraimbuddy
Copy link
Contributor

@anitakrueger I mean the schedule_interval on your DAG args, sorry for the misspelling but it's of no use since it works in 2.1.2

@noelmcloughlin
Copy link
Contributor

On 2.1.3 I saw that DAG containing sequential task flow works fine. The signal happens if DAG branches into parallel tasks. I rolled back to 2.1.2

@MhmtErsy
Copy link
Contributor

I have installed a fresh Airflow 2.1.3 today. I wasn't getting this error while I use Sequential executor. It started appearing only in the DAGs that contain PythonVirtualenvOperator tasks after I set CeleryExecutor as the default executor. I had to downgrade to 2.1.2. It's not appearing now.

@kondla
Copy link

kondla commented Oct 1, 2021

Thank you for this thread! I am having the PID doesn't match PID issue running Airflow 2.1.2 on the Google Cloud in Composer 1.17.1

[2021-10-01 20:10:27,295] {local_task_job.py:194} WARNING - Recorded pid 13335 does not match the current pid 834
[2021-10-01 20:10:27,304] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 834
[2021-10-01 20:10:27,316] {process_utils.py:66} INFO - Process psutil.Process(pid=834, status='terminated', exitcode=<Negsignal.SIGTERM: -15>, started='20:10:10') (834) terminated with exit code Negsignal.SIGTERM

@chandu-007
Copy link

We are also facing same issue using Composer 1.17.3 and Airflow 2.1.2

@gcristian
Copy link

gcristian commented Nov 15, 2021

I'm facing the same error on Airflow 2.1.3, to which I arrive looking for a solution on zombie detection mechanism just killing non-sense trivial tasks at any random time, so I cannot try 2.1.2, already tried increasing frequency on heartbeat config for both workers and scheduler. Trivial DAGs with linear workflow, no parallel tasks. Any DAG, with no particular configuration.
I think probably this validation checks could at least be a warning and not act so sure on its mind as killing a task blocking any further advance turning usage impossible.

@gcristian
Copy link

gcristian commented Nov 17, 2021

For the record, the error is not showing up anymore 🥳 , I mark what I did in case it may be solution for others on this:

  • increased scheduler's heartbeats to 1 minute
  • worker's hbs to 30''.
  • removing default_impersonation attribute (found this git blaming the source of warning local_task_job.py)
  • starting from scratch with a new airflow.cfg file and porting some well understood and needed parameters from previous versions.

@sunny760408
Copy link

Hi @gcristian ,
What is worker's heartbeats name? Do you mean job_heartbeat_sec?

@jacoDentsu
Copy link

A new version 3.2 and still no fix...

@potiuk
Copy link
Member

potiuk commented May 27, 2022

@jacoDentsu - could you please reproduce it in your instance of 2.3.* (we do not have 3.2 yet) and post the logs wiht it? You seem to still have the problem - that would allow us to confirm if the issue have not been fixed in the meantime. Seems that we had problems with diagnosing ti - so it could have been fixed in the meantime and we did not noticed.

Can I count on your help here @jacoDentsu ?

@jacoDentsu
Copy link

jacoDentsu commented May 27, 2022

@potiuk which logs do you want? The console just give the same as #10026 (comment)

We still use apache-airflow[cncf.kubernetes,apache.spark]==2.0.2, so the error has to be from 2.0 to 2.1

@potiuk
Copy link
Member

potiuk commented May 27, 2022

@potiuk which logs do you want? The console just give the same as #10026 (comment)

All logs that you can find that you've already looked at (and will look at). I guess you did some investigation already and tried to understand the issue? As you probably are aware - this is not a helpdesk. People here help in their free time for the software that you get for free. And we MIGHT be able to help you as long as you show that you've done your part and as long as you provide enough information that will allow people here to solve Your problem. The problem is yours. If the problem was widespread, we would have more similar issues like that, so this is likely a problem with the way how you use Airflow - and the only way we can find it out is you making an effort to let us know what you do, what kind of problems you have and all the evidence that you can gather on your side to help us to help you.

BTW. I have a talk today at Airflow Summit about being empathetic user - highly recommend watching it https://www.crowdcast.io/e/airflowsummit2022/53

We still use apache-airflow[cncf.kubernetes,apache.spark]==2.0.2, so the error has to be from 2.0 to 2.1

This is jumping to conclusions. There might be many reasons - for example you using it wrongly in the first place. But I have no idea as I see not a slightest evidence from you what you are doing.

@potiuk
Copy link
Member

potiuk commented May 27, 2022

BTW. Log with showing SIGTERM means that something killed the task. This might be anything andi in vast majority of cases this is a problem with misconfiguration of your deployment.

Just make sure to spend quality time at looking at your deployment logs - and finding clues what can be killing the workers - as long as you provide some logs of your deployment (which you are the only person who can do it) frrom around this problem happening we MIGHT be able to help you.

But you have 100% guarantee that if you don't provide any logs - we will not be able to help you. There is simply no way.

@apache apache locked and limited conversation to collaborators Jun 1, 2022
@potiuk potiuk converted this issue into discussion #24069 Jun 1, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
affected_version:2.1 Issues Reported for 2.1 area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests