-
Notifications
You must be signed in to change notification settings - Fork 14.3k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate tasks invoked for a single task_id when manually invoked task details modal. #10026
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Having the same issue :( |
still an issue? |
It is still an issue with airflow 1.10.12. |
I encountered the exact same issue (airflow 1.10.9.)
I am trying out this workaround. EDIT: When rerunning a task by clearing the state and letting the scheduler to reschedule the task I do not have this bug. |
I'm having this issue in 2.0.1. Is there a workaround? My timesensor failed at a seemingly random time, twice, resulting in 4 failures. I only had 3 retries set, so the DAG failed.
|
I have started seeing the same issue since migrating to Airflow 2.1 when clearing task states, never had it with Airflow 2.0.1 🤔
It seems to happen when many tasks are to be scheduled, I suspect maybe a DB concurrency issue? Maybe something similar to #15714 ? But I am not sure why it never happened before |
It's still an issue with version 2.1.0. Is there anyone who has a workaround to solve it in the airflow 2.x version?
|
+1 encountered this issue on 2.1.0. Rolling it back to 2.0.2 works. |
+1 encountered same issue on 2.1.2, but mine's worse... the non matching pid are actually matching
|
Same issue here, running Airflow 2.1.1 :
Also, tried to clear the task state and the same issue happens. |
Same here: any solution? [2021-08-02 14:48:07,900] {scheduler_job.py:1210} INFO - Executor reports execution of TEST.submit execution_date=2021-08-02 13:48:05.155505+00:00 exited with status success for try_number 1 it tries to submit same task twice at same time, but execution_date date is different! |
I'm encountering the same issue. Any update on this or workaround? Thank you! |
Hmm I think it is the same as #17411 (fix is merged and it will be released in upcoming 2.1.3). It's not the same as the original duplicate task, but I believe it will be fixed by migrating to 2.1.3. Update: After looking more closely, It's just about showing the correct PID (the issue reported by @moranlemusj ). So I am not sure any more if the recent comments cfrom @andrelop are addressed. |
This issue seems to started from something quite different. Seems that the original issue refers to 1.10* version which will not be updated and there are few other issues mentioned (and it was entirely different issue). But the issue mentioned later seems to be quite widespread in 2.1.* series.
Proposal: I believe there were other fixes related to similar propblems in 2.1.3. And this seems to be a widespread issue. So @ephraimbuddy @kaxil @ashb (maybe others) I know you were looking at those -> could you please take a look and tell if that issue has been fixed/looked at recently? We know that at least the #17411 should give proper warning) and we can close this issue, and in case it's not addressed - one of you @yevsh , @phtechwohletz , @andrelop, @jvictorchen @huozhanfeng @ashaw27 @BenoitHanotte could open a new issue describing the problem in a bit more detail. |
I am having the same issue. I upgraded from 2.0.1 to 2.1.2. For me, this was not happening in 2.0.1. I know this is related to the run_as_user setting in the DAG. I spent the day converting the DAGs over to run as the airflow user and things are working again. Finally! The run_as_user was nice because all the settings for the user's application were in the account. I would have liked to use the run_as_user argument, but it was more important for me to get the data flowing again. I'm willing to help. It is was a lot of work to merge user setting into the airflow account, but toggling back should not be too bad. I use conda and have a 2.0.1 and 2.1.2 environment. I can roll back the install easy enough, but I did do the airflow db upgrade, so I would need guidance on an airflow db downgrade. |
Yep. It definitely relates to |
I was able to reproduce a case where deleting running Dag on the UI will not delete all the TaskInstances because it’s still being processed by the scheduler/executor. However, the Dag is deleted. When the Dag appear and is ran again, the TaskInstances would be seen to have changed state and a sigterm will be sent to terminate it. I’m taking a look at fixing this. Also, we fixed a part of run_as_user that would be released in 2.1.3. I’ll also check if I can still reproduce with run_as_user |
I suspect you cleared these tasks when they were still running. I have added a PR for a case I reproduced when deleting running DAGs from the UI #17630 |
I see "pid not matching" issue in Airflow 2.1.3/CeleryWorker/RMQ when manually triggering DAGS. Was not happening in airflow 2.1.2 (+cherrypick from #16860). I'm not using run_as_user. I cannot replicate the issue when DAG has single task, only with multiple tasks, my testing listed below. I'm have ntp enabled. Issue happens on branching into parallelization of tasks, not sequential DAG1 (BashOp) DAG2 (BashOp) DAG3 (PyOP,ShortCircuitOp) |
I downgraded from 2.1.3 back to 2.1.2 release and the problem is gone for now. See related #17507 |
We are seeing the same issue on every DAG run on 2.1.3. It makes 2.1.3 basically unusable. Based on that, can anyone think of any workaround that might unblock us? |
@anitakrueger I have not been able to reproduce this, if you have a method you think I can use to reproduce this please share. Also, what is your scheduler_interval? |
@ephraimbuddy We don't have I'm honestly not sure how to reproduce it. When on 2.1.3 whichever DAG I manually trigger or which runs on a schedule, it gets killed after 5 seconds. If I increase the job_heartbeat_sec setting, it gets killed after that interval.
I've reverted to 2.1.2 now and can trigger DAGs just fine. How can we troubleshoot this on 2.1.3? |
@anitakrueger I mean the |
On 2.1.3 I saw that DAG containing sequential task flow works fine. The signal happens if DAG branches into parallel tasks. I rolled back to 2.1.2 |
I have installed a fresh Airflow 2.1.3 today. I wasn't getting this error while I use Sequential executor. It started appearing only in the DAGs that contain PythonVirtualenvOperator tasks after I set CeleryExecutor as the default executor. I had to downgrade to 2.1.2. It's not appearing now. |
Thank you for this thread! I am having the PID doesn't match PID issue running
|
We are also facing same issue using |
I'm facing the same error on Airflow |
For the record, the error is not showing up anymore 🥳 , I mark what I did in case it may be solution for others on this:
|
Hi @gcristian , |
A new version 3.2 and still no fix... |
@jacoDentsu - could you please reproduce it in your instance of 2.3.* (we do not have 3.2 yet) and post the logs wiht it? You seem to still have the problem - that would allow us to confirm if the issue have not been fixed in the meantime. Seems that we had problems with diagnosing ti - so it could have been fixed in the meantime and we did not noticed. Can I count on your help here @jacoDentsu ? |
@potiuk which logs do you want? The console just give the same as #10026 (comment) We still use |
All logs that you can find that you've already looked at (and will look at). I guess you did some investigation already and tried to understand the issue? As you probably are aware - this is not a helpdesk. People here help in their free time for the software that you get for free. And we MIGHT be able to help you as long as you show that you've done your part and as long as you provide enough information that will allow people here to solve Your problem. The problem is yours. If the problem was widespread, we would have more similar issues like that, so this is likely a problem with the way how you use Airflow - and the only way we can find it out is you making an effort to let us know what you do, what kind of problems you have and all the evidence that you can gather on your side to help us to help you. BTW. I have a talk today at Airflow Summit about being empathetic user - highly recommend watching it https://www.crowdcast.io/e/airflowsummit2022/53
This is jumping to conclusions. There might be many reasons - for example you using it wrongly in the first place. But I have no idea as I see not a slightest evidence from you what you are doing. |
BTW. Log with showing SIGTERM means that something killed the task. This might be anything andi in vast majority of cases this is a problem with misconfiguration of your deployment. Just make sure to spend quality time at looking at your deployment logs - and finding clues what can be killing the workers - as long as you provide some logs of your deployment (which you are the only person who can do it) frrom around this problem happening we MIGHT be able to help you. But you have 100% guarantee that if you don't provide any logs - we will not be able to help you. There is simply no way. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Apache Airflow version:
1.10.11
Kubernetes version (if you are using kubernetes) (use
kubectl version
): NAEnvironment:
Kernel (e.g.
uname -a
):Linux airflow-scheduler-10-229-13-220 4.14.165-131.185.amzn2.x86_64 #1 SMP Wed Jan 15 14:19:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others:
What happened:
When manually invoke a task from the task details dialog, we see the task running for approximately 22 seconds before we see the following appear in the log...
The task then is killed. We notice this is accompanied with a second failure shortly afterwards that correlates to the new pid that has been written to the
task_instance
table.It is interesting to note that if the task is scheduled as part of a normal dag run, or by clearing state and allowing the schedular to schedule its execution then we do not experience any issue.
We have attempted to specify
task_concurrency
on our operators with no effect.What you expected to happen:
We expected a single process to be spawned for the manually executed task.
How to reproduce it:
Manually invoke a task via the task details dialog where that task execution is going to be longer than the heart rate interval that has been set.
The heart rate checks the pid and sees a mismatch and so kills the task.
Anything else we need to know:
We can produce this reliably if the task execution time is > than the heart rate interval.
The text was updated successfully, but these errors were encountered: