State of this instance has been externally set to removed. Taking the poison pill. #8087

K-7 · 2020-04-03T05:01:52Z

Apache Airflow version: 1.10.2

Environment: Linux EC2 Machine

Cloud provider or hardware configuration: AWS
What happened:
AIiflow Tasks are killed by passing SIGTERM signal

What you expected to happen:
ECSOperator triggered from Airflow has to work smoothly when the AWS policies are attached correctly to the role.

How to reproduce it:
Using Airflow Dags when I run ECS tasks using ECSOperator, the tasks are first triggered and a response of 200 is received which I see in the logs. But the very next log message is

State of this instance has been externally set to removed. Taking the poison pill

This goes and kills the ECS task and a response of 'desiredStatus': 'STOPPED' is returned back. The log messages does not clearly say as why the task was killed.

Under airflow.cfg the configurations are as follow
parallelism = 32
dag_concurrency = 16
dags_are_paused_at_creation = True
max_active_runs_per_dag = 16
non_pooled_task_slot_count = 128

Kindly change the log messages so that we understand what is the root cause of the issue

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2020-04-03T05:01:53Z

Thanks for opening your first issue here! Be sure to follow the issue template!

heowc · 2020-11-19T05:03:12Z

I had the same problem. 🤔

albydeca · 2020-12-03T13:10:37Z

any update on this?

kaxil · 2021-04-20T12:25:05Z

@K-7 Can you test it with Airflow 2.0.2 and see if you still see this error, in which case can you also provide steps for reproduction please

jinzishuai · 2021-04-27T19:55:36Z

we are seeing this from time to time on

Astronomer Certified 1.10.10.post4 based on Apache Airflow 1.10.10

deepakjindal90 · 2021-05-15T20:13:22Z

I am using 1.10.12 and facing same issue due to which my GlueJob called two times, below are the highlighted log where glue job called two times.

Scenario : I have used one dummy operator and success of that task i have called glue job using 'on_success_callback' callback function

Note : This is coming intermittently, not everytime.

[2021-05-15 03:30:10,863] {{taskinstance.py:901}} INFO - Executing <Task(DummyOperator): insert_run_control> on 2021-05-14T03:30:00+00:00
[2021-05-15 03:30:10,926] {{standard_task_runner.py:54}} INFO - Started process 6044 to run task
[2021-05-15 03:30:10,943] {{standard_task_runner.py:77}} INFO - Running: ['airflow', 'run', 'dev.udh.loyalty.mk_premier_qual_rule', 'insert_run_control', '2021-05-14T03:30:00+00:00', '--job_id', '8489', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/commercial/loyalty/mk_premier_qual_rule.py', '--cfg_path', '/tmp/tmptc3np1i2']
[2021-05-15 03:30:11,063] {{standard_task_runner.py:78}} INFO - Job 8489: Subtask insert_run_control
[2021-05-15 03:30:11,335] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: dev.udh.loyalty.mk_premier_qual_rule.insert_run_control 2021-05-14T03:30:00+00:00 [running]> ip-10-36-70-203.ec2.internal
[2021-05-15 03:30:11,524] {{logging_mixin.py:112}} INFO - Task insert_run_control success :
[2021-05-15 03:30:11,585] {{logging_mixin.py:112}} INFO - Updating audit operator infor table...
[2021-05-15 03:30:11,654] {{glue.py:119}} INFO - Initializing AWS Glue Job: cbs-udh-audit-runner
[2021-05-15 03:30:30,330] {{glue.py:126}} INFO - AWS Glue Job: cbs-udh-audit-runner status: SUCCEEDED. Run Id: jr_61870a034ff896de378199a6148aef24943beb53c42bfbf9f8ecc835f11db975
[2021-05-15 03:30:30,427] {{taskinstance.py:1070}} INFO - Marking task as SUCCESS.dag_id=dev.udh.loyalty.mk_premier_qual_rule, task_id=insert_run_control, execution_date=20210514T033000, start_date=20210515T033010, end_date=20210515T033030
[2021-05-15 03:30:30,512] {{logging_mixin.py:112}} INFO - [2021-05-15 03:30:30,511] {{local_task_job.py:159}} WARNING - State of this instance has been externally set to success. Taking the poison pill.
[2021-05-15 03:30:30,619] {{logging_mixin.py:112}} INFO - Task insert_run_control success :
[2021-05-15 03:30:30,680] {{logging_mixin.py:112}} INFO - Updating audit operator infor table...
[2021-05-15 03:30:30,760] {{glue.py:119}} INFO - Initializing AWS Glue Job: cbs-udh-audit-runner
[2021-05-15 03:30:49,408] {{glue.py:126}} INFO - AWS Glue Job: cbs-udh-audit-runner status: SUCCEEDED. Run Id: jr_97de1ed79f3cee852c33bec0cae589c23d5cdb61f3c7262db525acd8f038e3ac
[2021-05-15 03:30:49,495] {{helpers.py:325}} INFO - Sending Signals.SIGTERM to GPID 6044
[2021-05-15 03:30:49,544] {{helpers.py:291}} INFO - Process psutil.Process(pid=6044, status='terminated', exitcode=0, started='03:30:10') (6044) terminated with exit code 0

psmukherjee009 · 2021-06-17T01:45:11Z

I am having the same issue. Atleast no poison pill
[2021-06-16 18:42:09,150] {local_task_job.py:196} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2021-06-16 18:42:09,168] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 15930
[2021-06-16 18:42:09,169] {taskinstance.py:1264} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-06-16 18:42:09,169] {subprocess.py:89} INFO - Sending SIGTERM signal to process group

jvaca92Code · 2021-06-28T15:03:41Z

#16637 Seems to be the same issue

mikekenneth · 2021-07-08T20:39:30Z

Hi All,
Am having the same issue.
Please any update on this.
Thanks.

potiuk · 2021-07-10T12:31:11Z

Airflow 1.10 reached end-of-life on June 17th 2021 and it will not receive any updates any more. Please upgrade to Airflow 2 at the earliest convenient time. Note that there are already security fixes addressed in 2+ Airflow series that are not fixed in 1.10 so you are putting yourself in high risk by not upgrading. You can find out more about it by watching the recording of the discussion panel we just had at the Airflow Summit 2021: https://www.crowdcast.io/e/airflowsummit2021/3

deepakjindal90 · 2021-07-10T13:13:20Z

We have tried AWS + Airflow 2.0, but Airflow 2.0 has multiple issue.

Main Issue in Airflow 2.0 :

Not able to retrigger the dag after couple of hour. This is main feature of Airflow and this is not working properly in Airflow 2.0. Airflow product team also aware of this issue. If we retrigger dag within 1-2 hour it got executed, but if we trying to retrigger dag after couple of hour it got stuck
This "SIGTERM signal to process group" issue is very frequent in Airflow 2.0.

potiuk · 2021-07-10T13:42:33Z

Airflow is now at 2.1.1 and 2.1.2 will be released next week. And a number of stability improvements have been already released in 2.1 line.

If this is limitation of the MWAA to only support 2.0.0, then you should raise issue there. Also if you have an SLA with managed service to support 1.10 for longer, then you should raise the issue in that managed service. Maybe they can diagnose/fix it. The 1.10 line is no longer supported here, sorry - that's the policy that we introduced already more than a year ago.

K-7 · 2021-07-10T14:10:02Z

Sorry, this seems to be a very old ticket that I had raised. Since I dint not get any response from Airflow back those days even i have forgotten about it. I dont remember how exactly I fixed this issue, but one thing I am very sure of is Airflow needs to do a better job in error handling. The error messages dont clearly indicate the root cause of the problem.

Recently when I was trying the latest version of Airflow I remember capacityProviderStrategy is still not enabled in ECSOperator but the error message was irrelevant. I had to go through the entire code and understand that capacityProviderStrategy is still not enabled and I to handle this issue by overriding the ECSOperator and writing my own custom version of ECSOperator for auto scaling in my instances.

Any way this ticket can be closed since I am not able to recall the exact steps to reproduce this issue

potiuk · 2021-07-10T16:30:25Z

hey @K-7 - Airflow is a community managed product that you get for free. I think if you are aware of such issue, absolutely the best what you can do is to raise pull request fixing it ! This is best way to thank community for all the work they to so that you can use the software for free.

Many of our users do that and we even yesterday run a workshop for first time contributors to help them to learn how to do it.

It seems that fixing error message seems to be an easy task to do. Would you be willing to fix it @K-7 ? this way you could help others to get better diagnostics.

K-7 · 2021-07-11T06:41:05Z

Yup I do that, I remember I had raised a PR for supporting some AWS operator. But not sure whether that PR got merged or what happened with it

K-7 added the kind:bug This is a clearly a bug label Apr 3, 2020

turbaszek added the area:Scheduler Scheduler or dag parsing Issues label Apr 3, 2020

kaxil added the pending-response label Apr 20, 2021

potiuk closed this as completed Jul 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State of this instance has been externally set to removed. Taking the poison pill. #8087

State of this instance has been externally set to removed. Taking the poison pill. #8087

K-7 commented Apr 3, 2020 •

edited

boring-cyborg bot commented Apr 3, 2020

heowc commented Nov 19, 2020

albydeca commented Dec 3, 2020

kaxil commented Apr 20, 2021

jinzishuai commented Apr 27, 2021

deepakjindal90 commented May 15, 2021

psmukherjee009 commented Jun 17, 2021

jvaca92Code commented Jun 28, 2021 •

edited

mikekenneth commented Jul 8, 2021

potiuk commented Jul 10, 2021

deepakjindal90 commented Jul 10, 2021

potiuk commented Jul 10, 2021

K-7 commented Jul 10, 2021

potiuk commented Jul 10, 2021

K-7 commented Jul 11, 2021

State of this instance has been externally set to removed. Taking the poison pill. #8087

State of this instance has been externally set to removed. Taking the poison pill. #8087

Comments

K-7 commented Apr 3, 2020 • edited

boring-cyborg bot commented Apr 3, 2020

heowc commented Nov 19, 2020

albydeca commented Dec 3, 2020

kaxil commented Apr 20, 2021

jinzishuai commented Apr 27, 2021

deepakjindal90 commented May 15, 2021

psmukherjee009 commented Jun 17, 2021

jvaca92Code commented Jun 28, 2021 • edited

mikekenneth commented Jul 8, 2021

potiuk commented Jul 10, 2021

deepakjindal90 commented Jul 10, 2021

potiuk commented Jul 10, 2021

K-7 commented Jul 10, 2021

potiuk commented Jul 10, 2021

K-7 commented Jul 11, 2021

K-7 commented Apr 3, 2020 •

edited

jvaca92Code commented Jun 28, 2021 •

edited