Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State of this instance has been externally set to removed. Taking the poison pill. #8087

Closed
K-7 opened this issue Apr 3, 2020 · 15 comments
Closed
Labels
area:Scheduler Scheduler or dag parsing Issues kind:bug This is a clearly a bug pending-response

Comments

@K-7
Copy link

K-7 commented Apr 3, 2020

Apache Airflow version: 1.10.2

Environment: Linux EC2 Machine

Cloud provider or hardware configuration: AWS
What happened:
AIiflow Tasks are killed by passing SIGTERM signal

What you expected to happen:
ECSOperator triggered from Airflow has to work smoothly when the AWS policies are attached correctly to the role.

How to reproduce it:
Using Airflow Dags when I run ECS tasks using ECSOperator, the tasks are first triggered and a response of 200 is received which I see in the logs. But the very next log message is 

State of this instance has been externally set to removed. Taking the poison pill

This goes and kills the ECS task and a response of 'desiredStatus': 'STOPPED' is returned back. The log messages does not clearly say as why the task was killed.

Under airflow.cfg the configurations are as follow
parallelism = 32
dag_concurrency = 16
dags_are_paused_at_creation = True
max_active_runs_per_dag = 16
non_pooled_task_slot_count = 128

Kindly change the log messages so that we understand what is the root cause of the issue

@K-7 K-7 added the kind:bug This is a clearly a bug label Apr 3, 2020
@boring-cyborg
Copy link

boring-cyborg bot commented Apr 3, 2020

Thanks for opening your first issue here! Be sure to follow the issue template!

@turbaszek turbaszek added the area:Scheduler Scheduler or dag parsing Issues label Apr 3, 2020
@heowc
Copy link

heowc commented Nov 19, 2020

I had the same problem. 🤔

@albydeca
Copy link

albydeca commented Dec 3, 2020

any update on this?

@kaxil
Copy link
Member

kaxil commented Apr 20, 2021

@K-7 Can you test it with Airflow 2.0.2 and see if you still see this error, in which case can you also provide steps for reproduction please

@jinzishuai
Copy link

we are seeing this from time to time on

Astronomer Certified 1.10.10.post4 based on Apache Airflow 1.10.10

@deepakjindal90
Copy link

I am using 1.10.12 and facing same issue due to which my GlueJob called two times, below are the highlighted log where glue job called two times.

Scenario : I have used one dummy operator and success of that task i have called glue job using 'on_success_callback' callback function

Note : This is coming intermittently, not everytime.

[2021-05-15 03:30:10,863] {{taskinstance.py:901}} INFO - Executing <Task(DummyOperator): insert_run_control> on 2021-05-14T03:30:00+00:00
[2021-05-15 03:30:10,926] {{standard_task_runner.py:54}} INFO - Started process 6044 to run task
[2021-05-15 03:30:10,943] {{standard_task_runner.py:77}} INFO - Running: ['airflow', 'run', 'dev.udh.loyalty.mk_premier_qual_rule', 'insert_run_control', '2021-05-14T03:30:00+00:00', '--job_id', '8489', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/commercial/loyalty/mk_premier_qual_rule.py', '--cfg_path', '/tmp/tmptc3np1i2']
[2021-05-15 03:30:11,063] {{standard_task_runner.py:78}} INFO - Job 8489: Subtask insert_run_control
[2021-05-15 03:30:11,335] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: dev.udh.loyalty.mk_premier_qual_rule.insert_run_control 2021-05-14T03:30:00+00:00 [running]> ip-10-36-70-203.ec2.internal
[2021-05-15 03:30:11,524] {{logging_mixin.py:112}} INFO - Task insert_run_control success :
[2021-05-15 03:30:11,585] {{logging_mixin.py:112}} INFO - Updating audit operator infor table...
[2021-05-15 03:30:11,654] {{glue.py:119}} INFO - Initializing AWS Glue Job: cbs-udh-audit-runner
[2021-05-15 03:30:30,330] {{glue.py:126}} INFO - AWS Glue Job: cbs-udh-audit-runner status: SUCCEEDED. Run Id: jr_61870a034ff896de378199a6148aef24943beb53c42bfbf9f8ecc835f11db975
[2021-05-15 03:30:30,427] {{taskinstance.py:1070}} INFO - Marking task as SUCCESS.dag_id=dev.udh.loyalty.mk_premier_qual_rule, task_id=insert_run_control, execution_date=20210514T033000, start_date=20210515T033010, end_date=20210515T033030
[2021-05-15 03:30:30,512] {{logging_mixin.py:112}} INFO - [2021-05-15 03:30:30,511] {{local_task_job.py:159}} WARNING - State of this instance has been externally set to success. Taking the poison pill.
[2021-05-15 03:30:30,619] {{logging_mixin.py:112}} INFO - Task insert_run_control success :
[2021-05-15 03:30:30,680] {{logging_mixin.py:112}} INFO - Updating audit operator infor table...
[2021-05-15 03:30:30,760] {{glue.py:119}} INFO - Initializing AWS Glue Job: cbs-udh-audit-runner
[2021-05-15 03:30:49,408] {{glue.py:126}} INFO - AWS Glue Job: cbs-udh-audit-runner status: SUCCEEDED. Run Id: jr_97de1ed79f3cee852c33bec0cae589c23d5cdb61f3c7262db525acd8f038e3ac
[2021-05-15 03:30:49,495] {{helpers.py:325}} INFO - Sending Signals.SIGTERM to GPID 6044
[2021-05-15 03:30:49,544] {{helpers.py:291}} INFO - Process psutil.Process(pid=6044, status='terminated', exitcode=0, started='03:30:10') (6044) terminated with exit code 0

@psmukherjee009
Copy link

I am having the same issue. Atleast no poison pill
[2021-06-16 18:42:09,150] {local_task_job.py:196} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2021-06-16 18:42:09,168] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 15930
[2021-06-16 18:42:09,169] {taskinstance.py:1264} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-06-16 18:42:09,169] {subprocess.py:89} INFO - Sending SIGTERM signal to process group

@jvaca92Code
Copy link

jvaca92Code commented Jun 28, 2021

#16637 Seems to be the same issue

@mikekenneth
Copy link

Hi All,
Am having the same issue.
Please any update on this.
Thanks.

@potiuk
Copy link
Member

potiuk commented Jul 10, 2021

Airflow 1.10 reached end-of-life on June 17th 2021 and it will not receive any updates any more. Please upgrade to Airflow 2 at the earliest convenient time. Note that there are already security fixes addressed in 2+ Airflow series that are not fixed in 1.10 so you are putting yourself in high risk by not upgrading. You can find out more about it by watching the recording of the discussion panel we just had at the Airflow Summit 2021: https://www.crowdcast.io/e/airflowsummit2021/3

@potiuk potiuk closed this as completed Jul 10, 2021
@deepakjindal90
Copy link

We have tried AWS + Airflow 2.0, but Airflow 2.0 has multiple issue.

Main Issue in Airflow 2.0 :

  1. Not able to retrigger the dag after couple of hour. This is main feature of Airflow and this is not working properly in Airflow 2.0. Airflow product team also aware of this issue. If we retrigger dag within 1-2 hour it got executed, but if we trying to retrigger dag after couple of hour it got stuck
  2. This "SIGTERM signal to process group" issue is very frequent in Airflow 2.0.

@potiuk
Copy link
Member

potiuk commented Jul 10, 2021

Airflow is now at 2.1.1 and 2.1.2 will be released next week. And a number of stability improvements have been already released in 2.1 line.

If this is limitation of the MWAA to only support 2.0.0, then you should raise issue there. Also if you have an SLA with managed service to support 1.10 for longer, then you should raise the issue in that managed service. Maybe they can diagnose/fix it. The 1.10 line is no longer supported here, sorry - that's the policy that we introduced already more than a year ago.

@K-7
Copy link
Author

K-7 commented Jul 10, 2021

Sorry, this seems to be a very old ticket that I had raised. Since I dint not get any response from Airflow back those days even i have forgotten about it. I dont remember how exactly I fixed this issue, but one thing I am very sure of is Airflow needs to do a better job in error handling. The error messages dont clearly indicate the root cause of the problem.

Recently when I was trying the latest version of Airflow I remember capacityProviderStrategy is still not enabled in ECSOperator but the error message was irrelevant. I had to go through the entire code and understand that capacityProviderStrategy is still not enabled and I to handle this issue by overriding the ECSOperator and writing my own custom version of ECSOperator for auto scaling in my instances.

Any way this ticket can be closed since I am not able to recall the exact steps to reproduce this issue

@potiuk
Copy link
Member

potiuk commented Jul 10, 2021

hey @K-7 - Airflow is a community managed product that you get for free. I think if you are aware of such issue, absolutely the best what you can do is to raise pull request fixing it ! This is best way to thank community for all the work they to so that you can use the software for free.

Many of our users do that and we even yesterday run a workshop for first time contributors to help them to learn how to do it.

It seems that fixing error message seems to be an easy task to do. Would you be willing to fix it @K-7 ? this way you could help others to get better diagnostics.

@K-7
Copy link
Author

K-7 commented Jul 11, 2021

Yup I do that, I remember I had raised a PR for supporting some AWS operator. But not sure whether that PR got merged or what happened with it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler Scheduler or dag parsing Issues kind:bug This is a clearly a bug pending-response
Projects
None yet
Development

No branches or pull requests