Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schedular going down for 1-2 minute on every 10 minute as increase completed pods in EKS #22612

Open
2 tasks done
dviru opened this issue Mar 30, 2022 · 5 comments
Open
2 tasks done
Assignees
Labels
affected_version:2.2 Issues Reported for 2.2 affected_version:2.3 Issues Reported for 2.3 area:core area:performance area:Scheduler Scheduler or dag parsing Issues kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues

Comments

@dviru
Copy link

dviru commented Mar 30, 2022

Apache Airflow version

2.2.4 (latest released)

What happened

Hi Team, I am using airflow 2.2.4 and deployed it on aws eks cluster. I noticed that every 5-10 minute schedular down message seeing on airflow UI. When I checked airflow schedular log, seeing the lot of below statements.

[2022-03-21 08:21:21,640] {kubernetes_executor.py:729} INFO - Attempting to adopt pod sampletask.05b6f567b4a64bd5beb16e526ba94d7a

This above statement will print for all completed pod which exist in eks, But it is repeating multiple time and as also invoking the PATCH api.

As per my understanding what happing is, below code pulling all the completed pod details for every time from EKS cluster and invoking the patch API on completed pod. So this activity for 1000 completed POD finishing in 1 minute, for 7000 completed POD its taking 3-5 minute, thats the reason scheduler is going down

160352813-9ff57de3-782f-4cee-8f7c-f6d5b8a60d29

What you think should happen instead

This schedular will be healthy when we set "delete_worker_pods = True". but when set delete_worker_pods =False and completed pod count goes to 7000 to 10,000 The scheduler should goes down.

The scheduler should be healthy irrespective of how many completed pod exist in EKS cluster.

How to reproduce

Deploy airflow in k8s cluster and set "delete_worker_pods = False". once completed pod reaches 7,000 to 10,000, you will able to see this issue.

Operating System

OS:Debian GNU/Linux, VERSION: 10

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@dviru dviru added area:core kind:bug This is a clearly a bug labels Mar 30, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Mar 30, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@potiuk
Copy link
Member

potiuk commented Mar 30, 2022

cc: @dstandish -> what we talked about :)

@eladkal eladkal added area:Scheduler Scheduler or dag parsing Issues provider:cncf-kubernetes Kubernetes provider related issues affected_version:2.2 Issues Reported for 2.2 labels Jul 4, 2022
@github-actions
Copy link

github-actions bot commented Jul 5, 2023

This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.

@github-actions
Copy link

github-actions bot commented Aug 4, 2023

This issue has been closed because it has not received response from the issue author.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 4, 2023
@dirrao
Copy link
Collaborator

dirrao commented May 2, 2024

We are seeing this issue in the airflow version 2.3.3. I strongly believe the issue is there in the latest airflow version 2.9.1 as well as per the latest code. I don't see any improvements in watcher performance between 2.3.3 and 2.9.1.
The primary reason for this issue is due to the Kubernetes pod watcher is not fast enough to cope with the Kubernetes events rate. This leads to Kubernetes watcher failure/restart and adopt_complete_pods take over the completed pods. The adopt_complete_pods will take a couple of minutes, causing the scheduler delayed heartbeat, and then scheduler liveness failures, and then scheduler pod restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.2 Issues Reported for 2.2 affected_version:2.3 Issues Reported for 2.3 area:core area:performance area:Scheduler Scheduler or dag parsing Issues kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

No branches or pull requests

4 participants