High load issues in Airflow DAGs when running on EKS cluster

### Apache Airflow version

Other Airflow 2 version (please specify below)

### If "Other Airflow 2 version" selected, which one?

2.8.2

### What happened?

Description:

I am encountering significant load spikes when running DAGs in an Airflow setup on an EKS cluster with the following configuration:

EKS Node Type: c5.9xlarge
OS (Architecture): linux (amd64)
OS Image: Amazon Linux 2
Kernel Version: 5.10.225-213.878.amzn2.x86_64
Container Runtime: containerd://1.7.11
Kubelet Version: v1.28.13-eks-a737599
The DAGs are responsible for ingesting data from Kafka to S3 and are triggered both periodically and by orchestration DAGs. However, during their execution, I observe a sharp increase in resource usage, particularly CPU and memory, on the worker nodes.

The configuration for some DAGs includes relatively high max_active_runs, which may contribute to the overload. I have attached a simplified version of the DAGs below.

Steps to Reproduce:

Deploy an Airflow instance in an EKS cluster with at least two c5.9xlarge worker nodes.
Run the provided DAGs that trigger data ingestion from Kafka to S3.
Monitor resource usage on the worker nodes.
Expected Behavior:

The load on the worker nodes should remain stable and manageable, even during the execution of multiple DAGs.

Actual Behavior:

The load on the worker nodes increases significantly, causing performance degradation across the cluster.

Environment:

Kubernetes version: v1.28.13-eks-a737599
Airflow version: 2.5.0
Worker nodes: c5.9xlarge
Container runtime: containerd://1.7.11

### What you think should happen instead?

_No response_

### How to reproduce

Deploy an Airflow instance in an EKS cluster with at least two c5.9xlarge worker nodes running Amazon Linux 2.
Configure Airflow with the DAGs that handle ingestion of data from Kafka to S3. Use a high number of max_active_runs (e.g., 10) for DAGs that trigger orchestration or batch tasks.
Trigger the orchestration DAG, which will start the data ingestion processes from Kafka to S3.
Monitor CPU and memory usage on the worker nodes during DAG execution.
Observe the load spike during execution, particularly when multiple DAGs run in parallel.
Expected behavior: Stable resource usage on the worker nodes without sharp increases in load.

Actual behavior: Significant increase in CPU and memory usage, causing performance degradation on the worker nodes.

### Operating System

Ubuntu 22

### Versions of Apache Airflow Providers

_No response_

### Deployment

Official Apache Airflow Helm Chart

### Deployment details

import logging
from datetime import datetime
from CommonDag.L1RawCommonDag import DagRawKafkaS3Msg
from CommonUtil import CommonUtilClass

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

dag_config = CommonUtilClass.DagConfig(
    dag_id="example_dag",
    default_args=CommonUtilClass.DagDefaultArgs(
        owner="airflow_user",
        start_date=datetime(2024, 9, 25)
    ),
    max_active_runs=5,
    tags=["L0", "kafka", "cdc"]
)

dag_object = DagRawKafkaS3Msg(dag_config=dag_config, log=logger)
dag = dag_object.prepare_dag()


### Anything else?

I would appreciate any guidance on how to manage these load spikes more effectively, or suggestions for optimizing DAGs in this scenario.



### Are you willing to submit PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High load issues in Airflow DAGs when running on EKS cluster #42784

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

High load issues in Airflow DAGs when running on EKS cluster #42784

Description

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions