Skip to content

High load issues in Airflow DAGs when running on EKS cluster #42784

@sairan-01

Description

@sairan-01

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.8.2

What happened?

Description:

I am encountering significant load spikes when running DAGs in an Airflow setup on an EKS cluster with the following configuration:

EKS Node Type: c5.9xlarge
OS (Architecture): linux (amd64)
OS Image: Amazon Linux 2
Kernel Version: 5.10.225-213.878.amzn2.x86_64
Container Runtime: containerd://1.7.11
Kubelet Version: v1.28.13-eks-a737599
The DAGs are responsible for ingesting data from Kafka to S3 and are triggered both periodically and by orchestration DAGs. However, during their execution, I observe a sharp increase in resource usage, particularly CPU and memory, on the worker nodes.

The configuration for some DAGs includes relatively high max_active_runs, which may contribute to the overload. I have attached a simplified version of the DAGs below.

Steps to Reproduce:

Deploy an Airflow instance in an EKS cluster with at least two c5.9xlarge worker nodes.
Run the provided DAGs that trigger data ingestion from Kafka to S3.
Monitor resource usage on the worker nodes.
Expected Behavior:

The load on the worker nodes should remain stable and manageable, even during the execution of multiple DAGs.

Actual Behavior:

The load on the worker nodes increases significantly, causing performance degradation across the cluster.

Environment:

Kubernetes version: v1.28.13-eks-a737599
Airflow version: 2.5.0
Worker nodes: c5.9xlarge
Container runtime: containerd://1.7.11

What you think should happen instead?

No response

How to reproduce

Deploy an Airflow instance in an EKS cluster with at least two c5.9xlarge worker nodes running Amazon Linux 2.
Configure Airflow with the DAGs that handle ingestion of data from Kafka to S3. Use a high number of max_active_runs (e.g., 10) for DAGs that trigger orchestration or batch tasks.
Trigger the orchestration DAG, which will start the data ingestion processes from Kafka to S3.
Monitor CPU and memory usage on the worker nodes during DAG execution.
Observe the load spike during execution, particularly when multiple DAGs run in parallel.
Expected behavior: Stable resource usage on the worker nodes without sharp increases in load.

Actual behavior: Significant increase in CPU and memory usage, causing performance degradation on the worker nodes.

Operating System

Ubuntu 22

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

import logging
from datetime import datetime
from CommonDag.L1RawCommonDag import DagRawKafkaS3Msg
from CommonUtil import CommonUtilClass

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(name)

dag_config = CommonUtilClass.DagConfig(
dag_id="example_dag",
default_args=CommonUtilClass.DagDefaultArgs(
owner="airflow_user",
start_date=datetime(2024, 9, 25)
),
max_active_runs=5,
tags=["L0", "kafka", "cdc"]
)

dag_object = DagRawKafkaS3Msg(dag_config=dag_config, log=logger)
dag = dag_object.prepare_dag()

Anything else?

I would appreciate any guidance on how to manage these load spikes more effectively, or suggestions for optimizing DAGs in this scenario.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions