Skip to content

KubernetesExecutor retries can lose failed-attempt hostname and remote logs, leading to http://:8793 No host supplied #65366

@hkc-8010

Description

@hkc-8010

Apache Airflow version

3.1.8

If "Other Airflow 3 version" selected, which one?

Also reproduced on 3.1.1.

What happened?

We have multiple anonymized production incidents on Airflow 3.1.x with KubernetesExecutor where a task retries and the failed attempts are not retained correctly.

The visible UI symptom is:

Could not read served logs: Invalid URL 'http://:8793/log/dag_id=.../run_id=.../task_id=.../attempt=3.log': No host supplied

But the more important observation is that the failed attempts already have broken metadata before log retrieval happens:

  • the surviving task_instance row for the final successful retry is normal
  • the corresponding task_instance_history rows for earlier failed attempts have hostname=''
  • those same failed history rows also have start_date = NULL
  • the remote log object for those failed attempts is missing from S3
  • only the final successful attempt log object exists in remote storage

This was first seen on Airflow 3.1.1 and still reproduces on 3.1.8, so it does not look fixed by later 3.1 patch releases.

What you think should happen instead?

When a task retries under KubernetesExecutor, each failed attempt should preserve normal historical metadata, including hostname and start_date, and its remote log object should remain accessible after later retries succeed.

The UI should not end up constructing http://:8793/... for historical attempts.

How to reproduce

I do not yet have a minimal standalone reproducer, but the repeated field pattern is:

  1. Run Airflow 3.1.x with KubernetesExecutor
  2. Enable remote S3 logging
  3. Run a DAG where a task fails, retries, and later succeeds
  4. Inspect historical task attempts for that task/run

Observed result in affected runs:

  • final task_instance row is normal and points to the last successful attempt
  • earlier task_instance_history rows for failed attempts have blank hostname and null start_date
  • remote log objects for those failed attempts do not exist
  • only the final successful attempt log object exists

Operating System

Linux / Kubernetes

Versions of Apache Airflow Providers

Not yet isolated to a provider-specific issue. The symptom has appeared on normal task retries for dbt-related DAG tasks with remote S3 logging enabled.

Deployment

Other Kubernetes deployment

Deployment details

Environment characteristics shared by affected runs:

  • Executor: KubernetesExecutor
  • Remote logging: enabled to S3
  • Airflow versions: reproduced on 3.1.1 and 3.1.8
  • Multiple scheduler replicas were present in at least one affected environment

This does not look like the Celery/proxy issue in #64263. In these incidents:

  • there is no Celery worker layer involved
  • successful sibling tasks in the same DAG run do persist logs normally
  • final successful retries for the same task also persist logs normally
  • only the earlier failed attempts are missing both hostname metadata and remote log objects

Anything else?

A concrete anonymized example from Airflow 3.1.8:

For one task that eventually succeeded on try_number=8, the database state looked like this:

final task_instance row:
try_number=8, state='success', start_date=2026-04-16 08:13:48+00:00, hostname='10.x.x.x'

task_instance_history rows:
try_number=3, state='failed', start_date=NULL, hostname=''
try_number=6, state='failed', start_date=NULL, hostname=''

Remote logging for that same task/run contained only:

attempt=8.log

and direct checks for attempt=3.log and attempt=6.log returned 404.

We saw the same pattern on other tasks in the same DAG run, not just a single task.

The UI error seems to be a downstream consequence of the blank hostname. FileTaskHandler later tries to build the served-log URL from ti.hostname, which explains the final http://:8793/... message, but by that point the task-attempt metadata and remote log object are already missing.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:corearea:loggingkind:bugThis is a clearly a bugpriority:highHigh priority bug that should be patched quickly but does not require immediate new release

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions