Apache Airflow version
3.1.8
If "Other Airflow 3 version" selected, which one?
Also reproduced on 3.1.1.
What happened?
We have multiple anonymized production incidents on Airflow 3.1.x with KubernetesExecutor where a task retries and the failed attempts are not retained correctly.
The visible UI symptom is:
Could not read served logs: Invalid URL 'http://:8793/log/dag_id=.../run_id=.../task_id=.../attempt=3.log': No host supplied
But the more important observation is that the failed attempts already have broken metadata before log retrieval happens:
- the surviving
task_instance row for the final successful retry is normal
- the corresponding
task_instance_history rows for earlier failed attempts have hostname=''
- those same failed history rows also have
start_date = NULL
- the remote log object for those failed attempts is missing from S3
- only the final successful attempt log object exists in remote storage
This was first seen on Airflow 3.1.1 and still reproduces on 3.1.8, so it does not look fixed by later 3.1 patch releases.
What you think should happen instead?
When a task retries under KubernetesExecutor, each failed attempt should preserve normal historical metadata, including hostname and start_date, and its remote log object should remain accessible after later retries succeed.
The UI should not end up constructing http://:8793/... for historical attempts.
How to reproduce
I do not yet have a minimal standalone reproducer, but the repeated field pattern is:
- Run Airflow 3.1.x with
KubernetesExecutor
- Enable remote S3 logging
- Run a DAG where a task fails, retries, and later succeeds
- Inspect historical task attempts for that task/run
Observed result in affected runs:
- final
task_instance row is normal and points to the last successful attempt
- earlier
task_instance_history rows for failed attempts have blank hostname and null start_date
- remote log objects for those failed attempts do not exist
- only the final successful attempt log object exists
Operating System
Linux / Kubernetes
Versions of Apache Airflow Providers
Not yet isolated to a provider-specific issue. The symptom has appeared on normal task retries for dbt-related DAG tasks with remote S3 logging enabled.
Deployment
Other Kubernetes deployment
Deployment details
Environment characteristics shared by affected runs:
- Executor:
KubernetesExecutor
- Remote logging: enabled to S3
- Airflow versions: reproduced on 3.1.1 and 3.1.8
- Multiple scheduler replicas were present in at least one affected environment
This does not look like the Celery/proxy issue in #64263. In these incidents:
- there is no Celery worker layer involved
- successful sibling tasks in the same DAG run do persist logs normally
- final successful retries for the same task also persist logs normally
- only the earlier failed attempts are missing both hostname metadata and remote log objects
Anything else?
A concrete anonymized example from Airflow 3.1.8:
For one task that eventually succeeded on try_number=8, the database state looked like this:
final task_instance row:
try_number=8, state='success', start_date=2026-04-16 08:13:48+00:00, hostname='10.x.x.x'
task_instance_history rows:
try_number=3, state='failed', start_date=NULL, hostname=''
try_number=6, state='failed', start_date=NULL, hostname=''
Remote logging for that same task/run contained only:
and direct checks for attempt=3.log and attempt=6.log returned 404.
We saw the same pattern on other tasks in the same DAG run, not just a single task.
The UI error seems to be a downstream consequence of the blank hostname. FileTaskHandler later tries to build the served-log URL from ti.hostname, which explains the final http://:8793/... message, but by that point the task-attempt metadata and remote log object are already missing.
Are you willing to submit PR?
Code of Conduct
Apache Airflow version
3.1.8
If "Other Airflow 3 version" selected, which one?
Also reproduced on 3.1.1.
What happened?
We have multiple anonymized production incidents on Airflow 3.1.x with
KubernetesExecutorwhere a task retries and the failed attempts are not retained correctly.The visible UI symptom is:
But the more important observation is that the failed attempts already have broken metadata before log retrieval happens:
task_instancerow for the final successful retry is normaltask_instance_historyrows for earlier failed attempts havehostname=''start_date = NULLThis was first seen on Airflow 3.1.1 and still reproduces on 3.1.8, so it does not look fixed by later 3.1 patch releases.
What you think should happen instead?
When a task retries under
KubernetesExecutor, each failed attempt should preserve normal historical metadata, includinghostnameandstart_date, and its remote log object should remain accessible after later retries succeed.The UI should not end up constructing
http://:8793/...for historical attempts.How to reproduce
I do not yet have a minimal standalone reproducer, but the repeated field pattern is:
KubernetesExecutorObserved result in affected runs:
task_instancerow is normal and points to the last successful attempttask_instance_historyrows for failed attempts have blank hostname and null start_dateOperating System
Linux / Kubernetes
Versions of Apache Airflow Providers
Not yet isolated to a provider-specific issue. The symptom has appeared on normal task retries for dbt-related DAG tasks with remote S3 logging enabled.
Deployment
Other Kubernetes deployment
Deployment details
Environment characteristics shared by affected runs:
KubernetesExecutorThis does not look like the Celery/proxy issue in #64263. In these incidents:
Anything else?
A concrete anonymized example from Airflow 3.1.8:
For one task that eventually succeeded on
try_number=8, the database state looked like this:Remote logging for that same task/run contained only:
and direct checks for
attempt=3.logandattempt=6.logreturned404.We saw the same pattern on other tasks in the same DAG run, not just a single task.
The UI error seems to be a downstream consequence of the blank hostname.
FileTaskHandlerlater tries to build the served-log URL fromti.hostname, which explains the finalhttp://:8793/...message, but by that point the task-attempt metadata and remote log object are already missing.Are you willing to submit PR?
Code of Conduct