[Airflow 3 / Task SDK] Heartbeat 404 "Task Instance not found" after ~12h run; supervisor SERVER_TERMINATED (SIGKILL); retry sees 409 while TI still running

### Under which category would you file this issue?

Task SDK

### Apache Airflow version

airflow 3.0.6

### What happened and how to reproduce it?

During a long-running task (wall time on the order of ~12 hours in our case), the task process was terminated by the supervisor after the execution API stopped accepting heartbeats:
- PUT /execution/task-instances/<ti_uuid>/heartbeat returned 404 with Task Instance not found / reason: not_found.
- Supervisor logged that the server indicated the task should not keep running, then Task killed! with process SIGKILL (exit_code=-9), final_state=SERVER_TERMINATED.
- Supervisor also logged duration ~ 43214 s (~12 h) for that workload session.

In the same log window we also saw PATCH .../task-instances/<ti_uuid>/run return 409 Conflict with a message along the lines of cannot start task instance — invalid state, previous_state=running (exact try_number in our export was 2; we treat that as illustrative, not the core claim—we care about 404 + forced termination after a long running execute, regardless of attempt index).

DAG/operator architecture: The workload is a PythonOperator subclass whose execute() submits work to Snowflake and then blocks inside the same task on an imperative wait() (implemented on a BaseSensorOperator subclass: time.sleep + polling). There is no separate sensor task, so one TI can stay running for many hours.

What I need help with:

1. Under what conditions does the Task Execution API return 404 on heartbeat for a TI that was recently updating heartbeats successfully?
2. Whether long running tasks on Celery + Airflow 3 are expected to hit this, and how it relates to timeouts, TI lifecycle, or DAG run / span updates (we also see scheduler INFO such as Found span_status 'should_end', while updating state for dag_run ... on the same run).
3. Recommended patterns for multi-hour waits (deferrable operators, separate sensor task, etc.) under the Task SDK.

Representative log excerpts:
`PUT /execution/task-instances/<ti_uuid>/heartbeat HTTP/1.1" 404 Not Found
Task Instance not found        ti_id=<uuid>
Server indicated the task shouldn't be running anymore ... reason='not_found' message='Task Instance not found'
Task killed!
Task finished [supervisor] duration=43214.24... exit_code=-9 final_state=SERVER_TERMINATED`

### What you think should happen instead?

Heartbeat should keep returning 2xx until the TI is terminal or explicitly cleared; worker should not get 404 while still the active runner.

### Operating System

_No response_

### Deployment

MWAA, Celery

### Apache Airflow Provider(s)

_No response_

### Versions of Apache Airflow Providers

_No response_

### Official Helm Chart version

Not Applicable

### Kubernetes Version

_No response_

### Helm Chart configuration

_No response_

### Docker Image customizations

_No response_

### Anything else?

_No response_

### Are you willing to submit PR?

- [ ] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Airflow 3 / Task SDK] Heartbeat 404 "Task Instance not found" after ~12h run; supervisor SERVER_TERMINATED (SIGKILL); retry sees 409 while TI still running #66713

Under which category would you file this issue?

Apache Airflow version

What happened and how to reproduce it?

What you think should happen instead?

Operating System

Deployment

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Official Helm Chart version

Kubernetes Version

Helm Chart configuration

Docker Image customizations

Anything else?

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Airflow 3 / Task SDK] Heartbeat 404 "Task Instance not found" after ~12h run; supervisor SERVER_TERMINATED (SIGKILL); retry sees 409 while TI still running #66713

Description

Under which category would you file this issue?

Apache Airflow version

What happened and how to reproduce it?

What you think should happen instead?

Operating System

Deployment

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Official Helm Chart version

Kubernetes Version

Helm Chart configuration

Docker Image customizations

Anything else?

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions