Apache Airflow version
2.6.3
What happened
A spark job which runs for >1 hours gets a 410 Expired error regardless of what the actual output of the spark application was.
Logs -
[2023-07-21, 19:41:57 IST] {spark_kubernetes.py:126} INFO - 2023-07-21T14:11:57.424338279Z 23/07/21 19:41:57 INFO MetricsSystemImpl: s3a-file-system metrics system stopped.
[2023-07-21, 19:41:57 IST] {spark_kubernetes.py:126} INFO - 2023-07-21T14:11:57.424355219Z 23/07/21 19:41:57 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor [adhoc-d9e5ed897882bd27-exec-1] is pending
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor [adhoc-d9e5ed897882bd27-exec-1] is pending
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor [adhoc-d9e5ed897882bd27-exec-1] is running
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor [adhoc-d9e5ed897882bd27-exec-1] completed
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor [adhoc-d9e5ed897882bd27-exec-2] is pending
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor [adhoc-d9e5ed897882bd27-exec-2] is running
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor [adhoc-d9e5ed897882bd27-exec-3] is pending
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor [adhoc-d9e5ed897882bd27-exec-3] is pending
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor [adhoc-d9e5ed897882bd27-exec-4] is pending
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor [adhoc-d9e5ed897882bd27-exec-4] is running
[2023-07-21, 19:41:58 IST] {spark_kubernetes.py:117} INFO - Executor [adhoc-d9e5ed897882bd27-exec-3] is running
[2023-07-21, 19:41:58 IST] {taskinstance.py:1824} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py", line 112, in execute
for event in namespace_event_stream:
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 182, in stream
raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: The resourceVersion for the provided watch is too old.
[2023-07-21, 19:41:58 IST] {taskinstance.py:1345} INFO - Marking task as FAILED. dag_id=adhoc, task_id=submit_job, execution_date=20230721T125127, start_date=20230721T125217, end_date=20230721T141158
[2023-07-21, 19:41:58 IST] {standard_task_runner.py:104} ERROR - Failed to execute job 247326 for task submit_job ((410)
Reason: Expired: The resourceVersion for the provided watch is too old.
; 15)
[2023-07-21, 19:41:58 IST] {local_task_job_runner.py:225} INFO - Task exited with return code 1
[2023-07-21, 19:41:58 IST] {taskinstance.py:2653} INFO - 0 downstream tasks scheduled from follow-on schedule check
In this case, I am able to see the logs of the task and only the final state is wrongly reported.
But in case the task runs for even longer (>4 hours) then even the logs are not seen.
What you think should happen instead
In the first case, the tasks should have reported the correct state of the spark application. In the second case, logs should have been still visible and the correct state of the spark application should have been reported
How to reproduce
To reproduce it, have a long running (>1 hour) task submitted using the SparkKubernetesOperator.
If the task gets completed before 4 hours then you should see a 410 Expired error regardless of what the actual output of the spark application was.
If the task takes longer, you should see the task fail around the 4 hour mark due to 410 even when the spark application is still running.
Operating System
Debian GNU/Linux 11 (bullseye)
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==8.2.0
apache-airflow-providers-celery==3.2.1
apache-airflow-providers-cncf-kubernetes==7.3.0
apache-airflow-providers-common-sql==1.5.2
apache-airflow-providers-docker==3.7.1
apache-airflow-providers-elasticsearch==4.5.1
apache-airflow-providers-ftp==3.4.2
apache-airflow-providers-google==10.2.0
apache-airflow-providers-grpc==3.2.1
apache-airflow-providers-hashicorp==3.4.1
apache-airflow-providers-http==4.4.2
apache-airflow-providers-imap==3.2.2
apache-airflow-providers-microsoft-azure==6.1.2
apache-airflow-providers-mysql==5.1.1
apache-airflow-providers-odbc==4.0.0
apache-airflow-providers-postgres==5.5.1
apache-airflow-providers-redis==3.2.1
apache-airflow-providers-sendgrid==3.2.1
apache-airflow-providers-sftp==4.3.1
apache-airflow-providers-slack==7.3.1
apache-airflow-providers-snowflake==4.2.0
apache-airflow-providers-sqlite==3.4.2
apache-airflow-providers-ssh==3.7.1
Deployment
Official Apache Airflow Helm Chart
Deployment details
k8s version - 1.23 hosted using EKS
python 3.8
I have upgraded the apache-airflow-providers-cncf-kubernetes to ensure that the bug has not been fixed in the newer versions.
Anything else
I think this issue is because of the kubernetes 'Watch().stream''s APIException not being handled. According to its docs -
Note that watching an API resource can expire. The method tries to
resume automatically once from the last result, but if that last result
is too old as well, an `ApiException` exception will be thrown with
``code`` 410. In that case you have to recover yourself, probably
by listing the API resource to obtain the latest state and then
watching from that state on by setting ``resource_version`` to
one returned from listing.
this error needs to be handled by Airflow and not by the kubernetes client.
Are you willing to submit PR?
Code of Conduct
Apache Airflow version
2.6.3
What happened
A spark job which runs for >1 hours gets a 410 Expired error regardless of what the actual output of the spark application was.
Logs -
In this case, I am able to see the logs of the task and only the final state is wrongly reported.
But in case the task runs for even longer (>4 hours) then even the logs are not seen.
What you think should happen instead
In the first case, the tasks should have reported the correct state of the spark application. In the second case, logs should have been still visible and the correct state of the spark application should have been reported
How to reproduce
To reproduce it, have a long running (>1 hour) task submitted using the SparkKubernetesOperator.
If the task gets completed before 4 hours then you should see a 410 Expired error regardless of what the actual output of the spark application was.
If the task takes longer, you should see the task fail around the 4 hour mark due to 410 even when the spark application is still running.
Operating System
Debian GNU/Linux 11 (bullseye)
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==8.2.0
apache-airflow-providers-celery==3.2.1
apache-airflow-providers-cncf-kubernetes==7.3.0
apache-airflow-providers-common-sql==1.5.2
apache-airflow-providers-docker==3.7.1
apache-airflow-providers-elasticsearch==4.5.1
apache-airflow-providers-ftp==3.4.2
apache-airflow-providers-google==10.2.0
apache-airflow-providers-grpc==3.2.1
apache-airflow-providers-hashicorp==3.4.1
apache-airflow-providers-http==4.4.2
apache-airflow-providers-imap==3.2.2
apache-airflow-providers-microsoft-azure==6.1.2
apache-airflow-providers-mysql==5.1.1
apache-airflow-providers-odbc==4.0.0
apache-airflow-providers-postgres==5.5.1
apache-airflow-providers-redis==3.2.1
apache-airflow-providers-sendgrid==3.2.1
apache-airflow-providers-sftp==4.3.1
apache-airflow-providers-slack==7.3.1
apache-airflow-providers-snowflake==4.2.0
apache-airflow-providers-sqlite==3.4.2
apache-airflow-providers-ssh==3.7.1
Deployment
Official Apache Airflow Helm Chart
Deployment details
k8s version - 1.23 hosted using EKS
python 3.8
I have upgraded the apache-airflow-providers-cncf-kubernetes to ensure that the bug has not been fixed in the newer versions.
Anything else
I think this issue is because of the kubernetes 'Watch().stream''s APIException not being handled. According to its docs -
this error needs to be handled by Airflow and not by the kubernetes client.
Are you willing to submit PR?
Code of Conduct