You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I observed that some workers stopped randomly after being running.
After some investigation, the issue is in the new kubernetes pod operator and is dependant of a current issue in the kubernetes api.
When a log rotate event occurs in kubernetes, the stream we consume on fetch_container_logs(follow=True,...) is no longer being feeded.
Therefore, the k8s pod operator hangs indefinetly at the middle of the log. Only a sigterm could terminate it as logs consumption is blocking execute() to finish.
However, I think there are many possibilities to walk-around this from airflow-side (like making them not-blocking and block until status.phase.completed as it's currently done when get_logs is not true).
Linking #12103 for ref, as the result is more or less the same (although the root cause is different)
The text was updated successfully, but these errors were encountered:
I observed that some workers stopped randomly after being running.
After some investigation, the issue is in the new kubernetes pod operator and is dependant of a current issue in the kubernetes api.
When a log rotate event occurs in kubernetes, the stream we consume on
fetch_container_logs(follow=True,...)
is no longer being feeded.Therefore, the k8s pod operator hangs indefinetly at the middle of the log. Only a sigterm could terminate it as logs consumption is blocking
execute()
to finish.Ref to the issue in kubernetes: kubernetes/kubernetes#59902
However, I think there are many possibilities to walk-around this from airflow-side (like making them not-blocking and block until
status.phase.completed
as it's currently done whenget_logs
is not true).Linking #12103 for ref, as the result is more or less the same (although the root cause is different)
The text was updated successfully, but these errors were encountered: