Missing logs if the pod was killed by Kubernetes #8780

RyanSiu1995 · 2020-05-08T07:28:25Z

Apache Airflow version: 1.10.10

Kubernetes version (if you are using kubernetes) (use kubectl version):

> kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.7", GitCommit:"8fca2ec50a6133511b771a11559e24191b1aa2b4", GitTreeState:"clean", BuildDate:"2019-09-18T14:47:22Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.27", GitCommit:"145f9e21a4515947d6fb10819e5a336aff1b6959", GitTreeState:"clean", BuildDate:"2020-02-21T18:01:40Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}

Environment:

GKE
Linux
Linux pod-name-597d575d8-xrmkn 4.14.138+ #1 SMP Tue Sep 3 02:58:08 PDT 2019 x86_64 GNU/Linu
pip install

What happened:
We used Kubernetes executor to execute our pipeline with a GCS remote log storage. We encountered the log was not found in the GCS bucket for some DAGs.
It shares the same error like this.

*** Unable to read remote log from gs://xxxx/zzz_test_on_hold/on_hold/2020-05-08T07:06:18.991443+00:00/1.log
*** 404 GET https://storage.googleapis.com/download/storage/v1/b/xxxx/o/zzz_test_on_hold%2Fon_hold%2F2020-05-08T07%3A06%3A18.991443%2B00%3A00%2F1.log?alt=media: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)

And we have gone into the log on stdout and found that there was error like this.

Please make sure that airflow[gcp] is installed and the GCS connection exists.
Could not write logs to �gs://xxxx/zzz_test_on_hold/on_hold/2020-05-08T07:06:18.991443+00:00/1.log: 'NoneType' object has no attribute 'upload'

What you expected to happen:
I expected the logs can still upload to GCS if the pod was unexpectedly killed by Kubernetes.

How to reproduce it:
Write a DAG with the following function.

def on_hold():
    """On hold the process"""
    while True:
        logger.warn("this is on holding now...")
        time.sleep(1000)

When the task pod is spawned, kill the pod with kubectl.
The log then will not be able to be uploaded.

Anything else we need to know:
Nope

The text was updated successfully, but these errors were encountered:

RyanSiu1995 · 2020-05-08T09:31:46Z

I have tried to remove all the try except block and get this stacktrace.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1919, in shutdown
    h.close()
  File "/data/clover_pipeline/src/airflow/airflow/utils/log/gcs_task_handler.py", line 81, in close
    self.gcs_write(log, remote_loc)
  File "/data/clover_pipeline/src/airflow/airflow/utils/log/gcs_task_handler.py", line 154, in gcs_write
    self.hook.upload(bkt, blob, tmpfile.name)
  File "/usr/local/lib/python3.6/site-packages/cached_property.py", line 35, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/data/clover_pipeline/src/airflow/airflow/utils/log/gcs_task_handler.py", line 48, in hook
    google_cloud_storage_conn_id=remote_conn_id
  File "/data/clover_pipeline/src/airflow/airflow/contrib/hooks/gcs_hook.py", line 43, in __init__
    delegate_to)
  File "/data/clover_pipeline/src/airflow/airflow/contrib/hooks/gcp_api_base_hook.py", line 131, in __init__
    self.extras = self.get_connection(self.gcp_conn_id).extra_dejson
  File "/data/clover_pipeline/src/airflow/airflow/hooks/base_hook.py", line 84, in get_connection
    conn = random.choice(list(cls.get_connections(conn_id)))
  File "/data/clover_pipeline/src/airflow/airflow/hooks/base_hook.py", line 80, in get_connections
    return secrets.get_connections(conn_id)
  File "/data/clover_pipeline/src/airflow/airflow/secrets/__init__.py", line 52, in get_connections
    conn_list = secrets_backend.get_connections(conn_id=conn_id)
  File "/data/clover_pipeline/src/airflow/airflow/utils/db.py", line 72, in wrapper
    with create_session() as session:
  File "/usr/local/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/data/clover_pipeline/src/airflow/airflow/utils/db.py", line 42, in create_session
    session = settings.Session()
TypeError: 'NoneType' object is not callable

msumit · 2020-05-08T10:22:00Z

AFAIK the tasks logs are uploaded to remote storage once the task is completed, be it Kubernetes, Celery, or even Local executors. One simple solution is to make use of multi read-writes Persistent volume groups to store these logs, so the logs won't be lost even if pod is killed in between.
Not sure if Kubernetes has some way to catch the pod kill request and execute some code before actually killing it. In that case, you can write some code to upload the logs manually before the pod goes away.

RyanSiu1995 · 2020-05-08T10:45:28Z

@msumit Thank you for your reply.
I am thinking a change like this.
We reinitiate the database session dynamically in this util files.
https://github.com/apache/airflow/blob/1.10.10/airflow/utils/db.py#L38-L50
The code will be like this.

def create_session():
    """
    Contextmanager that will create and teardown a session.
    """
    have_session = True
    if settings.Session == None:
        settings.configure_orm(disable_connection_pool=True)
    session = settings.Session()
    try:
        yield session
        session.commit()
    except Exception:
        session.rollback()
        raise
    finally:
        session.close()
        if not have_session:
            settings.dispose_orm()

This actually resolves in my case. But not pretty sure if it is proper to initiate a dynamic session with the create_session() function.

RyanSiu1995 · 2020-05-08T10:59:38Z

But, to be honest, it is pretty odd that the function create_session() will not actually have any action of creation. It just tries to return the session object in settings.py.
So, this change actually makes sense to me through giving back the creation function to create_session()

msumit · 2020-05-08T11:21:01Z

@RyanSiu1995 so it means that the worker pod is trying to upload the logs to GCS when being killed, but failing due to some issue in getting DB session?

RyanSiu1995 · 2020-05-08T11:23:32Z

Yes, that's exactly what happen with my case.

RosterIn · 2022-03-18T11:12:00Z

I do see from time to time on Airflow 2.2.3 (Kubernetes) that logs are missing with similar message.
I wonder if this issue is the cause for it?

RyanSiu1995 added the kind:bug This is a clearly a bug label May 8, 2020

mik-laj added the k8s label May 12, 2020

kaxil added provider:cncf-kubernetes Kubernetes provider related issues and removed area:k8s labels Nov 18, 2020

apache locked and limited conversation to collaborators Feb 6, 2023

eladkal converted this issue into discussion #29388 Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Missing logs if the pod was killed by Kubernetes #8780

Missing logs if the pod was killed by Kubernetes #8780

RyanSiu1995 commented May 8, 2020

RyanSiu1995 commented May 8, 2020

msumit commented May 8, 2020

RyanSiu1995 commented May 8, 2020

RyanSiu1995 commented May 8, 2020

msumit commented May 8, 2020

RyanSiu1995 commented May 8, 2020

RosterIn commented Mar 18, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Missing logs if the pod was killed by Kubernetes #8780

Missing logs if the pod was killed by Kubernetes #8780

Comments

RyanSiu1995 commented May 8, 2020

RyanSiu1995 commented May 8, 2020

msumit commented May 8, 2020

RyanSiu1995 commented May 8, 2020

RyanSiu1995 commented May 8, 2020

msumit commented May 8, 2020

RyanSiu1995 commented May 8, 2020

RosterIn commented Mar 18, 2022

This issue was moved to a discussion.