Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing logs if the pod was killed by Kubernetes #8780

Closed
RyanSiu1995 opened this issue May 8, 2020 · 7 comments
Closed

Missing logs if the pod was killed by Kubernetes #8780

RyanSiu1995 opened this issue May 8, 2020 · 7 comments
Labels
kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues

Comments

@RyanSiu1995
Copy link
Contributor

Apache Airflow version: 1.10.10

Kubernetes version (if you are using kubernetes) (use kubectl version):

> kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.7", GitCommit:"8fca2ec50a6133511b771a11559e24191b1aa2b4", GitTreeState:"clean", BuildDate:"2019-09-18T14:47:22Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.27", GitCommit:"145f9e21a4515947d6fb10819e5a336aff1b6959", GitTreeState:"clean", BuildDate:"2020-02-21T18:01:40Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • GKE
  • Linux
  • Linux pod-name-597d575d8-xrmkn 4.14.138+ #1 SMP Tue Sep 3 02:58:08 PDT 2019 x86_64 GNU/Linu
  • pip install

What happened:
We used Kubernetes executor to execute our pipeline with a GCS remote log storage. We encountered the log was not found in the GCS bucket for some DAGs.
It shares the same error like this.

*** Unable to read remote log from gs://xxxx/zzz_test_on_hold/on_hold/2020-05-08T07:06:18.991443+00:00/1.log
*** 404 GET https://storage.googleapis.com/download/storage/v1/b/xxxx/o/zzz_test_on_hold%2Fon_hold%2F2020-05-08T07%3A06%3A18.991443%2B00%3A00%2F1.log?alt=media: ('Request failed with status code', 404, 'Expected one of', <HTTPStatus.OK: 200>, <HTTPStatus.PARTIAL_CONTENT: 206>)

And we have gone into the log on stdout and found that there was error like this.

Please make sure that airflow[gcp] is installed and the GCS connection exists.
Could not write logs to �gs://xxxx/zzz_test_on_hold/on_hold/2020-05-08T07:06:18.991443+00:00/1.log: 'NoneType' object has no attribute 'upload'

What you expected to happen:
I expected the logs can still upload to GCS if the pod was unexpectedly killed by Kubernetes.

How to reproduce it:
Write a DAG with the following function.

def on_hold():
    """On hold the process"""
    while True:
        logger.warn("this is on holding now...")
        time.sleep(1000)

When the task pod is spawned, kill the pod with kubectl.
The log then will not be able to be uploaded.

Anything else we need to know:
Nope

@RyanSiu1995 RyanSiu1995 added the kind:bug This is a clearly a bug label May 8, 2020
@RyanSiu1995
Copy link
Contributor Author

I have tried to remove all the try except block and get this stacktrace.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/logging/__init__.py", line 1919, in shutdown
    h.close()
  File "/data/clover_pipeline/src/airflow/airflow/utils/log/gcs_task_handler.py", line 81, in close
    self.gcs_write(log, remote_loc)
  File "/data/clover_pipeline/src/airflow/airflow/utils/log/gcs_task_handler.py", line 154, in gcs_write
    self.hook.upload(bkt, blob, tmpfile.name)
  File "/usr/local/lib/python3.6/site-packages/cached_property.py", line 35, in __get__
    value = obj.__dict__[self.func.__name__] = self.func(obj)
  File "/data/clover_pipeline/src/airflow/airflow/utils/log/gcs_task_handler.py", line 48, in hook
    google_cloud_storage_conn_id=remote_conn_id
  File "/data/clover_pipeline/src/airflow/airflow/contrib/hooks/gcs_hook.py", line 43, in __init__
    delegate_to)
  File "/data/clover_pipeline/src/airflow/airflow/contrib/hooks/gcp_api_base_hook.py", line 131, in __init__
    self.extras = self.get_connection(self.gcp_conn_id).extra_dejson
  File "/data/clover_pipeline/src/airflow/airflow/hooks/base_hook.py", line 84, in get_connection
    conn = random.choice(list(cls.get_connections(conn_id)))
  File "/data/clover_pipeline/src/airflow/airflow/hooks/base_hook.py", line 80, in get_connections
    return secrets.get_connections(conn_id)
  File "/data/clover_pipeline/src/airflow/airflow/secrets/__init__.py", line 52, in get_connections
    conn_list = secrets_backend.get_connections(conn_id=conn_id)
  File "/data/clover_pipeline/src/airflow/airflow/utils/db.py", line 72, in wrapper
    with create_session() as session:
  File "/usr/local/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/data/clover_pipeline/src/airflow/airflow/utils/db.py", line 42, in create_session
    session = settings.Session()
TypeError: 'NoneType' object is not callable

@msumit
Copy link
Contributor

msumit commented May 8, 2020

AFAIK the tasks logs are uploaded to remote storage once the task is completed, be it Kubernetes, Celery, or even Local executors. One simple solution is to make use of multi read-writes Persistent volume groups to store these logs, so the logs won't be lost even if pod is killed in between.
Not sure if Kubernetes has some way to catch the pod kill request and execute some code before actually killing it. In that case, you can write some code to upload the logs manually before the pod goes away.

@RyanSiu1995
Copy link
Contributor Author

@msumit Thank you for your reply.
I am thinking a change like this.
We reinitiate the database session dynamically in this util files.
https://github.com/apache/airflow/blob/1.10.10/airflow/utils/db.py#L38-L50
The code will be like this.

def create_session():
    """
    Contextmanager that will create and teardown a session.
    """
    have_session = True
    if settings.Session == None:
        settings.configure_orm(disable_connection_pool=True)
    session = settings.Session()
    try:
        yield session
        session.commit()
    except Exception:
        session.rollback()
        raise
    finally:
        session.close()
        if not have_session:
            settings.dispose_orm()

This actually resolves in my case. But not pretty sure if it is proper to initiate a dynamic session with the create_session() function.

@RyanSiu1995
Copy link
Contributor Author

But, to be honest, it is pretty odd that the function create_session() will not actually have any action of creation. It just tries to return the session object in settings.py.
So, this change actually makes sense to me through giving back the creation function to create_session()

@msumit
Copy link
Contributor

msumit commented May 8, 2020

@RyanSiu1995 so it means that the worker pod is trying to upload the logs to GCS when being killed, but failing due to some issue in getting DB session?

@RyanSiu1995
Copy link
Contributor Author

Yes, that's exactly what happen with my case.

@mik-laj mik-laj added the k8s label May 12, 2020
@kaxil kaxil added provider:cncf-kubernetes Kubernetes provider related issues and removed area:k8s labels Nov 18, 2020
@RosterIn
Copy link
Contributor

I do see from time to time on Airflow 2.2.3 (Kubernetes) that logs are missing with similar message.
I wonder if this issue is the cause for it?

@apache apache locked and limited conversation to collaborators Feb 6, 2023
@eladkal eladkal converted this issue into discussion #29388 Feb 6, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

No branches or pull requests

5 participants