Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

airflow logging to S3 #34

Closed
adib-next opened this issue Dec 12, 2020 · 19 comments
Closed

airflow logging to S3 #34

adib-next opened this issue Dec 12, 2020 · 19 comments
Labels
kind/question kind - user questions

Comments

@adib-next
Copy link

adib-next commented Dec 12, 2020

We have been encountering some issues with the logging to S3. Seems like in some cases where the dag fails the log is missing as well and we get the following error. Has anyone come across this issue? any suggestions?

image

@adib-next adib-next added the kind/question kind - user questions label Dec 12, 2020
@thesuperzapper
Copy link
Member

@adib-next are you following the steps under the Option 1 - S3/GCS bucket heading in the README?

@adib-next
Copy link
Author

Yes, I have added the configuration below and the logs are being written to S3. But it is not stable and we get sometimes errors like the screenshot I shared where the logs file does not exist. any idea why this is happening?

airflow:
config:
AIRFLOW__CORE__REMOTE_LOGGING: "True"
AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER: "s3://<>/airflow/logs"
AIRFLOW__CORE__REMOTE_LOG_CONN_ID: "aws_default"

scheduler:
securityContext:
fsGroup: 65534

web:
securityContext:
fsGroup: 65534

workers:
securityContext:
fsGroup: 65534

@thesuperzapper
Copy link
Member

@adib-next I think it could be that if the worker pod crashes before uploading the logs to S3, they will be lost (as they are only stored on the temp disk of the pod until they are uploaded)

@adib-next
Copy link
Author

@thesuperzapper any recommendation on how to prevent this issue/ minimize it?

@johngtam
Copy link

johngtam commented Feb 12, 2021

@adib-next, I might be encountering the same problem you are these days -- I'm speculating that, if you do this in your airflow.config:

AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: "False"

You'll end up seeing your pod with an Init:Error status. The pod isn't even starting. If you look into your pod, if you followed the instructions here (e.g. modifying the IAM role to interface with your cluster's OIDC, then adding the annotation for your IAM role's ARN to the values.yaml file), your pod will still fail to start.

From my experience so far, I'm speculating that I'm not running the task instance pod (if you're rolling with KubernetesExecutor) is falling into this really unfortunate trap found here. The securityContext applies to the webserver and scheduler pods (which means they initialize and are able to read the web identity token file).

But ultimately, your task instance pods don't run because the security context for the task instance pod isn't set to be able to read the web identity token file (more on that here, via AWS's documentation).

Right now, I'm trying to figure out a way to give every pod spawned by the Kubernetes Executor a security context in order to actually read that token and be able to assume the correct IAM role (in your and my case, for logging). It might be the way of pod_template_file but I'm finding the documentation to be confusing.

@adib-next, did you end up having any luck in the time since?

@johngtam
Copy link

Ended up getting it to work. For Kubernetes 1.10.14 and lower, setting this for airflow.config:

AIRFLOW__KUBERNETES__FS_GROUP: "65534"

Results in the fs_group being set for the containers in the worker pods. Not sure if that's a viable solution for you, but it unblocked me in actually getting the pods to run.

Documentation here for 1.10.14: https://airflow.apache.org/docs/apache-airflow/1.10.14/configurations-ref.html#fs-group

@thesuperzapper
Copy link
Member

Can @adib-next or @johngtam clarify if this is still an issue after version 8.0.0 of the chart?

@thesuperzapper
Copy link
Member

@adib-next or @johngtam any updates?

@thesuperzapper thesuperzapper moved this from Unsorted to Need Clarification in Issue Triage and PR Tracking Apr 9, 2021
@johngtam
Copy link

johngtam commented Apr 9, 2021

@thesuperzapper, I haven't given this a shot yet, but hope to do so soon! Really excited about version 8 of this Helm chart :)

@stale
Copy link

stale bot commented Jun 8, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale lifecycle - this is stale label Jun 8, 2021
@johngtam
Copy link

johngtam commented Jun 15, 2021

@thesuperzapper, ended up upgrading to both Airflow 2 and the version 8 chart here. Unfortunately, this ended up breaking logging :/.

Looking at the logs:

*** Failed to verify remote log exists s3://flatironhealth-d-wariotesteks-data-platforms-57-airflow-logs/k8s_pod_operator_test_trigger_v001/artifactory-busybox-test/2021-05-24T15:05:52.979589+00:00/1.log.
The conn_id `aws_default` isn't defined
*** Falling back to local log
*** Trying to get logs (last 100 lines) from worker pod localhost ***

*** Unable to fetch logs from worker pod localhost ***
(404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '17dd3d2e-8cb8-4276-9f0b-661173e2c90b', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 15 Jun 2021 18:11:24 GMT', 'Content-Length': '186'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \\"localhost\\" not found","reason":"NotFound","details":{"name":"localhost","kind":"pods"},"code":404}\n'

I see that the conn_id 'aws_default' isn't defined. I'm using the KubernetesExecutor. However, in my helm chart, I have the values (with some templating):

  config:
    AIRFLOW__LOGGING__LOGGING_LEVEL: INFO
    AIRFLOW__LOGGING__REMOTE_LOGGING: "True"
    AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER: "s3://${bucket_name}"
    AIRFLOW__LOGGING__ENCRYPT_S3_LOGS: "True"
    AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: "aws_default"

scheduler:
  securityContext:
    fsGroup: 65534

web:
  securityContext:
    fsGroup: 65534

# This allows for the security context to be the same on pods. I don't configure "workers" because I'm not using the CeleryExecutor. I'm using the KubernetesExecutor.
airflow:
  securityContext:
    fsGroup: 65534

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn:"${base_role_arn}"

Do you know why I might be getting this error with The conn_id aws_default isn't defined? I'm curious, considering in config, I have AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID: "aws_default".

@stale stale bot removed the lifecycle/stale lifecycle - this is stale label Jun 15, 2021
@johngtam
Copy link

D'oh, all I had to do was define aws_default in the connections. I'm able to get the logs.

@thesuperzapper
Copy link
Member

@johngtam, good to hear, for those finding this, please look at the docs for creating connections with airflow.connections:
https://github.com/airflow-helm/charts/tree/main/charts/airflow#how-to-create-airflow-connections

Issue Triage and PR Tracking automation moved this from Needs Clarification to Done Jun 27, 2021
@Legion2
Copy link

Legion2 commented Apr 20, 2022

@johngtam how did you configure the connection, because when using eks iam service accounts, we don't want to configure aws_access_key_id and aws_secret_access_key.

@thesuperzapper
Copy link
Member

@Legion2 the "How to persist airflow logs?" FAQ explains how to use EKS - IAM Roles for Service Accounts for authorization.

@johngtam
Copy link

Yes, I used ISRA (IAM Roles for Service Accounts)! That worked out nicely for me.

@Legion2
Copy link

Legion2 commented Apr 21, 2022

@thesuperzapper the faq does not explain how to configure the aws_default connection for eks iam role service account. After some research we created an empty aws_default default connection.

@thesuperzapper
Copy link
Member

@Legion2 thanks for highlighting that the docs could be clearer!

I have updated the following docs to make it easier to understand:

Please tell me if you think there is still room for improvement.

@Legion2
Copy link

Legion2 commented May 12, 2022

@thesuperzapper thanks that documentation is much better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question kind - user questions
Development

No branches or pull requests

4 participants