Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding more information to kubernetes executor logs #29929

Merged
merged 19 commits into from
May 25, 2023

Conversation

amoghrajesh
Copy link
Contributor

@amoghrajesh amoghrajesh commented Mar 5, 2023

Kubernetes executor logs do not contain a lot of information that can be back tracked to narrow down the issue. To make debugging easier, this PR adds the key annotations to some logger lines which can be useful to debug issues quicker.

The annotations contain crucial information regarding a k8s log, for ex the structure can involve:

      {
            "dag_id": "dag",
            "task_id": "task",
            "run_id": "run_id",
            "try_number": "1",
            "execution_date": None,
        }

closes: #18329


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:Scheduler including HA (high availability) scheduler labels Mar 5, 2023
tests/executors/test_kubernetes_executor.py Outdated Show resolved Hide resolved
airflow/executors/kubernetes_executor.py Outdated Show resolved Hide resolved
airflow/kubernetes/kubernetes_helper_functions.py Outdated Show resolved Hide resolved
@amoghrajesh
Copy link
Contributor Author

@jedcunningham @hussein-awala @hterik want to hear some opinions about this.
Does the following proposal work?

  1. Since most lines have enough logs with required annotations, it is better is we don't force more annotations and flood the logs for all users
  2. We can have a feature flag which covers if the annotations are printed in the logs for some of these k8s executor logs. This flag can be enabled by certain users who require these kind of info

@potiuk
Copy link
Member

potiuk commented Apr 4, 2023

seems that "rebase" from UI - messed with the change (first time see smth like that) - can yoy please rebase your local change again manually @amoghrajesh and re-push it ?

@potiuk potiuk mentioned this pull request Apr 4, 2023
@@ -2671,6 +2671,13 @@ kubernetes_executor:
type: string
example: '{ "total": 3, "backoff_factor": 0.5 }'
default: ""
detailed_executor_logs:
description: |
Flag to have more details in kubernetes executor pod logs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be more specific than “more details”? What details are they? Why/when would people not want those details?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uranusjr thanks for the review. All the logs present under kubernetes executor do not contain a lot of data needed to have the best debugging experience.

Example:
When trying to trace the lifecycle of a task in the kubernetes executor, you currently must search first for the name of the pod created by the task, then search for the pod name in the logs. This means you need to be pretty familiar with the structure of the scheduler logs in order to search effectively for the lifecycle of a task that had a problem.

Some log statements like Attempting to finish pod do have the annotations for the pod, which include dag name, task name, and run_id, but others do not. For instance, Event: podname-a2f2c1ac706 had an event of type DELETED has no such annotations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will rename the flag to something more meaningful like: traceable-executor-logs. What do you think @uranusjr?
Or any other alternative thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about logs_task_metadata? if you look to the added information, they describes the task and not the executor. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. I agree with you @hussein-awala.
I will make the changes for this

@amoghrajesh amoghrajesh requested a review from uranusjr May 17, 2023 04:02
Comment on lines 240 to 245
if get_logs_task_metadata():
self.log.info(
"Event: Failed to start pod %s, annotations: %s", pod_name, annotations_string
)
else:
self.log.info("Event: Failed to start pod %s", pod_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder instead of doing if-else everywhere, if it’s better to simply do something like

if get_logs_task_metadata():
    annotations_for_logging = annotations_to_str(annotations)
else:
    annotations_for_logging = "<omitted>"

and always add the annotations: %s part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe do this if-else in annotations_to_st directly? (If we give this function a better name.)

Copy link
Contributor Author

@amoghrajesh amoghrajesh May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo, leaving the omitted annotations would not look good in the logs when we do not have any annotations or the logs metadata flag enabled. What do you think? @uranusjr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logs should prioritise simplcity over making things pretty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me address this comment in that case.

@amoghrajesh amoghrajesh requested a review from uranusjr May 19, 2023 08:02
@amoghrajesh
Copy link
Contributor Author

@uranusjr I have addressed your review comments, can you help in another round of review when you have some time?

@amoghrajesh amoghrajesh reopened this May 22, 2023
@amoghrajesh
Copy link
Contributor Author

amoghrajesh commented May 22, 2023

Closed pull requests to re launch the entire test suite

@amoghrajesh amoghrajesh reopened this May 22, 2023
Comment on lines 1169 to 1201
def test_annotations_for_logging_task_metadata(self):
annotations_test = {
"dag_id": "dag",
"run_id": "run_id",
"task_id": "task",
"try_number": "1",
}
with mock.patch.dict(
os.environ, {"AIRFLOW__KUBERNETES_EXECUTOR__LOGS_TASK_METADATA": "True"}, clear=True
):
expected_annotations = {
"dag_id": "dag",
"run_id": "run_id",
"task_id": "task",
"try_number": "1",
}
annotations_actual = annotations_for_logging_task_metadata(annotations_test)
assert annotations_actual == expected_annotations

def test_annotations_for_logging_task_metadata_fallback(self):
annotations_test = {
"dag_id": "dag",
"run_id": "run_id",
"task_id": "task",
"try_number": "1",
}
with mock.patch.dict(
os.environ, {"AIRFLOW__KUBERNETES_EXECUTOR__LOGS_TASK_METADATA": "False"}, clear=True
):
expected_annotations = "<omitted>"
annotations_actual = annotations_for_logging_task_metadata(annotations_test)
assert annotations_actual == expected_annotations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests seem to work fine in my environment. Not sure how/why these are failing in the CI. Any hints @hussein-awala ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using mock.patch, it’s more reliable to use the conf_vars helper. You also want to patch both the true and false cases because it’s not deterministic in the CI which the config is set to when the test is run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uranusjr I didn't quite get what you meant by You also want to patch both the true and false cases because it’s not deterministic in the CI which the config is set to when the test is run.

I have patched both the test cases, True on line 1177 and False on line 1196

@amoghrajesh
Copy link
Contributor Author

I tried using conf_vars instead and the tests pass for me in my dev setup but not sure why it consistently fails here.

@uranusjr @hussein-awala any tips?

@uranusjr
Copy link
Member

uranusjr commented May 23, 2023

When a function is decorated with cache, it is only executed exactly once, and later calls automatically re-uses the previous result. This means that after the first test (whichever that is) is run, the other test would receive the wrong, cached result. You need to clear the cache. Search for cache_clear() in the tests directory for examples.

@amoghrajesh
Copy link
Contributor Author

Thank you for that pointer, @uranusjr! Great help.
Pushing a commit with that included.

@amoghrajesh
Copy link
Contributor Author

I was able to fix the tests here, @hussein-awala / @uranusjr can you help in merging this PR?

Thanks!

@hussein-awala hussein-awala merged commit 64b0872 into apache:main May 25, 2023
@hussein-awala hussein-awala added this to the Airflow 2.7.0 milestone May 25, 2023
@hussein-awala hussein-awala added the type:new-feature Changelog: New Features label May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler provider:cncf-kubernetes Kubernetes provider related issues type:new-feature Changelog: New Features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add task name, DAG name, try_number, and run_id to all Kubernetes executor logs
6 participants