Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

heartbeat logmessage fix #31996

Closed
wants to merge 0 commits into from
Closed

Conversation

Bowrna
Copy link
Contributor

@Bowrna Bowrna commented Jun 19, 2023

closes: #31810
relates: #31810


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added the area:Scheduler Scheduler or dag parsing Issues label Jun 19, 2023
@Bowrna Bowrna changed the title heartbeat log fix heartbeat logmessage fix Jun 19, 2023
@Bowrna
Copy link
Contributor Author

Bowrna commented Jun 19, 2023

Some of the suggestions provided in the issue by @hterik


A: Don't log the whole stack trace, instead just log this as an error message.
B: Include a descriptive message of the consequence of this error in the log.
C: When heartbeat recovers, log it.
D: If heartbeat was recovered after scheduler_health_check_threshold, log another more severe error message including descriptive message of the consequences.

the current changes cover A, B and D. I will figure out how I can find if the heartbeat is recovering and finish C part.

@Bowrna Bowrna force-pushed the heartbeat-log-message-fix branch from 0be788d to 33c0427 Compare July 5, 2023 13:05
@Bowrna Bowrna force-pushed the heartbeat-log-message-fix branch from 33c0427 to 98ae62a Compare July 19, 2023 04:13
@Bowrna Bowrna marked this pull request as ready for review July 19, 2023 04:45
@Bowrna
Copy link
Contributor Author

Bowrna commented Jul 19, 2023

@kaxil @ashb @XD-DENG Please review this PR and let me know about your views.

airflow/jobs/job.py Outdated Show resolved Hide resolved
airflow/jobs/job.py Outdated Show resolved Hide resolved
if self.is_alive():
self.log.error("%s heartbeat failed with error. Scheduler may go into unhealthy state", self.__class__.__name__)
else:
self.log.error("%s heartbeat failed with error. Scheduler is in unhealthy state", self.__class__.__name__)
Copy link
Contributor

@hterik hterik Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user of airflow reading the logs of a dag i would not understand what this means to me. Is this something i have to react to? Do i need to contact my admins? Is the dag results corrupted? Should i restart the scheduler?

This error isn't necessarily a problem with the scheduler. More often it is a problem of the executor not being able to reach the database, due to transient network problems. As long as this error is transient and recovers shortly, the consequence of this is usually none. The log message should reflect this. If this is too much to fit into a log-message, linking to the architecture documentation at airflow.apache.org as suggested by potiuk above sounds like a good proposal.

@Bowrna Bowrna force-pushed the heartbeat-log-message-fix branch from 98ae62a to 53e0f12 Compare July 31, 2023 04:35
@Bowrna
Copy link
Contributor Author

Bowrna commented Jul 31, 2023

@potiuk In this PR, I am specifically working on the heartbeat failure when updating the table about the heartbeat timestamp. So its more related to db problems when executor wants to update the table. But with above comments, I think the scope for this issue is expanding. Is my understanding right @potiuk ?

@potiuk
Copy link
Member

potiuk commented Aug 6, 2023

@potiuk In this PR, I am specifically working on the heartbeat failure when updating the table about the heartbeat timestamp. So its more related to db problems when executor wants to update the table. But with above comments, I think the scope for this issue is expanding. Is my understanding right @potiuk ?

That's the proposal - but it's just suggestion, if you want to stay with small change - up to you.

@Bowrna
Copy link
Contributor Author

Bowrna commented Aug 7, 2023

@potiuk thanks for the update. Let me break the proposal into a few more tasks and work on it.

self.log.error(
"%s heartbeat failed with error. Scheduler is in unhealthy state", self.__class__.__name__
)
# self.log.exception("%s heartbeat got an exception", self.__class__.__name__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# self.log.exception("%s heartbeat got an exception", self.__class__.__name__)

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM: with the small NIT of removing the commented out line

@Bowrna
Copy link
Contributor Author

Bowrna commented Sep 5, 2023

thanks @potiuk let me fix it

@Bowrna Bowrna closed this Sep 7, 2023
@Bowrna
Copy link
Contributor Author

Bowrna commented Sep 7, 2023

I wrongly did a git force push with main branch changes. Please excuse. I will clean up and commit the changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler Scheduler or dag parsing Issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve log messages of heartbeat connection errors and recovery
4 participants