Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cylc message severity levels #2505

Closed
ColemanTom opened this issue Dec 7, 2017 · 12 comments
Closed

cylc message severity levels #2505

ColemanTom opened this issue Dec 7, 2017 · 12 comments

Comments

@ColemanTom
Copy link
Contributor

Hi,

I thought it would be good to follow standard syslog severity levels. At the moment it appears to allow NORMAL, WARNING, CRITICAL. Standard syslog is: DEBUG, INFO, NOTICE, WARNING, ERR, CRIT, ALERT, EMERG

See: https://docs.python.org/3/library/syslog.html and https://en.wikipedia.org/wiki/Syslog

@matthewrmshin
Copy link
Contributor

Python's logging module doesn't do all these either, the last time I looked. What are we trying to support that requires all these levels?

@hjoliver
Copy link
Member

hjoliver commented Dec 7, 2017

Note we have a "CUSTOM" level too, and this functionality overlaps with event-handling somewhat in Cylc.

I'm of two minds about this. I suppose we could allow additional levels that could be used with custom messages in user job scripting, just for logging - user or site-defined meaning. On the other hand, Matt makes a good point.

@matthewrmshin
Copy link
Contributor

Not really against, but would be interested to understand the requirements here.

@ColemanTom
Copy link
Contributor Author

Fair point that I didn't really justify anything. I wrote all severity levels, but yes, you are probably correct that it would be overkill. I had looked at the python stdlib syslog library, rather than the logger library, so I thought all were included. I'm not entirely sure how CUSTOM works, so I can't comment on it. Further exploration of why this was in my head is not written above. Basically, I think having a bit more control over the log levels is useful.

For example, in the python logger, you can tell it to only print out messages above a specific level. This would allow people, during development, to put in a bunch of extra log information (e.g. log_debug), but by a configuration have them turned off in operations to avoid polluting log files. This allows a smoother transition to operations, but, if setup properly, would allow people to do an edit run on a failure to turn the debug messages back on to help figure a problem out.

The other aspect of this would relate to the alerting downstream. I don't know the details exactly, but I do know cylc is being configured to work with message brokers to deliver messages to alerting and monitoring systems. The granularity in levels would provide a direct link to the alerting mechanism to help prioritise resolution (when combined with some priority ranking of the system in the organisational context). Perhaps this is already figured out though and I am going down a weird path? But, for example, say a task is running, it is doing some data format conversion on lustre (netcdf to/from grib2 for example), and you find evidence that the file is corrupted. Perhaps that should be raised as an emergency level requiring immediate escalation to 2nd level support rather than first level trying to triage it because there is most likely something wrong with one or more of the lustre OSTs.

tldr - real idea should be more along the lines of;

  1. It would be nice to be able to have a couple of more levels, such as DEBUG, such that you can have cylc message -p "DEBUG" ... peppered through scripts, but, some configuration setting accessible via the edit run interface, would allow you turn them off in operations, but on in the case of trying to resolve an unseen before problem.
  2. I imagine that more levels may provide better granularity for triage support and how 1st level support should act in the case of failures (but this may already be sorted out via however the message broker integration is being done).

Does the above make a bit more sense. Sorry for the brevity/not fully fleshing my thoughts out initially.

@ivorblockley
Copy link
Contributor

ivorblockley commented Feb 1, 2018

To weigh in on the use-cases:

I think differentiating between errors and fatal/critical error diagnostic/alerting messages could be useful. For example it is conceivable for a task to encounter real errors when invoking commands or interfacing with databases etc. Sometimes this might result in the suite's progress to halt (let's call this scenario a fatal error). In other cases the task may have fall-back logic programmed in to work-around the errors (e.g. ... if system-wide open file-handle limit is hit, wait awhile and retry on the assumption this condition is sporadic) and these errors could be reported as errors (warranting serious and timely investigation) but they did not cause a critical error for the task or suite (and hence operations).

Supporting a debug severity level would also be nice for reasons Tom has mentioned.

There is a discussion about severity levels and what distinguishes them at https://stackoverflow.com/questions/2031163/when-to-use-the-different-log-levels which I found informative. Ultimately I think this issue comes down to determining if these distinguishing characteristics are useful to downstream applications/customers/operators. I think there is a case, although out of the standard syslog severity levels (IETF RFC5424), I have to admit I don't see a need for ALERT.

A TRACE level to support finer-grained debugging could be useful (this goes beyond the syslog convention).

@matthewrmshin matthewrmshin added this to the later milestone Feb 1, 2018
@matthewrmshin
Copy link
Contributor

#386 is related.

Annoyingly, Python's logging module maps severity levels from 10 (debug) to 50 (critical) - the number increases with severity level, whereas syslog maps the main severity levels from 7 (debug) to 2 (critical) - the number decreases with severity level.

What we can do... Pick either logging or syslog as a basis. (The former is more likely, given that it is already used in the logic.) Modify cylc message to allow any severity level. If the specified level is recognised, the reporting system will respect the level in the normal way. Otherwise, the level is considered custom - and the reporting system will act according to any custom event handlers (but can probably default to e.g. logging.INFO).

@matthewrmshin matthewrmshin self-assigned this Feb 21, 2018
@matthewrmshin matthewrmshin modified the milestones: later, next release, soon Feb 21, 2018
@matthewrmshin
Copy link
Contributor

matthewrmshin commented Feb 22, 2018

#2582 should solve the cylc message part of this issue.

Still need to figure out the following:

  • How to deal with logging levels on the suite side. I think we need to rationalise how we configure logging for the running suite. My normal instinct is to introduce a setting to configure the logging level of the suite (as opposed to having the verbose and debug flags). We should also consider whether we need to duplicate log entries in both log/suite/log and log/suite/err rationalize use of suite stdout, stderr, and the log #386. Done by Improve logging #2781.
  • A job failure currently has a CRITICAL severity. Should this be an ERROR instead? (And should a job failure be a WARNING for tasks that have retries lined up?) Or perhaps this should be configurable per task? (New runtime config item: "priority"? #2289?)

@hjoliver
Copy link
Member

hjoliver commented Feb 25, 2018

@matthewrmshin - responding to the previous comments:

  • the original intention for the debug flag was to print Python tracebacks,, and otherwise just a simple error message for users who should not be expected to understand Python tracebacks. Not sure that's the best approach though, not least because it may be inconvenient to re-run a failed suite in order to get a traceback. Aside from debug ,a multi-level verbosity flag seems sensible to me. Also, I'd be happy to not duplicate suite err message in the suite log (we don't for job.err after all).
  • this is a tricky one! A job failure is typically critical for the job, but not the suite. Maybe we need two categories of CRITICAL (one for job, one for suite). But as you note, a job failure when there are retries lined up is presumably less critical. I'd prefer not to make it configurable unless we really have to, as I doubt many would resort to that. This might be a good one to discuss in June...

@matthewrmshin
Copy link
Contributor

matthewrmshin commented Dec 17, 2018

With #2582 and #2781, we should now be aligned with Python's logging module.

Things left to do before closing this issue:

  • Agree on the default logging level of a failed job with and without retries lined up.
    • CRITICAL - as now.
    • ERROR, or WARNING if job is expected to fail from time to time (e.g. has follow-on retries, or where failed output is a prerequisite of a downstream task).
  • Fully expose suite logging via configuration. (Requires Python 3 for easy implementation.)

@hjoliver hjoliver mentioned this issue Dec 17, 2018
6 tasks
@matthewrmshin matthewrmshin modified the milestones: soon, cylc-8.0.0 Mar 11, 2019
@matthewrmshin
Copy link
Contributor

Tentatively re-targeting this for Cylc 8. Now that the code is in Python 3, we can implement configurable logging.

@oliver-sanders
Copy link
Member

#3647

@oliver-sanders
Copy link
Member

#2582 and #2781 solved the cylc message side of things (which covers the OP).

Agree on the default logging level of a failed job with and without retries lined up.

This isn't related to cylc message, but is now covered by #3647.

Fully expose suite logging via configuration.

Have yet to encounter a use case for this, however, with Cylc 8 it is now possible to add your own log handlers via a Cylc configuration plugin. If there's any interest in this let us know.

@oliver-sanders oliver-sanders removed this from the some-day milestone May 4, 2023
@matthewrmshin matthewrmshin removed their assignment May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants