Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-jobs: support new "inactive_reason" output field and "endreason" format #5055

Merged

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Apr 1, 2023

Per discussion in #4946. This adds a new field {inactive_reason} to get why a job is inactive and a new named format in flux-jobs to help the user get the reason why their job is inactive.

After some trial and error, came up with the following defined format for flux-jobs, which I call "endreason". Open to suggestions for an alternate name. Its the best I could come up with :-)

+                "{id.f58:>12} ?:{queue:<8.8} {username:<8.8} {name:<10.10+} "
+                "{status_abbrev:>2.2} {t_inactive!D:>19h} {inactive_reason}"

example

       JOBID USER     NAME       ST          T_INACTIVE INACTIVE-REASON
    ƒCWZ9cgX chu11    sleep       R                   - 
    ƒBwTFejy chu11    sleep       R                   - 
    ƒJw1KBu1 chu11    nosuchcom+  F 2023-03-31T12:45:43 command not found
    ƒGf7P5mh chu11    true       CD 2023-03-31T12:45:38 Exit 0
    ƒF7xGGpw chu11    false       F 2023-03-31T12:45:35 Exit 1
    ƒ3hsP1TD chu11    sleep      CA 2023-03-31T12:45:25 Canceled: foobar
    ƒ4VrUjRZ chu11    sleep      CA 2023-03-31T12:45:15 Canceled

the ST field is still there just so users can see those are running jobs and thus don't get just an empty INACTIVE-REASON without any context. I initially wanted to include T_SUBMIT but I couldn't really fit it in and make the format pretty enough for my tastes :-) Also, I didn't like two hyphens next to each other for T_INACTIVE and INACTIVE-REASON, so there's just one ... eh?? :-)

One debate I had on the output from an exec error, I elect to handle it as a special case, outputting the special bash exit code message of "command not found" (exit 126) vs the actual output that might come out from the exception, "task 0 (host foobar): start failed: nosuchcommand: No such file or directory". Could tweak that.

This PR is built on top of #5031

@chu11 chu11 force-pushed the issue4946_flux_jobs_inactive_reason branch from 183bb7a to 4ab2a8a Compare April 1, 2023 00:54
@chu11
Copy link
Member Author

chu11 commented Apr 1, 2023

hmmmm, my attempt to be clever doesn't like the vermin python version check.

                # Unfortunately strsignal() is only available in                                                                            
                # Python 3.8 and up.  flux-core's minimum is 3.6.                                                                           
                # If strsignal() is not available, we hand check for                                                                        
                # several common signals before outputting something                                                                        
                # generic.                                                                                                                  
                                                                                                                                            
                if "strsignal" in dir(signal):                                                                                              
                    try:                                                                                                                    
                        sigdesc = signal.strsignal(signum)                                                                                  
                    except ValueError:                                                                                                      
                        sigdesc = f"Signaled {signum}"                                                                                      
                else:                                                                                                                       
                    if signum == signal.SIGINT:                                                                                             
                        sigdesc = "Interrupt"                                                                                               
                    elif signum == signal.SIGKILL:                                                                                          
                        sigdesc = "Killed"                                                                                                  
                    elif signum == signal.SIGUSR1:                                                                                          
                        sigdesc = "User defined signal 1"                                                                                   
                    elif signum == signal.SIGSEGV:                                                                                          
                        sigdesc = "Segmentation Fault"                                                                                      
                    elif signum == signal.SIGUSR2:                                                                                          
                        sigdesc = "User defined signal 2"                                                                                   
                    elif signum == signal.SIGTERM:                                                                                          
                        sigdesc = "Terminated"                                                                                              
                    else:                                                                                                                   
                        sigdesc = f"Signaled {signum}"                                                                                      
                return sigdesc                            

dunno if there is a workaround for this? Couldn't figure one out via the project's website. Guess we could just go with the fallthrough one and just add a TODO to come back to this when Python 3.8 is the minimum required version.

@chu11 chu11 force-pushed the issue4946_flux_jobs_inactive_reason branch from 4ab2a8a to 5b2247d Compare April 1, 2023 13:26
@grondo
Copy link
Contributor

grondo commented Apr 1, 2023

Guess we could just go with the fallthrough one and just add a TODO to come back to this when Python 3.8 is the minimum required version.

That seems most reasonable. BTW, from what I've gathered, the more Pythonic way to do this (for future reference) is with a try: except block, rather than if strsignal in dir(signal). e.g.:

import signal
  
def strsignal(signum):
    if signum == signal.SIGINT:
        return "Interrupt"
    elif signum == signal.SIGKILL:
        return "Killed"
    elif signum == signal.SIGUSR1:
        return "User defined signal 1"
    elif signum == signal.SIGSEGV:
        return "Segmentation Fault"
    elif signum == signal.SIGUSR2:
        return "User defined signal 2"
    elif signum == signal.SIGTERM:
        return "Terminated"
    else:
        return f"Signal {signum}"

try:
    result = signal.strsignal(15)
except AttributeError:
    result = strsignal(15)

print(result)

Which is also better since it encapsulates the compatibility code into a function.

In fact, if we had many of these functions we could place them all into a module flux.compat36, and then do something like:

try:
    from signal import strsignal
except ImportError:
    from flux.compat36 import strsignal

Then once we no longer need to support python3.6, the whole module could be removed.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just got around to reviewing this one.

Main suggested change is to factor a long if block into its own function, which can then be separately tested.

Comment on lines 583 to 587
# Python 3.8
# try:
# sigdesc = signal.strsignal(signum)
# except ValueError:
# sigdesc = f"Signaled {signum}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code might be better moved to its own function. Then that function can check for signal.strsignal and use it if available, otherwise the function uses the code below (raising ValueError for invalid signum). Then the code here can be replaced with the suggested block in this comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this would also allow some unit testing of this function under t/python to clear all the reports of uncovered lines below.

"description": "Show why each job ended",
"format": (
"{id.f58:>12} ?:{queue:<8.8} {username:<8.8} {name:<10.10+} "
"{status_abbrev:>2.2} {t_inactive!D:>19h} {inactive_reason}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest {t_inactive!d:%b%d %R::<12} since it is shorter and more human readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I think the reason I went with !D may have been because we (I don't think) support the h hyphen output with !d. So I sort of dislike this output for the running jobs.

       JOBID USER     NAME       ST   T_INACTIVE INACTIVE-REASON
    f7qUkw7d achu     sleep       R  Dec31 16:00 
    f7qRnxYx achu     sleep       R  Dec31 16:00 
    f7qRnxYw achu     sleep       R  Dec31 16:00 
    f7qQJyGb achu     sleep       R  Dec31 16:00 
    f7gRwKMZ achu     sleep       F  May10 15:10 User defined signal 1
    f7XHBkU7 achu     sleep       F  May10 15:10 Terminated
    f7GqYkFH achu     sleep      CA  May10 15:10 Canceled: foobar
    f72GVocj achu     sleep      CA  May10 15:10 Canceled
    f6tkdSmM achu     nosuchcom+  F  May10 15:10 command not found
    f6mBo7MH achu     false       F  May10 15:10 Exit 1
    f6dU4rF9 achu     true       CD  May10 15:10 Exit 0

Maybe we can tweak/handle h in UtilDatetime.

But along the way, I noticed this:

        if conv == "d":                                                                                                                     
            # convert from float seconds since epoch to a datetime.                                                                         
            # User can than use datetime specific format fields, e.g.                                                                       
            # {t_inactive!d:%H:%M:%S}.                                                                                                      
            try:                                                                                                                            
                value = UtilDatetime.fromtimestamp(value)                                                                                   
            except TypeError:                                                                                                               
                if orig_value == "":                                                                                                        
                    value = UtilDatetime.fromtimestamp(0.0)                                                                                 
                else:                                                                                                                       
                    raise                                                                                                                   
        elif conv == "D":                                                                                                                   
            # As above, but convert to ISO 8601 date time string.                                                                           
            try:                                                                                                                            
                value = datetime.fromtimestamp(value).strftime("%FT%T")                                                                     
            except TypeError:                                                                                                               
                if orig_value == "":                                                                                                        
                    value = ""                                                                                                              
                else:                                                                                                                       
                    raise

I don't know why !d converts to epoch time with empty string and with !D does not. It is documented that way in the manpage. That seems inconsistent. And I think empty string and 0 mean two different things. For example, t_remaining being empty string might mean the job isn't running vs 0 means there's 0 time remaining. Similarly t_inactive being empty string may mean the job isn't done.

I will note that in JobInfo we do default:

    #  Default values for job properties.
    defaults = {
        "t_depend": 0.0,
        "t_run": 0.0,
        "t_cleanup": 0.0,
        "t_inactive": 0.0,

I think it has been this way forever, but maybe shouldn't be. Wondering if we could default that to "" and that means output "" vs epoch.

I will note that I'm not sure what fallout there could be with the json output option if we adjusted things.

Will ponder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why !d converts to epoch time with empty string and with !D does not. It is documented that way in the manpage. That seems inconsistent.

One must be a Python datetime object and the other is a string. You cant return an empty string as a datetime object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I think the reason I went with !D may have been because we (I don't think) support the h hyphen output with !d. So I sort of dislike this output for the running jobs.

Well, this is a bit unsatisfying, but this allows UtilDatetime class to use h. It requires subverting h handling in format_items becuase the UtilDatetime formatting is done after this function, then adds h suffix handling specifically to the format() method of the class.

We maybe should also have datetime.fromtimestamp(0.) format to an empty string which would help here:

diff --git a/src/bindings/python/flux/util.py b/src/bindings/python/flux/util.py
index befc950b4..87deb55b6 100644
--- a/src/bindings/python/flux/util.py
+++ b/src/bindings/python/flux/util.py
@@ -312,16 +312,21 @@ class UtilDatetime(datetime):
     def __format__(self, fmt):
         # The string "::" is used to split the strftime() format from
         # any Python format spec:
-        vals = fmt.split("::", 1)
+        timefmt, *spec = fmt.split("::", 1)
 
         # Call strftime() to get the formatted datetime as a string
-        result = self.strftime(vals[0])
+        result = self.strftime(timefmt)
+
+        spec = spec[0] if spec else ""
+        if spec.endswith("h"):
+            if self == datetime.fromtimestamp(0.):
+                result = "-"
+            spec = spec[:-1] + "s"
 
         # If there was a format spec, apply it here:
-        try:
-            return f"{{0:{vals[1]}}}".format(result)
-        except IndexError:
-            return result
+        if spec:
+            return f"{{0:{spec}}}".format(result)
+        return result
 
 
 def fsd(secs):
@@ -421,7 +426,7 @@ class UtilFormatter(Formatter):
             denote_truncation = True
             spec = spec[:-1]
 
-        if spec.endswith("h"):
+        if spec.endswith("h") and not isinstance(value, UtilDatetime):
             localepoch = datetime.fromtimestamp(0.0).strftime("%FT%T")
             basecases = ("", "0s", "0.0", "0:00:00", "1970-01-01T00:00:00", localepoch)
             value = "-" if str(value) in basecases else str(value)

Copy link
Contributor

@grondo grondo May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is acceptable I can propose a small PR. It would be more satisfying to handle the h in one place, but due to the way the formatting percolates through all the layers here (which frankly is warping my hippocampus), I don't think this is possible..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh your post is very timely, I was looking into how to add h and fumbling and just about to message you if you had an idea on how to proceed.

+ if spec.endswith("h") and not isinstance(value, UtilDatetime):

This was the part I was struggling to get around, didn't think to just skip it.

PR sounds good. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the changes to the UtilDatetime class to support the h suffix, what do you think about formatting all "zero" UtilDatetime objects as an empty string in the class' format() function @chu11?

Seems like this would be consistent with !D as you note, and would avoid the requirement to use h when you don't want to display Dec31 ... for unset timestamps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this would be consistent with !D as you note, and would avoid the requirement to use h when you don't want to display Dec31 ... for unset timestamps.

That sounds like a good idea!

@chu11 chu11 force-pushed the issue4946_flux_jobs_inactive_reason branch from b690a14 to 2996d18 Compare May 11, 2023 17:06
@chu11 chu11 changed the title flux-jobs: support new "inactive_reason" output field and "endreason" format WIP: flux-jobs: support new "inactive_reason" output field and "endreason" format May 11, 2023
@chu11 chu11 force-pushed the issue4946_flux_jobs_inactive_reason branch from 2996d18 to bf7bcba Compare May 11, 2023 17:13
@chu11
Copy link
Member Author

chu11 commented May 11, 2023

re-pushed just to get initial review on fixes from comments from above + one fixup for something I randomly found. Changed to WIP since this depends on PR fix for h converter per comment above.

  • noticed that the checks for "empty things" is different between h and filter_empty(), so added a fix for that.

  • added a new compat36 lib for Python 3.6 strsignal(). I ended up adding signal -> string for all signals 1 through 29. I didn't go above that because I questioned portability of all those real time signals in the 30-40 range. No evidence of lack of portability, just a gut feeling. Also, oddly, one signal was only defined in Python 3.11, go figure.

  • Edit: Oh yeah, I decided to right align instead of left align t_inactive ... I like it better that way :-) But I will submit to majority vote.

    f2t6W9kj achu     sleep       R            - 
    f2t52AUP achu     sleep       R            - 
    f2t3YBC3 achu     sleep       R            - 
    f2szaCdM achu     sleep       R            - 
    f2sy6DM1 achu     sleep       R            - 
    f2iqpdju achu     sleep       F  May11 07:21 User defined signal 1
    f2Zwtwfu achu     sleep       F  May11 07:21 Terminated
    f2KZDv1m achu     sleep      CA  May11 07:21 Canceled: foobar
    f257aunw achu     sleep      CA  May11 07:21 Canceled
     fw7Ynb1 achu     nosuchcom+  F  May11 07:21 command not found
     fodAR1y achu     false       F  May11 07:21 Exit 1
     fg1N73D achu     true       CD  May11 07:21 Exit 0

Edit: Ugh, vermin check failed trying to use signal.strsignal(), dunno if we're going to be able to use it even if it is available. I noticed that signal.SIGSTKFLT is possible to be used. Wondering if vermin has some hard coded checks in there.

@chu11 chu11 force-pushed the issue4946_flux_jobs_inactive_reason branch 2 times, most recently from 3514c7a to a382822 Compare May 12, 2023 00:06
@chu11 chu11 changed the title WIP: flux-jobs: support new "inactive_reason" output field and "endreason" format lux-jobs: support new "inactive_reason" output field and "endreason" format May 12, 2023
@chu11 chu11 changed the title lux-jobs: support new "inactive_reason" output field and "endreason" format flux-jobs: support new "inactive_reason" output field and "endreason" format May 12, 2023
@chu11
Copy link
Member Author

chu11 commented May 12, 2023

now that #5169 is merged, rebased and re-pushed and removed WIP

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple comments up for further discussion.

Comment on lines 429 to 432
def empty_outputs(self):
localepoch = datetime.fromtimestamp(0.0).strftime("%FT%T")
empty = ("", "0s", "0.0", "0:00:00", "1970-01-01T00:00:00", localepoch)
return empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't seem to use self so maybe it can be defined outside of this class. This will allow the unnecessary instantiation of a UtilFormatter() object to be dropped in fliter_empty() below.

Comment on lines 575 to 577
# strsignal() is only available in Python 3.8 and up.
# flux-core's minimum is 3.6. Use compat library
from flux.compat36 import strsignal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised we didn't get a lint warning here. I thought it was required or strongly suggested to have all imports at the top of the file.

Anyway, if you move this to the top of the file, then you can first try to import strsignal and only use the compat code if that fails, e.g.

try:
   from signal import strsignal
except ImportError:
  from flux.compat36 import strsignal

Copy link
Member Author

@chu11 chu11 May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, fails vermin check

 !2, 3.8      /home/runner/work/flux-core/flux-core/src/bindings/python/flux/job/info.py
  'signal.strsignal' member requires !2, 3.8

Copy link
Member Author

@chu11 chu11 May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going into vermin's code, it seems like this is just a hard require

    "signal.strsignal": (None, (3, 8)),

BUT I cannot generate this error message on a local install of vermin with 1.5.1, wondering if this could be worked around with a newer version of vermin. playing around.

Edit: Ok, I think I must have accidentally installed both python 2.7 and python 3.6 versions of vermin, thus getting weird stuff. Now I see the right errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well that's a bummer. Vermin makes it so you can't support multiple versions of Python? Do we really need the vermin checks? Our CI already runs all tests with multiple versions of Python.

Though maybe doing the ImportError thing is considered bad practice by Python folks? I'm having trouble forming an opinion now. Sorry for putting you through that.

@chu11 chu11 force-pushed the issue4946_flux_jobs_inactive_reason branch 2 times, most recently from 398f373 to 793f992 Compare May 16, 2023 17:06
@chu11
Copy link
Member Author

chu11 commented May 16, 2023

rebased and re-pushed. I went ahead and threw in some #novermin comments to skip these issues in vermin. I went ahead and was "preemptive", putting #novermin in for something that isn't caught in vermin 1.4.2 (in our workflow) where as it is caught in version 1.5.1. Also threw in an extra commit for a test timeout I see once in awhile.

As a complete aside, this was also caught from newer version vermin check. So maybe could use # novermin too. I'll update the vermin version in a different PR.

try:                                                                                                                                        
    import tomllib                                                                                                                          
except ModuleNotFoundError:                                                                                                                 
    from flux.utils import tomli as tomllib             

chu11 added 5 commits May 16, 2023 12:26
Problem: Several tests in t3203-instance-recovery.t fail on occasion
when doing large parallel runs of the testsuite.

Increase the test timeouts from 15 seconds to 120 seconds.
Problem: The strsignal() function would be useful to use, but is
only available in Python 3.8 and newer.  Flux's minimum version of
Python is only 3.6.

Solution: Add a new compat36.py utility file that supports our own
simple implementation of strsignal() when it is not available.
Problem: There is no coverage of the new python compat36 library.

Add some coverage for compat36 and the strsignal() function.
Problem: The list of "empty things" used by the filter_empty() function
in OutputFormat is missing several "empty things" that are used by
the "h" field converter.

Solution: Add a utility function to return list of commonly known
"empty outputs" and use that list in all appropriate locations.
Problem: There are no tests to ensure the job-list module returns
job cancellation messages correctly.

Solution: Add some tests to t2260-job-list.t.  Add a job cancel message
to job cancelation in t2800-jobs-cmd.t and update tests.
chu11 added 6 commits May 16, 2023 12:26
Problem: There is no coverage for terminated / signaled jobs in
the job-list / flux-jobs tests.

Solution: Add some additional coverage by adding a terminated
job via SIGTERM to the tests in t2260-job-list.t and t2800-jobs-cmd.t.
Problem: There is no coverage for user exceptions in the job-list / flux-jobs
tests.

Solution: Add some additional coverage by adding a job ended with a user
exception in t2260-job-list.t and t2800-jobs-cmd.t.
Problem: There are a number of fields in flux-jobs that are available
to help users get details about how or why their job exited / finished / ended.
However, it is scattered across several fields.  It would be convenient
if there were just one field that collated that information.

Solution: Support a new "inactive_reason" output field.  It will output if a
job was canceled, signaled, timedout, or exited normally.  Other contextual
information will be output when available.

Fixes flux-framework#4946
Problem: There is no coverage for the new flux-jobs inactive_reason
output field.

Solution: Add some coverage in t2800-jobs-cmd.t.
Problem: The new inactive_reason field is not documented in flux-jobs(1).

Add information about the new inactive_reason field.
Problem: Now that the inactive_reason output field is available, it
would be nice if there were a convenient format to output it and
other pertinent information related to it.

Solution: Add a new "endreason" output format.
@chu11 chu11 force-pushed the issue4946_flux_jobs_inactive_reason branch from 793f992 to 8f704e2 Compare May 16, 2023 19:26
@codecov
Copy link

codecov bot commented May 16, 2023

Codecov Report

Merging #5055 (398f373) into master (3b9b6e3) will increase coverage by 27.70%.
The diff coverage is 89.81%.

❗ Current head 398f373 differs from pull request most recent head 8f704e2. Consider uploading reports for the commit 8f704e2 to get more accurate results

@@             Coverage Diff             @@
##           master    #5055       +/-   ##
===========================================
+ Coverage   55.43%   83.13%   +27.70%     
===========================================
  Files         413      454       +41     
  Lines       71220    77961     +6741     
===========================================
+ Hits        39478    64816    +25338     
+ Misses      31742    13145    -18597     
Impacted Files Coverage Δ
src/cmd/flux-jobs.py 96.50% <ø> (ø)
src/bindings/python/flux/job/info.py 92.85% <81.08%> (+69.56%) ⬆️
src/bindings/python/flux/compat36.py 93.84% <93.84%> (ø)
src/bindings/python/flux/util.py 95.23% <100.00%> (+70.37%) ⬆️

... and 376 files with indirect coverage changes

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@chu11
Copy link
Member Author

chu11 commented May 16, 2023

thanks, setting MWP

@mergify mergify bot merged commit c0e977c into flux-framework:master May 16, 2023
30 of 32 checks passed
@chu11 chu11 deleted the issue4946_flux_jobs_inactive_reason branch May 16, 2023 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants