New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flux-jobs: support new "inactive_reason" output field and "endreason" format #5055
flux-jobs: support new "inactive_reason" output field and "endreason" format #5055
Conversation
183bb7a
to
4ab2a8a
Compare
hmmmm, my attempt to be clever doesn't like the
dunno if there is a workaround for this? Couldn't figure one out via the project's website. Guess we could just go with the fallthrough one and just add a TODO to come back to this when Python 3.8 is the minimum required version. |
4ab2a8a
to
5b2247d
Compare
That seems most reasonable. BTW, from what I've gathered, the more Pythonic way to do this (for future reference) is with a import signal
def strsignal(signum):
if signum == signal.SIGINT:
return "Interrupt"
elif signum == signal.SIGKILL:
return "Killed"
elif signum == signal.SIGUSR1:
return "User defined signal 1"
elif signum == signal.SIGSEGV:
return "Segmentation Fault"
elif signum == signal.SIGUSR2:
return "User defined signal 2"
elif signum == signal.SIGTERM:
return "Terminated"
else:
return f"Signal {signum}"
try:
result = signal.strsignal(15)
except AttributeError:
result = strsignal(15)
print(result) Which is also better since it encapsulates the compatibility code into a function. In fact, if we had many of these functions we could place them all into a module try:
from signal import strsignal
except ImportError:
from flux.compat36 import strsignal Then once we no longer need to support python3.6, the whole module could be removed. |
5b2247d
to
4d93134
Compare
4d93134
to
b690a14
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, just got around to reviewing this one.
Main suggested change is to factor a long if block into its own function, which can then be separately tested.
src/bindings/python/flux/job/info.py
Outdated
# Python 3.8 | ||
# try: | ||
# sigdesc = signal.strsignal(signum) | ||
# except ValueError: | ||
# sigdesc = f"Signaled {signum}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code might be better moved to its own function. Then that function can check for signal.strsignal
and use it if available, otherwise the function uses the code below (raising ValueError
for invalid signum
). Then the code here can be replaced with the suggested block in this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this would also allow some unit testing of this function under t/python
to clear all the reports of uncovered lines below.
src/cmd/flux-jobs.py
Outdated
"description": "Show why each job ended", | ||
"format": ( | ||
"{id.f58:>12} ?:{queue:<8.8} {username:<8.8} {name:<10.10+} " | ||
"{status_abbrev:>2.2} {t_inactive!D:>19h} {inactive_reason}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest {t_inactive!d:%b%d %R::<12}
since it is shorter and more human readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I think the reason I went with !D
may have been because we (I don't think) support the h
hyphen output with !d
. So I sort of dislike this output for the running jobs.
JOBID USER NAME ST T_INACTIVE INACTIVE-REASON
f7qUkw7d achu sleep R Dec31 16:00
f7qRnxYx achu sleep R Dec31 16:00
f7qRnxYw achu sleep R Dec31 16:00
f7qQJyGb achu sleep R Dec31 16:00
f7gRwKMZ achu sleep F May10 15:10 User defined signal 1
f7XHBkU7 achu sleep F May10 15:10 Terminated
f7GqYkFH achu sleep CA May10 15:10 Canceled: foobar
f72GVocj achu sleep CA May10 15:10 Canceled
f6tkdSmM achu nosuchcom+ F May10 15:10 command not found
f6mBo7MH achu false F May10 15:10 Exit 1
f6dU4rF9 achu true CD May10 15:10 Exit 0
Maybe we can tweak/handle h
in UtilDatetime
.
But along the way, I noticed this:
if conv == "d":
# convert from float seconds since epoch to a datetime.
# User can than use datetime specific format fields, e.g.
# {t_inactive!d:%H:%M:%S}.
try:
value = UtilDatetime.fromtimestamp(value)
except TypeError:
if orig_value == "":
value = UtilDatetime.fromtimestamp(0.0)
else:
raise
elif conv == "D":
# As above, but convert to ISO 8601 date time string.
try:
value = datetime.fromtimestamp(value).strftime("%FT%T")
except TypeError:
if orig_value == "":
value = ""
else:
raise
I don't know why !d
converts to epoch time with empty string and with !D
does not. It is documented that way in the manpage. That seems inconsistent. And I think empty string and 0 mean two different things. For example, t_remaining
being empty string might mean the job isn't running vs 0 means there's 0 time remaining. Similarly t_inactive
being empty string may mean the job isn't done.
I will note that in JobInfo
we do default:
# Default values for job properties.
defaults = {
"t_depend": 0.0,
"t_run": 0.0,
"t_cleanup": 0.0,
"t_inactive": 0.0,
I think it has been this way forever, but maybe shouldn't be. Wondering if we could default that to ""
and that means output "" vs epoch.
I will note that I'm not sure what fallout there could be with the json output option if we adjusted things.
Will ponder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why !d converts to epoch time with empty string and with !D does not. It is documented that way in the manpage. That seems inconsistent.
One must be a Python datetime object and the other is a string. You cant return an empty string as a datetime object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I think the reason I went with !D may have been because we (I don't think) support the h hyphen output with !d. So I sort of dislike this output for the running jobs.
Well, this is a bit unsatisfying, but this allows UtilDatetime
class to use h
. It requires subverting h
handling in format_items
becuase the UtilDatetime
formatting is done after this function, then adds h
suffix handling specifically to the format()
method of the class.
We maybe should also have datetime.fromtimestamp(0.)
format to an empty string which would help here:
diff --git a/src/bindings/python/flux/util.py b/src/bindings/python/flux/util.py
index befc950b4..87deb55b6 100644
--- a/src/bindings/python/flux/util.py
+++ b/src/bindings/python/flux/util.py
@@ -312,16 +312,21 @@ class UtilDatetime(datetime):
def __format__(self, fmt):
# The string "::" is used to split the strftime() format from
# any Python format spec:
- vals = fmt.split("::", 1)
+ timefmt, *spec = fmt.split("::", 1)
# Call strftime() to get the formatted datetime as a string
- result = self.strftime(vals[0])
+ result = self.strftime(timefmt)
+
+ spec = spec[0] if spec else ""
+ if spec.endswith("h"):
+ if self == datetime.fromtimestamp(0.):
+ result = "-"
+ spec = spec[:-1] + "s"
# If there was a format spec, apply it here:
- try:
- return f"{{0:{vals[1]}}}".format(result)
- except IndexError:
- return result
+ if spec:
+ return f"{{0:{spec}}}".format(result)
+ return result
def fsd(secs):
@@ -421,7 +426,7 @@ class UtilFormatter(Formatter):
denote_truncation = True
spec = spec[:-1]
- if spec.endswith("h"):
+ if spec.endswith("h") and not isinstance(value, UtilDatetime):
localepoch = datetime.fromtimestamp(0.0).strftime("%FT%T")
basecases = ("", "0s", "0.0", "0:00:00", "1970-01-01T00:00:00", localepoch)
value = "-" if str(value) in basecases else str(value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is acceptable I can propose a small PR. It would be more satisfying to handle the h
in one place, but due to the way the formatting percolates through all the layers here (which frankly is warping my hippocampus), I don't think this is possible..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh your post is very timely, I was looking into how to add h
and fumbling and just about to message you if you had an idea on how to proceed.
+ if spec.endswith("h") and not isinstance(value, UtilDatetime):
This was the part I was struggling to get around, didn't think to just skip it.
PR sounds good. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the changes to the UtilDatetime
class to support the h
suffix, what do you think about formatting all "zero" UtilDatetime
objects as an empty string in the class' format()
function @chu11?
Seems like this would be consistent with !D
as you note, and would avoid the requirement to use h
when you don't want to display Dec31 ...
for unset timestamps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this would be consistent with !D as you note, and would avoid the requirement to use h when you don't want to display Dec31 ... for unset timestamps.
That sounds like a good idea!
b690a14
to
2996d18
Compare
2996d18
to
bf7bcba
Compare
re-pushed just to get initial review on fixes from comments from above + one fixup for something I randomly found. Changed to WIP since this depends on PR fix for
Edit: Ugh, vermin check failed trying to use |
3514c7a
to
a382822
Compare
now that #5169 is merged, rebased and re-pushed and removed WIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple comments up for further discussion.
src/bindings/python/flux/util.py
Outdated
def empty_outputs(self): | ||
localepoch = datetime.fromtimestamp(0.0).strftime("%FT%T") | ||
empty = ("", "0s", "0.0", "0:00:00", "1970-01-01T00:00:00", localepoch) | ||
return empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't seem to use self
so maybe it can be defined outside of this class. This will allow the unnecessary instantiation of a UtilFormatter()
object to be dropped in fliter_empty()
below.
src/bindings/python/flux/job/info.py
Outdated
# strsignal() is only available in Python 3.8 and up. | ||
# flux-core's minimum is 3.6. Use compat library | ||
from flux.compat36 import strsignal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised we didn't get a lint warning here. I thought it was required or strongly suggested to have all imports at the top of the file.
Anyway, if you move this to the top of the file, then you can first try to import strsignal
and only use the compat code if that fails, e.g.
try:
from signal import strsignal
except ImportError:
from flux.compat36 import strsignal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, fails vermin check
!2, 3.8 /home/runner/work/flux-core/flux-core/src/bindings/python/flux/job/info.py
'signal.strsignal' member requires !2, 3.8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going into vermin's code, it seems like this is just a hard require
"signal.strsignal": (None, (3, 8)),
BUT I cannot generate this error message on a local install of vermin with 1.5.1, wondering if this could be worked around with a newer version of vermin. playing around.
Edit: Ok, I think I must have accidentally installed both python 2.7 and python 3.6 versions of vermin, thus getting weird stuff. Now I see the right errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well that's a bummer. Vermin makes it so you can't support multiple versions of Python? Do we really need the vermin checks? Our CI already runs all tests with multiple versions of Python.
Though maybe doing the ImportError thing is considered bad practice by Python folks? I'm having trouble forming an opinion now. Sorry for putting you through that.
398f373
to
793f992
Compare
rebased and re-pushed. I went ahead and threw in some As a complete aside, this was also caught from newer version vermin check. So maybe could use
|
Problem: Several tests in t3203-instance-recovery.t fail on occasion when doing large parallel runs of the testsuite. Increase the test timeouts from 15 seconds to 120 seconds.
Problem: The strsignal() function would be useful to use, but is only available in Python 3.8 and newer. Flux's minimum version of Python is only 3.6. Solution: Add a new compat36.py utility file that supports our own simple implementation of strsignal() when it is not available.
Problem: There is no coverage of the new python compat36 library. Add some coverage for compat36 and the strsignal() function.
Problem: The list of "empty things" used by the filter_empty() function in OutputFormat is missing several "empty things" that are used by the "h" field converter. Solution: Add a utility function to return list of commonly known "empty outputs" and use that list in all appropriate locations.
Problem: There are no tests to ensure the job-list module returns job cancellation messages correctly. Solution: Add some tests to t2260-job-list.t. Add a job cancel message to job cancelation in t2800-jobs-cmd.t and update tests.
Problem: There is no coverage for terminated / signaled jobs in the job-list / flux-jobs tests. Solution: Add some additional coverage by adding a terminated job via SIGTERM to the tests in t2260-job-list.t and t2800-jobs-cmd.t.
Problem: There is no coverage for user exceptions in the job-list / flux-jobs tests. Solution: Add some additional coverage by adding a job ended with a user exception in t2260-job-list.t and t2800-jobs-cmd.t.
Problem: There are a number of fields in flux-jobs that are available to help users get details about how or why their job exited / finished / ended. However, it is scattered across several fields. It would be convenient if there were just one field that collated that information. Solution: Support a new "inactive_reason" output field. It will output if a job was canceled, signaled, timedout, or exited normally. Other contextual information will be output when available. Fixes flux-framework#4946
Problem: There is no coverage for the new flux-jobs inactive_reason output field. Solution: Add some coverage in t2800-jobs-cmd.t.
Problem: The new inactive_reason field is not documented in flux-jobs(1). Add information about the new inactive_reason field.
Problem: Now that the inactive_reason output field is available, it would be nice if there were a convenient format to output it and other pertinent information related to it. Solution: Add a new "endreason" output format.
793f992
to
8f704e2
Compare
Codecov Report
@@ Coverage Diff @@
## master #5055 +/- ##
===========================================
+ Coverage 55.43% 83.13% +27.70%
===========================================
Files 413 454 +41
Lines 71220 77961 +6741
===========================================
+ Hits 39478 64816 +25338
+ Misses 31742 13145 -18597
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
thanks, setting MWP |
Per discussion in #4946. This adds a new field
{inactive_reason}
to get why a job is inactive and a new named format influx-jobs
to help the user get the reason why their job is inactive.After some trial and error, came up with the following defined format for
flux-jobs
, which I call "endreason". Open to suggestions for an alternate name. Its the best I could come up with :-)example
the
ST
field is still there just so users can see those are running jobs and thus don't get just an emptyINACTIVE-REASON
without any context. I initially wanted to includeT_SUBMIT
but I couldn't really fit it in and make the format pretty enough for my tastes :-) Also, I didn't like two hyphens next to each other forT_INACTIVE
andINACTIVE-REASON
, so there's just one ... eh?? :-)One debate I had on the output from an exec error, I elect to handle it as a special case, outputting the special bash exit code message of "command not found" (exit 126) vs the actual output that might come out from the exception, "task 0 (host foobar): start failed: nosuchcommand: No such file or directory". Could tweak that.
This PR is built on top of #5031