-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use previous existStatus code instead of default 99108 #11581
Conversation
Alan, I made initial implementation to fix the problem with code assignment but I'm not sure how I can test it. It seems to me that provided unit tests cannot be run through docker unit test approach as it requires condor chirp, here is traceback from docker image used for unit testing:
|
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valentin, with this implementation in, we would be propagating the initial step failure (exit code) to all the subsequent cmsRun steps. While what we actually want is to properly mark each step with their most meaningful exit code (thus 99108 for a step that should not run because the previous one failed), BUT the final job exit code should carry the first non-zero exit code.
Regarding testing it, I fear that we might have to test it with one of our integration templates:
https://github.com/dmwm/WMCore/blob/master/test/data/ReqMgr/requests/DMWM/SC_6Steps_PU.json
meaning, setting up an agent with this patch in and running a test workflow. From my previous execution of this template, it looks like we are supposed to get 1 or 2 job failures.
@amaltaro, then I need to know how to determine if a step is a final step? If this information is not available within a step |
I don't think you need to precisely know that. Looking into the logic of this function: ensuring that what do you think? |
if we implement what you are asking then we will end-up with the following:
instead of current
because cmsRun6,4,5 will not be set and I'm not sure if Anyway, I'll implement the change you propose. |
Jenkins results:
|
@vkuznet yes, that is the correct behavior. Instead of the current exit codes:
we should have:
I don't see where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code seems to be similar to yesterday, thus still propagating the first non-0 exit code to every single cmsRun step.
No, the code is doing what you asked. I removed
|
Oh, I see. Line 302 is supposed to create those remaining cmsRunX exit codes with:
but then won't it overwrite again In addition, that will use the subprocess return status, which might be different than 99108. I think you need to test it to ensure that we have the expected behavior implemented. Specially because this is a potential patch to be applied to the agents as soon as we converge, so we need to be sure that it's doing what it is supposed to. |
you are right about setting |
@amaltaro For the [1] https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L669 |
Jenkins results:
|
@mrceyhun thank you for the confirmation! Honestly, I suspect we do not set @vkuznet Valentin, I suggested in one of my previous comments a setup to test it. But basically, you will need to:
Checking things in MonIT might be trickier. AFAIK only schedds connected to the global pool propagate condor information to MonIT. If that is the case, we need to ensure that the test WMAgent is connected to global pool. @mrceyhun can you please confirm which job information is consumed by the condor job monitoring setup? |
@amaltaro You know the global collectors, for volunteer and ITB, we consume :
We query both queue and history schedds with HTCondor Py API Our history schedd query:
Our queue schedd query:
You can find details in history.py and queue.py in https://github.com/dmwm/cms-htcondor-es/tree/master/src/htcondor_es |
@amaltaro , thanks for details, I think for testing it would be much simpler to use testbed. It is for testing anyway, and WMAgent are connected to it too. Please let me know which agent I can use and I can patch it. Regarding seeing metrics in MONIT, I think it would be much easier if I'll dump the Let me know if I can proceed with this setup, and let me know in this case which agent to use. |
That's a good idea. Feel free to provide an extra commit adding those debug lines. Once you are happy with testing, just remove that commit and force push your branch and it will be all clean. For the agent, feel free to use vocms0193 (team name is: |
Jenkins results:
|
@amaltaro , I removed logging and squashed all commits. Please proceed with review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valentin, even though we tested this behavior, I feel like further changes are required in this PR. For that, I would suggest you to provide an extra commit - in case I/we are mistaken.
For the _setStatus
method, I think the logic should actually be:
if not self.failedPreviousStep:
set Chirp_WMCore_cmsRun_ExitCode
set Chirp_WMCore_%s_ExitCode
else:
set only Chirp_WMCore_%s_ExitCode
with the current code, it will stop providing exit code for further cmsRun steps as soon as it hits a failure. I am inclined to say that every step should have an exit code in monitoring.
Jenkins results:
|
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkuznet please see further review along the code.
Jenkins results:
|
self.setCondorChirpAttrDelayed('Chirp_WMCore_%s_ExitCode' % self.stepName, returnCode) | ||
if returnMessage and returnCode != 0: | ||
self.setCondorChirpAttrDelayed('Chirp_WMCore_%s_Exception_Message' % self.stepName, returnMessage, compress=True) | ||
logging.info("Step %s: Chirp_WMCore_%s_ExitCod %s", self.stepName, self.stepName, returnCode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo is back in both lines. In addition, I think this logging:
logging.info("Step %s: Chirp_WMCore_cmsRun_ExitCode %s", self.stepName, returnCode)
should actually be moved to the next line after if not self.failedPreviousStep:
, between L55-L56.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, I restored the lines w/o fixing the message. Sorry. I refactored code a little bit to avoid its duplication. Please have a look again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkuznet Valentin, thank you for providing these changes. Code looks good to me.
Given that we have made significant changes compared to the code that we tested in WMAgent, would you be able to update the patch on vocms0193 and inject that StepChain template a couple of times once again?
If the test goes successful, we can then squash those commits in a single one (or if you prefer to squash them now, go ahead). |
Jenkins results:
|
ok, here are two new WFs:
Let's wait for their completion. |
@vkuznet Valentin, this seems to be working just as expected now. Here is a grep of the output log:
I can also confirm that WMStats information is correct, in addition to the job Report.pkl file. Please squash all those commits into a single one, remove the relevant GH labels and fire another review request. We will have to backport it to 2.2.0.2_wmagent branch as well. |
Jenkins results:
|
Fixes #11349
Status
not-tested
Description
Extract previously failed step exit status and use it as exit code for
Chirp_WMCore_cmsRunX_ExitCode
metricsIs it backward compatible (if not, which system it affects?)
MAYBE
Related PRs
External dependencies / deployment changes
Testing may require to intentionally run workflow which will fail step chain. It is unclear to me how to test this code though unit tests