Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use previous existStatus code instead of default 99108 #11581

Merged
merged 1 commit into from
May 10, 2023

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented May 2, 2023

Fixes #11349

Status

not-tested

Description

Extract previously failed step exit status and use it as exit code for Chirp_WMCore_cmsRunX_ExitCode metrics

Is it backward compatible (if not, which system it affects?)

MAYBE

Related PRs

External dependencies / deployment changes

Testing may require to intentionally run workflow which will fail step chain. It is unclear to me how to test this code though unit tests

@vkuznet
Copy link
Contributor Author

vkuznet commented May 2, 2023

Alan, I made initial implementation to fix the problem with code assignment but I'm not sure how I can test it. It seems to me that provided unit tests cannot be run through docker unit test approach as it requires condor chirp, here is traceback from docker image used for unit testing:

[dmwm@a91cfde02d6a WMCore]$ python3 test/python/WMCore_t/WMSpec_t/Steps_t/Executors_t/CMSSW_t.py
/home/dmwm/wmcore_unittest/WMCore/src/python/WMCore/Services/Rucio/Rucio.py:710: DeprecationWarning: invalid escape sequence \c
  def pickRSE(self, rseExpression='rse_type=TAPE\cms_type=test', rseAttribute='ddm_quota', minNeeded=0):
/home/dmwm/wmcore_unittest/WMCore/src/python/WMCore/Storage/TrivialFileCatalog.py:42: DeprecationWarning: invalid escape sequence \?
  _TFCArgSplit = re.compile("\?protocol=")
WARNING:root:condor_chirp was not found in the system.
WARNING:root:condor_chirp was not found in the system.
/home/dmwm/wmcore_unittest/WMCore/src/python/WMCore/Algorithms/ParseXMLFile.py:64: ResourceWarning: unclosed file <_io.BufferedReader name='FrameworkJobReport.xml'>
  expat_parse(open(reportFile, 'rb'),
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.FF
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test/python/WMCore_t/WMSpec_t/Steps_t/Executors_t/CMSSW_t.py", line 204, in testC_ExecuteSegfault
    self.fail("Failure encountered, %s" % str(ex))
AssertionError: Failure encountered, [Errno 2] No such file or directory: '/tmp/tmp7by64cuw/UnitTests/WMSandbox/WMWorkload.pkl'

======================================================================
FAIL: testD_ExecuteNoOutput (__main__.CMSSW_t)
_ExecuteNoOutput_
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/python/WMCore_t/WMSpec_t/Steps_t/Executors_t/CMSSW_t.py", line 223, in testD_ExecuteNoOutput
    executor.initialise(self.step, self.job)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp7by64cuw/UnitTests/WMSandbox/WMWorkload.pkl'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test/python/WMCore_t/WMSpec_t/Steps_t/Executors_t/CMSSW_t.py", line 230, in testD_ExecuteNoOutput
    self.fail("Failure encountered, %s" % str(ex))
AssertionError: Failure encountered, [Errno 2] No such file or directory: '/tmp/tmp7by64cuw/UnitTests/WMSandbox/WMWorkload.pkl'

----------------------------------------------------------------------
Ran 4 tests in 2.904s

FAILED (failures=3)

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 2 warnings
    • 31 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14241/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, with this implementation in, we would be propagating the initial step failure (exit code) to all the subsequent cmsRun steps. While what we actually want is to properly mark each step with their most meaningful exit code (thus 99108 for a step that should not run because the previous one failed), BUT the final job exit code should carry the first non-zero exit code.

Regarding testing it, I fear that we might have to test it with one of our integration templates:
https://github.com/dmwm/WMCore/blob/master/test/data/ReqMgr/requests/DMWM/SC_6Steps_PU.json

meaning, setting up an agent with this patch in and running a test workflow. From my previous execution of this template, it looks like we are supposed to get 1 or 2 job failures.

@vkuznet
Copy link
Contributor Author

vkuznet commented May 2, 2023

@amaltaro, then I need to know how to determine if a step is a final step? If this information is not available within a step CMSSW.py module then there is no way to do what you're asking for. Or, I need to know which code combines all steps in order to send them elsewhere..

@amaltaro
Copy link
Contributor

amaltaro commented May 2, 2023

I don't think you need to precisely know that. Looking into the logic of this function:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py#L52

ensuring that Chirp_WMCore_cmsRun_ExitCode will have the meaningful error code seems simpler than I thought. I guess we can make something like:
"""
if previous step failed; then
do not set Chirp_WMCore_cmsRun_ExitCode once again (and with the wrong exit code)
else
keep setting Chirp_WMCore_cmsRun_ExitCode
"""

what do you think?

@vkuznet
Copy link
Contributor Author

vkuznet commented May 3, 2023

if we implement what you are asking then we will end-up with the following:

      "Chirp_WMCore_cmsRun3_ExitCode": 8021,
      "Chirp_WMCore_cmsRun_ExitCode": 8021,
      "CondorExitCode": 85,

instead of current

grep ExitCode 99108.json
      "Chirp_WMCore_cmsRun3_ExitCode": 8021,
      "Chirp_WMCore_cmsRun6_ExitCode": 99108,
      "Chirp_WMCore_cmsRun4_ExitCode": 99108,
      "ExitCode": 99108,
      "Chirp_WMCore_cmsRun5_ExitCode": 99108,
      "Chirp_WMCore_cmsRun_ExitCode": 99108,
      "CondorExitCode": 85,

because cmsRun6,4,5 will not be set and I'm not sure if ExitCode will show up or not, and if it will show up will it have correct status code value.

Anyway, I'll implement the change you propose.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 2 warnings
    • 31 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14242/artifact/artifacts/PullRequestReport.html

@vkuznet vkuznet requested a review from amaltaro May 3, 2023 12:35
@amaltaro
Copy link
Contributor

amaltaro commented May 3, 2023

@vkuznet yes, that is the correct behavior. Instead of the current exit codes:

      "Chirp_WMCore_cmsRun3_ExitCode": 8021,
      "Chirp_WMCore_cmsRun4_ExitCode": 99108,
      "Chirp_WMCore_cmsRun5_ExitCode": 99108,
      "Chirp_WMCore_cmsRun6_ExitCode": 99108,
      "ExitCode": 99108,
      "Chirp_WMCore_cmsRun_ExitCode": 99108,
      "CondorExitCode": 85,

we should have:

      "Chirp_WMCore_cmsRun3_ExitCode": 8021,
      "Chirp_WMCore_cmsRun4_ExitCode": 99108,
      "Chirp_WMCore_cmsRun5_ExitCode": 99108,
      "Chirp_WMCore_cmsRun6_ExitCode": 99108,
      "ExitCode": 8021,
      "Chirp_WMCore_cmsRun_ExitCode": 8021,
      "CondorExitCode": 85,

I don't see where ExitCode is defined in WMCore though. Perhaps it's created in the CMS Job Monitoring script (@mrceyhun )? The CondorExitCode is not clear to me as well, but I think we should disconsider it as it is out of scope.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code seems to be similar to yesterday, thus still propagating the first non-0 exit code to every single cmsRun step.

@vkuznet
Copy link
Contributor Author

vkuznet commented May 3, 2023

No, the code is doing what you asked. I removed self._setStatus, and if you look closely to code flow you'll see the following:

  • the lines 108-122 checks the previous step, set proper code, and through exception WITHOUT setting Chirp_WMCore_cmsRunX_ExitCode code
  • while, if you'll look further on lines 289-305 the code will set proper exit code
  • since lines [108-122] will not be used if previous step is not failed therefore lines [289-305] will setup proper code

@amaltaro
Copy link
Contributor

amaltaro commented May 3, 2023

Oh, I see. Line 302 is supposed to create those remaining cmsRunX exit codes with:

                self._setStatus(returnCode, returnMessage)

but then won't it overwrite again Chirp_WMCore_cmsRun_ExitCode with the wrong exit code?

In addition, that will use the subprocess return status, which might be different than 99108.

I think you need to test it to ensure that we have the expected behavior implemented. Specially because this is a potential patch to be applied to the agents as soon as we converge, so we need to be sure that it's doing what it is supposed to.

@vkuznet
Copy link
Contributor Author

vkuznet commented May 3, 2023

you are right about setting Chirp_WMCore_cmsRun_ExitCode in line 302, so I added necessary protection in _setStatus function. And, of course, I agree about proper testing. Since unit tests here is not an option please provide detailed instructions how to do that in test agent (or point to proper documentation).

@mrceyhun
Copy link
Contributor

mrceyhun commented May 3, 2023

@amaltaro For the ExitCode, if it is not provided by upstream, yes we create it[1].
And I also don't know CondorExitCode which is equal to ExitCode. Probably we keep it there because of historical reasons.

[1] https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py#L669

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 2 warnings
    • 31 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14243/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

amaltaro commented May 3, 2023

@mrceyhun thank you for the confirmation! Honestly, I suspect we do not set ExitCode at all for central production jobs (neither I could find it grepping the repo). Thank you for providing its definition.

@vkuznet Valentin, I suggested in one of my previous comments a setup to test it. But basically, you will need to:

  1. either use testbed services or deploy central services in your k8s cluster
  2. connect a WMAgent to those services (I can help you with this deployment)
  3. patch WMAgent with this patch
  4. inject this test template: https://github.com/dmwm/WMCore/blob/master/test/data/ReqMgr/requests/DMWM/SC_6Steps_PU.json (which should have a couple of job failures)
  5. and check the classads propagated to MonIT.

Checking things in MonIT might be trickier. AFAIK only schedds connected to the global pool propagate condor information to MonIT. If that is the case, we need to ensure that the test WMAgent is connected to global pool. @mrceyhun can you please confirm which job information is consumed by the condor job monitoring setup?

@mrceyhun
Copy link
Contributor

mrceyhun commented May 3, 2023

@amaltaro You know the global collectors, for volunteer and ITB, we consume :

{
    "Volunteer":["vocms0840"], "ITB":["cmsgwms-collector-itb"]
}

We query both queue and history schedds with HTCondor Py API schedd.xquery(requirements=query) and schedd.history(history_query, [], match=-1).

Our history schedd query:

        (JobUniverse == 5) && (CMS_Type != "DONOTMONIT")
        &&
        (
            EnteredCurrentStatus >= %(last_completion)d
            || CRAB_PostJobLastUpdate >= %(last_completion)d
        )

Our queue schedd query:

        (JobUniverse == 5) && (CMS_Type != "DONOTMONIT")
         &&
         (
             JobStatus < 3 || JobStatus > 4
             || EnteredCurrentStatus >= %(completed_since)d
             || CRAB_PostJobLastUpdate >= %(completed_since)d
         )

You can find details in history.py and queue.py in https://github.com/dmwm/cms-htcondor-es/tree/master/src/htcondor_es

@vkuznet
Copy link
Contributor Author

vkuznet commented May 3, 2023

@amaltaro , thanks for details, I think for testing it would be much simpler to use testbed. It is for testing anyway, and WMAgent are connected to it too. Please let me know which agent I can use and I can patch it.

Regarding seeing metrics in MONIT, I think it would be much easier if I'll dump the self.step into logger on an agent side and we can see the settings via logs.

Let me know if I can proceed with this setup, and let me know in this case which agent to use.

@amaltaro
Copy link
Contributor

amaltaro commented May 3, 2023

Regarding seeing metrics in MONIT, I think it would be much easier if I'll dump the self.step into logger on an agent side and we can see the settings via logs.

That's a good idea. Feel free to provide an extra commit adding those debug lines. Once you are happy with testing, just remove that commit and force push your branch and it will be all clean.

For the agent, feel free to use vocms0193 (team name is: testbed-vocms0193). Once you are done with tests, I will redeploy that agent to upgrade to the latest version (it is now 1 version behind, on 2.2.0.2).

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 4 new failures
  • Python3 Pylint check: failed
    • 6 warnings and errors that must be fixed
    • 2 warnings
    • 31 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14244/artifact/artifacts/PullRequestReport.html

@vkuznet vkuznet requested a review from amaltaro May 9, 2023 12:24
@vkuznet
Copy link
Contributor Author

vkuznet commented May 9, 2023

@amaltaro , I removed logging and squashed all commits. Please proceed with review.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, even though we tested this behavior, I feel like further changes are required in this PR. For that, I would suggest you to provide an extra commit - in case I/we are mistaken.

For the _setStatus method, I think the logic should actually be:

if not self.failedPreviousStep:
    set Chirp_WMCore_cmsRun_ExitCode
    set Chirp_WMCore_%s_ExitCode
else:
    set only Chirp_WMCore_%s_ExitCode

with the current code, it will stop providing exit code for further cmsRun steps as soon as it hits a failure. I am inclined to say that every step should have an exit code in monitoring.

src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py Outdated Show resolved Hide resolved
src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py Outdated Show resolved Hide resolved
src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py Outdated Show resolved Hide resolved
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 2 warnings
    • 33 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14245/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 2 warnings
    • 33 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14247/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet please see further review along the code.

src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py Outdated Show resolved Hide resolved
@vkuznet vkuznet requested a review from amaltaro May 9, 2023 17:33
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 2 warnings
    • 33 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14250/artifact/artifacts/PullRequestReport.html

self.setCondorChirpAttrDelayed('Chirp_WMCore_%s_ExitCode' % self.stepName, returnCode)
if returnMessage and returnCode != 0:
self.setCondorChirpAttrDelayed('Chirp_WMCore_%s_Exception_Message' % self.stepName, returnMessage, compress=True)
logging.info("Step %s: Chirp_WMCore_%s_ExitCod %s", self.stepName, self.stepName, returnCode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo is back in both lines. In addition, I think this logging:

logging.info("Step %s: Chirp_WMCore_cmsRun_ExitCode %s", self.stepName, returnCode)

should actually be moved to the next line after if not self.failedPreviousStep:, between L55-L56.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, I restored the lines w/o fixing the message. Sorry. I refactored code a little bit to avoid its duplication. Please have a look again.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuznet Valentin, thank you for providing these changes. Code looks good to me.

Given that we have made significant changes compared to the code that we tested in WMAgent, would you be able to update the patch on vocms0193 and inject that StepChain template a couple of times once again?

@amaltaro
Copy link
Contributor

amaltaro commented May 9, 2023

If the test goes successful, we can then squash those commits in a single one (or if you prefer to squash them now, go ahead).

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 2 warnings
    • 31 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14252/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented May 9, 2023

ok, here are two new WFs:

- wmagent_SC_6Steps_PU_Test_issue_11349_v6_230509_200845_2794
- wmagent_SC_6Steps_PU_Test_issue_11349_v7_230509_200936_4048

Let's wait for their completion.

@amaltaro
Copy link
Contributor

@vkuznet Valentin, this seems to be working just as expected now. Here is a grep of the output log:

cmst1@vocms0193:/data/srv/wmagent/current $ grep Chirp_WMCore Job_4278/wmagentJob.log
2023-05-09 21:27:22,059:INFO:CMSSW:Step cmsRun1: Chirp_WMCore_cmsRun_ExitCode 0
2023-05-09 21:27:22,066:INFO:CMSSW:Step cmsRun1: Chirp_WMCore_cmsRun1_ExitCode 0
2023-05-09 22:36:07,721:INFO:CMSSW:Step cmsRun2: Chirp_WMCore_cmsRun_ExitCode 8016
2023-05-09 22:36:07,736:INFO:CMSSW:Step cmsRun2: Chirp_WMCore_cmsRun2_ExitCode 8016
2023-05-09 22:36:07,871:INFO:CMSSW:Step cmsRun3: Chirp_WMCore_cmsRun3_ExitCode 99108
2023-05-09 22:36:07,968:INFO:CMSSW:Step cmsRun4: Chirp_WMCore_cmsRun4_ExitCode 99108
2023-05-09 22:36:08,000:INFO:CMSSW:Step cmsRun5: Chirp_WMCore_cmsRun5_ExitCode 99108
2023-05-09 22:36:08,038:INFO:CMSSW:Step cmsRun6: Chirp_WMCore_cmsRun6_ExitCode 99108

I can also confirm that WMStats information is correct, in addition to the job Report.pkl file.

Please squash all those commits into a single one, remove the relevant GH labels and fire another review request.

We will have to backport it to 2.2.0.2_wmagent branch as well.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 2 warnings
    • 31 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14254/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

Here is the backport PR:
#11589

I am cutting a 2.2.0.6 on top of that and also patching all the agents in production with #11589

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exit code reporting of stepchain job failures to MONIT requires improvement
4 participants