Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed ParentageResolved check, logged ParentageResolved false #11805

Conversation

d-ylee
Copy link
Contributor

@d-ylee d-ylee commented Nov 29, 2023

Fixes #11725

Status

In development

Description

Removed ParentageResolved checks. Log when ParentageResolved is false when processing.

Is it backward compatible (if not, which system it affects?)

MAYBE

Related PRs

N/A

External dependencies / deployment changes

N/A

@d-ylee d-ylee requested a review from amaltaro November 29, 2023 22:24
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
  • Python3 Pylint check: succeeded
    • 6 warnings
    • 47 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14677/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-ylee Dennis, these changes look good to me. But I left a comment along the code for your consideration.

In addition, I am inclined to say that we should not archive announced workflows unless their parentage has been resolved. Otherwise we have no way to spot such issues in the future. That means, we should likely add to the archival pipeline an if statement checking for the status and wflow['ParentageResolved'] information (or any other alternative implementation).

@todor-ivanov what do you think?

msg += " Will retry again in the next cycle."
self.logger.info(msg)
self._checkStatusAdvanceExpired(wflow, additionalInfo=msg)
# elif wflow['RequestStatus'] == 'announced' and not wflow['ParentageResolved']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to completely remove these lines.

Comment on lines 331 to 332
if wflow['RequestStatus'] == 'announced' and not wflow['ParentageResolved']:
msg = f"Adding workflow: {wflow['RequestName']} - 'ParentageResolved' flag set to false."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see the point of those two lines here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe moving it down to the archival method would be good enough. I think it's important to log that A workflow can't go through the component (completely) because its parentage has not been resolved.

Copy link
Contributor

@todor-ivanov todor-ivanov Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaltaro, @d-ylee , this msg variable here is not logged in practice. If left here as it is, and not sent to the logger it will be overwritten once the cycle bellow is entered. And here follow two more comments of mine:

  • We are not "Adding" this workflow here or anywhere else. We are just "Not skipping" it in the _dispatch step - ergo in the cleanup process. So the message is misleading.
  • And yes if we are about to log the fact that a workflow cleanup is not skipped, but the workflow itself is not going to be archived, the proper place for such message is in the archive method in the code snippet i was pointing to in my general comment. More on this - this way we will avoid duplicated checks and also will benefit from the fact that once the workflow is kept in the system for time longer than the archival threshold this log message will be transformed into a proper alarm. These few lines bellow should do the job:
        if not (wflow['ParentageResolved'] or wflow['ForceArchive']):
            msg = "Not properly cleaned workflow: %s - 'ParentageResolved' flag set to false." % wflow['RequestName']
            self.logger.info(msg)
            if self._checkStatusAdvanceExpired(wflow, additionalInfo=msg):
                self.alertStatusAdvanceExpired(wflow)
            raise MSRuleCleanerArchivalSkip(msg)

@todor-ivanov
Copy link
Contributor

hi @amaltaro
about:

In addition, I am inclined to say that we should not archive announced workflows unless their parentage has been resolved. Otherwise we have no way to spot such issues in the future. That means, we should likely add to the archival pipeline an if statement checking for the status and wflow['ParentageResolved'] information (or any other alternative implementation).

This is quite simple. The check needs to follow the exact same logic as this one:

if not (wflow['IsClean'] or wflow['ForceArchive']):
msg = "Not properly cleaned workflow: %s" % wflow['RequestName']
if self._checkStatusAdvanceExpired(wflow, additionalInfo=msg):
self.alertStatusAdvanceExpired(wflow)
raise MSRuleCleanerArchivalSkip(msg)
and needs to go right before it.

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-ylee I left one comment inline, which you may want to take a look at.

@amaltaro
Copy link
Contributor

amaltaro commented Dec 1, 2023

@d-ylee Dennis, I just remembered that we have some unit tests relying on workflow dump, example in the following lines:
https://github.com/dmwm/WMCore/blob/master/test/python/WMCore_t/MicroService_t/MSRuleCleaner_t/MSRuleCleaner_t.py#L73-L75

One option would be for you to adjust one of those dumps (maybe the StepChain one) and see how it behaves in the component with your changes.

@d-ylee d-ylee force-pushed the feature_11725_remove_prod_transferer_rules_once_satisfied branch from 71d64ec to e0a3657 Compare December 1, 2023 22:36
@d-ylee
Copy link
Contributor Author

d-ylee commented Dec 1, 2023

I made the changes suggested above. @amaltaro It looks like in the StepChain dump, ParentageResolved was already set to false. I also do not see self.stepChainReq being used in any tests of MSRuleCleaner_t.py, unless I am missing something. Would it be better if I use the existing file and write a test that uses it?

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 6 warnings
    • 47 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14683/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

amaltaro commented Dec 1, 2023

@d-ylee we could either adapt the unit tests to use self.stepChainReq or create a new unit test for a new stepchain dump. Or anything along these lines that you see as a meaningful test.

@d-ylee
Copy link
Contributor Author

d-ylee commented Dec 4, 2023

@amaltaro I ran the test with the existing self.stepChainReq, and the resulting assertion failed:

AssertionError: {'RequestName': 'StepChain_Tasks_HG2011_Val[1862 chars]se."} != {'CleanupStatus': {'plineAgentBlock': True,[1861 chars]se."}
  {'CleanupStatus': {'plineAgentBlock': True, 'plineAgentCont': True},
   'ForceArchive': False,
   'IncludeParents': False,
   'InputDataset': None,
   'IsArchivalDelayExpired': True,
   'IsClean': True,
   'IsLogDBClean': True,
   'OutputDatasets': ['/DYJetsToLL_Pt-50To100_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/Integ_TestStep1-GENSIM_StepChain_Tasks_HG2011_Val_Todor_v1-v20/GEN-SIM',
                      '/DYJetsToLL_Pt-50To100_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/Integ_TestStep1-GENSIM_StepChain_Tasks_HG2011_Val_Todor_v1-v20/LHE',
                      '/DYJetsToLL_Pt-50To100_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/Integ_TestStep2-DIGI_StepChain_Tasks_HG2011_Val_Todor_v1-v20/GEN-SIM-RAW',
                      '/DYJetsToLL_Pt-50To100_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/Integ_TestStep3-RECO_StepChain_Tasks_HG2011_Val_Todor_v1-v20/AODSIM'],
   'ParentDataset': [],
   'ParentageResolved': False,
   'PlineMarkers': ['plineArchive',
                    'plineAgentBlock',
                    'plineAgentCont',
                    'plineArchive'],
   'RequestName': 'StepChain_Tasks_HG2011_Val_201029_112731_6371',
   'RequestStatus': 'aborted-completed',
   'RequestTransition': [{'DN': '', 'Status': 'new', 'UpdateTime': 1603967251},
                         {'DN': '',
                          'Status': 'assignment-approved',
                          'UpdateTime': 1603967253},
                         {'DN': '',
                          'Status': 'assigned',
                          'UpdateTime': 1603967254},
                         {'DN': '',
                          'Status': 'aborted',
                          'UpdateTime': 1604931587},
                         {'DN': '',
                          'Status': 'aborted-completed',
                          'UpdateTime': 1604931737}],
   'RequestType': 'StepChain',
   'RulesToClean': {'plineAgentBlock': [], 'plineAgentCont': []},
   'StatusAdvanceExpiredMsg': 'Not properly cleaned workflow: '
                              'StepChain_Tasks_HG2011_Val_201029_112731_6371 - '
                              "'ParentageResolved' flag set to false.\n"
                              'Not properly cleaned workflow: '
                              'StepChain_Tasks_HG2011_Val_201029_112731_6371\n'
                              'Not properly cleaned workflow: '
                              'StepChain_Tasks_HG2011_Val_201029_112731_6371 - '
                              "'ParentageResolved' flag set to false.",
   'SubRequestType': '',
   'TapeRulesStatus': [],
-  'TargetStatus': 'aborted-archived',
?                   ^^  ^^^

+  'TargetStatus': 'normal-archived',
?                   ^  ^^^

   'TransferDone': False,
   'TransferTape': False}

I did modify the Request*, InputDataset, and OutputDatasets fields to match the StepChainRequestDump.json value. I also modified the StatusAdvanceExpiredMsg to match the output.

Is this what we are expecting with the changes?

@amaltaro
Copy link
Contributor

amaltaro commented Dec 4, 2023

The json dump has this request status:
https://github.com/dmwm/WMCore/blob/master/test/data/ReqMgr/requests/Static/StepChainRequestDump.json#L102

and if unit tests are resulting in a target status normal-archived, something is very wrong with the MSRuleCleaner implementation. Which is likely something that has always been in the code, so it's worth it investigating.

@d-ylee
Copy link
Contributor Author

d-ylee commented Dec 4, 2023

@amaltaro What status should we be expecting?

@amaltaro
Copy link
Contributor

amaltaro commented Dec 4, 2023

The one from your diff, this:

-  'TargetStatus': 'aborted-archived',

given that that workflow dump has RequestStatus=aborted-completed, the next allowed state transition would be aborted-archived, as it was supposed to be in this diagram:
https://github.com/dmwm/WMCore/blob/master/doc/wmcore/RequestStateTransition.png

we need to update it from: aborted -> aborted-completed -> aborted-archived. To be done at some point.

@d-ylee
Copy link
Contributor Author

d-ylee commented Dec 4, 2023

@amaltaro The test was expecting normal-archived in the expected workflow, while aborted-archived was the result after running the workflow. After changing TargetStatus in the expected workflow dict to aborted-archived, the test passes.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 6 warnings
    • 63 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14687/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

amaltaro commented Dec 5, 2023

Cool, thanks Dennis. Is it ready for another review? If so, please request it through the Reviewers option at the top right side.

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Dennis, it looks good to me!

@amaltaro amaltaro merged commit 63292d6 into dmwm:master Dec 5, 2023
3 of 4 checks passed
@d-ylee d-ylee deleted the feature_11725_remove_prod_transferer_rules_once_satisfied branch December 5, 2023 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove wma_prod and wmcore_transferor rules once output rules are satisfied
4 participants