Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor StepChainParentage thread to resolve by workflow #11694

Merged
merged 3 commits into from
Oct 23, 2023

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Aug 24, 2023

Fixes #11693

Status

ready

Description

Refactor how data is organized in the StepChainParentageFix cherrypy thread, such that it goes through each workflow at a time; instead of aggregating all of the datasets that require parentage fix, dealing with all of them, and only then making the status transition.

With these changes, we have a smaller memory footprint but a larger number of HTTP calls to the DBS server, as multiple workflows could have the same input and/or output dataset names.

In addition, an standalone script is provided to fix the parentage information for any given single workflow. Script is called fixWorkflowParentage.py and by default it will use production ReqMgr2 and DBS Server.

Note-1: spike in memory can still occur, whenever a large workflow needs to have its parentage fixed (e.g.: cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766).
Note-2: the parentage resolution logic has not been changed at all. Only the order of datasets and reqmgr2 updates.
Note-3: the "hack" commit has been removed. Here is its reference though: 1961d25

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

None

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 4 warnings
    • 3 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14441/artifact/artifacts/PullRequestReport.html

while listWflowChildren:
wflowData = listWflowChildren.pop()
msg = f'Resolving parentage for workflow {wflowData["workflowName"]} '
msg += f'with a total of {wflowData["childrenDsets"]} datasets. '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will fail when wflowData will be empty dict, it is better to check wflowData exist before passing it to message line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wflowData content is built in the method above called getChildDatasetsForStepChainMissingParent, through a list iteration. Hence, we guarantee that wflowData exists and has the 2 expected keys.

I did find an issue though with this log record, as I wanted to print the number of datasets only, not their names. I am updating it in a minute.


# now resolve every dataset parentage for each workflow
while listWflowChildren:
wflowData = listWflowChildren.pop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pop method can throw exception if it is empty, e.g.

>>> d=[]
>>> d.pop()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: pop from empty list

And, code here does not check for it. It is better to put it into try/except block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what the while listWflowChildren: is supposed to do. E.g.:

In [1]: l = [1, 2, 3]

In [2]: while l:
   ...:     a = l.pop()
   ...:     print(a)
   ...: 
3
2
1

# measure time taken to fix a workflow parentage
start = time.time()
parentageResolved = True
for childDS in wflowData["childrenDsets"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again there is no check for wflowData dict here, I rather prefer to change this line to

for childDS in wflowData.get('childredDsets', []):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already accounted and there is no need for double default. Please see getChildDatasetsForStepChainMissingParent where it is already set to a default [], in case no datasets exist.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 4 warnings
    • 3 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14444/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

I tested this with a real workflow and it works fine, so I decided to patch the production reqmgr2-tasks service and let it go through a cycle. After this cycle, we will know what is stuck because the service was misbehaving, and what is stuck because we need to have the relevant changes in in DBS server.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 6 warnings
    • 5 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14445/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 6 warnings
    • 5 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14446/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix @amaltaro
It looks good to me, I made a single comment inline, and I have another one to make regarding the PR description. By the current one the reader is left with the impression the main motivation for this change is the tradeoff between memory footprint and DBS calls, and it is not even mentioned that the current logic mitigates the effect of piling up stuck requests in the system.

childrenDsets = []
for _stepNum, dsetDict in info.items():
childrenDsets.extend(dsetDict['ChildDsets'])
wflowChildrenDsets.append({"workflowName": reqName, "childrenDsets": childrenDsets})
Copy link
Contributor

@todor-ivanov todor-ivanov Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know me myself is mix those names a lot: workflowName vs. ReqName. In the original data structure it is named as ReqName, and this change of the key name here may cause confusion later.

This is just a comment, not a request for change, because I am not sure which is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I happen to use them interchangeably. I guess I will just keep it as is to avoid adding a mistake in the code.
I couldn't find the ReqName reference that you mention though.

@amaltaro
Copy link
Contributor Author

Quick update: WM central services have been redeployed with a stable docker image, which means the changes provided in this PR are no longer applied to the StepChainParentageFix cherrypy thread. I don't see any restarts of the thread today, but if I start seeing those, I will re-apply this patch such that we can skip the very large workflow that brings the application down.

@amaltaro
Copy link
Contributor Author

I have just applied this patch again to the production reqmgr2-tasks pod and restarted the service. Timestamp is around Thu Sep 14 11:05:39 UTC 2023.

@@ -0,0 +1,112 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may or may not be relevant, but since we rely on python3 I think it should be python3 instead of python. The later can be pointed to something else depending on node configuration and software setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not wrong, distutils automatically converts that to python3, if it is using a python3 stack to build the service.

from WMCore.Services.RequestDB.RequestDBWriter import RequestDBWriter


REQMGR_URL = 'https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is never good idea to have hardcoded values in the code. I understand that they may not be changed, but it is much better to read them from environment and if env does not provide it setup its default.

data = reqmgrDB.getRequestByNames(wflowName)
for reqName, info in data.items():
childrenDsets = []
for _stepNum, dsetDict in info.get("ChainParentageMap", {}).items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be simplified to

for dsetDict in info.get("ChainParentageMap", {}).values():

but it is a personal choice of style.

bin/adhoc-scripts/fixWorkflowParentage.py Show resolved Hide resolved
sys.exit(1)

wflowName = sys.argv[1]
reqmgrDB = RequestDBWriter(REQMGR_URL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, when I see usage of REQMGR_URL and DBS_URL it is the place where you can decide to read it from env or pass them as input parameters of this script. The passing arguments will be most flexible in my opinion.

@amaltaro
Copy link
Contributor Author

@vkuznet I should have addressed all of your comments now, besides the python shebang which I am not sure it's meaningful.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 6 new failures
    • 2 tests no longer failing
    • 2 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 15 warnings and errors that must be fixed
    • 6 warnings
    • 4 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14492/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 6 warnings
    • 4 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14573/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

Thank you for the review, Valentin and Todor. The only modification I have done after your review was to remove an already planned commit, which was hacking the code to skip a given very large workflow. Initial description has been updated accordingly.

@amaltaro amaltaro merged commit cb1e503 into dmwm:master Oct 23, 2023
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StepChainParentageFix ReqMgr2 task getting killed
4 participants