Refactor StepChainParentage thread to resolve by workflow #11694

amaltaro · 2023-08-24T18:24:16Z

Status

ready

Description

Refactor how data is organized in the StepChainParentageFix cherrypy thread, such that it goes through each workflow at a time; instead of aggregating all of the datasets that require parentage fix, dealing with all of them, and only then making the status transition.

With these changes, we have a smaller memory footprint but a larger number of HTTP calls to the DBS server, as multiple workflows could have the same input and/or output dataset names.

In addition, an standalone script is provided to fix the parentage information for any given single workflow. Script is called fixWorkflowParentage.py and by default it will use production ReqMgr2 and DBS Server.

Note-1: spike in memory can still occur, whenever a large workflow needs to have its parentage fixed (e.g.: cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766).
Note-2: the parentage resolution logic has not been changed at all. Only the order of datasets and reqmgr2 updates.
Note-3: the "hack" commit has been removed. Here is its reference though: 1961d25

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

None

cmsdmwmbot · 2023-08-24T18:40:41Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 changes in unstable tests
Python3 Pylint check: failed
- 5 warnings and errors that must be fixed
- 4 warnings
- 3 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14441/artifact/artifacts/PullRequestReport.html

vkuznet · 2023-08-24T20:12:13Z

src/python/WMCore/ReqMgr/CherryPyThreads/StepChainParentageFixTask.py

+            while listWflowChildren:
+                wflowData = listWflowChildren.pop()
+                msg = f'Resolving parentage for workflow {wflowData["workflowName"]} '
+                msg += f'with a total of {wflowData["childrenDsets"]} datasets. '


This line will fail when wflowData will be empty dict, it is better to check wflowData exist before passing it to message line

The wflowData content is built in the method above called getChildDatasetsForStepChainMissingParent, through a list iteration. Hence, we guarantee that wflowData exists and has the 2 expected keys.

I did find an issue though with this log record, as I wanted to print the number of datasets only, not their names. I am updating it in a minute.

vkuznet · 2023-08-24T20:14:49Z

src/python/WMCore/ReqMgr/CherryPyThreads/StepChainParentageFixTask.py

+
+            # now resolve every dataset parentage for each workflow
+            while listWflowChildren:
+                wflowData = listWflowChildren.pop()


The pop method can throw exception if it is empty, e.g.

>>> d=[] >>> d.pop() Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: pop from empty list

And, code here does not check for it. It is better to put it into try/except block.

This is what the while listWflowChildren: is supposed to do. E.g.:

In [1]: l = [1, 2, 3] In [2]: while l: ...: a = l.pop() ...: print(a) ...: 3 2 1

vkuznet · 2023-08-24T20:16:03Z

src/python/WMCore/ReqMgr/CherryPyThreads/StepChainParentageFixTask.py

+                # measure time taken to fix a workflow parentage
+                start = time.time()
+                parentageResolved = True
+                for childDS in wflowData["childrenDsets"]:


again there is no check for wflowData dict here, I rather prefer to change this line to

for childDS in wflowData.get('childredDsets', []):

This is already accounted and there is no need for double default. Please see getChildDatasetsForStepChainMissingParent where it is already set to a default [], in case no datasets exist.

cmsdmwmbot · 2023-08-24T22:02:16Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 changes in unstable tests
Python3 Pylint check: failed
- 5 warnings and errors that must be fixed
- 4 warnings
- 3 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14444/artifact/artifacts/PullRequestReport.html

amaltaro · 2023-08-25T01:18:21Z

I tested this with a real workflow and it works fine, so I decided to patch the production reqmgr2-tasks service and let it go through a cycle. After this cycle, we will know what is stuck because the service was misbehaving, and what is stuck because we need to have the relevant changes in in DBS server.

cmsdmwmbot · 2023-08-25T15:57:29Z

Jenkins results:

Python3 Unit tests: succeeded
Python3 Pylint check: failed
- 13 warnings and errors that must be fixed
- 6 warnings
- 5 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14445/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-08-25T16:21:33Z

Jenkins results:

Python3 Unit tests: succeeded
Python3 Pylint check: failed
- 14 warnings and errors that must be fixed
- 6 warnings
- 5 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14446/artifact/artifacts/PullRequestReport.html

todor-ivanov

Thanks for this fix @amaltaro
It looks good to me, I made a single comment inline, and I have another one to make regarding the PR description. By the current one the reader is left with the impression the main motivation for this change is the tradeoff between memory footprint and DBS calls, and it is not even mentioned that the current logic mitigates the effect of piling up stuck requests in the system.

todor-ivanov · 2023-08-29T07:19:43Z

src/python/WMCore/ReqMgr/CherryPyThreads/StepChainParentageFixTask.py

+        childrenDsets = []
+        for _stepNum, dsetDict in info.items():
+            childrenDsets.extend(dsetDict['ChildDsets'])
+        wflowChildrenDsets.append({"workflowName": reqName, "childrenDsets": childrenDsets})


I know me myself is mix those names a lot: workflowName vs. ReqName. In the original data structure it is named as ReqName, and this change of the key name here may cause confusion later.

This is just a comment, not a request for change, because I am not sure which is better.

Yes, I happen to use them interchangeably. I guess I will just keep it as is to avoid adding a mistake in the code.
I couldn't find the ReqName reference that you mention though.

amaltaro · 2023-09-13T13:14:58Z

Quick update: WM central services have been redeployed with a stable docker image, which means the changes provided in this PR are no longer applied to the StepChainParentageFix cherrypy thread. I don't see any restarts of the thread today, but if I start seeing those, I will re-apply this patch such that we can skip the very large workflow that brings the application down.

amaltaro · 2023-09-14T11:06:26Z

I have just applied this patch again to the production reqmgr2-tasks pod and restarted the service. Timestamp is around Thu Sep 14 11:05:39 UTC 2023.

vkuznet · 2023-09-19T12:24:49Z

bin/adhoc-scripts/fixWorkflowParentage.py

@@ -0,0 +1,112 @@
+#!/usr/bin/env python


This may or may not be relevant, but since we rely on python3 I think it should be python3 instead of python. The later can be pointed to something else depending on node configuration and software setup.

If I am not wrong, distutils automatically converts that to python3, if it is using a python3 stack to build the service.

vkuznet · 2023-09-19T12:25:46Z

bin/adhoc-scripts/fixWorkflowParentage.py

+from WMCore.Services.RequestDB.RequestDBWriter import RequestDBWriter
+
+
+REQMGR_URL = 'https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache'


It is never good idea to have hardcoded values in the code. I understand that they may not be changed, but it is much better to read them from environment and if env does not provide it setup its default.

vkuznet · 2023-09-19T12:27:15Z

bin/adhoc-scripts/fixWorkflowParentage.py

+    data = reqmgrDB.getRequestByNames(wflowName)
+    for reqName, info in data.items():
+        childrenDsets = []
+        for _stepNum, dsetDict in info.get("ChainParentageMap", {}).items():


this can be simplified to

for dsetDict in info.get("ChainParentageMap", {}).values():

but it is a personal choice of style.

bin/adhoc-scripts/fixWorkflowParentage.py

vkuznet · 2023-09-19T12:29:11Z

bin/adhoc-scripts/fixWorkflowParentage.py

+        sys.exit(1)
+
+    wflowName = sys.argv[1]
+    reqmgrDB = RequestDBWriter(REQMGR_URL)


Now, when I see usage of REQMGR_URL and DBS_URL it is the place where you can decide to read it from env or pass them as input parameters of this script. The passing arguments will be most flexible in my opinion.

src/python/WMCore/ReqMgr/CherryPyThreads/StepChainParentageFixTask.py

amaltaro · 2023-09-19T20:12:04Z

@vkuznet I should have addressed all of your comments now, besides the python shebang which I am not sure it's meaningful.

cmsdmwmbot · 2023-09-19T20:19:03Z

Jenkins results:

Python3 Unit tests: failed
- 6 new failures
- 2 tests no longer failing
- 2 tests added
- 3 changes in unstable tests
Python3 Pylint check: failed
- 15 warnings and errors that must be fixed
- 6 warnings
- 4 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14492/artifact/artifacts/PullRequestReport.html

Fix a few logging messages

cmsdmwmbot · 2023-10-23T20:57:43Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
Python3 Pylint check: failed
- 14 warnings and errors that must be fixed
- 6 warnings
- 4 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14573/artifact/artifacts/PullRequestReport.html

amaltaro · 2023-10-23T21:08:13Z

Thank you for the review, Valentin and Todor. The only modification I have done after your review was to remove an already planned commit, which was hacking the code to skip a given very large workflow. Initial description has been updated accordingly.

amaltaro requested review from vkuznet and todor-ivanov August 24, 2023 18:56

vkuznet requested changes Aug 24, 2023

View reviewed changes

amaltaro force-pushed the fix-11693 branch from 610f3f0 to 92be6d6 Compare August 24, 2023 21:11

todor-ivanov approved these changes Aug 29, 2023

View reviewed changes

amaltaro requested a review from vkuznet September 14, 2023 13:19

vkuznet requested changes Sep 19, 2023

View reviewed changes

amaltaro requested a review from vkuznet September 19, 2023 20:10

vkuznet approved these changes Sep 20, 2023

View reviewed changes

amaltaro added 3 commits October 23, 2023 16:49

Refactor StepChainParentage thread to resolve by workflow

084d949

Fix a few logging messages

Bin script to fix a single workflow parentage information

3cc04d6

apply Valentins suggestions

0aae73a

amaltaro force-pushed the fix-11693 branch from 592f15a to 0aae73a Compare October 23, 2023 20:50

amaltaro merged commit cb1e503 into dmwm:master Oct 23, 2023
3 of 4 checks passed

amaltaro mentioned this pull request Oct 25, 2023

Backport fixes for StepChain parentage thread and Unified configuration #11774

Merged

amaltaro added the PR: Need cmsweb branch label Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor StepChainParentage thread to resolve by workflow #11694

Refactor StepChainParentage thread to resolve by workflow #11694

amaltaro commented Aug 24, 2023 •

edited

Loading

cmsdmwmbot commented Aug 24, 2023

vkuznet Aug 24, 2023

amaltaro Aug 24, 2023

vkuznet Aug 24, 2023

amaltaro Aug 24, 2023

vkuznet Aug 24, 2023

amaltaro Aug 24, 2023

cmsdmwmbot commented Aug 24, 2023

amaltaro commented Aug 25, 2023

cmsdmwmbot commented Aug 25, 2023

cmsdmwmbot commented Aug 25, 2023

todor-ivanov left a comment

todor-ivanov Aug 29, 2023 •

edited

Loading

amaltaro Aug 29, 2023

amaltaro commented Sep 13, 2023

amaltaro commented Sep 14, 2023

vkuznet Sep 19, 2023

amaltaro Sep 19, 2023

vkuznet Sep 19, 2023

vkuznet Sep 19, 2023

vkuznet Sep 19, 2023

amaltaro commented Sep 19, 2023

cmsdmwmbot commented Sep 19, 2023

cmsdmwmbot commented Oct 23, 2023

amaltaro commented Oct 23, 2023

		from WMCore.Services.RequestDB.RequestDBWriter import RequestDBWriter


		REQMGR_URL = 'https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache'

Refactor StepChainParentage thread to resolve by workflow #11694

Refactor StepChainParentage thread to resolve by workflow #11694

Conversation

amaltaro commented Aug 24, 2023 • edited Loading

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

cmsdmwmbot commented Aug 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmsdmwmbot commented Aug 24, 2023

amaltaro commented Aug 25, 2023

cmsdmwmbot commented Aug 25, 2023

cmsdmwmbot commented Aug 25, 2023

todor-ivanov left a comment

Choose a reason for hiding this comment

todor-ivanov Aug 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amaltaro commented Sep 13, 2023

amaltaro commented Sep 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amaltaro commented Sep 19, 2023

cmsdmwmbot commented Sep 19, 2023

cmsdmwmbot commented Oct 23, 2023

amaltaro commented Oct 23, 2023

amaltaro commented Aug 24, 2023 •

edited

Loading

todor-ivanov Aug 29, 2023 •

edited

Loading