-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor StepChainParentage thread to resolve by workflow #11694
Conversation
Jenkins results:
|
while listWflowChildren: | ||
wflowData = listWflowChildren.pop() | ||
msg = f'Resolving parentage for workflow {wflowData["workflowName"]} ' | ||
msg += f'with a total of {wflowData["childrenDsets"]} datasets. ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line will fail when wflowData
will be empty dict, it is better to check wflowData
exist before passing it to message line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wflowData
content is built in the method above called getChildDatasetsForStepChainMissingParent
, through a list iteration. Hence, we guarantee that wflowData exists and has the 2 expected keys.
I did find an issue though with this log record, as I wanted to print the number of datasets only, not their names. I am updating it in a minute.
|
||
# now resolve every dataset parentage for each workflow | ||
while listWflowChildren: | ||
wflowData = listWflowChildren.pop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pop
method can throw exception if it is empty, e.g.
>>> d=[]
>>> d.pop()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: pop from empty list
And, code here does not check for it. It is better to put it into try/except block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what the while listWflowChildren:
is supposed to do. E.g.:
In [1]: l = [1, 2, 3]
In [2]: while l:
...: a = l.pop()
...: print(a)
...:
3
2
1
# measure time taken to fix a workflow parentage | ||
start = time.time() | ||
parentageResolved = True | ||
for childDS in wflowData["childrenDsets"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again there is no check for wflowData
dict here, I rather prefer to change this line to
for childDS in wflowData.get('childredDsets', []):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already accounted and there is no need for double default. Please see getChildDatasetsForStepChainMissingParent
where it is already set to a default []
, in case no datasets exist.
Jenkins results:
|
I tested this with a real workflow and it works fine, so I decided to patch the production |
Jenkins results:
|
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fix @amaltaro
It looks good to me, I made a single comment inline, and I have another one to make regarding the PR description. By the current one the reader is left with the impression the main motivation for this change is the tradeoff between memory footprint and DBS calls, and it is not even mentioned that the current logic mitigates the effect of piling up stuck requests in the system.
childrenDsets = [] | ||
for _stepNum, dsetDict in info.items(): | ||
childrenDsets.extend(dsetDict['ChildDsets']) | ||
wflowChildrenDsets.append({"workflowName": reqName, "childrenDsets": childrenDsets}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know me myself is mix those names a lot: workflowName
vs. ReqName
. In the original data structure it is named as ReqName
, and this change of the key name here may cause confusion later.
This is just a comment, not a request for change, because I am not sure which is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I happen to use them interchangeably. I guess I will just keep it as is to avoid adding a mistake in the code.
I couldn't find the ReqName
reference that you mention though.
Quick update: WM central services have been redeployed with a stable docker image, which means the changes provided in this PR are no longer applied to the StepChainParentageFix cherrypy thread. I don't see any restarts of the thread today, but if I start seeing those, I will re-apply this patch such that we can skip the very large workflow that brings the application down. |
I have just applied this patch again to the production reqmgr2-tasks pod and restarted the service. Timestamp is around |
@@ -0,0 +1,112 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may or may not be relevant, but since we rely on python3 I think it should be python3 instead of python. The later can be pointed to something else depending on node configuration and software setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not wrong, distutils automatically converts that to python3
, if it is using a python3 stack to build the service.
from WMCore.Services.RequestDB.RequestDBWriter import RequestDBWriter | ||
|
||
|
||
REQMGR_URL = 'https://cmsweb.cern.ch/couchdb/reqmgr_workload_cache' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is never good idea to have hardcoded values in the code. I understand that they may not be changed, but it is much better to read them from environment and if env does not provide it setup its default.
data = reqmgrDB.getRequestByNames(wflowName) | ||
for reqName, info in data.items(): | ||
childrenDsets = [] | ||
for _stepNum, dsetDict in info.get("ChainParentageMap", {}).items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be simplified to
for dsetDict in info.get("ChainParentageMap", {}).values():
but it is a personal choice of style.
sys.exit(1) | ||
|
||
wflowName = sys.argv[1] | ||
reqmgrDB = RequestDBWriter(REQMGR_URL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, when I see usage of REQMGR_URL
and DBS_URL
it is the place where you can decide to read it from env or pass them as input parameters of this script. The passing arguments will be most flexible in my opinion.
@vkuznet I should have addressed all of your comments now, besides the python shebang which I am not sure it's meaningful. |
Jenkins results:
|
Jenkins results:
|
Thank you for the review, Valentin and Todor. The only modification I have done after your review was to remove an already planned commit, which was hacking the code to skip a given very large workflow. Initial description has been updated accordingly. |
Fixes #11693
Status
ready
Description
Refactor how data is organized in the StepChainParentageFix cherrypy thread, such that it goes through each workflow at a time; instead of aggregating all of the datasets that require parentage fix, dealing with all of them, and only then making the status transition.
With these changes, we have a smaller memory footprint but a larger number of HTTP calls to the DBS server, as multiple workflows could have the same input and/or output dataset names.
In addition, an standalone script is provided to fix the parentage information for any given single workflow. Script is called
fixWorkflowParentage.py
and by default it will use production ReqMgr2 and DBS Server.Note-1: spike in memory can still occur, whenever a large workflow needs to have its parentage fixed (e.g.: cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766).
Note-2: the parentage resolution logic has not been changed at all. Only the order of datasets and reqmgr2 updates.
Note-3: the "hack" commit has been removed. Here is its reference though: 1961d25
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
None
External dependencies / deployment changes
None