Fix ChangeState logic for limiting number of docs to commit in bulk #11786

amaltaro · 2023-11-03T22:52:18Z

Status

ready

Description

Previous code was buggy as we never incremented countDocs. In addition, we should actually have used self.maxBulkCommit, given that this is how the database connection instance is created and how we define the limit for commits in bulk.

In addition, now we also wrap jsumdatabase commits in a try/except, as we recently learned that those documents could be larger than the configured limit as well. Unfortunately, we have to deal with those slightly different, mainly for 2 reasons: 1) the document contains set() object, which is not serializable and thus we cannot get the full size of the document; 2) I picked 3 fields that can potentially be large (even though only errors has been a real example so far).

UPDATE: a list of 500 dicts for the inputfiles field adds around 100kB, so I think it's a good commitment.

It fixes a bug that I created in #11502

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

None

cmsdmwmbot · 2023-11-03T23:13:32Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 changes in unstable tests
Python3 Pylint check: failed
- 1 warnings and errors that must be fixed
- 8 warnings
- 28 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14601/artifact/artifacts/PullRequestReport.html

amaltaro · 2023-11-03T23:51:14Z

Things get better with the first commit, but now instead of failing to inject documents in fwjrdatabase, they fail at jsumdatabase. I am almost sure that the lumis attribute of the jobSummary document is the culprit. Investigating...

cmsdmwmbot · 2023-11-04T01:04:40Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 2 changes in unstable tests
Python3 Pylint check: failed
- 1 warnings and errors that must be fixed
- 8 warnings
- 29 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14602/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-11-04T01:10:10Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 1 tests no longer failing
- 1 changes in unstable tests
Python3 Pylint check: failed
- 1 warnings and errors that must be fixed
- 8 warnings
- 29 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14603/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-11-04T01:13:08Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 changes in unstable tests
Python3 Pylint check: failed
- 1 warnings and errors that must be fixed
- 8 warnings
- 29 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14604/artifact/artifacts/PullRequestReport.html

amaltaro · 2023-11-04T12:22:10Z

I was hopeful that after applying this patch (it's been applied to vocms0255), workflow job information would become available once again, but it did not happen for: pdmvserv_task_PPS-Run3Summer22EEMiniAODv4-00177__v1_T_231101_141258_2773

cmsdmwmbot · 2023-11-06T12:13:19Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
Python3 Pylint check: failed
- 1 warnings and errors that must be fixed
- 8 warnings
- 29 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14605/artifact/artifacts/PullRequestReport.html

amaltaro · 2023-11-06T12:19:44Z

@todor-ivanov I am almost sure that the reason we do not see job related documents in CouchDB is because we are constantly failing to inject some documents into the local database, as their size is beyond the configured limits.

My first commit is actually fixing a logic that was put in place a couple of months ago, when injecting fwjr documents around this line:

WMCore/src/python/WMCore/JobStateMachine/ChangeState.py

Line 394 in 1702bc9

self.fwjrdatabase.commit(callback=discardConflictingDocument)

it turns out that jsum documents can also fail, around this line:

WMCore/src/python/WMCore/JobStateMachine/ChangeState.py

Line 518 in 1702bc9

self.jsumdatabase.commit()

where I saw the errors key having many megabytes. So I added a few checks to the python dictionary document before even queueing them.

As a short term fix, I would just deal with these document too large and truncate/reset some large fields.

Long term, we need to start fetching the outcome of these bulk operations and retrying only the specific document, as documented here: https://docs.couchdb.org/en/stable/api/database/bulk-api.html#api-db-bulk-docs-validation
This IMO is a larger development though and can potentially have side effects in places that we do not expect.

I just wanted to hear your thoughts on this and a review of this PR. One alternative for this PR could be: instead of checking the size of each jobsummary document before hand, to only check it in case we spot too large documents (same as it is implemented for fwjr, at the moment). Thanks

amaltaro · 2023-11-06T13:56:56Z

And we already have a ticket for the refactoring of CouchDB bulk operations:
#11576

cmsdmwmbot · 2023-11-07T12:24:06Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 8 warnings
- 28 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14606/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-11-07T16:23:09Z

Jenkins results:

Python3 Unit tests: failed
- 99 new failures
- 275 tests deleted
- 1 changes in unstable tests
Python3 Pylint check: failed
- 11 warnings and errors that must be fixed
- 8 warnings
- 30 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14607/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-11-07T16:48:05Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 8 warnings
- 28 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14608/artifact/artifacts/PullRequestReport.html

todor-ivanov · 2023-11-08T12:02:33Z

src/python/WMCore/JobStateMachine/ChangeState.py

+                    # keep only the first 5k characters
+                    doc[innerAttr] = doc[innerAttr][:5000]
+                    doc[innerAttr] += "... error message truncated ..."
+                elif isinstance(doc[innerAttr], list):


Why not (similarly to the str field type) preserve the last, maybe hundreds, of elements from the list?

I am going to answer it in the PR body itself to make it more readable.

todor-ivanov · 2023-11-08T12:02:53Z

src/python/WMCore/JobStateMachine/ChangeState.py

+                    doc[innerAttr] += "... error message truncated ..."
+                elif isinstance(doc[innerAttr], list):
+                    doc[innerAttr] = []
+                elif isinstance(doc[innerAttr], dict):


same comment here

todor-ivanov · 2023-11-08T12:08:01Z

src/python/WMCore/JobStateMachine/ChangeState.py

@@ -304,7 +330,7 @@ def recordInCouch(self, jobs, newstate, oldstate, updatesummary=False):

                couchRecordsToUpdate.append({"jobid": job["id"],
                                             "couchid": jobDocument["_id"]})
-                if countDocs >= self.jobsdatabase.getQueueSize():
+                if self.jobsdatabase.getQueueSize() >= self.maxBulkCommit:


What's the difference when you tie the condition here to self.maxBulkCommit vs countDocs?
And if we do not use this counter here any more (and similarly later in the code) is it even referenced any more?
BTW, I could not find even where this counter was previously updated at the first place.

Exactly the problem (well, one of them). countDocs was only initialized and never updated.
With this PR, I remove that completely and rely solely on the size of the couchdb document queue.

src/python/WMCore/JobStateMachine/ChangeState.py

todor-ivanov · 2023-11-08T12:11:52Z

src/python/WMCore/JobStateMachine/ChangeState.py

+                            msg = "Failed to commit bulk of job summary documents to CouchDB."
+                            msg += f" Details: {str(exc)}"
+                            logging.warning(msg)
+                            shrinkLargeSummary(self.jsumdatabase, self.fwjrLimitSize)


Ok, Let me see if I read it correctly: If we are in this situation, calling the shringLargeSummary function here is going to trigger a full scan of the local database and shrink only, the documents which are exceeding the configured size by cutting the fields that pass over some threshold? is My understanding correct?

If my understanding is correct, I'd ask a second question. Could we avoid iterating through the whole database by a single function call and fix only the document at hand?

If I misinterpreted it, please ignore my comment as a whole.

Yes, you are partially correct. The only correction is that we do not scan the full local database, but only documents that are queued up for a given database (documents waiting to be committed).

The long term fix is indeed to deal only with the document that exceeded the size, but for that we need to work in the other GH issue that I have already referenced in the code (11576).

todor-ivanov

@amaltaro in general, it looks goo to me, but I've left 2 or 3 comments inline with some clarification questions. Could you please take a look at them, before merging

amaltaro · 2023-11-08T14:38:59Z

@todor-ivanov answering your question on the shrinkLargeSummary function and why don't we truncate data for list/dict type as well. The short answer is, because it's more complicated and it increases the changes of having a final document that is still above the size limits.

The longer answer is, both inputfiles and lumis fields expect a list value data type (the dict is just a safeguard and we could potentially just remove that), where inputfiles has the following format:

 'inputfiles': [{'input_type': 'primaryFiles',
                 'lfn': '/store/unmerged/DMWM_Test/RelValProdMinBias/AODSIM/RECOPROD1_TaskChain_ProdMinBias_Nov2023_Val_Alanv2-v11/1930000/7A082F24-A37D-EE11-B99C-3CECEF0DDCE2.root'},
                {'input_type': 'primaryFiles',
                 'lfn': '/store/unmerged/DMWM_Test/RelValProdMinBias/AODSIM/RECOPROD1_TaskChain_ProdMinBias_Nov2023_Val_Alanv2-v11/1930000/6837B7E8-A27D-EE11-84A9-3CECEF5D9F50.root'},

and lumis looks like:

 'lumis': [{'1': {1: None,
                  ...
                  21: None}},
           {'1': {22: None,
                   ....

and I think for inputfiles we could indeed truncate it at 100 input files (or maybe 500 files, to cover merge jobs). But for the lumis field, it becomes tricky because we have dicts of dicts and their sizes can vary.

In addition, in the string data type case, it's simple to add a message saying that it has been truncated. While in the list/dict, I see no robust way to do that without affecting the data structure.

Another challenge with this PR is that the jobsummary document contains a/some set objects, which makes it unhashable and not possible to perform a json.dumps on the whole document. I am trying to figure which field is that though and see if anything better - not costly - can be done.

Please let me know if you have any suggestions.

todor-ivanov

Thank you @amaltaro, for those clarifications. All looks good to me.

amaltaro · 2023-11-08T22:31:06Z

@todor-ivanov thank you for your review. On top of it, I made the 6th commit (for bookkeepping: ce3a313) which makes a few improvements to the function truncating/resetting.

I was pretty sure I have had problems json dumping the whole document, but I haven't seen those again in my tests from yesterday and today, so I am giving this another try now.

cmsdmwmbot · 2023-11-08T22:57:21Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 8 warnings
- 28 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14615/artifact/artifacts/PullRequestReport.html

amaltaro · 2023-11-09T01:45:06Z

As just noticed, I need to enclose L562:

        self.jsumdatabase.commit()

in a try/except block as well.

cmsdmwmbot · 2023-11-09T02:02:14Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 2 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 8 warnings
- 28 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14616/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2023-11-09T03:04:27Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 8 warnings
- 29 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14617/artifact/artifacts/PullRequestReport.html

amaltaro · 2023-11-09T04:52:46Z

I have this patch updated in vocms0255 and now everything is looking good in the logs. This patch become a little more complex because the errors field isn't really a plain string, but a dict of lists - with potentially dictionary as items.

I see no way to make the code simpler, but I am happy to hear of any suggestions.

@todor-ivanov if you review it again, commits 6 through 8 are supposed to be new code that you have not looked yet.

todor-ivanov · 2023-11-09T08:03:05Z

src/python/WMCore/JobStateMachine/ChangeState.py

-                    # keep only the first 5k characters
-                    doc[innerAttr] = doc[innerAttr][:5000]
-                    doc[innerAttr] += "... error message truncated ..."
+                    # keep only the first 5k characters of the error detail


At the moment the initial error dictionary has been created, was it assured, the next two fields used in the following two for cycles are actually iterrable (even though with zero length)?

Yes, data type defaults to dict and if it's empty, it doesn't even go in in the first loop.

todor-ivanov

Thanks @amaltaro I left only one comment inline, which is just a precaution.

Check job summary document size as well Check job summary doc for large fields Catch dictionary as well add extra message whenever errors is truncated Separate function for dealing with job summary docs Refactor shrinkLargeSummary to get a dump of the whole document Enclose in try/except a final jobsummary commit The errors valye is actually a nested dictionary

cmsdmwmbot · 2023-11-10T02:11:12Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 2 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 8 warnings
- 29 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14618/artifact/artifacts/PullRequestReport.html

amaltaro force-pushed the fix-11771 branch from bacbb78 to a1e095a Compare November 4, 2023 00:54

amaltaro requested a review from todor-ivanov November 7, 2023 14:04

amaltaro force-pushed the fix-11771 branch from 424a097 to db0e9d0 Compare November 7, 2023 16:39

todor-ivanov reviewed Nov 8, 2023

View reviewed changes

src/python/WMCore/JobStateMachine/ChangeState.py Show resolved Hide resolved

todor-ivanov reviewed Nov 8, 2023

View reviewed changes

amaltaro added the PR: squashing needed label Nov 8, 2023

todor-ivanov approved these changes Nov 8, 2023

View reviewed changes

amaltaro requested a review from todor-ivanov November 9, 2023 04:52

todor-ivanov reviewed Nov 9, 2023

View reviewed changes

todor-ivanov approved these changes Nov 9, 2023

View reviewed changes

amaltaro force-pushed the fix-11771 branch from 0bde167 to 230213e Compare November 10, 2023 01:58

amaltaro added PR: Need wmagent branch and removed PR: squashing needed labels Nov 10, 2023

amaltaro merged commit 21342f5 into dmwm:master Nov 10, 2023
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ChangeState logic for limiting number of docs to commit in bulk #11786

Fix ChangeState logic for limiting number of docs to commit in bulk #11786

amaltaro commented Nov 3, 2023 •

edited

Loading

cmsdmwmbot commented Nov 3, 2023

amaltaro commented Nov 3, 2023

cmsdmwmbot commented Nov 4, 2023

cmsdmwmbot commented Nov 4, 2023

cmsdmwmbot commented Nov 4, 2023

amaltaro commented Nov 4, 2023

cmsdmwmbot commented Nov 6, 2023

amaltaro commented Nov 6, 2023

amaltaro commented Nov 6, 2023

cmsdmwmbot commented Nov 7, 2023

cmsdmwmbot commented Nov 7, 2023

cmsdmwmbot commented Nov 7, 2023

todor-ivanov Nov 8, 2023 •

edited

Loading

amaltaro Nov 8, 2023

todor-ivanov Nov 8, 2023

todor-ivanov Nov 8, 2023 •

edited

Loading

amaltaro Nov 8, 2023

todor-ivanov Nov 8, 2023 •

edited

Loading

amaltaro Nov 8, 2023

todor-ivanov left a comment

amaltaro commented Nov 8, 2023

todor-ivanov left a comment

amaltaro commented Nov 8, 2023

cmsdmwmbot commented Nov 8, 2023

amaltaro commented Nov 9, 2023

cmsdmwmbot commented Nov 9, 2023

cmsdmwmbot commented Nov 9, 2023

amaltaro commented Nov 9, 2023

todor-ivanov Nov 9, 2023

amaltaro Nov 9, 2023 •

edited

Loading

todor-ivanov left a comment

cmsdmwmbot commented Nov 10, 2023

Fix ChangeState logic for limiting number of docs to commit in bulk #11786

Fix ChangeState logic for limiting number of docs to commit in bulk #11786

Conversation

amaltaro commented Nov 3, 2023 • edited Loading

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

cmsdmwmbot commented Nov 3, 2023

amaltaro commented Nov 3, 2023

cmsdmwmbot commented Nov 4, 2023

cmsdmwmbot commented Nov 4, 2023

cmsdmwmbot commented Nov 4, 2023

amaltaro commented Nov 4, 2023

cmsdmwmbot commented Nov 6, 2023

amaltaro commented Nov 6, 2023

amaltaro commented Nov 6, 2023

cmsdmwmbot commented Nov 7, 2023

cmsdmwmbot commented Nov 7, 2023

cmsdmwmbot commented Nov 7, 2023

todor-ivanov Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

amaltaro Nov 8, 2023

Choose a reason for hiding this comment

todor-ivanov Nov 8, 2023

Choose a reason for hiding this comment

todor-ivanov Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

amaltaro Nov 8, 2023

Choose a reason for hiding this comment

todor-ivanov Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

amaltaro Nov 8, 2023

Choose a reason for hiding this comment

todor-ivanov left a comment

Choose a reason for hiding this comment

amaltaro commented Nov 8, 2023

todor-ivanov left a comment

Choose a reason for hiding this comment

amaltaro commented Nov 8, 2023

cmsdmwmbot commented Nov 8, 2023

amaltaro commented Nov 9, 2023

cmsdmwmbot commented Nov 9, 2023

cmsdmwmbot commented Nov 9, 2023

amaltaro commented Nov 9, 2023

todor-ivanov Nov 9, 2023

Choose a reason for hiding this comment

amaltaro Nov 9, 2023 • edited Loading

Choose a reason for hiding this comment

todor-ivanov left a comment

Choose a reason for hiding this comment

cmsdmwmbot commented Nov 10, 2023

amaltaro commented Nov 3, 2023 •

edited

Loading

todor-ivanov Nov 8, 2023 •

edited

Loading

todor-ivanov Nov 8, 2023 •

edited

Loading

todor-ivanov Nov 8, 2023 •

edited

Loading

amaltaro Nov 9, 2023 •

edited

Loading