Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ChangeState logic for limiting number of docs to commit in bulk #11786

Merged
merged 1 commit into from
Nov 10, 2023

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Nov 3, 2023

Fixes #11771

Status

ready

Description

Previous code was buggy as we never incremented countDocs. In addition, we should actually have used self.maxBulkCommit, given that this is how the database connection instance is created and how we define the limit for commits in bulk.

In addition, now we also wrap jsumdatabase commits in a try/except, as we recently learned that those documents could be larger than the configured limit as well. Unfortunately, we have to deal with those slightly different, mainly for 2 reasons: 1) the document contains set() object, which is not serializable and thus we cannot get the full size of the document; 2) I picked 3 fields that can potentially be large (even though only errors has been a real example so far).

UPDATE: a list of 500 dicts for the inputfiles field adds around 100kB, so I think it's a good commitment.

It fixes a bug that I created in #11502

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

None

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 8 warnings
    • 28 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14601/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 3, 2023

Things get better with the first commit, but now instead of failing to inject documents in fwjrdatabase, they fail at jsumdatabase. I am almost sure that the lumis attribute of the jobSummary document is the culprit. Investigating...

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 8 warnings
    • 29 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14602/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 8 warnings
    • 29 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14603/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 8 warnings
    • 29 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14604/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 4, 2023

I was hopeful that after applying this patch (it's been applied to vocms0255), workflow job information would become available once again, but it did not happen for: pdmvserv_task_PPS-Run3Summer22EEMiniAODv4-00177__v1_T_231101_141258_2773

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 8 warnings
    • 29 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14605/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 6, 2023

@todor-ivanov I am almost sure that the reason we do not see job related documents in CouchDB is because we are constantly failing to inject some documents into the local database, as their size is beyond the configured limits.

My first commit is actually fixing a logic that was put in place a couple of months ago, when injecting fwjr documents around this line:

self.fwjrdatabase.commit(callback=discardConflictingDocument)

it turns out that jsum documents can also fail, around this line:

self.jsumdatabase.commit()

where I saw the errors key having many megabytes. So I added a few checks to the python dictionary document before even queueing them.

As a short term fix, I would just deal with these document too large and truncate/reset some large fields.

Long term, we need to start fetching the outcome of these bulk operations and retrying only the specific document, as documented here: https://docs.couchdb.org/en/stable/api/database/bulk-api.html#api-db-bulk-docs-validation
This IMO is a larger development though and can potentially have side effects in places that we do not expect.

I just wanted to hear your thoughts on this and a review of this PR. One alternative for this PR could be: instead of checking the size of each jobsummary document before hand, to only check it in case we spot too large documents (same as it is implemented for fwjr, at the moment). Thanks

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 6, 2023

And we already have a ticket for the refactoring of CouchDB bulk operations:
#11576

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 8 warnings
    • 28 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14606/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 99 new failures
    • 275 tests deleted
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 11 warnings and errors that must be fixed
    • 8 warnings
    • 30 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14607/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 8 warnings
    • 28 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14608/artifact/artifacts/PullRequestReport.html

# keep only the first 5k characters
doc[innerAttr] = doc[innerAttr][:5000]
doc[innerAttr] += "... error message truncated ..."
elif isinstance(doc[innerAttr], list):
Copy link
Contributor

@todor-ivanov todor-ivanov Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not (similarly to the str field type) preserve the last, maybe hundreds, of elements from the list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to answer it in the PR body itself to make it more readable.

doc[innerAttr] += "... error message truncated ..."
elif isinstance(doc[innerAttr], list):
doc[innerAttr] = []
elif isinstance(doc[innerAttr], dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here

@@ -304,7 +330,7 @@ def recordInCouch(self, jobs, newstate, oldstate, updatesummary=False):

couchRecordsToUpdate.append({"jobid": job["id"],
"couchid": jobDocument["_id"]})
if countDocs >= self.jobsdatabase.getQueueSize():
if self.jobsdatabase.getQueueSize() >= self.maxBulkCommit:
Copy link
Contributor

@todor-ivanov todor-ivanov Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference when you tie the condition here to self.maxBulkCommit vs countDocs?
And if we do not use this counter here any more (and similarly later in the code) is it even referenced any more?
BTW, I could not find even where this counter was previously updated at the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly the problem (well, one of them). countDocs was only initialized and never updated.
With this PR, I remove that completely and rely solely on the size of the couchdb document queue.

msg = "Failed to commit bulk of job summary documents to CouchDB."
msg += f" Details: {str(exc)}"
logging.warning(msg)
shrinkLargeSummary(self.jsumdatabase, self.fwjrLimitSize)
Copy link
Contributor

@todor-ivanov todor-ivanov Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Let me see if I read it correctly: If we are in this situation, calling the shringLargeSummary function here is going to trigger a full scan of the local database and shrink only, the documents which are exceeding the configured size by cutting the fields that pass over some threshold? is My understanding correct?

If my understanding is correct, I'd ask a second question. Could we avoid iterating through the whole database by a single function call and fix only the document at hand?

If I misinterpreted it, please ignore my comment as a whole.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are partially correct. The only correction is that we do not scan the full local database, but only documents that are queued up for a given database (documents waiting to be committed).

The long term fix is indeed to deal only with the document that exceeded the size, but for that we need to work in the other GH issue that I have already referenced in the code (11576).

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaltaro in general, it looks goo to me, but I've left 2 or 3 comments inline with some clarification questions. Could you please take a look at them, before merging

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 8, 2023

@todor-ivanov answering your question on the shrinkLargeSummary function and why don't we truncate data for list/dict type as well. The short answer is, because it's more complicated and it increases the changes of having a final document that is still above the size limits.

The longer answer is, both inputfiles and lumis fields expect a list value data type (the dict is just a safeguard and we could potentially just remove that), where inputfiles has the following format:

 'inputfiles': [{'input_type': 'primaryFiles',
                 'lfn': '/store/unmerged/DMWM_Test/RelValProdMinBias/AODSIM/RECOPROD1_TaskChain_ProdMinBias_Nov2023_Val_Alanv2-v11/1930000/7A082F24-A37D-EE11-B99C-3CECEF0DDCE2.root'},
                {'input_type': 'primaryFiles',
                 'lfn': '/store/unmerged/DMWM_Test/RelValProdMinBias/AODSIM/RECOPROD1_TaskChain_ProdMinBias_Nov2023_Val_Alanv2-v11/1930000/6837B7E8-A27D-EE11-84A9-3CECEF5D9F50.root'},

and lumis looks like:

 'lumis': [{'1': {1: None,
                  ...
                  21: None}},
           {'1': {22: None,
                   ....

and I think for inputfiles we could indeed truncate it at 100 input files (or maybe 500 files, to cover merge jobs). But for the lumis field, it becomes tricky because we have dicts of dicts and their sizes can vary.

In addition, in the string data type case, it's simple to add a message saying that it has been truncated. While in the list/dict, I see no robust way to do that without affecting the data structure.

Another challenge with this PR is that the jobsummary document contains a/some set objects, which makes it unhashable and not possible to perform a json.dumps on the whole document. I am trying to figure which field is that though and see if anything better - not costly - can be done.

Please let me know if you have any suggestions.

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @amaltaro, for those clarifications. All looks good to me.

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 8, 2023

@todor-ivanov thank you for your review. On top of it, I made the 6th commit (for bookkeepping: ce3a313) which makes a few improvements to the function truncating/resetting.

I was pretty sure I have had problems json dumping the whole document, but I haven't seen those again in my tests from yesterday and today, so I am giving this another try now.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 8 warnings
    • 28 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14615/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 9, 2023

As just noticed, I need to enclose L562:

        self.jsumdatabase.commit()

in a try/except block as well.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 8 warnings
    • 28 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14616/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 8 warnings
    • 29 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14617/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented Nov 9, 2023

I have this patch updated in vocms0255 and now everything is looking good in the logs. This patch become a little more complex because the errors field isn't really a plain string, but a dict of lists - with potentially dictionary as items.

I see no way to make the code simpler, but I am happy to hear of any suggestions.

@todor-ivanov if you review it again, commits 6 through 8 are supposed to be new code that you have not looked yet.

# keep only the first 5k characters
doc[innerAttr] = doc[innerAttr][:5000]
doc[innerAttr] += "... error message truncated ..."
# keep only the first 5k characters of the error detail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the initial error dictionary has been created, was it assured, the next two fields used in the following two for cycles are actually iterrable (even though with zero length)?

Copy link
Contributor Author

@amaltaro amaltaro Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, data type defaults to dict and if it's empty, it doesn't even go in in the first loop.

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amaltaro I left only one comment inline, which is just a precaution.

Check job summary document size as well

Check job summary doc for large fields

Catch dictionary as well

add extra message whenever errors is truncated

Separate function for dealing with job summary docs

Refactor shrinkLargeSummary to get a dump of the whole document

Enclose in try/except a final jobsummary commit

The errors valye is actually a nested dictionary
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 8 warnings
    • 29 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14618/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro merged commit 21342f5 into dmwm:master Nov 10, 2023
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflows with no errors
3 participants