-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ChangeState logic for limiting number of docs to commit in bulk #11786
Conversation
Jenkins results:
|
Things get better with the first commit, but now instead of failing to inject documents in |
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
I was hopeful that after applying this patch (it's been applied to vocms0255), workflow job information would become available once again, but it did not happen for: |
Jenkins results:
|
@todor-ivanov I am almost sure that the reason we do not see job related documents in CouchDB is because we are constantly failing to inject some documents into the local database, as their size is beyond the configured limits. My first commit is actually fixing a logic that was put in place a couple of months ago, when injecting fwjr documents around this line:
it turns out that jsum documents can also fail, around this line:
where I saw the As a short term fix, I would just deal with these document too large and truncate/reset some large fields. Long term, we need to start fetching the outcome of these bulk operations and retrying only the specific document, as documented here: https://docs.couchdb.org/en/stable/api/database/bulk-api.html#api-db-bulk-docs-validation I just wanted to hear your thoughts on this and a review of this PR. One alternative for this PR could be: instead of checking the size of each |
And we already have a ticket for the refactoring of CouchDB bulk operations: |
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
# keep only the first 5k characters | ||
doc[innerAttr] = doc[innerAttr][:5000] | ||
doc[innerAttr] += "... error message truncated ..." | ||
elif isinstance(doc[innerAttr], list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not (similarly to the str
field type) preserve the last, maybe hundreds, of elements from the list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to answer it in the PR body itself to make it more readable.
doc[innerAttr] += "... error message truncated ..." | ||
elif isinstance(doc[innerAttr], list): | ||
doc[innerAttr] = [] | ||
elif isinstance(doc[innerAttr], dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here
@@ -304,7 +330,7 @@ def recordInCouch(self, jobs, newstate, oldstate, updatesummary=False): | |||
|
|||
couchRecordsToUpdate.append({"jobid": job["id"], | |||
"couchid": jobDocument["_id"]}) | |||
if countDocs >= self.jobsdatabase.getQueueSize(): | |||
if self.jobsdatabase.getQueueSize() >= self.maxBulkCommit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference when you tie the condition here to self.maxBulkCommit
vs countDocs
?
And if we do not use this counter here any more (and similarly later in the code) is it even referenced any more?
BTW, I could not find even where this counter was previously updated at the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly the problem (well, one of them). countDocs
was only initialized and never updated.
With this PR, I remove that completely and rely solely on the size of the couchdb document queue.
msg = "Failed to commit bulk of job summary documents to CouchDB." | ||
msg += f" Details: {str(exc)}" | ||
logging.warning(msg) | ||
shrinkLargeSummary(self.jsumdatabase, self.fwjrLimitSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, Let me see if I read it correctly: If we are in this situation, calling the shringLargeSummary
function here is going to trigger a full scan of the local database and shrink only, the documents which are exceeding the configured size by cutting the fields that pass over some threshold? is My understanding correct?
If my understanding is correct, I'd ask a second question. Could we avoid iterating through the whole database by a single function call and fix only the document at hand?
If I misinterpreted it, please ignore my comment as a whole.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are partially correct. The only correction is that we do not scan the full local database, but only documents that are queued up for a given database (documents waiting to be committed).
The long term fix is indeed to deal only with the document that exceeded the size, but for that we need to work in the other GH issue that I have already referenced in the code (11576).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amaltaro in general, it looks goo to me, but I've left 2 or 3 comments inline with some clarification questions. Could you please take a look at them, before merging
@todor-ivanov answering your question on the The longer answer is, both
and
and I think for In addition, in the string data type case, it's simple to add a message saying that it has been truncated. While in the list/dict, I see no robust way to do that without affecting the data structure. Another challenge with this PR is that the jobsummary document contains a/some set objects, which makes it unhashable and not possible to perform a json.dumps on the whole document. I am trying to figure which field is that though and see if anything better - not costly - can be done. Please let me know if you have any suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @amaltaro, for those clarifications. All looks good to me.
@todor-ivanov thank you for your review. On top of it, I made the 6th commit (for bookkeepping: ce3a313) which makes a few improvements to the function truncating/resetting. I was pretty sure I have had problems json dumping the whole document, but I haven't seen those again in my tests from yesterday and today, so I am giving this another try now. |
Jenkins results:
|
As just noticed, I need to enclose L562:
in a try/except block as well. |
Jenkins results:
|
Jenkins results:
|
I have this patch updated in vocms0255 and now everything is looking good in the logs. This patch become a little more complex because the I see no way to make the code simpler, but I am happy to hear of any suggestions. @todor-ivanov if you review it again, commits 6 through 8 are supposed to be new code that you have not looked yet. |
# keep only the first 5k characters | ||
doc[innerAttr] = doc[innerAttr][:5000] | ||
doc[innerAttr] += "... error message truncated ..." | ||
# keep only the first 5k characters of the error detail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment the initial error dictionary has been created, was it assured, the next two fields used in the following two for
cycles are actually iterrable (even though with zero length)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, data type defaults to dict and if it's empty, it doesn't even go in in the first loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amaltaro I left only one comment inline, which is just a precaution.
Check job summary document size as well Check job summary doc for large fields Catch dictionary as well add extra message whenever errors is truncated Separate function for dealing with job summary docs Refactor shrinkLargeSummary to get a dump of the whole document Enclose in try/except a final jobsummary commit The errors valye is actually a nested dictionary
Jenkins results:
|
Fixes #11771
Status
ready
Description
Previous code was buggy as we never incremented
countDocs
. In addition, we should actually have usedself.maxBulkCommit
, given that this is how the database connection instance is created and how we define the limit for commits in bulk.In addition, now we also wrap
jsumdatabase
commits in a try/except, as we recently learned that those documents could be larger than the configured limit as well. Unfortunately, we have to deal with those slightly different, mainly for 2 reasons: 1) the document contains set() object, which is not serializable and thus we cannot get the full size of the document; 2) I picked 3 fields that can potentially be large (even though onlyerrors
has been a real example so far).UPDATE: a list of 500 dicts for the
inputfiles
field adds around 100kB, so I think it's a good commitment.It fixes a bug that I created in #11502
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
None
External dependencies / deployment changes
None