-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global WorkQueue gets stuck #11186
Comments
At the agent other than the regular document update conflicts like:
I can see only one connection timeout error, related to LogDB backend:
|
connecting into the pod I do see the following error logs:
Then logging into the pod directly prooves the process is missing indeed:
|
It was pointed by @muhammadimranfarooqi that the process is actually running but the pid is wrongly recorded in the pid file. This behavior is obviously similar to the containers for all the rest of the services. I am not to debug this pid number shift, right now. The container was restarted. We are going to monitor i it gets back properly in the monitoring pages. |
Now the processes are proved to be present in the containers. I think we are failing to load some CouchDB views. Things get stuck here: WMCore/src/python/WMCore/Database/CMSCouch.py Lines 561 to 562 in 38bf266
|
The full trace from the logs is here:
|
here is the line that gives the clue what is happening:
here is the actual interface to be called:
here is the actual API call that is timing out: |
And at the end we just fail to collect and upload monitoring information to both Wmstats's service status and to MONIT, while the agents seem to be working just fine. |
if we take a look at this monitoring page here: https://monit-grafana.cern.ch/goto/csqXi9Cnk?orgId=11 this same thing has happened with And here https://monit-grafana.cern.ch/goto/kbOIZ9jnz?orgId=11 we can see a sudden increase of |
Thanks for following this up, Todor. Global workqueue shouldn't be timing out with only 62k elements in the database. There must be something else, either on:
Anyhow, these are just possibilities/guesses and we need to start ruling them out. |
The I also had a look at the couchdb logs and I see successful requests to that view every now and then:
correlating this couch.log records with the moment workqueue actually made this HTTP request, gives me:
besides the difference in timestamp, where workqueue is CERN localtime, while couchdb is UTC - thus 2h of difference - we can see that actually CouchDB is serving that request in a second or two. Something is wrong in getting these packets back to the workqueue POD! |
Hi @amaltaro I do not know if any actions have been taken by you during the night, but I can see the monitoring data is back in place since midnight. |
It seems CouchDB is misbehaving again. And this time we have also [1]
|
This could be correlated to an extra load of workqueue elements in the system. Here is this very same moment on Jun 17th I was talking about above [1][2][3][4] this time from the agents' perspective - we see a sudden increase on local workqueue elements in all the agents at the very same time. Almost doubling actually. It seems to me, we either have troubles digesting something or we have troubles pushing updates to the central Couch and hence the increase of local workqueue elements at the agents not moving forward. [1] [2] |
Looking at the WMAgent grafana dashboard, it seems to have started on the 17/Jun/2022, around 1am. This matches the moment where I deployed and started a RelVal agent (vocms0259), the first agent connected to production using CouchDB 3.x. A couple of days ago we discovered that the RelVal agent was failing to replicate from the central workqueue (timing out because there were too many deleted documents), and possibly because of an issue with CouchDB itself: After many changes to the CouchDB replicator configuration in vocms0259, we could not find a way to get data properly replicating (even with CouchDB developers help!). The only solution was to rotate (replicate) central workqueue database and get rid of of all the deleted documents. Imran was super kind and helpful and we got the workqueue rotation done yesterday evening CERN time (with an outage for that backend). Now we have a fresh and clean workqueue database and overall WM is looking stable. Last but not least, I updated the following wiki with some data for this procedure, such that we can come up with a better ETA for future cases: I am closing it now, but if timeouts strike again in the coming day or two, please reopen it. |
Impact of the bug
Global WorkGueue
Describe the bug
Today we started experiencing some Heartbeat timeouts in the WorkQueu. This leads to a broken status for the component/service in WMStats. The logs are pointing to a pycurl error for timing out :
Another place to look at the current state is:
https://monit-grafana.cern.ch/goto/Z5NWYXCnk?orgId=11
Needs to be checked on the agent side if the error is the same.
How to reproduce it
Not clear yet.
Expected behavior
The workqueue should not get stuck
Additional context and error message
The text was updated successfully, but these errors were encountered: