Global WorkQueue gets stuck #11186

todor-ivanov · 2022-06-17T13:08:38Z

Impact of the bug
Global WorkGueue

Describe the bug
Today we started experiencing some Heartbeat timeouts in the WorkQueu. This leads to a broken status for the component/service in WMStats. The logs are pointing to a pycurl error for timing out :

ERROR:heartbeatMonitor: pycurl.error: (28, 'Operation timed out after 270000 milliseconds with 36705889 bytes received')

Another place to look at the current state is:
https://monit-grafana.cern.ch/goto/Z5NWYXCnk?orgId=11

Needs to be checked on the agent side if the error is the same.

How to reproduce it
Not clear yet.

Expected behavior
The workqueue should not get stuck

Additional context and error message

The text was updated successfully, but these errors were encountered:

todor-ivanov · 2022-06-17T13:34:36Z

At the agent other than the regular document update conflicts like:

2022-06-17 11:22:26,770:140249310566144:ERROR:WorkQueueBackend:Couch error saving element: "73947fb4c21b994873d0c38842abf805", error "c

I can see only one connection timeout error, related to LogDB backend:

2022-06-17 13:40:22,396:140249310566144:ERROR:LogDB:LogDBBackend post API failed, error=(28, 'Operation timed out after 270001 millisec

todor-ivanov · 2022-06-17T14:31:57Z

connecting into the pod I do see the following error logs:

$ kubectl logs workqueue-69bf455676-sj7ks -n dmwm workqueue
INFO Files in /etc/grid-security
...
stopping workqueue
globalworkqueue not running, not killing
starting workqueue
globalworkqueue not running, not killing
sendmail: 530 5.7.1 Client was not authenticated

Then logging into the pod directly prooves the process is missing indeed:

[_workqueue@workqueue-69bf455676-sj7ks srv]$ cat  state/workqueue/globalworkqueue.pid 
94
[_workqueue@workqueue-69bf455676-sj7ks srv]$ 
[_workqueue@workqueue-69bf455676-sj7ks srv]$ 
[_workqueue@workqueue-69bf455676-sj7ks srv]$ ps auxf 
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
_workqu+  148419  0.0  0.0 115700  3672 pts/0    Ss   16:22   0:00 bash
_workqu+  149940  0.0  0.0 155464  4024 pts/0    R+   16:24   0:00  \_ ps auxf
_workqu+       1  0.0  0.0  11712  2748 ?        Ss   Jun09   0:00 /bin/bash /opt/setup-certs-and-run/setup-certs-and-run.sh
_workqu+      10  0.0  0.0  11708  2692 ?        S    Jun09   0:00 /bin/bash /data/run.sh
root          95  0.0  0.0 131536  7232 ?        S    Jun09   0:00  \_ sudo /usr/sbin/crond -n
root          98  0.0  0.0  22712  2672 ?        S    Jun09   0:01      \_ /usr/sbin/crond -n
_workqu+      93  0.0  0.0  30836  3568 ?        S    Jun09   0:07 rotatelogs /data/srv/logs/workqueue/workqueue-%Y%m%d-workqueue-69bf4
_workqu+      96  0.0  0.0 195320 24744 ?        S    Jun09   0:00 python3 /data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.p
_workqu+      97 10.2  2.5 2322296 752336 ?      Sl   Jun09 1219:10  \_ python3 /data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.

todor-ivanov · 2022-06-17T15:17:40Z

It was pointed by @muhammadimranfarooqi that the process is actually running but the pid is wrongly recorded in the pid file. This behavior is obviously similar to the containers for all the rest of the services. I am not to debug this pid number shift, right now.

The container was restarted. We are going to monitor i it gets back properly in the monitoring pages.

todor-ivanov · 2022-06-17T16:44:30Z

Now the processes are proved to be present in the containers. I think we are failing to load some CouchDB views. Things get stuck here:

WMCore/src/python/WMCore/Database/CMSCouch.py

Lines 561 to 562 in 38bf266

    
           retval = self.post('/%s/_design/%s/_view/%s?%s' % \ 
        
                              (self.name, design, view, data), {'keys': keys})

todor-ivanov · 2022-06-17T16:54:11Z

The full trace from the logs is here:

ERROR:heartbeatMonitor:Periodic Thread ERROR pycurl.error (28, 'Operation timed out after 270000 milliseconds with 38463367 bytes received')
ERROR:heartbeatMonitor: Traceback (most recent call last):
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/REST/CherryPyPeriodicTask.py", line 95, in run
ERROR:heartbeatMonitor:     self.taskFunc(self.config)
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/REST/HeartbeatMonitorBase.py", line 36, in reportToWMStats
ERROR:heartbeatMonitor:     monitorInfo = self.addAdditionalMonitorReport(config)
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/GlobalWorkQueue/CherryPyThreads/HeartbeatMonitor.py", line 27, in addAdditionalMonitorReport
ERROR:heartbeatMonitor:     results = self.globalQ.monitorWorkQueue(self.initialStatus)
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1210, in monitorWorkQueue
ERROR:heartbeatMonitor:     elements = self.workqueueDS.getElementsByStatus(status)
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Services/WorkQueue/WorkQueue.py", line 397, in getElementsByStatus
ERROR:heartbeatMonitor:     data = db.loadView('WorkQueue', 'elementsByStatus', options, status)
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 561, in loadView
ERROR:heartbeatMonitor:     retval = self.post('/%s/_design/%s/_view/%s?%s' % \
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 140, in post
ERROR:heartbeatMonitor:     return self.makeRequest(uri, data, 'POST', incoming_headers,
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 131, in makeRequest
ERROR:heartbeatMonitor:     result, status, reason, cached = JSONRequests.makeRequest(
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 171, in makeRequest
ERROR:heartbeatMonitor:     result, response = self.makeRequest_pycurl(uri, data, verb, headers)
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 188, in makeRequest_pycurl
ERROR:heartbeatMonitor:     response, result = self.reqmgr.request(uri, data, headers, verb=verb,
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/Utils/PortForward.py", line 67, in portMangle
ERROR:heartbeatMonitor:     return callFunc(callObj, newUrl, *args, **kwargs)
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 327, in request
ERROR:heartbeatMonitor:     curl.perform()
ERROR:heartbeatMonitor: pycurl.error: (28, 'Operation timed out after 270000 milliseconds with 38463367 bytes received')
ERROR:reqmgrInteraction:LogDBBackend post API failed, error=(28, 'Operation timed out after 270000 milliseconds with 0 bytes received')

todor-ivanov · 2022-06-17T17:11:29Z

here is the line that gives the clue what is happening:

ERROR:heartbeatMonitor:     monitorInfo = self.addAdditionalMonitorReport(config)

here is the actual interface to be called:

ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Services/WorkQueue/WorkQueue.py", line 397, in getElementsByStatus
ERROR:heartbeatMonitor:     data = db.loadView('WorkQueue', 'elementsByStatus', options, status)

here is the actual API call that is timing out:
https://cmsweb.cern.ch/couchdb/workqueue/_design/WorkQueue/_view/elementsByStatus?include_docs=true
I can see this is the case even for myself if I try to load that in a browser. It seems all of this is just a matter of overloading the service while trying to fetch monitoring information. The API above is quite heavy by itself.
@amaltaro please confirm my observation!

todor-ivanov · 2022-06-17T17:12:45Z

And at the end we just fail to collect and upload monitoring information to both Wmstats's service status and to MONIT, while the agents seem to be working just fine.

todor-ivanov · 2022-06-17T17:28:20Z

if we take a look at this monitoring page here: https://monit-grafana.cern.ch/goto/csqXi9Cnk?orgId=11 this same thing has happened with reqmgr with just a spike at around 08:00 CERN time this morning and then it was followed by a well distinguishable increase of the load a little bit later. Was there a huge injection happening at that time? It seems something changed with the incoming load which caused the misbehavior.

And here https://monit-grafana.cern.ch/goto/kbOIZ9jnz?orgId=11 we can see a sudden increase of rejected workflows which overlaps with the short monitoring data interruption for reqmgr. So this glitch is the sign of the moment of triggering the events causing the current situation. @haozturk are you aware of any activities at around that same time?

amaltaro · 2022-06-17T17:41:39Z

Thanks for following this up, Todor.

Global workqueue shouldn't be timing out with only 62k elements in the database. There must be something else, either on:

CERN resources/network and/or workqueue VM
kubernetes
CouchDB load
external uses running something else heavier
any other activity from the CMSWEB team (I heard this morning there is a new testbed cluster. If there is a new production cluster, it could be that we have multiple services querying the same databases).

Anyhow, these are just possibilities/guesses and we need to start ruling them out.

amaltaro · 2022-06-17T20:04:15Z

The workqueue database is being compacted at the moment. That task is running for 1h40min now and progress is at 41%. So it should likely take another 2h before the compaction is over.

I also had a look at the couchdb logs and I see successful requests to that view every now and then:

[Fri, 17 Jun 2022 19:33:57 GMT] [info] [<0.7708.381>] 188.185.89.194 - - POST /workqueue/_design/WorkQueue/_view/elementsByStatus?stale=update_after&include_docs=true 200
...
[Fri, 17 Jun 2022 19:52:25 GMT] [info] [<0.28034.381>] 188.185.89.202 - - POST /workqueue/_design/WorkQueue/_view/elementsByStatus?stale=update_after&include_docs=true 200

correlating this couch.log records with the moment workqueue actually made this HTTP request, gives me:

2022-06-17 21:33:56,724:INFO:HeartbeatMonitorBase:Checking Thread status...
2022-06-17 21:33:56,886:INFO:HeartbeatMonitor:Collecting GlobalWorkqueue statistics...
2022-06-17 21:38:27,674:ERROR:CherryPyPeriodicTask:Periodic Thread ERROR pycurl.error (28, 'Operation timed out after 270000 milliseconds with 38243989 bytes received')
2022-06-17 21:38:27,676:ERROR:CherryPyPeriodicTask: Traceback (most recent call last):
...
2022-06-17 21:52:23,972:INFO:HeartbeatMonitorBase:Checking Thread status...
2022-06-17 21:52:24,533:INFO:HeartbeatMonitor:Collecting GlobalWorkqueue statistics...

besides the difference in timestamp, where workqueue is CERN localtime, while couchdb is UTC - thus 2h of difference - we can see that actually CouchDB is serving that request in a second or two. Something is wrong in getting these packets back to the workqueue POD!

todor-ivanov · 2022-06-18T06:38:02Z

Hi @amaltaro I do not know if any actions have been taken by you during the night, but I can see the monitoring data is back in place since midnight.

todor-ivanov · 2022-06-20T09:38:16Z

It seems CouchDB is misbehaving again. And this time we have also JobCreator timing out [1]. All the agents are in Error in Wmstats currently. This also leads to a pressure loss on some site (reported by P&R).

[1]

2022-06-18 18:55:20,396:140249312208640:ERROR:ChangeState:Traceback (most recent call last):
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/JobStateMachine/ChangeState.py", line 178, in propagate
    self.recordInCouch(jobs, newstate, oldstate, updatesummary)
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/JobStateMachine/ChangeState.py", line 278, in recordInCouch
    self.jobsdatabase.queue(jobDocument, callback=discardConflictingDocument)
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 237, in queue
    self.commit(viewlist=viewlist, callback=callback)
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 303, in commit
    retval = self.post(uri, data)
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 140, in post
    return self.makeRequest(uri, data, 'POST', incoming_headers,
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 131, in makeRequest
    result, status, reason, cached = JSONRequests.makeRequest(
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 171, in makeRequest
    result, response = self.makeRequest_pycurl(uri, data, verb, headers)
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 188, in makeRequest_pycurl
    response, result = self.reqmgr.request(uri, data, headers, verb=verb,
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/Utils/PortForward.py", line 69, in portMangle
    return callFunc(callObj, url, *args, **kwargs)
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 316, in request
    curl.perform()
pycurl.error: (28, 'Operation timed out after 270001 milliseconds with 0 bytes received')

todor-ivanov · 2022-06-20T10:35:11Z

This could be correlated to an extra load of workqueue elements in the system. Here is this very same moment on Jun 17th I was talking about above [1][2][3][4] this time from the agents' perspective - we see a sudden increase on local workqueue elements in all the agents at the very same time. Almost doubling actually. It seems to me, we either have troubles digesting something or we have troubles pushing updates to the central Couch and hence the increase of local workqueue elements at the agents not moving forward.

[1]
https://monit-grafana.cern.ch/goto/qfFrmZq7k?orgId=11

[2]
https://monit-grafana.cern.ch/goto/_98rmZqnk?orgId=11
[3]
https://monit-grafana.cern.ch/goto/IQGjiZ3nz?orgId=11
[4]
https://monit-grafana.cern.ch/goto/CEJjiZqnz?orgId=11

amaltaro · 2022-06-23T18:58:13Z

Looking at the WMAgent grafana dashboard, it seems to have started on the 17/Jun/2022, around 1am. This matches the moment where I deployed and started a RelVal agent (vocms0259), the first agent connected to production using CouchDB 3.x.

A couple of days ago we discovered that the RelVal agent was failing to replicate from the central workqueue (timing out because there were too many deleted documents), and possibly because of an issue with CouchDB itself:
apache/couchdb#4078

After many changes to the CouchDB replicator configuration in vocms0259, we could not find a way to get data properly replicating (even with CouchDB developers help!). The only solution was to rotate (replicate) central workqueue database and get rid of of all the deleted documents.

Imran was super kind and helpful and we got the workqueue rotation done yesterday evening CERN time (with an outage for that backend). Now we have a fresh and clean workqueue database and overall WM is looking stable.

Last but not least, I updated the following wiki with some data for this procedure, such that we can come up with a better ETA for future cases:
https://github.com/dmwm/WMCore/wiki/workqueue-couch-db-replication-procedure

I am closing it now, but if timeouts strike again in the coming day or two, please reopen it.

todor-ivanov added BUG Operations WorkQueue Highest Priority labels Jun 17, 2022

todor-ivanov self-assigned this Jun 17, 2022

amaltaro self-assigned this Jun 23, 2022

amaltaro closed this as completed Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global WorkQueue gets stuck #11186

Global WorkQueue gets stuck #11186

todor-ivanov commented Jun 17, 2022 •

edited

Loading

todor-ivanov commented Jun 17, 2022

todor-ivanov commented Jun 17, 2022

todor-ivanov commented Jun 17, 2022

todor-ivanov commented Jun 17, 2022

todor-ivanov commented Jun 17, 2022

todor-ivanov commented Jun 17, 2022

todor-ivanov commented Jun 17, 2022 •

edited

Loading

todor-ivanov commented Jun 17, 2022

amaltaro commented Jun 17, 2022

amaltaro commented Jun 17, 2022

todor-ivanov commented Jun 18, 2022

todor-ivanov commented Jun 20, 2022

todor-ivanov commented Jun 20, 2022

amaltaro commented Jun 23, 2022

Global WorkQueue gets stuck #11186

Global WorkQueue gets stuck #11186

Comments

todor-ivanov commented Jun 17, 2022 • edited Loading

todor-ivanov commented Jun 17, 2022

todor-ivanov commented Jun 17, 2022

todor-ivanov commented Jun 17, 2022

todor-ivanov commented Jun 17, 2022

todor-ivanov commented Jun 17, 2022

todor-ivanov commented Jun 17, 2022

todor-ivanov commented Jun 17, 2022 • edited Loading

todor-ivanov commented Jun 17, 2022

amaltaro commented Jun 17, 2022

amaltaro commented Jun 17, 2022

todor-ivanov commented Jun 18, 2022

todor-ivanov commented Jun 20, 2022

todor-ivanov commented Jun 20, 2022

amaltaro commented Jun 23, 2022

todor-ivanov commented Jun 17, 2022 •

edited

Loading

todor-ivanov commented Jun 17, 2022 •

edited

Loading