Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global WorkQueue gets stuck #11186

Closed
todor-ivanov opened this issue Jun 17, 2022 · 14 comments
Closed

Global WorkQueue gets stuck #11186

todor-ivanov opened this issue Jun 17, 2022 · 14 comments

Comments

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Jun 17, 2022

Impact of the bug
Global WorkGueue

Describe the bug
Today we started experiencing some Heartbeat timeouts in the WorkQueu. This leads to a broken status for the component/service in WMStats. The logs are pointing to a pycurl error for timing out :

ERROR:heartbeatMonitor: pycurl.error: (28, 'Operation timed out after 270000 milliseconds with 36705889 bytes received')

Another place to look at the current state is:
https://monit-grafana.cern.ch/goto/Z5NWYXCnk?orgId=11

Needs to be checked on the agent side if the error is the same.

How to reproduce it
Not clear yet.

Expected behavior
The workqueue should not get stuck

Additional context and error message

@todor-ivanov
Copy link
Contributor Author

At the agent other than the regular document update conflicts like:

2022-06-17 11:22:26,770:140249310566144:ERROR:WorkQueueBackend:Couch error saving element: "73947fb4c21b994873d0c38842abf805", error "c

I can see only one connection timeout error, related to LogDB backend:

2022-06-17 13:40:22,396:140249310566144:ERROR:LogDB:LogDBBackend post API failed, error=(28, 'Operation timed out after 270001 millisec

@todor-ivanov
Copy link
Contributor Author

connecting into the pod I do see the following error logs:

$ kubectl logs workqueue-69bf455676-sj7ks -n dmwm workqueue
INFO Files in /etc/grid-security
...
stopping workqueue
globalworkqueue not running, not killing
starting workqueue
globalworkqueue not running, not killing
sendmail: 530 5.7.1 Client was not authenticated

Then logging into the pod directly prooves the process is missing indeed:

[_workqueue@workqueue-69bf455676-sj7ks srv]$ cat  state/workqueue/globalworkqueue.pid 
94
[_workqueue@workqueue-69bf455676-sj7ks srv]$ 
[_workqueue@workqueue-69bf455676-sj7ks srv]$ 
[_workqueue@workqueue-69bf455676-sj7ks srv]$ ps auxf 
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
_workqu+  148419  0.0  0.0 115700  3672 pts/0    Ss   16:22   0:00 bash
_workqu+  149940  0.0  0.0 155464  4024 pts/0    R+   16:24   0:00  \_ ps auxf
_workqu+       1  0.0  0.0  11712  2748 ?        Ss   Jun09   0:00 /bin/bash /opt/setup-certs-and-run/setup-certs-and-run.sh
_workqu+      10  0.0  0.0  11708  2692 ?        S    Jun09   0:00 /bin/bash /data/run.sh
root          95  0.0  0.0 131536  7232 ?        S    Jun09   0:00  \_ sudo /usr/sbin/crond -n
root          98  0.0  0.0  22712  2672 ?        S    Jun09   0:01      \_ /usr/sbin/crond -n
_workqu+      93  0.0  0.0  30836  3568 ?        S    Jun09   0:07 rotatelogs /data/srv/logs/workqueue/workqueue-%Y%m%d-workqueue-69bf4
_workqu+      96  0.0  0.0 195320 24744 ?        S    Jun09   0:00 python3 /data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.p
_workqu+      97 10.2  2.5 2322296 752336 ?      Sl   Jun09 1219:10  \_ python3 /data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.

@todor-ivanov
Copy link
Contributor Author

It was pointed by @muhammadimranfarooqi that the process is actually running but the pid is wrongly recorded in the pid file. This behavior is obviously similar to the containers for all the rest of the services. I am not to debug this pid number shift, right now.

The container was restarted. We are going to monitor i it gets back properly in the monitoring pages.

@todor-ivanov
Copy link
Contributor Author

Now the processes are proved to be present in the containers. I think we are failing to load some CouchDB views. Things get stuck here:

retval = self.post('/%s/_design/%s/_view/%s?%s' % \
(self.name, design, view, data), {'keys': keys})

@todor-ivanov
Copy link
Contributor Author

The full trace from the logs is here:

ERROR:heartbeatMonitor:Periodic Thread ERROR pycurl.error (28, 'Operation timed out after 270000 milliseconds with 38463367 bytes received')
ERROR:heartbeatMonitor: Traceback (most recent call last):
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/REST/CherryPyPeriodicTask.py", line 95, in run
ERROR:heartbeatMonitor:     self.taskFunc(self.config)
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/REST/HeartbeatMonitorBase.py", line 36, in reportToWMStats
ERROR:heartbeatMonitor:     monitorInfo = self.addAdditionalMonitorReport(config)
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/GlobalWorkQueue/CherryPyThreads/HeartbeatMonitor.py", line 27, in addAdditionalMonitorReport
ERROR:heartbeatMonitor:     results = self.globalQ.monitorWorkQueue(self.initialStatus)
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1210, in monitorWorkQueue
ERROR:heartbeatMonitor:     elements = self.workqueueDS.getElementsByStatus(status)
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Services/WorkQueue/WorkQueue.py", line 397, in getElementsByStatus
ERROR:heartbeatMonitor:     data = db.loadView('WorkQueue', 'elementsByStatus', options, status)
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 561, in loadView
ERROR:heartbeatMonitor:     retval = self.post('/%s/_design/%s/_view/%s?%s' % \
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 140, in post
ERROR:heartbeatMonitor:     return self.makeRequest(uri, data, 'POST', incoming_headers,
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 131, in makeRequest
ERROR:heartbeatMonitor:     result, status, reason, cached = JSONRequests.makeRequest(
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 171, in makeRequest
ERROR:heartbeatMonitor:     result, response = self.makeRequest_pycurl(uri, data, verb, headers)
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 188, in makeRequest_pycurl
ERROR:heartbeatMonitor:     response, result = self.reqmgr.request(uri, data, headers, verb=verb,
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/Utils/PortForward.py", line 67, in portMangle
ERROR:heartbeatMonitor:     return callFunc(callObj, newUrl, *args, **kwargs)
ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 327, in request
ERROR:heartbeatMonitor:     curl.perform()
ERROR:heartbeatMonitor: pycurl.error: (28, 'Operation timed out after 270000 milliseconds with 38463367 bytes received')
ERROR:reqmgrInteraction:LogDBBackend post API failed, error=(28, 'Operation timed out after 270000 milliseconds with 0 bytes received')

@todor-ivanov
Copy link
Contributor Author

here is the line that gives the clue what is happening:

ERROR:heartbeatMonitor:     monitorInfo = self.addAdditionalMonitorReport(config)

here is the actual interface to be called:

ERROR:heartbeatMonitor:   File "/data/srv/HG2206e/sw/slc7_amd64_gcc630/cms/workqueue/2.0.4.pre4/lib/python3.8/site-packages/WMCore/Services/WorkQueue/WorkQueue.py", line 397, in getElementsByStatus
ERROR:heartbeatMonitor:     data = db.loadView('WorkQueue', 'elementsByStatus', options, status)

here is the actual API call that is timing out:
https://cmsweb.cern.ch/couchdb/workqueue/_design/WorkQueue/_view/elementsByStatus?include_docs=true
I can see this is the case even for myself if I try to load that in a browser. It seems all of this is just a matter of overloading the service while trying to fetch monitoring information. The API above is quite heavy by itself.
@amaltaro please confirm my observation!

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Jun 17, 2022

And at the end we just fail to collect and upload monitoring information to both Wmstats's service status and to MONIT, while the agents seem to be working just fine.

@todor-ivanov
Copy link
Contributor Author

if we take a look at this monitoring page here: https://monit-grafana.cern.ch/goto/csqXi9Cnk?orgId=11 this same thing has happened with reqmgr with just a spike at around 08:00 CERN time this morning and then it was followed by a well distinguishable increase of the load a little bit later. Was there a huge injection happening at that time? It seems something changed with the incoming load which caused the misbehavior.

And here https://monit-grafana.cern.ch/goto/kbOIZ9jnz?orgId=11 we can see a sudden increase of rejected workflows which overlaps with the short monitoring data interruption for reqmgr. So this glitch is the sign of the moment of triggering the events causing the current situation. @haozturk are you aware of any activities at around that same time?

@amaltaro
Copy link
Contributor

Thanks for following this up, Todor.

Global workqueue shouldn't be timing out with only 62k elements in the database. There must be something else, either on:

  • CERN resources/network and/or workqueue VM
  • kubernetes
  • CouchDB load
  • external uses running something else heavier
  • any other activity from the CMSWEB team (I heard this morning there is a new testbed cluster. If there is a new production cluster, it could be that we have multiple services querying the same databases).

Anyhow, these are just possibilities/guesses and we need to start ruling them out.

@amaltaro
Copy link
Contributor

The workqueue database is being compacted at the moment. That task is running for 1h40min now and progress is at 41%. So it should likely take another 2h before the compaction is over.

I also had a look at the couchdb logs and I see successful requests to that view every now and then:

[Fri, 17 Jun 2022 19:33:57 GMT] [info] [<0.7708.381>] 188.185.89.194 - - POST /workqueue/_design/WorkQueue/_view/elementsByStatus?stale=update_after&include_docs=true 200
...
[Fri, 17 Jun 2022 19:52:25 GMT] [info] [<0.28034.381>] 188.185.89.202 - - POST /workqueue/_design/WorkQueue/_view/elementsByStatus?stale=update_after&include_docs=true 200

correlating this couch.log records with the moment workqueue actually made this HTTP request, gives me:

2022-06-17 21:33:56,724:INFO:HeartbeatMonitorBase:Checking Thread status...
2022-06-17 21:33:56,886:INFO:HeartbeatMonitor:Collecting GlobalWorkqueue statistics...
2022-06-17 21:38:27,674:ERROR:CherryPyPeriodicTask:Periodic Thread ERROR pycurl.error (28, 'Operation timed out after 270000 milliseconds with 38243989 bytes received')
2022-06-17 21:38:27,676:ERROR:CherryPyPeriodicTask: Traceback (most recent call last):
...
2022-06-17 21:52:23,972:INFO:HeartbeatMonitorBase:Checking Thread status...
2022-06-17 21:52:24,533:INFO:HeartbeatMonitor:Collecting GlobalWorkqueue statistics...

besides the difference in timestamp, where workqueue is CERN localtime, while couchdb is UTC - thus 2h of difference - we can see that actually CouchDB is serving that request in a second or two. Something is wrong in getting these packets back to the workqueue POD!

@todor-ivanov
Copy link
Contributor Author

Hi @amaltaro I do not know if any actions have been taken by you during the night, but I can see the monitoring data is back in place since midnight.

@todor-ivanov
Copy link
Contributor Author

It seems CouchDB is misbehaving again. And this time we have also JobCreator timing out [1]. All the agents are in Error in Wmstats currently. This also leads to a pressure loss on some site (reported by P&R).

[1]

2022-06-18 18:55:20,396:140249312208640:ERROR:ChangeState:Traceback (most recent call last):
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/JobStateMachine/ChangeState.py", line 178, in propagate
    self.recordInCouch(jobs, newstate, oldstate, updatesummary)
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/JobStateMachine/ChangeState.py", line 278, in recordInCouch
    self.jobsdatabase.queue(jobDocument, callback=discardConflictingDocument)
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 237, in queue
    self.commit(viewlist=viewlist, callback=callback)
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 303, in commit
    retval = self.post(uri, data)
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 140, in post
    return self.makeRequest(uri, data, 'POST', incoming_headers,
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Database/CMSCouch.py", line 131, in makeRequest
    result, status, reason, cached = JSONRequests.makeRequest(
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 171, in makeRequest
    result, response = self.makeRequest_pycurl(uri, data, verb, headers)
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 188, in makeRequest_pycurl
    response, result = self.reqmgr.request(uri, data, headers, verb=verb,
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/Utils/PortForward.py", line 69, in portMangle
    return callFunc(callObj, url, *args, **kwargs)
  File "/data/srv/wmagent/v2.0.2.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.2.patch1/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 316, in request
    curl.perform()
pycurl.error: (28, 'Operation timed out after 270001 milliseconds with 0 bytes received')

@todor-ivanov
Copy link
Contributor Author

This could be correlated to an extra load of workqueue elements in the system. Here is this very same moment on Jun 17th I was talking about above [1][2][3][4] this time from the agents' perspective - we see a sudden increase on local workqueue elements in all the agents at the very same time. Almost doubling actually. It seems to me, we either have troubles digesting something or we have troubles pushing updates to the central Couch and hence the increase of local workqueue elements at the agents not moving forward.

[1]
https://monit-grafana.cern.ch/goto/qfFrmZq7k?orgId=11

[2]
https://monit-grafana.cern.ch/goto/_98rmZqnk?orgId=11
[3]
https://monit-grafana.cern.ch/goto/IQGjiZ3nz?orgId=11
[4]
https://monit-grafana.cern.ch/goto/CEJjiZqnz?orgId=11

@amaltaro amaltaro self-assigned this Jun 23, 2022
@amaltaro
Copy link
Contributor

Looking at the WMAgent grafana dashboard, it seems to have started on the 17/Jun/2022, around 1am. This matches the moment where I deployed and started a RelVal agent (vocms0259), the first agent connected to production using CouchDB 3.x.

A couple of days ago we discovered that the RelVal agent was failing to replicate from the central workqueue (timing out because there were too many deleted documents), and possibly because of an issue with CouchDB itself:
apache/couchdb#4078

After many changes to the CouchDB replicator configuration in vocms0259, we could not find a way to get data properly replicating (even with CouchDB developers help!). The only solution was to rotate (replicate) central workqueue database and get rid of of all the deleted documents.

Imran was super kind and helpful and we got the workqueue rotation done yesterday evening CERN time (with an outage for that backend). Now we have a fresh and clean workqueue database and overall WM is looking stable.

Last but not least, I updated the following wiki with some data for this procedure, such that we can come up with a better ETA for future cases:
https://github.com/dmwm/WMCore/wiki/workqueue-couch-db-replication-procedure

I am closing it now, but if timeouts strike again in the coming day or two, please reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants