Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.x]: Frequent "Could not acquire a mutex lock for the queue (queue)" errors. #11605

Closed
jamesmacwhite opened this issue Jul 15, 2022 · 12 comments
Assignees

Comments

@jamesmacwhite
Copy link
Contributor

jamesmacwhite commented Jul 15, 2022

What happened?

Description

Our Craft CMS site is running on fortrabbit with the Professional scaling and we can see regular "Could not acquire a mutex lock for the queue (queue)" errors being logged by Sentry in our staging and production applications when the queue is running via their worker component.

We have had discussions with Fortabbit and they are not seeing the same issues with other Craft CMS 3 clients, but equally, I cannot reproduce the mutex lock errors on my development environment (running nitro queue), so this is a hard one to understand what is actually causing the locks. Whether this is in fact environmental or something else is triggering it.

The runQueueAutomatically config setting is set to false, as we have a dedicated worker running ./craft queue/listen -v. Locally I would run nitro queue to essentially start a worker process that is similar to how Fortrabbit would be.

Here's a couple of stack traces of these mutex lock errors.

> yii\base\Exception: Could not acquire a mutex lock for the queue (queue).
#12 /vendor/craftcms/cms/src/queue/Queue.php(780): craft\queue\Queue::_lock
#11 /vendor/craftcms/cms/src/queue/Queue.php(684): craft\queue\Queue::_moveExpired
#10 /vendor/craftcms/cms/src/queue/Queue.php(437): craft\queue\Queue::getJobInfo
#9 /vendor/craftcms/cms/src/controllers/QueueController.php(184): craft\controllers\QueueController::actionGetJobInfo
#8 [internal](0): call_user_func_array
#7 /vendor/yiisoft/yii2/base/InlineAction.php(57): yii\base\InlineAction::runWithParams
#6 /vendor/yiisoft/yii2/base/Controller.php(178): yii\base\Controller::runAction
#5 /vendor/yiisoft/yii2/base/Module.php(552): yii\base\Module::runAction
#4 /vendor/craftcms/cms/src/web/Application.php(293): craft\web\Application::runAction
#3 /vendor/craftcms/cms/src/web/Application.php(602): craft\web\Application::_processActionRequest
#2 /vendor/craftcms/cms/src/web/Application.php(272): craft\web\Application::handleRequest
#1 /vendor/yiisoft/yii2/base/Application.php(384): yii\base\Application::run
#0 /index.php(12): null
yii\base\Exception: Could not acquire a mutex lock for the queue (queue).
#17 srv/app/nttmcoll-prod/htdocs/vendor/craftcms/cms/src/queue/Queue.php(780): craft\queue\Queue::_lock
#16 srv/app/nttmcoll-prod/htdocs/vendor/craftcms/cms/src/queue/Queue.php(618): craft\queue\Queue::reserve
#15 srv/app/nttmcoll-prod/htdocs/vendor/craftcms/cms/src/queue/Queue.php(130): craft\queue\Queue::craft\queue\{closure}
#14 [internal](0): call_user_func
#13 srv/app/nttmcoll-prod/htdocs/vendor/yiisoft/yii2-queue/src/cli/Queue.php(117): yii\queue\cli\Queue::runWorker
#12 srv/app/nttmcoll-prod/htdocs/vendor/craftcms/cms/src/queue/Queue.php(140): craft\queue\Queue::run
#11 srv/app/nttmcoll-prod/htdocs/vendor/craftcms/cms/src/queue/Command.php(88): craft\queue\Command::actionListen
#10 [internal](0): call_user_func_array
#9 srv/app/nttmcoll-prod/htdocs/vendor/yiisoft/yii2/base/InlineAction.php(57): yii\base\InlineAction::runWithParams
#8 srv/app/nttmcoll-prod/htdocs/vendor/yiisoft/yii2/base/Controller.php(178): yii\base\Controller::runAction
#7 srv/app/nttmcoll-prod/htdocs/vendor/yiisoft/yii2/console/Controller.php(182): yii\console\Controller::runAction
#6 srv/app/nttmcoll-prod/htdocs/vendor/yiisoft/yii2/base/Module.php(552): yii\base\Module::runAction
#5 srv/app/nttmcoll-prod/htdocs/vendor/yiisoft/yii2/console/Application.php(180): yii\console\Application::runAction
#4 srv/app/nttmcoll-prod/htdocs/vendor/craftcms/cms/src/console/Application.php(89): craft\console\Application::runAction
#3 srv/app/nttmcoll-prod/htdocs/vendor/yiisoft/yii2/console/Application.php(147): yii\console\Application::handleRequest
#2 srv/app/nttmcoll-prod/htdocs/vendor/craftcms/cms/src/console/Application.php(118): craft\console\Application::handleRequest
#1 srv/app/nttmcoll-prod/htdocs/vendor/yiisoft/yii2/base/Application.php(384): yii\base\Application::run
#0 srv/app/nttmcoll-prod/htdocs/craft(13): null

We are running just a single dedicated worker to run the queue. Our staging app is also on Fortabbit, it too has also logged mutex errors, although, it is not running queue jobs constantly like our production app, so seems to be less frequent.

The reoccurring themes of the mutex locks are around the queue and either _moveExpired(), release() and reserve(). Generally it seems to be when there's activity on the Control Panel while the queue is running. The stacktrace often shows that the manage/actions/queue/get-job-info returned a 500 error because of the mutex error, which was triggered by any user who happened to be logged in or doing Control Panel actions unrelated to the queue. This appears to be only occurring in the Control Panel and not having any impact to the front end.

We see random flash messages "A server error occurred" or occasionally "Internal Server error" on Control Panel requests, but they all lead back to a mutex lock.

Generally, it seems to be when there is simultaneous activity in the Control Panel when the queue is running. Looking at Sentry, no mutex locking errors occurred at quieter times when basically no one is logged in to the Control Panel and the queue generally runs OK through the worker, then with additional activity in the Control Panel alongside the queue, it seems more likely that locks occur.

Our environment is load balanced, but I've been told by Fortrabbit that the worker component is a single instance so shouldn't be affected by the use of the now default FileMutex, however I'm running out of ideas what I can try or could be causing these locks.

I can see that Redis is recommended, however Fortrabbit do not support Redis directly, although I could look at a cloud Redis provider to implement, however I feel I'm potentially just potentially masking a problem, as I cannot fully understand what is actually triggering the locks themselves, in the first place, other than they seem to happen under certain conditions, but with no obvious cause or case to reproduce consistently.

We have the CRAFT_EPHEMERAL constant set to true when on Fortrabbit due to the storage being ephemeral due to load balancing. I don't know if this could be having an impact.

Any ideas what we can look at to try and understand why the queue is generating lock errors in the above situations?

I can share the Sentry error traces privately, however I can't post the full details publicly due to some capturing sessions from active Control Panel users, which reveals names and email addresses.

Craft CMS version

3.7.48

PHP version

7.4

Operating system and version

Linux

Database type and version

MySQL 8

Image driver and version

N/A

Installed plugins and versions

Amazon S3 1.3.0
Asset Rev 6.0.2
Asset Usage 2.3.1
Command Palette 3.1.4
Control Panel Nav 3.0.17
CRM Enquiries 1.0.4 (Private plugin)
Element API 2.8.6.1
Environment Label 3.2.0
Feed Me 4.5.3
iCalendar 1.1.3
Imager X v3.6.2
Kint 1.0.0
Linkit 1.1.12.1
Maps 3.9.4
Minify 1.2.11
PathTools 1.0.7
Redactor 2.10.10
Redirects 3.1.73
Reverse Relations 1.3.2
Scout 2.7.2
Sentry SDK 1.1.2
SEO 3.4.36
Sprout Email 4.4.10
Sprout Encode Email 2.0.7
Sprout Fields 3.8.5
Sprout Forms 3.13.8
Sprout Reports 1.5.13
Super Table 2.7.2
Table Maker 2.0.5
Twitter 2.1.6
Typogrify 1.1.19
Videos 2.1.0

@angrybrad
Copy link
Member

What plugins do you have installed? Do you have any custom modules running?

@jamesmacwhite
Copy link
Contributor Author

@angrybrad I've added our full plugin list to the original issue for reference.

We do have a custom module running for project specific functionality. I have audited it to try and find anything that could be causing it, but so far I haven't found anything

On Friday and over the weekend I have swapped out memcached for Redis on our staging environment to see if replacing the FileMutex has any impact. So far, it looks like using Redis for Cache, Session and Mutex seems to have helped. There was still was one mutex lock error raised, however this came from the worker component not the main app. I haven't actually seen a mutex lock raised by the application since testing Redis on our staging environment.

I have been sending various queue jobs and feed imports to simulate the queue activity that's on our live site and then performed various actions in the Control Panel as well as leaving the Queue Manager tab open so it polls the action controller which has returned 500 errors and I haven't been able to get the same issue. I don't have an accurate way to trigger the mutex errors, only that it generally triggers, when the queue is running and there's other CMS activity.

So while, it's not a "fix", switching to Redis seems to have helped. It could be due to the FileMutex, but I still think there's probably an underlying issue somewhere, but I'm just not seeing it.

@angrybrad
Copy link
Member

@jamesmacwhite hrm... I'd let this run for a week or so and see if it pops up again.

If not, you could leave it as-is, or swap it to a database-backed mutex. If that works fine after a week, then the evidence points to the underlying issue being a file-based mutex in this environment.

@jamesmacwhite
Copy link
Contributor Author

@angrybrad Redis is now live on our production environment as well and so far the results are good and no more queue locking errors, although this week will be the true test with the normal load and activity, given it was a bit random.

The FileMutex does potentially seem to be something that is/was influencing it. I doubt having the cache and session going to Redis has really done much for this error, but I figured we might as well try it as a replacement for Memcached entirely, which based on reading seems to be the less preferred option these days in comparison to Redis anyway and more importantly, Redis being the recommended Mutex for load balanced environments.

I believe I did try and use the MySQLMutex as a test, but it is quite possible that I didn't correctly set it when testing, as I know the syntax has changed for 3.7.20+, however Redis seems to be the way to go anyway. Although, I'd still potentially think we may encounter MySQL's locking behaviour if we were having mutex lock issues anyway,

My theory of what could be happening after looking into things with Redis in use vs not. The worker on Fortrabbit, sometimes seemed to spawning multiple instances of itself if the logs were to be believed. After a code deployment or app change these reloads the app and the worker, I could see multiple stopping worker, starting worker logs, we only have one. I'm wondering if with the FileMutex it is unreliable and perhaps this is where the locking issues were coming from, if the worker was spawning multiple processes then running the queue, causing concurrency and contention issues. That's just a theory though. The behaviour of the worker and queue under Redis does seem different and a lot smoother.

Time will tell, although even now I can see the performance difference, so it's a good sign.

@jamesmacwhite
Copy link
Contributor Author

jamesmacwhite commented Jul 18, 2022

I might have found the issue and it's not until replacing the FileMutex does it become apparent potentially.

So the issue is I think the use of Craft::$app->setCache() in certain places in a module in this project. This is using the FileMutex I believe by default? I think this is where the problem might have been. in certain places I had been using getCache or setCache calls, but when on a load balanced environment this is potentially problematic, because that's not reliable. However after switching the entire mutex out to Redis, these cache calls will now be writing to Redis.

We were using memcache, but it's my understanding Memcache can only store small bits of data, the cached data sometimes a Guzzle HTTP response which is significantly larger than 1MB, so I assume it would then fallback to FileMutex and this problem starts happening.

Maybe if I had set the mutex back to the database this may have not been triggered either, however sticking with Redis seems like the best move anyway because of performance gains and overall better caching mechanism that's workable in a load balanced environment. That's what I believe to be the case.

@howells
Copy link

howells commented Apr 17, 2023

@jamesmacwhite I assume you never returned to this after you used redis, but I have the exact same issue with Fortrabbit, where a call to get-job-info causes a mutex log, which then subsequently fails the job being run. When I have the browser closed, I can run the job manually successfully.

Fortrabbit upped the memcache size to 10mb, and it still happens, so we've eliminated that as a cause.

Any ideas? No worries if you've moved on from this!

@jamesmacwhite
Copy link
Contributor Author

jamesmacwhite commented Apr 17, 2023

@howells I didn't as Redis seemed to resolve our issues, we have still experienced occasional mutex locks errors logged by Sentry despite using Redis, but they don't seem to impact major operations. Certainly moving to Redis over Memcache helped with the issues we had, although this could have been some of our own doing with wanting to cache larger amounts of data in certain ways which then hit issues with memcache limitations and queue jobs.

I also got the impression that memcache is somewhat of an older caching method and had limitations with size and Redis is recommended for Craft CMS usage anyway and can handle much larger data, which seemed overall more scalable. The only downside is Fortrabbit do not provide any native Redis caching, so we had to go through with Redis Cloud, which sits outside of Fortrabbit, but quite by chance Redis Cloud use AWS and happen to have instances provisioned in the same region as Fortrabbit, so latency is about as low as it can be.

Fortrabbit support never really clarified our issue fully, so it's interesting you are also hitting similar issues.

I can only advise messing around with the mutex configuration for testing, have you tried switching the mutex to the MySQL database driver to see if it has any impact? That's one thing I didn't do, I thought I did, but I realised the mutex config in our app.php was wrong and not targeting the right component, so it never was tested.

@howells
Copy link

howells commented Apr 18, 2023

@jamesmacwhite really appreciate your reply. It's a total mystery and Fortrabbit are flummoxed. They'll take one more look into it, then I'll move to mysql drive, redis and then just move away from Fortrabbit if we can't find a solution.

@jamesmacwhite
Copy link
Contributor Author

@howells Sounds similar to my issue. I also nearly considered moving provider because it was getting that bad, but Redis vs Memcache overall for us seems to solve the issue and provided performance benefits, as a mostly drop in replacement, so we stayed. Although, others recommend Servd, which has Redis scaling directly, if you are considering moving. I did look at Servd, but ultimately stayed, given a lot of our infrastructure is tied around Fortrabbit and AWS.

If I was to go with another provider in the future though, it would likely be Servd.

@frank-laemmer
Copy link

Frank from fortrabbit here. We are currently still investigating the issue and already spend quit some time with it. It's complex. Redis is on our roadmap as well.

@mcclaskiem
Copy link

I am having the same issue with a job that I run to sync Commerce Products with a POS software one of my clients uses.

The first error I was getting was more on the server side shown below:

[pool www] server reached pm.max_children setting (5), consider raising it

I increased the pm.max_children setting to 12 in my pool conf and then started seeing the mutex lock errors that this issue originally referenced.

I tried switching to craft\cache\DbCache but am now getting memory exhaust errors when trying to set the cache for our import script. Is there a way to control the amount of data craft can post to MySQL in the cache table?

@brandonkelly
Copy link
Member

@mcclaskiem Switching the cache component to DbCache won’t have any effect on your mutex locks.

How have you configured your mutex component? It should ideally be set to use Redis, consistent with the example here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants