Mod_pagespeed deadlock? #1662
Comments
would you be willing to try and reproduce this with a debug build? |
I will check internally if we are allowed to install the apache 2.4 debug build and update here. Does the same pagespeed rpm work or do we need to rebuild pagespeed from source? |
Hi @lanki567 |
Hi @Lofesa We don't possess skillset on c++ or python. it is taking forever to fix the build issues. All branches are currently broken due to missing git submodules hosted on apache infra. Is there any debug build RPM available for centos 7.x? |
@lanki567 you could try this to work around git.apache.org being down:
@Lofesa |
Oh wow, I did not see that coming; but yes, we should run replace git.apache.org with github.com/apache/ then |
We have executed the below command on the latest stable branch but the build fails for the BUILDTYPE=Debug however build is successful with BUILDTYPE=Release. Any help is much appreciated. Thanks in advance Release build type generated binary for the following commands on RHEL 7
Debug build fails for the following command on RHEL 7Please find the log below. Let me know if you need the complete log Partial Log |
Below is the stacktrace of 2 threads with the Release module compiled from latest stable build branch. We see 100's of threads with the below stacktrace Thread 25 (Thread 0x7f3c42fd7700 (LWP 12477)): |
Thanks so much for hanging in there and persisting. These backtraces look very very helpful. One last question; could you, while you're at it, post backtraces from all the threads somewhere for reference later on? |
Unfortunately, Policies here doesn't let me upload all the logs. Are you specifically looking for anything? I can paste those logs here. Is there any workaround/property to disable this specific functionality? |
Well, I was hoping to inspect what all the threads are at when the hang occurs. The two threads you posted both seem to be waiting to acquire a lock, but there's probably (hopefully!) also another one out there that holds it. |
Thank you for the backtraces. They reveal this is happening in the shared-memory cache, where multiple processes use a cross-process mutex to access shared memory. I have always been concerned that a process crash, or a SIGKILL while holding onto a shared-memory mutex could make it impossible to release the lock. However the shared-memory cache has been running for probably 7 years now and I've never heard of this actually occurring. If this is just a once in a decade fluke, it might be a reasonable workaround to simply reboot the machine to clear the shm-locks and hopefully you will not be unlucky twice. If, after reboot, the problem re-appears, a workaround is to disable the shared-memory cache, and instead use per-process in-memory LRU caches as your L1s. See https://www.modpagespeed.com/doc/system for details. Hope this helps! |
I just wanted to provide an update. Into 6th day, We didn't see the issue yet after disabling shared memory cache. Thanks for the workaround We are now little bit concerned about performance. Should be consider memcache/redis as alternative solution? |
Yes, actually memcache/redis would be good to use as an L2 independent of
whether you are using the shared-memory cache or the per-child-process
cache as an L1.
Before resorting to the per-child-process L1, did you try to reboot the
machine to clear out the shm-locks? Did that not work?
…On Wed, Oct 9, 2019 at 11:02 AM lanki567 ***@***.***> wrote:
I just wanted to provide an update. Into 6th day, We didn't see the issue
yet after disabling shared memory cache. Thanks for the workaround
We are now little bit concerned about performance. Should be consider
memcache/redis as alternative solution?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1662?email_source=notifications&email_token=AAO2IPOUMGDSR4H6335QBX3QNXXAPA5CNFSM4IVUCFE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAYGIEA#issuecomment-540042256>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAO2IPNTRXWTXY5EHZMPXY3QNXXAPANCNFSM4IVUCFEQ>
.
|
Our apache 2.4 is configured to use latest stable version of mod_pagespeed but unfortunately after sometime apache doesn't serve the response for all the requests configured to use pagespeed. As per the Mod_forensic log, Response was never sent for these requests. During this period, all url's configured with "ModPagespeedDisallow" works as expected. Even the url's configured to use pagespeed works as expected when the header/request parameter "ModPagespeed=off" is set. Thread dump shows several threads stuck at the address "0x00007f1417d94253". We are consistently able to reproduce this on all our RHEL 7.6 servers(kernel version is 3.10.0-957.21.3.el7.x86_64. )
addr2line -e mod_pagespeed_ap24.so 0x00007f1417d94253 ??:0
#0 0x00007f1427f42d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f1417d94253 in ?? () from /apache/modules/mod_pagespeed_ap24.so #2 0x00007f1417bfccfc in ?? () from /apache/modules/mod_pagespeed_ap24.so #3 0x00007f1417bfcd8e in ?? () from /apache/modules/mod_pagespeed_ap24.so #4 0x00007f1417acda2e in ?? () from /apache/modules/mod_pagespeed_ap24.so #5 0x00007f1417aa44e8 in ?? () from /apache/modules/mod_pagespeed_ap24.so #6 0x00007f1417aa626f in ?? () from /apache/modules/mod_pagespeed_ap24.so #7 0x00007f1426f24194 in filter_harness (f=0x7f12580c6e48, bb=0x7f12580617e0) at mod_filter.c:323 #8 0x00007f1418621d4d in ajp_process_callback (msg=<optimized out>, pmsg=0x292bfc8, ae=0x292bf50, r=0x7f12d47f2a60, l=0x2727470) at jk_ajp_common.c:2146 #9 0x00007f14186243ca in ajp_get_reply (e=<optimized out>, s=0x7f12d47f2a60, l=0x2727470, p=0x292bf50, op=0x7f12d47f1830) at jk_ajp_common.c:2310 #10 0x00007f1418628cb7 in ajp_service (e=<optimized out>, s=0x7f12d47f2a60, l=<optimized out>, is_error=<optimized out>) at jk_ajp_common.c:2678 #11 0x00007f1418611b4f in service (e=<optimized out>, s=<optimized out>, l=<optimized out>, is_error=0x7f12d47f2cbc) at jk_lb_worker.c:1418 #12 0x00007f14185fe121 in jk_handler (r=<optimized out>) at mod_jk.c:2896 #13 0x0000000000453960 in ap_run_handler (r=r@entry=0x7f1258069ce0) at config.c:170 #14 0x0000000000453ea9 in ap_invoke_handler (r=r@entry=0x7f1258069ce0) at config.c:444 #15 0x00000000004697ba in ap_process_async_request (r=0x7f1258069ce0) at http_request.c:453 #16 0x0000000000469a7e in ap_process_request (r=r@entry=0x7f1258069ce0) at http_request.c:488 #17 0x0000000000465c45 in ap_process_http_sync_connection (c=0x7f12ac021098) at http_core.c:210 #18 ap_process_http_connection (c=0x7f12ac021098) at http_core.c:251 #19 0x000000000045d480 in ap_run_process_connection (c=c@entry=0x7f12ac021098) at connection.c:42 #20 0x000000000045d9a8 in ap_process_connection (c=c@entry=0x7f12ac021098, csd=csd@entry=0x7f12ac020e80) at connection.c:219 #21 0x00000000004706f2 in process_socket (bucket_alloc=0x7f125801ef08, my_thread_num=13, my_child_num=7, sock=0x7f12ac020e80, p=0x7f12ac020df8, thd=0x2f481e8) at worker.c:479 #22 worker_thread (thd=0x2f481e8, dummy=<optimized out>) at worker.c:808 #23 0x00007f1427f3edd5 in start_thread () from /lib64/libpthread.so.0 #24 0x00007f1427a6402d in clone () from /lib64/libc.so.6
The text was updated successfully, but these errors were encountered: