Apache stuck indefinitely waiting for PSOL #1048
Comments
Original comment by |
Original comment by
|
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Original comment by |
Thank you @crowell and @jeffkaufman for your perseverance. Now I'm writing from the company account, but I've already wrote you as @capn3m0 and I'm sorry that I've couldn't done the tests that you have requested: those servers are still in production so, once that the problem has been solved I didn't feel safe to try the test another time threatening to make it unstable. I am confident that the problem will be solved as soon as possible, meanwhile I hope you have a good job and I would like to say thank you from the WpSEO Staff and our Clients. |
I just went through the four recent backtraces we have and classifies the states of the threads:
|
We need to learn which thread or threads are spinning and burning 100% cpu, since the backtraces don't look like they should be spinning. @crowell When you next manage to reproduce this, could you run:
On the apache process taking 100% CPU? Then in gdb we can see which thread has that LWP id. For example, on my system (not having the problem) I currently see:
And when I look in gdb:
|
The way |
The only checking that seems appropriate for those would be a CHECK ---- On Thu, Aug 20, 2015 at 7:50 AM, Jeff Kaufman notifications@github.com
|
@morlovich agreed |
backtrace with on a debug build, along with thread process stats. |
@crowell That's very strange; doesn't look like 100% cpu at all? Only 25951 and 25901 are at all busy, and they're doing actual work, not waiting around like in our other backtraces. |
@crowell I count 11 of 36 threads as doing real cpu-involving work as opposed to waiting, while on the other backtraces it was at most 1. |
@jeffkaufman when i has the problem only 1 or 2 httpd.worker stuck with high cpu but all the rest httpd process working normally |
There are some stack-frames that look concerning: Thread 8 (Thread 0x7f7e85feb700 (LWP 25875)): On Thu, Aug 20, 2015 at 2:40 PM, WpSEO.it Hosting WordPress e Consulenza
|
@jmarantz yeah, this is also the case in other threads/processes
the code for the function in frame 2 virtual void TimedWait(int64 timeout_ms) {
mutex_->DropLockControl();
condvar_->TimedWait(timeout_ms);
mutex_->TakeLockControl();
} so there really should be no reason for timeout_ms being reset to 0... |
Never mind; that was a red herring; PthreadCondvar::TimedWait overwrites
-Josh On Thu, Aug 20, 2015 at 2:58 PM, Joshua Marantz jmarantz@google.com wrote:
|
yeah seems to be a valid 5 second wait. |
https://gist.github.com/903957a54f2cebcbe1b8 3 stack traces before it got out of the "Waiting for completion" state looking now. |
some stack traces ~30 seconds apart, with 1 rewrite thread in config. https://gist.github.com/63d0b4e6c4c406d2b2d3 they're all identical. after stopping the load on the server, the issue cleared up. |
Before this change we would call ap_rwrite() etc from the rewrite thread, and if that blocked for a while we might not have any rewrite threads available to serve other requests. With this change, all potentially blocking Apache calls always happen on the request thread. Writes are buffered in ApacheFetch until the resource is complete, and then sent out in one go. Most uses of this (ex: IPRO) were already not streaming, so we don't lose that much by buffering. We are doing more copying than we were, and to evaluate this impact I ran "siege http://localhost:8080/$testimage -c200 -t1h" both with and without the change, to really stress test this piece of ipro. This test pulls a single 1.1MB ipro optimized image with 200 concurrent readers for 1hr, and should give us a worst-case indication of the slowdown buffering causes. before qps: 393.11 after qps: 392.20 A 0.2% worstcase slowdown is not bad at all; buffering seems to not be a problem. This fixes some instances of #1048, but we're not sure yet whether it fixes all of them. It definitely doesn't fix ones due to slow filesystems, but we haven't seen that version in the wild. For uses of ApacheFetch that we know will always be synchronous we disable this buffering, which is slightly more efficient.
https://github.com/pagespeed/mod_pagespeed/releases/tag/1.9.32.8 should fix the issue. anyone who has been affected by this, please give this pre-release a try and report back! |
installed. |
While we're confident now that the "Waiting for Completion" state has been fixed, there seems to be a second, related, issue of hangs within apr_memcache2_multgetp It may be possible for apr_pollset_poll to return a value that isn't It may also be possible for get_server_line to be called with We're working on testing these, and will have a test binary soon for interested users. |
Installed 1.9.32.8 on production this morning after having run it succesfully in test for 2 weeks. Tonight
25.000+ times in 15 minutes and then all seems to be fine again. Only happened on 1 out of 3 webservers, but seems to be same behavior as before 1.9.32.8 |
@sv72: Did the CPU usage go to 100% when this happened? |
Actually no, CPU was not higher than normal (30-35%), but memory usage did go to 95% (normally around 50%) |
(issue #1048) in the case where if (strncmp(MS_VALUE, conn->buffer, MS_VALUE_LEN) == 0) { and else if (strncmp(MS_END, conn->buffer, MS_END_LEN) == 0) { both fail, it was possible for queries_sent to never decrement. This patch sets rv to APR_EGENERAL in this case, decrements the queries_sent, and closes the connection. According to the trace from betabrand this is where the hang is. Patch applied from apr dev mailing list (http://www.mail-archive.com/dev%40apr.apache.org/msg26265.html)
(issue #1048) in the case where if (strncmp(MS_VALUE, conn->buffer, MS_VALUE_LEN) == 0) { and else if (strncmp(MS_END, conn->buffer, MS_END_LEN) == 0) { both fail, it was possible for queries_sent to never decrement. This patch sets rv to APR_EGENERAL in this case, decrements the queries_sent, and closes the connection. According to the trace from betabrand this is where the hang is. Patch applied from apr dev mailing list (http://www.mail-archive.com/dev%40apr.apache.org/msg26265.html)
Hello, I have pushed mod_pagespeed 1.9.32.10-7423 to production server (only one domain) and re-enabled rewrite of images in fullsize to test it. That seems to be ok. I have those error in apache error.log, a few ones, less than 1 % of rewrited images : Is it a new mod_pagespeed parameter or should we use an existing one ? My conf :
|
@eldk that is just a diagnostic message, letting you know that the time to read a file from disk passed an arbitrary amount of time that we determined to be "slow", it doesn't impact performance, just the message is logged, nothing is cancelled or unscheduled. If you want to silence this, you can configure the |
Thanks. |
Original issue reported on code.google.com by
unsalkor...@gmail.com
on 10 Feb 2015 at 9:35The text was updated successfully, but these errors were encountered: