Skip to content
This repository has been archived by the owner on Apr 21, 2023. It is now read-only.

Apache stuck indefinitely waiting for PSOL #1048

Closed
GoogleCodeExporter opened this issue Apr 6, 2015 · 187 comments
Closed

Apache stuck indefinitely waiting for PSOL #1048

GoogleCodeExporter opened this issue Apr 6, 2015 · 187 comments

Comments

@GoogleCodeExporter
Copy link

I just installed mod_pagespeed on my centos 7 and got tons of httpd errors log 
in 1 minute. An example line of error:
<code>
[Tue Feb 10 11:05:14.311755 2015] [pagespeed:warn] [pid 21132:tid 
139634310850304] [mod_pagespeed 1.9.32.3-4448 @21132] Waiting for completion of 
URL http://exampledomain.com/example-slug/ for 45.001 sec
</code>

ALL requests got error, include image requests too.


My server hardware specs:
* Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz, 8 cores
* 32 GB DDR3 RAM
* 2 x 2 TB SATA 6 Gb/s 7200 rpm HDD (Software-RAID 1) Class Enterprise

Software specs:
Operating system: CentOS Linux 7.0.1406
Kernel: Linux 3.10.0-123.20.1.el7.x86_64 on x86_64

Server Version: Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips mod_fcgid/2.3.9 
PHP/5.6.5 mod_perl/2.0.9dev Perl/v5.16.3
Server MPM: event


What version of the product are you using (please check X-Mod-Pagespeed
header)?
mod-pagespeed-stable-1.9.32.3-4448.x86_64

URL of broken page:
I removed module after 1 minute of terror. If a google developer want to learn 
more of server information, mail me an ip address that i can give permission to 
look mod_info print.



Original issue reported on code.google.com by unsalkor...@gmail.com on 10 Feb 2015 at 9:35

@GoogleCodeExporter
Copy link
Author

I have the same problem with the same version of pagespeed.

I'm on Debian Wheezy.


Original comment by hug...@betabrand.com on 19 Feb 2015 at 6:40

@GoogleCodeExporter
Copy link
Author

This looks like a bug to me in:

void ApacheFetch::Wait() {
  MessageHandler* handler = server_context_->message_handler();
  Timer* timer = server_context_->timer();
  int64 start_ms = timer->NowMs();
  {
    ScopedMutex lock(mutex_.get());
    while (!done_) {
      condvar_->TimedWait(blocking_fetch_timeout_ms_);
      if (!done_) {
        int64 elapsed_ms = timer->NowMs() - start_ms;
        handler->Message(
            kWarning, "Waiting for completion of URL %s for %g sec",
            mapped_url_.c_str(), elapsed_ms / 1000.0);
      }
    }
  }
}

I think what we're supposed to be doing is blocking for a maximum of 
blocking_fetch_timeout_ms_, but instead we block until Done() is called and 
just log an error every blocking_fetch_timeout_ms_.

Original comment by jefftk@google.com on 19 Feb 2015 at 8:36

  • Changed state: Accepted

@GoogleCodeExporter
Copy link
Author

In the mean time, could you try turning off IPRO?

    ModPagespeedInPlaceResourceOptimization off

Original comment by jefftk@google.com on 19 Feb 2015 at 8:42

@GoogleCodeExporter
Copy link
Author

I'll try that and report back.

Thanks.

Original comment by hug...@betabrand.com on 19 Feb 2015 at 8:43

@GoogleCodeExporter
Copy link
Author

Turns out the logging behavior is intentional:

  // Blocks indefinitely waiting for the proxy fetch to complete.
  // Every 'blocking_fetch_timeout_ms', log a message so that if
  // we get stuck there's noise in the logs, but we don't expect this
  // to happen because underlying fetch/cache timeouts should fire.
  //
  // Note that enforcing a timeout in this function makes debugging
  // difficult.
  void Wait();

Still not clear why the fetch/cache timeouts aren't firing.

Original comment by jefftk@google.com on 19 Feb 2015 at 8:44

@GoogleCodeExporter
Copy link
Author

Have you set any of the pagespeed timeouts to strange values, like -1?  Are 
there any other ways your PageSpeed configuration or server setup is unusual?  
Or is this pretty much a stock install?

Original comment by jefftk@google.com on 19 Feb 2015 at 8:45

@GoogleCodeExporter
Copy link
Author

The only place referring to timeouts in my config are:

ModPagespeedMemcachedTimeoutUs           50000
ModPagespeedRewriteDeadlinePerFlushMs   100

## Not a timeout, but maybe relevant
ModPagespeedImageMaxRewritesAtOnce      0

I also don't run into this problem for files that are declared with 
ModPagespeedLoadFromFileMatch


Original comment by hug...@betabrand.com on 19 Feb 2015 at 8:49

@GoogleCodeExporter
Copy link
Author

Reporting back: disabling IPRO did not fix the issue.

Original comment by hug...@betabrand.com on 19 Feb 2015 at 10:57

@GoogleCodeExporter
Copy link
Author

I also noticed that:

1- contrary to what I said earlier, the bug also occurs for resources declared 
with ModPagespeedLoadFromFileMatch

2- It seems to happen with .html files (which I feel like is weird; but since I 
don't know the internals of MPS, could seem logical to you)

[Thu Feb 19 14:57:18 2015] [warn] [mod_pagespeed 1.9.32.3-4448 @29090] Waiting 
for completion of URL 
http://www.betabrand.com/mens-gray-chambray-drawstring-pants.html for 6730.16 
sec
[Thu Feb 19 14:57:18 2015] [warn] [mod_pagespeed 1.9.32.3-4448 @29090] Waiting 
for completion of URL 
http://www.betabrand.com/womens/more/navy-blue-cornucopia-water-resistant-daypac
k.html for 6730.14 sec
[Thu Feb 19 14:57:18 2015] [warn] [mod_pagespeed 1.9.32.3-4448 @29090] Waiting 
for completion of URL 
http://www.betabrand.com/mens-removable-hood-gekko-vest.html for 6640.23 sec

Original comment by hug...@betabrand.com on 19 Feb 2015 at 11:01

@GoogleCodeExporter
Copy link
Author

Hi there,

I'd like to mention that this bug is fairly important as it is triggered all 
the time and its effect is apache2 hogging the CPU. The only fix is killing the 
apache2 process.

Also, is it possible that using ModPagespeedBlockingRewriteKey could be a cause 
of this bug?

I'm setting the X-PSA-Blocking-Rewrite header on all my requests.


Original comment by hug...@betabrand.com on 20 Feb 2015 at 8:31

@GoogleCodeExporter
Copy link
Author

I have the same problem with the same version of pagespeed.

I'm on CENTOS 6.6 x86_64 standard, Apache 2.4.12.

[Sat Feb 28 15:21:10.939879 2015] [pagespeed:warn] [pid 49502:tid 
140315103012608] [mod_pagespeed 1.9.32.3-4448 @49502] Waiting for completion of 
URL 
http://corourbano.com/w/wp-content/uploads/2015/01/Mozart-La-Para-Ft-Anthony-San
tos-Pa-Gozar-Remix.jpg for 10 sec
[Sat Feb 28 15:21:11.526066 2015] [pagespeed:warn] [pid 49687:tid 
140315144972032] [mod_pagespeed 1.9.32.3-4448 @49687] Waiting for completion of 
URL http://corourbano.com/w/wp-content/uploads/2014/04/Paramba-ps-4.jpg for 
40.002 sec
[Sat Feb 28 15:21:15.940041 2015] [pagespeed:warn] [pid 49502:tid 
140315103012608] [mod_pagespeed 1.9.32.3-4448 @49502] Waiting for completion of 
URL 
http://corourbano.com/w/wp-content/uploads/2015/01/Mozart-La-Para-Ft-Anthony-San
tos-Pa-Gozar-Remix.jpg for 15.001 sec
[Sat Feb 28 15:21:16.526242 2015] [pagespeed:warn] [pid 49687:tid 
140315144972032] [mod_pagespeed 1.9.32.3-4448 @49687] Waiting for completion of 
URL http://corourbano.com/w/wp-content/uploads/2014/04/Paramba-ps-4.jpg for 
45.002 sec
[Sat Feb 28 15:21:21.526388 2015] [pagespeed:warn] [pid 49687:tid 
140315144972032] [mod_pagespeed 1.9.32.3-4448 @49687] Waiting for completion of 
URL http://corourbano.com/w/wp-content/uploads/2014/04/Paramba-ps-4.jpg for 
50.002 sec
[Sat Feb 28 15:21:21.744521 2015] [pagespeed:warn] [pid 49819:tid 
140315228890880] [mod_pagespeed 1.9.32.3-4448 @49819] Waiting for completion of 
URL 
http://www.corourbano.com/w/wp-content/uploads/2013/09/680px_dedd825780164515bdb
1b16841c61be9.png for 5 sec
[Sat Feb 28 15:21:26.526512 2015] [pagespeed:warn] [pid 49687:tid 
140315144972032] [mod_pagespeed 1.9.32.3-4448 @49687] Waiting for completion of 
URL http://corourbano.com/w/wp-content/uploads/2014/04/Paramba-ps-4.jpg for 
55.002 sec
[Sat Feb 28 15:21:31.526660 2015] [pagespeed:warn] [pid 49687:tid 
140315144972032] [mod_pagespeed 1.9.32.3-4448 @49687] Waiting for completion of 
URL http://corourbano.com/w/wp-content/uploads/2014/04/Paramba-ps-4.jpg for 
60.002 sec
[Sat Feb 28 15:21:36.526792 2015] [pagespeed:warn] [pid 49687:tid 
140315144972032] [mod_pagespeed 1.9.32.3-4448 @49687] Waiting for completion of 
URL http://corourbano.com/w/wp-content/uploads/2014/04/Paramba-ps-4.jpg for 
65.002 sec
[Sat Feb 28 15:21:40.631322 2015] [core:error] [pid 49895:tid 140315355662080] 
[client 198.7.62.69:57760] Script timed out before returning headers: 
wp-cron.php
[Sat Feb 28 15:22:26.233396 2015] [pagespeed:warn] [pid 49592:tid 
140315228890880] [mod_pagespeed 1.9.32.3-4448 @49592] Waiting for completion of 
URL 
http://corourbano.com/w/wp-content/uploads/2014/11/j-balvin-latin-grammy-slide-n
ewsjpg.jpg for 5 sec
[Sat Feb 28 15:22:44.002184 2015] [pagespeed:warn] [pid 49819:tid 
140315176441600] [mod_pagespeed 1.9.32.3-4448 @49819] Waiting for completion of 
URL 
http://corourbano.com/w/wp-content/uploads/2014/07/tu-quiere-chapiame-chimbala-f
t-la-material.jpg for 5 sec
[Sat Feb 28 15:22:49.002314 2015] [pagespeed:warn] [pid 49819:tid 
140315176441600] [mod_pagespeed 1.9.32.3-4448 @49819] Waiting for completion of 
URL 
http://corourbano.com/w/wp-content/uploads/2014/07/tu-quiere-chapiame-chimbala-f
t-la-material.jpg for 10 sec
[Sat Feb 28 15:22:54.002439 2015] [pagespeed:warn] [pid 49819:tid 
140315176441600] [mod_pagespeed 1.9.32.3-4448 @49819] Waiting for completion of 
URL 
http://corourbano.com/w/wp-content/uploads/2014/07/tu-quiere-chapiame-chimbala-f
t-la-material.jpg for 15 sec

Any suggestions on this?

Original comment by CoroUrba...@gmail.com on 28 Feb 2015 at 7:34

@GoogleCodeExporter
Copy link
Author

I also have this problem, anyone can solve ?

Original comment by huynhvut...@gmail.com on 7 Mar 2015 at 4:51

@GoogleCodeExporter
Copy link
Author

I suspect (but don't know for sure) that use of ModPagespeedBlockingRewriteKey 
might not interact well with in-place fetches.  It would be better to set a 
longer rewrite deadline.

Nevertheless I think this is a real bug and we should figure out who should 
tackle it.

Original comment by jmara...@google.com on 8 Mar 2015 at 5:14

@GoogleCodeExporter
Copy link
Author

Is everybody on this thread with the bug using ModPagespeedBlockingRewriteKey 
though?

Original comment by hug...@betabrand.com on 8 Mar 2015 at 8:36

@GoogleCodeExporter
Copy link
Author

Looking at the code, the class in question (ApacheFetch) is used in three 
different scenarios:

1. admin-page handling (which I think is not at issue here based on the log 
messages)
2. ModPagespeedInPlaceResourceOptimization (on by default starting in 1.9)
3. ModPagespeedMapProxyDomain

It would be helpful if each person experiencing this bug let us know if they 
are using any of the above features, specifying with or without a 
BlockingRewriteKey.

Original comment by jmara...@google.com on 9 Mar 2015 at 1:37

@GoogleCodeExporter
Copy link
Author

Someone found the problem? I'm having this and I'm trying to find out.

Original comment by lambert....@gmail.com on 11 Mar 2015 at 11:34

@GoogleCodeExporter
Copy link
Author

In order to help finding the source of this bug it would help if you posted the 
information stated in comment #15: 
https://code.google.com/p/modpagespeed/issues/detail?id=1048#c15

Original comment by hug...@betabrand.com on 11 Mar 2015 at 11:37

@GoogleCodeExporter
Copy link
Author

I have the same problem with an Ubuntu-server 13.10

[Tue Mar 31 14:41:20.129694 2015] [pagespeed:warn] [pid 1166] [mod_pagespeed 
1.9.32.3-4448 @1166] Waiting for completion of URL 
http://cometeelmundo.net/sites/default/files/styles/originalwatermark/public/fie
ld/image/Interior%20catedral%20de%20San%20Basilio.jpg?itok=TKiOM8ek for 345.023 
sec
[Tue Mar 31 14:41:25.129988 2015] [pagespeed:warn] [pid 1166] [mod_pagespeed 
1.9.32.3-4448 @1166] Waiting for completion of URL 
http://cometeelmundo.net/sites/default/files/styles/originalwatermark/public/fie
ld/image/Interior%20catedral%20de%20San%20Basilio.jpg?itok=TKiOM8ek for 350.023 
sec
[Tue Mar 31 14:41:30.130250 2015] [pagespeed:warn] [pid 1166] [mod_pagespeed 
1.9.32.3-4448 @1166] Waiting for completion of URL 
http://cometeelmundo.net/sites/default/files/styles/originalwatermark/public/fie
ld/image/Interior%20catedral%20de%20San%20Basilio.jpg?itok=TKiOM8ek for 355.024 
sec
[Tue Mar 31 14:41:35.130507 2015] [pagespeed:warn] [pid 1166] [mod_pagespeed 
1.9.32.3-4448 @1166] Waiting for completion of URL 
http://cometeelmundo.net/sites/default/files/styles/originalwatermark/public/fie
ld/image/Interior%20catedral%20de%20San%20Basilio.jpg?itok=TKiOM8ek for 360.024 
sec
[Tue Mar 31 14:41:40.130767 2015] [pagespeed:warn] [pid 1166] [mod_pagespeed 
1.9.32.3-4448 @1166] Waiting for completion of URL 
http://cometeelmundo.net/sites/default/files/styles/originalwatermark/public/fie
ld/image/Interior%20catedral%20de%20San%20Basilio.jpg?itok=TKiOM8ek for 365.024 
sec
[Tue Mar 31 14:41:45.131062 2015] [pagespeed:warn] [pid 1166] [mod_pagespeed 
1.9.32.3-4448 @1166] Waiting for completion of URL 
http://cometeelmundo.net/sites/default/files/styles/originalwatermark/public/fie
ld/image/Interior%20catedral%20de%20San%20Basilio.jpg?itok=TKiOM8ek for 370.025 
sec
[Tue Mar 31 14:41:50.131376 2015] [pagespeed:warn] [pid 1166] [mod_pagespeed 
1.9.32.3-4448 @1166] Waiting for completion of URL 
http://cometeelmundo.net/sites/default/files/styles/originalwatermark/public/fie
ld/image/Interior%20catedral%20de%20San%20Basilio.jpg?itok=TKiOM8ek for 375.025 
sec
[Tue Mar 31 14:41:55.131638 2015] [pagespeed:warn] [pid 1166] [mod_pagespeed 
1.9.32.3-4448 @1166] Waiting for completion of URL 
http://cometeelmundo.net/sites/default/files/styles/originalwatermark/public/fie
ld/image/Interior%20catedral%20de%20San%20Basilio.jpg?itok=TKiOM8ek for 380.025 
sec
[Tue Mar 31 14:42:00.131922 2015] [pagespeed:warn] [pid 1166] [mod_pagespeed 
1.9.32.3-4448 @1166] Waiting for completion of URL 
http://cometeelmundo.net/sites/default/files/styles/originalwatermark/public/fie
ld/image/Interior%20catedral%20de%20San%20Basilio.jpg?itok=TKiOM8ek for 385.025 
sec
[Tue Mar 31 14:42:05.132191 2015] [pagespeed:warn] [pid 1166] [mod_pagespeed 
1.9.32.3-4448 @1166] Waiting for completion of URL 
http://cometeelmundo.net/sites/default/files/styles/originalwatermark/public/fie
ld/image/Interior%20catedral%20de%20San%20Basilio.jpg?itok=TKiOM8ek for 390.026 
sec

Original comment by foreve...@gmail.com on 31 Mar 2015 at 1:22

@GoogleCodeExporter
Copy link
Author

It would be helpful if each person experiencing this bug let us know if they 
are using either of the these features:

 -  ModPagespeedInPlaceResourceOptimization (on by default starting in 1.9)
 -  ModPagespeedMapProxyDomain

Please also let us know if you also use a BlockingRewriteKey.

Original comment by jmara...@google.com on 31 Mar 2015 at 1:30

@GoogleCodeExporter
Copy link
Author

If I understand correctly ModPagespeedInPlaceResourceOptimization is enabled by 
default on 1.9 without specifying it on the config file, so in this case I have 
it enabled. 

I'm not using ModPagespeedMapProxyDomain and netiher a BlockingRewriteKey.

If you need more information feel free to ask for it.

Original comment by foreve...@gmail.com on 31 Mar 2015 at 1:34

@GoogleCodeExporter
Copy link
Author

I have this problem as well. I regularly need to restart the Apache2 service 
because I use 100% CPU. It happens almost twice a day. It's a problem because 
the rest of the time mod_pagespeed benefits are real and great....

Original comment by vedio...@gmail.com on 31 Mar 2015 at 7:20

@GoogleCodeExporter
Copy link
Author

We're still trying to figure out what's causing this, but a workaround for now 
is to disable in-place resource optimization:

    ModPagespeedInPlaceResourceOptimization off


Original comment by jefftk@google.com on 31 Mar 2015 at 7:49

@GoogleCodeExporter
Copy link
Author

    ModPagespeedInPlaceResourceOptimization off didn't stop my CPU to go high today. I had to restart the service again.

A few examples of what I have in my apache2 error logs (no idea if it is 
related) : 
=> tail -F /var/log/apache2/error.log 



[Wed Apr 01 13:04:40.539741 2015] [pagespeed:error] [pid 6440] [mod_pagespeed 
1.9.32.3-4448 @6440] http://www.xxxxxx.ch/image/ico_delete_small.gif 
(connecting to:%22):0: Error status=670002 (Name or service not known) 
serf_connection_create2
[Wed Apr 01 13:04:40.539799 2015] [pagespeed:warn] [pid 6440] [mod_pagespeed 
1.9.32.3-4448 @6440] Fetch failed to start: 
http://www.xxxxxx.ch/image/ico_delete_small.gif (connecting to:%22)
[Wed Apr 01 13:27:42.192734 2015] [pagespeed:warn] [pid 8212] [mod_pagespeed  
@8212] [0401/132742:WARNING:queued_worker_pool.cc(432)] Canceling 68 functions 
on sequence Shutdown

Original comment by vedio...@gmail.com on 1 Apr 2015 at 4:36

@GoogleCodeExporter
Copy link
Author

vediovis: do you have MapProxyDomain in your config?  Or BlockingRewriteKey?  
Did you flush your cache after turning off InPlaceResourceOptimization?

foreveryo: did you try flushing your cache and turning off 
InPlaceResourceOptimization?

Original comment by jmara...@google.com on 1 Apr 2015 at 5:21

@GoogleCodeExporter
Copy link
Author

jmara: I do not use MapProxyDomain or BlockingRewriteKey
How can I flush the cache? 

The CPU goes high all the time now. It seems the longer the installation, the 
higher frequence of high CPU usage.

Original comment by vedio...@gmail.com on 2 Apr 2015 at 7:45

@GoogleCodeExporter
Copy link
Author

jmara: I do not use MapProxyDomain or BlockingRewriteKey
How can I flush the cache? 

The CPU goes high all the time now. It seems the longer the installation, the 
higher frequence of high CPU usage.

Original comment by vedio...@gmail.com on 2 Apr 2015 at 7:48

@GoogleCodeExporter
Copy link
Author

I have set ModPagespeedInPlaceResourceOptimization off and cleared the cache (I 
am using memcached) I will come back in a few hours to let us know how is 
working.

Original comment by foreve...@gmail.com on 2 Apr 2015 at 8:20

@WpSEOit
Copy link

WpSEOit commented Aug 18, 2015

Thank you @crowell and @jeffkaufman for your perseverance.

Now I'm writing from the company account, but I've already wrote you as @capn3m0 and I'm sorry that I've couldn't done the tests that you have requested: those servers are still in production so, once that the problem has been solved I didn't feel safe to try the test another time threatening to make it unstable.

I am confident that the problem will be solved as soon as possible, meanwhile I hope you have a good job and I would like to say thank you from the WpSEO Staff and our Clients.

@jeffkaufman
Copy link
Contributor

I just went through the four recent backtraces we have and classifies the states of the threads:

1 32  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
1 31  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
1 30  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
1 29  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
1 28  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
1 27  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
1 26  ap_process_request > ap_core_output_filter > poll
1 25  ap_queue_pop > pthread_cond_wait
1 24  ap_process_request > ap_core_output_filter > poll
1 23  ap_queue_pop > pthread_cond_wait
1 22  ap_queue_pop > pthread_cond_wait
1 21  ap_process_request > ap_core_output_filter > poll
1 20  ap_process_request > HandleAsPagespeedResource
                         > ap_core_output_filter > poll
1 19  ap_process_request > HandleAsPagespeedResource > BoundedWaitFor
                         > pthread_cond_timedwait
1 18  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
1 17  ap_queue_pop > pthread_cond_wait
1 16  ap_queue_pop > pthread_cond_wait
1 15  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
1 14  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
1 13  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
1 12  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
1 11  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
1 10  ap_queue_pop > pthread_cond_wait
1 09  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
1 08  ap_process_request > HandleAsPagespeedResource > BoundedWaitFor
                         > pthread_cond_timedwait
1 07  ap_unixd_accept > accept4
1 06  WorkThread::Run > QueuedWorkerPool::Run > ApacheWriter::Write > deflate
1 05  WorkThread::Run > pthread_cond_wait
1 04  WorkThread::Run > pthread_cond_wait
1 03  WorkThread::Run > pthread_cond_wait
1 02  SerfThreadFn > TransferFetchesAndCheckDone > pthread_cond_timedwait
1 01  main > ap_run_mpm > ap_mpm_podx_check > read

2 32  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
2 31  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
2 30  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
2 29  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
2 28  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
2 27  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
2 26  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
2 25  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
2 24  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
2 23  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
2 22  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
2 21  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
2 20  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
2 19  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
2 18  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
2 17  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
2 16  ap_process_request > ap_core_output_filter > poll
2 15  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
2 14  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
2 13  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
2 12  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
2 11  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
2 10  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
2 09  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
2 08  ap_process_request > RewriteDriver::FinishParse > pthread_cond_timedwait
2 07  ap_queue_info_wait_for_idler > pthread_cond_wait
2 06  WorkThread::Run > QueuedWorkerPool::Run > ApacheWriter::Write > poll
2 05  WorkThread::Run > pthread_cond_wait
2 04  WorkThread::Run > pthread_cond_wait
2 03  WorkThread::Run > pthread_cond_wait
2 02  SerfThreadFn > TransferFetchesAndCheckDone > pthread_cond_timedwait
2 01  main > ap_run_mpm > ap_mpm_podx_check > read

3 26  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
3 25  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
3 24  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
3 23  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
3 22  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
3 21  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
3 20  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
3 19  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
3 18  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
3 17  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
3 16  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
3 15  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
3 14  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
3 13  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
3 12  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
3 11  WorkThread::Run > pthread_cond_wait
3 10  WorkThread::Run > AprMemCache::MultiGet > epoll_wait
3 09  WorkThread::Run > pthread_cond_wait
3 08  WorkThread::Run > pthread_cond_wait
3 07  WorkThread::Run > pthread_cond_wait
3 06  WorkThread::Run > pthread_cond_wait
3 05  SerfThreadFn > TransferFetchesAndCheckDone > pthread_cond_timedwait
3 04  WorkThread::Run > pthread_cond_wait
3 03  WorkThread::Run > pthread_cond_wait
3 02  WorkThread::Run > pthread_cond_wait
3 01  main > ap_run_mpm > apr_thread_join > pthread_join

4 35  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
4 34  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
4 33  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
4 32  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
4 31  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
4 30  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
4 29  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
4 28  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
4 27  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
4 26  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
4 25  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
4 24  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
4 23  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
4 22  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
4 21  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
4 20  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
4 19  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
4 18  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
4 17  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
4 16  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
4 15  ap_queue_pop > pthread_cond_wait
4 14  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
4 13  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
4 12  ap_process_request > ApacheFetch::Wait > pthread_cond_timedwait
4 11  ap_process_request > RewriteDriver::Flush > pthread_cond_timedwait
4 10  unixd_accept > accept4
4 09  WorkThread::Run > pthread_cond_wait
4 08  WorkThread::Run > AprMemCache::MultiGet > epoll_wait
4 07  WorkThread::Run > pthread_cond_wait
4 06  SerfThreadFn > TransferFetchesAndCheckDone > pthread_cond_timedwait
4 05  WorkThread::Run > pthread_cond_wait
4 04  WorkThread::Run > pthread_cond_wait
4 03  WorkThread::Run > pthread_cond_wait
4 02  WorkThread::Run > pthread_cond_wait
4 01  main > ap_run_mpm > ap_mpm_pod_check > read

@jeffkaufman
Copy link
Contributor

We need to learn which thread or threads are spinning and burning 100% cpu, since the backtraces don't look like they should be spinning.

@crowell When you next manage to reproduce this, could you run:

$ ps -p [PID] -w -w -L -o %cpu,lwp,pid,user,args

On the apache process taking 100% CPU? Then in gdb we can see which thread has that LWP id.

For example, on my system (not having the problem) I currently see:

$ ps -p 3753 -w -w -L -o %cpu,lwp,pid,user,args
%CPU   LWP   PID USER     COMMAND
 0.0  3753  3753 jefftk   /usr/local/apache2/bin/httpd -k start
 0.0  3754  3753 jefftk   /usr/local/apache2/bin/httpd -k start
 0.0  3755  3753 jefftk   /usr/local/apache2/bin/httpd -k start
 0.0  3756  3753 jefftk   /usr/local/apache2/bin/httpd -k start

And when I look in gdb:

> thread apply all bt
...
Thread 3 (Thread 0x2b8e8bf27700 (LWP 3755)):
...

@jeffkaufman
Copy link
Contributor

The way PthreadCondvar::TimedWait in pagespeed/kernel/thread/pthread_condvar.cc is not checking the return code on pthread_cond_timedwait doesn't look right. Now, pthread_cond_timedwait should only return a few errors but if we gave one invalid input it would spin forever.

@morlovich
Copy link
Contributor

The only checking that seems appropriate for those would be a CHECK ----
ETIMEOUT is the only one that can happen barring major bugs in our code,
and ignoring ETIMEOUT is the right thing to do.

On Thu, Aug 20, 2015 at 7:50 AM, Jeff Kaufman notifications@github.com
wrote:

The way PthreadCondvar::TimedWait in
pagespeed/kernel/thread/pthread_condvar.cc is not checking the return
code on pthread_cond_timedwait doesn't look right. Now,
pthread_cond_timedwait should only return a few errors
http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_cond_timedwait.html
but if we gave one invalid input it would spin forever.


Reply to this email directly or view it on GitHub
#1048 (comment)
.

@jeffkaufman
Copy link
Contributor

@morlovich agreed

@crowell
Copy link
Contributor

crowell commented Aug 20, 2015

backtrace with on a debug build, along with thread process stats.

https://gist.github.com/e5f323af7e879c226e91

@jeffkaufman
Copy link
Contributor

@crowell That's very strange; doesn't look like 100% cpu at all? Only 25951 and 25901 are at all busy, and they're doing actual work, not waiting around like in our other backtraces.

@jeffkaufman
Copy link
Contributor

@crowell I count 11 of 36 threads as doing real cpu-involving work as opposed to waiting, while on the other backtraces it was at most 1.

@WpSEOit
Copy link

WpSEOit commented Aug 20, 2015

@jeffkaufman when i has the problem only 1 or 2 httpd.worker stuck with high cpu but all the rest httpd process working normally

@jmarantz
Copy link
Contributor

There are some stack-frames that look concerning:

Thread 8 (Thread 0x7f7e85feb700 (LWP 25875)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007f7e9bb24e47 in net_instaweb::PthreadCondvar::TimedWait
(this=0x7f7e7431be38, timeout_ms=0)
at
pagespeed/kernel/thread/pthread_condvar.cc:66
#2 0x00007f7e9b906bac in
net_instaweb::CheckingThreadSystem::CheckingCondvar::TimedWait
(this=0x7f7e744faf68, timeout_ms=5000) at
pagespeed/kernel/base/checking_thread_system.cc:55#3 0x00007f7e9b62e9bc in
net_instaweb::ApacheFetch::Wait (this=0x7f7e741220a8,
rewrite_driver=0x7f7e80135298) at pagespeed/apache/apache_fetch.cc:199

On Thu, Aug 20, 2015 at 2:40 PM, WpSEO.it Hosting WordPress e Consulenza
SEO notifications@github.com wrote:

@jeffkaufman https://github.com/jeffkaufman when i has the problem only
1 or 2 httpd.worker stuck with high cpu but all the rest httpd process
working normally


Reply to this email directly or view it on GitHub
#1048 (comment)
.

@crowell
Copy link
Contributor

crowell commented Aug 20, 2015

@jmarantz yeah, this is also the case in other threads/processes

(gdb) where
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007f7e9bb24e47 in net_instaweb::PthreadCondvar::TimedWait (this=0x7f7e9033d178, timeout_ms=0)
    at pagespeed/kernel/thread/pthread_condvar.cc:66
#2  0x00007f7e9b906bac in net_instaweb::CheckingThreadSystem::CheckingCondvar::TimedWait (this=0x7f7e902136c8, 
    timeout_ms=5000) at pagespeed/kernel/base/checking_thread_system.cc:55

the code for the function in frame 2

  virtual void TimedWait(int64 timeout_ms) {
    mutex_->DropLockControl();
    condvar_->TimedWait(timeout_ms);
    mutex_->TakeLockControl();
  }

so there really should be no reason for timeout_ms being reset to 0...

@jmarantz
Copy link
Contributor

Never mind; that was a red herring; PthreadCondvar::TimedWait overwrites
that formal and it winds up holding the number of miliseconds. But the
data is not lost.

int64 timeout_sec_incr = timeout_ms / Timer::kSecondMs;
  *timeout_ms %= Timer::kSecondMs;*
  // Figure out current time, compute absolute time for timeout
  // Carrying ns to s as appropriate.  As morlovich notes, we
  // get *really close* to overflowing a 32-bit tv_nsec here,
  // so this code should be modified with caution.
  if (gettimeofday(&current_time, NULL) != 0) {
    LOG(FATAL) << "Could not determine time of day";
  }
  timeout.tv_nsec = current_time.tv_usec * 1000 + timeout_ms * kMsNs;
  timeout_sec_incr += timeout.tv_nsec / Timer::kSecondNs;
  timeout.tv_nsec %= Timer::kSecondNs;
  timeout.tv_sec = current_time.tv_sec +
static_cast<time_t>(timeout_sec_incr);
  // Finally we actually get to wait.
  pthread_cond_timedwait(&condvar_, &mutex_->mutex_, &timeout);

-Josh

On Thu, Aug 20, 2015 at 2:58 PM, Joshua Marantz jmarantz@google.com wrote:

There are some stack-frames that look concerning:

Thread 8 (Thread 0x7f7e85feb700 (LWP 25875)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1 0x00007f7e9bb24e47 in net_instaweb::PthreadCondvar::TimedWait
(this=0x7f7e7431be38, timeout_ms=0)
at
pagespeed/kernel/thread/pthread_condvar.cc:66
#2 0x00007f7e9b906bac in
net_instaweb::CheckingThreadSystem::CheckingCondvar::TimedWait
(this=0x7f7e744faf68, timeout_ms=5000) at
pagespeed/kernel/base/checking_thread_system.cc:55#3 0x00007f7e9b62e9bc in
net_instaweb::ApacheFetch::Wait (this=0x7f7e741220a8,
rewrite_driver=0x7f7e80135298) at pagespeed/apache/apache_fetch.cc:199

On Thu, Aug 20, 2015 at 2:40 PM, WpSEO.it Hosting WordPress e Consulenza
SEO notifications@github.com wrote:

@jeffkaufman https://github.com/jeffkaufman when i has the problem
only 1 or 2 httpd.worker stuck with high cpu but all the rest httpd process
working normally


Reply to this email directly or view it on GitHub
#1048 (comment)
.

@crowell
Copy link
Contributor

crowell commented Aug 20, 2015

(gdb) p timeout_sec_incr
$6 = 5
(gdb) p timeout
$7 = {tv_sec = 1440096674, tv_nsec = 332683000}

yeah seems to be a valid 5 second wait.

@crowell
Copy link
Contributor

crowell commented Aug 25, 2015

https://gist.github.com/903957a54f2cebcbe1b8

3 stack traces before it got out of the "Waiting for completion" state looking now.

@crowell
Copy link
Contributor

crowell commented Aug 25, 2015

some stack traces ~30 seconds apart, with 1 rewrite thread in config.

https://gist.github.com/63d0b4e6c4c406d2b2d3

they're all identical. after stopping the load on the server, the issue cleared up.

jeffkaufman added a commit that referenced this issue Sep 2, 2015
Before this change we would call ap_rwrite() etc from the rewrite thread, and if
that blocked for a while we might not have any rewrite threads available to
serve other requests.  With this change, all potentially blocking Apache calls
always happen on the request thread.  Writes are buffered in ApacheFetch until
the resource is complete, and then sent out in one go.

Most uses of this (ex: IPRO) were already not streaming, so we don't lose that
much by buffering.  We are doing more copying than we were, and to evaluate this
impact I ran "siege http://localhost:8080/$testimage -c200 -t1h" both with and
without the change, to really stress test this piece of ipro.  This test pulls a
single 1.1MB ipro optimized image with 200 concurrent readers for 1hr, and
should give us a worst-case indication of the slowdown buffering causes.

    before qps: 393.11
    after qps: 392.20

A 0.2% worstcase slowdown is not bad at all; buffering seems to not be a
problem.

This fixes some instances of #1048, but we're not sure yet whether it fixes all
of them.  It definitely doesn't fix ones due to slow filesystems, but we haven't
seen that version in the wild.

For uses of ApacheFetch that we know will always be synchronous we disable this
buffering, which is slightly more efficient.
@crowell
Copy link
Contributor

crowell commented Sep 4, 2015

https://github.com/pagespeed/mod_pagespeed/releases/tag/1.9.32.8 should fix the issue.

anyone who has been affected by this, please give this pre-release a try and report back!

@WpSEOit
Copy link

WpSEOit commented Sep 4, 2015

installed.
I will give you a feedback in the next days.
Thanks for all

@crowell
Copy link
Contributor

crowell commented Sep 22, 2015

While we're confident now that the "Waiting for Completion" state has been fixed, there seems to be a second, related, issue of hangs within apr_memcache2_multgetp

It may be possible for apr_pollset_poll to return a value that isn't APR_SUCCESS due to a failure that isn't a timeout.

It may also be possible for get_server_line to be called with conn->bb as empty, causing infinite reads of size zero.

We're working on testing these, and will have a test binary soon for interested users.

@sv72
Copy link

sv72 commented Sep 22, 2015

Installed 1.9.32.8 on production this morning after having run it succesfully in test for 2 weeks. Tonight
unfortunately waiting for completion starts occurring again:

[Tue Sep 22 22:05:35 2015] [warn] [mod_pagespeed 1.9.32.8-7388 @13009] Waiting for completion of URL http://m.xxxxx.nl/m20/?gclid=xxx for 110.002 sec.
[Tue Sep 22 22:05:35 2015] [warn] [mod_pagespeed 1.9.32.8-7388 @16737] Waiting for completion of URL http://m.xxxxx.com/wcsstore/dojo18/dijit/layout/TabContainer.js for 100.002 sec.
[Tue Sep 22 22:05:35 2015] [warn] [mod_pagespeed 1.9.32.8-7388 @1573] Waiting for completion of URL http://m.xxxx.nl/m20/?gclid=xxxx 75.002 sec.
[Tue Sep 22 22:05:35 2015] [warn] [mod_pagespeed 1.9.32.8-7388 @13522] Waiting for completion of URL http://m.xxxx.com/wcsstore/dojo18/dijit/layout/_TabContainerBase.js for 70.001 sec.
[Tue Sep 22 22:05:35 2015] [warn] [mod_pagespeed 1.9.32.8-7388 @17553] Waiting for completion of URL http://m.xxxx.com/wcsstore/dojo18/dijit/form/DropDownButton.js for 80.002 sec.

25.000+ times in 15 minutes and then all seems to be fine again. Only happened on 1 out of 3 webservers, but seems to be same behavior as before 1.9.32.8

@jeffkaufman
Copy link
Contributor

@sv72: Did the CPU usage go to 100% when this happened?

@sv72
Copy link

sv72 commented Sep 22, 2015

Actually no, CPU was not higher than normal (30-35%), but memory usage did go to 95% (normally around 50%)

crowell added a commit that referenced this issue Sep 30, 2015
 (issue #1048)

in the case where
           if (strncmp(MS_VALUE, conn->buffer, MS_VALUE_LEN) == 0) {
and
           else if (strncmp(MS_END, conn->buffer, MS_END_LEN) == 0) {

both fail, it was possible for queries_sent to never decrement.
This patch sets rv to APR_EGENERAL in this case, decrements the queries_sent,
and closes the connection.

According to the trace from betabrand this is where the hang is.

Patch applied from apr dev mailing list (http://www.mail-archive.com/dev%40apr.apache.org/msg26265.html)
crowell added a commit that referenced this issue Sep 30, 2015
(issue #1048)

in the case where
	if (strncmp(MS_VALUE, conn->buffer, MS_VALUE_LEN) == 0) {
and
	else if (strncmp(MS_END, conn->buffer, MS_END_LEN) == 0) {

both fail, it was possible for queries_sent to
never decrement.
This patch sets rv to APR_EGENERAL in this case,
decrements the queries_sent,
and closes the connection.

According to the trace from betabrand this is
where the hang is.

Patch applied from apr dev mailing list
(http://www.mail-archive.com/dev%40apr.apache.org/msg26265.html)
@crowell crowell closed this as completed in 19b09fe Oct 1, 2015
@eldk
Copy link
Contributor

eldk commented Oct 9, 2015

Hello,

I have pushed mod_pagespeed 1.9.32.10-7423 to production server (only one domain) and re-enabled rewrite of images in fullsize to test it.

That seems to be ok.

I have those error in apache error.log, a few ones, less than 1 % of rewrited images :
Slow read operation on file (...) configure SlowFileLatencyUs to change threshold (files are all on local disk and loadFromFile).

Is it a new mod_pagespeed parameter or should we use an existing one ?

My conf :

--
Version: 14: on

Filters
cw  Collapse Whitespace
jc  Combine Javascript
gp  Convert Gif to Png
jp  Convert Jpeg to Progressive
jw  Convert Jpeg To Webp
pj  Convert Png to Jpeg
dj  Defer Javascript
hw  Flushes html
io  In-place optimize for browser
idp Insert DNS Prefetch
js  Jpeg Subsampling
pr  Prioritize Critical Css
rj  Recompress Jpeg
rp  Recompress Png
rw  Recompress Webp
rc  Remove Comments
cf  Rewrite Css
jm  Rewrite External Javascript
jj  Rewrite Inline Javascript
cp  Strip Image Color Profiles
md  Strip Image Meta Data

Options
  CssInlineMaxBytes (ci)            10240
  EnableCachePurge (euci)           True
  EnableRewriting (e)               1
  FileCacheCleanIntervalMs (afcci)  3600000
  FileCachePath (afcp)              /var/cache/mod_pagespeed/
  FileCacheSizeKb (afc)             4096000
  ImplicitCacheTtlMs (ict)          15549367000
  LoadFromFileCacheTtlMs (lfct)     15549367000
  LogDir (ald)                      /var/log/pagespeed
  LRUCacheByteLimit (alcb)          16384
  LRUCacheKbPerProcess (alcp)       1024
  MemcachedServers (ams)            localhost:11211
  RequestOptionOverride (roo)      
  RewriteDeadlinePerFlushMs (rdm)   20
  RewriteLevel (l)                  Optimize For Bandwidth
  SslCertDirectory (assld)          /etc/ssl/certs
  StatisticsLogging (asle)          True
  SupportNoScriptEnabled (snse)     False
  UrlSigningKey (usk)               

Domain Lawyer

@crowell
Copy link
Contributor

crowell commented Oct 9, 2015

@eldk that is just a diagnostic message, letting you know that the time to read a file from disk passed an arbitrary amount of time that we determined to be "slow", it doesn't impact performance, just the message is logged, nothing is cancelled or unscheduled. If you want to silence this, you can configure the SlowFileLatencyUs to a higher number.

@eldk
Copy link
Contributor

eldk commented Oct 13, 2015

Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests