Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serf spin causes 100% CPU usage #674

Closed
melong007 opened this issue Apr 22, 2014 · 13 comments
Closed

Serf spin causes 100% CPU usage #674

melong007 opened this issue Apr 22, 2014 · 13 comments

Comments

@melong007
Copy link

@melong007 melong007 commented Apr 22, 2014

The process stack:

#0  0x00000039a580e4ed in read () from /lib64/libpthread.so.0
#1  0x000000000088859f in apr_socket_recv ()
#2  0x000000000087b63d in socket_reader ()
#3  0x000000000087ab28 in common_databuf_prep ()
#4  0x000000000087ab9c in serf_databuf_read ()
#5  0x000000000087cbfc in serf_limit_read ()
#6  0x000000000087bca3 in serf_response_read ()
#7  0x00000000004d2f3e in net_instaweb::SerfFetch::ReadBody (this=0x7f127401de60, response=0x7f1270008a18)
    at /data/cache9/pagespeed/mod_pagespeed/tmp/tmp1/src/net/instaweb/system/serf_url_async_fetcher.cc:577
#8  0x00000000004d2a6d in net_instaweb::SerfFetch::HandleResponse (this=0x7f127401de60, response=0x7f1270008a18)
    at /data/cache9/pagespeed/mod_pagespeed/tmp/tmp1/src/net/instaweb/system/serf_url_async_fetcher.cc:464
#9  0x00000000004d289c in net_instaweb::SerfFetch::HandleResponse (request=0x7f127004f888, response=0x7f1270008a18, 
    handler_baton=0x7f127401de60, pool=0x7f1270015758)
    at /data/cache9/pagespeed/mod_pagespeed/tmp/tmp1/src/net/instaweb/system/serf_url_async_fetcher.cc:354
#10 0x000000000087e864 in serf__process_connection ()
#11 0x000000000087a6b2 in serf_event_trigger ()
#12 0x000000000087a81f in serf_context_run ()
#13 0x00000000004d0e12 in net_instaweb::SerfUrlAsyncFetcher::Poll (this=0xfb60e0, max_wait_ms=50)
    at /data/cache9/pagespeed/mod_pagespeed/tmp/tmp1/src/net/instaweb/system/serf_url_async_fetcher.cc:1211
#14 0x00000000004d4345 in net_instaweb::SerfThreadedFetcher::SerfThread (this=0xfb60e0)
    at /data/cache9/pagespeed/mod_pagespeed/tmp/tmp1/src/net/instaweb/system/serf_url_async_fetcher.cc:931
#15 0x00000000004d40b9 in net_instaweb::SerfThreadedFetcher::SerfThreadFn (thread_id=0xf78a00, context=0xfb60e0)
    at /data/cache9/pagespeed/mod_pagespeed/tmp/tmp1/src/net/instaweb/system/serf_url_async_fetcher.cc:833
#16 0x00000039a58077f1 in start_thread () from /lib64/libpthread.so.0
#17 0x00000039a54e570d in clone () from /lib64/libc.so.6
@melong007
Copy link
Author

@melong007 melong007 commented Apr 22, 2014

  1. EAGAIN cause the process stop at read() function, do the infinite loop;
  2. the strace -ff -p outputs:
[pid 18471] read(232, 0x7f127004d8a4, 8000) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18471] read(232, 0x7f127004d8a4, 8000) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18471] read(232, 0x7f127004d8a4, 8000) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18471] read(232, 0x7f127004d8a4, 8000) = -1 EAGAIN (Resource temporarily unavailable)
@jeffkaufman
Copy link
Contributor

@jeffkaufman jeffkaufman commented Apr 22, 2014

Looking at apr/network_io/unix/sendrecv.c, apr_socket_recv is supposed to respond to EAGAIN by continuing to try until sock->timeout is zero.

The socket comes from databuf as databuf->read_baton->skt, the first argument to first serf_databuf_read and then through there to common_databuf_prep.

@jeffkaufman
Copy link
Contributor

@jeffkaufman jeffkaufman commented Apr 22, 2014

Looking at the apr code, I think we may be hitting a bug in apr_wait_for_io_or_timeout. apr_socket_recv expects that after a call to apr_wait_for_io_or_timeout there will either be data available to read() or an error (APR_TIMEUP or something else) will be returned. This is true when poll() is available, but the implementation for when poll() is unavailable has /* TODO - timeout should be less each time through this loop */.

The non-poll() apr_wait_for_io_or_timeout code is excessively complicated because it's using pollsets with, as far as I can tell, only ever one item in the set.

@jeffkaufman
Copy link
Contributor

@jeffkaufman jeffkaufman commented Apr 22, 2014

Current guess: apr_wait_for_io_or_timeout is returning success when there's not actually anything to read, read() returns EAGAIN, we loop forever.

@jeffkaufman
Copy link
Contributor

@jeffkaufman jeffkaufman commented Apr 22, 2014

What OS are you on?

@wlxiong
Copy link

@wlxiong wlxiong commented Jul 1, 2014

We encountered the same problem. Here are the stack of nginx process,

#0  0x0000003b4ac0e54d in read () from /lib64/libpthread.so.0
#1  0x000000000082e3f5 in pagespeed_ol_apr_socket_recv ()
#2  0x000000000074667d in socket_reader ()
#3  0x0000000000740848 in common_databuf_prep ()
#4  0x00000000007408ac in pagespeed_ol_serf_databuf_read ()
#5  0x0000000000745d58 in serf_limit_read ()
#6  0x0000000000743d63 in serf_response_read ()
#7  0x00000000005a519b in net_instaweb::SerfFetch::HandleResponse(serf_request_t*, serf_bucket_t*, void*, apr_pool_t*) ()
#8  0x0000000000743013 in pagespeed_ol_serf__process_connection ()
#9  0x00000000007411f5 in pagespeed_ol_serf_event_trigger ()
#10 0x00000000007413ac in pagespeed_ol_serf_context_run ()
#11 0x00000000005a5ded in net_instaweb::SerfUrlAsyncFetcher::Poll(long) ()
#12 0x00000000005a8aac in net_instaweb::SerfThreadedFetcher::SerfThreadFn(apr_thread_t*, void*) ()
#13 0x0000003b4ac07851 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003b4a4e890d in clone () from /lib64/libc.so.6

and the system call traces

[pid  8465] read(55, 0x7f2e24040114, 8000) = -1 EAGAIN (Resource temporarily unavailable)
[pid  8465] read(55, 0x7f2e24040114, 8000) = -1 EAGAIN (Resource temporarily unavailable)
[pid  8465] read(55, 0x7f2e24040114, 8000) = -1 EAGAIN (Resource temporarily unavailable)
[pid  8465] read(55, 0x7f2e24040114, 8000) = -1 EAGAIN (Resource temporarily unavailable)
[pid  8465] read(55, 0x7f2e24040114, 8000) = -1 EAGAIN (Resource temporarily unavailable)

We are running nginx 1.6.0 as a forward proxy with ngx_pagespeed-1.8.31.2-beta on CentOS 6.4. Anything I can do to help fix the problem?

@melong007
Copy link
Author

@melong007 melong007 commented Jul 15, 2014

Hi jeffkaufman,
This problem still appeared in our online pagespeed. I can't located the bug, Is there anything I can do to help fix the Problem

  • ngx_pagespeed version: 1.8.31.4-4009
  • nginx version: 1.6.0
  • OS: CentOS 6.4
  • Nginx stack:
#0  0x000000398a40e54d in read () from /lib64/libpthread.so.0
#1  0x000000000086a0b5 in pagespeed_ol_apr_socket_recv ()
#2  0x00000000007822fd in socket_reader ()
#3  0x000000000077c4c8 in common_databuf_prep ()
#4  0x000000000077c52c in pagespeed_ol_serf_databuf_read ()
#5  0x00000000007819d8 in serf_limit_read ()
#6  0x000000000077f9e3 in serf_response_read ()
#7  0x00000000005e0e1b in net_instaweb::SerfFetch::HandleResponse(serf_request_t*, serf_bucket_t*, void*, apr_pool_t*) ()
#8  0x000000000077ec93 in pagespeed_ol_serf__process_connection ()
#9  0x000000000077ce75 in pagespeed_ol_serf_event_trigger ()
#10 0x000000000077d02c in pagespeed_ol_serf_context_run ()
#11 0x00000000005dffb6 in net_instaweb::SerfFetch::Start(net_instaweb::SerfUrlAsyncFetcher*) ()
#12 0x00000000005e3477 in net_instaweb::SerfUrlAsyncFetcher::StartFetch(net_instaweb::SerfFetch*) ()
#13 0x00000000005e47d1 in net_instaweb::SerfThreadedFetcher::SerfThreadFn(apr_thread_t*, void*) ()
#14 0x000000398a407851 in start_thread () from /lib64/libpthread.so.0
#15 0x000000398a0e890d in clone () from /lib64/libc.so.6
@crispygoth
Copy link

@crispygoth crispygoth commented Jun 9, 2016

Hi,

I've just upgraded to nginx 1.10.1 with ngx_pagespeed 1.11.33.2 (using the packages from dotdeb.org) and it looks like I'm being affected by a bug very similar to this one. strace output is the same (repeated reads returning EAGAIN), and the gdb backtraces look like:

(gdb) bt
#0  0x00007f533c400e53 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000450260 in ngx_epoll_process_events (cycle=0x23ba540, timer=500, flags=1) at src/event/modules/ngx_epoll_module.c:717
#2  0x0000000000446377 in ngx_process_events_and_timers (cycle=cycle@entry=0x23ba540) at src/event/ngx_event.c:242
#3  0x000000000044dee5 in ngx_worker_process_cycle (cycle=cycle@entry=0x23ba540, data=data@entry=0x6) at src/os/unix/ngx_process_cycle.c:753
#4  0x000000000044c90a in ngx_spawn_process (cycle=cycle@entry=0x23ba540, proc=proc@entry=0x44dea0 <ngx_worker_process_cycle>, data=data@entry=0x6, name=name@entry=0xd7ca82 "worker process", respawn=respawn@entry=-3) at src/os/unix/ngx_process.c:198
#5  0x000000000044e154 in ngx_start_worker_processes (cycle=0x23ba540, n=20, type=-3) at src/os/unix/ngx_process_cycle.c:358
#6  0x000000000044ef24 in ngx_master_process_cycle (cycle=0x23ba540) at src/os/unix/ngx_process_cycle.c:130
#7  0x00000000004273ed in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:367
(gdb) info threads
  Id   Target Id         Frame 
  7    Thread 0x7f5336b6b700 (LWP 6921) "nginx" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  6    Thread 0x7f533616a700 (LWP 7257) "nginx" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  5    Thread 0x7f5335769700 (LWP 7261) "nginx" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  4    Thread 0x7f5334d68700 (LWP 7262) "nginx" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  3    Thread 0x7f5327ffe700 (LWP 7573) "nginx" 0x00007f533e4a2add in read () at ../sysdeps/unix/syscall-template.S:81
  2    Thread 0x7f53275fd700 (LWP 21591) "nginx" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
* 1    Thread 0x7f533eccc740 (LWP 6889) "nginx" 0x00007f533c400e53 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f5327ffe700 (LWP 7573))]
#0  0x00007f533e4a2add in read () at ../sysdeps/unix/syscall-template.S:81
81  ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007f533e4a2add in read () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000bd453b in apr_socket_recv (sock=0x7f53180fa440, buf=0x7f53180795a4 "* </p>\n        * @method render\n        * @param {String} appendToNode The element id to which the Module \n        * should be appended to prior to rendering <em>OR</em>\n        * @param {HTMLElement}"..., len=0x7f5327ffd780)
    at third_party/apr/src/network_io/unix/sendrecv.c:81
#2  0x0000000000b7292d in socket_reader (baton=0x7f5318079578, bufsize=8000, buf=0x7f53180795a4 "* </p>\n        * @method render\n        * @param {String} appendToNode The element id to which the Module \n        * should be appended to prior to rendering <em>OR</em>\n        * @param {HTMLElement}"..., len=0x7f5327ffd780)
    at third_party/serf/src/buckets/socket_buckets.c:42
#3  0x0000000000b6ab0f in common_databuf_prep (databuf=0x7f5318079580, len=0x7f5327ffd980) at third_party/serf/instaweb_buckets.c:338
#4  0x0000000000b6ab8e in serf_databuf_read (databuf=0x7f5318079580, requested=263564, data=0x7f5327ffd988, len=0x7f5327ffd980) at third_party/serf/instaweb_buckets.c:358
#5  0x0000000000b72aff in serf_socket_read (bucket=0x7f53180c7268, requested=263564, data=0x7f5327ffd988, len=0x7f5327ffd980) at third_party/serf/src/buckets/socket_buckets.c:95
#6  0x0000000000b68bdb in serf_barrier_read (bucket=0x7f53180d2828, requested=263564, data=0x7f5327ffd988, len=0x7f5327ffd980) at third_party/serf/src/buckets/barrier_buckets.c:45
#7  0x0000000000b68bdb in serf_barrier_read (bucket=0x7f53180d3528, requested=263564, data=0x7f5327ffd988, len=0x7f5327ffd980) at third_party/serf/src/buckets/barrier_buckets.c:45
#8  0x0000000000b723a8 in serf_response_body_read (bucket=0x7f53180d3628, requested=263564, data=0x7f5327ffd988, len=0x7f5327ffd980) at third_party/serf/src/buckets/response_body_buckets.c:67
#9  0x0000000000b6e826 in serf_response_read (bucket=0x7f53180d2728, requested=18446744073709551615, data=0x7f5327ffd988, len=0x7f5327ffd980) at third_party/serf/instaweb_response_buckets.c:423
#10 0x0000000000abd30d in net_instaweb::SerfFetch::ReadBody (this=0x7f53180b7070, response=0x7f53180d2728) at pagespeed/system/serf_url_async_fetcher.cc:593
#11 0x0000000000abfe56 in net_instaweb::SerfFetch::HandleResponse (this=0x7f53180b7070, response=0x7f53180d2728) at pagespeed/system/serf_url_async_fetcher.cc:498
#12 0x0000000000abff6f in net_instaweb::SerfFetch::HandleResponse (request=0x7f5318182738, response=0x7f53180d2728, handler_baton=0x7f53180b7070, pool=0x7f53180d42a8) at pagespeed/system/serf_url_async_fetcher.cc:367
#13 0x0000000000b6cac2 in handle_response (request=0x7f5318182738, pool=0x7f53180d42a8) at third_party/serf/instaweb_outgoing.c:939
#14 0x0000000000b6ce6d in read_from_connection (conn=0x7f5318002938) at third_party/serf/instaweb_outgoing.c:1134
#15 0x0000000000b6d06b in serf__process_connection (conn=0x7f5318002938, events=1) at third_party/serf/instaweb_outgoing.c:1255
#16 0x0000000000b6917a in serf_event_trigger (s=0x24a11f0, serf_baton=0x7f5318002948, desc=0x24a13e8) at third_party/serf/src/context.c:226
#17 0x0000000000b69301 in serf_context_run (ctx=0x24a11f0, duration=50000, pool=0x24a1128) at third_party/serf/src/context.c:300
#18 0x0000000000ac1ecb in net_instaweb::SerfUrlAsyncFetcher::Poll (this=0x23e2240, max_wait_ms=50) at pagespeed/system/serf_url_async_fetcher.cc:1251
#19 0x0000000000ac509e in net_instaweb::SerfThreadedFetcher::SerfThread (this=0x23e2240) at pagespeed/system/serf_url_async_fetcher.cc:954
#20 0x0000000000ac5148 in net_instaweb::SerfThreadedFetcher::SerfThreadFn (thread_id=0x24a1700, context=0x23e2240) at pagespeed/system/serf_url_async_fetcher.cc:856
#21 0x0000000000af452d in dummy_worker (opaque=0x24a1700) at third_party/apr/src/threadproc/unix/thread.c:142
#22 0x00007f533e49c0a4 in start_thread (arg=0x7f5327ffe700) at pthread_create.c:309
#23 0x00007f533c40087d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

libapr1 is at version 1.5.1 (standard Debian Jessie package).

Is there anything I can do to avoid this problem? Sorry if it's a different issue, if so let me know and I'll open a new one instead.

@crowell
Copy link
Contributor

@crowell crowell commented Jun 9, 2016

that block in apr hasn't been updated since 2003 (although it's possible that apr is still to blame).

This does look like the same issue.
The apr code should be built into the psol binary, so you'd be using that instead of the system copy by default.

How often does this happen, and can you reproduce this with any regularity?

@crispygoth
Copy link

@crispygoth crispygoth commented Jun 10, 2016

We only went live with this new version a couple of days ago, but I've seen it happen a couple of times so far. Our traffic peaks at around 300 reqs/sec currently, so it's not exactly that frequent relatively speaking.

I have a theory (as yet untested) that it happens when our failover system (using keepalived) triggers -- I actually noticed this happening because we'd just failed over and there were still nginx processes using CPU on the server that had just gone out of service. I suspect that the connection that pagespeed had open became invalid because the IP at had been using as its source address was no longer on the interface.

@jmarantz
Copy link
Contributor

@jmarantz jmarantz commented Jun 10, 2016

Your theory seems plausible. One thing I can't tell for sure from your
stack trace is whether it's bouncing out to serf_url_async_fetcher.cc and
looping there, or whether it's looping inside the serf or apr code.

If it's in serf_url_async_fetcher.cc we might be able to work around it by
failing the fetch after a threshold of iterations through the loop without
making progress. Or enforcing the timeout in that loop.

-Josh

On Fri, Jun 10, 2016 at 7:31 AM, crispygoth notifications@github.com
wrote:

We only went live with this new version a couple of days ago, but I've
seen it happen a couple of times so far. Our traffic peaks at around 300
reqs/sec currently, so it's not exactly that frequent relatively speaking.

I have a theory (as yet untested) that it happens when our failover system
(using keepalived) triggers -- I actually noticed this happening because
we'd just failed over and there were still nginx processes using CPU on the
server that had just gone out of service. I suspect that the connection
that pagespeed had open became invalid because the IP at had been using as
its source address was no longer on the interface.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#674 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AB2kPc_eW_GOZiHzKBOkqF05nTLZj73_ks5qKUsIgaJpZM4B0Twv
.

@hillsp
Copy link
Contributor

@hillsp hillsp commented Jun 13, 2016

Thanks for the stack trace @crispygoth! It turns out that SerfFetch::ReadBody doesn't handle EAGAIN correctly and immediately tries again to read from a non-blocking socket. If there's no data available, like when a socket is hung after a fail-over, it will spin.

I'm working on a fix.

@hillsp hillsp self-assigned this Jun 13, 2016
@hillsp
Copy link
Contributor

@hillsp hillsp commented Jun 27, 2016

The fix for this was in the serf fetcher, which is in the core mod_pagespeed repo. I found a number of problems with the error handling and the fix will definitely be in our next release. If you want to patch it locally, here are links to the sequence of commits you'll need:

apache/incubator-pagespeed-mod@9d76d90
apache/incubator-pagespeed-mod@66d8196
apache/incubator-pagespeed-mod@dfdd05a
apache/incubator-pagespeed-mod@08e7b80

@hillsp hillsp closed this Jun 27, 2016
hillsp added a commit to apache/incubator-pagespeed-mod that referenced this issue Jul 21, 2016
crowell added a commit to apache/incubator-pagespeed-mod that referenced this issue Jul 21, 2016
@crowell crowell changed the title ngx_pagespeed cpu 100% Problem Serf spin causes 100% CPU usage Aug 2, 2016
@pono pono unassigned hillsp Jan 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants