Agent stalls with 100% cpu utilization #14

kingcu · 2009-11-17T17:52:36Z

I have been struggling for several days with a problem in my agents. Randomly, they will stall and use 100% of the CPU. strace reveals the agents are just context switching and doing nothing:

--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 40001616
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 1
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 1
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 101
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0

I have tried everything: modified agents to use epoll rather than select, tried ruby enterprise edition and ruby1.9 (they remove the syscalls in strace, but agents still lock). I cannot discern a pattern or reason the agents lock specifically, meaning the job they lock on isn't consistent ASIDE from happening during a job that utilizes net/http to pull down some images and stitch them together.

I thought it might be an issue with calling sleep() inside the agents, but that didn't solve anything. I really have no idea where to go from here.

Pastie to my agent code: http://pastie.org/702881
Pastie to image fetch/stitch code: http://pastie.org/702895

On the plus side, I'll be able to give you a quick modification to nanite that causes it to use epoll, which dropped my CPU utilization a hair while performing a large amount of jobs! Any ideas on where to start even looking from here would be appreciated, otherwise I am going to just start commenting out code until something changes (the worst way to debug!).

roidrage · 2009-11-18T13:07:43Z

I wish I had any idea where to start. What's the number of messages your seeing? Please also run the agents with debug log mode so you can at least see what the last of their activities is. I'd like to get Nanite a lot more bullet-proof in that regard.

Also, what EventMachine and AMQP version are you using?

kingcu · 2009-11-18T18:38:10Z

Whoops, forgot to mention the particulars:

AMQP 0.6.5
EM 0.12.10 (same happens with 0.12.8)

Happens when I push through a group of 200 or so jobs, with prefetch set at 1, so only one job is on the agent at a time. It doesn't seem to stall on any particular piece of code that I can discern. Additionally, strace shows absolutely 0 activity outside of the SIGVTALRM syscalls, even though the CPU is pegged, which is beyond me. From my understanding, this means all threads have finished. This is why I think there is something odd going on with Nanite/AMQP/EM, because it's inconsistent and they think work is done before it is done.

I have been reading up on ltrace and more detailed strace use, so I'll have a bit more something to go off here soon I think.

roidrage · 2009-11-21T12:13:54Z

I'll try to set up something to fire similar jobs. Maybe I can reproduce it.

kingcu · 2009-11-24T15:55:57Z

No rush, super busy for next couple weeks and it's working great in production. Been running couple thousand jobs a day through it with no hiccups, so it's just an issue of pushing too much of the same job (possibly, I do batches of same jobs when reprocessing failed jobs). I am thinking it may be an issue with net/http at this point (no solid reason why) so will try em/http since EM is already loaded. I'll let you know the results and go from there.

On a possibly unrelated note: when stracing ruby processes, I noticed a TON of ENOENT exceptions for required library files as it checks through the namespaces for the rb file. Meaning for example, it will look for parser.rb in the nanite gems directory, then doesn't find it and checks in the AMQP gems directory and on down until it finds optparser in ruby core lib directories. These libraries are checked for during execution of every job and probably add a great deal to the overhead of running a pile of jobs. No idea if this is a ruby or a nanite issue at this point (ruby is my guess), but it's also on my list of things to investigate.

roidrage · 2009-11-24T18:42:41Z

We're using a lot of net/http as well, and we'll replace it with em-http based code eventually, since there's always a risk of blockage.

Keep me posted.

About the requiring stuff, I'd say that's normal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent stalls with 100% cpu utilization #14

Agent stalls with 100% cpu utilization #14

kingcu commented Nov 17, 2009

roidrage commented Nov 18, 2009

kingcu commented Nov 18, 2009

roidrage commented Nov 21, 2009

kingcu commented Nov 24, 2009

roidrage commented Nov 24, 2009

Agent stalls with 100% cpu utilization #14

Agent stalls with 100% cpu utilization #14

Comments

kingcu commented Nov 17, 2009

roidrage commented Nov 18, 2009

kingcu commented Nov 18, 2009

roidrage commented Nov 21, 2009

kingcu commented Nov 24, 2009

roidrage commented Nov 24, 2009