Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent stalls with 100% cpu utilization #14

Open
kingcu opened this issue Nov 17, 2009 · 5 comments
Open

Agent stalls with 100% cpu utilization #14

kingcu opened this issue Nov 17, 2009 · 5 comments

Comments

@kingcu
Copy link

kingcu commented Nov 17, 2009

I have been struggling for several days with a problem in my agents. Randomly, they will stall and use 100% of the CPU. strace reveals the agents are just context switching and doing nothing:

--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 40001616
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 1
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 1
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 101
--- SIGVTALRM (Virtual timer expired) @ 0 (0) ---
rt_sigreturn(0) = 0

I have tried everything: modified agents to use epoll rather than select, tried ruby enterprise edition and ruby1.9 (they remove the syscalls in strace, but agents still lock). I cannot discern a pattern or reason the agents lock specifically, meaning the job they lock on isn't consistent ASIDE from happening during a job that utilizes net/http to pull down some images and stitch them together.

I thought it might be an issue with calling sleep() inside the agents, but that didn't solve anything. I really have no idea where to go from here.

Pastie to my agent code: http://pastie.org/702881
Pastie to image fetch/stitch code: http://pastie.org/702895

On the plus side, I'll be able to give you a quick modification to nanite that causes it to use epoll, which dropped my CPU utilization a hair while performing a large amount of jobs! Any ideas on where to start even looking from here would be appreciated, otherwise I am going to just start commenting out code until something changes (the worst way to debug!).

@roidrage
Copy link
Collaborator

I wish I had any idea where to start. What's the number of messages your seeing? Please also run the agents with debug log mode so you can at least see what the last of their activities is. I'd like to get Nanite a lot more bullet-proof in that regard.

Also, what EventMachine and AMQP version are you using?

@kingcu
Copy link
Author

kingcu commented Nov 18, 2009

Whoops, forgot to mention the particulars:

AMQP 0.6.5
EM 0.12.10 (same happens with 0.12.8)

Happens when I push through a group of 200 or so jobs, with prefetch set at 1, so only one job is on the agent at a time. It doesn't seem to stall on any particular piece of code that I can discern. Additionally, strace shows absolutely 0 activity outside of the SIGVTALRM syscalls, even though the CPU is pegged, which is beyond me. From my understanding, this means all threads have finished. This is why I think there is something odd going on with Nanite/AMQP/EM, because it's inconsistent and they think work is done before it is done.

I have been reading up on ltrace and more detailed strace use, so I'll have a bit more something to go off here soon I think.

@roidrage
Copy link
Collaborator

I'll try to set up something to fire similar jobs. Maybe I can reproduce it.

@kingcu
Copy link
Author

kingcu commented Nov 24, 2009

No rush, super busy for next couple weeks and it's working great in production. Been running couple thousand jobs a day through it with no hiccups, so it's just an issue of pushing too much of the same job (possibly, I do batches of same jobs when reprocessing failed jobs). I am thinking it may be an issue with net/http at this point (no solid reason why) so will try em/http since EM is already loaded. I'll let you know the results and go from there.

On a possibly unrelated note: when stracing ruby processes, I noticed a TON of ENOENT exceptions for required library files as it checks through the namespaces for the rb file. Meaning for example, it will look for parser.rb in the nanite gems directory, then doesn't find it and checks in the AMQP gems directory and on down until it finds optparser in ruby core lib directories. These libraries are checked for during execution of every job and probably add a great deal to the overhead of running a pile of jobs. No idea if this is a ruby or a nanite issue at this point (ruby is my guess), but it's also on my list of things to investigate.

@roidrage
Copy link
Collaborator

We're using a lot of net/http as well, and we'll replace it with em-http based code eventually, since there's always a risk of blockage.

Keep me posted.

About the requiring stuff, I'd say that's normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants