Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse of workers with low values of size #29

Closed
davidw opened this issue Jun 11, 2013 · 11 comments
Closed

Reuse of workers with low values of size #29

davidw opened this issue Jun 11, 2013 · 11 comments

Comments

@davidw
Copy link
Contributor

davidw commented Jun 11, 2013

I'm using Chicago Boss, and got curious about how I could make its initial memory usage a little bit lower. One obvious thing would be to start the database connection pool off with one worker, rather than five, which is probably fine, since this is a semi-embedded system where we won't normally have concurrent users. max_overflow was set to 20. Out of curiosity, I ran some simple benchmarks with "ab -c 10 -n 1000" (1000 requests, running 10 at a time), and noticed that dropping the size below 10 slows things down.

After some digging, I realized that poolboy is constantly tearing down and creating new connections. Something about this does not seem quite right: if I'm getting a lot of requests right now, I can expect to keep getting them for the immediate future, so I don't want to shut them down immediately.

I'm not exactly sure how things ought to work, but I'd expect that workers would go away slowly, rather than reaped quickly.

@devinus
Copy link
Owner

devinus commented Jun 11, 2013

@davidw Workers should only be "churning" like that if their processes are crashing.

@davidw
Copy link
Contributor Author

davidw commented Jun 11, 2013

Hi,

Here's what I did to dig into what seems to be going on. Perhaps there is a mistake, or a problem with some other level of the system.

From boss_db_sup.erl, which starts up poolboy:

init(StartArgs) ->
    Args = [{name, {local, boss_db_pool}},
        {worker_module, boss_db_controller},
        {size, 1}, {max_overflow, 20}|StartArgs],
    PoolSpec = {db_controller, {poolboy, start_link, [Args]}, permanent, 2000, worker, [poolboy]},
    {ok, {{one_for_one, 10, 10}, [PoolSpec]}}.

A little patch to show when the workers churn:

 new_worker(Sup) ->
+    io:format("new_worker~n"),
     {ok, Pid} = supervisor:start_child(Sup, []),
     true = link(Pid),
     Pid.

 new_worker(Sup, FromPid) ->
+    io:format("new_worker~n"),
     Pid = new_worker(Sup),
     Ref = erlang:monitor(process, FromPid),
     {Pid, Ref}.

 dismiss_worker(Sup, Pid) ->
+    io:format("dismiss_worker~n"),
     true = unlink(Pid),
     supervisor:terminate_child(Sup, Pid).

With that in place, I did:

ab -c 10 -n 1000 http://localhost:8001/

And I get a lot of:

new_worker
new_worker
dismiss_worker
new_worker
new_worker
dismiss_worker
new_worker
new_worker
dismiss_worker
new_worker
new_worker
dismiss_worker
new_worker
new_worker
dismiss_worker
new_worker
new_worker
dismiss_worker
dismiss_worker
new_worker
new_worker
dismiss_worker

Now, I changed the size parameter to 10, and rerun 'ab' and do not get any churn, and the requests naturally are faster too.

So, that's what I can see... For what it's worth, this is using epgsql under the hood, managed by boss_db.

I don't see any evidence of crashes, and I would think that if there were something crashing, it would not depend on the size parameter.

@devinus
Copy link
Owner

devinus commented Jun 11, 2013

@davidw Ah, this is because the size of your pool is 1, with a max_overflow of 20. This means that your pool is only keeping 1 worker on hand. A max_overflow of 20 means that instead of queueing requests and waiting for your one available worker to be checked back in, the pool will create up to 20 new workers to handle increased workload, but then dismiss them when they're checked back in. That's where the churn is.

@davidw
Copy link
Contributor Author

davidw commented Jun 11, 2013

Aha... ok, so that explains things. Now ... is there a way to do something along the lines of how I thought it worked without radically changing things? The overflow workers go away a little bit at a time, for instance, or maybe some params like start_size / max_workers. Here's my use case: this is all going to go on a semi-embedded system where it's going to be pretty normal to have only one user at a time, and maybe a system process or two accessing the DB. It'd be nice to just start one worker and let things ramp up if more are needed. It's not that big a deal either way, it just struck me as something that would help us contain memory usage. Obviously an extra worker is not the problem, but the DB child process that it engages.

@Vagabond
Copy link
Collaborator

Actually, on checkin, if there is work pending (as in you're using nonblocking checkouts), poolboy will reuse the worker instead of killing it. Other than that, yes, they will get reaped pretty quickly.

@devinus devinus closed this as completed Jun 13, 2013
@davidw
Copy link
Contributor Author

davidw commented Jun 13, 2013

Hi,

Can you add some explanation rather than simply closing?

I opened it because it's a genuine problem: poolboy, as it stands, scales poorly because you have to guess at the "correct" value for size with failure to do so meaning that either 1) the system overallocates resources initially, or 2) churns once you go over size.

Adaptive scaling is pretty common - the Apache web server, for instance, has configuration parameters like: MaxSpareServers, MinSpareServers, StartServers, and MaxClients.

The computer knows how many resources it needs at a given point in time - as long as it's under a certain maximum limit, it should be free to allocate those resources and then gradually eliminate them as the need lessens.

@devinus
Copy link
Owner

devinus commented Jun 13, 2013

@davidw The issue is closed because it's not an issue. However, GitHub still allows discussion here, as you can see. Poolboy, as it stands, scales just fine. I've had it scaling to hundreds of workers at thousands of requests a second (see: https://github.com/devinus/poolboy-benchmark). It doesn't scale any further than what you configure it to scale to.

You have two important options:

  • size: The initial and maximum size of the pool. When a worker is checked out from the pool, the pool size is never increased. As long as your max_overflow is 0 and your workers aren't dying before they're checked back into the pool, there will be no "churn".
  • max_overflow: The amount of temporary workers the pool is allowed to create if there is enough demand.

If you don't want "churn", then don't allow overflow. If you don't want a lot of workers because of memory constraints, then don't allow a large pool size -- your requests will be queued to the pool with a configurable timeout.

As with everything involving pools and caching, it's up to you -- the user -- to profile and figure out the best configuration yourself.

@davidw
Copy link
Contributor Author

davidw commented Jun 14, 2013

I'm not a crystal ball

I think this is the crux of the problem: no one is, so it's best to let the software adapt to its environment, within certain constraints.

On a more practical, and less handwavey level, what if the system were modified to do something like start a timer instead of reaping workers immediately. When the timer goes off, if usage levels are still high, don't kill anything. If they've receded, we can start knocking off spawned workers. This would let us both start with a small size, and lessen churn in case of a heavy usage spike.

@devinus
Copy link
Owner

devinus commented Jun 14, 2013

@davidw Here's the thing, if you give your pool enough a modest size and no max_overflow there will never be any worker churn. The workers won't be killed and requests to checkout from your pool will queue until a worker is available.

@evanmiller
Copy link

Would it be possible to configure the timeout before unused workers are killed? That way we can minimize churn while still having overflow connections. For our application, workers are expensive to create, and it makes sense to hold on to them for a bit in case they will be needed in the next few seconds. But they also consume resources, so it also makes sense to reap them after a certain amount of time, so max_overflow = 0 is not a good solution for us.

@devinus
Copy link
Owner

devinus commented Jun 16, 2013

@davidw @evanmiller Hrm, that's an interesting idea. Let's open a new ticket? I'd accept a patch with tests. Maybe we can have a configurable timeout for reaping overflow workers, and instead of dismissing them on checkin, using the timer module to dismiss them after the period of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants