Reuse of workers with low values of size #29

davidw · 2013-06-11T10:31:04Z

I'm using Chicago Boss, and got curious about how I could make its initial memory usage a little bit lower. One obvious thing would be to start the database connection pool off with one worker, rather than five, which is probably fine, since this is a semi-embedded system where we won't normally have concurrent users. max_overflow was set to 20. Out of curiosity, I ran some simple benchmarks with "ab -c 10 -n 1000" (1000 requests, running 10 at a time), and noticed that dropping the size below 10 slows things down.

After some digging, I realized that poolboy is constantly tearing down and creating new connections. Something about this does not seem quite right: if I'm getting a lot of requests right now, I can expect to keep getting them for the immediate future, so I don't want to shut them down immediately.

I'm not exactly sure how things ought to work, but I'd expect that workers would go away slowly, rather than reaped quickly.

devinus · 2013-06-11T16:01:09Z

@davidw Workers should only be "churning" like that if their processes are crashing.

davidw · 2013-06-11T16:24:25Z

Hi,

Here's what I did to dig into what seems to be going on. Perhaps there is a mistake, or a problem with some other level of the system.

From boss_db_sup.erl, which starts up poolboy:

init(StartArgs) ->
    Args = [{name, {local, boss_db_pool}},
        {worker_module, boss_db_controller},
        {size, 1}, {max_overflow, 20}|StartArgs],
    PoolSpec = {db_controller, {poolboy, start_link, [Args]}, permanent, 2000, worker, [poolboy]},
    {ok, {{one_for_one, 10, 10}, [PoolSpec]}}.

A little patch to show when the workers churn:

 new_worker(Sup) ->
+    io:format("new_worker~n"),
     {ok, Pid} = supervisor:start_child(Sup, []),
     true = link(Pid),
     Pid.

 new_worker(Sup, FromPid) ->
+    io:format("new_worker~n"),
     Pid = new_worker(Sup),
     Ref = erlang:monitor(process, FromPid),
     {Pid, Ref}.

 dismiss_worker(Sup, Pid) ->
+    io:format("dismiss_worker~n"),
     true = unlink(Pid),
     supervisor:terminate_child(Sup, Pid).

With that in place, I did:

ab -c 10 -n 1000 http://localhost:8001/

And I get a lot of:

new_worker
new_worker
dismiss_worker
new_worker
new_worker
dismiss_worker
new_worker
new_worker
dismiss_worker
new_worker
new_worker
dismiss_worker
new_worker
new_worker
dismiss_worker
new_worker
new_worker
dismiss_worker
dismiss_worker
new_worker
new_worker
dismiss_worker

Now, I changed the size parameter to 10, and rerun 'ab' and do not get any churn, and the requests naturally are faster too.

So, that's what I can see... For what it's worth, this is using epgsql under the hood, managed by boss_db.

I don't see any evidence of crashes, and I would think that if there were something crashing, it would not depend on the size parameter.

devinus · 2013-06-11T17:09:46Z

@davidw Ah, this is because the size of your pool is 1, with a max_overflow of 20. This means that your pool is only keeping 1 worker on hand. A max_overflow of 20 means that instead of queueing requests and waiting for your one available worker to be checked back in, the pool will create up to 20 new workers to handle increased workload, but then dismiss them when they're checked back in. That's where the churn is.

davidw · 2013-06-11T20:29:28Z

Aha... ok, so that explains things. Now ... is there a way to do something along the lines of how I thought it worked without radically changing things? The overflow workers go away a little bit at a time, for instance, or maybe some params like start_size / max_workers. Here's my use case: this is all going to go on a semi-embedded system where it's going to be pretty normal to have only one user at a time, and maybe a system process or two accessing the DB. It'd be nice to just start one worker and let things ramp up if more are needed. It's not that big a deal either way, it just struck me as something that would help us contain memory usage. Obviously an extra worker is not the problem, but the DB child process that it engages.

Vagabond · 2013-06-13T05:18:10Z

Actually, on checkin, if there is work pending (as in you're using nonblocking checkouts), poolboy will reuse the worker instead of killing it. Other than that, yes, they will get reaped pretty quickly.

davidw · 2013-06-13T20:32:04Z

Hi,

Can you add some explanation rather than simply closing?

I opened it because it's a genuine problem: poolboy, as it stands, scales poorly because you have to guess at the "correct" value for size with failure to do so meaning that either 1) the system overallocates resources initially, or 2) churns once you go over size.

Adaptive scaling is pretty common - the Apache web server, for instance, has configuration parameters like: MaxSpareServers, MinSpareServers, StartServers, and MaxClients.

The computer knows how many resources it needs at a given point in time - as long as it's under a certain maximum limit, it should be free to allocate those resources and then gradually eliminate them as the need lessens.

devinus · 2013-06-13T20:43:35Z

@davidw The issue is closed because it's not an issue. However, GitHub still allows discussion here, as you can see. Poolboy, as it stands, scales just fine. I've had it scaling to hundreds of workers at thousands of requests a second (see: https://github.com/devinus/poolboy-benchmark). It doesn't scale any further than what you configure it to scale to.

You have two important options:

size: The initial and maximum size of the pool. When a worker is checked out from the pool, the pool size is never increased. As long as your max_overflow is 0 and your workers aren't dying before they're checked back into the pool, there will be no "churn".
max_overflow: The amount of temporary workers the pool is allowed to create if there is enough demand.

If you don't want "churn", then don't allow overflow. If you don't want a lot of workers because of memory constraints, then don't allow a large pool size -- your requests will be queued to the pool with a configurable timeout.

As with everything involving pools and caching, it's up to you -- the user -- to profile and figure out the best configuration yourself.

davidw · 2013-06-14T08:19:45Z

I'm not a crystal ball

I think this is the crux of the problem: no one is, so it's best to let the software adapt to its environment, within certain constraints.

On a more practical, and less handwavey level, what if the system were modified to do something like start a timer instead of reaping workers immediately. When the timer goes off, if usage levels are still high, don't kill anything. If they've receded, we can start knocking off spawned workers. This would let us both start with a small size, and lessen churn in case of a heavy usage spike.

devinus · 2013-06-14T15:31:49Z

@davidw Here's the thing, if you give your pool enough a modest size and no max_overflow there will never be any worker churn. The workers won't be killed and requests to checkout from your pool will queue until a worker is available.

evanmiller · 2013-06-16T18:34:05Z

Would it be possible to configure the timeout before unused workers are killed? That way we can minimize churn while still having overflow connections. For our application, workers are expensive to create, and it makes sense to hold on to them for a bit in case they will be needed in the next few seconds. But they also consume resources, so it also makes sense to reap them after a certain amount of time, so max_overflow = 0 is not a good solution for us.

devinus · 2013-06-16T23:00:44Z

@davidw @evanmiller Hrm, that's an interesting idea. Let's open a new ticket? I'd accept a patch with tests. Maybe we can have a configurable timeout for reaping overflow workers, and instead of dismissing them on checkin, using the timer module to dismiss them after the period of time.

devinus closed this as completed Jun 13, 2013

evanmiller mentioned this issue Jun 17, 2013

Timeout before killing overflow workers #30

Closed

davidw mentioned this issue May 12, 2014

Use new poolboy 'stack based' strategy ErlyORM/boss_db#179

Open

davidw mentioned this issue Feb 17, 2015

Ability to specify keep alive time for excess workers #66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse of workers with low values of size #29

Reuse of workers with low values of size #29

davidw commented Jun 11, 2013

devinus commented Jun 11, 2013

davidw commented Jun 11, 2013

devinus commented Jun 11, 2013

davidw commented Jun 11, 2013

Vagabond commented Jun 13, 2013

davidw commented Jun 13, 2013

devinus commented Jun 13, 2013

davidw commented Jun 14, 2013

devinus commented Jun 14, 2013

evanmiller commented Jun 16, 2013

devinus commented Jun 16, 2013

Reuse of workers with low values of size #29

Reuse of workers with low values of size #29

Comments

davidw commented Jun 11, 2013

devinus commented Jun 11, 2013

davidw commented Jun 11, 2013

devinus commented Jun 11, 2013

davidw commented Jun 11, 2013

Vagabond commented Jun 13, 2013

davidw commented Jun 13, 2013

devinus commented Jun 13, 2013

davidw commented Jun 14, 2013

devinus commented Jun 14, 2013

evanmiller commented Jun 16, 2013

devinus commented Jun 16, 2013