Only log '[num] workers' message when it changes. #1078

Merged
merged 3 commits into from Aug 19, 2015

Projects

None yet

3 participants

@erydo
Contributor
erydo commented Jul 12, 2015

Otherwise when debug logging is on, the message prints every second even with no system activity, drowning out more useful log messages:

[2015-07-12 17:39:25 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:26 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:27 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:28 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:29 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:30 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:31 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:32 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:33 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:34 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:35 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:36 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:37 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:38 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:39 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:40 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:41 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:42 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:43 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:45 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:46 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:47 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:48 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:49 +0000] [19970] [DEBUG] 3 workers
[2015-07-12 17:39:50 +0000] [19970] [DEBUG] 3 workers
@benoitc benoitc and 2 others commented on an outdated diff Jul 12, 2015
gunicorn/arbiter.py
@@ -473,6 +473,8 @@ def manage_workers(self):
Maintain the number of workers by spawning or killing
as required.
"""
+ orig_num_workers = self.num_workers
+
if len(self.WORKERS.keys()) < self.num_workers:
@benoitc
benoitc Jul 12, 2015 Owner

i would rather tag here if a change has been done and test against it instead of comparing numbers. Thoughts?

@erydo
erydo Jul 12, 2015 Contributor

In my experience, using a mutable has_changed-type flag tends to open up a class of errors where the flag does not get set appropriately after modification. Particularly when it happens in a separate method call (Can kill_worker/spawn_workers modify num_workers? Pretty sure. Do they always? I don't know). That kind of bug crops up more during future code changes/maintenance.

It also makes the if-statement harder to reason about: does this flag represent that the value being printed out changed? That it could have changed? Should have? If I see multiple "3 workers" messages in sequence, is that expected or a bug?

This is a short method, but those are the reasons I tend to prefer the state-comparison approach when it's not a computationally difficult one.

@erydo
erydo Jul 12, 2015 Contributor

More concretely, as someone unfamiliar with the codebase, I didn't know if spawn_workers would necessarily change that number, or could be capped by memory or some other configuration value.

It could be expected to try to spawn each time but not necessarily succeed at increasing that number; I'd have to look deeper to see if that was a (long-term) guarantee.

(I hope that makes it clear where my head was at here)

@tilgovi
tilgovi Jul 13, 2015 Collaborator

I think your reasoning is sound but if the flag is localized to this function then it's pretty fail proof.

At the top of the function, just set count_changed = False.
In each of the loops you can set count_changed = True.
At the bottom of the function you can check count_changed.

@tilgovi
tilgovi Jul 13, 2015 Collaborator

I actually think I slightly prefer what I just described for exactly the reasons you say. While it should be clear from the code that self.num_workers must change or the loops would not terminate, it's perhaps clearest to just set a flag when we're attempting to change the number.

It's conceivable that self.num_workers could change between the loops so that it is first increased and then decreased, so if the edge case of logging the same number twice bothers us then your way is indeed better.

@tilgovi
tilgovi Jul 13, 2015 Collaborator

Oops, sorry. len(workers) must change :). I wrote too fast.

@tilgovi
tilgovi Jul 13, 2015 Collaborator

Thinking more on this, I think the approach you took is right, but I would not use self.num_workers. That is unlikely to change during this function (unless there is a concurrent signal). What is likely to change is len(self.WORKERS).

@tilgovi
tilgovi Jul 13, 2015 Collaborator

So I think the right version of this patch that cannot ever say the same number twice is this:

original_length = len(self.WORKERS)
# ...
if len(self.WORKERS) != original_length:
    # ...
@benoitc
benoitc Jul 13, 2015 Owner

We are indeed already testing if len(self.WORKERS.keys()) < self.num_workers . The goal of this function is to make sure that this number is always at least equal to self.num_workers. Why not moving the debug log at the end of the if block? And add it in different places where we detect when a work exit?

@erydo
erydo Jul 13, 2015 Contributor

Oops, you're right, len(self.WORKERS) is the right initial value.

On the guard comparison, the value being printed out is len(workers) instead of len(self.WORKERS). It looks like the reason for this is that kill_worker will only synchronously remove a worker from self.WORKERS if the worker is already dead (throwing errno.ESRCH)—otherwise it gets popped asynchronously in reap_workers. That behavior could be modified (maybe should be?) otherwise I think this should be:

orig_worker_count = len(self.WORKERS)
#
if len(workers) != orig_worker_count):
    self.log.debug("{0} workers".format(len(workers)),
    #
@tilgovi
tilgovi Jul 13, 2015 Collaborator

You're correct, although it makes the code more confusing to check len(self.WORKERS) at the top and then len(workers) at the bottom.

I suspect we can't change the behavior of kill_worker because we need to reap them. We could have a separate dead worker list to reap from, though.

@tilgovi
tilgovi Jul 13, 2015 Collaborator

Oh, wait, I'm wrong. The waitpid gives us the worker pid and we pop it with a None argument in line 463 so I think it might be harmless to remove it synchronously.

@erydo
erydo Jul 13, 2015 Contributor

Yeah—I believe it can be removed synchronously as long as worker.tmp.close() is called. I wasn't sure if there was another reason why they might be left around for later reaping. (For example, reaping does not call self.cfg.worker_exit(self, worker) but trying to kill a dead worker does; that might also be a bug?)

@tilgovi
tilgovi Jul 13, 2015 Collaborator

What if kill_worker moved the worker from self.WORKERS to a dead worker list? I don't think ESRCH block is necessary because reap_workers should still get the SIGCHLD.

Then the manage_workers becomes three steps:

@tilgovi
tilgovi Jul 13, 2015 Collaborator

Scratch that last part, not three steps. manage_workers can stay the same, I think.

@tilgovi
tilgovi Jul 13, 2015 Collaborator

@erydo maybe let's make a PR to clean this dance up a bit and then let's rebase this. If you're up for it.

@erydo
erydo Jul 13, 2015 Contributor

Sure thing, I'll give it a stab.

@erydo
erydo Jul 13, 2015 Contributor

See #1084.

@erydo
Contributor
erydo commented Jul 13, 2015

This PR is on hold until #1084 is completed.

@erydo erydo Only log '[num] workers' message when it changes.
Otherwise when debug logging is on, the message prints every second even
with no system activity.
09357ed
@erydo
Contributor
erydo commented Jul 19, 2015

(Rebasing against current master)

@erydo erydo Store last logged worker count.
This is easier and safer than only logging when we detect that
self.WORKERS has changed or that `spawn_worker` or `kill_worker` has
been done.
6f6ec92
@erydo
Contributor
erydo commented Jul 19, 2015

@tilgovi @benoitc — I've updated this to no longer rely on synchronicity/asynchronicity of spawn_worker, kill_worker, or self.WORKERS. It just stores the last number that was logged and checks against that.

This was possibly the right approach to begin with (no error conditions!) and decouples it from #1084, which might take some more time to merge in.

@benoitc
Owner
benoitc commented Jul 23, 2015

Quite like the idea :) While we are here maybe we could also just log where the number of workers did really change in spawn workers and when we reap them. Something like the patch below. I didn't test it though. Thoughts?

diff --git a/gunicorn/arbiter.py b/gunicorn/arbiter.py
index b7ee05d..85fa5ea 100644
--- a/gunicorn/arbiter.py
+++ b/gunicorn/arbiter.py
@@ -55,6 +55,7 @@ class Arbiter(object):
         os.environ["SERVER_SOFTWARE"] = SERVER_SOFTWARE

         self._num_workers = None
+        self.worker_count = None
         self.setup(app)

         self.pidfile = None
@@ -464,6 +465,7 @@ class Arbiter(object):
                     if not worker:
                         continue
                     worker.tmp.close()
+                    self._log_numworkers()
         except OSError as e:
             if e.errno != errno.ECHILD:
                 raise
@@ -475,6 +477,7 @@ class Arbiter(object):
         """
         if len(self.WORKERS.keys()) < self.num_workers:
             self.spawn_workers()
+            self._log_numworkers()

         workers = self.WORKERS.items()
         workers = sorted(workers, key=lambda w: w[1].age)
@@ -482,10 +485,6 @@ class Arbiter(object):
             (pid, _) = workers.pop(0)
             self.kill_worker(pid, signal.SIGTERM)

-        self.log.debug("{0} workers".format(len(workers)),
-                       extra={"metric": "gunicorn.workers",
-                              "value": len(workers),
-                              "mtype": "gauge"})

     def spawn_worker(self):
         self.worker_age += 1
@@ -563,8 +562,20 @@ class Arbiter(object):
                 try:
                     worker = self.WORKERS.pop(pid)
                     worker.tmp.close()
+                    self._log_numworkers()
                     self.cfg.worker_exit(self, worker)
                     return
                 except (KeyError, OSError):
                     return
             raise
+
+    def _log_numworkers(self):
+        nworker_count = len(workers)
+        if self.worker_count != nworker_count:
+            self.log.debug("{0} workers".format(nworker_count),
+                    extra={"metric": "gunicorn.workers",
+                        "value": len(workers),
+                        "mtype": "gauge"})
+            self.worker_count = nworker_count
+
@benoitc benoitc and 1 other commented on an outdated diff Jul 23, 2015
gunicorn/arbiter.py
@@ -51,6 +51,8 @@ class Arbiter(object):
if name[:3] == "SIG" and name[3] != "_"
)
+ last_logged_worker_count = None
@benoitc
benoitc Jul 23, 2015 Owner

shouldn't it be probably an instance variable?

@erydo
erydo Jul 23, 2015 Contributor

The instance variable is created once it's assigned to for the first time. I'd be fine moving it into __init__.

@erydo
Contributor
erydo commented Jul 23, 2015

@benoitc — I would make the same argument as I made at the beginning of this PR. The patch I've proposed here has no edge cases and is resilient to refactorings instead of trying to find and maintain every place that modifies self.WORKERS.

Additionally, the patch in your comment would log every change in manage_workers which could be quite noisy if it's spawning/killing multiple workers (e.g. on startup you'd get a log message with an incrementing number for each worker). My motivation for this PR in the first place was to reduce the unnecessary noise.

@benoitc
Owner
benoitc commented Jul 23, 2015

hrm the patch in manage_workers only log if new workers need to be spawned and after it. Did I miss smth?

The point to log there and in reap workers is to only log when the number did really change not before it changes on kill. I think it would be more accurate although we can optimise the reap workers case. This is at least the intention :)

@erydo
Contributor
erydo commented Jul 23, 2015

Ah, you're right, I misread your patch! For some reason I misremembered spawning as being in a loop like killing is. I'd still stand by the maintenance argument, though: manage_workers gets called regularly. Logging the worker count there is the least complex, least error-prone implementation of logging the current active worker count.

If it's desirable to also log once workers finish dying, I'd suggest that that change should be made separately, and should likely be based off #1084, which consolidates that worker cleanup work.

@benoitc
Owner
benoitc commented Jul 24, 2015

All of this depends of what is expected by using this metric: supervise the number of active workers or supervise the number of workers alives (running). Maybe indeed we should have 2 metrics indeed and I guess you're right and your patch is enough to tell the number of active workers and then logged as it.

I commented the patch for a last one change, then let's commit it:)

@benoitc benoitc commented on the diff Jul 24, 2015
gunicorn/arbiter.py
@@ -55,6 +55,8 @@ def __init__(self, app):
os.environ["SERVER_SOFTWARE"] = SERVER_SOFTWARE
self._num_workers = None
+ self._last_logged_active_worker_count = None
@benoitc
benoitc Jul 24, 2015 Owner

last_logged is not needed. let's just call it self._active_worker_count .

@benoitc
Owner
benoitc commented Jul 29, 2015

OK if noone object I will merge this PR and do the renaming right after. OK?

@tilgovi
Collaborator
tilgovi commented Jul 29, 2015

👍

On Wed, Jul 29, 2015, 06:08 Benoit Chesneau notifications@github.com
wrote:

OK if noone object I will merge this PR and do the renaming right after.
OK?


Reply to this email directly or view it on GitHub
#1078 (comment).

@benoitc benoitc merged commit a132ca9 into benoitc:master Aug 19, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment