Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

High CPU usage due to millisecond polling #85

Closed
psobot opened this Issue · 9 comments

5 participants

@psobot

Not sure if this is a bug or a feature, but I've noticed that @satterly's change to make gmond's TCP accept channel listen in a separate thread (74cee73) appears to poll the listener socket every millisecond. This value is hardcoded to 1000 (µs, not ms):

apr_interval_time_t wait = 1000;

debug_msg("Starting TCP listener thread...");
for(;!done;)
  {
    if(!deaf)
      {
        now = apr_time_now();
        /* Pull in incoming data */
        poll_tcp_listen_channels(wait, now);
      }
    else
      {
        apr_sleep( wait );
      }

  }

(https://github.com/ganglia/monitor-core/blob/master/gmond/gmond.c#L3184)

Now, this could be the intended behaviour - however, it makes each gmond instance use between 2% and 5% CPU at all times on my test machines. Running multiple instances of gmond on one machine (for monitoring multiple clusters, for example) becomes very expensive after this change. Is this intentional?

@jbuchbinder
Owner

It looks as though it only hits the apr_sleep(wait) call if it's "deaf". If so, it would probably be better done as:

debug_msg("Starting TCP listener thread...");
if(!deaf)
  {
    for(;!done;)
      {
        now = apr_time_now();
        /* Pull in incoming data */
        poll_tcp_listen_channels(wait, now);
        apr_sleep( wait );
      }
  }
 else
  {
      /* Do nothing or something else here */
  }
@psobot

poll_tcp_listen_channels(wait, now) actually times out after wait microseconds, does it not? Adding an apr_sleep(wait) call would just double the interval, which would cut the CPU usage in half. I'd have expected the timeout to just be set to something more sensible, like 1 second.

@satterly
Collaborator

I think this was meant to be a 1 second timeout but I might have misread the API docs for apr_interval_time_t as @psobot suggested. Would 100ms be a more sensible value or should we go for the full 1 second?

Alternatively, the TCP listener thread could not be started at all if the gmond is configured to be deaf.

  /* Create TCP listener thread */
  if(!deaf)
    {
      apr_thread_t *thread;
      if (apr_thread_create(&thread, NULL, tcp_listener, NULL, global_context) != APR_SUCCESS)
        {
          err_msg("Failed to create TCP listener thread. Exiting.\n");
          exit(1);
        }
    }
@psobot

Forgive my ignorance, but why is this function polling in the first place? It's waiting on a TCP connection and re-checking every n µs to see if one has arrived... but if it's in a separate thread, couldn't that thread just block until connection arrives? (The apr docs suggest that a negative timeout will block indefinitely, which would work in this case.)

@satterly
Collaborator

If the TCP thread blocks until a connection arrives then the TCP thread will prevent the gmond from reloading its config due to a SIGHUP or exiting due to any other signal interrupt. I think we should increase the timeout to 500ms or so and not even start the TCP thread if the gmond is in deaf mode.

@vvuksan
Owner

I agree on not starting the TCP thread if gmond is in deaf mode. 500ms may be too large. I'd go for say 100ms. @psobot can you test with value of 100ms and see if CPU utilization is greatly diminished. Thank you.

@psobot

@satterly - Good point.

@vvuksan - Quick testing results on my laptop, as reported by ps:

Timeout: 1000    (1ms)    => 1.2% CPU
Timeout: 100000  (100ms)  => 0.1% CPU
Timeout: 500000  (500ms)  => 0.0% CPU
@cburroughs

Is there consensus here?

@vvuksan
Owner

I'd go for 100ms.

@satterly satterly closed this issue from a commit in satterly/monitor-core
@satterly satterly Reduce CPU utilisation of gmond running in deaf mode
TCP thread should only poll every 100ms and should not start
at all if the gmond is configured to be deaf. Fixes #85.
c65f933
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.