Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Cdo::UnicornListener middleware #18961

Merged
merged 7 commits into from Nov 9, 2017
Merged

Add Cdo::UnicornListener middleware #18961

merged 7 commits into from Nov 9, 2017

Conversation

wjordan
Copy link
Contributor

@wjordan wjordan commented Nov 7, 2017

Cdo::UnicornListener instruments a Rack application to collect the number of currently-executing requests across all Unicorn worker-processes, and reports high-resolution (1-second interval) metrics to CloudWatch for alerting/monitoring.

The following metrics are collected and reported to the 'Unicorn' namespace:

  • active - the number of active TCP/socket connections (obtained via raindrops by monitoring /proc/net/unix for Unix socket or inet_diag for TCP socket)
  • queued - the number of queued TCP/socket requests
  • calling - the maximum number of currently-executing requests at any point during the interval (obtained via raindrops by request-tracking middleware using an atomic counter shared across processes/threads). This metric is similar to active except it continuously tracks the maximum rather than polling, so it should be more accurate for tracking brief activity spikes.

These metrics can be interpreted as follows:

  • if queued is non-zero, there aren't enough open worker-processes to immediately serve all incoming requests.
  • calling / CDO.dashboard_workers gives the 'worker-process utilization' for a frontend instance. If calling reaches the number of workers, then requests will either queue or timeout with 5xx errors. Track the average and/or maximum of calling in CloudWatch to measure the distribution of requests across frontend instances.
  • active should be similar to (if slightly lower than) calling, if they are not similar then something else might be going on (a bug in the metric collection, perhaps).

The listener leverages code from raindrops:

raindrops is a real-time stats toolkit to show statistics for Rack HTTP servers. It is designed for preforking servers such as unicorn, but should support any Rack HTTP server on platforms supporting POSIX shared memory. It may also be used as a generic scoreboard for sharing atomic counters across multiple processes.

For reporting to CloudWatch, the listener uses a Cdo::Metrics class that encapsulates batching/asynchronous-sending logic for interacting with a ::Aws::CloudWatch::Client object, with a simple Cdo::Metrics.push(namespace, metrics) interface. (This duplicates/extracts logic originally written for mysql-metrics (#12140), it should be possible to reuse this class in that script in a future PR.)

Also upgraded some rubygems in Gemfile.lock:

  • aws-sdk to support the storage_resolution high-resolution metrics parameter in the CloudWatch client
  • raindrops to support Rack 2.x

UnicornListener instruments a Rack application to collect the number of
currently-executing requests across all Unicorn worker-processes,
and reports metrics to CloudWatch.
Unicorn is not usually used in :development, and we also don't care to
report CloudWatch metrics from developer workstations.
class StatsWithMax < Raindrops::Middleware::Stats
def initialize
super
@raindrops.size = 3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the 3 and 2's be a constant and not a magic #?
Also, I don't understand what's going on here... is @raindrops a base class thing where @Raindrops[0] and [1] are already in use, and we are expanding one more slot for our use? Some more comments would be helpful.

Copy link
Contributor Author

@wjordan wjordan Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is specific to the class (Raindrops::Middleware::Stats) this class is extending, which initializes :calling and :writing as Raindrops counters (which are backed by the @raindrops instance variable. Here I'm increasing the size by 1 to introduce a third variable. I can add more comments / make less use of magic integers in a future commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I managed to simplify the implementation in a way that doesn't require interacting with this internal instance-variable or using any index constants.

@jeremydstone
Copy link

LGTM modulo the comment. Things to think about (where I don't know if I can come up with the answers just from the code here) is, are paranoid questions including: is there anything that could be intrusive from turning this on? Is there any condition like running out of memory from tracking metrics, or spending too much time sending metrics? What happens if the CloudWatch API goes down or start failing - do we just go on about our business or tip over in cascading failure? Do we want to do something like enable this one one FE first in prod to try it out? As long as you've thought through these and any other paranoid questions, looks good to me.

@wjordan
Copy link
Contributor Author

wjordan commented Nov 7, 2017

LGTM modulo the comment. Things to think about (where I don't know if I can come up with the answers just from the code here) is, are paranoid questions including:

is there anything that could be intrusive from turning this on?

This PR will send 3 per-second CloudWatch metrics from every (non-development) host running a Dashboard server. There's a cost associated with custom metrics (ranging from $0.02-$0.30 per metric/month depending on volume). This shouldn't be too intrusive (the only possibility is a large amount of adhoc instances, we can disable sending this metric from that environment if it becomes an issue).

Is there any condition like running out of memory from tracking metrics [...]

  • The metrics-collection array is emptied after every report is queued, so should never contain more than 60 metrics with current settings.

  • Even if the collection wasn't flushed correctly due to some error, the amount of memory required by a single MetricDatum is not very large - at the rate of a few metrics per second, it would take a long time for any memory impact to be meaningful/noticeable.

[...], or spending too much time sending metrics?

  • The AWS-client timeouts are set relatively low (5 seconds/3 retries, so max 15 seconds total before timing out), with any errors caught/notified via Honeybadger.

  • The put_metric_data AWS call is asynchronously queued on a separate thread, so even the timeout wont block any calling thread from continuing execution.

  • Metrics are batched together in sets of 20 (the current AWS limit) to reduce request overhead by some factor.

What happens if the CloudWatch API goes down or start failing - do we just go on about our business or tip over in cascading failure?

The metrics-reporting thread will begin generating exceptions which are caught/reported as Honeybadger errors. Since the asynchronous reporting thread is detached from any main request-processing threads, there's no risk of cascading failure there.

Do we want to do something like enable this one one FE first in prod to try it out?

I plan to do this before merging. I've also run it on an adhoc instance for manual testing

remove magic-index numbers and direct interaction with `@raindrops`.
@wjordan
Copy link
Contributor Author

wjordan commented Nov 8, 2017

Added unit-test coverage and cleaned up the implementation a bit, now I'm going to manually test on a single frontend before merging.

@wjordan
Copy link
Contributor Author

wjordan commented Nov 8, 2017

All finished manually testing and incoming metrics look good, will merge after CI tests pass.

@wjordan wjordan merged commit 9c7ff98 into staging Nov 9, 2017
@wjordan wjordan deleted the unicorn-listener branch November 9, 2017 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants