Add Cdo::UnicornListener middleware #18961

wjordan · 2017-11-07T19:02:31Z

Cdo::UnicornListener instruments a Rack application to collect the number of currently-executing requests across all Unicorn worker-processes, and reports high-resolution (1-second interval) metrics to CloudWatch for alerting/monitoring.

The following metrics are collected and reported to the 'Unicorn' namespace:

active - the number of active TCP/socket connections (obtained via raindrops by monitoring /proc/net/unix for Unix socket or inet_diag for TCP socket)
queued - the number of queued TCP/socket requests
calling - the maximum number of currently-executing requests at any point during the interval (obtained via raindrops by request-tracking middleware using an atomic counter shared across processes/threads). This metric is similar to active except it continuously tracks the maximum rather than polling, so it should be more accurate for tracking brief activity spikes.

These metrics can be interpreted as follows:

if queued is non-zero, there aren't enough open worker-processes to immediately serve all incoming requests.
calling / CDO.dashboard_workers gives the 'worker-process utilization' for a frontend instance. If calling reaches the number of workers, then requests will either queue or timeout with 5xx errors. Track the average and/or maximum of calling in CloudWatch to measure the distribution of requests across frontend instances.
active should be similar to (if slightly lower than) calling, if they are not similar then something else might be going on (a bug in the metric collection, perhaps).

The listener leverages code from raindrops:

raindrops is a real-time stats toolkit to show statistics for Rack HTTP servers. It is designed for preforking servers such as unicorn, but should support any Rack HTTP server on platforms supporting POSIX shared memory. It may also be used as a generic scoreboard for sharing atomic counters across multiple processes.

For reporting to CloudWatch, the listener uses a Cdo::Metrics class that encapsulates batching/asynchronous-sending logic for interacting with a ::Aws::CloudWatch::Client object, with a simple Cdo::Metrics.push(namespace, metrics) interface. (This duplicates/extracts logic originally written for mysql-metrics (#12140), it should be possible to reuse this class in that script in a future PR.)

Also upgraded some rubygems in Gemfile.lock:

aws-sdk to support the storage_resolution high-resolution metrics parameter in the CloudWatch client
raindrops to support Rack 2.x

UnicornListener instruments a Rack application to collect the number of currently-executing requests across all Unicorn worker-processes, and reports metrics to CloudWatch.

Unicorn is not usually used in :development, and we also don't care to report CloudWatch metrics from developer workstations.

jeremydstone · 2017-11-07T19:44:26Z

lib/cdo/unicorn_listener.rb

+  class StatsWithMax < Raindrops::Middleware::Stats
+    def initialize
+      super
+      @raindrops.size = 3


Make the 3 and 2's be a constant and not a magic #?
Also, I don't understand what's going on here... is @raindrops a base class thing where @Raindrops[0] and [1] are already in use, and we are expanding one more slot for our use? Some more comments would be helpful.

yeah, this is specific to the class (Raindrops::Middleware::Stats) this class is extending, which initializes :calling and :writing as Raindrops counters (which are backed by the @raindrops instance variable. Here I'm increasing the size by 1 to introduce a third variable. I can add more comments / make less use of magic integers in a future commit.

I managed to simplify the implementation in a way that doesn't require interacting with this internal instance-variable or using any index constants.

jeremydstone · 2017-11-07T21:05:33Z

LGTM modulo the comment. Things to think about (where I don't know if I can come up with the answers just from the code here) is, are paranoid questions including: is there anything that could be intrusive from turning this on? Is there any condition like running out of memory from tracking metrics, or spending too much time sending metrics? What happens if the CloudWatch API goes down or start failing - do we just go on about our business or tip over in cascading failure? Do we want to do something like enable this one one FE first in prod to try it out? As long as you've thought through these and any other paranoid questions, looks good to me.

wjordan · 2017-11-07T23:59:05Z

LGTM modulo the comment. Things to think about (where I don't know if I can come up with the answers just from the code here) is, are paranoid questions including:

is there anything that could be intrusive from turning this on?

This PR will send 3 per-second CloudWatch metrics from every (non-development) host running a Dashboard server. There's a cost associated with custom metrics (ranging from $0.02-$0.30 per metric/month depending on volume). This shouldn't be too intrusive (the only possibility is a large amount of adhoc instances, we can disable sending this metric from that environment if it becomes an issue).

Is there any condition like running out of memory from tracking metrics [...]

The metrics-collection array is emptied after every report is queued, so should never contain more than 60 metrics with current settings.
Even if the collection wasn't flushed correctly due to some error, the amount of memory required by a single MetricDatum is not very large - at the rate of a few metrics per second, it would take a long time for any memory impact to be meaningful/noticeable.

[...], or spending too much time sending metrics?

The AWS-client timeouts are set relatively low (5 seconds/3 retries, so max 15 seconds total before timing out), with any errors caught/notified via Honeybadger.
The put_metric_data AWS call is asynchronously queued on a separate thread, so even the timeout wont block any calling thread from continuing execution.
Metrics are batched together in sets of 20 (the current AWS limit) to reduce request overhead by some factor.

What happens if the CloudWatch API goes down or start failing - do we just go on about our business or tip over in cascading failure?

The metrics-reporting thread will begin generating exceptions which are caught/reported as Honeybadger errors. Since the asynchronous reporting thread is detached from any main request-processing threads, there's no risk of cascading failure there.

Do we want to do something like enable this one one FE first in prod to try it out?

I plan to do this before merging. I've also run it on an adhoc instance for manual testing

remove magic-index numbers and direct interaction with `@raindrops`.

wjordan · 2017-11-08T17:53:36Z

Added unit-test coverage and cleaned up the implementation a bit, now I'm going to manually test on a single frontend before merging.

add unit test to ensure correctness.

wjordan · 2017-11-08T22:36:55Z

All finished manually testing and incoming metrics look good, will merge after CI tests pass.

wjordan added 2 commits November 7, 2017 09:21

Add Cdo::UnicornListener middleware

d56baee

UnicornListener instruments a Rack application to collect the number of currently-executing requests across all Unicorn worker-processes, and reports metrics to CloudWatch.

Disable UnicornListener in :development

e7f8b32

Unicorn is not usually used in :development, and we also don't care to report CloudWatch metrics from developer workstations.

wjordan requested a review from jeremydstone November 7, 2017 19:02

jeremydstone requested a review from sureshc November 7, 2017 19:39

jeremydstone reviewed Nov 7, 2017

View reviewed changes

jeremydstone approved these changes Nov 7, 2017

View reviewed changes

wjordan added 2 commits November 8, 2017 09:20

add unit test for UnicornListener middleware

c49a726

Simplify StatsWithMax implementation

1077bd8

remove magic-index numbers and direct interaction with `@raindrops`.

wjordan added 3 commits November 8, 2017 11:06

remove simplecov from unit test

0f32c8f

disable reporting endpoint

0adfd47

fix collect_metrics signature for reporting task

c06f7a6

add unit test to ensure correctness.

wjordan merged commit 9c7ff98 into staging Nov 9, 2017

wjordan deleted the unicorn-listener branch November 9, 2017 05:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cdo::UnicornListener middleware #18961

Add Cdo::UnicornListener middleware #18961

wjordan commented Nov 7, 2017

jeremydstone Nov 7, 2017

wjordan Nov 8, 2017 •

edited

wjordan Nov 8, 2017

jeremydstone commented Nov 7, 2017

wjordan commented Nov 7, 2017

wjordan commented Nov 8, 2017

wjordan commented Nov 8, 2017

Add Cdo::UnicornListener middleware #18961

Add Cdo::UnicornListener middleware #18961

Conversation

wjordan commented Nov 7, 2017

jeremydstone Nov 7, 2017

Choose a reason for hiding this comment

wjordan Nov 8, 2017 • edited

Choose a reason for hiding this comment

wjordan Nov 8, 2017

Choose a reason for hiding this comment

jeremydstone commented Nov 7, 2017

wjordan commented Nov 7, 2017

wjordan commented Nov 8, 2017

wjordan commented Nov 8, 2017

wjordan Nov 8, 2017 •

edited