New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Cdo::UnicornListener middleware #18961
Conversation
UnicornListener instruments a Rack application to collect the number of currently-executing requests across all Unicorn worker-processes, and reports metrics to CloudWatch.
Unicorn is not usually used in :development, and we also don't care to report CloudWatch metrics from developer workstations.
lib/cdo/unicorn_listener.rb
Outdated
class StatsWithMax < Raindrops::Middleware::Stats | ||
def initialize | ||
super | ||
@raindrops.size = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the 3 and 2's be a constant and not a magic #?
Also, I don't understand what's going on here... is @raindrops
a base class thing where @Raindrops[0] and [1] are already in use, and we are expanding one more slot for our use? Some more comments would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is specific to the class (Raindrops::Middleware::Stats
) this class is extending, which initializes :calling
and :writing
as Raindrops
counters (which are backed by the @raindrops
instance variable. Here I'm increasing the size by 1 to introduce a third variable. I can add more comments / make less use of magic integers in a future commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I managed to simplify the implementation in a way that doesn't require interacting with this internal instance-variable or using any index constants.
LGTM modulo the comment. Things to think about (where I don't know if I can come up with the answers just from the code here) is, are paranoid questions including: is there anything that could be intrusive from turning this on? Is there any condition like running out of memory from tracking metrics, or spending too much time sending metrics? What happens if the CloudWatch API goes down or start failing - do we just go on about our business or tip over in cascading failure? Do we want to do something like enable this one one FE first in prod to try it out? As long as you've thought through these and any other paranoid questions, looks good to me. |
This PR will send 3 per-second CloudWatch metrics from every (non-development) host running a Dashboard server. There's a cost associated with custom metrics (ranging from $0.02-$0.30 per metric/month depending on volume). This shouldn't be too intrusive (the only possibility is a large amount of adhoc instances, we can disable sending this metric from that environment if it becomes an issue).
The metrics-reporting thread will begin generating exceptions which are caught/reported as Honeybadger errors. Since the asynchronous reporting thread is detached from any main request-processing threads, there's no risk of cascading failure there.
I plan to do this before merging. I've also run it on an adhoc instance for manual testing |
remove magic-index numbers and direct interaction with `@raindrops`.
Added unit-test coverage and cleaned up the implementation a bit, now I'm going to manually test on a single frontend before merging. |
add unit test to ensure correctness.
All finished manually testing and incoming metrics look good, will merge after CI tests pass. |
Cdo::UnicornListener
instruments a Rack application to collect the number of currently-executing requests across all Unicorn worker-processes, and reports high-resolution (1-second interval) metrics to CloudWatch for alerting/monitoring.The following metrics are collected and reported to the 'Unicorn' namespace:
active
- the number of active TCP/socket connections (obtained viaraindrops
by monitoring/proc/net/unix
for Unix socket or inet_diag for TCP socket)queued
- the number of queued TCP/socket requestscalling
- the maximum number of currently-executing requests at any point during the interval (obtained viaraindrops
by request-tracking middleware using an atomic counter shared across processes/threads). This metric is similar toactive
except it continuously tracks the maximum rather than polling, so it should be more accurate for tracking brief activity spikes.These metrics can be interpreted as follows:
queued
is non-zero, there aren't enough open worker-processes to immediately serve all incoming requests.calling
/CDO.dashboard_workers
gives the 'worker-process utilization' for a frontend instance. Ifcalling
reaches the number of workers, then requests will either queue or timeout with 5xx errors. Track theaverage
and/ormaximum
ofcalling
in CloudWatch to measure the distribution of requests across frontend instances.active
should be similar to (if slightly lower than)calling
, if they are not similar then something else might be going on (a bug in the metric collection, perhaps).The listener leverages code from
raindrops
:For reporting to CloudWatch, the listener uses a
Cdo::Metrics
class that encapsulates batching/asynchronous-sending logic for interacting with a::Aws::CloudWatch::Client
object, with a simpleCdo::Metrics.push(namespace, metrics)
interface. (This duplicates/extracts logic originally written formysql-metrics
(#12140), it should be possible to reuse this class in that script in a future PR.)Also upgraded some rubygems in
Gemfile.lock
:aws-sdk
to support thestorage_resolution
high-resolution metrics parameter in the CloudWatch clientraindrops
to support Rack 2.x