Show Date: Tuesday, 17 July 2012 17:00 Rome, 11:00 EDT
- Jason Dixon github, twitter, irc: jdixon, blog
- Tim Dysinger github, twitter
- Bryan Berry github, twitter, irc: bryanwb, blog: devopsanywhere
- John Vincent, aka Lusis twitter, github
Jason and Dysinger, what are your backgrounds? where do you live?
Outline / Questions
What do we mean by "monitoring"? Let's talk vocabulary first.
- Monitoring - collection, storage, poll vs pull
- Fault Detection - identifying a failure
- Trending & Capacity Planning
What kinds of monitoring are there?
- Heartbeat / liveness monitoring - is it up?
- Quality of Service - is it operating at a sufficient level / doing the right thing?
- Host-level metrics - cpu, disk, memory, network
- Generally we don't care about these on their own
- Rather, we correlate this with service data to narrow our diagnosis of an outage
- Service / Application-level metrics
- Business / Transactional metrics - is the workflow operational? are we making money?
What do we do with this collected data?
- Watch for transients (state changes)
- Send notifications - email, sms, pagers
- Visual feedback (real-time vs historical dashboards, reports)
- How long should we retain data?
What tools are available and where should people start?
- Should I start w/ Nagios or go straight to Sensu?
- Poll vs Push - should metrics storage be the source of truth?
- Visualization - Graphite, Ganglia
- Monolithic vs Modular - wtf is Voltron?
Collecting metrics - logstash, collectd, statsd, logster
- What should I use to collect system metrics? collectd or diamond?
- How should engineers collect application metrics?
- When do I need to use amqp and message bus to ship around my metrics?
- WTF happened to SNMP?
- Why are standard deviation and 90th percentile important?
- How should you represent your metrics so that they are meaningful?
- What #'s do you care about most?
- Should I be more interested in histogram peaks or tails?
- Depends on your business and who's asking
- Percentiles are valuable for finding trends
- Tails are valuable for finding anomalies
- Both are useful, but either can be misleading / time-consuming
Jason, you worked at Heroku up until recently. How do they handle metrics collection internally?
- Self-service monitoring pipeline (event stream model)
- Robust logging infrastructure (logplex) to lean on
- Application logs key=value pairs
- Exprd ("expression-d"), Logster-style app extracts metrics via logplex drain, applies gauges/counters/etc and sends them to Backstop
- Backstop proxies from JSON/HTTP to Carbon (Graphite listener)
- Umpire queries Graphite API and provides an HTTP status response
- This model makes it easy for engineers to setup their own monitoring and trending flows
- "Out of the box" - Ganglia, Nagios + RRD / PNP4Nagios
- Cacti - if you're using SNMP
- Graphite - ZOMG all the choices: gdash, Graphiti, Graphene, Tasseo, Descartes
- Third-party services - Ducksboard (only supports push)
- Why are there no "pretty" dashboard services that support pull (e.g. from Graphite API)?
What about self-service metrics? like coda hale's metrics library
Jason, how do you feel about the overall health of the graphite project? Are there any design features or aspects of the project that might limit it long-term?
- I think there are a few rough spots, but nothing that's currently inhibiting its adoption.
- Lack of authentication and authorization on ingress. Needs some way to filter access on metrics submission.
- Probably should happen on a relay, e.g. carbon-relay or Backstop.
- When scaling out to multiple storage (carbon) nodes, you must run a webserver on each node.
- Because carbon only reads from memory, not from disk. Webserver reads from disk.
- Would need to extend carbon to serve up metrics from both in order to remove webserver dependency.
- I talked with Michael Leinartas (@mleinart) at Velocity about these issues, he agrees.
- Lack of metadata in metrics (whisper). This is both a pro and a con.
- pro - Graphite doesn't care what you throw at it. Easy ramp-up.
- con - Hard(er) to do stuff like annotations.
- You end up pushing a lot of this "extra" information out to the edge, e.g. dashboard database.
- Weak dashboard application. In its place we're seeing a lot of open-source alternatives using Graphite's JSON output.
Jason, since you work w/ graphite then you must have to use launchpad. What do you think of launchpad?
- Well, now I'm biased. But even if you'd asked me this a month ago I'd have the same answer. I hate LP.
- Their "answers" product is nice for assembling an organic FAQ and searchability.
- Tracking branches and submitting changes is a hassle.
- Finding anything in their UI is a hassle.
- Oh, and it's painfully slow.
- Fortunately I don't have to use LP anymore for code. The project is hosted on GitHub now.
Let's summarize for n00bs, I start w/ A then proceed to B, then C . . .
- The Sharpe books by Bernard Cornwell
- The Economist
- collectd > 1.4.15
- Logstash 1.1.1 released
- someon should hire jordan sissel to work on logstash full-time
- John Rauser's Velocity keynotes
- Impressed by Kyle's work and his dedication to performance. Really smart dude.
- It solves a real problem with [high performance] Complex Event Processing in the event stream model.
Hosted Graphite service.
- Support statsd and cleartext submission.
- I haven't used it yet, but it looks promising.
- Same guys who wrote MetricFire.
- No formal details yet, but I'm working on a new Monitoring conference / hackathon.
- Probably Boston in October (~3rd week).
- Single track mornings.
- Dual track afternoons (workshops / hackathon).
- Low cost / high accessibility.
- Hope to announce details by next week.
Please take the time to rate us on itunes and to send your cookbook news to email@example.com
Follow @foodfightshow on twitter.
Also, you can submit show ideas to our github repo