Skip to content

Concepts

Jacek Masiulaniec edited this page Sep 4, 2015 · 38 revisions
  1. Overview
  2. Architecture
  3. Lifetime of a Data Point
  4. Feed
  5. Fitness for Purpose

1.1. Overview

The overarching goal is to make all cluster metrics accessible as a real-time stream API.

The API provides a raw data feed that is well suited as an underpinning of higher-level monitoring solutions: alert generators, real-time dashboards, time series databases, and so on. The intent is to spark off creation of new kinds of metric-based tools. Their scope doesn't have to be limited to monitoring. For example, one could use the API to automatically reconfigure load balancers or add service capacity by starting more containers.

HIGH_LEVEL_GRAPH_GOES_HERE

1.2. Architecture

Core of TSP is broken down in a few services (white boxes):

Dotted lines represent control traffic (HTTP). Solid lines represent time series feed traffic (TCP/4242). Both are described later.

The Riemann server is assumed to be running a translation gateway that converts TSP's feed to Riemann's own format. The OpenTSDB server does not require such translation step because TSP is wire-compatible with OpenTSDB's line-based protocol.

In general, for an external system to consume the feed, it must either support TSP's feed traffic natively, or be fronted by a translation gateway. TSP deliberately omits support for any protocol beyond its own. Resolving protocol incompatibility globally at TSP level is not worth the added overhead in code, configuration, deployment, and cognitive load in general. Therefore, TSP employs a common protocol for transporting metrics, and expects subscribers to do their own local translation.

Forwarder

Aka the tsp-forwarder(8) agent. Runs on every server. Collects metrics for local host using plugins, which are standalone programs that write metrics to standard output. Merges plugin output into a single stream, transforms it by setting host-identifying tags (host and cluster), and publishes it to remote subscribers over the TCP/4242 feed protocol. A subscriber like OpenTSDB receives TCP streams directly from the many forwarders (designated as "...") because it benefits from by-server sharding. A subscriber like Riemann prefers the unsharded version (one fat TCP stream), which it obtains via aggregator.

Poller

Aka the tsp-poller(8) service. Runs on one server. Pulls metrics from remote hosts, e.g. SNMP-only devices or Internet APIs. Publishes to the same subscribers as forwarder.

Aggregator

Aka the tsp-aggregator(8) service. Runs on one server. Receives forwarder and poller feeds. Merges them into a single stream, and publishes it to remote subscribers. This step centralizes scalable merging of TCP streams on behalf of job-like subscribers (e.g. Riemann). Does not transform data in any way.

Controller

Aka the tsp-controller(8) service. Runs on a few servers. Responds to configuration requests (HTTP) made by all of the above services. Helps reduce operational overhead by giving the operator a central point of control that can dynamically reconfigure the pipeline (e.g. add a subscriber), analogously to how dhcpd(8) helps reconfigure an entire computer network.

1.3. Lifetime of a Data Point

The processing stages of a data point created by a hypothetical procfs collector:

Forwarder starts the procfs.sh child on startup, and restarts it in case of crashes, hangs, or new releases (detected by periodically checking mtime of procfs.sh).

Forwarder rewrites every incoming point using a filter ruleset downloaded from controller. The main role of the filter is to decorate points with host-identifying tags, and to drop undesirable points.

Finally, forwarder fans the data point out to a set of relay hosts, which is also discovered using controller. In this case, forwarder has learned about two relays: OpenTSDB and aggregator.

Every point is transmitted twice: each relay gets a copy of the point over a long-lived connection. If a relay cannot keep up with the rate of data point arrivals, its forwarder-side queue eventually fills, causing any points that arrive during this resource shortage to be discarded. Other healthy relays are unaffected.

The OpenTSDB relay is a direct subscriber. It receives points directly from forwarders and persists them in HBase.

The Riemann relay is an indirect subscriber. It receives points via aggregator and continuously analyses them.

The aggregator relay is not a subscriber. It's an internal helper service that fans in many forwarder feeds, and fans them out as a single feed to subscribers like Riemann. This extra step saves such subscribers from having to implement support for efficient handling of hundreds of concurrent forwarder connections. It also saves the operator from having to reconfigure any firewalls on the network path between forwarders and Riemann. As usual, aggregator's relay list is downloaded from controller.

1.4. Feed

The metric feed format is a plain-text sequence of lines, one data point per line.

A sample wire-level view of the data feed:

put os.cpu.Percent 1423944435 0.2 host=www001.example.com
put os.memory.Size 1423944435 129078112 host=www001.example.com type=Used
put os.memory.Size 1423944435 5139616 host=www001.example.com type=Free
put os.cpu.Percent 1423944445 100.0 host=www001.example.com
put os.memory.Size 1423944445 129078112 host=www001.example.com type=Used
put os.memory.Size 1423944445 5139616 host=www001.example.com type=Free

The example contains 3 time series named:

os.cpu.Percent  host=www001.example.com
os.memory.Size  host=www001.example.com type=Used
os.memory.Size  host=www001.example.com type=Free

The name has two sections: a metric section and a tags section. These sections help users navigate the vast space of available time series.

The metric section is the dot-delimited one. It organizes time series namespace into broad categories.

The tags section is the remaining list of key=value pairs. It serves to subdivide each metric along independent dimensions.

The example contains 2 samples of each time series:

  • 1423944435 is the time of the first collection sample
  • 1423944445 is the time of the second sample, 10 seconds later

The values suggest that host www001.example.com has had cpu spike, although its memory usage remained constant.

Subscribers receive the feed over a single, long-lived TCP connection initiated by aggregator.

Alternatively, a subscriber may be configured to receive the feed over many TCP connections directly from forwarders and the poller. This would be suitable for a subscriber like OpenTSDB, which does not require consolidated feed, and in fact scales better with many partial feeds because of its thread-per-connection design.

TABLE_OF_DIRECT_VS_INDIRECT_PROS_AND_CONS

1.5. Fitness For Purpose

The information contained in this section is not necessary for understanding TSP at a high level, but it may help decide if TSP will fit well into your existing architecture.

All services are written in Go. The program integrates easily in a Unix environment, although the Windows version compiles, and has been tested to work well.

Forwarding throughput has been a major goal. Aggregator is the forwarding hot spot. Betfair runs an aggregator instance that ingests 100K point/s, and duplicates it on behalf of 3 subscribers, causing aggregate output of 300K point/s. This workload fits in 2 cores and 1 GB of RAM.

TSP enables rapid orientation thanks to low latency at which data points traverse the pipeline. Forwarder, aggregator, and poller write to sockets in a line-buffered way, which lets subscribers observe new data points within milliseconds since creation. Nagle's algorithm in TCP/IP is responsible for batching these data points into good-sized packets (at a modest latency cost).

TSP follows a hybrid push/pull model. Forwarders and the poller pull metrics over procfs/snmp/jmx/etc but they push them to any subscribers. Pulling offers two advantages: (1) control over the sampling interval, (2) monitored programs don't require any changes. The main advantage of pushing is that each subscriber gets a dedicated sender-side queue that helps recover from temporary network partition or receiver overload. Once the receiver becomes available again, data points held in sender-side queue are flushed.

The collection plugin subsystem is copied wholesale from tcollector because it offers so many benefits:

  • Plugins can be written in any language.
  • Plugins are hot-upgradable.
  • Troubleshooting is a matter of re-running a script, maybe after an edit.

There are two departures from tcollector's plugin model:

  • There is no lib/ directory for storing local configuration.
  • Support for one-shot plugins has been omitted.

TSP eschews local configuration. Developers are encouraged to design plugins so that the configuration step is unnecessary. Plugins are expected to choose good defaults.

For the rare cases where plugin configuration is unavoidable, TSP favors reading configuration over the network by downloading it from the controller service. In large-scale deployments, local config files can be impossible to update in a timely fashion, which impacts reliability of the metric feed because inconsistently configured pipeline cannot be trusted.

Forwarder, aggregator, and poller do not depend on any external message queue. Every one of them builds in a bounded message queue, which breaks circular dependency that would arise if an external queue such as RabbitMQ was used. In such case, RabbitMQ would be unmonitorable using TSP, introducing unacceptable blind spot in the monitoring solution.

Controller does not depend on any external configuration store. It reads configuration from a local file, which holds static content similar in purpose to dhcpd.conf. That file can be maintained by hand, or generated offline based on some external store. If an external store such as zookeeper was in the critical path, a monitoring blind spot would be introduced: zookeeper would be unmonitorable using TSP.

Nothing in the system depends on availability of disk space. Metrics are delivered despite space shortage.

TSP allows the monitoring system to degrade gracefully. For example, OpenTSDB becoming unavailable is not a catastrophic event because monitoring services that depend on the aggregator feed (e.g. Riemann) continue to operate unaffected.

Thanks to regular sampling, the feed does not suffer from load-induced variability in data rate. Data rate remains constant even when the monitored systems are running hot. This is in contrast to any event-based monitoring feed (e.g. HTTP access_log).

Dependability, low latency and good throughput make for fun user experience. There is nothing more demotivating than waiting for a monitoring system to unblock/unbreak and supply fresh data, especially during an outage.

Continue to Installation.

Return to Documentation.