Zipkin is a distributed tracing system that helps us gather timing data for all the disparate services at Twitter. It manages both the collection and lookup of this data through a Collector and a Query service. We closely modelled Zipkin after the Google Dapper paper. Follow @zipkinproject for updates.
Collecting traces helps developers gain deeper knowledge about how certain requests perform in a distributed system. Let's say we're having problems with user requests timing out. We can look up traced requests that timed out and display it in the web ui. We'll be able to quickly find the service responsible for adding the unexpected response time. If the service has been annotated adequately we can also find out where in that service the issue is happening.
These are the components that make up a fully fledged tracing system.
Tracing information is collected on each host using the instrumented libraries and sent to Zipkin. When the host makes a request to another service, it passes a few tracing identifers along with the request so we can later tie the data together.
We have instrumented the libraries below to trace requests and to pass the required identifiers to the other services called in the request.
Finagle is an asynchronous network stack for the JVM that you can use to build asynchronous Remote Procedure Call (RPC) clients and servers in Java, Scala, or any JVM-hosted language.
Finagle is used heavily inside of Twitter and it was a natural point to include tracing support. So far we have client/server support for Thrift and HTTP as well as client only support for Memcache and Redis.
To set up a Finagle server in Scala, just do the following.
Adding tracing is as simple as adding finagle-zipkin as a dependency and a
tracerFactory to the ServerBuilder.
ServerBuilder() .codec(ThriftServerFramedCodec()) .bindTo(serverAddr) .name("servicename") .tracerFactory(ZipkinTracer()) .build(new SomeService.FinagledService(queryService, new TBinaryProtocol.Factory()))
The tracing setup for clients is similar. When you've specified the Zipkin tracer as above a small sample of your requests will be traced automatically. We'll record when the request started and ended, services and hosts involved.
In case you want to record additional information you can add a custom annotation in your code.
Trace.record("starting that extremely expensive computation")
The line above will add an annotation with the string attached to the point in time when it happened. You can also add a key value annotation. It could look like this:
There's a gem we use to trace requests. In order to push the tracer and generate a trace id on a request you can use that gem in a RackHandler. See zipkin-web for an example of where we trace the tracers.
For tracing client calls from Ruby we rely on the Twitter Ruby Thrift client. See below for an example on how to wrap the client.
client = ThriftClient.new(SomeService::Client, '127.0.0.1:1234') client_id = FinagleThrift::ClientId.new(:name => "service_example.sample_environment") FinagleThrift.enable_tracing!(client, client_id), "service_name")
Querulous is a Scala library for interfacing with SQL databases. The tracing includes the timings of the request and the SQL query performed.
Cassie is a Finagle based Cassandra client library. You set the tracer in Cassie pretty much like you would in Finagle, but in Cassie you set it on the KeyspaceBuilder.
We use Scribe to transport all the traces from the different services to Zipkin and Hadoop. Scribe was developed by Facebook and it's made up of a daemon that can run on each server in your system. It listens for log messages and routes them to the correct receiver depending on the category.
Once the trace data arrives at the Zipkin collector daemon we check that it's valid, store it and the index it for lookups.
We settled on Cassandra for storage. It's scalable, has a flexible schema and is heavily used within Twitter. We did try to make this component pluggable though, so should not be hard to put in something else here.
Once the data is stored and indexed we need a way to extract it. This is where the query daemon comes in, providing the users with a simple Thrift api for finding and retrieving traces. See the Thrift file.
Most of our users access the data via our UI. It's a Rails app that uses D3 to visualize the trace data. Note that there is no built in authentication in the UI.
Zipkin relies on Cassandra for storage. So you will need to bring up a Cassandra cluster.
- See Cassandra's site for instructions on how to start a cluster.
- Use the Zipkin Cassandra schema attached to this project. You can create the schema with the following command.
bin/cassandra-cli -host localhost -port 9160 -f zipkin-server/src/schema/cassandra-schema.txt
Zipkin uses ZooKeeper for coordination. That's where we store the server side sample rate and register the servers.
- See ZooKeeper's site for instructions on how to install it.
Scribe is the logging framework we use to transport the trace data. You need to set up a network store that points to the Zipkin collector daemon.
A Scribe store for Zipkin might look something like this.
<store> category=zipkin type=network remote_host=zk://zookeeper-hostname:2181/scribe/zipkin remote_port=9410 use_conn_pool=yes default_max_msg_before_reconnect=50000 allowable_delta_before_reconnect=12500 must_succeed=no </store>
Note that the above uses the Twitter version of Scribe with support for using ZooKeeper to find the hosts to send the category to. You can also use a DNS entry for the collectors or something similar.
git clone https://github.com/twitter/zipkin.git
cp config/collector-dev.scala config/collector-prod.scala
cp config/query-dev.scala config/query-prod.scala
- Modify the configs above as needed. Pay particular attention to ZooKeeper and Cassandra server entries.
bin/sbt update package-dist(This downloads SBT 0.11.2 if it doesn't already exist)
scp dist/zipkin*.zip [server]
mkdir -p /var/log/zipkin
- Start the collector and query daemon.
The UI is a standard Rails 3 app.
- Update config with your ZooKeeper server. This is used to find the query daemons.
- Deploy to a suitable Rails 3 app server. For testing you can simply do
bundle install && bundle exec rails server.
It's possible to setup Scribe to log into Hadoop. If you do this you can generate various reports from the data that is not easy to do on the fly in Zipkin itself.
We use a library called Scalding to write Hadoop jobs in Scala.
- To run a Hadoop job first make the fat jar.
sbt 'project zipkin-hadoop' compile assembly
- Change scald.rb to point to the hostname you want to copy the jar to and run the job from.
- You can then run the job using our scald.rb script.
./scald.rb --hdfs com.twitter.zipkin.hadoop.[classname] --date yyyy-mm-ddThh:mm yyyy-mm-ddThh:mm
There are two mailing lists you can use to get in touch with other users and developers.
Noticed a bug? Please add an issue here. https://github.com/twitter/zipkin/issues
Contributions are very welcome! Please create a pull request on github and we'll look at it as soon as possible.
Try to make the code in the pull request as focused and clean as possible, stick as close to our code style as you can.
If the pull request is becoming too big we ask that you split it into smaller ones.
Areas where we'd love to see contributions include: adding tracing to more libraries and protocols, interesting reports generated with Hadoop from the trace data, extending collector to support more transports and storage systems and other ways of visualizing the data in the web ui.
We intend to use the semver style versioning.
Thanks to everyone below for making Zipkin happen!
Copyright 2012 Twitter, Inc.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0