Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add initial OpenTracing instrumentation #5425

Closed
wants to merge 2 commits into from

Conversation

bg451
Copy link

@bg451 bg451 commented May 20, 2016

This PR adds some basic OpenTracing instrumentation to etcd.

Background

The idea behind OpenTracing is to provide a standardized interface across languages to various distributed tracing systems(such as zipkin). Once you instrument your system, you can swap the underlying tracer with an O(1) switch. In case you aren't familiar with distributed tracing, the basic idea is that you can trace a request as it branches out to multiple machines and services, making it very easy to diagnose various problems such as latency, limping hardware, and events that lead up to an error deep in the stack

Goal

The goal behind the pr is to a) show what a request trace looks like in an etcd cluster and b) to start a discussion about what components of etcd need to be traced. I'm not extremely familiar with how etcd works, so there's a lot left to be desired when it comes to the number of possible endpoints that can be traced and noise reduction. I'm probably creating spans for parts that don't really matter and not creating them for key things that I don't know about, etc.

What's really neat is that gRPC allows for metadata, which makes sending traces across process boundaries using gRPC a breeze. This can be seen in the v3 KV client and server. Here's a gist of what a simple client using opentracing and the etcd client looks like.

Screenshot

Appdash

There is currently an issue open due to the fact that Appdash can't render spans that are less than 1ms. However, if you look at the table, you can see all of the various spans from a specific trace.
screen shot 2016-05-20 at 3 08 30 pm

Lighstep

Since a trace in a 3 node cluster is pretty big, I couldn't fit it all in in a screenshot so I saved the page as a PDF, which you can view here: https://drive.google.com/file/d/0BxWuGMMFc4K0RDZVcGYzUWhkNkk/view?usp=sharing
Colors aren't printing correctly, but any time there's an encode->decode span set, the message is being decoded on a different node
cc @philips

@@ -108,6 +113,27 @@ func startEtcdOrProxyV2() {
plog.Warningf("no data-dir provided, using default data-dir ./%s", cfg.Dir)
}

// The global tracer is noop by default.
if cfg.TracingLightstepToken != "" {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in my opinion one of the most important parts about opentracing: You don't have to modify the entire codebase to use a different tracing backend. It's a quick change in the configuration and everything else stays the same.

@xiang90
Copy link
Contributor

xiang90 commented May 20, 2016

@bg451 I have difficulty understanding the tracing from google drive. Is there a node identifier? It seems the events happened on different nodes?

@heyitsanthony
Copy link
Contributor

If this is going in, I think all the clientv3 and api/v3rpc additions should be interface wrappers similar to etcdserver/api/v3rpc/quota.go so the instrumentation calls aren't baked into the core implementation.

@bg451
Copy link
Author

bg451 commented May 20, 2016

@xiang90 Hopefully this should be better, the different nodes are in different colors, light blue is the leader in this case and dark blue is a client program. example.pdf Node identifies would be added as a span tag. In general span names should be of low cardinality.

@xiang90
Copy link
Contributor

xiang90 commented May 21, 2016

@bg451 I like the idea of distributed tracing. But I am also wondering if we really need this fine grained tracing inside etcd boundary. We already have pretty good idea about what is happening inside etcd with a bunch of metrics. How about just trace when a request comes into etcd and when a response goes out of etcd?

@bhs
Copy link

bhs commented May 21, 2016

@xiang90 the value of tracing like this is probably as much for etcd users as it is for etcd core contributors; those who aren't familiar with the codebase will be able to gain an intuitive understanding of why operations are slow (or fast) by examining the path that requests take through the etcd cluster. We saw this pattern over and over again at Google (using Dapper) with users of Bigtable, Chubby, etc.

@bhs
Copy link

bhs commented May 21, 2016

One other thing: "metrics" in the sense of timeseries counters/gauges do not tell the story of how individual requests propagate through a distributed system. They are valuable but complementary to tracing (which is probably more comparable to logging than it is to timeseries monitoring).

@bhs
Copy link

bhs commented Jun 16, 2016

@xiang90 I've looked at some of these traces live and they are quite useful... timing diagrams, of course, but also wire payloads showing how specific transactions proceed through the etcd/raft workflows.

IMO it would be great to unblock this PR – it adds a lot of value at very little cost.

@xiang90
Copy link
Contributor

xiang90 commented Jun 16, 2016

@bensigelman Sorry for the delay. We are trying to release etcd 3.0 at the end of this month. Then we will have time to look into this in more depth. I would LOVE to play with distributed tracing for sure. Just need to figure out the best way for integration.

@bhs
Copy link

bhs commented Jun 16, 2016

Got it, thanks – good luck with the release!

@bg451 bg451 closed this Nov 11, 2016
@aybabtme
Copy link
Contributor

aybabtme commented Jan 26, 2018

@xiang90 @bg451 I know that this is pretty old but it's the only reference to OpenTracing that I've seen in the project. What would the take be today about adding tracing to etcd?

@gyuho
Copy link
Contributor

gyuho commented Jan 26, 2018

Tracing will certainly be useful, but we don't have bandwidth to look into this right now. If we do, other tracing libraries should be evaluated as well.

@aybabtme
Copy link
Contributor

@gyuho do you have other libraries in mind than OpenTracing? Or did you mean the LightStep integration?

@gyuho
Copy link
Contributor

gyuho commented Jan 26, 2018

@aybabtme Haven't heard of LightStep :)

https://www.cncf.io/projects/

There are open tracing, jaeger, https://github.com/census-instrumentation/opencensus-go.

We may evaluate these in next release cycle (3.5).

@bhs
Copy link

bhs commented Jan 27, 2018

@gyuho Jaeger's instrumentation library is OpenTracing. (OpenTracing is a project specifically designed to provide a single instrumentation API that couples to a variety of tracing systems... zipkin, jaeger, various commercial vendors, etc)

@gyuho
Copy link
Contributor

gyuho commented Jan 27, 2018

@bhs Thanks! Just created a separate issue to track this #9242.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants