Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate a local UUID for requests with a parent UUID for debug purposes #60

Closed
pmvilaca opened this issue Jan 7, 2014 · 35 comments
Closed

Comments

@pmvilaca
Copy link

pmvilaca commented Jan 7, 2014

At the moment, if service A needs to make 2 independent calls to service B, according the best-practices, service A should pass the UUID to service B.

That is useful to identify a global transaction but isn't so useful if we need to identify a single request. We can track the requests looking at the logs, timestamps, ... but that could be a problem if we have alerts based on that field.

@eswdd
Copy link
Contributor

eswdd commented Jan 7, 2014

There was a suggestion a year ago that we derive new request uuids from passed in ones, such that they become unique but it is still possible to search on the parent. the intention was to append to the string of the parent uuid, but that could conceivably be considered to be an implementation choice of this request.

@pmvilaca
Copy link
Author

pmvilaca commented Jan 7, 2014

I think that it's better to generate a new uuid and log that as a new field
instead of appending something to the original uuid.. That would force us
to apply a regex to extract the original uuid.

On Tuesday, January 7, 2014, Simon Matic Langford wrote:

There was a suggestion a year ago that we derive new request uuids from
passed in ones, such that they become unique but it is still possible to
search on the parent. the intention was to append to the string of the
parent uuid, but that could conceivably be considered to be an
implementation choice of this request.


Reply to this email directly or view it on GitHubhttps://github.com//issues/60#issuecomment-31778529
.

@eswdd
Copy link
Contributor

eswdd commented Jan 8, 2014

Won't you have to use regex or some other configuration to extract the uuid anyway?

If we introduce a seperate uuid, then bear in mind that that will only be logged in one place alongside the parent (likely the request log since it's common to all transports). From the PoV of something like Zipkin which is automagically stitching things together and grabbing all available data this is simple, for a human you will always need to join across this single log to trace something, making scripts more complex.

The appending mechanism is actually quite an elegant solution since it helps in a number of styles of usage:

  • Want to trace what requests a service/client made because it was slow: just grep for the parent uuid in all logs
  • Want to find the source of a request from the access log: just decompose the uuid and grep in the logs to find the parent request.

It's also not dissimilar to a mechanism we have in tracing events through the system - perhaps we should/could consider joining them. If we change the appending style from what I'd originally considered, namely just appending -1, -2, -3 to the current uuid when making a request, to including the hostname like we do for new uuids, then we know the exact location of the source (at least where it's something sharing the uuid scheme - and if it's not at least it will look different).

@pmvilaca
Copy link
Author

pmvilaca commented Jan 8, 2014

Fair enough.. But I don't know how could we append the -1, -2, -... in the client side. Most of the times, there are requests to the same endpoints in parallel.

@eswdd
Copy link
Contributor

eswdd commented Jan 8, 2014

And yet they all get passed the same RequestUUID - I see 2 choices here:

  1. Add a method to the request uuid to get a child one which uses an internal atomic integer or long
  2. Make the code create the sub-Requests prior to passing in to the cougar client - more code complexity but more control maybe?

@pmvilaca
Copy link
Author

pmvilaca commented Jan 8, 2014

  1. We can do that in the code that sends the requests to other services
    (cougar client), assuming that original context is being used for those
    requests but that isn't valid for requests from the sites (I'm talking
    about Mantis). Those requests are creating ExecutionContext instances based
    using the UUID (already generated). Just to clarify.. With this option, you
    are suggesting that we invoke a method before sending the requests just to
    increment the number of the request (-1, -2, -...), right?
  2. I'm not sure about what you are trying to say.. Could you elaborate?

2014/1/8 Simon Matic Langford notifications@github.com

And yet they all get passed the same RequestUUID - I see 2 choices here:

  1. Add a method to the request uuid to get a child one which uses an
    internal atomic integer or long
  2. Make the code create the sub-Requests prior to passing in to the cougar
    client - more code complexity but more control maybe?


Reply to this email directly or view it on GitHubhttps://github.com//issues/60#issuecomment-31877938
.

@eswdd
Copy link
Contributor

eswdd commented Jan 9, 2014

I think you got the jist of it in 1.. the 2 choices here weren't really choices on re-reading, just different parts of the same thing.

I do think we should follow my suggestion to add the host as well as the -1, -2, but only if the uuid that we're creating from is not generated by this host (because there's wasn't one on the request).

e.g. I'm making a new request on host abc001:

  • if i received a request with no uuid on it then the uuid will be of the form abc001-xxxxxxxx-yyyy in which case new uuids should be of the form abc001-xxxxxxxx-yyyy-1, abc001-xxxxxxxx-yyyy-2 etc
  • if i received a request with a uuid (e.g. xyz001-xxxxxxx-yyyy), then I should also append my host: xyz001-xxxxxxx-yyyy-abc001-1, xyz001-xxxxxxx-yyyy-abc001-2

@pmvilaca
Copy link
Author

pmvilaca commented Jan 9, 2014

Sounds like a plan.. :)
+1

@eswdd
Copy link
Contributor

eswdd commented Jun 25, 2014

So I started looking at this this morning and there are some probs..

  1. Any change in what we send as uuid to legacy servers is going to hit the regexp limiting uuids to 20-50 characters, which is irritating.
  2. If we keep chaining info together, we're going to hit massive uuids within a few hops, and at some point we need to set a sensible limit.

Suggested solutions:
2. We can meet the various requirements by having a triple component uuid, namely "root:parent:mine" which then allows us to set a sensible limit of (say) 150 characters.

  1. We can add some settings to our various clients which allow us to disable sending the root and parent component part uuids, which can be set until we know that servers have been upgraded.

@eswdd eswdd added this to the Release 3.2 Candidates milestone Jun 27, 2014
This was referenced Jun 30, 2014
@eswdd
Copy link
Contributor

eswdd commented Jul 2, 2014

I wonder if the extended uuid should be sent in a new header alongside the legacy one, so that we don't have to have a config option - will have to consider impact on binary client..

@pmvilaca
Copy link
Author

pmvilaca commented Jul 2, 2014

Thinking about the 2nd suggested solution, in fact, it's what we need for
the zipkin implementation. If we decide to follow that approach half of the
things needed to the zipkin integration will be implemented.

Just to clarify the origin of this issue.. I think that it was created
after an internal discussion in betfair to define if the "Service A" should
pass the same UUID to the "Service B" when it needs to invoke the service B
multiple times in the same transaction. And the team that is responsible
for the "Service B" doesn't want that the same UUID is used because there
is a reporting tool based on the logs that assumes that one UUID represents
only one transaction.

So, maybe we can implemented this issue and solve the debug problem and at
the same time, make some progress for the zipkin adoption in cougar. What
do you think?

2014-07-02 11:56 GMT+01:00 Simon Matic Langford notifications@github.com:

I wonder if the extended uuid should be sent in a new header alongside the
legacy one, so that we don't have to have a config option - will have to
consider impact on binary client..


Reply to this email directly or view it on GitHub
#60 (comment).

@eswdd
Copy link
Contributor

eswdd commented Jul 2, 2014

That is precisely why I'm working on this now ;) Also hence the comments regarding the need for ExecutionContext changes for Zipkin integration. The only thing that seems a little rough around the edges is how to ensure the client doesn't send out the same uuid for multiple calls - at the moment it's up to the developer to pass a sub-uuid to the client when constructing the execution context..

@pmvilaca
Copy link
Author

pmvilaca commented Jul 2, 2014

I think that we can delegate the generation of the sub-uuid to the cougar
client. That way the developer should pass an ExecutionContext with the
root uuid and with the current uuid and every time that it makes a call to
the another cougar service using the cougar client, the client will pick
the 2 original values and add a new one.

After that, when a cougar service receives call with the 3 values
(root:parent:mine) it means that it's receiving a call that belongs to a
on-going span.

Using this approach, the UUID that should be logged to represent the
current implementation is the root uuid, and we can use the other fields to
correlate the calls between the services and also for zipkin.

This is a little bit difficult to explain by email, but I hope that you can
understand my idea :)

2014-07-02 14:32 GMT+01:00 Simon Matic Langford notifications@github.com:

That is precisely why I'm working on this now ;) Also hence the comments
regarding the need for ExecutionContext changes for Zipkin integration. The
only thing that seems a little rough around the edges is how to ensure the
client doesn't send out the same uuid for multiple calls - at the moment
it's up to the developer to pass a sub-uuid to the client when constructing
the execution context..


Reply to this email directly or view it on GitHub
#60 (comment).

@eswdd
Copy link
Contributor

eswdd commented Jul 2, 2014

All understood, it's exactly what i was thinking, except i think we want to log the full compound uuid in all cases, and the zipkin log parsing can extract the bits it needs from that field and humans still get to see the full uuid (it also means no breaking change to the log format).

With regard to delegating to the client, it seems the preferred approach, but I only wonder if this could cause probs where existing apps already pass in a new generated request uuid. However I wonder if we can detect and deal with this. I will think some more on the situations which could be problematic.

@pmvilaca
Copy link
Author

pmvilaca commented Jul 2, 2014

When I was investigating how we could integrate zipkin in cougar, I wasn't planing to use log files because I think that we don't really need it. At least in Betfair, they're using 0MQ to emit metrics to their own metrics aggregator system (something like statsd). So, I was planning to emit zipkin events to 0MQ and then use that system to pick those events and send them to the zipkin collector.

Regarding your concern, you're talking about the cases where the developers are creating a new ExecutionContext object and setting the UUID of that ExecutionContext? Sorry if it doesn't make sense but my memory isn't fresh enough to remember those "strange" situations. Could you post some snippets of code (if you've some)?

@eswdd
Copy link
Contributor

eswdd commented Jul 2, 2014

The metrics aggregator system is called statse and whilst a client has been released and it is integrated into Cougar via Tornjak it's now 8 months down the line and the server has not been released, hence the feature raised to integrate statsd into Tornjak.

My concern on using 0MQ to integrate Cougar with Zipkin would be as to whether the other end of 0MQ (the collection daemon) would also be released in any sensible timescales. So it may well be that #59 is the shared abstraction and the Zipkin integration for Cougar use something off the shelf and already available (for example Brave's Zipkin Span Collector. It rather depends on Betfair's capacity/appetite for open sourcing more components.

I would still emit the full uuid in the log files to support the other usecases surrounding uuids.

Regarding my concern - it's rubbish, I realised that now, so ignore me on that.

@atropineal
Copy link
Contributor

If by chance you're talking about Cougar applications that receive a
request with one UUID and then contact other services using a totally new
requestUUID, then I've seen that happening many times. In my experience of
reviewing Cougar applications it's actually the norm. Makes
troubleshooting across apps a real pain. :-)

On Wed, Jul 2, 2014 at 8:16 PM, Simon Matic Langford <
notifications@github.com> wrote:

The metrics aggregator system is called statse and whilst a client has
been released and it is integrated into Cougar via Tornjak
https://github.com/betfair/tornjak it's now 8 months down the line and
the server has not been released, hence the feature raised to integrate
statsd into Tornjak.

My concern on using 0MQ to integrate Cougar with Zipkin would be as to
whether the other end of 0MQ (the collection daemon) would also be release
in any sensible timescales. So it may well be that #59
#59 is the shared abstraction
and the Zipkin integration for Cougar use something off the shelf and
already available (for example Brave's Zipkin Span Collector
https://github.com/kristofa/brave/tree/master/brave-zipkin-spancollector.
It rather depends on Betfair's capacity/appetite for open sourcing more
components.

I would still emit the full uuid in the log files to support the other
usecases surrounding uuids.

Regarding my concern - it's rubbish, I realised that now, so ignore me on
that.


Reply to this email directly or view it on GitHub
#60 (comment).

@eswdd
Copy link
Contributor

eswdd commented Jul 2, 2014

Ah, I'd rather assumed it wasn't. I think with the changes we're suggesting, that would be no worse, and if tracing across distributed calls was desirable then perhaps service owners could be persuaded to fix their apps - especially if all they need to do is pass the RequestUUID from the EC passed into the service code into the client calls. I suspect in most cases this is a case of people not realising the benefits they can achieve by just passing this item on.

@richardqd
Copy link
Contributor

I'd have thought people were self-motivated to fix this, given the
additional grief it adds to their support burden!

On 2 July 2014 20:23, Simon Matic Langford notifications@github.com wrote:

Ah, I'd rather assumed it wasn't. I think with the changes we're
suggesting, that would be no worse, and if tracing across distributed calls
was desirable then perhaps service owners could be persuaded to fix their
apps - especially if all they need to do is pass the RequestUUID from the
EC passed into the service code into the client calls. I suspect in most
cases this is a case of people not realising the benefits they can achieve
by just passing this item on.


Reply to this email directly or view it on GitHub
#60 (comment).

@atropineal
Copy link
Contributor

I think, in the applications I've seen, it was often the case that by the
time one service impl was calling out to another, the initial
RequestContext had been forgotten - having been lost in that popular layer
which translates between the Cougar and an internal domain. Folks, perhaps
forgivably (especially if they do not have experience of troubleshooting
across services), forget about passing on the stuff in the RC like like the
UUID unless they simply cannot fulfill their functionality without it (e.g.
a user or cert name for audit or whatever).

Another case I seem to remember was where a popular internal service
distributed their own library to client services, but where that client
library did not expose any way to set the outgoing request's UUID. The
library just exposed a plain interface that knew nothing about Cougar at
all, and the necessary Cougar stuff was done 'behind the scenes'.

On Wed, Jul 2, 2014 at 8:39 PM, richardqd notifications@github.com wrote:

I'd have thought people were self-motivated to fix this, given the
additional grief it adds to their support burden!

On 2 July 2014 20:23, Simon Matic Langford notifications@github.com
wrote:

Ah, I'd rather assumed it wasn't. I think with the changes we're
suggesting, that would be no worse, and if tracing across distributed
calls
was desirable then perhaps service owners could be persuaded to fix
their
apps - especially if all they need to do is pass the RequestUUID from
the
EC passed into the service code into the client calls. I suspect in most
cases this is a case of people not realising the benefits they can
achieve
by just passing this item on.


Reply to this email directly or view it on GitHub
#60 (comment).


Reply to this email directly or view it on GitHub
#60 (comment).

@eswdd
Copy link
Contributor

eswdd commented Jul 2, 2014

The distributed client library of course comes with it's own issues, like generated code incompatibility with different Cougar versions, but the first is the situation I had imagined originally.

@eswdd
Copy link
Contributor

eswdd commented Jul 2, 2014

Can anyone think of a better mechanism for ensuring backwards compatibility (new clients talking to old servers) when introducing this? So far I have 2 ideas, I think I prefer the second:

  1. Add a new header X-UUID2 and send the old format in X-UUID and the new in X-UUID2
  2. Break out the root:parent component and put in X-UUID-Parents (or similar) and send the local component in X-UUID

@atropineal
Copy link
Contributor

I've just realized that Pedro has earlier said:

"this issue.. was created after an internal discussion in betfair to define
if the "Service A" should pass the same UUID to the "Service B" if it needs
to invoke the service B multiple times in the same transaction. And the
team that is responsible for the "Service B" doesn't want that the same
UUID is used because there is a reporting tool based on the logs that
assumes that one UUID represents only one transaction".

That was me representing service B. I definitely did not want multiple
requests coming in with the same UUID. Sort of violated (for me at least!)
the concept of a 'request identifier' since I then wouldn't be able to
identify a specific actual request with it any more, only a logical one
that could be made up of any number of actual requests. From my
perspective this negatively impacted several non-functionals:
troubleshooting / tooling / log mining... I encouraged them to adopt a
UUID-suffix style approach, ('-1', '-2' or whatever), especially since in
this case the multiple requests were serial in nature.

I'd even be in favor of an inbuilt mechanism that actively tries to prevent
duplicate request UUIDs, since somebody can 'break me' (non-functionally,
in the ways above) quite easily without really knowing the impact. For
example a generated Cougar client could remember UUIDs going to the target
service within a time and/or size limited window and reject duplicates?
I'm just brainstorming here.

Another thing I'm not keen on regarding UUIDs is that people can use
whatever they like. I massively appreciate UUIDs having the abbreviated
client node name as a prefix, it really is a boon for investigation speed.
But unfortunately some clients decide for whatever reason that this isn't
for them, and they'll transmit some custom gibberish. Or they'll transmit
something that has a prefix that looks like it's a node name but isn't.
I'm in favor of making sure all outgoing request UUIDs are standardized,
with this prefix being part of the standard.

Simon, idea 2 sounds better to me because it doesn't create an 'old way vs
new way' type situation, it just augments what is there. No 'OK, now we do
everything 2 ways until we migrate totally' type affair.

On Wed, Jul 2, 2014 at 9:07 PM, Simon Matic Langford <
notifications@github.com> wrote:

Can anyone think of a better mechanism for ensuring backwards
compatibility (new clients talking to old servers) when introducing this?
So far I have 2 ideas, I think I prefer the second:

  1. Add a new header X-UUID2 and send the old format in X-UUID and the new
    in X-UUID2
  2. Break out the root:parent component and put in X-Tracing-Info (or
    similar) and send the local component in X-UUID


Reply to this email directly or view it on GitHub
#60 (comment).

@eswdd
Copy link
Contributor

eswdd commented Jul 2, 2014

I'd wondered about client 'memory' to try to resolve that issue, although if we change the clients so they always generate a sub this wouldn't have much effect. It also doesn't deal with clients which are not using cougar generated clients.

The UUID naming I totally agree on, although is hampered by the ability to plugin a custom generator which seems valid if you wish some form of service identifier but your hostname is not sufficient. I wonder if it's valid for a server to reject a client uuid or regenerate part of it if it disagrees with the formatting? Probably not, this sounds like a governance issue to me although I definitely agree with the sentiment.

@andredasilvapinto
Copy link
Contributor

From a Zipkin point of view, considering it follows Google's Dapper paper we will need to support not 2 but 3 data fields (headers in the case of HTTP):

  • TraceId
  • SpanId
  • ParentSpanId (when applicable)

These fields should all be probabilistically unique 64 bits integers (i.e. longs), which would imply a conflict with the current X-UUID format. This would be especially problematic if we use something like Brave (like was referred above) as its API is expecting longs, but depending on whether Zipkin collector/query/web components introduce this format contraint it might be even impossible to use any other format. I think using hash functions here in order to convert the UUID string format to a long would imply unnecessary work per request/span and defeat the purpose of using the entire 64 bits range.

I would suggest we add those 3 new fields to the transports while keeping the current X-UUID field format for backward compatibility.

@eswdd
Copy link
Contributor

eswdd commented Jul 3, 2014

Is there a suggested mechanism that makes them probabilistically unique enough?

Seems odd to restrict the ids to longs when it’s relatively easy to generate GUIDs if you allow strings. I’d almost argue that’s rather a large failing, but I’m sure they have their reasons.

Frustratingly, I’d rather assumed strings were ok, should have read the docs better. Given the amount of work we do everywhere else, hashing might not be unreasonable and would allow dual use of the X-UUID data..

On 3 Jul 2014, at 00:07, André Pinto notifications@github.com wrote:

From a Zipkin point of view, considering it follow Google's Dapper paper we will need to support not 2 but 3 data fields (headers in the case of HTTP):

TraceId
SpanId
ParentSpanId (when applicable)
These fields should all be probabilistically unique 64 bits integers (i.e. longs), which would imply a conflict with the current X-UUID format. This would be especially problematic if we use something like Brave (like was referred above) as its API is expecting longs, not strings and I think using hash functions here would imply unnecessary work per request/span and defeat the purpose of using the entire 64 bits range.

I would suggest we add those 3 new fields to the transports while keeping the current X-UUID field format for backward compatibility.


Reply to this email directly or view it on GitHub.

@andredasilvapinto
Copy link
Contributor

On Dapper's paper? No.

I'm using ThreadLocalRandom.nextLong(Long.MIN_VALUE, Long.MAX_VALUE) (Java 7+) in Mantis in order to avoid contention. Pedro Vilaça also suggested the use of UUID.getMostSignificantBits() although I think that would probably be a little more expensive (haven't tested though).

@eswdd
Copy link
Contributor

eswdd commented Jul 3, 2014

Can it handle clashes if they do occur? I can't find any reference in the docs.

@andredasilvapinto
Copy link
Contributor

Are you referring to Zipkin ids clashes? I think they don't do any magic here. If you have duplicated trace ids (which is very unlikely with a uniform probabilistic function over a 64 bit range - 5.4210109e-20 probability) then it will probably override the old one:

We generate a random trace ID to use (a 64-bit int in hex) which we can do because it's very unlikely that we'll have trace ID collisions (and it's not a big deal if we do -- some other trace will get overwritten).

From a browser extension plugin but still on the twitter/zipkin repo:
https://github.com/twitter/zipkin/blob/master/zipkin-browser-extension/firefox/CONTRIBUTING.md

@eswdd
Copy link
Contributor

eswdd commented Jul 3, 2014

ok, cool thanks. Am going to explore hashing a little to see if there's any legs there in terms of low duplicate rates and full use of the 64bit spectrum, otherwise will consider fleshing out the isTracingEnabled on ExecutionContext into a TracingInfo interface which contains the isTracingEnabled and enables tracing plugins to hook in the data they need.

@eswdd
Copy link
Contributor

eswdd commented Jul 4, 2014

So was thinking on this some more overnight...

This feature request sits on it's own if you ignore the Zipkin/tracing requirements for the purposes of debugging, and so we should continue to make the changes on the basis I was already working.

It would be nice to be able to leverage this for Zipkin/tracing, but not mandatory. If Zipkin needs the same info but in a different format, and is unable to derive it from this info (ie hashing doesn't work) then I think it's reasonable to state that the Zipkin ids are also request uuids, just a different format of them and thus it's reasonable for the Zipkin integration to provide a RequestUUID impl which provides the standard Cougar uuids as well as those required for Zipkin. Then we do not need changes to the EC interface (which as @richardqd says is painful) and we only then need to provide the tracing hooks in the client and transports.

This also suggests that the header name I proposed is sensible and if we need extra info for tracing we can use an appropriate seperate header such as X-Trace-Info

@andredasilvapinto
Copy link
Contributor

Hashing would also imply that every non-Cougar node of the ecosystem would have to be aware of the way Cougar's hash function works as they need to pass and receive the fields to/from them.

@eswdd
Copy link
Contributor

eswdd commented Jul 4, 2014

Yes it would. It would have to be a well published algorithm. I appreciate
this would not be desirable in a mixed ecosystem.

On 4 July 2014 09:06, André Pinto notifications@github.com wrote:

Hashing would also imply that every non-Cougar node of the ecosystem would
have to be aware of the way Cougar's hash function works as they need to
pass and receive the fields to/from them.


Reply to this email directly or view it on GitHub
#60 (comment).

@andredasilvapinto
Copy link
Contributor

ThreadLocalRandom.nextLong(Long.MIN_VALUE, Long.MAX_VALUE) is not an option.

Looking at the method definition, and considering it refers no restriction on the documentation, I thought it would allow the entire long range, but after testing and looking at the implementation you can see that it only allows 32 bit range as the kind of operations it performs internally introduce overflow problems for larger ranges.

I'm now using Vilaça's suggestion

UUID.randomUUID().get[Least|Most]SignificantBits().

eswdd added a commit that referenced this issue Jul 10, 2014
…new socket protocol version (5) to ensure backwards compatibility. Includes unit and integration tests following the new style started in #73
eswdd added a commit that referenced this issue Jul 10, 2014
…new socket protocol version (5) to ensure backwards compatibility. Includes unit and integration tests following the new style started in #73
@eswdd eswdd modified the milestones: Release 3.2.0, Release 3.2 Candidates Jul 10, 2014
@eswdd
Copy link
Contributor

eswdd commented Jul 10, 2014

This issue is now complete. Any further discussion regarding zipkin or tracing should occur on the appropriate issues.

@eswdd eswdd closed this as completed Jul 10, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants