Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze Dump Segments Containing Sketches #9806

Closed
leerho opened this issue May 2, 2020 · 8 comments
Closed

Analyze Dump Segments Containing Sketches #9806

leerho opened this issue May 2, 2020 · 8 comments

Comments

@leerho
Copy link
Contributor

leerho commented May 2, 2020

Description

Debugging Segments with sketches

Motivation

Debugging Segments with sketches can be challenging as currently there is no simple way to extract the sketch binaries from the segment.

This idea was motivated by Issue #9736 where the segment with the sketches that were causing problems was provided to the DataSketches team, but without familiarity with the segment structure schema, it was all but impossible to analyze. @gianm supported the idea of creating this issue for discussion.

The dump_segment_tool was partially helpful, but it presents the sketches as a column of Base64 encodings in a JSON format. But in order to understand the internal state of each sketch it needs to be binary form. Once the sketches are in binary form, there are a number of tools in the DataSketches library that can be used to analyze the exact state of the sketches.

Sketches, by their nature are complex state machines. Providing a tool like this will significantly enhance our ability to debug sketching problems in the future.

@leerho
Copy link
Contributor Author

leerho commented May 2, 2020

I already have developed a simple prototype that works. What I am doing now is figuring out exactly what form it could take. For example, as a separate tool or as an extended class of the current DST.

This class will have dependencies on the DataSketches library. I also need to figure out the user options it would present. I will report back to this issue as I make progress and have questions to those with more experience with segments in Druid.

Please feel free to add your suggestions and comments.

@gianm
Copy link
Contributor

gianm commented May 2, 2020

Is the binary form you need just the base64 decoding of what the dump-segment tool dumps? Or is it something different that requires DataSketches code in order to generate?

If it's the latter, perhaps it would make sense to add an option to the DumpSegment tool that dumps stuff in a format that could be specified by the extension that enables that particular column type. It could involve the ComplexMetricsSerde impl somehow.

@leerho
Copy link
Contributor Author

leerho commented May 2, 2020

@gianm

Yes and Yes. I can envision a couple of ways to do this.

The first and simplest option would be an added feature of the DST that, given a column, would dump the data contents of each aggregation item as a separate binary file in some user-specified output directory. This has the advantage that the DST does not need to know any specifics of the aggregation item. This was the first thing I tried by just decoding the base64 rows from the DST output and spitting out a bin file for each one. In the segment I was given, it produced 47 binary files. It has the disadvantage that it doesn't communicate anything about the internal state of those binary files, which in our case represent sketches.

The second option would add functionality, somewhere, that utilizes the DataSketches library to read those binary files and produce human-readable output like the following for each one. This is the output from the 6th of the 47 sketches in the segment

HllSketch 6

### HLL SKETCH PREAMBLE:
Byte 0: Preamble Ints         : 10
Byte 1: SerVer                : 1
Byte 2: Family                : HLL
Byte 3: lgK                   : 14
Byte 4: LgArr or Aux LgArr    : 0
Byte 5: Flags:                : 00001000, 8
  BIG_ENDIAN_STORAGE          : false
  (Native Byte Order)         : LITTLE_ENDIAN
  READ_ONLY                   : false
  EMPTY                       : false
  COMPACT                     : true
  OUT_OF_ORDER                : false
  REBUILD_KXQ                 : false
Byte 6: Cur Min               : 0
Byte 7: Mode                  : HLL, HLL_4
HIP Accum                     : 6562.656349859294
KxQ0                          : 12685.76611328125
KxQ1                          : 0.0
Num At Cur Min                : 10988
Aux Count                     : 0
### END HLL SKETCH PREAMBLE

### HLL SKETCH SUMMARY: 
  Log Config K   : 14
  Hll Target     : HLL_4
  Current Mode   : HLL
  Memory         : false
  LB             : 6520.24649606101
  Estimate       : 6562.656349859294
  UB             : 6605.621511177209
  OutOfOrder Flag: false
  CurMin         : 0
  NumAtCurMin    : 10988
  HipAccum       : 6562.656349859294
  KxQ0           : 12685.76611328125
  KxQ1           : 0.0
  Rebuild KxQ Flg: false

However, for the Druid use case where we could be dealing with potentially thousands of sketches in a large segment, reformatting the above information into a summary matrix, where each sketch is a row, would provide a more compact visual presentation and also allow easier visual detection of anomalous states of the sketches.

I could easily see this as a separate tool that just processes all the files produced by the DST in the 1st option. A separate independent tool has the advantage of not polluting the dependencies of the DST with dependencies on the DataSketches Library.

A tighter integration of these two options would allow the ability to output just the human-readable summary matrix without having to output all the individual files. This would be a nice feature, but not a critical one.

My vote would be to just change the DST to allow outputting individual files for each aggregation item of a column. And then I could provide the tool that allows the analysis of those files for the sketches case.

@leerho
Copy link
Contributor Author

leerho commented May 2, 2020

There is a third option, which is what I did in my prototype.

I took the base64 column output from the DST (no changes required), and have the option of either producing individual files and /or producing the human-readable summaries above.

Lee.

@gianm
Copy link
Contributor

gianm commented May 8, 2020

The first and simplest option would be an added feature of the DST that, given a column, would dump the data contents of each aggregation item as a separate binary file in some user-specified output directory. This has the advantage that the DST does not need to know any specifics of the aggregation item.

For this I would caution that the dump-segment tool actually does not have access to the sketch images from the column! It only seems like it does. What it actually has access to is the thing that the relevant ComplexMetricSerde deserializes it into. And what it does is then take that object and serialize it using Jackson.

In the case of DataSketches HLL, the HllSketchMergeComplexMetricSerde returns an HllSketch object. And that is serialized to Jackson using the HllSketchJsonSerializer that is part of the Druid DataSketches extension, which does the following.

    jgen.writeBinary(sketch.toCompactByteArray());

And what writeBinary does is write a Base64 string. So, that's why you get the behavior you get.

With regard to taking this output and splitting it up into multiple files, my first thought is we don't need the dump-segment tool to do that, since you could do it by composing two tools in the Unix style. (dump-segment to generate a dump of the entire segment, jq to extract line-by-line output corresponding to each sketch, and split to split into multiple files). I suppose it wouldn't hurt to add an option to dump-segment that does this, but the Unix way is more flexible.

With regard to dumping interesting "human readable" stuff, that actually seems very useful as a new option to dump-segment. Perhaps we need a way for ComplexMetricsSerde implementations to return human-readable debugging output for a particular object that they understand, and the dump-segment option would use this functionality. Or perhaps we adopt a convention that calling toString() should do this, and then we add an option to dump-segment to dump using toString() instead of the Jackson serializer.

Any thoughts on the above @leerho?

@leerho
Copy link
Contributor Author

leerho commented May 8, 2020

@gianm Thanks for the tutorial and insights. I have been trying to grok the Dump Segment tool and it is challenge, as it has scores of dependencies on druid internals that I don't understand and really don't want to mess with.

The query/aggregation/datasketches code on the other hand has several advantages. Each sketch is already separated into its own aggregator and the aggregator code, by definition, has intimate knowledge of a particular sketch. Adding a public abstract String toString() method to the ComplexMetricsSerde (CMS) makes a whole lot of sense. Then, as you say, the Dump Segment tool could rely on that.

It turns out that all of our sketches already have "toString(...)" methods that return human-readable summaries that we use for debugging. Based on the sketch some of these toString() methods require additional parameters for different output options, some do not. But the CMS impl would know how to call these diagnostic methods for each sketch. The output would appear like the "HllSketch 6" sample above. It would not be in a neat matrix form like I was envisioning, but, hey, it is more than we have now, and involves only minor changes to the CMS impls, and a minor change to the Dump Segment tool and that's it! It would be useful for the dump tool to spit out a row number before it calls the CMS toString() method.

If someone on the Druid team could make the change in CMS and in the dump tool, than either Alex or I could add the toString() methods to the CMS impls in query/aggregation/datasketches.

This, of course, ripples through all of your CMS implementations and not just DataSketches, so someone would have to work on that; but the default could be to print "no debug info available" for complex metrics that don't have diagnostic methods like we do.

Lee.

@github-actions
Copy link

github-actions bot commented Aug 7, 2023

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 7, 2023
@github-actions
Copy link

github-actions bot commented Sep 5, 2023

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants