-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analyze Dump Segments Containing Sketches #9806
Comments
I already have developed a simple prototype that works. What I am doing now is figuring out exactly what form it could take. For example, as a separate tool or as an extended class of the current DST. This class will have dependencies on the DataSketches library. I also need to figure out the user options it would present. I will report back to this issue as I make progress and have questions to those with more experience with segments in Druid. Please feel free to add your suggestions and comments. |
Is the binary form you need just the base64 decoding of what the dump-segment tool dumps? Or is it something different that requires DataSketches code in order to generate? If it's the latter, perhaps it would make sense to add an option to the DumpSegment tool that dumps stuff in a format that could be specified by the extension that enables that particular column type. It could involve the ComplexMetricsSerde impl somehow. |
Yes and Yes. I can envision a couple of ways to do this. The first and simplest option would be an added feature of the DST that, given a column, would dump the data contents of each aggregation item as a separate binary file in some user-specified output directory. This has the advantage that the DST does not need to know any specifics of the aggregation item. This was the first thing I tried by just decoding the base64 rows from the DST output and spitting out a bin file for each one. In the segment I was given, it produced 47 binary files. It has the disadvantage that it doesn't communicate anything about the internal state of those binary files, which in our case represent sketches. The second option would add functionality, somewhere, that utilizes the DataSketches library to read those binary files and produce human-readable output like the following for each one. This is the output from the 6th of the 47 sketches in the segment
However, for the Druid use case where we could be dealing with potentially thousands of sketches in a large segment, reformatting the above information into a summary matrix, where each sketch is a row, would provide a more compact visual presentation and also allow easier visual detection of anomalous states of the sketches. I could easily see this as a separate tool that just processes all the files produced by the DST in the 1st option. A separate independent tool has the advantage of not polluting the dependencies of the DST with dependencies on the DataSketches Library. A tighter integration of these two options would allow the ability to output just the human-readable summary matrix without having to output all the individual files. This would be a nice feature, but not a critical one. My vote would be to just change the DST to allow outputting individual files for each aggregation item of a column. And then I could provide the tool that allows the analysis of those files for the sketches case. |
There is a third option, which is what I did in my prototype. I took the base64 column output from the DST (no changes required), and have the option of either producing individual files and /or producing the human-readable summaries above. Lee. |
For this I would caution that the dump-segment tool actually does not have access to the sketch images from the column! It only seems like it does. What it actually has access to is the thing that the relevant ComplexMetricSerde deserializes it into. And what it does is then take that object and serialize it using Jackson. In the case of DataSketches HLL, the HllSketchMergeComplexMetricSerde returns an HllSketch object. And that is serialized to Jackson using the HllSketchJsonSerializer that is part of the Druid DataSketches extension, which does the following. jgen.writeBinary(sketch.toCompactByteArray()); And what writeBinary does is write a Base64 string. So, that's why you get the behavior you get. With regard to taking this output and splitting it up into multiple files, my first thought is we don't need the dump-segment tool to do that, since you could do it by composing two tools in the Unix style. (dump-segment to generate a dump of the entire segment, With regard to dumping interesting "human readable" stuff, that actually seems very useful as a new option to dump-segment. Perhaps we need a way for ComplexMetricsSerde implementations to return human-readable debugging output for a particular object that they understand, and the dump-segment option would use this functionality. Or perhaps we adopt a convention that calling Any thoughts on the above @leerho? |
@gianm Thanks for the tutorial and insights. I have been trying to grok the Dump Segment tool and it is challenge, as it has scores of dependencies on druid internals that I don't understand and really don't want to mess with. The query/aggregation/datasketches code on the other hand has several advantages. Each sketch is already separated into its own aggregator and the aggregator code, by definition, has intimate knowledge of a particular sketch. Adding a It turns out that all of our sketches already have "toString(...)" methods that return human-readable summaries that we use for debugging. Based on the sketch some of these toString() methods require additional parameters for different output options, some do not. But the CMS impl would know how to call these diagnostic methods for each sketch. The output would appear like the "HllSketch 6" sample above. It would not be in a neat matrix form like I was envisioning, but, hey, it is more than we have now, and involves only minor changes to the CMS impls, and a minor change to the Dump Segment tool and that's it! It would be useful for the dump tool to spit out a row number before it calls the CMS toString() method. If someone on the Druid team could make the change in CMS and in the dump tool, than either Alex or I could add the toString() methods to the CMS impls in query/aggregation/datasketches. This, of course, ripples through all of your CMS implementations and not just DataSketches, so someone would have to work on that; but the default could be to print "no debug info available" for complex metrics that don't have diagnostic methods like we do. Lee. |
This issue has been marked as stale due to 280 days of inactivity. |
This issue has been closed due to lack of activity. If you think that |
Description
Debugging Segments with sketches
Motivation
Debugging Segments with sketches can be challenging as currently there is no simple way to extract the sketch binaries from the segment.
This idea was motivated by Issue #9736 where the segment with the sketches that were causing problems was provided to the DataSketches team, but without familiarity with the segment structure schema, it was all but impossible to analyze. @gianm supported the idea of creating this issue for discussion.
The dump_segment_tool was partially helpful, but it presents the sketches as a column of Base64 encodings in a JSON format. But in order to understand the internal state of each sketch it needs to be binary form. Once the sketches are in binary form, there are a number of tools in the DataSketches library that can be used to analyze the exact state of the sketches.
Sketches, by their nature are complex state machines. Providing a tool like this will significantly enhance our ability to debug sketching problems in the future.
The text was updated successfully, but these errors were encountered: