Tutorial on ingesting and querying Theta sketches#12723
Tutorial on ingesting and querying Theta sketches#12723vtlim merged 15 commits intoapache:masterfrom
Conversation
techdocsmith
left a comment
There was a problem hiding this comment.
@vtlim this is an awesome start. Needs some stylistic work for the docs. Also need to make sure the steps line up. I'm not seeing the same results as from the screen shots with this data set. We should also add a little bit of commentary about the expected results.
@hellmarbecker , thank you so much for working with @vtlim on this contribution
@petermarshallio PTAL
| * Typical production datasources have tens to hundreds of columns. | ||
| * [Dimension columns](./data-model.md#dimensions) are stored as-is, so they can be filtered on, grouped by, or aggregated at query time. They are always single Strings, [arrays of Strings](../querying/multi-value-dimensions.md), single Longs, single Doubles or single Floats. | ||
| * [Metric columns](./data-model.md#metrics) are stored [pre-aggregated](../querying/aggregations.md), so they can only be aggregated at query time (not filtered or grouped by). They are often stored as numbers (integers or floats) but can also be stored as complex objects like [HyperLogLog sketches or approximate quantile sketches](../querying/aggregations.md#approx). Metrics can be configured at ingestion time even when rollup is disabled, but are most useful when rollup is enabled. | ||
| * [Metric columns](./data-model.md#metrics) are stored [pre-aggregated](../querying/aggregations.md), so they can only be aggregated at query time (not filtered or grouped by). They are often stored as numbers (integers or floats) but can also be stored as complex objects like [HyperLogLog sketches or approximate quantile sketches](../querying/aggregations.md#approximate-aggregations). Metrics can be configured at ingestion time even when rollup is disabled, but are most useful when rollup is enabled. |
There was a problem hiding this comment.
nit: pre-aggregated sounds a little jargon-ish. Druid aggregates metric columns during ingestion and stores the aggregated value (?)
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
…to tutorial-sketches
techdocsmith
left a comment
There was a problem hiding this comment.
Fantastic revision. I have some suggestions for adding headings for readability but otherwise LGTM.
| - How many visitors are there that watched _at least one_ of the episodes? | ||
| - How many visitors watched episode 1 _but not_ episode 2? | ||
|
|
||
| There is no way to answer these questions by just looking at the aggregated numbers. You would have to go back to the detail data and scan every single row. If the data volume is high enough, this may take long, meaning that an interactive data exploration is not possible. |
There was a problem hiding this comment.
| There is no way to answer these questions by just looking at the aggregated numbers. You would have to go back to the detail data and scan every single row. If the data volume is high enough, this may take long, meaning that an interactive data exploration is not possible. | |
| There is no way to answer these questions by just looking at the aggregated numbers. You would have to go back to the detail data and scan every single row. If the data volume is high enough, this may take a very long time, meaning that an interactive data exploration is not possible. |
| ## Acknowledgments | ||
|
|
||
| This tutorial is adapted from a blog post by Hellmar Becker. Visit the [original blog post](https://blog.hellmar-becker.de/2022/06/05/druid-data-cookbook-counting-unique-visitors-for-overlapping-segments/) on Hellmar Becker's blog. |
There was a problem hiding this comment.
i've mixed opinions on this. On the one hand, it definitely seems a bit strange to me have this section in the docs that links out to another website, these docs are supposed to be the authority on Druid stuff if my mind. But on the other hand this is a pretty great content and it deserves credit.
Do we do stuff like this in other docs pages?
There was a problem hiding this comment.
We haven't done this on other docs pages. However, @hellmarbecker is a member of the Druid community and the originator of this content. We have to have some way to acknowledge his authorship. Perhaps this would be cleaner:
This tutorial is adapted from a [blog post] (https://blog.hellmar-becker.de/2022/06/05/druid-data-cookbook-counting-unique-visitors-for-overlapping-segments/) by community memeber Hellmar Becker.
it also cuts down some of the repetition.
There was a problem hiding this comment.
I think the part that seemed strange is that it sort of seems vaguely like it is someone elses content, which isn't technically true if it is being (legally) contributed here.
Docs are contributed and licensed the same way as source files, and so this contribution is granting the ASF ownership of this content and licensing it to be distributed under the Apache license. The standard way we acknowledge external source contributions when necessary is the https://github.com/apache/druid/blob/master/LICENSE and https://github.com/apache/druid/blob/master/NOTICE files, though I'm unsure if it would be appropriate in this case, since the source packages are not the typical consumption method of the docs.
I just wanted to make sure that this is clear to @hellmarbecker to ensure that the license of this content is clean.
If all of this sounds ok, I do think it is ok to acknowledge and link to this external content since it was developed externally and later added instead of as an original Druid contribution (which we do not typically credit the author with, other than in the form of github commits and occasionally release announcements).
There was a problem hiding this comment.
Thanks, I am happy to grant permission for Apache to use the content!
A collaboration with @hellmarbecker to polish and get his Theta sketches tutorial into the docs.
This PR has: