Skip to content

Tutorial on ingesting and querying Theta sketches#12723

Merged
vtlim merged 15 commits intoapache:masterfrom
vtlim:tutorial-sketches
Aug 24, 2022
Merged

Tutorial on ingesting and querying Theta sketches#12723
vtlim merged 15 commits intoapache:masterfrom
vtlim:tutorial-sketches

Conversation

@vtlim
Copy link
Member

@vtlim vtlim commented Jun 30, 2022

A collaboration with @hellmarbecker to polish and get his Theta sketches tutorial into the docs.

This PR has:

  • been self-reviewed.

Copy link
Contributor

@techdocsmith techdocsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtlim this is an awesome start. Needs some stylistic work for the docs. Also need to make sure the steps line up. I'm not seeing the same results as from the screen shots with this data set. We should also add a little bit of commentary about the expected results.

@hellmarbecker , thank you so much for working with @vtlim on this contribution

@petermarshallio PTAL

* Typical production datasources have tens to hundreds of columns.
* [Dimension columns](./data-model.md#dimensions) are stored as-is, so they can be filtered on, grouped by, or aggregated at query time. They are always single Strings, [arrays of Strings](../querying/multi-value-dimensions.md), single Longs, single Doubles or single Floats.
* [Metric columns](./data-model.md#metrics) are stored [pre-aggregated](../querying/aggregations.md), so they can only be aggregated at query time (not filtered or grouped by). They are often stored as numbers (integers or floats) but can also be stored as complex objects like [HyperLogLog sketches or approximate quantile sketches](../querying/aggregations.md#approx). Metrics can be configured at ingestion time even when rollup is disabled, but are most useful when rollup is enabled.
* [Metric columns](./data-model.md#metrics) are stored [pre-aggregated](../querying/aggregations.md), so they can only be aggregated at query time (not filtered or grouped by). They are often stored as numbers (integers or floats) but can also be stored as complex objects like [HyperLogLog sketches or approximate quantile sketches](../querying/aggregations.md#approximate-aggregations). Metrics can be configured at ingestion time even when rollup is disabled, but are most useful when rollup is enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pre-aggregated sounds a little jargon-ish. Druid aggregates metric columns during ingestion and stores the aggregated value (?)

vtlim and others added 2 commits July 5, 2022 13:58
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
@vtlim vtlim requested a review from techdocsmith August 18, 2022 00:17
Copy link
Contributor

@techdocsmith techdocsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic revision. I have some suggestions for adding headings for readability but otherwise LGTM.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice 🚀

- How many visitors are there that watched _at least one_ of the episodes?
- How many visitors watched episode 1 _but not_ episode 2?

There is no way to answer these questions by just looking at the aggregated numbers. You would have to go back to the detail data and scan every single row. If the data volume is high enough, this may take long, meaning that an interactive data exploration is not possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
There is no way to answer these questions by just looking at the aggregated numbers. You would have to go back to the detail data and scan every single row. If the data volume is high enough, this may take long, meaning that an interactive data exploration is not possible.
There is no way to answer these questions by just looking at the aggregated numbers. You would have to go back to the detail data and scan every single row. If the data volume is high enough, this may take a very long time, meaning that an interactive data exploration is not possible.

Comment on lines +324 to +326
## Acknowledgments

This tutorial is adapted from a blog post by Hellmar Becker. Visit the [original blog post](https://blog.hellmar-becker.de/2022/06/05/druid-data-cookbook-counting-unique-visitors-for-overlapping-segments/) on Hellmar Becker's blog.
Copy link
Member

@clintropolis clintropolis Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've mixed opinions on this. On the one hand, it definitely seems a bit strange to me have this section in the docs that links out to another website, these docs are supposed to be the authority on Druid stuff if my mind. But on the other hand this is a pretty great content and it deserves credit.

Do we do stuff like this in other docs pages?

Copy link
Contributor

@techdocsmith techdocsmith Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't done this on other docs pages. However, @hellmarbecker is a member of the Druid community and the originator of this content. We have to have some way to acknowledge his authorship. Perhaps this would be cleaner:

This tutorial is adapted from a [blog post] (https://blog.hellmar-becker.de/2022/06/05/druid-data-cookbook-counting-unique-visitors-for-overlapping-segments/) by community memeber Hellmar Becker.

it also cuts down some of the repetition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the part that seemed strange is that it sort of seems vaguely like it is someone elses content, which isn't technically true if it is being (legally) contributed here.

Docs are contributed and licensed the same way as source files, and so this contribution is granting the ASF ownership of this content and licensing it to be distributed under the Apache license. The standard way we acknowledge external source contributions when necessary is the https://github.com/apache/druid/blob/master/LICENSE and https://github.com/apache/druid/blob/master/NOTICE files, though I'm unsure if it would be appropriate in this case, since the source packages are not the typical consumption method of the docs.

I just wanted to make sure that this is clear to @hellmarbecker to ensure that the license of this content is clean.

If all of this sounds ok, I do think it is ok to acknowledge and link to this external content since it was developed externally and later added instead of as an original Druid contribution (which we do not typically credit the author with, other than in the form of github commits and occasionally release announcements).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I am happy to grant permission for Apache to use the content!

@vtlim vtlim closed this Aug 23, 2022
@vtlim vtlim reopened this Aug 23, 2022
@vtlim vtlim merged commit 02914c1 into apache:master Aug 24, 2022
@vtlim vtlim deleted the tutorial-sketches branch August 24, 2022 16:23
@abhishekagarwal87 abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants