Tutorial on ingesting and querying Theta sketches#12723

Merged

vtlim merged 15 commits intoapache:masterfrom

vtlim:tutorial-sketches

Aug 24, 2022

Member

vtlim commented Jun 30, 2022

A collaboration with @hellmarbecker to polish and get his Theta sketches tutorial into the docs.

This PR has:

been self-reviewed.

vtlim added 3 commits

June 28, 2022 17:10


          initial commit

033f7b1


          edits

8f79d87


          link updates

7037b62

vtlim added the Area - Documentation label

vtlim requested a review from techdocsmith

June 30, 2022 21:26

hellmarbecker approved these changes

View reviewed changes


          add license info

779ac16

techdocsmith requested changes

View reviewed changes

Contributor

techdocsmith left a comment

@vtlim this is an awesome start. Needs some stylistic work for the docs. Also need to make sure the steps line up. I'm not seeing the same results as from the screen shots with this data set. We should also add a little bit of commentary about the expected results.

@hellmarbecker , thank you so much for working with @vtlim on this contribution

@petermarshallio PTAL

docs/ingestion/schema-design.md

               * Typical production datasources have tens to hundreds of columns.
               * [Dimension columns](./data-model.md#dimensions) are stored as-is, so they can be filtered on, grouped by, or aggregated at query time. They are always single Strings, [arrays of Strings](../querying/multi-value-dimensions.md), single Longs, single Doubles or single Floats.
-              * [Metric columns](./data-model.md#metrics) are stored [pre-aggregated](../querying/aggregations.md), so they can only be aggregated at query time (not filtered or grouped by). They are often stored as numbers (integers or floats) but can also be stored as complex objects like [HyperLogLog sketches or approximate quantile sketches](../querying/aggregations.md#approx). Metrics can be configured at ingestion time even when rollup is disabled, but are most useful when rollup is enabled.
+              * [Metric columns](./data-model.md#metrics) are stored [pre-aggregated](../querying/aggregations.md), so they can only be aggregated at query time (not filtered or grouped by). They are often stored as numbers (integers or floats) but can also be stored as complex objects like [HyperLogLog sketches or approximate quantile sketches](../querying/aggregations.md#approximate-aggregations). Metrics can be configured at ingestion time even when rollup is disabled, but are most useful when rollup is enabled.

Contributor

techdocsmith Jul 5, 2022

nit: pre-aggregated sounds a little jargon-ish. Druid aggregates metric columns during ingestion and stores the aggregated value (?)

docs/ingestion/schema-design.md Outdated Show resolved Hide resolved

docs/ingestion/schema-design.md Outdated Show resolved Hide resolved

docs/tutorials/tutorial-sketches-theta.md Outdated Show resolved Hide resolved

docs/tutorials/tutorial-sketches-theta.md Outdated Show resolved Hide resolved

docs/tutorials/tutorial-sketches-theta.md Outdated Show resolved Hide resolved

docs/tutorials/tutorial-sketches-theta.md Show resolved Hide resolved

docs/tutorials/tutorial-sketches-theta.md Show resolved Hide resolved

docs/tutorials/tutorial-sketches-theta.md Show resolved Hide resolved

docs/tutorials/tutorial-sketches-theta.md Show resolved Hide resolved

vtlim and others added 2 commits

July 5, 2022 13:58


          update datasource name

95b7cf9


          Apply suggestions from code review

b8aaaf9

Co-authored-by: Charles Smith <techdocsmith@gmail.com>

vtlim commented

View reviewed changes

docs/tutorials/tutorial-sketches-theta.md Outdated Show resolved Hide resolved

vtlim and others added 5 commits

July 5, 2022 14:16


          Apply suggestions from code review

d1c182b

Co-authored-by: Charles Smith <techdocsmith@gmail.com>


          add words to spelling file

5be8ed1


          Merge branch 'tutorial-sketches' of https://github.com/vtlim/druid in…

45cbf9a

…to tutorial-sketches


          remove we

527abed


          Updates from review

f395507

vtlim requested a review from techdocsmith

August 18, 2022 00:17

vtlim added 2 commits

August 17, 2022 17:20


          add detail

6bf8df0


          add "clickstreams" to spelling

1a68832

techdocsmith approved these changes

View reviewed changes

Contributor

techdocsmith left a comment

Fantastic revision. I have some suggestions for adding headings for readability but otherwise LGTM.

docs/tutorials/tutorial-sketches-theta.md Show resolved Hide resolved

docs/tutorials/tutorial-sketches-theta.md Show resolved Hide resolved

docs/tutorials/tutorial-sketches-theta.md Show resolved Hide resolved

docs/tutorials/tutorial-sketches-theta.md Outdated Show resolved Hide resolved

docs/tutorials/tutorial-sketches-theta.md Outdated Show resolved Hide resolved


          add headers per review

fd0e6e0

clintropolis reviewed

View reviewed changes

Member

clintropolis left a comment

very nice 🚀

docs/tutorials/tutorial-sketches-theta.md Outdated

+              - How many visitors are there that watched _at least one_ of the episodes?
+              - How many visitors watched episode 1 _but not_ episode 2?
+              There is no way to answer these questions by just looking at the aggregated numbers. You would have to go back to the detail data and scan every single row. If the data volume is high enough, this may take long, meaning that an interactive data exploration is not possible.

Member

clintropolis Aug 18, 2022

Suggested change

      
            There is no way to answer these questions by just looking at the aggregated numbers. You would have to go back to the detail data and scan every single row. If the data volume is high enough, this may take long, meaning that an interactive data exploration is not possible.
          
            There is no way to answer these questions by just looking at the aggregated numbers. You would have to go back to the detail data and scan every single row. If the data volume is high enough, this may take a very long time, meaning that an interactive data exploration is not possible.

docs/tutorials/tutorial-sketches-theta.md Outdated

Comment on lines +324 to +326

		## Acknowledgments

		This tutorial is adapted from a blog post by Hellmar Becker. Visit the [original blog post](https://blog.hellmar-becker.de/2022/06/05/druid-data-cookbook-counting-unique-visitors-for-overlapping-segments/) on Hellmar Becker's blog.

Member

clintropolis Aug 18, 2022 •

edited

Loading

i've mixed opinions on this. On the one hand, it definitely seems a bit strange to me have this section in the docs that links out to another website, these docs are supposed to be the authority on Druid stuff if my mind. But on the other hand this is a pretty great content and it deserves credit.

Do we do stuff like this in other docs pages?

Contributor

techdocsmith Aug 19, 2022 •

edited

Loading

We haven't done this on other docs pages. However, @hellmarbecker is a member of the Druid community and the originator of this content. We have to have some way to acknowledge his authorship. Perhaps this would be cleaner:

This tutorial is adapted from a [blog post] (https://blog.hellmar-becker.de/2022/06/05/druid-data-cookbook-counting-unique-visitors-for-overlapping-segments/) by community memeber Hellmar Becker.

it also cuts down some of the repetition.

Member

clintropolis Aug 19, 2022

I think the part that seemed strange is that it sort of seems vaguely like it is someone elses content, which isn't technically true if it is being (legally) contributed here.

Docs are contributed and licensed the same way as source files, and so this contribution is granting the ASF ownership of this content and licensing it to be distributed under the Apache license. The standard way we acknowledge external source contributions when necessary is the https://github.com/apache/druid/blob/master/LICENSE and https://github.com/apache/druid/blob/master/NOTICE files, though I'm unsure if it would be appropriate in this case, since the source packages are not the typical consumption method of the docs.

I just wanted to make sure that this is clear to @hellmarbecker to ensure that the license of this content is clean.

If all of this sounds ok, I do think it is ok to acknowledge and link to this external content since it was developed externally and later added instead of as an original Druid contribution (which we do not typically credit the author with, other than in the form of github commits and occasionally release announcements).

Contributor

hellmarbecker Aug 19, 2022

Thanks, I am happy to grant permission for Apache to use the content!


          comments

8ba4ffd

vtlim closed this

vtlim reopened this

clintropolis approved these changes

View reviewed changes

vtlim merged commit 02914c1 into apache:master

vtlim deleted the tutorial-sketches branch

August 24, 2022 16:23

abhishekagarwal87 added this to the 24.0.0 milestone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Documentation