Docusaurus build framework + ingestion doc refresh. #8311

gianm · 2019-08-15T07:38:30Z

TL;DR

A refresh of the documentation done in collaboration with @vogievetsky.

Check out a render at: https://staging-druid.imply.io/docs/design/

And compare to the current doc pages: https://druid.apache.org/docs/latest/design/

Description

This refresh has two main goals.

First, setting up Docusaurus:

Use Docusaurus, a really nice documentation generator that creates much more beautiful pages than we have now. They are also more functional; each one has a left-hand collapsible table of contents showing the outline of the overall docs, and a right-hand table of contents showing the outline of that particular doc page. Compare to the old doc pages, where there's only one non-collapsible ToC, and it's grown so long that it's quite difficult to follow.
Make various titles and headers more consistent. The inconsistencies have been around for a while, but became much easier to notice with the new Docusaurus ToCs.

Second, an ingestion doc refresh:

Includes a new ingestion/index.md doc that introduces all the key ingestion spec concepts, and describes the most popular ingestion methods. It is meant to be an introduction to the world of Druid ingestion.
Consolidate lots of disparate ingestion docs, which have grown organically over time and have become difficult to follow, into a simpler set of fewer, larger, more cross-referenced pages. Check out the new pages at: https://staging-druid.imply.io/docs/ingestion/index.html. Some prime examples of this consolidation are the new pages ingestion/data-management.md and ingestion-tasks.md, which represent multiple pages from the current set of docs.
Have a bit more 'opinionatedness' in the ingestion pages, pushing new people towards Kafka, Kinesis, Hadoop, and native batch ingestion. They discuss Tranquility but don't present it as something highly recommended (in keeping with its status as something that doesn't get a lot of love these days). And the few remaining references to Realtime Nodes have been redirected to a page that says they're gone now.

Other notes

I think we need to restore some of the _bin scripts that are still useful (but which ones?).

ccaominh

CI is failing because of missing _bin scripts

ccaominh · 2019-08-15T18:01:37Z

docs/dependencies/metadata-storage.md

@@ -113,7 +110,7 @@ Note that the format of this blob can and will change from time-to-time.
 ### Rule Table


The title case style here is not consistent with the changes you made above

Fixed, thanks.

ccaominh · 2019-08-15T18:07:57Z

docs/ingestion/index.md

+
+When doing batch loads from files, you should use one-time [tasks](tasks.md), and you have three options: `index`
+(native batch; single-task), `index_parallel` (native batch; parallel), or `index_hadoop` (Hadoop-based). The following
+table compares and contrasts the three batch ingestion options.


This last sentence is slightly out of place. Perhaps merge it with the sentence on line 81?

I deleted it and replaced the L81 sentence with:

This table compares the three available options:

ccaominh · 2019-08-15T18:11:58Z

docs/ingestion/index.md

+
+### Example of rollup
+
+For an example of how to configure rollup, and of what how the feature will modify your data, check out the


Typo: of what how -> of how

Fixed, thanks.

ccaominh · 2019-08-15T18:13:46Z

docs/ingestion/index.md

+### Best-effort rollup
+
+Some Druid ingestion methods guarantee _perfect rollup_, meaning that input data are perfectly aggregated at ingestion
+time. Others offer _best-effort rollup_, meaming that input data might not be perfectly aggregated and thus there could


Typo: meaming -> meaning

Fixed, thanks.

ccaominh · 2019-08-15T18:15:58Z

docs/ingestion/index.md

+In general, ingestion methods that offer best-effort rollup do this because they are either parallelizing ingestion
+without a shuffling step (which would be required for perfect rollup), or because they are finalizing and publishing
+segments before all data for a time chunk has been received, which we call _incremental publishing_. In both of these
+cases, records may end up in different segments that are received by different, non-shuffling tasks cannot be rolled


This sentence needs to be reworded. Perhaps something like: non-shuffling tasks cannot be -> non-shuffling tasks and cannot be

I replaced this sentence:

In both of these cases, records may end up in different segments that are received by different, non-shuffling tasks cannot be rolled up together.

With this one:

In both of these cases, records that could theoretically be rolled up may end up in different segments.

ccaominh · 2019-08-15T18:20:24Z

docs/ingestion/index.md

+quickly.
+
+You will usually get the best performance and smallest overall footprint by partitioning your data on some "natural"
+dimension that you often filter by, if one exists. This will often improve compression — users have reported threefold


Are the dashes rendered as em-dashes?

Yes, this is rendered as an em dash.

vogievetsky · 2019-08-15T18:35:16Z

Hi all, as @gianm mentioned this work is a collaboration. I am going to be pushing to this branch ( https://github.com/implydata/druid/tree/ingest-doc ) to address feedback and fix up some remaining things about the build script.

fjy · 2019-08-16T03:48:49Z

+2

clintropolis · 2019-08-17T04:59:07Z

I think we need to restore some of the _bin scripts that are still useful (but which ones?).

#8306 moves all the scripts related to performing an apache release into distribution/bin, which makes more sense anyway since they are not doc related scripts.

This is all that remains in docs/_bin:

which if none are necessary to the new docs can all safely be deleted I believe.

gianm · 2019-08-19T17:49:50Z

Just re-pushed.

yurmix · 2019-08-19T20:54:54Z

website/sidebars.json

+    ],
+    "Configuration": [
+      "configuration/index",
+      "development/extensions",


Very nice outline! Also, linking straight to .md is great.

One thing though: I see that the list of extensions (extensions.md) has been moved under Configuration. I think it's better to mention in the release note this change and any other structural change, to let existing users know and adjust.

Ah yes, I did move it there, since I thought it made more sense there. (The Extensions page is written in a user-facing style, and so I don't think it belonged in the Development section.)

yurmix · 2019-08-19T21:01:38Z

Do you think you can find a way to control style/indentation for subcategories on the sidebar? I think the distinction between 2nd and 3rd levels (and so on) isn't clear enough.
The only thing I found is subcategory collapse, which should be available on v2 (currently early alpha): facebook/docusaurus#1352

vogievetsky · 2019-08-19T21:07:37Z

@yurmix I looked into Docusaurus v2 and it looks like it is not quite ready for prime time use which is why we did not use it here. Totally down to switch to it once it comes out (or at least becomes more stable).

As for the indentation we can play with the CSS a bit to make it more obvious in the meantime - in a subsequent PR, trying to get this merged so we do not have to live in conflict hell.

ccaominh

LGTM 👍

jon-wei · 2019-08-19T22:46:50Z

docs/comparisons/druid-vs-elasticsearch.md

-Elasticsearch is a search systems based on Apache Lucene. It provides full text search for schema-free documents 
-and provides access to raw event level data. Elasticsearch is increasingly adding more support for analytics and aggregations. 
-[Some members of the community](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ) have pointed out  
+Elasticsearch is a search systems based on Apache Lucene. It provides full text search for schema-free documents


search systems -> search system

Fixed, thanks.

jon-wei · 2019-08-19T23:15:40Z

docs/design/processes.md

+
+### Indexer process (optional)
+
+[**MiddleManager**](../design/indexer.md) processes are an alternative to MiddleManagers and Peons. Instead of


MiddleManager -> Indexer

Fixed, thanks.

jon-wei · 2019-08-19T23:32:34Z

docs/ingestion/index.md

+
+- Generally, the fewer dimensions you have, and the lower the cardinality of your dimensions, the better rollup ratios
+you will achieve.
+- Use [sketches](#sketches) to avoid storing high cardinality dimensions, which harm rollup ratios.


The #sketches link doesn't resolve

Fixed, thanks.

jon-wei · 2019-08-19T23:34:53Z

docs/ingestion/index.md

+a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight UTC). Transforms are applied _after_ the
+`timestampSpec`.
+
+Druid currently includes one kind of builtin transform, the expression transform. It has the following syntax:


builtin -> built-in

Fixed, thanks.

jon-wei · 2019-08-19T23:50:50Z

docs/ingestion/native-batch.md

@@ -487,10 +493,13 @@ If you have some tasks of a higher priority than others, you may set their
 This may help the higher priority tasks to finish earlier than lower priority tasks
 by assigning more task slots to them.

-Local Index Task
----------------
+## Simple task


👍 on the new name

jihoonson

The new doc looks great! Thanks @gianm and @vogievetsky. Still reviewing.

jihoonson · 2019-08-19T21:53:49Z

docs/design/index.md

+is designed to run 24/7 with no need for planned downtimes for any reason, including configuration changes and software
+updates.
+6. **Cloud-native, fault-tolerant architecture that won't lose data.** Once Druid has ingested your data, a copy is
+stored safely in [deep storage](#deep-storage) (typically cloud storage, HDFS, or a shared filesystem). Your data can be


Broken link.

Fixed, thanks.

jihoonson · 2019-08-19T22:21:43Z

docs/ingestion/index.md

+| **External dependencies** | None. | None. | Hadoop cluster (Druid submits Map/Reduce jobs). |
+| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any [firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid datasource. |
+| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Any Hadoop InputFormat. |
+| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).| Only best-effort. Support for perfect rollup is coming in a future release. | Always perfect. |


This table looks gone stale. Would you please update it as it is in the master?

I updated the "rollup modes" and "partitioning options" sections.

jihoonson · 2019-08-19T22:24:58Z

docs/ingestion/index.md

+
+|Method|How it works|
+|------|------------|
+|[Native batch](native-batch.html)|`index` (non-parallel) tasks partition input files based on the `partitionDimensions` and `forceGuaranteedRollup` tuning configs. `index_parallel` tasks do not currently support user-defined partitioning.|


Please update this table as well. index_parallel now supports user-defined partitioning.

I updated it to say:

Configured using [`partitionsSpec`](native-batch.html#partitionsspec) inside the `tuningConfig`.

jihoonson · 2019-08-19T22:32:39Z

docs/ingestion/data-management.md

+the same datasource, interval, and version, but have linearly increasing partition numbers.
+
+```
+foo_2015-01-01/2015-01-02_v1_0


It could be better if we use a more realistic version than v1.

Maybe, I'd rather change that later though (if we do at all) since this section was just copied and relocated from an existing document.

clintropolis

overall lgtm

clintropolis · 2019-08-19T23:58:30Z

docs/ingestion/index.md

+
+|Method|How it works|
+|------|------------|
+|[Native batch](native-batch.html)|`index_parallel` type is best-effort. `index` type may be either perfect or best-effort, based on configuration.|


this is no longer true, both can use perfect rollup

Fixed, thanks.

clintropolis · 2019-08-19T23:59:19Z

docs/ingestion/native-batch.md

@@ -73,16 +78,19 @@ if one of them fails.

 You may want to consider the below things:

+- This task does not support [perfect rollup](index.md#best-effort-rollup) because it does not shuffle


this statement can be removed, obsolete

Fixed, thanks.

* add clear filter * update tool kit * remove usless check * auto run * add %

* Fix resource leak * Patch comments

gianm · 2019-08-20T20:53:11Z

I've pushed an update reflecting the comments above, with all broken links and anchors fixed, and re-uploaded a render to https://staging-druid.imply.io/docs/design/index.html. I've also restored some subheaders that were accidentally deleted.

gianm · 2019-08-20T21:05:12Z

I've pushed an update reflecting the comments above, with all broken links and anchors fixed, and re-uploaded a render to https://staging-druid.imply.io/docs/design/index.html. I've also restored some subheaders that were accidentally deleted.

I spoke too soon; the broken link checker built into docusaurus wasn't catching all the broken links (it only checked .md links, not .html links). Will need to fix this and re-push.

lgtm-com · 2019-08-20T21:08:28Z

This pull request introduces 1 alert when merging 8554133 into e2a25fb - view on LGTM.com

new alerts:

1 for Unused variable, import, function or class

lgtm-com · 2019-08-20T23:18:58Z

This pull request introduces 1 alert when merging 8f98e00 into 6fa22f6 - view on LGTM.com

new alerts:

1 for Unused variable, import, function or class

ccaominh · 2019-08-20T23:29:19Z

.travis.yml

@@ -164,6 +164,10 @@ matrix:
      script:
        - $MVN test -pl 'web-console'

+    - name: "docs"
+      install: cd website && npm install
+      script: cd website && npm run lint


From the travis log, it looks like the cd website done in the install step stays in effect when the script step. Some alternatives are to use pushd/popd or a subshell.

lgtm-com · 2019-08-21T00:21:47Z

This pull request introduces 1 alert when merging 39af0ef into d5a1967 - view on LGTM.com

new alerts:

1 for Unused variable, import, function or class

gianm · 2019-08-21T00:23:20Z

Pushed again to resolve some conflicts, fixed more broken links and anchors found by the new script npm run lint, and uploaded another render (including redirects!) here: https://staging-druid.imply.io/docs/

lgtm-com · 2019-08-21T01:22:39Z

This pull request introduces 1 alert when merging 1002b3e into d5a1967 - view on LGTM.com

new alerts:

1 for Unused variable, import, function or class

jihoonson

Looks like there are still some broken links.

jihoonson · 2019-08-21T01:15:43Z

docs/ingestion/index.md

+## Ingestion specs
+
+No matter what ingestion method you use, data is loaded into Druid using either one-time [tasks](tasks.html) or
+ongoing [supervisors](supervisors.html). In any case, part of the task or supervisor definition is an


Broken link.

Fixed, thanks.

jihoonson · 2019-08-21T01:19:12Z

docs/ingestion/data-formats.md

 ```

 Note that the CSV and TSV data do not contain column heads. This becomes important when you specify the data for ingesting.

 ## Custom Formats

-Druid supports custom data formats and can use the `Regex` parser or the `JavaScript` parsers to parse these formats. Please note that using any of these parsers for 
+Druid supports custom data formats and can use the `Regex` parser or the `JavaScript` parsers to parse these formats. Please note that using any of these parsers for
 parsing data will not be as efficient as writing a native Java parser or using an external stream processor. We welcome contributions of new Parsers.

 ## Configuration


This is not introduced in this pr, but would you please fix the broken link below as well? Druid can automatically flatten it for you is broken.

Thanks, fixed this too (and now it's being detected by the broken link checker).

vogievetsky · 2019-08-21T04:21:26Z

OMG! it looks good. It even passed my super strict broken link linter! Someone merge before there are more doc conflicts.

leventov · 2019-08-23T15:20:37Z

Did you use some regular expression to replace .html -> .md links? Is that on purpose that links with id (like ../configuration/index.html#dynamic-configuration) remain with .html?

gianm · 2019-08-23T22:01:27Z

Did you use some regular expression to replace .html -> .md links? Is that on purpose that links with id (like ../configuration/index.html#dynamic-configuration) remain with .html?

The .md links are a little nicer since they will make links work when the source is viewed in GitHub, and will generate to the exact same link on the live site. I think it'd be fine to replace the ones with anchor ids too. I think I replaced a few by hand but didn't do all of them.

Docusaurus build framework + ingestion doc refresh.

0760d7b

gianm added Area - Documentation WIP and removed WIP labels Aug 15, 2019

ccaominh reviewed Aug 15, 2019

View reviewed changes

vogievetsky added 3 commits August 15, 2019 11:54

stick to npm instead of yarn

0db673c

fix typos

1efa140

restore some _bin

295ce67

gianm added 2 commits August 19, 2019 10:15

Merge branch 'master' into ingest-doc

05820e6

Adjustments.

9f295fd

detect and fix redirect anchors

ff21f92

yurmix reviewed Aug 19, 2019

View reviewed changes

Merge branch 'master' into ingest-doc

8825b0c

gianm added the Release Notes label Aug 19, 2019

ccaominh approved these changes Aug 19, 2019

View reviewed changes

jon-wei reviewed Aug 19, 2019

View reviewed changes

jihoonson reviewed Aug 20, 2019

View reviewed changes

clintropolis reviewed Aug 20, 2019

View reviewed changes

vogievetsky and others added 5 commits August 19, 2019 20:08

update anchor lint

dc63220

Web-console: remove specific column filters (apache#8343)

450ce29

* add clear filter * update tool kit * remove usless check * auto run * add %

Fix resource leak (apache#8337)

bd1f4d8

* Fix resource leak * Patch comments

Enable Spotbugs NP_NONNULL_RETURN_VIOLATION (apache#8234)

01d316e

Fixes from PR review.

1d07fcc

vogievetsky and others added 2 commits August 20, 2019 13:49

clean up placeholder page

4873141

Merge branch 'master' into ingest-doc

8554133

vogievetsky added 2 commits August 20, 2019 15:59

add to website lint to travis config

743dd56

better broken link checking

8f98e00

ccaominh reviewed Aug 20, 2019

View reviewed changes

vogievetsky and others added 4 commits August 20, 2019 16:35

travis fix

eb0a9c2

Fixed more broken links

0efbf50

Merge branch 'master' into ingest-doc

6a6b79f

better redirects

39af0ef

clintropolis approved these changes Aug 21, 2019

View reviewed changes

unfancy catch

1002b3e

fix LGTM error

bbba990

jihoonson reviewed Aug 21, 2019

View reviewed changes

vogievetsky and others added 3 commits August 20, 2019 18:38

link fixes

faf98f1

fix md issues

f8862d2

Addl fixes

98c6623

gianm merged commit d007477 into apache:master Aug 21, 2019

gianm deleted the ingest-doc branch August 21, 2019 04:49

clintropolis added this to the 0.16.0 milestone Aug 23, 2019

clintropolis mentioned this pull request Sep 6, 2019

0.16.0-incubating release notes #8369

Closed

		@@ -113,7 +110,7 @@ Note that the format of this blob can and will change from time-to-time.
		### Rule Table


		### Example of rollup

		For an example of how to configure rollup, and of what how the feature will modify your data, check out the


		### Indexer process (optional)

		[MiddleManager](../design/indexer.md) processes are an alternative to MiddleManagers and Peons. Instead of

		@@ -73,16 +78,19 @@ if one of them fails.

		You may want to consider the below things:

		- This task does not support [perfect rollup](index.md#best-effort-rollup) because it does not shuffle

Docusaurus build framework + ingestion doc refresh. #8311

Docusaurus build framework + ingestion doc refresh. #8311

Conversation

gianm commented Aug 15, 2019 • edited Loading

TL;DR

Description

Other notes

ccaominh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vogievetsky commented Aug 15, 2019

fjy commented Aug 16, 2019

clintropolis commented Aug 17, 2019

gianm commented Aug 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yurmix commented Aug 19, 2019

vogievetsky commented Aug 19, 2019

ccaominh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm commented Aug 20, 2019

gianm commented Aug 20, 2019

lgtm-com bot commented Aug 20, 2019

lgtm-com bot commented Aug 20, 2019

Choose a reason for hiding this comment

lgtm-com bot commented Aug 21, 2019

gianm commented Aug 21, 2019

lgtm-com bot commented Aug 21, 2019

jihoonson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vogievetsky commented Aug 21, 2019

leventov commented Aug 23, 2019

gianm commented Aug 23, 2019

gianm commented Aug 15, 2019 •

edited

Loading