Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docusaurus build framework + ingestion doc refresh. #8311

Merged
merged 30 commits into from Aug 21, 2019

Conversation

@gianm
Copy link
Contributor

commented Aug 15, 2019

TL;DR

A refresh of the documentation done in collaboration with @vogievetsky.

Check out a render at: https://staging-druid.imply.io/docs/design/

And compare to the current doc pages: https://druid.apache.org/docs/latest/design/

Description

This refresh has two main goals.

First, setting up Docusaurus:

  1. Use Docusaurus, a really nice documentation generator that creates much more beautiful pages than we have now. They are also more functional; each one has a left-hand collapsible table of contents showing the outline of the overall docs, and a right-hand table of contents showing the outline of that particular doc page. Compare to the old doc pages, where there's only one non-collapsible ToC, and it's grown so long that it's quite difficult to follow.
  2. Make various titles and headers more consistent. The inconsistencies have been around for a while, but became much easier to notice with the new Docusaurus ToCs.

Second, an ingestion doc refresh:

  1. Includes a new ingestion/index.md doc that introduces all the key ingestion spec concepts, and describes the most popular ingestion methods. It is meant to be an introduction to the world of Druid ingestion.
  2. Consolidate lots of disparate ingestion docs, which have grown organically over time and have become difficult to follow, into a simpler set of fewer, larger, more cross-referenced pages. Check out the new pages at: https://staging-druid.imply.io/docs/ingestion/index.html. Some prime examples of this consolidation are the new pages ingestion/data-management.md and ingestion-tasks.md, which represent multiple pages from the current set of docs.
  3. Have a bit more 'opinionatedness' in the ingestion pages, pushing new people towards Kafka, Kinesis, Hadoop, and native batch ingestion. They discuss Tranquility but don't present it as something highly recommended (in keeping with its status as something that doesn't get a lot of love these days). And the few remaining references to Realtime Nodes have been redirected to a page that says they're gone now.

Other notes

I think we need to restore some of the _bin scripts that are still useful (but which ones?).

@gianm gianm added Area - Documentation WIP and removed WIP labels Aug 15, 2019
Copy link
Contributor

left a comment

CI is failing because of missing _bin scripts

@@ -113,7 +110,7 @@ Note that the format of this blob can and will change from time-to-time.
### Rule Table

This comment has been minimized.

Copy link
@ccaominh

ccaominh Aug 15, 2019

Contributor

The title case style here is not consistent with the changes you made above

This comment has been minimized.

Copy link
@gianm

gianm Aug 19, 2019

Author Contributor

Fixed, thanks.


When doing batch loads from files, you should use one-time [tasks](tasks.md), and you have three options: `index`
(native batch; single-task), `index_parallel` (native batch; parallel), or `index_hadoop` (Hadoop-based). The following
table compares and contrasts the three batch ingestion options.

This comment has been minimized.

Copy link
@ccaominh

ccaominh Aug 15, 2019

Contributor

This last sentence is slightly out of place. Perhaps merge it with the sentence on line 81?

This comment has been minimized.

Copy link
@gianm

gianm Aug 19, 2019

Author Contributor

I deleted it and replaced the L81 sentence with:

This table compares the three available options:


### Example of rollup

For an example of how to configure rollup, and of what how the feature will modify your data, check out the

This comment has been minimized.

Copy link
@ccaominh

ccaominh Aug 15, 2019

Contributor

Typo: of what how -> of how

This comment has been minimized.

Copy link
@gianm

gianm Aug 19, 2019

Author Contributor

Fixed, thanks.

### Best-effort rollup

Some Druid ingestion methods guarantee _perfect rollup_, meaning that input data are perfectly aggregated at ingestion
time. Others offer _best-effort rollup_, meaming that input data might not be perfectly aggregated and thus there could

This comment has been minimized.

Copy link
@ccaominh

ccaominh Aug 15, 2019

Contributor

Typo: meaming -> meaning

This comment has been minimized.

Copy link
@gianm

gianm Aug 19, 2019

Author Contributor

Fixed, thanks.

In general, ingestion methods that offer best-effort rollup do this because they are either parallelizing ingestion
without a shuffling step (which would be required for perfect rollup), or because they are finalizing and publishing
segments before all data for a time chunk has been received, which we call _incremental publishing_. In both of these
cases, records may end up in different segments that are received by different, non-shuffling tasks cannot be rolled

This comment has been minimized.

Copy link
@ccaominh

ccaominh Aug 15, 2019

Contributor

This sentence needs to be reworded. Perhaps something like: non-shuffling tasks cannot be -> non-shuffling tasks and cannot be

This comment has been minimized.

Copy link
@gianm

gianm Aug 19, 2019

Author Contributor

I replaced this sentence:

In both of these cases, records may end up in different segments that are received by different, non-shuffling tasks cannot be rolled up together.

With this one:

In both of these cases, records that could theoretically be rolled up may end up in different segments.

quickly.

You will usually get the best performance and smallest overall footprint by partitioning your data on some "natural"
dimension that you often filter by, if one exists. This will often improve compression — users have reported threefold

This comment has been minimized.

Copy link
@ccaominh

ccaominh Aug 15, 2019

Contributor

Are the dashes rendered as em-dashes?

This comment has been minimized.

Copy link
@gianm

gianm Aug 19, 2019

Author Contributor

Yes, this is rendered as an em dash.

@vogievetsky

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2019

Hi all, as @gianm mentioned this work is a collaboration. I am going to be pushing to this branch ( https://github.com/implydata/druid/tree/ingest-doc ) to address feedback and fix up some remaining things about the build script.

vogievetsky added 3 commits Aug 15, 2019
@fjy

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

+2

@clintropolis

This comment has been minimized.

Copy link
Member

commented Aug 17, 2019

I think we need to restore some of the _bin scripts that are still useful (but which ones?).

#8306 moves all the scripts related to performing an apache release into distribution/bin, which makes more sense anyway since they are not doc related scripts.

This is all that remains in docs/_bin:
Screen Shot 2019-08-16 at 9 57 50 PM

which if none are necessary to the new docs can all safely be deleted I believe.

@gianm

This comment has been minimized.

Copy link
Contributor Author

commented Aug 19, 2019

Just re-pushed.

],
"Configuration": [
"configuration/index",
"development/extensions",

This comment has been minimized.

Copy link
@yurmix

yurmix Aug 19, 2019

Contributor

Very nice outline! Also, linking straight to .md is great.

One thing though: I see that the list of extensions (extensions.md) has been moved under Configuration. I think it's better to mention in the release note this change and any other structural change, to let existing users know and adjust.

This comment has been minimized.

Copy link
@gianm

gianm Aug 19, 2019

Author Contributor

Ah yes, I did move it there, since I thought it made more sense there. (The Extensions page is written in a user-facing style, and so I don't think it belonged in the Development section.)

@yurmix

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2019

Do you think you can find a way to control style/indentation for subcategories on the sidebar? I think the distinction between 2nd and 3rd levels (and so on) isn't clear enough.
The only thing I found is subcategory collapse, which should be available on v2 (currently early alpha): facebook/docusaurus#1352

@vogievetsky

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2019

@yurmix I looked into Docusaurus v2 and it looks like it is not quite ready for prime time use which is why we did not use it here. Totally down to switch to it once it comes out (or at least becomes more stable).

As for the indentation we can play with the CSS a bit to make it more obvious in the meantime - in a subsequent PR, trying to get this merged so we do not have to live in conflict hell.

@gianm gianm added the Release Notes label Aug 19, 2019
Copy link
Contributor

left a comment

LGTM 👍

Elasticsearch is a search systems based on Apache Lucene. It provides full text search for schema-free documents
and provides access to raw event level data. Elasticsearch is increasingly adding more support for analytics and aggregations.
[Some members of the community](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ) have pointed out
Elasticsearch is a search systems based on Apache Lucene. It provides full text search for schema-free documents

This comment has been minimized.

Copy link
@jon-wei

jon-wei Aug 19, 2019

Contributor

search systems -> search system

This comment has been minimized.

Copy link
@gianm

gianm Aug 20, 2019

Author Contributor

Fixed, thanks.


### Indexer process (optional)

[**MiddleManager**](../design/indexer.md) processes are an alternative to MiddleManagers and Peons. Instead of

This comment has been minimized.

Copy link
@jon-wei

jon-wei Aug 19, 2019

Contributor

MiddleManager -> Indexer

This comment has been minimized.

Copy link
@gianm

gianm Aug 20, 2019

Author Contributor

Fixed, thanks.


- Generally, the fewer dimensions you have, and the lower the cardinality of your dimensions, the better rollup ratios
you will achieve.
- Use [sketches](#sketches) to avoid storing high cardinality dimensions, which harm rollup ratios.

This comment has been minimized.

Copy link
@jon-wei

jon-wei Aug 19, 2019

Contributor

The #sketches link doesn't resolve

This comment has been minimized.

Copy link
@gianm

gianm Aug 20, 2019

Author Contributor

Fixed, thanks.

a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight UTC). Transforms are applied _after_ the
`timestampSpec`.

Druid currently includes one kind of builtin transform, the expression transform. It has the following syntax:

This comment has been minimized.

Copy link
@jon-wei

jon-wei Aug 19, 2019

Contributor

builtin -> built-in

This comment has been minimized.

Copy link
@gianm

gianm Aug 20, 2019

Author Contributor

Fixed, thanks.

@@ -487,10 +493,13 @@ If you have some tasks of a higher priority than others, you may set their
This may help the higher priority tasks to finish earlier than lower priority tasks
by assigning more task slots to them.

Local Index Task
----------------
## Simple task

This comment has been minimized.

Copy link
@jon-wei

jon-wei Aug 19, 2019

Contributor

👍 on the new name

Copy link
Contributor

left a comment

The new doc looks great! Thanks @gianm and @vogievetsky. Still reviewing.

is designed to run 24/7 with no need for planned downtimes for any reason, including configuration changes and software
updates.
6. **Cloud-native, fault-tolerant architecture that won't lose data.** Once Druid has ingested your data, a copy is
stored safely in [deep storage](#deep-storage) (typically cloud storage, HDFS, or a shared filesystem). Your data can be

This comment has been minimized.

Copy link
@jihoonson

jihoonson Aug 20, 2019

Contributor

Broken link.

This comment has been minimized.

Copy link
@gianm

gianm Aug 20, 2019

Author Contributor

Fixed, thanks.

| **External dependencies** | None. | None. | Hadoop cluster (Druid submits Map/Reduce jobs). |
| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any [firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid datasource. |
| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Any Hadoop InputFormat. |
| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).| Only best-effort. Support for perfect rollup is coming in a future release. | Always perfect. |

This comment has been minimized.

Copy link
@jihoonson

jihoonson Aug 20, 2019

Contributor

This table looks gone stale. Would you please update it as it is in the master?

This comment has been minimized.

Copy link
@gianm

gianm Aug 20, 2019

Author Contributor

I updated the "rollup modes" and "partitioning options" sections.


|Method|How it works|
|------|------------|
|[Native batch](native-batch.html)|`index` (non-parallel) tasks partition input files based on the `partitionDimensions` and `forceGuaranteedRollup` tuning configs. `index_parallel` tasks do not currently support user-defined partitioning.|

This comment has been minimized.

Copy link
@jihoonson

jihoonson Aug 20, 2019

Contributor

Please update this table as well. index_parallel now supports user-defined partitioning.

This comment has been minimized.

Copy link
@gianm

gianm Aug 20, 2019

Author Contributor

I updated it to say:

Configured using [`partitionsSpec`](native-batch.html#partitionsspec) inside the `tuningConfig`.
the same datasource, interval, and version, but have linearly increasing partition numbers.

```
foo_2015-01-01/2015-01-02_v1_0

This comment has been minimized.

Copy link
@jihoonson

jihoonson Aug 20, 2019

Contributor

It could be better if we use a more realistic version than v1.

This comment has been minimized.

Copy link
@gianm

gianm Aug 20, 2019

Author Contributor

Maybe, I'd rather change that later though (if we do at all) since this section was just copied and relocated from an existing document.

Copy link
Member

left a comment

overall lgtm


|Method|How it works|
|------|------------|
|[Native batch](native-batch.html)|`index_parallel` type is best-effort. `index` type may be either perfect or best-effort, based on configuration.|

This comment has been minimized.

Copy link
@clintropolis

clintropolis Aug 20, 2019

Member

this is no longer true, both can use perfect rollup

This comment has been minimized.

Copy link
@gianm

gianm Aug 20, 2019

Author Contributor

Fixed, thanks.

@@ -73,16 +78,19 @@ if one of them fails.

You may want to consider the below things:

- This task does not support [perfect rollup](index.md#best-effort-rollup) because it does not shuffle

This comment has been minimized.

Copy link
@clintropolis

clintropolis Aug 20, 2019

Member

this statement can be removed, obsolete

This comment has been minimized.

Copy link
@gianm

gianm Aug 20, 2019

Author Contributor

Fixed, thanks.

vogievetsky and others added 5 commits Aug 20, 2019
* add clear filter

* update tool kit

* remove usless check

* auto run

* add %
* Fix resource leak

* Patch comments
@gianm

This comment has been minimized.

Copy link
Contributor Author

commented Aug 20, 2019

I've pushed an update reflecting the comments above, with all broken links and anchors fixed, and re-uploaded a render to https://staging-druid.imply.io/docs/design/index.html. I've also restored some subheaders that were accidentally deleted.

@gianm

This comment has been minimized.

Copy link
Contributor Author

commented Aug 20, 2019

I've pushed an update reflecting the comments above, with all broken links and anchors fixed, and re-uploaded a render to https://staging-druid.imply.io/docs/design/index.html. I've also restored some subheaders that were accidentally deleted.

I spoke too soon; the broken link checker built into docusaurus wasn't catching all the broken links (it only checked .md links, not .html links). Will need to fix this and re-push.

@lgtm-com

This comment has been minimized.

Copy link

commented Aug 20, 2019

This pull request introduces 1 alert when merging 8554133 into e2a25fb - view on LGTM.com

new alerts:

  • 1 for Unused variable, import, function or class
@lgtm-com

This comment has been minimized.

Copy link

commented Aug 20, 2019

This pull request introduces 1 alert when merging 8f98e00 into 6fa22f6 - view on LGTM.com

new alerts:

  • 1 for Unused variable, import, function or class
.travis.yml Outdated
@@ -164,6 +164,10 @@ matrix:
script:
- $MVN test -pl 'web-console'

- name: "docs"
install: cd website && npm install
script: cd website && npm run lint

This comment has been minimized.

Copy link
@ccaominh

ccaominh Aug 20, 2019

Contributor

From the travis log, it looks like the cd website done in the install step stays in effect when the script step. Some alternatives are to use pushd/popd or a subshell.

vogievetsky and others added 4 commits Aug 20, 2019
@lgtm-com

This comment has been minimized.

Copy link

commented Aug 21, 2019

This pull request introduces 1 alert when merging 39af0ef into d5a1967 - view on LGTM.com

new alerts:

  • 1 for Unused variable, import, function or class
@gianm

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2019

Pushed again to resolve some conflicts, fixed more broken links and anchors found by the new script npm run lint, and uploaded another render (including redirects!) here: https://staging-druid.imply.io/docs/

@lgtm-com

This comment has been minimized.

Copy link

commented Aug 21, 2019

This pull request introduces 1 alert when merging 1002b3e into d5a1967 - view on LGTM.com

new alerts:

  • 1 for Unused variable, import, function or class
Copy link
Contributor

left a comment

Looks like there are still some broken links.

## Ingestion specs

No matter what ingestion method you use, data is loaded into Druid using either one-time [tasks](tasks.html) or
ongoing [supervisors](supervisors.html). In any case, part of the task or supervisor definition is an

This comment has been minimized.

Copy link
@jihoonson

jihoonson Aug 21, 2019

Contributor

Broken link.

This comment has been minimized.

Copy link
@gianm

gianm Aug 21, 2019

Author Contributor

Fixed, thanks.

```

Note that the CSV and TSV data do not contain column heads. This becomes important when you specify the data for ingesting.

## Custom Formats

Druid supports custom data formats and can use the `Regex` parser or the `JavaScript` parsers to parse these formats. Please note that using any of these parsers for
Druid supports custom data formats and can use the `Regex` parser or the `JavaScript` parsers to parse these formats. Please note that using any of these parsers for
parsing data will not be as efficient as writing a native Java parser or using an external stream processor. We welcome contributions of new Parsers.

## Configuration

This comment has been minimized.

Copy link
@jihoonson

jihoonson Aug 21, 2019

Contributor

This is not introduced in this pr, but would you please fix the broken link below as well? Druid can automatically flatten it for you is broken.

This comment has been minimized.

Copy link
@gianm

gianm Aug 21, 2019

Author Contributor

Thanks, fixed this too (and now it's being detected by the broken link checker).

vogievetsky and others added 3 commits Aug 21, 2019
@vogievetsky

This comment has been minimized.

Copy link
Contributor

commented Aug 21, 2019

OMG! it looks good. It even passed my super strict broken link linter! Someone merge before there are more doc conflicts.

@gianm gianm merged commit d007477 into apache:master Aug 21, 2019
5 of 6 checks passed
5 of 6 checks passed
LGTM analysis: Java No code changes detected
Details
Inspections: pull requests (Druid) TeamCity build finished
Details
LGTM analysis: JavaScript No new or fixed alerts
Details
LGTM analysis: Python No new or fixed alerts
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+6.5%) to 71.93%
Details
@gianm gianm deleted the implydata:ingest-doc branch Aug 21, 2019
@leventov

This comment has been minimized.

Copy link
Member

commented Aug 23, 2019

Did you use some regular expression to replace .html -> .md links? Is that on purpose that links with id (like ../configuration/index.html#dynamic-configuration) remain with .html?

@clintropolis clintropolis added this to the 0.16.0 milestone Aug 23, 2019
@gianm

This comment has been minimized.

Copy link
Contributor Author

commented Aug 23, 2019

Did you use some regular expression to replace .html -> .md links? Is that on purpose that links with id (like ../configuration/index.html#dynamic-configuration) remain with .html?

The .md links are a little nicer since they will make links work when the source is viewed in GitHub, and will generate to the exact same link on the live site. I think it'd be fine to replace the ones with anchor ids too. I think I replaced a few by hand but didn't do all of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.