Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docusaurus build framework + ingestion doc refresh. #8311

Merged
merged 30 commits into from
Aug 21, 2019

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Aug 15, 2019

TL;DR

A refresh of the documentation done in collaboration with @vogievetsky.

Check out a render at: https://staging-druid.imply.io/docs/design/

And compare to the current doc pages: https://druid.apache.org/docs/latest/design/

Description

This refresh has two main goals.

First, setting up Docusaurus:

  1. Use Docusaurus, a really nice documentation generator that creates much more beautiful pages than we have now. They are also more functional; each one has a left-hand collapsible table of contents showing the outline of the overall docs, and a right-hand table of contents showing the outline of that particular doc page. Compare to the old doc pages, where there's only one non-collapsible ToC, and it's grown so long that it's quite difficult to follow.
  2. Make various titles and headers more consistent. The inconsistencies have been around for a while, but became much easier to notice with the new Docusaurus ToCs.

Second, an ingestion doc refresh:

  1. Includes a new ingestion/index.md doc that introduces all the key ingestion spec concepts, and describes the most popular ingestion methods. It is meant to be an introduction to the world of Druid ingestion.
  2. Consolidate lots of disparate ingestion docs, which have grown organically over time and have become difficult to follow, into a simpler set of fewer, larger, more cross-referenced pages. Check out the new pages at: https://staging-druid.imply.io/docs/ingestion/index.html. Some prime examples of this consolidation are the new pages ingestion/data-management.md and ingestion-tasks.md, which represent multiple pages from the current set of docs.
  3. Have a bit more 'opinionatedness' in the ingestion pages, pushing new people towards Kafka, Kinesis, Hadoop, and native batch ingestion. They discuss Tranquility but don't present it as something highly recommended (in keeping with its status as something that doesn't get a lot of love these days). And the few remaining references to Realtime Nodes have been redirected to a page that says they're gone now.

Other notes

I think we need to restore some of the _bin scripts that are still useful (but which ones?).

Copy link
Contributor

@ccaominh ccaominh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is failing because of missing _bin scripts

@@ -113,7 +110,7 @@ Note that the format of this blob can and will change from time-to-time.
### Rule Table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title case style here is not consistent with the changes you made above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.


When doing batch loads from files, you should use one-time [tasks](tasks.md), and you have three options: `index`
(native batch; single-task), `index_parallel` (native batch; parallel), or `index_hadoop` (Hadoop-based). The following
table compares and contrasts the three batch ingestion options.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last sentence is slightly out of place. Perhaps merge it with the sentence on line 81?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted it and replaced the L81 sentence with:

This table compares the three available options:


### Example of rollup

For an example of how to configure rollup, and of what how the feature will modify your data, check out the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: of what how -> of how

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

### Best-effort rollup

Some Druid ingestion methods guarantee _perfect rollup_, meaning that input data are perfectly aggregated at ingestion
time. Others offer _best-effort rollup_, meaming that input data might not be perfectly aggregated and thus there could
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: meaming -> meaning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

In general, ingestion methods that offer best-effort rollup do this because they are either parallelizing ingestion
without a shuffling step (which would be required for perfect rollup), or because they are finalizing and publishing
segments before all data for a time chunk has been received, which we call _incremental publishing_. In both of these
cases, records may end up in different segments that are received by different, non-shuffling tasks cannot be rolled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence needs to be reworded. Perhaps something like: non-shuffling tasks cannot be -> non-shuffling tasks and cannot be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced this sentence:

In both of these cases, records may end up in different segments that are received by different, non-shuffling tasks cannot be rolled up together.

With this one:

In both of these cases, records that could theoretically be rolled up may end up in different segments.

quickly.

You will usually get the best performance and smallest overall footprint by partitioning your data on some "natural"
dimension that you often filter by, if one exists. This will often improve compression — users have reported threefold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the dashes rendered as em-dashes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is rendered as an em dash.

@vogievetsky
Copy link
Contributor

Hi all, as @gianm mentioned this work is a collaboration. I am going to be pushing to this branch ( https://github.com/implydata/druid/tree/ingest-doc ) to address feedback and fix up some remaining things about the build script.

@fjy
Copy link
Contributor

fjy commented Aug 16, 2019

+2

@clintropolis
Copy link
Member

I think we need to restore some of the _bin scripts that are still useful (but which ones?).

#8306 moves all the scripts related to performing an apache release into distribution/bin, which makes more sense anyway since they are not doc related scripts.

This is all that remains in docs/_bin:
Screen Shot 2019-08-16 at 9 57 50 PM

which if none are necessary to the new docs can all safely be deleted I believe.

@gianm
Copy link
Contributor Author

gianm commented Aug 19, 2019

Just re-pushed.

],
"Configuration": [
"configuration/index",
"development/extensions",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice outline! Also, linking straight to .md is great.

One thing though: I see that the list of extensions (extensions.md) has been moved under Configuration. I think it's better to mention in the release note this change and any other structural change, to let existing users know and adjust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I did move it there, since I thought it made more sense there. (The Extensions page is written in a user-facing style, and so I don't think it belonged in the Development section.)

@yurmix
Copy link
Contributor

yurmix commented Aug 19, 2019

Do you think you can find a way to control style/indentation for subcategories on the sidebar? I think the distinction between 2nd and 3rd levels (and so on) isn't clear enough.
The only thing I found is subcategory collapse, which should be available on v2 (currently early alpha): facebook/docusaurus#1352

@vogievetsky
Copy link
Contributor

@yurmix I looked into Docusaurus v2 and it looks like it is not quite ready for prime time use which is why we did not use it here. Totally down to switch to it once it comes out (or at least becomes more stable).

As for the indentation we can play with the CSS a bit to make it more obvious in the meantime - in a subsequent PR, trying to get this merged so we do not have to live in conflict hell.

Copy link
Contributor

@ccaominh ccaominh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Elasticsearch is a search systems based on Apache Lucene. It provides full text search for schema-free documents
and provides access to raw event level data. Elasticsearch is increasingly adding more support for analytics and aggregations.
[Some members of the community](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ) have pointed out
Elasticsearch is a search systems based on Apache Lucene. It provides full text search for schema-free documents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search systems -> search system

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.


### Indexer process (optional)

[**MiddleManager**](../design/indexer.md) processes are an alternative to MiddleManagers and Peons. Instead of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MiddleManager -> Indexer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.


- Generally, the fewer dimensions you have, and the lower the cardinality of your dimensions, the better rollup ratios
you will achieve.
- Use [sketches](#sketches) to avoid storing high cardinality dimensions, which harm rollup ratios.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The #sketches link doesn't resolve

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight UTC). Transforms are applied _after_ the
`timestampSpec`.

Druid currently includes one kind of builtin transform, the expression transform. It has the following syntax:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

builtin -> built-in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

@@ -487,10 +493,13 @@ If you have some tasks of a higher priority than others, you may set their
This may help the higher priority tasks to finish earlier than lower priority tasks
by assigning more task slots to them.

Local Index Task
----------------
## Simple task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on the new name

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new doc looks great! Thanks @gianm and @vogievetsky. Still reviewing.

is designed to run 24/7 with no need for planned downtimes for any reason, including configuration changes and software
updates.
6. **Cloud-native, fault-tolerant architecture that won't lose data.** Once Druid has ingested your data, a copy is
stored safely in [deep storage](#deep-storage) (typically cloud storage, HDFS, or a shared filesystem). Your data can be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken link.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

| **External dependencies** | None. | None. | Hadoop cluster (Druid submits Map/Reduce jobs). |
| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any [firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid datasource. |
| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Any Hadoop InputFormat. |
| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).| Only best-effort. Support for perfect rollup is coming in a future release. | Always perfect. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table looks gone stale. Would you please update it as it is in the master?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the "rollup modes" and "partitioning options" sections.


|Method|How it works|
|------|------------|
|[Native batch](native-batch.html)|`index` (non-parallel) tasks partition input files based on the `partitionDimensions` and `forceGuaranteedRollup` tuning configs. `index_parallel` tasks do not currently support user-defined partitioning.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this table as well. index_parallel now supports user-defined partitioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to say:

Configured using [`partitionsSpec`](native-batch.html#partitionsspec) inside the `tuningConfig`.

the same datasource, interval, and version, but have linearly increasing partition numbers.

```
foo_2015-01-01/2015-01-02_v1_0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be better if we use a more realistic version than v1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, I'd rather change that later though (if we do at all) since this section was just copied and relocated from an existing document.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm


|Method|How it works|
|------|------------|
|[Native batch](native-batch.html)|`index_parallel` type is best-effort. `index` type may be either perfect or best-effort, based on configuration.|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is no longer true, both can use perfect rollup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

@@ -73,16 +78,19 @@ if one of them fails.

You may want to consider the below things:

- This task does not support [perfect rollup](index.md#best-effort-rollup) because it does not shuffle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this statement can be removed, obsolete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

@gianm
Copy link
Contributor Author

gianm commented Aug 20, 2019

I've pushed an update reflecting the comments above, with all broken links and anchors fixed, and re-uploaded a render to https://staging-druid.imply.io/docs/design/index.html. I've also restored some subheaders that were accidentally deleted.

@gianm
Copy link
Contributor Author

gianm commented Aug 20, 2019

I've pushed an update reflecting the comments above, with all broken links and anchors fixed, and re-uploaded a render to https://staging-druid.imply.io/docs/design/index.html. I've also restored some subheaders that were accidentally deleted.

I spoke too soon; the broken link checker built into docusaurus wasn't catching all the broken links (it only checked .md links, not .html links). Will need to fix this and re-push.

@lgtm-com
Copy link

lgtm-com bot commented Aug 20, 2019

This pull request introduces 1 alert when merging 8554133 into e2a25fb - view on LGTM.com

new alerts:

  • 1 for Unused variable, import, function or class

@lgtm-com
Copy link

lgtm-com bot commented Aug 20, 2019

This pull request introduces 1 alert when merging 8f98e00 into 6fa22f6 - view on LGTM.com

new alerts:

  • 1 for Unused variable, import, function or class

.travis.yml Outdated
@@ -164,6 +164,10 @@ matrix:
script:
- $MVN test -pl 'web-console'

- name: "docs"
install: cd website && npm install
script: cd website && npm run lint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the travis log, it looks like the cd website done in the install step stays in effect when the script step. Some alternatives are to use pushd/popd or a subshell.

@lgtm-com
Copy link

lgtm-com bot commented Aug 21, 2019

This pull request introduces 1 alert when merging 39af0ef into d5a1967 - view on LGTM.com

new alerts:

  • 1 for Unused variable, import, function or class

@gianm
Copy link
Contributor Author

gianm commented Aug 21, 2019

Pushed again to resolve some conflicts, fixed more broken links and anchors found by the new script npm run lint, and uploaded another render (including redirects!) here: https://staging-druid.imply.io/docs/

@lgtm-com
Copy link

lgtm-com bot commented Aug 21, 2019

This pull request introduces 1 alert when merging 1002b3e into d5a1967 - view on LGTM.com

new alerts:

  • 1 for Unused variable, import, function or class

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are still some broken links.

## Ingestion specs

No matter what ingestion method you use, data is loaded into Druid using either one-time [tasks](tasks.html) or
ongoing [supervisors](supervisors.html). In any case, part of the task or supervisor definition is an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken link.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

```

Note that the CSV and TSV data do not contain column heads. This becomes important when you specify the data for ingesting.

## Custom Formats

Druid supports custom data formats and can use the `Regex` parser or the `JavaScript` parsers to parse these formats. Please note that using any of these parsers for
Druid supports custom data formats and can use the `Regex` parser or the `JavaScript` parsers to parse these formats. Please note that using any of these parsers for
parsing data will not be as efficient as writing a native Java parser or using an external stream processor. We welcome contributions of new Parsers.

## Configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not introduced in this pr, but would you please fix the broken link below as well? Druid can automatically flatten it for you is broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed this too (and now it's being detected by the broken link checker).

@vogievetsky
Copy link
Contributor

OMG! it looks good. It even passed my super strict broken link linter! Someone merge before there are more doc conflicts.

@gianm gianm merged commit d007477 into apache:master Aug 21, 2019
@gianm gianm deleted the ingest-doc branch August 21, 2019 04:49
@leventov
Copy link
Member

Did you use some regular expression to replace .html -> .md links? Is that on purpose that links with id (like ../configuration/index.html#dynamic-configuration) remain with .html?

@clintropolis clintropolis added this to the 0.16.0 milestone Aug 23, 2019
@gianm
Copy link
Contributor Author

gianm commented Aug 23, 2019

Did you use some regular expression to replace .html -> .md links? Is that on purpose that links with id (like ../configuration/index.html#dynamic-configuration) remain with .html?

The .md links are a little nicer since they will make links work when the source is viewed in GitHub, and will generate to the exact same link on the live site. I think it'd be fine to replace the ones with anchor ids too. I think I replaced a few by hand but didn't do all of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.