-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docusaurus build framework + ingestion doc refresh. #8311
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is failing because of missing _bin
scripts
@@ -113,7 +110,7 @@ Note that the format of this blob can and will change from time-to-time. | |||
### Rule Table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title case style here is not consistent with the changes you made above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
docs/ingestion/index.md
Outdated
|
||
When doing batch loads from files, you should use one-time [tasks](tasks.md), and you have three options: `index` | ||
(native batch; single-task), `index_parallel` (native batch; parallel), or `index_hadoop` (Hadoop-based). The following | ||
table compares and contrasts the three batch ingestion options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last sentence is slightly out of place. Perhaps merge it with the sentence on line 81?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted it and replaced the L81 sentence with:
This table compares the three available options:
docs/ingestion/index.md
Outdated
|
||
### Example of rollup | ||
|
||
For an example of how to configure rollup, and of what how the feature will modify your data, check out the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: of what how -> of how
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
docs/ingestion/index.md
Outdated
### Best-effort rollup | ||
|
||
Some Druid ingestion methods guarantee _perfect rollup_, meaning that input data are perfectly aggregated at ingestion | ||
time. Others offer _best-effort rollup_, meaming that input data might not be perfectly aggregated and thus there could |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: meaming -> meaning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
docs/ingestion/index.md
Outdated
In general, ingestion methods that offer best-effort rollup do this because they are either parallelizing ingestion | ||
without a shuffling step (which would be required for perfect rollup), or because they are finalizing and publishing | ||
segments before all data for a time chunk has been received, which we call _incremental publishing_. In both of these | ||
cases, records may end up in different segments that are received by different, non-shuffling tasks cannot be rolled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence needs to be reworded. Perhaps something like: non-shuffling tasks cannot be -> non-shuffling tasks and cannot be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced this sentence:
In both of these cases, records may end up in different segments that are received by different, non-shuffling tasks cannot be rolled up together.
With this one:
In both of these cases, records that could theoretically be rolled up may end up in different segments.
docs/ingestion/index.md
Outdated
quickly. | ||
|
||
You will usually get the best performance and smallest overall footprint by partitioning your data on some "natural" | ||
dimension that you often filter by, if one exists. This will often improve compression — users have reported threefold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the dashes rendered as em-dashes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is rendered as an em dash.
Hi all, as @gianm mentioned this work is a collaboration. I am going to be pushing to this branch ( https://github.com/implydata/druid/tree/ingest-doc ) to address feedback and fix up some remaining things about the build script. |
+2 |
#8306 moves all the scripts related to performing an apache release into This is all that remains in which if none are necessary to the new docs can all safely be deleted I believe. |
Just re-pushed. |
], | ||
"Configuration": [ | ||
"configuration/index", | ||
"development/extensions", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice outline! Also, linking straight to .md is great.
One thing though: I see that the list of extensions (extensions.md) has been moved under Configuration. I think it's better to mention in the release note this change and any other structural change, to let existing users know and adjust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I did move it there, since I thought it made more sense there. (The Extensions page is written in a user-facing style, and so I don't think it belonged in the Development section.)
Do you think you can find a way to control style/indentation for subcategories on the sidebar? I think the distinction between 2nd and 3rd levels (and so on) isn't clear enough. |
@yurmix I looked into Docusaurus v2 and it looks like it is not quite ready for prime time use which is why we did not use it here. Totally down to switch to it once it comes out (or at least becomes more stable). As for the indentation we can play with the CSS a bit to make it more obvious in the meantime - in a subsequent PR, trying to get this merged so we do not have to live in conflict hell. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Elasticsearch is a search systems based on Apache Lucene. It provides full text search for schema-free documents | ||
and provides access to raw event level data. Elasticsearch is increasingly adding more support for analytics and aggregations. | ||
[Some members of the community](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ) have pointed out | ||
Elasticsearch is a search systems based on Apache Lucene. It provides full text search for schema-free documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
search systems -> search system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
docs/design/processes.md
Outdated
|
||
### Indexer process (optional) | ||
|
||
[**MiddleManager**](../design/indexer.md) processes are an alternative to MiddleManagers and Peons. Instead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MiddleManager -> Indexer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
docs/ingestion/index.md
Outdated
|
||
- Generally, the fewer dimensions you have, and the lower the cardinality of your dimensions, the better rollup ratios | ||
you will achieve. | ||
- Use [sketches](#sketches) to avoid storing high cardinality dimensions, which harm rollup ratios. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The #sketches
link doesn't resolve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
docs/ingestion/index.md
Outdated
a millisecond timestamp (number of milliseconds since Jan 1, 1970 at midnight UTC). Transforms are applied _after_ the | ||
`timestampSpec`. | ||
|
||
Druid currently includes one kind of builtin transform, the expression transform. It has the following syntax: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
builtin -> built-in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
@@ -487,10 +493,13 @@ If you have some tasks of a higher priority than others, you may set their | |||
This may help the higher priority tasks to finish earlier than lower priority tasks | |||
by assigning more task slots to them. | |||
|
|||
Local Index Task | |||
---------------- | |||
## Simple task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 on the new name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new doc looks great! Thanks @gianm and @vogievetsky. Still reviewing.
docs/design/index.md
Outdated
is designed to run 24/7 with no need for planned downtimes for any reason, including configuration changes and software | ||
updates. | ||
6. **Cloud-native, fault-tolerant architecture that won't lose data.** Once Druid has ingested your data, a copy is | ||
stored safely in [deep storage](#deep-storage) (typically cloud storage, HDFS, or a shared filesystem). Your data can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broken link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
docs/ingestion/index.md
Outdated
| **External dependencies** | None. | None. | Hadoop cluster (Druid submits Map/Reduce jobs). | | ||
| **Input locations** | Any [firehose](native-batch.md#firehoses). | Any [firehose](native-batch.md#firehoses). | Any Hadoop FileSystem or Druid datasource. | | ||
| **File formats** | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Text file formats (CSV, TSV, JSON). Support for binary formats is coming in a future release. | Any Hadoop InputFormat. | | ||
| **[Rollup modes](#rollup)** | Perfect if `forceGuaranteedRollup` = true in the [`tuningConfig`](native-batch.md#tuningconfig).| Only best-effort. Support for perfect rollup is coming in a future release. | Always perfect. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This table looks gone stale. Would you please update it as it is in the master?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the "rollup modes" and "partitioning options" sections.
docs/ingestion/index.md
Outdated
|
||
|Method|How it works| | ||
|------|------------| | ||
|[Native batch](native-batch.html)|`index` (non-parallel) tasks partition input files based on the `partitionDimensions` and `forceGuaranteedRollup` tuning configs. `index_parallel` tasks do not currently support user-defined partitioning.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update this table as well. index_parallel
now supports user-defined partitioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it to say:
Configured using [`partitionsSpec`](native-batch.html#partitionsspec) inside the `tuningConfig`.
the same datasource, interval, and version, but have linearly increasing partition numbers. | ||
|
||
``` | ||
foo_2015-01-01/2015-01-02_v1_0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be better if we use a more realistic version than v1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, I'd rather change that later though (if we do at all) since this section was just copied and relocated from an existing document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm
|
||
|Method|How it works| | ||
|------|------------| | ||
|[Native batch](native-batch.html)|`index_parallel` type is best-effort. `index` type may be either perfect or best-effort, based on configuration.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is no longer true, both can use perfect rollup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
docs/ingestion/native-batch.md
Outdated
@@ -73,16 +78,19 @@ if one of them fails. | |||
|
|||
You may want to consider the below things: | |||
|
|||
- This task does not support [perfect rollup](index.md#best-effort-rollup) because it does not shuffle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this statement can be removed, obsolete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
* add clear filter * update tool kit * remove usless check * auto run * add %
* Fix resource leak * Patch comments
I've pushed an update reflecting the comments above, with all broken links and anchors fixed, and re-uploaded a render to https://staging-druid.imply.io/docs/design/index.html. I've also restored some subheaders that were accidentally deleted. |
I spoke too soon; the broken link checker built into docusaurus wasn't catching all the broken links (it only checked .md links, not .html links). Will need to fix this and re-push. |
This pull request introduces 1 alert when merging 8554133 into e2a25fb - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging 8f98e00 into 6fa22f6 - view on LGTM.com new alerts:
|
.travis.yml
Outdated
@@ -164,6 +164,10 @@ matrix: | |||
script: | |||
- $MVN test -pl 'web-console' | |||
|
|||
- name: "docs" | |||
install: cd website && npm install | |||
script: cd website && npm run lint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the travis log, it looks like the cd website
done in the install
step stays in effect when the script step
. Some alternatives are to use pushd/popd or a subshell.
This pull request introduces 1 alert when merging 39af0ef into d5a1967 - view on LGTM.com new alerts:
|
Pushed again to resolve some conflicts, fixed more broken links and anchors found by the new script |
This pull request introduces 1 alert when merging 1002b3e into d5a1967 - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are still some broken links.
docs/ingestion/index.md
Outdated
## Ingestion specs | ||
|
||
No matter what ingestion method you use, data is loaded into Druid using either one-time [tasks](tasks.html) or | ||
ongoing [supervisors](supervisors.html). In any case, part of the task or supervisor definition is an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broken link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks.
``` | ||
|
||
Note that the CSV and TSV data do not contain column heads. This becomes important when you specify the data for ingesting. | ||
|
||
## Custom Formats | ||
|
||
Druid supports custom data formats and can use the `Regex` parser or the `JavaScript` parsers to parse these formats. Please note that using any of these parsers for | ||
Druid supports custom data formats and can use the `Regex` parser or the `JavaScript` parsers to parse these formats. Please note that using any of these parsers for | ||
parsing data will not be as efficient as writing a native Java parser or using an external stream processor. We welcome contributions of new Parsers. | ||
|
||
## Configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not introduced in this pr, but would you please fix the broken link below as well? Druid can automatically flatten it for you
is broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed this too (and now it's being detected by the broken link checker).
OMG! it looks good. It even passed my super strict broken link linter! Someone merge before there are more doc conflicts. |
Did you use some regular expression to replace |
The |
TL;DR
A refresh of the documentation done in collaboration with @vogievetsky.
Check out a render at: https://staging-druid.imply.io/docs/design/
And compare to the current doc pages: https://druid.apache.org/docs/latest/design/
Description
This refresh has two main goals.
First, setting up Docusaurus:
Second, an ingestion doc refresh:
ingestion/index.md
doc that introduces all the key ingestion spec concepts, and describes the most popular ingestion methods. It is meant to be an introduction to the world of Druid ingestion.ingestion/data-management.md
andingestion-tasks.md
, which represent multiple pages from the current set of docs.Other notes
I think we need to restore some of the
_bin
scripts that are still useful (but which ones?).