Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,20 @@ jobs:

- name: "docs"
install: (cd website && npm install)
script: (cd website && npm run lint)
script: (cd website && npm run lint && npm run spellcheck)
after_failure: |-
echo "FAILURE EXPLANATION:

If there are spell check errors:

1) Suppressing False Positives: Edit website/.spelling to add suppressions. Instructions
are at the top of the file and explain how to suppress false positives either globally or
within a particular file.

2) Running Spell Check Locally: cd website && npm install && npm run spellcheck

For more information, refer to: https://www.npmjs.com/package/markdown-spellcheck
"

- &integration_batch_index
name: "batch index integration test"
Expand Down
2 changes: 1 addition & 1 deletion docs/comparisons/druid-vs-kudu.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,5 +35,5 @@ Druid's segment architecture is heavily geared towards fast aggregates and filte
fast in Druid, whereas updates of older data is higher latency. This is by design as the data Druid is good for is typically event data,
and does not need to be updated too frequently. Kudu supports arbitrary primary keys with uniqueness constraints, and
efficient lookup by ranges of those keys. Kudu chooses not to include the execution engine, but supports sufficient
operations so as to allow node-local processing from the execution engines. This means that Kudu can support multiple frameworks on the same data (eg MR, Spark, and SQL).
operations so as to allow node-local processing from the execution engines. This means that Kudu can support multiple frameworks on the same data (e.g., MR, Spark, and SQL).
Druid includes its own query layer that allows it to push down aggregations and computations directly to data processes for faster query processing.
6 changes: 3 additions & 3 deletions docs/comparisons/druid-vs-sql-on-hadoop.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Druid was designed to
1. handle slice-n-dice style ad-hoc queries

SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems.
Some of these engines (including Impala and Presto) can be colocated with HDFS data nodes and coordinate with them to achieve data locality for queries.
Some of these engines (including Impala and Presto) can be co-located with HDFS data nodes and coordinate with them to achieve data locality for queries.
What does this mean? We can talk about it in terms of three general areas

1. Queries
Expand All @@ -53,7 +53,7 @@ are queries and results, and all computation is done internally as part of the D
Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats.
They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce).
Some (Impala/Presto) SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still
some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly
some latency overhead (e.g. serialization/deserialization time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly
how much of a performance impact this makes.

### Data Ingestion
Expand All @@ -79,4 +79,4 @@ Parquet is a column storage format that is designed to work with SQL-on-Hadoop e
relies on external sources to pull data out of it.

Druid's storage format is highly optimized for linear scans. Although Druid has support for nested data, Parquet's storage format is much
more hierachical, and is more designed for binary chunking. In theory, this should lead to faster scans in Druid.
more hierarchical, and is more designed for binary chunking. In theory, this should lead to faster scans in Druid.
80 changes: 40 additions & 40 deletions docs/configuration/index.md

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions docs/dependencies/metadata-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ See [postgresql-metadata-storage](../development/extensions-core/postgresql.md).

## Adding custom dbcp properties

NOTE: These properties are not settable through the druid.metadata.storage.connector.dbcp properties : username, password, connectURI, validationQuery, testOnBorrow. These must be set through druid.metadata.storage.connector properties.
NOTE: These properties are not settable through the `druid.metadata.storage.connector.dbcp properties`: `username`, `password`, `connectURI`, `validationQuery`, `testOnBorrow`. These must be set through `druid.metadata.storage.connector` properties.

Example supported properties:

Expand All @@ -78,7 +78,7 @@ system. The table has two main functional columns, the other columns are for
indexing purposes.

The `used` column is a boolean "tombstone". A 1 means that the segment should
be "used" by the cluster (i.e. it should be loaded and available for requests).
be "used" by the cluster (i.e., it should be loaded and available for requests).
A 0 means that the segment should not be actively loaded into the cluster. We
do this as a means of removing segments from the cluster without actually
removing their metadata (which allows for simpler rolling back if that is ever
Expand Down Expand Up @@ -138,4 +138,4 @@ The Metadata Storage is accessed only by:
2. Realtime Processes (if any)
3. Coordinator Processes

Thus you need to give permissions (eg in AWS Security Groups) only for these machines to access the Metadata storage.
Thus you need to give permissions (e.g., in AWS Security Groups) only for these machines to access the Metadata storage.
4 changes: 2 additions & 2 deletions docs/design/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Druid processes can be deployed any way you like, but for ease of deployment we
* **Query**: Runs Broker and optional Router processes, handles queries from external clients.
* **Data**: Runs Historical and MiddleManager processes, executes ingestion workloads and stores all queryable data.

For more details on process and server organization, please see [Druid Processses and Servers](../design/processes.md).
For more details on process and server organization, please see [Druid Processes and Servers](../design/processes.md).

## External dependencies

Expand All @@ -58,7 +58,7 @@ this is typically going to be local disk. Druid uses deep storage to store any d
system.

Druid uses deep storage only as a backup of your data and as a way to transfer data in the background between
Druid processes. To respond to queries, Historical processes do not read from deep storage, but instead read pre-fetched
Druid processes. To respond to queries, Historical processes do not read from deep storage, but instead read prefetched
segments from their local disks before any queries are served. This means that Druid never needs to access deep storage
during a query, helping it offer the best query latencies possible. It also means that you must have enough disk space
both in deep storage and across your Historical processes for the data you plan to load.
Expand Down
6 changes: 3 additions & 3 deletions docs/design/auth.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,14 @@ This document describes non-extension specific Apache Druid (incubating) authent
|`druid.escalator.type`|String|Type of the Escalator that should be used for internal Druid communications. This Escalator must use an authentication scheme that is supported by an Authenticator in `druid.auth.authenticationChain`.|"noop"|no|
|`druid.auth.authorizers`|JSON List of Strings|List of Authorizer type names |["allowAll"]|no|
|`druid.auth.unsecuredPaths`| List of Strings|List of paths for which security checks will not be performed. All requests to these paths will be allowed.|[]|no|
|`druid.auth.allowUnauthenticatedHttpOptions`|Boolean|If true, skip authentication checks for HTTP OPTIONS requests. This is needed for certain use cases, such as supporting CORS pre-flight requests. Note that disabling authentication checks for OPTIONS requests will allow unauthenticated users to determine what Druid endpoints are valid (by checking if the OPTIONS request returns a 200 instead of 404), so enabling this option may reveal information about server configuration, including information about what extensions are loaded (if those extensions add endpoints).|false|no|
|`druid.auth.allowUnauthenticatedHttpOptions`|Boolean|If true, skip authentication checks for HTTP OPTIONS requests. This is needed for certain use cases, such as supporting CORS preflight requests. Note that disabling authentication checks for OPTIONS requests will allow unauthenticated users to determine what Druid endpoints are valid (by checking if the OPTIONS request returns a 200 instead of 404), so enabling this option may reveal information about server configuration, including information about what extensions are loaded (if those extensions add endpoints).|false|no|

## Enabling Authentication/AuthorizationLoadingLookupTest

## Authenticator chain
Authentication decisions are handled by a chain of Authenticator instances. A request will be checked by Authenticators in the sequence defined by the `druid.auth.authenticatorChain`.

Authenticator implementions are provided by extensions.
Authenticator implementations are provided by extensions.

For example, the following authentication chain definition enables the Kerberos and HTTP Basic authenticators, from the `druid-kerberos` and `druid-basic-security` core extensions, respectively:

Expand Down Expand Up @@ -83,7 +83,7 @@ druid.auth.authenticator.anonymous.authorizerName=myBasicAuthorizer
## Escalator
The `druid.escalator.type` property determines what authentication scheme should be used for internal Druid cluster communications (such as when a Broker process communicates with Historical processes for query processing).

The Escalator chosen for this property must use an authentication scheme that is supported by an Authenticator in `druid.auth.authenticationChain`. Authenticator extension implementors must also provide a corresponding Escalator implementation if they intend to use a particular authentication scheme for internal Druid communications.
The Escalator chosen for this property must use an authentication scheme that is supported by an Authenticator in `druid.auth.authenticationChain`. Authenticator extension implementers must also provide a corresponding Escalator implementation if they intend to use a particular authentication scheme for internal Druid communications.

### Noop escalator

Expand Down
2 changes: 1 addition & 1 deletion docs/design/broker.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,5 +50,5 @@ To determine which processes to forward queries to, the Broker process first bui

### Caching

Broker processes employ a cache with a LRU cache invalidation strategy. The Broker cache stores per-segment results. The cache can be local to each Broker process or shared across multiple processes using an external distributed cache such as [memcached](http://memcached.org/). Each time a broker process receives a query, it first maps the query to a set of segments. A subset of these segment results may already exist in the cache and the results can be directly pulled from the cache. For any segment results that do not exist in the cache, the broker process will forward the query to the
Broker processes employ a cache with an LRU cache invalidation strategy. The Broker cache stores per-segment results. The cache can be local to each Broker process or shared across multiple processes using an external distributed cache such as [memcached](http://memcached.org/). Each time a broker process receives a query, it first maps the query to a set of segments. A subset of these segment results may already exist in the cache and the results can be directly pulled from the cache. For any segment results that do not exist in the cache, the broker process will forward the query to the
Historical processes. Once the Historical processes return their results, the Broker will store those results in the cache. Real-time segments are never cached and hence requests for real-time data will always be forwarded to real-time processes. Real-time data is perpetually changing and caching the results would be unreliable.
6 changes: 3 additions & 3 deletions docs/design/coordinator.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ For a list of API endpoints supported by the Coordinator, see [Coordinator API](

The Druid Coordinator process is primarily responsible for segment management and distribution. More specifically, the Druid Coordinator process communicates to Historical processes to load or drop segments based on configurations. The Druid Coordinator is responsible for loading new segments, dropping outdated segments, managing segment replication, and balancing segment load.

The Druid Coordinator runs periodically and the time between each run is a configurable parameter. Each time the Druid Coordinator runs, it assesses the current state of the cluster before deciding on the appropriate actions to take. Similar to the Broker and Historical processses, the Druid Coordinator maintains a connection to a Zookeeper cluster for current cluster information. The Coordinator also maintains a connection to a database containing information about available segments and rules. Available segments are stored in a segment table and list all segments that should be loaded in the cluster. Rules are stored in a rule table and indicate how segments should be handled.
The Druid Coordinator runs periodically and the time between each run is a configurable parameter. Each time the Druid Coordinator runs, it assesses the current state of the cluster before deciding on the appropriate actions to take. Similar to the Broker and Historical processes, the Druid Coordinator maintains a connection to a Zookeeper cluster for current cluster information. The Coordinator also maintains a connection to a database containing information about available segments and rules. Available segments are stored in a segment table and list all segments that should be loaded in the cluster. Rules are stored in a rule table and indicate how segments should be handled.

Before any unassigned segments are serviced by Historical processes, the available Historical processes for each tier are first sorted in terms of capacity, with least capacity servers having the highest priority. Unassigned segments are always assigned to the processes with least capacity to maintain a level of balance between processes. The Coordinator does not directly communicate with a historical process when assigning it a new segment; instead the Coordinator creates some temporary information about the new segment under load queue path of the historical process. Once this request is seen, the historical process will load the segment and begin servicing it.

Expand Down Expand Up @@ -85,8 +85,8 @@ Once a compaction task fails, the Coordinator simply finds the segments for the
#### Newest segment first policy

At every coordinator run, this policy searches for segments to compact by iterating segments from the latest to the oldest.
Once it finds the latest segment among all dataSources, it checks if the segment is _compactible_ with other segments of the same dataSource which have the same or abutting intervals.
Note that segments are compactible if their total size is smaller than or equal to the configured `inputSegmentSizeBytes`.
Once it finds the latest segment among all dataSources, it checks if the segment is _compactable_ with other segments of the same dataSource which have the same or abutting intervals.
Note that segments are compactable if their total size is smaller than or equal to the configured `inputSegmentSizeBytes`.

Here are some details with an example. Let us assume we have two dataSources (`foo`, `bar`)
and 5 segments (`foo_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION`, `foo_2017-11-01T00:00:00.000Z_2017-12-01T00:00:00.000Z_VERSION`, `bar_2017-08-01T00:00:00.000Z_2017-09-01T00:00:00.000Z_VERSION`, `bar_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION`, `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION`).
Expand Down
2 changes: 1 addition & 1 deletion docs/design/middlemanager.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ title: "MiddleManager Process"

### Configuration

For Apache Druid (incubating) Middlemanager Process Configuration, see [Indexing Service Configuration](../configuration/index.html#middlemanager-and-peons).
For Apache Druid (incubating) MiddleManager Process Configuration, see [Indexing Service Configuration](../configuration/index.html#middlemanager-and-peons).

### HTTP endpoints

Expand Down
2 changes: 1 addition & 1 deletion docs/design/overlord.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ The Overlord provides a UI for managing tasks and workers. For more details, ple

If a MiddleManager has task failures above a threshold, the Overlord will blacklist these MiddleManagers. No more than 20% of the MiddleManagers can be blacklisted. Blacklisted MiddleManagers will be periodically whitelisted.

The following vairables can be used to set the threshold and blacklist timeouts.
The following variables can be used to set the threshold and blacklist timeouts.

```
druid.indexer.runner.maxRetriesBeforeBlacklist
Expand Down
2 changes: 1 addition & 1 deletion docs/design/router.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ To use this balancer, specify the following property:
druid.router.avatica.balancer.type=consistentHash
```

This is a non-default implementation that is provided for experimentation purposes. The consistent hasher has longer setup times on initialization and when the set of Brokers changes, but has a faster Broker assignment time than the rendezous hasher when tested with 5 Brokers. Benchmarks for both implementations have been provided in `ConsistentHasherBenchmark` and `RendezvousHasherBenchmark`. The consistent hasher also requires locking, while the rendezvous hasher does not.
This is a non-default implementation that is provided for experimentation purposes. The consistent hasher has longer setup times on initialization and when the set of Brokers changes, but has a faster Broker assignment time than the rendezvous hasher when tested with 5 Brokers. Benchmarks for both implementations have been provided in `ConsistentHasherBenchmark` and `RendezvousHasherBenchmark`. The consistent hasher also requires locking, while the rendezvous hasher does not.


### Example production configuration
Expand Down
2 changes: 1 addition & 1 deletion docs/design/segments.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ Each column is stored as two parts:
1. A Jackson-serialized ColumnDescriptor
2. The rest of the binary for the column

A ColumnDescriptor is essentially an object that allows us to use jackson’s polymorphic deserialization to add new and interesting methods of serialization with minimal impact to the code. It consists of some metadata about the column (what type is it, is it multi-value, etc.) and then a list of serde logic that can deserialize the rest of the binary.
A ColumnDescriptor is essentially an object that allows us to use Jackson's polymorphic deserialization to add new and interesting methods of serialization with minimal impact to the code. It consists of some metadata about the column (what type is it, is it multi-value, etc.) and then a list of serialization/deserialization logic that can deserialize the rest of the binary.

## Sharding Data to Create Segments

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ Same as for the `all` converter user has control of `<namespacePrefix>.[<druid s
White-list based converter comes with the following default white list map located under resources in `./src/main/resources/defaultWhiteListMap.json`

Although user can override the default white list map by supplying a property called `mapPath`.
This property is a String containing the path for the file containing **white list map Json object**.
This property is a String containing the path for the file containing **white list map JSON object**.
For example the following converter will read the map from the file `/pathPrefix/fileName.json`.

```json
Expand Down
Loading