apache · fjy · Sep 17, 2019 · Sep 17, 2019 · Sep 17, 2019
diff --git a/.travis.yml b/.travis.yml
@@ -247,7 +247,20 @@ jobs:
 
     - name: "docs"
       install: (cd website && npm install)
-      script: (cd website && npm run lint)
+      script: (cd website && npm run lint && npm run spellcheck)
+      after_failure: |-
+        echo "FAILURE EXPLANATION:
+
+        If there are spell check errors:
+
+        1) Suppressing False Positives: Edit website/.spelling to add suppressions. Instructions
+        are at the top of the file and explain how to suppress false positives either globally or
+        within a particular file.
+
+        2) Running Spell Check Locally: cd website && npm install && npm run spellcheck
+
+        For more information, refer to: https://www.npmjs.com/package/markdown-spellcheck
+        "
 
     - &integration_batch_index
       name: "batch index integration test"

diff --git a/docs/comparisons/druid-vs-kudu.md b/docs/comparisons/druid-vs-kudu.md
@@ -35,5 +35,5 @@ Druid's segment architecture is heavily geared towards fast aggregates and filte
 fast in Druid, whereas updates of older data is higher latency. This is by design as the data Druid is good for is typically event data,
 and does not need to be updated too frequently. Kudu supports arbitrary primary keys with uniqueness constraints, and
 efficient lookup by ranges of those keys. Kudu chooses not to include the execution engine, but supports sufficient
-operations so as to allow node-local processing from the execution engines. This means that Kudu can support multiple frameworks on the same data (eg MR, Spark, and SQL).
+operations so as to allow node-local processing from the execution engines. This means that Kudu can support multiple frameworks on the same data (e.g., MR, Spark, and SQL).
 Druid includes its own query layer that allows it to push down aggregations and computations directly to data processes for faster query processing.
diff --git a/docs/comparisons/druid-vs-sql-on-hadoop.md b/docs/comparisons/druid-vs-sql-on-hadoop.md
@@ -37,7 +37,7 @@ Druid was designed to
 1. handle slice-n-dice style ad-hoc queries
 
 SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems.
-Some of these engines (including Impala and Presto) can be colocated with HDFS data nodes and coordinate with them to achieve data locality for queries.
+Some of these engines (including Impala and Presto) can be co-located with HDFS data nodes and coordinate with them to achieve data locality for queries.
 What does this mean?  We can talk about it in terms of three general areas
 
 1. Queries
@@ -53,7 +53,7 @@ are queries and results, and all computation is done internally as part of the D
 Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats.
 They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce).
 Some (Impala/Presto) SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still
-some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly
+some latency overhead (e.g. serialization/deserialization time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly
 how much of a performance impact this makes.
 
 ### Data Ingestion
@@ -79,4 +79,4 @@ Parquet is a column storage format that is designed to work with SQL-on-Hadoop e
 relies on external sources to pull data out of it.
 
 Druid's storage format is highly optimized for linear scans. Although Druid has support for nested data, Parquet's storage format is much
-more hierachical, and is more designed for binary chunking. In theory, this should lead to faster scans in Druid.
+more hierarchical, and is more designed for binary chunking. In theory, this should lead to faster scans in Druid.
diff --git a/docs/configuration/index.md b/docs/configuration/index.md
diff --git a/docs/dependencies/metadata-storage.md b/docs/dependencies/metadata-storage.md
@@ -54,7 +54,7 @@ See [postgresql-metadata-storage](../development/extensions-core/postgresql.md).
 
 ## Adding custom dbcp properties
 
-NOTE: These properties are not settable through the druid.metadata.storage.connector.dbcp properties : username, password, connectURI, validationQuery, testOnBorrow. These must be set through druid.metadata.storage.connector properties.
+NOTE: These properties are not settable through the `druid.metadata.storage.connector.dbcp properties`: `username`, `password`, `connectURI`, `validationQuery`, `testOnBorrow`. These must be set through `druid.metadata.storage.connector` properties.
 
 Example supported properties:
 
@@ -78,7 +78,7 @@ system. The table has two main functional columns, the other columns are for
 indexing purposes.
 
 The `used` column is a boolean "tombstone". A 1 means that the segment should
-be "used" by the cluster (i.e. it should be loaded and available for requests).
+be "used" by the cluster (i.e., it should be loaded and available for requests).
 A 0 means that the segment should not be actively loaded into the cluster. We
 do this as a means of removing segments from the cluster without actually
 removing their metadata (which allows for simpler rolling back if that is ever
@@ -138,4 +138,4 @@ The Metadata Storage is accessed only by:
 2. Realtime Processes (if any)
 3. Coordinator Processes
 
-Thus you need to give permissions (eg in AWS Security Groups)  only for these machines to access the Metadata storage.
+Thus you need to give permissions (e.g., in AWS Security Groups) only for these machines to access the Metadata storage.
diff --git a/docs/design/architecture.md b/docs/design/architecture.md
@@ -44,7 +44,7 @@ Druid processes can be deployed any way you like, but for ease of deployment we
 * **Query**: Runs Broker and optional Router processes, handles queries from external clients.
 * **Data**: Runs Historical and MiddleManager processes, executes ingestion workloads and stores all queryable data.
 
-For more details on process and server organization, please see [Druid Processses and Servers](../design/processes.md).
+For more details on process and server organization, please see [Druid Processes and Servers](../design/processes.md).
 
 ## External dependencies
 
@@ -58,7 +58,7 @@ this is typically going to be local disk. Druid uses deep storage to store any d
 system.
 
 Druid uses deep storage only as a backup of your data and as a way to transfer data in the background between
-Druid processes. To respond to queries, Historical processes do not read from deep storage, but instead read pre-fetched
+Druid processes. To respond to queries, Historical processes do not read from deep storage, but instead read prefetched
 segments from their local disks before any queries are served. This means that Druid never needs to access deep storage
 during a query, helping it offer the best query latencies possible. It also means that you must have enough disk space
 both in deep storage and across your Historical processes for the data you plan to load.

diff --git a/docs/design/auth.md b/docs/design/auth.md
@@ -31,14 +31,14 @@ This document describes non-extension specific Apache Druid (incubating) authent
 |`druid.escalator.type`|String|Type of the Escalator that should be used for internal Druid communications. This Escalator must use an authentication scheme that is supported by an Authenticator in `druid.auth.authenticationChain`.|"noop"|no|
 |`druid.auth.authorizers`|JSON List of Strings|List of Authorizer type names |["allowAll"]|no|
 |`druid.auth.unsecuredPaths`| List of Strings|List of paths for which security checks will not be performed. All requests to these paths will be allowed.|[]|no|
-|`druid.auth.allowUnauthenticatedHttpOptions`|Boolean|If true, skip authentication checks for HTTP OPTIONS requests. This is needed for certain use cases, such as supporting CORS pre-flight requests. Note that disabling authentication checks for OPTIONS requests will allow unauthenticated users to determine what Druid endpoints are valid (by checking if the OPTIONS request returns a 200 instead of 404), so enabling this option may reveal information about server configuration, including information about what extensions are loaded (if those extensions add endpoints).|false|no|
+|`druid.auth.allowUnauthenticatedHttpOptions`|Boolean|If true, skip authentication checks for HTTP OPTIONS requests. This is needed for certain use cases, such as supporting CORS preflight requests. Note that disabling authentication checks for OPTIONS requests will allow unauthenticated users to determine what Druid endpoints are valid (by checking if the OPTIONS request returns a 200 instead of 404), so enabling this option may reveal information about server configuration, including information about what extensions are loaded (if those extensions add endpoints).|false|no|
 
 ## Enabling Authentication/AuthorizationLoadingLookupTest
 
 ## Authenticator chain
 Authentication decisions are handled by a chain of Authenticator instances. A request will be checked by Authenticators in the sequence defined by the `druid.auth.authenticatorChain`.
 
-Authenticator implementions are provided by extensions.
+Authenticator implementations are provided by extensions.
 
 For example, the following authentication chain definition enables the Kerberos and HTTP Basic authenticators, from the `druid-kerberos` and `druid-basic-security` core extensions, respectively:
 
@@ -83,7 +83,7 @@ druid.auth.authenticator.anonymous.authorizerName=myBasicAuthorizer
 ## Escalator
 The `druid.escalator.type` property determines what authentication scheme should be used for internal Druid cluster communications (such as when a Broker process communicates with Historical processes for query processing).
 
-The Escalator chosen for this property must use an authentication scheme that is supported by an Authenticator in `druid.auth.authenticationChain`. Authenticator extension implementors must also provide a corresponding Escalator implementation if they intend to use a particular authentication scheme for internal Druid communications.
+The Escalator chosen for this property must use an authentication scheme that is supported by an Authenticator in `druid.auth.authenticationChain`. Authenticator extension implementers must also provide a corresponding Escalator implementation if they intend to use a particular authentication scheme for internal Druid communications.
 
 ### Noop escalator
 

diff --git a/docs/design/broker.md b/docs/design/broker.md
@@ -50,5 +50,5 @@ To determine which processes to forward queries to, the Broker process first bui
 
 ### Caching
 
-Broker processes employ a cache with a LRU cache invalidation strategy. The Broker cache stores per-segment results. The cache can be local to each Broker process or shared across multiple processes using an external distributed cache such as [memcached](http://memcached.org/). Each time a broker process receives a query, it ﬁrst maps the query to a set of segments. A subset of these segment results may already exist in the cache and the results can be directly pulled from the cache. For any segment results that do not exist in the cache, the broker process will forward the query to the
+Broker processes employ a cache with an LRU cache invalidation strategy. The Broker cache stores per-segment results. The cache can be local to each Broker process or shared across multiple processes using an external distributed cache such as [memcached](http://memcached.org/). Each time a broker process receives a query, it first maps the query to a set of segments. A subset of these segment results may already exist in the cache and the results can be directly pulled from the cache. For any segment results that do not exist in the cache, the broker process will forward the query to the
 Historical processes. Once the Historical processes return their results, the Broker will store those results in the cache. Real-time segments are never cached and hence requests for real-time data will always be forwarded to real-time processes. Real-time data is perpetually changing and caching the results would be unreliable.
diff --git a/docs/design/coordinator.md b/docs/design/coordinator.md
@@ -35,7 +35,7 @@ For a list of API endpoints supported by the Coordinator, see [Coordinator API](
 
 The Druid Coordinator process is primarily responsible for segment management and distribution. More specifically, the Druid Coordinator process communicates to Historical processes to load or drop segments based on configurations. The Druid Coordinator is responsible for loading new segments, dropping outdated segments, managing segment replication, and balancing segment load.
 
-The Druid Coordinator runs periodically and the time between each run is a configurable parameter. Each time the Druid Coordinator runs, it assesses the current state of the cluster before deciding on the appropriate actions to take. Similar to the Broker and Historical processses, the Druid Coordinator maintains a connection to a Zookeeper cluster for current cluster information. The Coordinator also maintains a connection to a database containing information about available segments and rules. Available segments are stored in a segment table and list all segments that should be loaded in the cluster. Rules are stored in a rule table and indicate how segments should be handled.
+The Druid Coordinator runs periodically and the time between each run is a configurable parameter. Each time the Druid Coordinator runs, it assesses the current state of the cluster before deciding on the appropriate actions to take. Similar to the Broker and Historical processes, the Druid Coordinator maintains a connection to a Zookeeper cluster for current cluster information. The Coordinator also maintains a connection to a database containing information about available segments and rules. Available segments are stored in a segment table and list all segments that should be loaded in the cluster. Rules are stored in a rule table and indicate how segments should be handled.
 
 Before any unassigned segments are serviced by Historical processes, the available Historical processes for each tier are first sorted in terms of capacity, with least capacity servers having the highest priority. Unassigned segments are always assigned to the processes with least capacity to maintain a level of balance between processes. The Coordinator does not directly communicate with a historical process when assigning it a new segment; instead the Coordinator creates some temporary information about the new segment under load queue path of the historical process. Once this request is seen, the historical process will load the segment and begin servicing it.
 
@@ -85,8 +85,8 @@ Once a compaction task fails, the Coordinator simply finds the segments for the
 #### Newest segment first policy
 
 At every coordinator run, this policy searches for segments to compact by iterating segments from the latest to the oldest.
-Once it finds the latest segment among all dataSources, it checks if the segment is _compactible_ with other segments of the same dataSource which have the same or abutting intervals.
-Note that segments are compactible if their total size is smaller than or equal to the configured `inputSegmentSizeBytes`.
+Once it finds the latest segment among all dataSources, it checks if the segment is _compactable_ with other segments of the same dataSource which have the same or abutting intervals.
+Note that segments are compactable if their total size is smaller than or equal to the configured `inputSegmentSizeBytes`.
 
 Here are some details with an example. Let us assume we have two dataSources (`foo`, `bar`)
 and 5 segments (`foo_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION`, `foo_2017-11-01T00:00:00.000Z_2017-12-01T00:00:00.000Z_VERSION`, `bar_2017-08-01T00:00:00.000Z_2017-09-01T00:00:00.000Z_VERSION`, `bar_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION`, `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION`).

diff --git a/docs/design/middlemanager.md b/docs/design/middlemanager.md
@@ -25,7 +25,7 @@ title: "MiddleManager Process"
 
 ### Configuration
 
-For Apache Druid (incubating) Middlemanager Process Configuration, see [Indexing Service Configuration](../configuration/index.html#middlemanager-and-peons).
+For Apache Druid (incubating) MiddleManager Process Configuration, see [Indexing Service Configuration](../configuration/index.html#middlemanager-and-peons).
 
 ### HTTP endpoints
 

diff --git a/docs/design/overlord.md b/docs/design/overlord.md
@@ -46,7 +46,7 @@ The Overlord provides a UI for managing tasks and workers. For more details, ple
 
 If a MiddleManager has task failures above a threshold, the Overlord will blacklist these MiddleManagers. No more than 20% of the MiddleManagers can be blacklisted. Blacklisted MiddleManagers will be periodically whitelisted.
 
-The following vairables can be used to set the threshold and blacklist timeouts.
+The following variables can be used to set the threshold and blacklist timeouts.
 
 ```
 druid.indexer.runner.maxRetriesBeforeBlacklist

diff --git a/docs/design/router.md b/docs/design/router.md
@@ -155,7 +155,7 @@ To use this balancer, specify the following property:
 druid.router.avatica.balancer.type=consistentHash
 ```
 
-This is a non-default implementation that is provided for experimentation purposes. The consistent hasher has longer setup times on initialization and when the set of Brokers changes, but has a faster Broker assignment time than the rendezous hasher when tested with 5 Brokers. Benchmarks for both implementations have been provided in `ConsistentHasherBenchmark` and `RendezvousHasherBenchmark`. The consistent hasher also requires locking, while the rendezvous hasher does not.
+This is a non-default implementation that is provided for experimentation purposes. The consistent hasher has longer setup times on initialization and when the set of Brokers changes, but has a faster Broker assignment time than the rendezvous hasher when tested with 5 Brokers. Benchmarks for both implementations have been provided in `ConsistentHasherBenchmark` and `RendezvousHasherBenchmark`. The consistent hasher also requires locking, while the rendezvous hasher does not.
 
 
 ### Example production configuration

diff --git a/docs/design/segments.md b/docs/design/segments.md
@@ -180,7 +180,7 @@ Each column is stored as two parts:
 1.  A Jackson-serialized ColumnDescriptor
 2.  The rest of the binary for the column
 
-A ColumnDescriptor is essentially an object that allows us to use jackson’s polymorphic deserialization to add new and interesting methods of serialization with minimal impact to the code. It consists of some metadata about the column (what type is it, is it multi-value, etc.) and then a list of serde logic that can deserialize the rest of the binary.
+A ColumnDescriptor is essentially an object that allows us to use Jackson's polymorphic deserialization to add new and interesting methods of serialization with minimal impact to the code. It consists of some metadata about the column (what type is it, is it multi-value, etc.) and then a list of serialization/deserialization logic that can deserialize the rest of the binary.
 
 ## Sharding Data to Create Segments
 

diff --git a/docs/development/extensions-contrib/ambari-metrics-emitter.md b/docs/development/extensions-contrib/ambari-metrics-emitter.md
@@ -87,7 +87,7 @@ Same as for the `all` converter user has control of `<namespacePrefix>.[<druid s
 White-list based converter comes with the following  default white list map located under resources in `./src/main/resources/defaultWhiteListMap.json`
 
 Although user can override the default white list map by supplying a property called `mapPath`.
-This property is a String containing  the path for the file containing **white list map Json object**.
+This property is a String containing  the path for the file containing **white list map JSON object**.
 For example the following converter will read the map from the file `/pathPrefix/fileName.json`.
 
 ```json