Skip to content
Permalink
Browse files
Update automatic compaction docs with consistent terminology (#12416)
* specify automatic compaction where applicable

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* update for style and consistency

* implement suggested feedback

* remove duplicate example

* Apply suggestions from code review

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/ingestion/compaction.md

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>

* Update docs/operations/api-reference.md

* update .spelling

* Adopt review suggestions

Co-authored-by: Katya Macedo  <38017980+ektravel@users.noreply.github.com>
  • Loading branch information
vtlim and ektravel committed May 3, 2022
1 parent 35a7d86 commit 0206a2da5c6211caf451751876d847f40c2e7755
Showing 7 changed files with 84 additions and 101 deletions.
@@ -951,31 +951,31 @@ These configuration options control the behavior of the Lookup dynamic configura
|`druid.manager.lookups.threadPoolSize`|How many processes can be managed concurrently (concurrent POST and DELETE requests). Requests this limit will wait in a queue until a slot becomes available.|10|
|`druid.manager.lookups.period`|How many milliseconds between checks for configuration changes|30_000|

##### Compaction Dynamic Configuration
##### Automatic compaction dynamic configuration

Compaction configurations can also be set or updated dynamically using
[Coordinator's API](../operations/api-reference.md#compaction-configuration) without restarting Coordinators.
You can set or update automatic compaction properties dynamically using the
[Coordinator API](../operations/api-reference.md#automatic-compaction-configuration) without restarting Coordinators.

For details about segment compaction, please check [Segment Size Optimization](../operations/segment-optimization.md).
For details about segment compaction, see [Segment size optimization](../operations/segment-optimization.md).

A description of the compaction config is:
You can configure automatic compaction through the following properties:

|Property|Description|Required|
|--------|-----------|--------|
|`dataSource`|dataSource name to be compacted.|yes|
|`taskPriority`|[Priority](../ingestion/tasks.md#priority) of compaction task.|no (default = 25)|
|`inputSegmentSizeBytes`|Maximum number of total segment bytes processed per compaction task. Since a time chunk must be processed in its entirety, if the segments for a particular time chunk have a total size in bytes greater than this parameter, compaction will not run for that time chunk. Because each compaction task runs with a single thread, setting this value too far above 1–2GB will result in compaction tasks taking an excessive amount of time.|no (default = Long.MAX_VALUE)|
|`maxRowsPerSegment`|Max number of rows per segment after compaction.|no|
|`skipOffsetFromLatest`|The offset for searching segments to be compacted in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. Strongly recommended to set for realtime dataSources. See [Data handling with compaction](../ingestion/compaction.md#data-handling-with-compaction)|no (default = "P1D")|
|`tuningConfig`|Tuning config for compaction tasks. See below [Compaction Task TuningConfig](#automatic-compaction-tuningconfig).|no|
|`skipOffsetFromLatest`|The offset for searching segments to be compacted in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. Strongly recommended to set for realtime dataSources. See [Data handling with compaction](../ingestion/compaction.md#data-handling-with-compaction).|no (default = "P1D")|
|`tuningConfig`|Tuning config for compaction tasks. See below [Automatic compaction tuningConfig](#automatic-compaction-tuningconfig).|no|
|`taskContext`|[Task context](../ingestion/tasks.md#context) for compaction tasks.|no|
|`granularitySpec`|Custom `granularitySpec`. See [Automatic compaction granularitySpec](#automatic-compaction-granularityspec)|No|
|`dimensionsSpec`|Custom `dimensionsSpec`. See [Automatic compaction dimensionsSpec](#automatic-compaction-dimensions-spec)|No|
|`transformSpec`|Custom `transformSpec`. See [Automatic compaction transformSpec](#automatic-compaction-transform-spec)|No|
|`granularitySpec`|Custom `granularitySpec`. See [Automatic compaction granularitySpec](#automatic-compaction-granularityspec).|No|
|`dimensionsSpec`|Custom `dimensionsSpec`. See [Automatic compaction dimensionsSpec](#automatic-compaction-dimensionsspec).|No|
|`transformSpec`|Custom `transformSpec`. See [Automatic compaction transformSpec](#automatic-compaction-transformspec).|No|
|`metricsSpec`|Custom [`metricsSpec`](../ingestion/ingestion-spec.md#metricsspec). The compaction task preserves any existing metrics regardless of whether `metricsSpec` is specified. If `metricsSpec` is specified, Druid does not reapply any aggregators matching the metric names specified in `metricsSpec` to rows that already have the associated metrics. For rows that do not already have the metric specified in `metricsSpec`, Druid applies the metric aggregator on the source column, then proceeds to combine the metrics across segments as usual. If `metricsSpec` is not specified, Druid automatically discovers the metrics in the existing segments and combines existing metrics with the same metric name across segments. Aggregators for metrics with the same name are assumed to be compatible for combining across segments, otherwise the compaction task may fail.|No|
|`ioConfig`|IO config for compaction tasks. See below [Compaction Task IOConfig](#automatic-compaction-ioconfig).|no|
|`ioConfig`|IO config for compaction tasks. See [Automatic compaction ioConfig](#automatic-compaction-ioconfig).|no|

An example of compaction config is:
Automatic compaction config example:

```json
{
@@ -989,10 +989,10 @@ An example of compaction config is:
Compaction tasks fail when higher priority tasks cause Druid to revoke their locks. By default, realtime tasks like ingestion have a higher priority than compaction tasks. Therefore frequent conflicts between compaction tasks and realtime tasks can cause the coordinator's automatic compaction to get stuck.
You may see this issue with streaming ingestion from Kafka and Kinesis, which ingest late-arriving data. To mitigate this problem, set `skipOffsetFromLatest` to a value large enough so that arriving data tends to fall outside the offset value from the current time. This way you can avoid conflicts between compaction tasks and realtime ingestion tasks.

###### Automatic compaction TuningConfig
###### Automatic compaction tuningConfig

Auto compaction supports a subset of the [tuningConfig for Parallel task](../ingestion/native-batch.md#tuningconfig).
The below is a list of the supported configurations for auto compaction.
Auto-compaction supports a subset of the [tuningConfig for Parallel task](../ingestion/native-batch.md#tuningconfig).
The below is a list of the supported configurations for auto-compaction.

|Property|Description|Required|
|--------|-----------|--------|
@@ -1022,22 +1022,22 @@ The below is a list of the supported configurations for auto compaction.
|`queryGranularity`|The resolution of timestamp storage within each segment. Defaults to 'null', which preserves the original query granularity. Accepts all [Query granularity](../querying/granularities.md) values.|No|
|`rollup`|Whether to enable ingestion-time rollup or not. Defaults to 'null', which preserves the original setting. Note that once data is rollup, individual records can no longer be recovered. |No|

###### Automatic compaction dimensions spec
###### Automatic compaction dimensionsSpec

|Field|Description|Required|
|-----|-----------|--------|
|`dimensions`| A list of dimension names or objects. Defaults to 'null', which preserves the original dimensions. Note that setting this will cause segments manually compacted with `dimensionExclusions` to be compacted again.|No|

###### Automatic compaction transform spec
###### Automatic compaction transformSpec

|Field|Description|Required|
|-----|-----------|--------|
|`filter`| The `filter` conditionally filters input rows during compaction. Only rows that pass the filter will be included in the compacted segments. Any of Druid's standard [query filters](../querying/filters.md) can be used. Defaults to 'null', which will not filter any row. |No|

###### Automatic compaction IOConfig
###### Automatic compaction ioConfig

Auto compaction supports a subset of the [IOConfig for Parallel task](../ingestion/native-batch.md).
The below is a list of the supported configurations for auto compaction.
Auto-compaction supports a subset of the [ioConfig for Parallel task](../ingestion/native-batch.md).
The below is a list of the supported configurations for auto-compaction.

|Property|Description|Default|Required|
|--------|-----------|-------|--------|
@@ -79,39 +79,38 @@ If a Historical process restarts or becomes unavailable for any reason, the Drui

To ensure an even distribution of segments across Historical processes in the cluster, the Coordinator process will find the total size of all segments being served by every Historical process each time the Coordinator runs. For every Historical process tier in the cluster, the Coordinator process will determine the Historical process with the highest utilization and the Historical process with the lowest utilization. The percent difference in utilization between the two processes is computed, and if the result exceeds a certain threshold, a number of segments will be moved from the highest utilized process to the lowest utilized process. There is a configurable limit on the number of segments that can be moved from one process to another each time the Coordinator runs. Segments to be moved are selected at random and only moved if the resulting utilization calculation indicates the percentage difference between the highest and lowest servers has decreased.

### Compacting Segments
### Automatic compaction

Each run, the Druid Coordinator compacts segments by merging small segments or splitting a large one. This is useful when your segments are not optimized
in terms of segment size which may degrade query performance. See [Segment Size Optimization](../operations/segment-optimization.md) for details.
The Druid Coordinator manages the automatic compaction system.
Each run, the Coordinator compacts segments by merging small segments or splitting a large one. This is useful when the size of your segments is not optimized which may degrade query performance.
See [Segment size optimization](../operations/segment-optimization.md) for details.

The Coordinator first finds the segments to compact based on the [segment search policy](#segment-search-policy).
The Coordinator first finds the segments to compact based on the [segment search policy](#segment-search-policy-in-automatic-compaction).
Once some segments are found, it issues a [compaction task](../ingestion/tasks.md#compact) to compact those segments.
The maximum number of running compaction tasks is `min(sum of worker capacity * slotRatio, maxSlots)`.
Note that even though `min(sum of worker capacity * slotRatio, maxSlots)` = 0, at least one compaction task is always submitted
Note that even if `min(sum of worker capacity * slotRatio, maxSlots) = 0`, at least one compaction task is always submitted
if the compaction is enabled for a dataSource.
See [Compaction Configuration API](../operations/api-reference.md#compaction-configuration) and [Compaction Configuration](../configuration/index.md#compaction-dynamic-configuration) to enable the compaction.
See [Automatic compaction configuration API](../operations/api-reference.md#automatic-compaction-configuration) and [Automatic compaction configuration](../configuration/index.md#automatic-compaction-dynamic-configuration) to enable and configure automatic compaction.

Compaction tasks might fail due to the following reasons.
Compaction tasks might fail due to the following reasons:

- If the input segments of a compaction task are removed or overshadowed before it starts, that compaction task fails immediately.
- If a task of a higher priority acquires a [time chunk lock](../ingestion/tasks.md#locking) for an interval overlapping with the interval of a compaction task, the compaction task fails.

Once a compaction task fails, the Coordinator simply checks the segments in the interval of the failed task again, and issues another compaction task in the next run.

Note that Compacting Segments Coordinator Duty is automatically enabled and run as part of the Indexing Service Duties group. However, Compacting Segments Coordinator Duty can be configured to run in isolation as a separate coordinator duty group. This allows changing the period of Compacting Segments Coordinator Duty without impacting the period of other Indexing Service Duties. This can be done by setting the following properties (for more details see [custom pluggable Coordinator Duty](../development/modules.md#adding-your-own-custom-pluggable-coordinator-duty)):
Note that Compacting Segments Coordinator Duty is automatically enabled and run as part of the Indexing Service Duties group. However, Compacting Segments Coordinator Duty can be configured to run in isolation as a separate Coordinator duty group. This allows changing the period of Compacting Segments Coordinator Duty without impacting the period of other Indexing Service Duties. This can be done by setting the following properties. For more details, see [custom pluggable Coordinator Duty](../development/modules.md#adding-your-own-custom-pluggable-coordinator-duty).
```
druid.coordinator.dutyGroups=[<SOME_GROUP_NAME>]
druid.coordinator.<SOME_GROUP_NAME>.duties=["compactSegments"]
druid.coordinator.<SOME_GROUP_NAME>.period=<PERIOD_TO_RUN_COMPACTING_SEGMENTS_DUTY>
```

### Segment search policy
### Segment search policy in automatic compaction

#### Recent segment first policy

At every coordinator run, this policy looks up time chunks in order of newest-to-oldest and checks whether the segments in those time chunks
need compaction or not.
A set of segments need compaction if all conditions below are satisfied.
At every Coordinator run, this policy looks up time chunks from newest to oldest and checks whether the segments in those time chunks
need compaction.
A set of segments needs compaction if all conditions below are satisfied:

1) Total size of segments in the time chunk is smaller than or equal to the configured `inputSegmentSizeBytes`.
2) Segments have never been compacted yet or compaction spec has been updated since the last compaction, especially `maxRowsPerSegment`, `maxTotalRows`, and `indexSpec`.
@@ -130,22 +129,22 @@ Assuming that each segment is 10 MB and haven't been compacted yet, this policy
`foo_2017-11-01T00:00:00.000Z_2017-12-01T00:00:00.000Z_VERSION` and `foo_2017-11-01T00:00:00.000Z_2017-12-01T00:00:00.000Z_VERSION_1` to compact together because
`2017-11-01T00:00:00.000Z/2017-12-01T00:00:00.000Z` is the most recent time chunk.

If the coordinator has enough task slots for compaction, this policy will continue searching for the next segments and return
If the Coordinator has enough task slots for compaction, this policy will continue searching for the next segments and return
`bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION` and `bar_2017-10-01T00:00:00.000Z_2017-11-01T00:00:00.000Z_VERSION_1`.
Finally, `foo_2017-09-01T00:00:00.000Z_2017-10-01T00:00:00.000Z_VERSION` will be picked up even though there is only one segment in the time chunk of `2017-09-01T00:00:00.000Z/2017-10-01T00:00:00.000Z`.

The search start point can be changed by setting [skipOffsetFromLatest](../configuration/index.md#compaction-dynamic-configuration).
The search start point can be changed by setting [`skipOffsetFromLatest`](../configuration/index.md#automatic-compaction-dynamic-configuration).
If this is set, this policy will ignore the segments falling into the time chunk of (the end time of the most recent segment - `skipOffsetFromLatest`).
This is to avoid conflicts between compaction tasks and realtime tasks.
Note that realtime tasks have a higher priority than compaction tasks by default. Realtime tasks will revoke the locks of compaction tasks if their intervals overlap, resulting in the termination of the compaction task.

> This policy currently cannot handle the situation when there are a lot of small segments which have the same interval,
> and their total size exceeds [inputSegmentSizeBytes](../configuration/index.md#compaction-dynamic-configuration).
> and their total size exceeds [`inputSegmentSizeBytes`](../configuration/index.md#automatic-compaction-dynamic-configuration).
> If it finds such segments, it simply skips them.
### The Coordinator console

The Druid Coordinator exposes a web GUI for displaying cluster information and rule configuration. For more details, please see [coordinator console](../operations/management-uis.md#coordinator-consoles).
The Druid Coordinator exposes a web GUI for displaying cluster information and rule configuration. For more details, see [Coordinator console](../operations/management-uis.md#coordinator-consoles).

### FAQ

0 comments on commit 0206a2d

Please sign in to comment.