Add compaction task #4985

jihoonson · 2017-10-20T02:17:54Z

Part of #4479.

This patch introduces a new task type, CompactionTask. The reason for introducing a new task type instead of using existing IndexTask + IngestSegmentFirehose is to make the task spec as simple as possible because it should also be submitted by humans as well as coordinators. As a result, I removed most unnecessary parameters from the task spec.

This task type is a sort of factory which generates an IndexTask spec doing the compaction work. An example of this task spec from the integration test is

{
  "type" : "compact",
  "dataSource" : "wikipedia_index_test",
  "interval" : "2013-08-31/2013-09-02"
}

This compaction task compacts all wikipedia_index_test segments of the 2013-08-31/2013-09-02 interval.

When CompactionTask.run() is called, it internally generates an indexTask spec for the given dataSource and interval. The generated index task spec includes all dimensions and metrics of the segments of the given interval. The segments of the given interval should have the same queryGranularity and rollup flag.

The generated indexTask spec for the above compactionTask is

{
  "type" : "index",
  "id" : "compaction_wikipedia_index_test_2017-10-20T01:33:54.420Z",
  "resource" : {
    "availabilityGroup" : "compaction_wikipedia_index_test_2017-10-20T01:33:54.420Z",
    "requiredCapacity" : 1
  },
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikipedia_index_test",
      "parser" : {
        "type" : "noop"
      },
      "metricsSpec" : [ {
        "type" : "doubleSum",
        "name" : "added",
        "fieldName" : "added",
        "expression" : null
      }, {
        "type" : "longSum",
        "name" : "count",
        "fieldName" : "count",
        "expression" : null
      }, {
        "type" : "doubleSum",
        "name" : "deleted",
        "fieldName" : "deleted",
        "expression" : null
      }, {
        "type" : "doubleSum",
        "name" : "delta",
        "fieldName" : "delta",
        "expression" : null
      } ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : {
          "type" : "period",
          "period" : "P2D",
          "timeZone" : "UTC",
          "origin" : null
        },
        "queryGranularity" : "SECOND",
        "rollup" : true,
        "intervals" : [ "2013-08-31T00:00:00.000Z/2013-09-02T00:00:00.000Z" ]
      }
    },
    "ioConfig" : {
      "type" : "index",
      "firehose" : {
        "type" : "ingestSegment",
        "dataSource" : "wikipedia_index_test",
        "interval" : "2013-08-31T00:00:00.000Z/2013-09-02T00:00:00.000Z",
        "filter" : null,
        "dimensions" : [ "robot", "continent", "country", "city", "newPage", "unpatrolled", "namespace", "anonymous", "language", "page", "region", "user" ],
        "metrics" : [ "deleted", "added", "count", "delta" ]
      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index",
      "targetPartitionSize" : 5000000,
      "maxRowsInMemory" : 75000,
      "maxTotalRows" : 20000000,
      "numShards" : null,
      "indexSpec" : {
        "bitmap" : {
          "type" : "concise"
        },
        "dimensionCompression" : "lz4",
        "metricCompression" : "lz4",
        "longEncoding" : "longs"
      },
      "maxPendingPersists" : 0,
      "buildV9Directly" : true,
      "forceExtendableShardSpecs" : false,
      "forceGuaranteedRollup" : false,
      "reportParseExceptions" : false,
      "publishTimeout" : 0
    }
  },
  "context" : null,
  "groupId" : "compaction_wikipedia_index_test_2017-10-20T01:33:54.420Z",
  "dataSource" : "wikipedia_index_test"
}

This change is

jihoonson · 2017-10-20T03:24:38Z

I'll add a doc soon.

jihoonson · 2017-10-23T03:41:56Z

Added doc.

gianm · 2017-10-20T20:47:47Z

docs/content/ingestion/firehose.md

@@ -79,7 +79,7 @@ A sample ingest firehose spec is shown below -
 |interval|A String representing ISO-8601 Interval. This defines the time range to fetch the data over.|yes|
 |dimensions|The list of dimensions to select. If left empty, no dimensions are returned. If left null or not defined, all dimensions are returned. |no|
 |metrics|The list of metrics to select. If left empty, no metrics are returned. If left null or not defined, all metrics are selected.|no|
-|filter| See [Filters](../querying/filters.html)|yes|
+|filter| See [Filters](../querying/filters.html)|no|


Good catch.

gianm · 2017-10-24T01:14:50Z

docs/content/ingestion/tasks.md

@@ -104,7 +104,7 @@ Tasks can have different default priorities depening on their types. Here are a
 |---------|----------------|
 |Realtime index task|75|
 |Batch index task|50|
-|Merge/Append task|25|
+|Merge/Append/Compation task|25|


Compaction (spelling)

gianm · 2017-10-25T04:26:39Z

docs/content/ingestion/tasks.md

 }
 ```

+### Compaction Task
+
+Compaction tasks merge all segments of the given interval. The syntax is:


I think this should include a segmentGranularity too. Unless your idea is that the interval specified should just be one segment's worth of interval, in which case, that should be said in the docs.

Yeah, all the segments of the interval are always merged into a single segment. I added the below statement.

This compaction task merges all segments of the interval 2017-01-01/2018-01-01 into a single segment.

I suggest adding two more sentences:

To merge each day's worth of data into a separate segment, you can submit multiple "compact" tasks, one for each day. They will run in parallel.

gianm · 2017-10-25T04:27:25Z

docs/content/ingestion/tasks.md

+For example, its `firehose` is always the [ingestSegmentSpec](./firehose.html) and `dimensionsSpec` and `metricsSpec`
+always include all dimensions and metrics of the input segments.
+
+Note that all input segments should have the same `queryGranularity` and `rollup`. See [Segment Metadata Queries](../querying/segmentmetadataquery.html#analysistypes) for more details.


What happens if they don't have consistent queryGranularity and rollup? (Docs should say and it should hopefully be reasonable, since this situation may happen in real life.)

Good point. It thrown an exception before, but now, it automatically checks and sets rollup if it is set for all input segments.

gianm · 2017-10-25T04:27:46Z

docs/content/ingestion/tasks.md

+
+This compaction task merges _all segments_ of the interval `2017-01-01/2018-01-01`. 
+
+A compaction task internally generates an indexTask spec for performing compaction work with some fixed parameters.


Probably more clear:

generates an "index" task spec

gianm · 2017-10-25T04:54:50Z

indexing-service/src/main/java/io/druid/indexing/common/task/CompactionTask.java

+    // conver to combining aggregators
+    final AggregatorFactory[] combiningAggregators;
+    if (mergedAggregators == null) {
+      combiningAggregators = null;


Probably need to thrown an exception here. I think if we actually go through with a dataSchema with null aggregators, it will just drop all the metrics while compacting, which seems like a bad idea.

Good point. I added a check.

gianm · 2017-10-25T04:56:53Z

indexing-service/src/main/java/io/druid/indexing/common/task/CompactionTask.java

+    );
+    final Map<DataSegment, File> segmentFileMap = pair.lhs;
+    final List<TimelineObjectHolder<String, DataSegment>> timelineSegments = pair.rhs;
+    final List<String> dimensions = IngestSegmentFirehoseFactory.getUniqueDimensions(


I don't think this is necessary. IngestSegmentFirehoseFactory will include all dimensions if the dimensions parameter passed to the constructor is null.

CompactionTask finds the unique set of dimensions to generate the DimensionsSpec. I added this line again to make sure the dimensions in the generated DimensionsSpec is used in IngestSegmentFirehose.

gianm · 2017-10-25T04:57:12Z

indexing-service/src/main/java/io/druid/indexing/common/task/CompactionTask.java

+        timelineSegments,
+        new NoopInputRowParser(null)
+    );
+    final List<String> metrics = IngestSegmentFirehoseFactory.getUniqueMetrics(timelineSegments);


I don't think this is necessary either -- same reason. IngestSegmentFirehoseFactory should include all metrics if null is passed in.

Similar here. CompactionTask finds the unique set of aggregators. I added this line again to make sure the aggregators in the generated DataSchema is used in IngestSegmentFirehose.

gianm · 2017-10-25T04:59:05Z

indexing-service/src/main/java/io/druid/indexing/common/task/CompactionTask.java

+    }
+
+    // find granularity spec
+    final GranularitySpec granularitySpec = new UniformGranularitySpec(


I think using ArbitraryGranularitySpec would be simpler. It's designed to index specific intervals.

gianm · 2017-10-25T05:00:04Z

indexing-service/src/main/java/io/druid/indexing/common/task/CompactionTask.java

+
+    return new DataSchema(
+        dataSource,
+        ImmutableMap.of("type", "noop"),


I think this will not work right with regard to numeric dimensions.

This data schema will essentially tell the index task to use metrics from combiningAggregators (which is good, assuming it's computed properly) and to auto-detect dimensions. But dimension auto-detection basically just treats everything that is not an input to an aggregator as a string. It won't retain the types they had in the original segment if it was a long or float dimension for example.

Thank you for the good point. I thought about this and the only possible solution looks allowing user-defined dimensionsSpec in the compactionTask spec until we store data types of dimensions in somewhere. Does it make sense?

Changed to accept an optional dimensionsSpec.

We can do better than that, by examining what types the existing dimension columns are. storageAdapter.getColumnCapabilities(column).getType() is the way to do that.

jihoonson · 2017-10-25T07:44:52Z

@gianm thank you for the quick review!

gianm · 2017-10-26T18:12:31Z

docs/content/ingestion/tasks.md

 }
 ```

+### Compaction Task
+
+Compaction tasks merge all segments of the given interval. The syntax is:


I suggest adding two more sentences:

To merge each day's worth of data into a separate segment, you can submit multiple "compact" tasks, one for each day. They will run in parallel.

gianm · 2017-10-26T18:24:39Z

indexing-service/src/main/java/io/druid/indexing/common/task/CompactionTask.java

+
+    return new DataSchema(
+        dataSource,
+        dimensionsSpec == null ? ImmutableMap.of("type", "noop")


It would be more type-safe to do something like jsonMapper.convertValue(parser, JacksonUtils.TYPE_REFERENCE_MAP_STRING_OBJECT), on a parser you create using a normal constructor.

Thanks. Changed.

gianm · 2017-10-26T18:25:54Z

indexing-service/src/main/java/io/druid/indexing/common/task/CompactionTask.java

+      @JsonProperty("resource") final TaskResource taskResource,
+      @JsonProperty("dataSource") final String dataSource,
+      @JsonProperty("interval") final Interval interval,
+      @JsonProperty("dimensionsSpec") final DimensionsSpec dimensionsSpec,


Rather than asking the user to include this, it could be determined by looking at the dimensions and column types from each segment's StorageAdapter. The methods are getAvailableDimensions and getColumnCapabilities.

Thanks. I changed.

gianm · 2017-10-26T18:28:41Z

indexing-service/src/main/java/io/druid/indexing/firehose/IngestSegmentFirehoseFactory.java

+  {
+    final BiMap<String, Integer> uniqueMetrics = HashBiMap.create();
+
+    // Here, we try to retain the order of metrics as they were specified since the order of metrics may be


Unlike dimensions, order of metrics doesn't matter for performance. Dimension order matters because it affects sorting of the rows and can be used to improve locality (rows are sorted by time first, but rows within the same time bucket are sorted by dimensions, in order). But metric order doesn't affect sorting.

Right. Metrics currently don't affect to performance. However, we're going to handle dimensions and metrics in the same way. So, I guess this will be needed in the future. Do you think it's better to add later?

BTW, I updated the comments here.

jon-wei · 2017-10-28T00:11:31Z

docs/content/ingestion/tasks.md

+always include all dimensions and metrics of the input segments.
+
+Note that the output segment is rolled up only when `rollup` is set for all input segments.
+See [Segment Metadata Queries](../querying/segmentmetadataquery.html#analysistypes) for more details about `rollup`.


Can you add a reference to http://druid.io/docs/latest/design/index.html#roll-up here, and add a comment to the link to the SegmentMetadataQuery docs about how that query can be used to determine whether a segment was created with rollup or not?

The "design" link has a more substantial explanation of what rollup is, and the extra comment re: SegmentMetadataQuery would make it more clear what it's used for

jon-wei · 2017-10-28T00:31:42Z

indexing-service/src/main/java/io/druid/indexing/common/task/CompactionTask.java

+
+    // Here, we try to retain the order of dimensions as they were specified since the order of dimensions may be
+    // optimized for performance.
+    // Dimensions are extracted from the recent segments to olders because recent segments are likely to be queried more


I think this would also have the effect giving recent segments precedence in terms of what type each dimension has (like if an older segment stored a dimension as String but newer ones store it as Long). Can you mention the ordering and type precedence in the docs somewhere?

One question that leads to, would it be useful to allow users some control over ordering and types?

I'm thinking that maybe that overcomplicates the Compaction Task which is meant to be simple to express, and users can manually write a full batch task if they have column ordering/type requirements for the final compacted segments, but would like to see what your thoughts are on that.

I think this would also have the effect giving recent segments precedence in terms of what type each dimension has

Good point. Added doc.

One question that leads to, would it be useful to allow users some control over ordering and types?

I think it's a good idea. Users can write a full batch spec for full control, but someone might want to have little more control with compaction task because it's simple. I added this feature and a unit test for it.

jon-wei · 2017-10-28T01:55:07Z

@jihoonson Had a few comments related to adding things to docs, rest LGTM. Can you also add a test that includes compacting segments with different dimension orders/types?

…d ordering

jihoonson · 2017-10-28T07:19:19Z

@jon-wei thank you for the review. I changed CompactionTaskTest to test different dimension orders and types.

…ction-task

jihoonson · 2017-10-31T09:15:26Z

I added segments as a new parameter to CompactionTask. This is not documented because it's intended to be used by only coordinators.

…ction-task

jihoonson · 2017-11-04T01:14:43Z

@jon-wei @gianm do you have more comments?

gianm · 2017-11-04T03:55:19Z

The latest changes look good to me.

jihoonson · 2017-11-04T04:01:47Z

@gianm thank you!

Gauravshah · 2018-06-17T23:21:57Z

@jihoonson since removing of dimensions is supported, any reason we didn't include the metricSpec ? It is interesting to be able able to go to a different granularity after making compaction. for ex from minute to Hour after a some duration has passed by

jihoonson · 2018-06-21T01:17:12Z

@Gauravshah, yes we can also add metricSpec if needed. Are you interested in making a PR for it?

Gauravshah · 2018-06-21T01:51:37Z

@jihoonson sure, I can try taking a stab at it. I do not know druid internals much though. Will start working on it after 4 weeks.

jihoonson · 2018-06-21T02:26:58Z

@Gauravshah sounds great. Thanks!

Add compaction task

8ae97fd

jihoonson added the Area - Batch Ingestion label Oct 20, 2017

added doc

26dad05

use combining aggregators

9aa7fff

gianm reviewed Oct 25, 2017

View reviewed changes

address comments

25c7b00

jihoonson added 2 commits October 26, 2017 10:57

add support for dimensionsSpec

89aeb26

fix getUniqueDims and getUniqueMetics

91ee617

gianm reviewed Oct 26, 2017

View reviewed changes

jihoonson added 5 commits October 27, 2017 11:00

find unique dimensionsSpec

0757294

fix compilation

ffe21e7

add unit test

1fcdf4a

fix test

805c81c

fix test

e740743

jon-wei reviewed Oct 28, 2017

View reviewed changes

jihoonson added 2 commits October 28, 2017 15:39

test for different dimension orderings and types, and doc for type an…

bb64304

…d ordering

add control for custom ordering and type

f6666c5

jihoonson added 2 commits October 31, 2017 09:40

update doc

3e1f5a3

Merge branch 'master' of https://github.com/druid-io/druid into compa…

c203e36

…ction-task

jihoonson mentioned this pull request Oct 31, 2017

Auto merging segments created by the Kafka indexing service #4498

Closed

jihoonson added 3 commits October 31, 2017 11:27

fix compile

02ff1bd

fix compile

d540f75

add segments param

58109d3

fix serde error

f9e5e06

jihoonson added 2 commits November 3, 2017 15:31

Merge branch 'master' of https://github.com/druid-io/druid into compa…

3582a58

…ction-task

fix build

95a8a71

gianm merged commit 5f3c863 into apache:master Nov 4, 2017

jon-wei added this to the 0.12.0 milestone Jan 5, 2018

jon-wei mentioned this pull request Jan 5, 2018

[WIP] Druid 0.12.0 release notes #5211

Closed


		This compaction task merges _all segments_ of the interval `2017-01-01/2018-01-01`.

		A compaction task internally generates an indexTask spec for performing compaction work with some fixed parameters.

Add compaction task #4985

Add compaction task #4985

Conversation

jihoonson commented Oct 20, 2017 • edited Loading

jihoonson commented Oct 20, 2017

jihoonson commented Oct 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm Oct 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm Oct 26, 2017 • edited Loading

Choose a reason for hiding this comment

jihoonson commented Oct 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jon-wei commented Oct 28, 2017

jihoonson commented Oct 28, 2017

jihoonson commented Oct 31, 2017 • edited Loading

jihoonson commented Nov 4, 2017

gianm commented Nov 4, 2017

jihoonson commented Nov 4, 2017

Gauravshah commented Jun 17, 2018

jihoonson commented Jun 21, 2018

Gauravshah commented Jun 21, 2018

jihoonson commented Jun 21, 2018

jihoonson commented Oct 20, 2017 •

edited

Loading

gianm Oct 25, 2017 •

edited

Loading

gianm Oct 26, 2017 •

edited

Loading

jihoonson commented Oct 31, 2017 •

edited

Loading