Create splits of multiple files for parallel indexing #9360

jihoonson · 2020-02-13T21:03:01Z

Description

For now, the Parallel task creates a sub task per input file. This could be not very efficient when you have lots of small files because each task has an overhead for scheduling, JVM startup, etc.

This PR adds a new MaxSizeSplitHintSpec and allows the Parallel task to create splits of multiple files. If a split has only one files, that file could be larger than the configured maxSize. Otherwise, the total size of files in the same split cannot be larger than maxSize. This means, if you have a very large file, there will be only one task that processes the big file. This could be addressed in the future by creating multiple splits for the same file, each of which references to disjoint parts of the file.

This PR changes the default splitHintSpec from none to MaxSizeSplitHintSpec.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths.
added integration tests.
been tested in a test Druid cluster.

sthetland

Added doc review.

docs/ingestion/native-batch.md

sthetland · 2020-02-19T19:48:43Z

docs/ingestion/native-batch.md

+doesn't depend on other external systems like Hadoop. The `index_parallel` task is a supervisor task which orchestrates
+the whole indexing process. It splits the input data and and issues worker tasks
+to the Overlord which actually process the assigned input split and create segments.
+Once a worker task successfully processes all assigned input split, it reports the generated segment list to the supervisor task. 


It’s a little unclear to me who's doing what in this. Is the following accurate/clearer?

“The index_parallel task is a supervisor task that orchestrates the indexing process. The task splits input data for processing by Overlord worker tasks, which process the input splits assigned to them and create segments from the input. Once a worker task successfully processes all assigned input splits, it reports the generated segment list to the supervisor task.”

If not, for a lighter edit, maybe just clarify that it's the worker tasks more specifically, rather than the overlord, that is processing input splits (if that's the case).

Thanks for taking a look!

If not, for a lighter edit, maybe just clarify that it's the worker tasks more specifically, rather than the overlord, that is processing input splits (if that's the case).

This is correct. I tried to make it more clear.

The Parallel task (type `index_parallel`) is a task for parallel batch indexing. This task only uses Druid’s resource and doesn’t depend on other external systems like Hadoop. The `index_parallel` task is a supervisor task that orchestrates the whole indexing process. The supervisor task splits the input data and creates worker tasks to process those splits. The created worker tasks are issued to the Overlord so that they can be scheduled and run on MiddleManagers or Indexers. Once a worker task successfully processes the assigned input split, it reports the generated segment list to the supervisor task. The supervisor task periodically checks the status of worker tasks. If one of them fails, it retries the failed task until the number of retries reaches the configured limit. If all worker tasks succeed, it publishes the reported segments at once and finalizes ingestion.

docs/ingestion/native-batch.md

sthetland · 2020-02-19T19:57:14Z

docs/ingestion/native-batch.md

+|property|description|default|required?|
+|--------|-----------|-------|---------|
+|type|This should always be `maxSize`.|none|yes|
+|maxSplitSize|Maximum number of bytes of input files to process in a single task. If a single file is larger than this number, it will be processed by itself in a single task (splitting a large file is not supported yet).|500MB|no|


Could this match the wording used below, so:
"....in a single task. (Files are never split across tasks.)

👍 I added "yet" at the end of the sentence since we may want to split files across tasks in the future.

clintropolis · 2020-02-19T22:52:29Z

core/src/main/java/org/apache/druid/data/input/impl/SpecificFilesLocalInputSource.java

+import java.util.Objects;
+import java.util.stream.Stream;
+
+public class SpecificFilesLocalInputSource extends AbstractInputSource implements SplittableInputSource<List<File>>


If it isn't too much trouble, it seems like this would be better to just be a part of LocalInputSource to be more consistent with the cloud file input sources, rather than introducing a new type. Though if it is needlessly complicated then is probably fine as is.

Co-Authored-By: sthetland <steve.hetland@imply.io>

clintropolis · 2020-02-20T07:45:04Z

core/src/main/java/org/apache/druid/data/input/impl/LocalInputSource.java

+    this.files = files;
+
+    if (baseDir == null && CollectionUtils.isNullOrEmpty(files)) {
+      throw new IAE("Either one of baseDir or files should be specified");


Is this better to accept both baseDir + filter and explicit files list, or should you specify one or the other exclusively?

If you think accepting both is better then this exception message should probably say 'At least one of ...' instead of 'Either one of'.

Oops, thanks. I'm not sure why we cannot have both at the same time as long as we don't process the same file more than once. It can be more aligned with the cloud input sources though.. (Also, why do we do this?)

Yeah, I think actually it probably would be better to allow both uris and prefixes in the cloud file input sources and any others that match this pattern, not sure why we do only one or the other currently..

jon-wei · 2020-02-22T00:07:42Z

core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java

+            current.add(peeking);
+            splitSize += size;
+            peeking = null;
+          } else if (splitSize + size < maxSplitSize) {


Looks like the splitSize + size < maxSplitSize and current.isEmpty() block can be combined

Ah good catch. Fixed.

jon-wei · 2020-02-22T01:04:49Z

core/src/main/java/org/apache/druid/data/input/impl/LocalInputSource.java


  @JsonCreator
  public LocalInputSource(
      @JsonProperty("baseDir") File baseDir,
-      @JsonProperty("filter") String filter
+      @JsonProperty("filter") String filter,
+      @JsonProperty("files") Set<File> files


Can you add this new property to the LocalInputSource property docs?

Oops, added.

jon-wei · 2020-02-22T01:06:01Z

core/src/main/java/org/apache/druid/data/input/impl/InputEntityIteratingReader.java

@@ -48,23 +48,23 @@
  public InputEntityIteratingReader(
      InputRowSchema inputRowSchema,
      InputFormat inputFormat,
-      Stream<InputEntity> sourceStream,
+      Iterator<? extends InputEntity> sourceStream,


nit: could call this sourceIterator

jon-wei · 2020-02-22T01:06:17Z

core/src/main/java/org/apache/druid/data/input/impl/InputEntityIteratingReader.java

  }

  public InputEntityIteratingReader(
      InputRowSchema inputRowSchema,
      InputFormat inputFormat,
-      CloseableIterator<InputEntity> sourceIterator,
+      CloseableIterator<? extends InputEntity> sourceIterator,


nit: could call this sourceCloseableIterator

jon-wei · 2020-02-22T01:10:25Z

core/src/main/java/org/apache/druid/data/input/impl/LocalInputSource.java


  @JsonCreator
  public LocalInputSource(
      @JsonProperty("baseDir") File baseDir,
-      @JsonProperty("filter") String filter
+      @JsonProperty("filter") String filter,
+      @JsonProperty("files") Set<File> files


Can add a @Nullable here

jon-wei · 2020-02-22T01:18:57Z

...tensions/src/main/java/org/apache/druid/data/input/google/GoogleCloudStorageInputSource.java

+              sizeInLong = sizeInBigInteger.longValueExact();
+            }
+            catch (ArithmeticException e) {
+              sizeInLong = Long.MAX_VALUE;


Should this propagate the exception instead? If we get an object with a byte size that can't be stored in a long, something seems very wrong

The length of a google storage object is the unsigned long type (https://cloud.google.com/storage/docs/json_api/v1/objects#resource-representations). I think it's better to work instead of failing. Added a warning log about the exception.

jon-wei · 2020-02-22T01:25:30Z

indexing-service/src/main/java/org/apache/druid/indexing/input/DruidInputSource.java

+              retryPolicyFactory,
+              dataSource,
+              interval,
+              splitHintSpec == null ? new SegmentsSplitHintSpec(null) : splitHintSpec


Since it would get converted into a MaxSizeSplitHintSpec in createSplit, could this create a MaxSizeSplitHintSpec directly? (Does this also mean SegmentsSplitHintSpec is deprecated?)

Changed to create MaxSizeSplitHintSpec directly.

Does this also mean SegmentsSplitHintSpec is deprecated?

Good question. MaxSizeSplitHintSpec and SegmentsSplitHintSpec work exactly same for now, but I think SegmentsSplitHintSpec can be further optimized in the future. Added some comment about the future improvement.

jnaous · 2020-02-22T23:23:19Z

core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java

+ * If there is only one file in the split, its size can be larger than {@link #maxSplitSize}.
+ * If there are two or more files in the split, their total size cannot be larger than {@link #maxSplitSize}.
+ */
+public class MaxSizeSplitHintSpec implements SplitHintSpec


I think you should make spec classes be pure data objects (or beans). Adding methods like split to them makes them complicated and adds logic that makes it hard to version them in the future. We should think of data objects as literals, not as objects with business logic.

Good point. I agree it is a better structure, but the problem is there are too many classes doing this kind of things especially on the ingestion side. I don't think it's possible to apply the suggested design to all classes anytime soon. Also, I think it's better to promote SQL for ingestion as well so that Druid users don't have to worry about the API changes.

jnaous · 2020-02-22T23:33:57Z

core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java

+  {
+    return new Iterator<List<T>>()
+    {
+      private T peeking;


I think you can simplify the logic of the next method below if you initialize peeking to inputIterator.next(), and only set peeking to null when inputIterator.hasNext() is false. In your next() below, you would just keeping shifting values from inputIterator into current after each iteration as long as there are more inputs.

I don't understand how it works. peeking is to keep the last fetched input from the underlying iterator because it can be returned or not based on the total size of inputs in the current list. If the last fetched input was not added, it should be returned in the following next() call.

jnaous · 2020-02-22T23:34:34Z

core/src/main/java/org/apache/druid/data/input/MaxSizeSplitHintSpec.java

+  {
+    return Objects.hash(maxSplitSize);
+  }
+}


equals and hashCode need unit tests

jnaous · 2020-02-22T23:37:42Z

core/src/main/java/org/apache/druid/data/input/SegmentsSplitHintSpec.java

@@ -56,6 +57,12 @@ public long getMaxInputSegmentBytesPerTask()
    return maxInputSegmentBytesPerTask;
  }

+  @Override
+  public <T> Iterator<List<T>> split(Iterator<T> inputIterator, Function<T, InputFileAttribute> inputAttributeExtractor)


Seems like this method really doesn't belong here if not all subclasses or implementation need it? Or should this class be abstract instead?

Added comment about it.

jnaous · 2020-02-22T23:43:37Z

core/src/main/java/org/apache/druid/data/input/impl/LocalInputSource.java

  }

  @Override
  public int hashCode()
  {
-    return Objects.hash(baseDir, filter);
+    return Objects.hash(baseDir, filter, files);
  }


equals and hashCode need unit tests for maintainability.

jnaous · 2020-02-22T23:48:06Z

indexing-service/src/main/java/org/apache/druid/indexing/firehose/WindowedSegmentId.java

+  public int hashCode()
+  {
+    return Objects.hash(segmentId, intervals);
+  }


tests for equals and hashcode please.

Create splits of multiple files for parallel indexing

6c839bb

jihoonson added Area - Batch Ingestion Design Review labels Feb 13, 2020

jihoonson added 5 commits February 13, 2020 14:33

fix wrong import and npe in test

4b78cf8

use the single file split in tests

6f812cd

Merge branch 'master' of github.com:apache/druid into split-files

c00cc53

rename

8ae5271

import order

0210ba1

sthetland reviewed Feb 19, 2020

View reviewed changes

clintropolis reviewed Feb 19, 2020

View reviewed changes

jihoonson and others added 3 commits February 19, 2020 22:34

Remove specific local input source

605ffb2

Update docs/ingestion/native-batch.md

383d256

Co-Authored-By: sthetland <steve.hetland@imply.io>

Update docs/ingestion/native-batch.md

76fb01c

Co-Authored-By: sthetland <steve.hetland@imply.io>

clintropolis reviewed Feb 20, 2020

View reviewed changes

doc and error msg

8623f10

clintropolis approved these changes Feb 20, 2020

View reviewed changes

jihoonson added 2 commits February 21, 2020 15:09

Merge branch 'master' of github.com:apache/druid into split-files

bcbb345

fix build

689d467

jon-wei reviewed Feb 22, 2020

View reviewed changes

jnaous suggested changes Feb 22, 2020

View reviewed changes

jihoonson added 2 commits February 24, 2020 10:34

Merge branch 'master' of github.com:apache/druid into split-files

c10552a

fix a test and address comments

acaa848

jon-wei approved these changes Feb 25, 2020

View reviewed changes

jon-wei merged commit 3bc7ae7 into apache:master Feb 25, 2020

jihoonson added the Release Notes label Feb 27, 2020

jihoonson added this to the 0.18.0 milestone Mar 26, 2020

jihoonson mentioned this pull request Apr 9, 2020

[Draft] 0.18.0 release notes #9652

Closed

Create splits of multiple files for parallel indexing #9360

Create splits of multiple files for parallel indexing #9360

Conversation

jihoonson commented Feb 13, 2020 • edited Loading

Description

sthetland left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson commented Feb 13, 2020 •

edited

Loading