More error reporting and stats for ingestion tasks #5418

jon-wei · 2018-02-23T00:55:59Z

This PR adds more ingestion status and error reporting for IndexTask, KafkaIndexTask, and HadoopIndexTask.

Things changed:

ParseException handling:
- reportParseExceptions and ignoreInvalidRows are now deprecated
- Three new configurations have been added to the various tuningConfig classes:
  - logParseExceptions - log.error() when a parse exception occurs
  - maxParseExceptions - The maximum number of parse exceptions allowed before a task fails
  - maxSavedParseExceptions - Save the last N parse exceptions to a circular buffer, retrievable via a new unparseableEvents endpoint on the tasks, and also exposed in the completed task status
  - When reportParseExceptions is true, this is treated as maxParseExceptionsset to 0.
New row counter type: processedWithErrors -> This refers to rows where the row was parseable but a column value could not be parsed into the correct type (e.g., a row that has "helloworld" for a long column). These rows are counted against the maxParseExceptions limit.
TaskStatus changes
- The task status now has an additional error message:
  - errorMsg -> Message showing cause for task failure, truncated to 100 characters. The full message is available in the TaskReport described below.
rowStats peon endpoint (http://localhost:<task port>/druid/worker/v1/chat/<task id>/rowStats)
- Shows the current total row stats for an ingestion task. By default this shows stats only for the current ingestion phase. Adding ?full will return stats for all phases.
unparseableEvents peon endpoint (http://localhost:<task port>/druid/worker/v1/chat/<task id>/unparseableEvents)
- Returns a list of the last N parse exceptions encountered by a task
Ingestion stats and errors TaskReport:
- Available at (http://<overlord>/druid/indexer/v1/task/<task_id>/reports) after a task completes. Contains the following:
- ingestionState -> shows what ingestion phase the ingestion task is currently in, possible values are NOT_STARTED, DETERMINE_PARTITIONS, BUILD_SEGMENTS, and COMPLETED
- rowStats
- unparseableEvents
- errorMsg -> shows the failure cause of a task

Example task report:

{
  "ingestionStatsAndErrors": {
    "taskId": "index_wikiticker_2018-04-03T20:33:12.782Z",
    "payload": {
      "ingestionState": "COMPLETED",
      "unparseableEvents": {
        "determinePartitions": [
          "Encountered row with timestamp that cannot be represented as a long: [MapBasedInputRow{timestamp=246140482-04-24T15:36:27.903Z, event={isRobot=true, channel=#ca.wikipedia, timestamp=246140482-04-24T15:36:27.903Z, flags=B, isUnpatrolled=false, page=Exèrcit dels guàrdies de la revolució islàmica, diffUrl=https://ca.wikipedia.org/w/index.php?diff=17129537&oldid=17127917&rcid=29640873, added=18, comment=Robot posa data a plantilles de manteniment, commentLength=43, isNew=false, isMinor=false, delta=18, isAnonymous=false, user=ArnauBot, deltaBucket=0.0, deleted=0, namespace=Main}, dimensions=[channel, cityName, comment, countryIsoCode, countryName, isAnonymous, isMinor, isNew, isRobot, isUnpatrolled, metroCode, namespace, page, regionIsoCode, regionName, user, commentLength, deltaBucket, flags, diffUrl]}]",
          "Unable to parse row [and so is this lol]",
          "Unable to parse row [this is total garbage]",
          "Unable to parse row [this is total garbag34]",
          "Unable to parse row [this is total garbage3]",
          "Unable to parse row [this is total garbage2]"
        ],
        "buildSegments": [
          "Encountered row with timestamp that cannot be represented as a long: [MapBasedInputRow{timestamp=246140482-04-24T15:36:27.903Z, event={isRobot=true, channel=#ca.wikipedia, timestamp=246140482-04-24T15:36:27.903Z, flags=B, isUnpatrolled=false, page=Exèrcit dels guàrdies de la revolució islàmica, diffUrl=https://ca.wikipedia.org/w/index.php?diff=17129537&oldid=17127917&rcid=29640873, added=18, comment=Robot posa data a plantilles de manteniment, commentLength=43, isNew=false, isMinor=false, delta=18, isAnonymous=false, user=ArnauBot, deltaBucket=0.0, deleted=0, namespace=Main}, dimensions=[channel, cityName, comment, countryIsoCode, countryName, isAnonymous, isMinor, isNew, isRobot, isUnpatrolled, metroCode, namespace, page, regionIsoCode, regionName, user, commentLength, deltaBucket, flags, diffUrl]}]",
          "Unable to parse row [and so is this lol]",
          "Found unparseable columns in row: [MapBasedInputRow{timestamp=2016-06-27T00:02:20.008Z, event={isRobot=false, channel=#id.wikipedia, timestamp=2016-06-27T00:02:20.008Z, flags=!, isUnpatrolled=true, page=Ibnu Sina, diffUrl=https://id.wikipedia.org/w/index.php?diff=11687177&oldid=11444059&rcid=20812462, added=106, comment=gambar, commentLength=6, isNew=false, isMinor=false, delta=106, isAnonymous=false, user=Ftihikam, deltaBucket=bad-delta-bucket, deleted=0, namespace=Main}, dimensions=[channel, cityName, comment, countryIsoCode, countryName, isAnonymous, isMinor, isNew, isRobot, isUnpatrolled, metroCode, namespace, page, regionIsoCode, regionName, user, commentLength, deltaBucket, flags, diffUrl]}], exceptions: [could not convert value [bad-delta-bucket] to long,]",
          "Unable to parse row [this is total garbage]",
          "Found unparseable columns in row: [MapBasedInputRow{timestamp=2016-06-27T00:01:07.347Z, event={isRobot=true, channel=#ceb.wikipedia, timestamp=2016-06-27T00:01:07.347Z, flags=NB, isUnpatrolled=false, page=Neqerssuaq, diffUrl=https://ceb.wikipedia.org/w/index.php?oldid=9563239&rcid=36193146, added=4150, comment=Paghimo ni bot Greenland, commentLength=24, isNew=true, isMinor=false, delta=4150, isAnonymous=false, user=Lsjbot, deltaBucket=helloworld, deleted=0, namespace=Main}, dimensions=[channel, cityName, comment, countryIsoCode, countryName, isAnonymous, isMinor, isNew, isRobot, isUnpatrolled, metroCode, namespace, page, regionIsoCode, regionName, user, commentLength, deltaBucket, flags, diffUrl]}], exceptions: [could not convert value [helloworld] to long,]",
          "Unable to parse row [this is total garbag34]"
        ]
      },
      "rowStats": {
        "determinePartitions": {
          "rowsProcessed": 29,
          "rowsProcessedWithErrors": 0,
          "rowsThrownAway": 0,
          "rowsUnparseable": 7
        },
        "buildSegments": {
          "rowsProcessed": 27,
          "rowsProcessedWithErrors": 2,
          "rowsThrownAway": 0,
          "rowsUnparseable": 7
        }
      },
      "errorMsg": null
    },
    "type": "ingestionStatsAndErrors"
  }
}

Example rowStats output:

{
  "totals": {
    "buildSegments": {
      "rowsProcessedWithErrors": 0,
      "rowsProcessed": 22479,
      "rowsUnparseable": 0,
      "rowsThrownAway": 0
    }
  }
}

Example unparseableEvents output:

{
  "determinePartitions": [
    "Encountered row with timestamp that cannot be represented as a long: [MapBasedInputRow{timestamp=246140482-04-24T15:36:27.903Z, event={isRobot=true, channel=#ca.wikipedia, timestamp=246140482-04-24T15:36:27.903Z, flags=B, isUnpatrolled=false, page=Exèrcit dels guàrdies de la revolució islàmica, diffUrl=https://ca.wikipedia.org/w/index.php?diff=17129537&oldid=17127917&rcid=29640873, added=18, comment=Robot posa data a plantilles de manteniment, commentLength=43, isNew=false, isMinor=false, delta=18, isAnonymous=false, user=ArnauBot, deltaBucket=0.0, deleted=0, namespace=Main}, dimensions=[channel, cityName, comment, countryIsoCode, countryName, isAnonymous, isMinor, isNew, isRobot, isUnpatrolled, metroCode, namespace, page, regionIsoCode, regionName, user, commentLength, deltaBucket, flags, diffUrl]}]",
    "Unable to parse row [and so is this lol]",
    "Unable to parse row [this is total garbage]",
    "Unable to parse row [this is total garbag34]",
    "Unable to parse row [this is total garbage3]",
    "Unable to parse row [this is total garbage2]"
  ],
  "buildSegments": [
    "Encountered row with timestamp that cannot be represented as a long: [MapBasedInputRow{timestamp=246140482-04-24T15:36:27.903Z, event={isRobot=true, channel=#ca.wikipedia, timestamp=246140482-04-24T15:36:27.903Z, flags=B, isUnpatrolled=false, page=Exèrcit dels guàrdies de la revolució islàmica, diffUrl=https://ca.wikipedia.org/w/index.php?diff=17129537&oldid=17127917&rcid=29640873, added=18, comment=Robot posa data a plantilles de manteniment, commentLength=43, isNew=false, isMinor=false, delta=18, isAnonymous=false, user=ArnauBot, deltaBucket=0.0, deleted=0, namespace=Main}, dimensions=[channel, cityName, comment, countryIsoCode, countryName, isAnonymous, isMinor, isNew, isRobot, isUnpatrolled, metroCode, namespace, page, regionIsoCode, regionName, user, commentLength, deltaBucket, flags, diffUrl]}]",
    "Unable to parse row [and so is this lol]",
    "Found unparseable columns in row: [MapBasedInputRow{timestamp=2016-06-27T00:02:20.008Z, event={isRobot=false, channel=#id.wikipedia, timestamp=2016-06-27T00:02:20.008Z, flags=!, isUnpatrolled=true, page=Ibnu Sina, diffUrl=https://id.wikipedia.org/w/index.php?diff=11687177&oldid=11444059&rcid=20812462, added=106, comment=gambar, commentLength=6, isNew=false, isMinor=false, delta=106, isAnonymous=false, user=Ftihikam, deltaBucket=bad-delta-bucket, deleted=0, namespace=Main}, dimensions=[channel, cityName, comment, countryIsoCode, countryName, isAnonymous, isMinor, isNew, isRobot, isUnpatrolled, metroCode, namespace, page, regionIsoCode, regionName, user, commentLength, deltaBucket, flags, diffUrl]}], exceptions: [could not convert value [bad-delta-bucket] to long,]",
    "Unable to parse row [this is total garbage]",
    "Found unparseable columns in row: [MapBasedInputRow{timestamp=2016-06-27T00:01:07.347Z, event={isRobot=true, channel=#ceb.wikipedia, timestamp=2016-06-27T00:01:07.347Z, flags=NB, isUnpatrolled=false, page=Neqerssuaq, diffUrl=https://ceb.wikipedia.org/w/index.php?oldid=9563239&rcid=36193146, added=4150, comment=Paghimo ni bot Greenland, commentLength=24, isNew=true, isMinor=false, delta=4150, isAnonymous=false, user=Lsjbot, deltaBucket=helloworld, deleted=0, namespace=Main}, dimensions=[channel, cityName, comment, countryIsoCode, countryName, isAnonymous, isMinor, isNew, isRobot, isUnpatrolled, metroCode, namespace, page, regionIsoCode, regionName, user, commentLength, deltaBucket, flags, diffUrl]}], exceptions: [could not convert value [helloworld] to long,]",
    "Unable to parse row [this is total garbag34]"
  ]
}

clintropolis

reviewed through KafkaSupervisorTest.java

clintropolis · 2018-03-15T00:42:46Z

api/src/main/java/io/druid/indexer/TaskStatusPlus.java

+      return false;
+    }
+
+    if (!Objects.equals(location, that.location)) {


clintropolis · 2018-03-15T01:01:29Z

api/src/main/java/io/druid/utils/CircularBuffer.java

+
+  public CircularBuffer(int capacity)
+  {
+    buffer = (E[]) new Object[capacity];


Maybe explode with a precondition check that capacity is larger than 0 here instead of exploding out of bounds here

Added a preconditions check

clintropolis · 2018-03-15T01:02:01Z

api/src/main/java/io/druid/utils/CircularBuffer.java

+
+import com.google.common.base.Preconditions;
+
+public class CircularBuffer<E>


Any reason not to use https://google.github.io/guava/releases/23.0/api/docs/com/google/common/collect/EvictingQueue.html? However, it would require a different strategy for getMessagesFromSavedParseExceptions where getLatest is used.

Decided to keep CircularBuffer for now, since it was already in the codebase and I do want getLatest

CircularBuffer is used in this as well as ChangeRequestHistory and ChangeRequestHistory requires a randomly-accessible circular array. I think it's fine to keep this.

However, would you add some javadocs to this class? I also think we need some unit tests for this class, but it's not mandatory for this PR.

jihoonson · 2018-03-15T04:23:32Z

api/src/main/java/io/druid/indexer/TaskStatusPlus.java

+  private final Map<String, Object> metrics;
+
+  @Nullable
+  private final String errorMsg;


I really wanted this!

jihoonson · 2018-03-15T04:27:48Z

api/src/main/java/io/druid/indexer/IngestionState.java

+
+package io.druid.indexer;
+
+public enum IngestionState


Is this same for all types of tasks? If so, I think it's better to expand TaskState to include these new states because every task is the ingestion task and we don't have to keep two states for them.

I decided to keep them separate, since I mean for IngestionState to be an additional qualifier on the existing states (RUNNING,FAILED,SUCCESS). For example, a task could be RUNNING and in DETERMINE_PARTITIONS, or RUNNING and in BUILD_SEGMENTS, or similarly with FAILED.

Sounds good.

jihoonson · 2018-03-15T04:29:34Z

api/src/main/java/io/druid/indexer/TaskStatusPlus.java

@@ -138,13 +175,37 @@ public boolean equals(Object o)
    if (!Objects.equals(duration, that.duration)) {
      return false;
    }
-    return location.equals(that.location);
+
+    if (!Objects.equals(location, that.location)) {


location can't be null.

jihoonson · 2018-03-15T04:30:48Z

...nsions-core/kafka-indexing-service/src/main/java/io/druid/indexing/kafka/KafkaIndexTask.java

+      return TaskStatus.failure(
+          getId(),
+          getTaskCompletionMetrics(),
+          e.getMessage(),


How about passing the full stack trace rather than a simple message? It would be much helpful to understand what's going on. If some applications need only error messages, they can parse the stack trace on their own.

I changed this to the following:

The stack trace is passed in now, but TaskStatus will truncate the error message to 100 characters. This is to avoid large ZK/metadata objects.

The full error + stack trace is provided in the TaskReport that's uploaded when the task completes

jihoonson · 2018-03-15T04:38:39Z

...nsions-core/kafka-indexing-service/src/main/java/io/druid/indexing/kafka/KafkaIndexTask.java

-                  } else {
-                    fireDepartmentMetrics.incrementThrownAway();
+                  }
+                  catch (ParseException e) {


Looks like this catch clause isn't necessary and handleParseException() can be called directly like

if (addResult.getParseException() != null) { handleParseException(e, record); }

Changed this and similar places to use handleParseException

jihoonson · 2018-03-15T04:41:39Z

...ons-core/kafka-indexing-service/src/main/java/io/druid/indexing/kafka/KafkaTuningConfig.java

+           Objects.equals(intermediateHandoffPeriod, that.intermediateHandoffPeriod) &&
+           Objects.equals(logParseExceptions, that.logParseExceptions) &&
+           Objects.equals(maxParseExceptions, that.maxParseExceptions) &&
+           Objects.equals(maxSavedParseExceptions, that.maxSavedParseExceptions);


New variables can't be null.

Changed these, I don't really think this matters much though since these comparisons aren't in any hot path

jihoonson · 2018-03-15T04:45:31Z

indexing-service/src/main/java/io/druid/indexing/common/task/CompactionTask.java

@@ -102,6 +103,9 @@
  @JsonIgnore
  private IndexTask indexTaskSpec;

+  @JsonIgnore
+  private final AuthorizerMapper authorizerMapper;


Thanks for adding this.

jihoonson · 2018-03-15T04:51:08Z

indexing-service/src/main/java/io/druid/indexing/common/task/IndexTask.java

      }
    }
    catch (TimeoutException | ExecutionException e) {
      throw Throwables.propagate(e);
    }
  }

+  @Nullable
+  public static List<String> getMessagesFromSavedParseExceptions(CircularBuffer<Throwable> savedParseExceptions)


IndexTask class is already very long enough. Suggest to extract as a method of a separate util class.

moved this to IndexTaskUtils

jihoonson · 2018-03-15T04:54:32Z

processing/src/main/java/io/druid/segment/incremental/IncrementalIndex.java

@@ -451,7 +453,7 @@ public boolean isRollup()
  );

  // Note: This method needs to be thread safe.
-  protected abstract Integer addToFacts(
+  protected abstract Pair<Integer, List<String>> addToFacts(


I think Pair makes code difficult to understand because I have to imagine what makes the pair. It would be much great to add a new class.

Changed this to use a new class instead of Pair

jihoonson · 2018-03-15T04:54:48Z

processing/src/main/java/io/druid/segment/incremental/IncrementalIndex.java

  }

  @VisibleForTesting
-  TimeAndDims toTimeAndDims(InputRow row)
+  Pair<TimeAndDims, List<String>> toTimeAndDims(InputRow row)


Changed this to use a new class instead of Pair

jihoonson · 2018-04-03T23:44:30Z

api/src/main/java/io/druid/data/input/impl/MapInputRowParser.java

@@ -72,7 +72,7 @@ public MapInputRowParser(
      }
    }
    catch (Exception e) {
-      throw new ParseException(e, "Unparseable timestamp found!");
+      throw new ParseException(e, "Unparseable timestamp found! Event: " + theMap);


nit: ParseException supports formatted string.

Used formatted string

jihoonson · 2018-04-03T23:44:39Z

api/src/main/java/io/druid/indexer/IngestionState.java

+
+package io.druid.indexer;
+
+public enum IngestionState


Sounds good.

jihoonson · 2018-04-03T23:52:17Z

api/src/main/java/io/druid/utils/CircularBuffer.java

+
+import com.google.common.base.Preconditions;
+
+public class CircularBuffer<E>


CircularBuffer is used in this as well as ChangeRequestHistory and ChangeRequestHistory requires a randomly-accessible circular array. I think it's fine to keep this.

However, would you add some javadocs to this class? I also think we need some unit tests for this class, but it's not mandatory for this PR.

jihoonson · 2018-04-03T23:54:32Z

common/src/main/java/io/druid/indexer/Jobby.java

+  @Nullable
+  default Map<String, Object> getStats()
+  {
+    throw new UnsupportedOperationException("This Jobby does not implement getJobStats().");


Please add the class name to the exception message.

Added class name

jihoonson · 2018-04-03T23:54:39Z

common/src/main/java/io/druid/indexer/Jobby.java

+  @Nullable
+  default String getErrorMessage()
+  {
+    throw new UnsupportedOperationException("This Jobby does not implement getErrorMessage().");


Added class name

jihoonson · 2018-04-04T00:13:19Z

indexing-hadoop/src/main/java/io/druid/indexer/Utils.java

+      return jsonMapper.writeValueAsString(taskDiagsMap);
+    }
+    catch (IOException | InterruptedException ie) {
+      log.error("couldn't get failure cause for job.");


Please add the exception and job name to the log message.

Added exception and job name

jihoonson · 2018-04-04T00:14:51Z

common/src/main/java/io/druid/indexer/Jobby.java

 /**
 */
 public interface Jobby
 {
  boolean run();
+
+  @Nullable


Would you please add a javadoc describing when the return value can be null?

Added javadoc

jihoonson · 2018-04-04T00:15:00Z

common/src/main/java/io/druid/indexer/Jobby.java

+    throw new UnsupportedOperationException("This Jobby does not implement getJobStats().");
+  }
+
+  @Nullable


Same here for nullable.

Added javadoc

jihoonson · 2018-04-04T00:24:49Z

indexing-hadoop/src/main/java/io/druid/indexer/InputRowSerde.java

@@ -233,7 +290,10 @@ public Double deserialize(ByteArrayDataInput in)
            typeHelper = STRING_HELPER;
          }
          writeString(dim, out);
-          typeHelper.serialize(out, row.getRaw(dim), reportParseExceptions);
+          String parseExceptionMessage = typeHelper.serialize(out, row.getRaw(dim), true);


Probably reportParseExceptions should be passed instead of true?

Removed reportParseExceptions here, it should always be true (always thrown, handling depends on config)

jihoonson · 2018-04-04T00:27:52Z

indexing-hadoop/src/main/java/io/druid/indexer/InputRowSerde.java

@@ -65,7 +67,8 @@
  {
    ValueType getType();

-    void serialize(ByteArrayDataOutput out, Object value, boolean reportParseExceptions);
+    @Nullable
+    String serialize(ByteArrayDataOutput out, Object value, boolean reportParseExceptions);


I'm not sure about returning an exception message. Probably it's possible that this method always throws a ParseException on parsing errors and reportParseExceptions can be handled by the caller.

Changed this throw parse exception instead of returning the message

jihoonson

@jon-wei thanks. Looks good to me.

* Add more indexing task status and error reporting * PR comments, add support in AppenderatorDriverRealtimeIndexTask * Use TaskReport instead of metrics/context * Fix tests * Use TaskReport uploads * Refactor fire department metrics retrieval * Refactor input row serde in hadoop task * Refactor hadoop task loader names * Truncate error message in TaskStatus, add errorMsg to task report * PR comments

* This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks Currently a config called 'maxRowsInMemory' is present which affects how much memory gets used for indexing.If this value is not optimal for your JVM heap size, it could lead to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might be bad for query performance and a higher value will limit number of persists but require more jvm heap space and could lead to OOM. 'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes kept in memory before persisting. * The default value is 1/3(Runtime.maxMemory()) * To maintain the current behaviour set 'maxBytesInMemory' to -1 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them will be respected i.e. the first one to go above threshold will trigger persist * Fix check style and remove a comment * Add overlord unsecured paths to coordinator when using combined service (#5579) * Add overlord unsecured paths to coordinator when using combined service * PR comment * More error reporting and stats for ingestion tasks (#5418) * Add more indexing task status and error reporting * PR comments, add support in AppenderatorDriverRealtimeIndexTask * Use TaskReport instead of metrics/context * Fix tests * Use TaskReport uploads * Refactor fire department metrics retrieval * Refactor input row serde in hadoop task * Refactor hadoop task loader names * Truncate error message in TaskStatus, add errorMsg to task report * PR comments * Allow getDomain to return disjointed intervals (#5570) * Allow getDomain to return disjointed intervals * Indentation issues * Adding feature thetaSketchConstant to do some set operation in PostAgg (#5551) * Adding feature thetaSketchConstant to do some set operation in PostAggregator * Updated review comments for PR #5551 - Adding thetaSketchConstant * Fixed CI build issue * Updated review comments 2 for PR #5551 - Adding thetaSketchConstant * Fix taskDuration docs for KafkaIndexingService (#5572) * With incremental handoff the changed line is no longer true. * Add doc for automatic pendingSegments (#5565) * Add missing doc for automatic pendingSegments * address comments * Fix indexTask to respect forceExtendableShardSpecs (#5509) * Fix indexTask to respect forceExtendableShardSpecs * add comments * Deprecate spark2 profile in pom.xml (#5581) Deprecated due to #5382 * CompressionUtils: Add support for decompressing xz, bz2, zip. (#5586) Also switch various firehoses to the new method. Fixes #5585. * This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks Currently a config called 'maxRowsInMemory' is present which affects how much memory gets used for indexing.If this value is not optimal for your JVM heap size, it could lead to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might be bad for query performance and a higher value will limit number of persists but require more jvm heap space and could lead to OOM. 'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes kept in memory before persisting. * The default value is 1/3(Runtime.maxMemory()) * To maintain the current behaviour set 'maxBytesInMemory' to -1 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them will be respected i.e. the first one to go above threshold will trigger persist * Address code review comments * Fix the coding style according to druid conventions * Add more javadocs * Rename some variables/methods * Other minor issues * Address more code review comments * Some refactoring to put defaults in IndexTaskUtils * Added check for maxBytesInMemory in AppenderatorImpl * Decrement bytes in abandonSegment * Test unit test for multiple sinks in single appenderator * Fix some merge conflicts after rebase * Fix some style checks * Merge conflicts * Fix failing tests Add back check for 0 maxBytesInMemory in OnHeapIncrementalIndex * Address PR comments * Put defaults for maxRows and maxBytes in TuningConfig * Change/add javadocs * Refactoring and renaming some variables/methods * Fix TeamCity inspection warnings * Added maxBytesInMemory config to HadoopTuningConfig * Updated the docs and examples * Added maxBytesInMemory config in docs * Removed references to maxRowsInMemory under tuningConfig in examples * Set maxBytesInMemory to 0 until used Set the maxBytesInMemory to 0 if user does not set it as part of tuningConfing and set to part of max jvm memory when ingestion task starts * Update toString in KafkaSupervisorTuningConfig * Use correct maxBytesInMemory value in AppenderatorImpl * Update DEFAULT_MAX_BYTES_IN_MEMORY to 1/6 max jvm memory Experimenting with various defaults, 1/3 jvm memory causes OOM * Update docs to correct maxBytesInMemory default value * Minor to rename and add comment * Add more details in docs * Address new PR comments * Address PR comments * Fix spelling typo

…e#5583) * This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks Currently a config called 'maxRowsInMemory' is present which affects how much memory gets used for indexing.If this value is not optimal for your JVM heap size, it could lead to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might be bad for query performance and a higher value will limit number of persists but require more jvm heap space and could lead to OOM. 'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes kept in memory before persisting. * The default value is 1/3(Runtime.maxMemory()) * To maintain the current behaviour set 'maxBytesInMemory' to -1 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them will be respected i.e. the first one to go above threshold will trigger persist * Fix check style and remove a comment * Add overlord unsecured paths to coordinator when using combined service (apache#5579) * Add overlord unsecured paths to coordinator when using combined service * PR comment * More error reporting and stats for ingestion tasks (apache#5418) * Add more indexing task status and error reporting * PR comments, add support in AppenderatorDriverRealtimeIndexTask * Use TaskReport instead of metrics/context * Fix tests * Use TaskReport uploads * Refactor fire department metrics retrieval * Refactor input row serde in hadoop task * Refactor hadoop task loader names * Truncate error message in TaskStatus, add errorMsg to task report * PR comments * Allow getDomain to return disjointed intervals (apache#5570) * Allow getDomain to return disjointed intervals * Indentation issues * Adding feature thetaSketchConstant to do some set operation in PostAgg (apache#5551) * Adding feature thetaSketchConstant to do some set operation in PostAggregator * Updated review comments for PR apache#5551 - Adding thetaSketchConstant * Fixed CI build issue * Updated review comments 2 for PR apache#5551 - Adding thetaSketchConstant * Fix taskDuration docs for KafkaIndexingService (apache#5572) * With incremental handoff the changed line is no longer true. * Add doc for automatic pendingSegments (apache#5565) * Add missing doc for automatic pendingSegments * address comments * Fix indexTask to respect forceExtendableShardSpecs (apache#5509) * Fix indexTask to respect forceExtendableShardSpecs * add comments * Deprecate spark2 profile in pom.xml (apache#5581) Deprecated due to apache#5382 * CompressionUtils: Add support for decompressing xz, bz2, zip. (apache#5586) Also switch various firehoses to the new method. Fixes apache#5585. * This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks Currently a config called 'maxRowsInMemory' is present which affects how much memory gets used for indexing.If this value is not optimal for your JVM heap size, it could lead to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might be bad for query performance and a higher value will limit number of persists but require more jvm heap space and could lead to OOM. 'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes kept in memory before persisting. * The default value is 1/3(Runtime.maxMemory()) * To maintain the current behaviour set 'maxBytesInMemory' to -1 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them will be respected i.e. the first one to go above threshold will trigger persist * Address code review comments * Fix the coding style according to druid conventions * Add more javadocs * Rename some variables/methods * Other minor issues * Address more code review comments * Some refactoring to put defaults in IndexTaskUtils * Added check for maxBytesInMemory in AppenderatorImpl * Decrement bytes in abandonSegment * Test unit test for multiple sinks in single appenderator * Fix some merge conflicts after rebase * Fix some style checks * Merge conflicts * Fix failing tests Add back check for 0 maxBytesInMemory in OnHeapIncrementalIndex * Address PR comments * Put defaults for maxRows and maxBytes in TuningConfig * Change/add javadocs * Refactoring and renaming some variables/methods * Fix TeamCity inspection warnings * Added maxBytesInMemory config to HadoopTuningConfig * Updated the docs and examples * Added maxBytesInMemory config in docs * Removed references to maxRowsInMemory under tuningConfig in examples * Set maxBytesInMemory to 0 until used Set the maxBytesInMemory to 0 if user does not set it as part of tuningConfing and set to part of max jvm memory when ingestion task starts * Update toString in KafkaSupervisorTuningConfig * Use correct maxBytesInMemory value in AppenderatorImpl * Update DEFAULT_MAX_BYTES_IN_MEMORY to 1/6 max jvm memory Experimenting with various defaults, 1/3 jvm memory causes OOM * Update docs to correct maxBytesInMemory default value * Minor to rename and add comment * Add more details in docs * Address new PR comments * Address PR comments * Fix spelling typo

More error reporting and stats for ingestion tasks (apache#5418) These changes are required for the tasks api improvement commit

* Add more indexing task status and error reporting * PR comments, add support in AppenderatorDriverRealtimeIndexTask * Use TaskReport instead of metrics/context * Fix tests * Use TaskReport uploads * Refactor fire department metrics retrieval * Refactor input row serde in hadoop task * Refactor hadoop task loader names * Truncate error message in TaskStatus, add errorMsg to task report * PR comments

leventov · 2019-05-23T11:37:01Z

indexing-service/src/main/java/io/druid/indexing/common/task/IndexTask.java

-          if (ingestionSchema.getTuningConfig().isReportParseExceptions()) {
-            throw e;
-          } else {
-            unparseable++;


This change makes the if (parseable > 0) below dead. See #7737.

jon-wei added Area - Batch Ingestion Improvement WIP labels Feb 23, 2018

jon-wei force-pushed the more_task_status branch 6 times, most recently from bbb8d03 to ad91de1 Compare March 8, 2018 01:06

jon-wei force-pushed the more_task_status branch 6 times, most recently from 09ce529 to 5f70e1a Compare March 14, 2018 22:19

Add more indexing task status and error reporting

278144e

jon-wei force-pushed the more_task_status branch from 5f70e1a to 278144e Compare March 14, 2018 23:22

jon-wei changed the title ~~[WIP] More error reporting and stats for ingestion tasks~~ More error reporting and stats for ingestion tasks Mar 14, 2018

jon-wei removed the WIP label Mar 14, 2018

clintropolis reviewed Mar 15, 2018

View reviewed changes

jihoonson reviewed Mar 15, 2018

View reviewed changes

jon-wei force-pushed the more_task_status branch from b76d8df to c6ff6cc Compare March 19, 2018 12:25

PR comments, add support in AppenderatorDriverRealtimeIndexTask

bbda06b

jon-wei force-pushed the more_task_status branch from c6ff6cc to bbda06b Compare March 19, 2018 12:31

jon-wei added 3 commits March 19, 2018 15:58

Merge remote-tracking branch 'upstream/master' into more_task_status

9bb55f5

Use TaskReport instead of metrics/context

8867c2f

Fix tests

888fa4d

jon-wei mentioned this pull request Mar 23, 2018

Add support for task reports, upload reports to deep storage #5524

Merged

jon-wei added 2 commits April 2, 2018 13:45

Merge WIP

4589808

Use TaskReport uploads

d5f1e28

jon-wei force-pushed the more_task_status branch from 6fa18f0 to d5f1e28 Compare April 2, 2018 23:22

jon-wei added 4 commits April 2, 2018 17:18

Refactor fire department metrics retrieval

9943b27

Refactor input row serde in hadoop task

60beb1c

Refactor hadoop task loader names

0e0e855

Truncate error message in TaskStatus, add errorMsg to task report

c2c132a

jihoonson reviewed Apr 4, 2018

View reviewed changes

PR comments

8e36c22

jihoonson approved these changes Apr 5, 2018

View reviewed changes

jon-wei merged commit 969342c into apache:master Apr 6, 2018

This was referenced Apr 10, 2018

Native parallel batch indexing without shuffle #5492

Merged

Allow ParseException for only InputRowParser and Parser in data ingestion #5639

Closed

surekhasaharan pushed a commit to implydata/druid-public that referenced this pull request Jul 3, 2018

Add partial changes from commit 969342c

61234a4

More error reporting and stats for ingestion tasks (apache#5418) These changes are required for the tasks api improvement commit

jihoonson mentioned this pull request Sep 17, 2018

JSON parsing bad metrics fails ignoring ignoreInvalidRows #3879

Closed

jon-wei mentioned this pull request Sep 24, 2018

Docs for ingestion stat reports and new parse exception handling #6373

Merged

dclim added this to the 0.13.0 milestone Oct 8, 2018

dclim mentioned this pull request Oct 10, 2018

Druid 0.13.0-incubating release notes #6442

Closed

leventov reviewed May 23, 2019

View reviewed changes

jon-wei mentioned this pull request May 23, 2019

Fix dead code in IndexTask.collectIntervalsAndShardSpecs() #7746

Merged


		import com.google.common.base.Preconditions;

		public class CircularBuffer<E>

More error reporting and stats for ingestion tasks #5418

More error reporting and stats for ingestion tasks #5418

Conversation

jon-wei commented Feb 23, 2018 • edited Loading

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson Mar 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jon-wei Apr 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jon-wei commented Feb 23, 2018 •

edited

Loading

jihoonson Mar 15, 2018 •

edited

Loading

jon-wei Apr 3, 2018 •

edited

Loading