CompressionUtils: Add support for decompressing xz, bz2, zip. #5586

gianm · 2018-04-06T05:07:35Z

Also switch various firehoses to the new method.

Also switch various firehoses to the new method. Fixes apache#5585.

fjy · 2018-04-06T05:32:09Z

👍

drcrallen · 2018-04-06T12:38:35Z

java-util/src/main/java/io/druid/java/util/common/CompressionUtils.java

@@ -48,7 +52,9 @@
 {
  private static final Logger log = new Logger(CompressionUtils.class);
  private static final int DEFAULT_RETRY_COUNT = 3;
+  private static final String BZ2_SUFFIX = ".bz2";


These constants wouldn't happen to be in the apache libs anywhere would they?

drcrallen · 2018-04-06T12:38:56Z

java-util/src/main/java/io/druid/java/util/common/CompressionUtils.java

@@ -313,7 +319,7 @@ public InputStream openStream() throws IOException
   *
   * @return A GZIPInputStream that can handle concatenated gzip streams in the input
   */
-  public static GZIPInputStream gzipInputStream(final InputStream in) throws IOException
+  private static GZIPInputStream gzipInputStream(final InputStream in) throws IOException


This can break extensions that depend on it. Can it be deprecated for a version or two instead?

drcrallen

Requesting re-instating public method signature for a few versions

…#5586)" This reverts commit 5ab1766.

fjy · 2018-04-06T15:07:49Z

@drcrallen my browser didn't update for your comments until after I clicked on merge so please feel free to revert to address rest of comments

gianm · 2018-04-06T15:17:53Z

I'll do another patch for it.

…#5586) Also switch various firehoses to the new method. Fixes apache#5585.

* This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks Currently a config called 'maxRowsInMemory' is present which affects how much memory gets used for indexing.If this value is not optimal for your JVM heap size, it could lead to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might be bad for query performance and a higher value will limit number of persists but require more jvm heap space and could lead to OOM. 'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes kept in memory before persisting. * The default value is 1/3(Runtime.maxMemory()) * To maintain the current behaviour set 'maxBytesInMemory' to -1 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them will be respected i.e. the first one to go above threshold will trigger persist * Fix check style and remove a comment * Add overlord unsecured paths to coordinator when using combined service (#5579) * Add overlord unsecured paths to coordinator when using combined service * PR comment * More error reporting and stats for ingestion tasks (#5418) * Add more indexing task status and error reporting * PR comments, add support in AppenderatorDriverRealtimeIndexTask * Use TaskReport instead of metrics/context * Fix tests * Use TaskReport uploads * Refactor fire department metrics retrieval * Refactor input row serde in hadoop task * Refactor hadoop task loader names * Truncate error message in TaskStatus, add errorMsg to task report * PR comments * Allow getDomain to return disjointed intervals (#5570) * Allow getDomain to return disjointed intervals * Indentation issues * Adding feature thetaSketchConstant to do some set operation in PostAgg (#5551) * Adding feature thetaSketchConstant to do some set operation in PostAggregator * Updated review comments for PR #5551 - Adding thetaSketchConstant * Fixed CI build issue * Updated review comments 2 for PR #5551 - Adding thetaSketchConstant * Fix taskDuration docs for KafkaIndexingService (#5572) * With incremental handoff the changed line is no longer true. * Add doc for automatic pendingSegments (#5565) * Add missing doc for automatic pendingSegments * address comments * Fix indexTask to respect forceExtendableShardSpecs (#5509) * Fix indexTask to respect forceExtendableShardSpecs * add comments * Deprecate spark2 profile in pom.xml (#5581) Deprecated due to #5382 * CompressionUtils: Add support for decompressing xz, bz2, zip. (#5586) Also switch various firehoses to the new method. Fixes #5585. * This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks Currently a config called 'maxRowsInMemory' is present which affects how much memory gets used for indexing.If this value is not optimal for your JVM heap size, it could lead to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might be bad for query performance and a higher value will limit number of persists but require more jvm heap space and could lead to OOM. 'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes kept in memory before persisting. * The default value is 1/3(Runtime.maxMemory()) * To maintain the current behaviour set 'maxBytesInMemory' to -1 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them will be respected i.e. the first one to go above threshold will trigger persist * Address code review comments * Fix the coding style according to druid conventions * Add more javadocs * Rename some variables/methods * Other minor issues * Address more code review comments * Some refactoring to put defaults in IndexTaskUtils * Added check for maxBytesInMemory in AppenderatorImpl * Decrement bytes in abandonSegment * Test unit test for multiple sinks in single appenderator * Fix some merge conflicts after rebase * Fix some style checks * Merge conflicts * Fix failing tests Add back check for 0 maxBytesInMemory in OnHeapIncrementalIndex * Address PR comments * Put defaults for maxRows and maxBytes in TuningConfig * Change/add javadocs * Refactoring and renaming some variables/methods * Fix TeamCity inspection warnings * Added maxBytesInMemory config to HadoopTuningConfig * Updated the docs and examples * Added maxBytesInMemory config in docs * Removed references to maxRowsInMemory under tuningConfig in examples * Set maxBytesInMemory to 0 until used Set the maxBytesInMemory to 0 if user does not set it as part of tuningConfing and set to part of max jvm memory when ingestion task starts * Update toString in KafkaSupervisorTuningConfig * Use correct maxBytesInMemory value in AppenderatorImpl * Update DEFAULT_MAX_BYTES_IN_MEMORY to 1/6 max jvm memory Experimenting with various defaults, 1/3 jvm memory causes OOM * Update docs to correct maxBytesInMemory default value * Minor to rename and add comment * Add more details in docs * Address new PR comments * Address PR comments * Fix spelling typo

…#5586) Also switch various firehoses to the new method. Fixes apache#5585.

…e#5583) * This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks Currently a config called 'maxRowsInMemory' is present which affects how much memory gets used for indexing.If this value is not optimal for your JVM heap size, it could lead to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might be bad for query performance and a higher value will limit number of persists but require more jvm heap space and could lead to OOM. 'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes kept in memory before persisting. * The default value is 1/3(Runtime.maxMemory()) * To maintain the current behaviour set 'maxBytesInMemory' to -1 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them will be respected i.e. the first one to go above threshold will trigger persist * Fix check style and remove a comment * Add overlord unsecured paths to coordinator when using combined service (apache#5579) * Add overlord unsecured paths to coordinator when using combined service * PR comment * More error reporting and stats for ingestion tasks (apache#5418) * Add more indexing task status and error reporting * PR comments, add support in AppenderatorDriverRealtimeIndexTask * Use TaskReport instead of metrics/context * Fix tests * Use TaskReport uploads * Refactor fire department metrics retrieval * Refactor input row serde in hadoop task * Refactor hadoop task loader names * Truncate error message in TaskStatus, add errorMsg to task report * PR comments * Allow getDomain to return disjointed intervals (apache#5570) * Allow getDomain to return disjointed intervals * Indentation issues * Adding feature thetaSketchConstant to do some set operation in PostAgg (apache#5551) * Adding feature thetaSketchConstant to do some set operation in PostAggregator * Updated review comments for PR apache#5551 - Adding thetaSketchConstant * Fixed CI build issue * Updated review comments 2 for PR apache#5551 - Adding thetaSketchConstant * Fix taskDuration docs for KafkaIndexingService (apache#5572) * With incremental handoff the changed line is no longer true. * Add doc for automatic pendingSegments (apache#5565) * Add missing doc for automatic pendingSegments * address comments * Fix indexTask to respect forceExtendableShardSpecs (apache#5509) * Fix indexTask to respect forceExtendableShardSpecs * add comments * Deprecate spark2 profile in pom.xml (apache#5581) Deprecated due to apache#5382 * CompressionUtils: Add support for decompressing xz, bz2, zip. (apache#5586) Also switch various firehoses to the new method. Fixes apache#5585. * This commit introduces a new tuning config called 'maxBytesInMemory' for ingestion tasks Currently a config called 'maxRowsInMemory' is present which affects how much memory gets used for indexing.If this value is not optimal for your JVM heap size, it could lead to OutOfMemoryError sometimes. A lower value will lead to frequent persists which might be bad for query performance and a higher value will limit number of persists but require more jvm heap space and could lead to OOM. 'maxBytesInMemory' is an attempt to solve this problem. It limits the total number of bytes kept in memory before persisting. * The default value is 1/3(Runtime.maxMemory()) * To maintain the current behaviour set 'maxBytesInMemory' to -1 * If both 'maxRowsInMemory' and 'maxBytesInMemory' are present, both of them will be respected i.e. the first one to go above threshold will trigger persist * Address code review comments * Fix the coding style according to druid conventions * Add more javadocs * Rename some variables/methods * Other minor issues * Address more code review comments * Some refactoring to put defaults in IndexTaskUtils * Added check for maxBytesInMemory in AppenderatorImpl * Decrement bytes in abandonSegment * Test unit test for multiple sinks in single appenderator * Fix some merge conflicts after rebase * Fix some style checks * Merge conflicts * Fix failing tests Add back check for 0 maxBytesInMemory in OnHeapIncrementalIndex * Address PR comments * Put defaults for maxRows and maxBytes in TuningConfig * Change/add javadocs * Refactoring and renaming some variables/methods * Fix TeamCity inspection warnings * Added maxBytesInMemory config to HadoopTuningConfig * Updated the docs and examples * Added maxBytesInMemory config in docs * Removed references to maxRowsInMemory under tuningConfig in examples * Set maxBytesInMemory to 0 until used Set the maxBytesInMemory to 0 if user does not set it as part of tuningConfing and set to part of max jvm memory when ingestion task starts * Update toString in KafkaSupervisorTuningConfig * Use correct maxBytesInMemory value in AppenderatorImpl * Update DEFAULT_MAX_BYTES_IN_MEMORY to 1/6 max jvm memory Experimenting with various defaults, 1/3 jvm memory causes OOM * Update docs to correct maxBytesInMemory default value * Minor to rename and add comment * Add more details in docs * Address new PR comments * Address PR comments * Fix spelling typo

CompressionUtils: Add support for decompressing xz, bz2, zip.

b4d7387

Also switch various firehoses to the new method. Fixes apache#5585.

drcrallen reviewed Apr 6, 2018

View reviewed changes

drcrallen requested changes Apr 6, 2018

View reviewed changes

fjy merged commit 5ab1766 into apache:master Apr 6, 2018

fjy added a commit that referenced this pull request Apr 6, 2018

Revert "CompressionUtils: Add support for decompressing xz, bz2, zip. (…

fc8f65e

…#5586)" This reverts commit 5ab1766.

gianm deleted the bz2 branch April 6, 2018 15:17

gianm mentioned this pull request Apr 6, 2018

CompressionUtils: Make gzipInputStream public once again. #5590

Merged

surekhasaharan pushed a commit to surekhasaharan/druid that referenced this pull request Apr 6, 2018

CompressionUtils: Add support for decompressing xz, bz2, zip. (apache…

99315da

…#5586) Also switch various firehoses to the new method. Fixes apache#5585.

gianm added a commit to implydata/druid-public that referenced this pull request Apr 9, 2018

CompressionUtils: Add support for decompressing xz, bz2, zip. (apache…

0305238

…#5586) Also switch various firehoses to the new method. Fixes apache#5585.

sathishsri88 pushed a commit to sathishs/druid that referenced this pull request May 8, 2018

CompressionUtils: Add support for decompressing xz, bz2, zip. (apache…

bae2190

…#5586) Also switch various firehoses to the new method. Fixes apache#5585.

dclim added this to the 0.13.0 milestone Oct 8, 2018

gianm mentioned this pull request Nov 14, 2018

Adds possibility to read '.gz' files when using local firehose #2394

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CompressionUtils: Add support for decompressing xz, bz2, zip. #5586

CompressionUtils: Add support for decompressing xz, bz2, zip. #5586

gianm commented Apr 6, 2018

fjy commented Apr 6, 2018

drcrallen Apr 6, 2018

drcrallen Apr 6, 2018

drcrallen left a comment

fjy commented Apr 6, 2018

gianm commented Apr 6, 2018

CompressionUtils: Add support for decompressing xz, bz2, zip. #5586

CompressionUtils: Add support for decompressing xz, bz2, zip. #5586

Conversation

gianm commented Apr 6, 2018

fjy commented Apr 6, 2018

drcrallen Apr 6, 2018

Choose a reason for hiding this comment

drcrallen Apr 6, 2018

Choose a reason for hiding this comment

drcrallen left a comment

Choose a reason for hiding this comment

fjy commented Apr 6, 2018

gianm commented Apr 6, 2018