[CARBONDATA-3005]Support Gzip as column compressor #2847

shardul-cr7 · 2018-10-23T12:29:02Z

This PR is to add a new compressor "Gzip" and enhance the compressing capabilities offered by CarbonData. User can now use gzip as the compressor for loading the data. Gzip can be set at System Properties level or also for particular table.

Gzip compressed file size is less than that of snappy but takes more time.

Data generated by tpch-dbgen(lineitem)

Load Performance Comparisons (Compression)

Test Case 1
File Size 3.9G
Records ~30M

Codec Used	Load Time	File Size After Load
Snappy	387s	1.1G
Zstd	459s	848M
Gzip	464s	830M

Test Case 2
File Size 7.8G
Records ~60M

Codec Used	Load Time	File Size After Load
Snappy	771s	2.2G
Zstd	764s	1.8G
Gzip	986s	1.7G

Query Performance (Decompression)

Test Case 1

Codec Used	Full Scan Time
Snappy	36.99s
Zstd	39.275s
Gzip	45.601s

Test Case 2

Codec Used	Full Scan Time
Snappy	73.114s
Zstd	77.360s
Gzip	88.510s

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

Any interfaces changed?
Any backward compatibility impacted?
Document update required?
Testing done
added some testcases
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

CarbonDataQA · 2018-10-23T14:40:41Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9235/

CarbonDataQA · 2018-10-23T14:53:07Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/974/

CarbonDataQA · 2018-10-23T16:34:24Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1187/

xuchuanyin · 2018-10-24T01:10:44Z

core/src/main/java/org/apache/carbondata/core/datastore/compression/GzipCompressor.java

+  }
+
+  @Override public long rawUncompress(byte[] input, byte[] output) throws IOException {
+    //gzip api doesnt have rawCompress yet.


if it is so, just throw exception, otherwise JVM may crash if you pass the illegal address/length

xuchuanyin · 2018-10-24T01:13:08Z

core/src/main/java/org/apache/carbondata/core/datastore/compression/GzipCompressor.java

+      e.printStackTrace();
+    }
+
+    return bt.toByteArray();


why bt is still open?

ByteArrayOutputStream.close() does nothing. It's implementation in java is like this:

public void close() throws IOException {
}

I can close it but I'll have to copy the stream to byte Array and return that byte array which can be a costly operation.

xuchuanyin · 2018-10-24T01:13:38Z

core/src/main/java/org/apache/carbondata/core/datastore/compression/GzipCompressor.java

+      try {
+        gzos.write(data);
+      } catch (IOException e) {
+        e.printStackTrace();


please optimize the logging!

ok will do that!

xuchuanyin · 2018-10-24T01:13:49Z

core/src/main/java/org/apache/carbondata/core/datastore/compression/GzipCompressor.java

+        gzos.close();
+      }
+    } catch (IOException e) {
+      e.printStackTrace();


please optimize the logging!

xuchuanyin · 2018-10-24T01:14:13Z

core/src/main/java/org/apache/carbondata/core/datastore/compression/GzipCompressor.java

+      }
+
+    } catch (IOException e) {
+      e.printStackTrace();


please optimize the logging!

xuchuanyin · 2018-10-24T01:14:24Z

core/src/main/java/org/apache/carbondata/core/datastore/compression/GzipCompressor.java

+      e.printStackTrace();
+    }
+
+    return bot.toByteArray();


bot not closed

Similar to ByteArrayOutputStream.close() reason mentioned above.

CarbonDataQA · 2018-10-24T13:45:40Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/994/

CarbonDataQA · 2018-10-24T14:45:30Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1207/

CarbonDataQA · 2018-10-24T14:49:45Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9260/

CarbonDataQA · 2018-12-04T13:16:24Z

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1633/

CarbonDataQA · 2018-12-04T13:16:57Z

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9893/

CarbonDataQA · 2018-12-04T13:18:13Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1844/

akashrn5 · 2018-12-04T13:19:33Z

core/src/main/java/org/apache/carbondata/core/datastore/compression/GzipCompressor.java

+
+    ByteArrayInputStream byteArrayOutputStream = new ByteArrayInputStream(data);
+    ByteArrayOutputStream byteOutputStream = new ByteArrayOutputStream();
+


remove empty line

akashrn5 · 2018-12-04T13:21:28Z

core/src/main/java/org/apache/carbondata/core/datastore/compression/GzipCompressor.java

+  }
+
+  /*
+   * Method called for compressing the data and


change the comment as starndard doc

CarbonDataQA · 2018-12-04T13:49:23Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1634/

CarbonDataQA · 2018-12-04T14:41:05Z

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1845/

CarbonDataQA · 2018-12-04T14:45:13Z

Build Failed with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9894/

CarbonDataQA · 2018-12-05T07:19:30Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1641/

CarbonDataQA · 2018-12-05T08:17:42Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9901/

CarbonDataQA · 2018-12-05T08:19:00Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1852/

CarbonDataQA · 2018-12-05T12:04:25Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1645/

CarbonDataQA · 2018-12-05T13:03:33Z

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9905/

CarbonDataQA · 2018-12-05T13:04:16Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1856/

kumarvishal09 · 2018-12-10T06:48:04Z

core/src/main/java/org/apache/carbondata/core/datastore/compression/GzipCompressor.java

+ */
+public class GzipCompressor extends AbstractCompressor {
+
+  public GzipCompressor() {


why this empty constructor is required ??

kumarvishal09 · 2018-12-10T06:52:40Z

core/src/main/java/org/apache/carbondata/core/datastore/compression/CompressorFactory.java

@@ -35,8 +35,8 @@
  private final Map<String, Compressor> allSupportedCompressors = new HashMap<>();

  public enum NativeSupportedCompressor {
-    SNAPPY("snappy", SnappyCompressor.class),
-    ZSTD("zstd", ZstdCompressor.class);
+    SNAPPY("snappy", SnappyCompressor.class), ZSTD("zstd", ZstdCompressor.class), GZIP("gzip",


Move each compressor to new line

kumarvishal09 · 2018-12-10T06:56:40Z

core/src/main/java/org/apache/carbondata/core/datastore/compression/GzipCompressor.java

+    }
+  }
+
+  @Override public boolean supportUnsafe() {


Please move this default implantation to AbstractCompressor and override only in SnappyCompressor class from other classes remove this implementation

kumarvishal09 · 2018-12-10T07:06:57Z

...org/apache/carbondata/integration/spark/testsuite/dataload/TestLoadDataWithCompression.scala

@@ -168,6 +168,7 @@ class TestLoadDataWithCompression extends QueryTest with BeforeAndAfterEach with
  private val tableName = "load_test_with_compressor"
  private var executorService: ExecutorService = _
  private val csvDataDir = s"$integrationPath/spark2/target/csv_load_compression"
+  private val compressors = Array("snappy","zstd","gzip")


Please don't remove any test case add new testcase for Zstd

No test cases were removed. Just changed the test case name of "test with snappy and offheap" was changed to "test different compressors and offheap".

kumarvishal09 · 2018-12-10T07:07:42Z

core/src/main/java/org/apache/carbondata/core/datastore/compression/GzipCompressor.java

+  }
+
+  @Override public long maxCompressedLength(long inputSize) {
+    if (inputSize < Integer.MAX_VALUE) {


Please add some comments for this peace of code

CarbonDataQA · 2018-12-10T09:29:43Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1674/

KanakaKumar · 2018-12-10T10:02:19Z

core/src/main/java/org/apache/carbondata/core/datastore/compression/GzipCompressor.java

+         */
+        gzipCompressorOutputStream.write(data);
+      } catch (IOException e) {
+        throw new RuntimeException("Error during Compression step " + e.getMessage());


Don't skip the actual exception. Add original exception also as the cause to RunTimeException

ok added the actual exception.

CarbonDataQA · 2018-12-10T10:28:24Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1887/

CarbonDataQA · 2018-12-10T10:44:01Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1678/

CarbonDataQA · 2018-12-10T11:44:02Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1890/

CarbonDataQA · 2018-12-10T11:46:50Z

Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9938/

KanakaKumar · 2018-12-10T13:24:03Z

core/src/main/java/org/apache/carbondata/core/datastore/compression/GzipCompressor.java

+    try {
+      GzipCompressorInputStream gzipCompressorInputStream =
+          new GzipCompressorInputStream(byteArrayOutputStream);
+      byte[] buffer = new byte[1024];


Instead of fixed 1024, can you observe what is the blocksize (bytes size) gzip operates and use the same value ?

Yeah, I can fixed it with double of the data length, so it minimizes the read. I have added that.

KanakaKumar · 2018-12-10T13:29:51Z

core/src/main/java/org/apache/carbondata/core/datastore/compression/GzipCompressor.java

+   * @return Compressed Byte Array
+   */
+  private byte[] compressData(byte[] data) {
+    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();


ByteArrayOutputStream initializes with 32 and copies the data to new byte[] on expansion. Can you use a better initial size to limit the number of copies during expansion. Snappy has a utility (maxCompressedLength) to calculate the same, you check if any gzip libs has similar method. If not we an use based a test with max possible compression ratio.

Based on the observations I have initialized the byteArrayOutputStream with size of half of byte buffer, So it reduces the number of resizing of the stream.

KanakaKumar · 2018-12-10T13:38:35Z

...org/apache/carbondata/integration/spark/testsuite/dataload/TestLoadDataWithCompression.scala

-  test("test data loading with snappy compressor and offheap") {
+  test("test data loading with different compressors and offheap") {
+    for(comp <- compressors){
+      CarbonProperties.getInstance().addProperty(CarbonCommonConstants.ENABLE_OFFHEAP_SORT, "true")


Should we have UT for enable.unsafe.in.query.processing ture and false ?

By default for gzip/zstd, it's false. So UT for this scenario is not required.

CarbonDataQA · 2018-12-10T14:51:55Z

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1686/

CarbonDataQA · 2018-12-10T15:04:40Z

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1687/

CarbonDataQA · 2018-12-10T16:09:54Z

Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9947/

CarbonDataQA · 2018-12-10T16:25:25Z

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1898/

KanakaKumar · 2018-12-11T09:12:01Z

LGTM

kumarvishal09 · 2018-12-11T09:13:30Z

LGTM

This PR is to add a new compressor "Gzip" and enhance the compressing capabilities offered by CarbonData. User can now use gzip as the compressor for loading the data. Gzip can be set at System Properties level or also for particular table. This closes #2847

This PR is to add a new compressor "Gzip" and enhance the compressing capabilities offered by CarbonData. User can now use gzip as the compressor for loading the data. Gzip can be set at System Properties level or also for particular table. This closes apache#2847

xuchuanyin reviewed Oct 24, 2018

View reviewed changes

shardul-cr7 force-pushed the b010 branch from 3997955 to f51a24e Compare October 24, 2018 13:10

[WIP]Support Gzip

65c643b

shardul-cr7 force-pushed the b010 branch from f51a24e to 65ba84e Compare December 4, 2018 13:12

akashrn5 reviewed Dec 4, 2018

View reviewed changes

shardul-cr7 force-pushed the b010 branch from 65ba84e to 5179501 Compare December 4, 2018 13:35

shardul-cr7 force-pushed the b010 branch from 5179501 to fcb8253 Compare December 5, 2018 07:07

shardul-cr7 force-pushed the b010 branch from fcb8253 to b848429 Compare December 5, 2018 11:54

shardul-cr7 changed the title ~~[WIP]Support Gzip as column compressor~~ [CARBONDATA-3005]Support Gzip as column compressor Dec 7, 2018

kumarvishal09 reviewed Dec 10, 2018

View reviewed changes

shardul-cr7 force-pushed the b010 branch from b848429 to 423983d Compare December 10, 2018 09:19

KanakaKumar reviewed Dec 10, 2018

View reviewed changes

shardul-cr7 force-pushed the b010 branch from 423983d to 935f8bf Compare December 10, 2018 10:30

KanakaKumar reviewed Dec 10, 2018

View reviewed changes

shardul-cr7 force-pushed the b010 branch from 935f8bf to 8a70e1e Compare December 10, 2018 14:48

Review comments handled

fbc7e9d

shardul-cr7 force-pushed the b010 branch from 8a70e1e to fbc7e9d Compare December 10, 2018 14:51

asfgit closed this in fd0885b Dec 11, 2018

akashrn5 mentioned this pull request Mar 25, 2019

support Intel Qat Compressor #3147

Open


		ByteArrayInputStream byteArrayOutputStream = new ByteArrayInputStream(data);
		ByteArrayOutputStream byteOutputStream = new ByteArrayOutputStream();

[CARBONDATA-3005]Support Gzip as column compressor #2847

[CARBONDATA-3005]Support Gzip as column compressor #2847

Conversation

shardul-cr7 commented Oct 23, 2018 • edited

CarbonDataQA commented Oct 23, 2018

CarbonDataQA commented Oct 23, 2018

CarbonDataQA commented Oct 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Oct 24, 2018

CarbonDataQA commented Oct 24, 2018

CarbonDataQA commented Oct 24, 2018

CarbonDataQA commented Dec 4, 2018

CarbonDataQA commented Dec 4, 2018

CarbonDataQA commented Dec 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Dec 4, 2018

CarbonDataQA commented Dec 4, 2018

CarbonDataQA commented Dec 4, 2018

CarbonDataQA commented Dec 5, 2018

CarbonDataQA commented Dec 5, 2018

CarbonDataQA commented Dec 5, 2018

CarbonDataQA commented Dec 5, 2018

CarbonDataQA commented Dec 5, 2018

CarbonDataQA commented Dec 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Dec 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Dec 10, 2018

CarbonDataQA commented Dec 10, 2018

CarbonDataQA commented Dec 10, 2018

CarbonDataQA commented Dec 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CarbonDataQA commented Dec 10, 2018

CarbonDataQA commented Dec 10, 2018

CarbonDataQA commented Dec 10, 2018

CarbonDataQA commented Dec 10, 2018

KanakaKumar commented Dec 11, 2018

kumarvishal09 commented Dec 11, 2018

shardul-cr7 commented Oct 23, 2018 •

edited