AVRO-2162 Adds Zstandard compression to the Avro File Format (Java) #303

scottcarey · 2018-03-22T19:58:36Z

  Adds TestCodecs to cover all file compression Codecs.
  Consolidates common code in Codecs into OutputStreamCodec
  and OutputInputStreamCodec abstractions.
  Fixes DataFileStream so that Codecs can return DirectByteBuffers
  or Heap ByteBuffers with non-zero offset.

Adds TestCodecs to cover all file compression Codecs. Consolidates common code in Codecs into OutputStreamCodec and OutputInputStreamCodec abstractions. Fixes DataFileStream so that Codecs can return DirectByteBuffers or Heap ByteBuffers with non-zero offset.

scottcarey · 2018-03-22T20:01:35Z

While writing this and adding test coverage, I ended up making a few other clean-ups.

Do we need to add anything in the spec about this?

I also targetted the 1.8 branch because I assume that many people on older branches might be interested. I'm still stuck on a 1.7.x branch in (hadoop) production myself -- though that should change in a few months.

scottcarey · 2018-03-22T20:03:20Z

lang/java/avro/pom.xml

+    <dependency>
+      <groupId>com.github.luben</groupId>
+      <artifactId>zstd-jni</artifactId>
+      <optional>true</optional>


I think that most of the compression dependencies should be . Its extra baggage that is not useful in any case where we aren't writing out or reading files. It would be more consistent with the other codecs to remove this.

scottcarey · 2018-03-22T20:04:54Z

lang/java/avro/src/main/java/org/apache/avro/file/BZip2Codec.java

  @Override
-  public ByteBuffer decompress(ByteBuffer compressedData) throws IOException {
-    ByteArrayInputStream bais = new ByteArrayInputStream(compressedData.array());
-    BZip2CompressorInputStream inputStream = new BZip2CompressorInputStream(bais);


Most of the Codecs are internally based on InputStreams and OutputStreams. I refactored the commonalities out into two abstract classes (these would be better as 'mix-in' interfaces in Java 8+).

scottcarey · 2018-03-22T20:05:44Z

lang/java/avro/src/main/java/org/apache/avro/file/CodecFactory.java

  /** Null codec, for no compression. */
  public static CodecFactory nullCodec() {
-    return NullCodec.OPTION;
+    // we can not reference NullCodec.OPTION because the static


The unit test uncovered the fact that accessing this field here results in 'null' since we have a circular dependency in static initialization.

scottcarey · 2018-03-22T20:07:44Z

lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java

      return ByteBuffer.wrap(data, offset, blockSize);
    }

+    void setBytes(ByteBuffer block) {


this now supports Codecs that return Direct ByteBuffers. Earlier versions of the ZstandardCodec were using APIs that returned direct buffers, which exploded here.

scottcarey · 2018-03-22T20:12:28Z

lang/java/avro/src/main/java/org/apache/avro/file/SnappyCodec.java

  @Override public String getName() { return DataFileConstants.SNAPPY_CODEC; }

  @Override
  public ByteBuffer compress(ByteBuffer in) throws IOException {


Tests were failing on only Snappy when I made some overly strict assumptions on the returned buffer.

In the process of debugging I fixed at least one bug (not setting ByteOrder.LITTLE_ENDIAN and letting the file format depend on the CPU of the writer).

The code also was not properly accounting for arrayOffset in many cases.

scottcarey · 2018-03-22T20:14:23Z

lang/java/avro/src/main/java/org/apache/avro/file/XZCodec.java

-      return true;
-    if (getClass() != obj.getClass())
-      return false;
-    XZCodec other = (XZCodec)obj;


The equals method for the XZ codec was wrong -- according to the specification of Codec, it should equal if it is mutually decompressible. The compression level is used for the compressor but does not affect decompressibility.

I made the implementation of hashCode and equals consistent across the codecs when appropriate.

scottcarey · 2018-03-22T20:15:51Z

lang/java/avro/src/test/java/org/apache/avro/file/TestCodecs.java

+  }
+
+  private final Codec codec;
+  private final byte[] zeroes = new byte[1024*1024];


all zeroes tends to compress massively -- which can uncover bugs in buffer sizing when decompressing.

pure random tends to be larger when compressed than uncompressed and may find bugs in buffer sizing when compressing.

scottcarey · 2018-03-30T09:30:48Z

Thinking about this a bit more.... I think I'll make three pull requests. One to the 1.7.x branch, one to the 1.8.x branch, and one to master.

The two for the older branches will be as minimal as possible, only adding the new compression type.

The one for master will include refactoring to reduce code duplication.

iemejia · 2018-12-05T17:18:53Z

Is there something considerable being missing to get this one merged? I think it would be great to have it for the 1.9.0 release. Backporting this looks like a lot of extra work, and can be done if someone really really needs/want to, otherwise we should encourage moving upwards.

nandorKollar · 2018-12-05T18:07:33Z

I believe a Zstandard codec was already merged: cf2f303

I think this is a duplicate.

Fokko · 2018-12-06T10:04:37Z

@scottcarey

dkulp · 2019-01-03T21:41:41Z

zstandard codec already merged. Duplicate

scottcarey · 2019-03-19T19:52:55Z

@dkulp @nandorKollar Not quite a duplicate, the version that was merged did not include zstandard compression level.

Zstandard's compression level runs from snappy or lz4 ish, extremely fast but not high compression, to nearly xz levels of compression ratio (but 20x faster at decompression than xz).

Not having the compression ratio configurable is a fairly big issue for me.

scottcarey · 2019-03-19T19:59:44Z

lang/java/avro/src/main/java/org/apache/avro/file/ZstandardCodec.java

+  private final int compressionLevel;
+
+  public ZstandardCodec(int compressionLevel) {
+    this.compressionLevel = Math.max(Math.min(compressionLevel, 22), 1);


note zstandard now has negative compression levels, for use cases that require even less CPU use: https://github.com/facebook/zstd/releases/tag/v1.3.4

scottcarey commented Mar 22, 2018

View reviewed changes

iemejia added the Java Pull Requests for Java binding label Nov 29, 2018

dkulp closed this Jan 3, 2019

scottcarey commented Mar 19, 2019

View reviewed changes

AVRO-2162 Adds Zstandard compression to the Avro File Format (Java) #303

AVRO-2162 Adds Zstandard compression to the Avro File Format (Java) #303

Uh oh!

Conversation

scottcarey commented Mar 22, 2018

Uh oh!

scottcarey commented Mar 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scottcarey commented Mar 30, 2018

Uh oh!

iemejia commented Dec 5, 2018

Uh oh!

nandorKollar commented Dec 5, 2018

Uh oh!

Fokko commented Dec 6, 2018

Uh oh!

dkulp commented Jan 3, 2019

Uh oh!

scottcarey commented Mar 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

scottcarey commented Mar 19, 2019 •

edited

Loading