[BEAM-422] AvroSource: use a 64K buffer size for Snappy codec#583
[BEAM-422] AvroSource: use a 64K buffer size for Snappy codec#583dhalperi wants to merge 1 commit intoapache:masterfrom
Conversation
commons-compress defaults to a 32K buffer size for Snappy. However, Avro uses xerial.snappy to write, which has a 64K buffer size. When the buffer size is too small, decoding data from Snappy can cause an EOF exception rather than finishing data. This fixes BEAM-422.
|
R: @lukecwik |
| switch (codec) { | ||
| case DataFileConstants.SNAPPY_CODEC: | ||
| return new SnappyCompressorInputStream(byteStream); | ||
| return new SnappyCompressorInputStream(byteStream, 1 << 16 /* Avro uses 64KB blocks */); |
There was a problem hiding this comment.
Instead of re-implementing the codec factory with a fixed number of codecs, why don't we just use the CodecFactory to create an instance of the codec which we can use to decode the bytes using Avro's code?
There was a problem hiding this comment.
I assume this was done because Avro's Codec does not support the interface we need: https://avro.apache.org/docs/1.7.7/api/java/org/apache/avro/file/Codec.html
There was a problem hiding this comment.
It seems easier to adapt to the interface they expose then to re-implement.
It seems as though internally in Avro they just access the byte buffer array directly. Should be easy enough to wrap byte[] into a ByteBuffer and unwrap to create a byte[] and or use a ByteBufferInputStream like equivalent.
Also, it seems as though the Codec for Snappy includes a CRC32 checksum which we are ignoring that we could benefit from if we used the Codec directly (https://avro.apache.org/docs/1.7.7/spec.html#snappy)
There was a problem hiding this comment.
That's all fair. Looking deeper, it looks like the change is fairly well-scoped and also allows a cleaner set of Maven dependencies. (As I'm sure you already realized).
There was a problem hiding this comment.
So looking even deeper, Avro does not provide any way to get access to Codec instances even through CodecFactory.
There was a problem hiding this comment.
Yes, looks like your correct. Avro doesn't want to expose the codecs in anyway.
|
LGTM |
No build cache
commons-compress defaults to a 32K buffer size for Snappy.
However, Avro uses xerial.snappy to write, which has a 64K buffer size.
When the buffer size is too small, decoding data from Snappy can cause
an EOF exception rather than finishing data.
This fixes BEAM-422.