Skip to content

Comments

[BEAM-422] AvroSource: use a 64K buffer size for Snappy codec#583

Closed
dhalperi wants to merge 1 commit intoapache:masterfrom
dhalperi:avro-source-fix
Closed

[BEAM-422] AvroSource: use a 64K buffer size for Snappy codec#583
dhalperi wants to merge 1 commit intoapache:masterfrom
dhalperi:avro-source-fix

Conversation

@dhalperi
Copy link
Contributor

@dhalperi dhalperi commented Jul 2, 2016

commons-compress defaults to a 32K buffer size for Snappy.

However, Avro uses xerial.snappy to write, which has a 64K buffer size.
When the buffer size is too small, decoding data from Snappy can cause
an EOF exception rather than finishing data.

This fixes BEAM-422.

commons-compress defaults to a 32K buffer size for Snappy.

However, Avro uses xerial.snappy to write, which has a 64K buffer size.
When the buffer size is too small, decoding data from Snappy can cause
an EOF exception rather than finishing data.

This fixes BEAM-422.
@lukecwik
Copy link
Member

lukecwik commented Jul 6, 2016

R: @lukecwik

switch (codec) {
case DataFileConstants.SNAPPY_CODEC:
return new SnappyCompressorInputStream(byteStream);
return new SnappyCompressorInputStream(byteStream, 1 << 16 /* Avro uses 64KB blocks */);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of re-implementing the codec factory with a fixed number of codecs, why don't we just use the CodecFactory to create an instance of the codec which we can use to decode the bytes using Avro's code?

Copy link
Contributor Author

@dhalperi dhalperi Jul 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was done because Avro's Codec does not support the interface we need: https://avro.apache.org/docs/1.7.7/api/java/org/apache/avro/file/Codec.html

Copy link
Member

@lukecwik lukecwik Jul 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems easier to adapt to the interface they expose then to re-implement.

It seems as though internally in Avro they just access the byte buffer array directly. Should be easy enough to wrap byte[] into a ByteBuffer and unwrap to create a byte[] and or use a ByteBufferInputStream like equivalent.

Also, it seems as though the Codec for Snappy includes a CRC32 checksum which we are ignoring that we could benefit from if we used the Codec directly (https://avro.apache.org/docs/1.7.7/spec.html#snappy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's all fair. Looking deeper, it looks like the change is fairly well-scoped and also allows a cleaner set of Maven dependencies. (As I'm sure you already realized).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking even deeper, Avro does not provide any way to get access to Codec instances even through CodecFactory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, looks like your correct. Avro doesn't want to expose the codecs in anyway.

@dhalperi dhalperi closed this Jul 6, 2016
@dhalperi dhalperi reopened this Jul 6, 2016
@lukecwik
Copy link
Member

lukecwik commented Jul 6, 2016

LGTM

@asfgit asfgit closed this in a7e8151 Jul 6, 2016
@dhalperi dhalperi deleted the avro-source-fix branch July 7, 2016 03:09
Amar3tto added a commit to akvelon/beam that referenced this pull request Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants