[BEAM-422] AvroSource: use a 64K buffer size for Snappy codec by dhalperi · Pull Request #583 · apache/beam

dhalperi · 2016-07-02T09:22:58Z

commons-compress defaults to a 32K buffer size for Snappy.

However, Avro uses xerial.snappy to write, which has a 64K buffer size.
When the buffer size is too small, decoding data from Snappy can cause
an EOF exception rather than finishing data.

This fixes BEAM-422.

commons-compress defaults to a 32K buffer size for Snappy. However, Avro uses xerial.snappy to write, which has a 64K buffer size. When the buffer size is too small, decoding data from Snappy can cause an EOF exception rather than finishing data. This fixes BEAM-422.

lukecwik · 2016-07-06T15:06:42Z

R: @lukecwik

lukecwik · 2016-07-06T15:24:31Z

sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroSource.java

      switch (codec) {
        case DataFileConstants.SNAPPY_CODEC:
-          return new SnappyCompressorInputStream(byteStream);
+          return new SnappyCompressorInputStream(byteStream, 1 << 16 /* Avro uses 64KB blocks */);


Instead of re-implementing the codec factory with a fixed number of codecs, why don't we just use the CodecFactory to create an instance of the codec which we can use to decode the bytes using Avro's code?

I assume this was done because Avro's Codec does not support the interface we need: https://avro.apache.org/docs/1.7.7/api/java/org/apache/avro/file/Codec.html

It seems easier to adapt to the interface they expose then to re-implement.

It seems as though internally in Avro they just access the byte buffer array directly. Should be easy enough to wrap byte[] into a ByteBuffer and unwrap to create a byte[] and or use a ByteBufferInputStream like equivalent.

Also, it seems as though the Codec for Snappy includes a CRC32 checksum which we are ignoring that we could benefit from if we used the Codec directly (https://avro.apache.org/docs/1.7.7/spec.html#snappy)

That's all fair. Looking deeper, it looks like the change is fairly well-scoped and also allows a cleaner set of Maven dependencies. (As I'm sure you already realized).

So looking even deeper, Avro does not provide any way to get access to Codec instances even through CodecFactory.

Yes, looks like your correct. Avro doesn't want to expose the codecs in anyway.

lukecwik · 2016-07-06T17:56:35Z

LGTM

No build cache

dhalperi mentioned this pull request Jul 2, 2016

AvroSource: use a 64K buffer size for Snappy codec GoogleCloudPlatform/DataflowJavaSDK#327

Merged

dhalperi force-pushed the avro-source-fix branch from 7bbec51 to deedad0 Compare July 2, 2016 18:40

dhalperi force-pushed the avro-source-fix branch from deedad0 to 243de5d Compare July 2, 2016 18:48

lukecwik reviewed Jul 6, 2016
View reviewed changes

dhalperi closed this Jul 6, 2016

dhalperi reopened this Jul 6, 2016

asfgit closed this in a7e8151 Jul 6, 2016

dhalperi deleted the avro-source-fix branch July 7, 2016 03:09

Amar3tto added a commit to akvelon/beam that referenced this pull request Apr 14, 2025

Merge pull request apache#583 from akvelon/local-march-25

3b63649

No build cache

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[BEAM-422] AvroSource: use a 64K buffer size for Snappy codec#583

[BEAM-422] AvroSource: use a 64K buffer size for Snappy codec#583
dhalperi wants to merge 1 commit intoapache:masterfrom
dhalperi:avro-source-fix

dhalperi commented Jul 2, 2016

Uh oh!

lukecwik commented Jul 6, 2016

Uh oh!

lukecwik Jul 6, 2016

Uh oh!

dhalperi Jul 6, 2016 •

edited

Loading

Uh oh!

lukecwik Jul 6, 2016 •

edited

Loading

Uh oh!

dhalperi Jul 6, 2016

Uh oh!

dhalperi Jul 6, 2016

Uh oh!

lukecwik Jul 6, 2016

Uh oh!

lukecwik commented Jul 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

dhalperi commented Jul 2, 2016

Uh oh!

lukecwik commented Jul 6, 2016

Uh oh!

lukecwik Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

dhalperi Jul 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lukecwik Jul 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhalperi Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

dhalperi Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

lukecwik Jul 6, 2016

Choose a reason for hiding this comment

Uh oh!

lukecwik commented Jul 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dhalperi Jul 6, 2016 •

edited

Loading

lukecwik Jul 6, 2016 •

edited

Loading