[FLINK-5944] Support reading of Snappy files #4683

mlipkovich · 2017-09-19T13:37:48Z

What is the purpose of the change

Support reading of Snappy compressed text files (both Xerial and Hadoop snappy)

Brief change log

Added InputStreamFactories for Xerial and Hadoop snappy
Added config parameter to control whether Xerial or Hadoop snappy should be used

Verifying this change

Manually verified the change by running word count for text files compressed using different Snappy versions

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? JavaDocs

haohui · 2017-09-24T07:51:21Z

flink-core/pom.xml

+			<groupId>org.apache.flink</groupId>
+			<artifactId>flink-shaded-hadoop2</artifactId>
+			<version>${project.version}</version>
+		</dependency>


Including hadoop as a dependency in flink-core can be problematic for a number of downstream projects.

I wonder what is the exact difference between the Hadoop and vanilla snappy codec? Is it just due to the fact that there are additional framings in the snappy codec in Hadoop?

What do you think about adding this dependency to compile-time only (in the scope 'provided')?

Regarding to difference between codecs as I understand the thing is that Snappy compressed files are not splittable. So Hadoop splits raw files into blocks and compresses each block separately using regular Snappy. If you download the whole Hadoop Snappy compressed file regular Snappy will not be able to decompress it since it's not aware of block boundaries

Internally we have several users that try Flink to read the files generated by Hadoop (e.g. lz4 / gz / snappy). I think the support of Hadoop is quite important.

I'm not sure supporting the xerial snappy format is a good idea. The two file formats are actually incompatible -- it would be quite confusing for the users to find out that they can't access the files using Spark / MR / Hive due to a missed configuration.

I suggest at least we should make the Hadoop file format as the default -- or to just get rid of the xerial version of the file format.

Putting the dependency in provided sounds fine to me -- if we need even tighter controls on the dependency, we can start thinking about having a separate module for it.

What do you think?

Yes, it is a good point to make Hadoop Snappy as the default codec. I think we still could support a Xerial Snappy since it comes for free. I will do these changes once we agree on dependencies

Regarding to separate module what would be the content of this model?
As I understand a user which would like to read HDFS files will need flink-java module anyway since it contains Hadoop wrappers like HadoopInputSplit and so on. How do you think if it makes sense to put this Hadoop codec there?

As of recently everything except the Hadoop compatibility package is free from Hadoop dependencies. This should also stay like this so that we can provide a Flink distribution without Hadoop dependencies because this was causing problems for some users.

Thanks for your comment Aljoscha,
So there are at least three ways on how to achieve it: either mark this dependency as 'provided', move Hadoop Snappy Codec related classes to flink-java module or move it to some separate module as suggested @haohui, but I'm not sure what should be inside this module

…s default

mlipkovich · 2017-10-13T11:29:00Z

@aljoscha @haohui ,
thank you for your comments. Marked hadoop dependency as provided, set Hadoop codec as default one

[FLINK-5944] Support reading of Snappy files

c4d4016

haohui reviewed Sep 24, 2017

View reviewed changes

FLINK-5944: Marked Hadoop dependency as provided, set Hadoop snappy a…

fb0b551

…s default

rmetzger added the component=Connectors/HadoopCompatibility label Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-5944] Support reading of Snappy files #4683

[FLINK-5944] Support reading of Snappy files #4683

mlipkovich commented Sep 19, 2017

haohui Sep 24, 2017

mlipkovich Sep 24, 2017 •

edited

haohui Sep 25, 2017

mlipkovich Sep 25, 2017 •

edited

aljoscha Sep 27, 2017

mlipkovich Sep 27, 2017

mlipkovich commented Oct 13, 2017

[FLINK-5944] Support reading of Snappy files #4683

Are you sure you want to change the base?

[FLINK-5944] Support reading of Snappy files #4683

Conversation

mlipkovich commented Sep 19, 2017

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

haohui Sep 24, 2017

Choose a reason for hiding this comment

mlipkovich Sep 24, 2017 • edited

Choose a reason for hiding this comment

haohui Sep 25, 2017

Choose a reason for hiding this comment

mlipkovich Sep 25, 2017 • edited

Choose a reason for hiding this comment

aljoscha Sep 27, 2017

Choose a reason for hiding this comment

mlipkovich Sep 27, 2017

Choose a reason for hiding this comment

mlipkovich commented Oct 13, 2017

mlipkovich Sep 24, 2017 •

edited

mlipkovich Sep 25, 2017 •

edited