[FLINK-1396][FLINK-1303] Hadoop Input/Output directly in API #363

aljoscha · 2015-02-04T16:54:31Z

This adds methods on ExecutionEnvironment for reading with Hadoop
Input/OutputFormat.

This also adds support in the Scala API for Hadoop Input/OutputFormats.

I also added tests and updated the documentation.

fhueske · 2015-02-04T20:43:30Z

docs/hadoop_compatibility.md


-Add the following dependency to your `pom.xml` to use the Hadoop Compatibility Layer.
+Support for Hadoop Mappers and Reducers is contained in the `flink-addons`


flink-staging is the new flink-addons ;-)

fhueske · 2015-02-04T21:06:32Z

Looks good.
Besides the typos and inline comments, you could also move the HadoopInputFormatTest and the HadoopIOFormatsITCase to flink-java and flink-tests.

We could even integrate HadoopIFs even more, if we overload the "regular" Flink input functions ExecutionEnvironment.readFile() and ExecutionEnvironment.createInput() instead of using "special" HadoopIF functions. What do you think?

StephanEwen · 2015-02-05T09:02:36Z

Looks good.

I have one suggestion concerning the Hadoop dependencies: The flink-java project depends on the Hadoop API, for the Writable interface, the NullValue and the InputFormat classes.

We should be able to exclude all transitive dependencies from the Hadoop dependency in flink-java, making the project more lightweight, so that someone that only writes against the flink API does not have the long tail of transitive hadoop dependencies.

Whenever we really execute Hadoop code, we have the flink runtime involved, which then has the necessary dependencies.

To exclude all transitive dependencies, use

<exclusions>
  <exclusion>
    <groupId>*</groupId>
    <artifactId>*</artifactId>
  </exclusion>
</exclusions>

aljoscha · 2015-02-05T09:51:18Z

I addressed the comments. What do the others think about overloading readFile()? I made it like this on purpose. So that the user sees in the API that they are using Hadoop input formats or that they can be used.

fhueske · 2015-02-05T10:09:32Z

Hmm, yes. That's also a valid point.
But on the other hand, new users might not even be aware of the different types of InputFormats. It all would look "natural" ;-)

I am more leaning towards overloading, but would be fine with having separate functions as well.

rmetzger · 2015-02-05T10:21:52Z

I vote for keeping @aljoscha's original approach.
Users might not notice the different interfaces there, so the "Hadoop" in the method name makes it more explicit.
Also, it could lead to confusions because Flink's and Hadoop's InputFormats have pretty similar names (they actually only differ in the package names).
Lastly, it would cause some work on Aljoscha's side to update the code and the documentation.

aljoscha · 2015-02-05T12:41:22Z

@StephanEwen If I add the exclusions then users that just add flink-java as a dependency will get weird errors when using Hadoop InputFormats.

StephanEwen · 2015-02-05T12:43:44Z

Does this occur during local execution, or collection execution? The dependencies are not covered by the runtime dependencies?

aljoscha · 2015-02-05T12:49:02Z

I think if executing it in an IDE the dependencies are not there. Since flink-java does not depend on flink-runtime, which has the hadoop dependencies.

StephanEwen · 2015-02-09T09:22:22Z

Looks good. We are getting into very long package names here ;-)
org.apache.flink.api.java.hadoop.mapred.wrapper.*

This adds methods on ExecutionEnvironment for reading with Hadoop Input/OutputFormat. This also adds support in the Scala API for Hadoop Input/OutputFormats.

fhueske reviewed Feb 4, 2015
View reviewed changes

aljoscha force-pushed the hadoop-in-api branch 3 times, most recently from d6bf958 to 0f632e3 Compare February 6, 2015 16:45

[FLINK-1396][FLINK-1303] Hadoop Input/Output directly in API

8b3805b

This adds methods on ExecutionEnvironment for reading with Hadoop Input/OutputFormat. This also adds support in the Scala API for Hadoop Input/OutputFormats.

aljoscha force-pushed the hadoop-in-api branch from 0f632e3 to 8b3805b Compare February 9, 2015 12:51

asfgit merged commit 8b3805b into apache:master Feb 9, 2015

aljoscha deleted the hadoop-in-api branch February 9, 2015 14:39

rmetzger added the component=API/Scala label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-1396][FLINK-1303] Hadoop Input/Output directly in API #363

[FLINK-1396][FLINK-1303] Hadoop Input/Output directly in API #363

aljoscha commented Feb 4, 2015

fhueske Feb 4, 2015

fhueske commented Feb 4, 2015

StephanEwen commented Feb 5, 2015

aljoscha commented Feb 5, 2015

fhueske commented Feb 5, 2015

rmetzger commented Feb 5, 2015

aljoscha commented Feb 5, 2015

StephanEwen commented Feb 5, 2015

aljoscha commented Feb 5, 2015

StephanEwen commented Feb 9, 2015


		Add the following dependency to your `pom.xml` to use the Hadoop Compatibility Layer.
		Support for Hadoop Mappers and Reducers is contained in the `flink-addons`

[FLINK-1396][FLINK-1303] Hadoop Input/Output directly in API #363

[FLINK-1396][FLINK-1303] Hadoop Input/Output directly in API #363

Conversation

aljoscha commented Feb 4, 2015

fhueske Feb 4, 2015

Choose a reason for hiding this comment

fhueske commented Feb 4, 2015

StephanEwen commented Feb 5, 2015

aljoscha commented Feb 5, 2015

fhueske commented Feb 5, 2015

rmetzger commented Feb 5, 2015

aljoscha commented Feb 5, 2015

StephanEwen commented Feb 5, 2015

aljoscha commented Feb 5, 2015

StephanEwen commented Feb 9, 2015