Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-1396][FLINK-1303] Hadoop Input/Output directly in API #363

Merged
merged 1 commit into from Feb 9, 2015

Conversation

aljoscha
Copy link
Contributor

@aljoscha aljoscha commented Feb 4, 2015

This adds methods on ExecutionEnvironment for reading with Hadoop
Input/OutputFormat.

This also adds support in the Scala API for Hadoop Input/OutputFormats.

I also added tests and updated the documentation.


Add the following dependency to your `pom.xml` to use the Hadoop Compatibility Layer.
Support for Hadoop Mappers and Reducers is contained in the `flink-addons`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flink-staging is the new flink-addons ;-)

@fhueske
Copy link
Contributor

fhueske commented Feb 4, 2015

Looks good.
Besides the typos and inline comments, you could also move the HadoopInputFormatTest and the HadoopIOFormatsITCase to flink-java and flink-tests.

We could even integrate HadoopIFs even more, if we overload the "regular" Flink input functions ExecutionEnvironment.readFile() and ExecutionEnvironment.createInput() instead of using "special" HadoopIF functions. What do you think?

@StephanEwen
Copy link
Contributor

Looks good.

I have one suggestion concerning the Hadoop dependencies: The flink-java project depends on the Hadoop API, for the Writable interface, the NullValue and the InputFormat classes.

We should be able to exclude all transitive dependencies from the Hadoop dependency in flink-java, making the project more lightweight, so that someone that only writes against the flink API does not have the long tail of transitive hadoop dependencies.

Whenever we really execute Hadoop code, we have the flink runtime involved, which then has the necessary dependencies.

To exclude all transitive dependencies, use

<exclusions>
  <exclusion>
    <groupId>*</groupId>
    <artifactId>*</artifactId>
  </exclusion>
</exclusions>

@aljoscha
Copy link
Contributor Author

aljoscha commented Feb 5, 2015

I addressed the comments. What do the others think about overloading readFile()? I made it like this on purpose. So that the user sees in the API that they are using Hadoop input formats or that they can be used.

@fhueske
Copy link
Contributor

fhueske commented Feb 5, 2015

Hmm, yes. That's also a valid point.
But on the other hand, new users might not even be aware of the different types of InputFormats. It all would look "natural" ;-)

I am more leaning towards overloading, but would be fine with having separate functions as well.

@rmetzger
Copy link
Contributor

rmetzger commented Feb 5, 2015

I vote for keeping @aljoscha's original approach.
Users might not notice the different interfaces there, so the "Hadoop" in the method name makes it more explicit.
Also, it could lead to confusions because Flink's and Hadoop's InputFormats have pretty similar names (they actually only differ in the package names).
Lastly, it would cause some work on Aljoscha's side to update the code and the documentation.

@aljoscha
Copy link
Contributor Author

aljoscha commented Feb 5, 2015

@StephanEwen If I add the exclusions then users that just add flink-java as a dependency will get weird errors when using Hadoop InputFormats.

@StephanEwen
Copy link
Contributor

Does this occur during local execution, or collection execution? The dependencies are not covered by the runtime dependencies?

@aljoscha
Copy link
Contributor Author

aljoscha commented Feb 5, 2015

I think if executing it in an IDE the dependencies are not there. Since flink-java does not depend on flink-runtime, which has the hadoop dependencies.

@aljoscha aljoscha force-pushed the hadoop-in-api branch 3 times, most recently from d6bf958 to 0f632e3 Compare February 6, 2015 16:45
@StephanEwen
Copy link
Contributor

Looks good. We are getting into very long package names here ;-)
org.apache.flink.api.java.hadoop.mapred.wrapper.*

This adds methods on ExecutionEnvironment for reading with Hadoop
Input/OutputFormat.

This also adds support in the Scala API for Hadoop Input/OutputFormats.
@asfgit asfgit merged commit 8b3805b into apache:master Feb 9, 2015
@aljoscha aljoscha deleted the hadoop-in-api branch February 9, 2015 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants