-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hadoopcompatibility: Implementations of basic programming interfaces and... #37
Conversation
Thank you for the pull request. I know it was a lot of work. |
Hey @rmetzger I think I made a fix for hadoop 2.2.0 dependencies. But I'm not sure I understand why it is still failing. Can you take a look please? :) |
Hey, everything is okay now, according to this page: https://travis-ci.org/apache/incubator-flink/builds/28399204 (You can ignore the one failed build, its not your fault). |
I don't think that I can review and test it this week. But next week would be possible. |
I'm testing the PR now. |
Okay, I've taken this Job: https://github.com/apache/hadoop-common/blob/branch-1.2/src/examples/org/apache/hadoop/examples/Join.java and replaced the
I think you should either throw a "UnsupportedOperationException" or implement support for this. (The same applies to all the other methods that you've inherited from |
*/ | ||
@Override | ||
@SuppressWarnings("unchecked") | ||
public RunningJob submitJob(JobConf hadoopJobConf) throws IOException{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you expect the JobConf
again here? I think the user already passed it when creating the StratosphereHadoopJobClient
. (JobConf
inherits from Configuration
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually here I am just following the JobClient
Interface. And it has this static method runJob
which is just a submitJob
and waitForCompletion
in one line. It is perhaps redundant in that case to pass the JobConf
both to the constructor and the method but it will get useful once we have more methods for the JobClient.
What is really missing is a constructor with no arguments, which I should add.
Sorry If I completely misunderstood this :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you got the question right. I'm still a bit unhappy with the configuration-object situation because the configuration that the user is passing into the constructor is never used. But lets keep it this way.
I experienced another error
I'm not sure what I'm doing wrong. You can find the code that I'm using for testing here: https://github.com/rmetzger/testjob/tree/artem |
Thank you for testing it. I believe it is not a bug. I actually tried to simulate this behavior with my last commit. If you run this on Hadoop, the same message is shown. Can you mention how do you run the driver please? I get the same when I set So it really depends on the InputFormat and OutputFormat you choose. This happens because we have mismatches between the types built in the input / output formats and the the input output format keys specified in the driver. If you mention your case exactly (if it is a sequence file how is it generated?) and maybe we can check (or it is really a bug). |
I was just executing the main method of this class: https://github.com/rmetzger/testjob/blob/artem/src/main/java/eu/stratosphere/test/hadoop/Join.java, with these arguments I thought this is independent of the InputFormat since the error is happening when the output collector is checking the type of the elements it received? |
I was following these instructions: http://www.mail-archive.com/core-user@hadoop.apache.org/msg04066.html and it worked! So the problem was indeed my input format! I think the pull request is ready to merge once the |
One more question: Why do you force the application to use a DOP of one? |
Definitely, this should not stay 1. I just left it here for simplicity. In my new branch it is the the maximum of ( The problem here is that in flink it is a DOP for all operators in an environment (if I'm not mistaken) so if for example, DOP = 3 it will be 3 mappers and 3 reducers. What happens if I want 3 mappers in parallel and 1 reducer ? Is there actually a way to set parallelism per operator with the current APIs ? |
It's great that you think it is ok. I will check which operations have not been implemented on the branch and guard them as you said. By the way, I wanted to ask you, do you think that |
Yes, move them to a Regarding the DOP: The |
Ok, I'll do that. Thank you :) |
In fact, setting the DOP of individual operators is supported with |
Oh yes, @fhueske. I completely missed it for some reason. So then, it gets more trivial. I will add correct parallelism for this PR and see if the tests pass. |
Well, it's an undocumented feature and easy to miss... ;-) |
Let me know when I can have a look at the PR again. |
Hi @rmetzger , I believe I have implemented your observations along with some other minor stuff. Do you think I should rebase and change the name now or in a seperate PR? Moreover, the sorting and the Distributed cache support are ready, do you think I should include them? Then we can be sure we have all the programming interfaces in this PR. I can do that tomorrow. What do you think? |
Moreover, if I rebase I can handle the Of course I can do seperate smaller PRs in the coming week. |
You might also be interested in my weekly report: |
It would be much easier for us, if you could It would also be cool if you could improve the InputFormats. |
This PR is now ASF compliant. The parallelism of the tasks is not fully mapped though. Hopefully I can solve this properly soon... ( @fhueske has given me some input in the tracker issue https://issues.apache.org/jira/browse/FLINK-838) |
//setting up the inputFormat for the job | ||
final DataSet input = environment.createInput(getFlinkInputFormat(hadoopJobConf)); | ||
|
||
final Mapper mapper = InstantiationUtil.instantiate(hadoopJobConf.getMapperClass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code contains many rawtype warnings.
Please try to get rid of them (e.g., DataSet<?>
instead of DataSet
) and add a @SuppressWarnings("rawtypes")
to the method if that's not possible.
input.setParallelism(mapParallelism); | ||
|
||
final FlatMapOperator mapped = input.flatMap(new HadoopMapFunction(hadoopJobConf)); | ||
mapped.setParallelism(getMapParallelism(hadoopJobConf)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set it to mapParallelism
to avoid the recomputation of the input splits.
Hi @atsikiridis, please add new commits to this PR and do not squash and force push the branch. Otherwise, all previous code comments are gone and not longer available. The There are a few issues that I commented inline. |
Ahh @fhueske squashing was a complete brain stop sorry :( |
…tests, cleaning and javadocs.
*/ | ||
public final class FlinkHadoopJobClient extends JobClient { | ||
|
||
private final static int TASK_SLOTS = GlobalConfiguration.getInteger(ConfigConstants.TASK_MANAGER_NUM_TASK_SLOTS, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's go for now with your suggestion, i.e., use max dop as number of slot if the parameter has not been initialized before. Please also add the warning.
…set and issues warning..
* @see http://docs.oracle.com/javase/7/docs/api/java/io/Serializable.html | ||
*/ | ||
private void writeObject(final ObjectOutputStream out) throws IOException { | ||
HadoopConfiguration.writeHadoopJobConf(jobConf,out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures that each task gets a seperate JobConf
object. This is related to https://mail-archives.apache.org/mod_mbox/incubator-flink-dev/201407.mbox/browser
Looks good to me. I'll add a comment to JIRA as well. I have a working prototype for full support of custom partitioner, sorter, and groupers (combine is not supported yet, though). However, the solution goes through the full program compilation stack and I'd like to get some feedback before I continue... |
Hello Artem and mentors, First of all nice greetings from INRIA, France. At INRIA, we are starting to adopt Stratosphere / Flink.
At present, we are typically in the profiles of Alice-Bob-Sam, as described in Artem's GSoC proposal. Briefly, Hence, I would like to know, what is the current status in this direction? You may also like to see my discussion with Robert on this page. Thanks in advance, |
Hi Anirvan, I would suggest continue the discussion on the mailing list. Best, Fabian |
@atsikiridis I think we can close this PR for now. The support to run complete Hadoop Jobs requires a bit more work. At least the combiner should work. Parts of this PR (wrappers for iterators and collectors, dummy reporters, etc.) can be added in a new PR which addresses FLINK-1076. |
@fhueske Ok, I will try to have the pr for FLINK-1076 soon. |
@fhueske What's the state of this PR? Is it supercedes by other changes, which happened in the mean time? |
I backuped the code. |
OK. +1 to close this. I've pinged @atsikiridis. |
That would be really cool! Let me know if you need any help! :-) |
…uration This closes apache#37.
End-to-end test based on the {@link RoutableKafkaVerificationModule} application. This test writes some records to Kafka, with target function id as key (UTF8 String) and MessageWithAddress messages as value, without the "from" field set. The routable Kafka ingress should automatically route them to the correct function instances, which tag the input messages with their own address, and then forwards it back to Kafka. The test verifies that the tagged outputs written back to Kafka are correct. This closes apache#37.
…gner-algorithm-dependency Exclude pentaho-aggdesigner-algorithm aiven#37
... a basic driver.
You can find more about the Google Summer of Code project here: https://issues.apache.org/jira/browse/FLINK-838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031511