New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-11] Integrate Spark runner with Beam #42

Closed
wants to merge 10 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@amitsela
Member

amitsela commented Mar 12, 2016

No description provided.

@davorbonaci

This comment has been minimized.

Member

davorbonaci commented Mar 13, 2016

I'll take a peek at this one shortly.

R: @davorbonaci

</dependency>
<dependency>
<groupId>com.google.cloud.dataflow</groupId>
<artifactId>google-cloud-dataflow-java-examples-all</artifactId>

This comment has been minimized.

@davorbonaci

davorbonaci Mar 13, 2016

Member

Is this really needed?

This comment has been minimized.

@amitsela

amitsela Mar 13, 2016

Member

It sounds fair to me to prefer the runners logger.. Flink runner does the same.

This comment has been minimized.

@tomwhite

tomwhite Mar 14, 2016

Member

The dependency is needed because some of the examples are used to test the Spark runner.

<dependency>
<groupId>com.google.cloud.dataflow</groupId>
<artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
<version>${beam.version}</version>

This comment has been minimized.

@davorbonaci

davorbonaci Mar 13, 2016

Member

I think this will end up being ${project.version}, or something like that.

This comment has been minimized.

@amitsela

amitsela Mar 13, 2016

Member

It could be ${project.version} if we keep all runners in the same version as the SDK.
For now, I could use ${parent.version}

<version>1.5.0-SNAPSHOT</version>
</parent>
<artifactId>spark-runner</artifactId>
<version>0.4.3-SNAPSHOT</version>

This comment has been minimized.

@davorbonaci

davorbonaci Mar 13, 2016

Member

I think many elements in this file can be inherited from the parent instead of being specified here, including:

  • version
  • licenses
  • properties (most if not all)
  • repositories
  • developers
  • scm
  • prerequisites
@davorbonaci

This comment has been minimized.

Member

davorbonaci commented Mar 13, 2016

LGTM

Nice!

I think we'll have to go over all pom.xml files in the project and fix them up globally -- but, that's unrelated to this pull request.

@davorbonaci

This comment has been minimized.

Member

davorbonaci commented Mar 13, 2016

(We should get to the bottom of the Jenkins failure before merging.)

@amitsela

This comment has been minimized.

Member

amitsela commented Mar 13, 2016

R: @tomwhite as well

<!--<transformers>-->
<!--<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />-->
<!--</transformers>-->
<!--</configuration>-->

This comment has been minimized.

@tomwhite

tomwhite Mar 14, 2016

Member

Guava will still need to be relocated to run properly on a cluster, won't it?

This comment has been minimized.

@amitsela

amitsela Mar 14, 2016

Member

The SDK upgraded to Guava 19 but I guess shading is still necessary for cluster. I'll reinstate the shade configuration.

@tomwhite

This comment has been minimized.

For the moment this is fine, but I just wanted to point out that it duplicates HadoopIO in contrib/hadoop. HadoopIO can be used by any runner, so this one should be removed at some point, in favour of the one in contrib.

This comment has been minimized.

Owner

amitsela replied Mar 14, 2016

@tomwhite

This comment has been minimized.

I don't think there's anything Hadoop-specific in this class, so maybe doesn't belong in this package. Ideally this would go in the SDK.

This comment has been minimized.

Owner

amitsela replied Mar 14, 2016

I placed it in hadoop package because only Hadoop related code used it.. I'll take a better look there and see if this could actually go in the SDK.

This comment has been minimized.

tomwhite replied Mar 14, 2016

That's fine. Could be done later of course.

@tomwhite

This comment has been minimized.

Member

tomwhite commented Mar 14, 2016

Looks good to me. Thanks for working on it @amitsela. A few comments inline and here:

  • The license headers should be changed to ASF ones.
  • Since you are reorganising packages, how about keeping only the ones that clients use (SparkPipelineRunner, SparkPipelineOptions, EvaluationResult) in the top-level org.apache.beam.runners.spark package, and moving all the others to subpackages?
  • Remove .gitignore and .travis.yml.
@tomwhite

This comment has been minimized.

Member

tomwhite commented Mar 14, 2016

Also, the note about the Spark runner on https://github.com/apache/incubator-beam#runners should be updated to say that it's now a part of Beam.

@amitsela

This comment has been minimized.

Member

amitsela commented Mar 14, 2016

Thanks @tomwhite and @davorbonaci !
I'll do a second iteration and hope that by the time I'm done, the Jenkins issue will be solved as well :)

I plan to address the following:

  • Shade configuration
  • ASF licenses
  • Remove .gitignore and .travis.iml
  • Package organization
  • Updated README

@davorbonaci like you said, I think that we need a cross-project pom.xml work to get all components inline, but let's get this runner running first :)

@amitsela

This comment has been minimized.

Member

amitsela commented Mar 14, 2016

This pull request is till pending additional work so please DON'T MERGE.
An early push was executed to trigger a Jenkins job to test a new configuration.

Thanks!

@amitsela

This comment has been minimized.

Member

amitsela commented Mar 15, 2016

@tomwhite please review second iteration. Thanks.

@tomwhite

This comment has been minimized.

Member

tomwhite commented Mar 15, 2016

+1 from me

@asfgit asfgit closed this in a91e115 Mar 15, 2016

aljoscha pushed a commit to aljoscha/beam that referenced this pull request Mar 29, 2018

Merge pull request apache#42 from bsidhom/portable-pipeline-result
Wire job service API into portable runner PipelineResults

tvalentyn pushed a commit to tvalentyn/beam that referenced this pull request May 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment