Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a (local mode) Scalding Interpreter to Zeppelin #561

Closed
wants to merge 31 commits into from

Conversation

sriramkrishnan
Copy link
Contributor

What is this PR for?

Scalding (https://github.com/twitter/scalding) is a Scala library for writing MapReduce jobs.
This issue tracks the addition of a Scalding interpreter for Zeppelin. To keep this work incremental, this PR will focus on just a local mode implementation. The Hadoop mode can be a subsequent addition.

What type of PR is it?

Feature

Todos

  • Addition of Hadoop mode for Scalding

Is there a relevant Jira issue?

https://issues.apache.org/jira/browse/ZEPPELIN-526

How should this be tested?

Run the tests in: scalding/src/test/java/org/apache/zeppelin/scalding/ScaldingInterpreterTest.java

Screenshots

scalding-example

scalding-screenshot

### Questions: - This could use documentation, which could just be the example in the screenshot. Where can I contribute that?

/**
* Scalding interpreter for Zeppelin. Based off the Spark interpreter code.
*
* @author sriramkrishnan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be willing to remove that information from the comments, please?

It is not strictly documented yet, but Zeppelin so far, as many other ASF projects (Hadoop, Zookeeper, Avro, etc), do not encourage use of @author tags.

We definitely want and keep contributors credits, but we use git, JIRA and mailing list history, so nothing will be lost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, not a problem.

@bzz
Copy link
Member

bzz commented Dec 22, 2015

@sriramkrishnan Great contribution, thank you! I think there are just a couple of things need to be take care

  1. could you please also add some docs for the new interpreter here http://zeppelin.incubator.apache.org/docs/0.6.0-incubating-SNAPSHOT/interpreter/ ?

    Here are some details on how to add it.

    Explicit documentation on supported features is very welcome: i.e only a local mode, progress\cancel for paragraph is not supported yet.

  2. We also need to document the licence of the dependencies (and it's transitive dependencies) in the zeppelin-distribution/src/bin_license/LICENSE that this contribution adds

    Would you care of doing it also in this PR please? Here is how the list of dependencies looks like now

    [INFO] +- com.twitter:scalding-core_2.10:jar:0.15.1-RC13:compile
    [INFO] |  +- com.twitter:scalding-args_2.10:jar:0.15.1-RC13:compile
    [INFO] |  +- com.twitter:scalding-date_2.10:jar:0.15.1-RC13:compile
    [INFO] |  +- com.twitter:scalding-serialization_2.10:jar:0.15.1-RC13:compile
    [INFO] |  +- com.twitter:maple:jar:0.15.1-RC13:compile
    [INFO] |  +- cascading:cascading-core:jar:2.6.1:compile
    [INFO] |  |  +- riffle:riffle:jar:0.1-dev:compile
    [INFO] |  |  +- thirdparty:jgrapht-jdk1.6:jar:0.8.1:compile
    [INFO] |  |  \- org.codehaus.janino:janino:jar:2.7.5:compile
    [INFO] |  |     \- org.codehaus.janino:commons-compiler:jar:2.7.5:compile
    [INFO] |  +- cascading:cascading-hadoop:jar:2.6.1:compile
    [INFO] |  +- cascading:cascading-local:jar:2.6.1:compile
    [INFO] |  |  \- com.google.guava:guava:jar:15.0:compile
    [INFO] |  +- com.twitter:chill-hadoop:jar:0.7.0:compile
    [INFO] |  |  \- com.esotericsoftware.kryo:kryo:jar:2.21:compile
    [INFO] |  |     +- com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:compile
    [INFO] |  |     |  \- org.ow2.asm:asm:jar:4.0:compile
    [INFO] |  |     +- com.esotericsoftware.minlog:minlog:jar:1.2:compile
    [INFO] |  |     \- org.objenesis:objenesis:jar:1.2:compile
    [INFO] |  +- com.twitter:chill-java:jar:0.7.0:compile
    [INFO] |  +- com.twitter:algebird-core_2.10:jar:0.11.0:compile
    [INFO] |  |  \- com.googlecode.javaewah:JavaEWAH:jar:0.6.6:compile
    [INFO] |  +- com.twitter:bijection-core_2.10:jar:0.8.1:compile
    [INFO] |  +- com.twitter:bijection-macros_2.10:jar:0.8.1:compile
    [INFO] |  |  \- org.scalatest:scalatest_2.10:jar:2.2.2:compile
    [INFO] |  +- com.twitter:chill_2.10:jar:0.7.0:compile
    [INFO] |  +- com.twitter:chill-algebird_2.10:jar:0.7.0:compile
    [INFO] |  \- org.scalamacros:quasiquotes_2.10:jar:2.0.1:compile
    [INFO] +- com.twitter:scalding-repl_2.10:jar:0.15.1-RC13:compile
    [INFO] |  \- jline:jline:jar:2.10:compile
    [INFO] +- org.scala-lang:scala-library:jar:2.10.5:compile
    [INFO] +- org.scala-lang:scala-compiler:jar:2.10.5:compile
    [INFO] +- org.scala-lang:scala-reflect:jar:2.10.5:compile
    [INFO] \- org.apache.hadoop:hadoop-client:jar:2.5.0:compile
    [INFO]    +- org.apache.hadoop:hadoop-common:jar:2.5.0:compile
    [INFO]    |  +- commons-cli:commons-cli:jar:1.2:compile
    [INFO]    |  +- org.apache.commons:commons-math3:jar:3.1.1:compile
    [INFO]    |  +- xmlenc:xmlenc:jar:0.52:compile
    [INFO]    |  +- commons-httpclient:commons-httpclient:jar:3.1:compile
    [INFO]    |  +- commons-codec:commons-codec:jar:1.5:compile
    [INFO]    |  +- commons-io:commons-io:jar:2.4:compile
    [INFO]    |  +- commons-net:commons-net:jar:3.1:compile
    [INFO]    |  +- commons-collections:commons-collections:jar:3.2.1:compile
    [INFO]    |  +- commons-logging:commons-logging:jar:1.1.1:compile
    [INFO]    |  +- commons-configuration:commons-configuration:jar:1.9:compile
    [INFO]    |  +- org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile
    [INFO]    |  +- org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile
    [INFO]    |  +- org.apache.avro:avro:jar:1.7.4:compile
    [INFO]    |  |  +- com.thoughtworks.paranamer:paranamer:jar:2.3:compile
    [INFO]    |  |  \- org.xerial.snappy:snappy-java:jar:1.0.4.1:compile
    [INFO]    |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
    [INFO]    |  +- org.apache.hadoop:hadoop-auth:jar:2.5.0:compile
    [INFO]    |  |  \- org.apache.directory.server:apacheds-kerberos-codec:jar:2.0.0-M15:compile
    [INFO]    |  |     +- org.apache.directory.server:apacheds-i18n:jar:2.0.0-M15:compile
    [INFO]    |  |     +- org.apache.directory.api:api-asn1-api:jar:1.0.0-M20:compile
    [INFO]    |  |     \- org.apache.directory.api:api-util:jar:1.0.0-M20:compile
    [INFO]    |  +- com.google.code.findbugs:jsr305:jar:1.3.9:compile
    [INFO]    |  +- org.apache.zookeeper:zookeeper:jar:3.4.6:compile
    [INFO]    |  \- org.apache.commons:commons-compress:jar:1.4.1:compile
    [INFO]    |     \- org.tukaani:xz:jar:1.0:compile
    [INFO]    +- org.apache.hadoop:hadoop-hdfs:jar:2.5.0:compile
    [INFO]    |  +- org.mortbay.jetty:jetty-util:jar:6.1.26:compile
    [INFO]    |  \- io.netty:netty:jar:3.6.2.Final:compile
    [INFO]    +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.5.0:compile
    [INFO]    |  +- org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.5.0:compile
    [INFO]    |  |  +- org.apache.hadoop:hadoop-yarn-client:jar:2.5.0:compile
    [INFO]    |  |  |  \- com.sun.jersey:jersey-client:jar:1.9:compile
    [INFO]    |  |  \- org.apache.hadoop:hadoop-yarn-server-common:jar:2.5.0:compile
    [INFO]    |  \- org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.5.0:compile
    [INFO]    |     \- org.fusesource.leveldbjni:leveldbjni-all:jar:1.8:compile
    [INFO]    +- org.apache.hadoop:hadoop-yarn-api:jar:2.5.0:compile
    [INFO]    +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.5.0:compile
    [INFO]    |  \- org.apache.hadoop:hadoop-yarn-common:jar:2.5.0:compile
    [INFO]    |     +- javax.xml.bind:jaxb-api:jar:2.2.2:compile
    [INFO]    |     |  +- javax.xml.stream:stax-api:jar:1.0-2:compile
    [INFO]    |     |  \- javax.activation:activation:jar:1.1:compile
    [INFO]    |     +- javax.servlet:servlet-api:jar:2.5:compile
    [INFO]    |     +- com.sun.jersey:jersey-core:jar:1.9:compile
    [INFO]    |     +- org.codehaus.jackson:jackson-jaxrs:jar:1.9.13:compile
    [INFO]    |     \- org.codehaus.jackson:jackson-xc:jar:1.9.13:compile
    [INFO]    +- org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.5.0:compile
    [INFO]    \- org.apache.hadoop:hadoop-annotations:jar:2.5.0:compile
    

    Not sure if you really need a hadoop-client here, as so far only Spark interpreter was requiring this one, but with the provided scope (so in my understanding it does not get bundled as a part of release, so no need to specify the licences, but may be @Leemoonsoo can correct me on that)

  3. CI failure here seems to be not relevant (if could not be reproduced by subsequent runs, otherwise need to create JIRA issue for that)

@prasadwagle
Copy link

Should we add ScaldingInterpreter to ZEPPELIN_INTERPRETERS in ZeppelinConfiguration.java?

tw-172-25-131-152 incubator-zeppelin (presto) $ git diff master -- zeppelin-zengine/src/main/java/org/apache/zeppelin/conf/ZeppelinConfiguration.java
diff --git a/zeppelin-zengine/src/main/java/org/apache/zeppelin/conf/ZeppelinConfiguration.java b/zeppelin-zengine/src/main/java/org/apache/zeppelin/conf/ZeppelinConfiguration.java
index 909345a..0c6e9f0 100755
--- a/zeppelin-zengine/src/main/java/org/apache/zeppelin/conf/ZeppelinConfiguration.java
+++ b/zeppelin-zengine/src/main/java/org/apache/zeppelin/conf/ZeppelinConfiguration.java
@@ -415,6 +415,7 @@ public class ZeppelinConfiguration extends XMLConfiguration {
+ "org.apache.zeppelin.cassandra.CassandraInterpreter,"
+ "org.apache.zeppelin.geode.GeodeOqlInterpreter,"
+ "org.apache.zeppelin.postgresql.PostgreSqlInterpreter,"

  •    + "org.apache.zeppelin.scalding.ScaldingInterpreter,"
     + "org.apache.zeppelin.kylin.KylinInterpreter"),
    
    ZEPPELIN_INTERPRETER_DIR("zeppelin.interpreter.dir", "interpreter"),
    ZEPPELIN_INTERPRETER_CONNECT_TIMEOUT("zeppelin.interpreter.connect.timeout", 30000),

@sriramkrishnan
Copy link
Contributor Author

@bzz thanks for the review. I will address your other comments and update the PR in a day or so.

@prasadwagle good point. I will add that too.

@sriramkrishnan
Copy link
Contributor Author

Addressed all PR comments, except for docs, license as @bzz has suggested, which I will do next. Build is also green now.

ps: I do need the hadoop-client jar as the Scalding REPL uses it even for local mode (and provided doesn't work there AFAICT).

@sriramkrishnan
Copy link
Contributor Author

Added docs.

@bzz I believe I have addressed all comments, except adding LICENSE info to zeppelin-distribution/src/bin_license/LICENSE. Do you happen to have a script for doing that already? I am dreading doing this by hand.

@jongyoul
Copy link
Member

I've tested it in my local machine and worked well. Thanks for this contribution.

@sriramkrishnan
Copy link
Contributor Author

Thanks @Leemoonsoo I believe I have addressed all your comments, except for the jgrapht.

Note that the jgrapht here is thirdparty:jgrapht, which seems to be a version that Cascading has forked. I am inclined to leave it as it, but open to suggestions.

@sriramkrishnan
Copy link
Contributor Author

bump

What else do we need to do to get this merged? Thanks!

@Leemoonsoo
Copy link
Member

I have queried and waiting for answer about license of thirdparty:jgraph from cascading.

One of the fastest way to get this merged, making it optional like we did it for geode https://github.com/apache/incubator-zeppelin/pull/379/files.

That would avoid license problem of binary dependency and 3rd party maven repo. Then i think it can be merged.

Eventually, i hope https://issues.apache.org/jira/browse/ZEPPELIN-546 solve the problem.
i.e. Does not include scaling dependency in binary package of Zeppelin. but release scalding interpreter as a maven artifact. And then Zeppelin can load it, on runtime.

@sriramkrishnan
Copy link
Contributor Author

I am OK making the Scalding interpreter optional with a -Pscalding flag. Does that also mean I have to revert all my changes to the LICENSE file?

@Leemoonsoo
Copy link
Member

Yes it is, changes to LICENSE file and files that has been added to licenses directory need to be reverted. And you probably want to describe -Pscalding flag in https://github.com/apache/incubator-zeppelin/blob/master/README.md

@Leemoonsoo
Copy link
Member

fyi, i have got replied about license, like

moon soo Lee
Subject: Questions about license of jgrapht-jdk1.6:0.8.1 binary

DEC 30, 2015 | 07:13AM PST
Ryan Desmond replied:
Hi Moon,

Chris has informed me that you can choose either license. That said, he also mentioned that you will need to speak with a lawyer for proper consultation on this issue.

Best Regards,
Ryan

@sriramkrishnan
Copy link
Contributor Author

Thanks @Leemoonsoo. So how should we proceed? Should I add it as LGPL?

Also, are you OK with using the conjars repo now? Or do you recommend that I add a -Pscalding flag. Would be great to get this merged either ways!

@Leemoonsoo
Copy link
Member

Apart from license for jgrapht, using 3rd party repo (conjars) need to be avoided. How about revert commits for the LICENSE / files added under licenses dir and add -Pscalding flag, for this pullrequest. After it is merged, create another issue for removing -Pscalding and continue to work. Would this way works for you?

@sriramkrishnan
Copy link
Contributor Author

@Leemoonsoo as you suggested, I have reverted commits for the LICENSE / files added under licenses dir, added -Pscalding flag, and updated the docs.

Are you OK with shipping/merging this PR now?

@sriramkrishnan
Copy link
Contributor Author

And we also have a green build now.

@Leemoonsoo
Copy link
Member

LGTM. @sriramkrishnan Thanks for the contribution!

@sriramkrishnan
Copy link
Contributor Author

Thanks @Leemoonsoo for the review. Could one of the committers please merge if there are no other objections?

@Leemoonsoo
Copy link
Member

I'm merging it, if there're no more discussions.

@sriramkrishnan
Copy link
Contributor Author

Thanks!

@asfgit asfgit closed this in 8fdaaba Jan 2, 2016
@sriramkrishnan sriramkrishnan deleted the scalding branch January 3, 2016 21:40
@sriramkrishnan
Copy link
Contributor Author

Filed https://issues.apache.org/jira/browse/ZEPPELIN-555 to track addition of Hadoop mode to the Scalding interpreter. I may pick it up in my "free time", but I have added a description there if anyone wants to tackle it.

CC @prasadwagle

asfgit pushed a commit that referenced this pull request Jan 5, 2016
### What is this PR for?
Since we avoid publish official binary package built with 3rd party maven repository, currently we can not include [Scalding](#561) dependency in binary package of Zeppelin. But the test for Scalding interpreter should be done in travis. So I just added `-Pscalding` flag to `.travis.yml` file.

### What type of PR is it?
Improvement

### Todos
* [x] - Add -Pscalding flag to .travis.yml.

### Is there a relevant Jira issue?
No. But you may checkout [here](https://github.com/apache/incubator-zeppelin/pull/561/files#r48471634).

### How should this be tested?

### Screenshots (if appropriate)

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Ryu Ah young <fbdkdud93@hanmail.net>

Closes #594 from AhyoungRyu/MODIFY-TRAVIS-FILE and squashes the following commits:

c7edbf9 [Ryu Ah young] Add -Pscalding build option to .travis.yml
dabaitu pushed a commit to dabaitu/zeppelin that referenced this pull request Jul 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants