Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Scala Pipeline API] Add scala logisticRegression api for spark pipeline #70

Closed

Conversation

Wenpei
Copy link
Contributor

@Wenpei Wenpei commented Feb 16, 2016

I wrote a scala ml pipeline wrapper for LogisticRegression Model as a example for scala user.
I thought those api (java, python, scala) should be put to separated jar, but just follow what it is this time.

Regards.
Wenpei.

@akchinSTC
Copy link
Contributor

Can an Admin verify this patch?

@mboehm7
Copy link
Contributor

mboehm7 commented Feb 16, 2016

thanks @Wenpei; @niketanpansare could you please take a look.

@niketanpansare
Copy link
Contributor

Hi Wenpei,

Thanks for your contribution. But, I believe your code does exactly the same thing as https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/ml/LogisticRegression.java. This class is fully compatible with Scala API and won't require separate jar. May be I am missing something. Is there any particular reason why you think we should create wrappers in Scala instead ?

Thanks,

Niketan.

@Wenpei
Copy link
Contributor Author

Wenpei commented Feb 17, 2016

@niketanpansare
Personally, I thought this wrapper LogisticRegression.java, more like a example but not a real api expose to end user.

So I propose a scala version example since some weakness for java version.

  • It's not naturally to extend scala class in java code. We need know function style after compile, like
    @OverRide
    public void org$apache$spark$ml$param$shared$HasElasticNetParam$setter$elasticNetParam_$eq(DoubleParam arg0) {}

    I assume it's set function, but do nothing here

  • Hard to follow ml parameter style, but define parameter like below

    private IntParam icpt = new IntParam(this, "icpt", "Value of intercept");
    private DoubleParam reg = new DoubleParam(this, "reg", "Value of regularization parameter");

Wenpei.

@niketanpansare
Copy link
Contributor

@Wenpei
Fair point, Java API is little cumbersome for developers of the API as one needs to understand how Java developers can use certain Scala features. However, it does not affect usability with respect to the end user. The main advantage of having Java API is "easy integration with current dev setup".

Btw, I am OK with having a Scala API into incubator-systemml if following conditions are met:

  1. "mvn verify" creates a jar with the Scala APIs in it and also runs the unit tests that verifies the Scala API. This will ensure that the future commits do not break the Scala API.
  2. Common dev setup (preferably eclipse or Scala IDE) for both the Scala API and the runtime. That is, scala files are compilation units inside the eclipse editor. This will encourage other committers to work contribute to the Scala API.

Thanks,

Niketan.

@dusenberrymw
Copy link
Contributor

I'm in favor of replacing the existing Java APIs (LogisticRegression, MLContext, etc.) with Scala versions in order to alleviate the types of issues raised here that are inherent with user-facing APIs. We could simply place these Scala APIs in src/main/scala and keep the same package structure as before.

So, currently our LogisticRegression API is is part of the org.apache.sysml.api.ml package, and therefore the actual LogisticRegression.java file is located at src/main/java/org/apache/sysml/api/ml/LogisticRegression.java. We could instead place the Scala file at src/main/scala/org/apache/sysml/api/ml/LogisticRegression.scala, and use the same org.apache.sysml.api.ml package identifier.

To alleviate build issues, we can just integrate Scala compilation into Maven. A quick search shows that the scala-maven-plugin can do just this by simply adding a few lines to the pom.xml, as shown in 1. Eclipse integration appears to also be easy by simply adding a few more lines to the pom.xml, as show at the bottom of 2. Then, a simply mvn package, or Eclipse (or IntelliJ) build would work as usual. No additional JARs would be necessary in the final packaged SystemML.jar file either.

This route would also be great in preparation for SYSTEMML-451, which aims add DML as an embedded Scala DSL.

@niketanpansare
Copy link
Contributor

@dusenberrymw +1 on creating Scala API for LogisticRegression, but am not sure whether we need to convert MLContext to Scala.

scala-maven-plugin solution looks good to resolve condition 1. The flag "cTestGoals" should ideally allow us to run junit tests 1.

Wrt condition 2, I had bad experience with Scala IDE during initial prototyping of Spark backend, especially for mixed scala/java projects (eg: code completion/refactoring/references). This was about a year and half ago and may be it has got better. Before we make the switch, let's double check the IDE support on both windows/mac. Then @Wenpei can push this commit to src/main/scala.

@dusenberrymw
Copy link
Contributor

@niketanpansare Great! The goal for replacing MLContext as well would relate to the issue of default parameters. There are currently 18 variations of registerInput(...), and 23 variations of execute(...)/executeScript(...) and it just keeps getting messier with each new parameter or parameter type.

Also, I can't speak for Eclipse, but IntelliJ works great for Spark (which is now a mixed Scala/Java project) as well as SystemML.

@Wenpei
Copy link
Contributor Author

Wenpei commented Feb 18, 2016

@niketanpansare @dusenberrymw
I work out a init pom.xml for scala support, which works for mvn clean package to produce a SystemML.jar.
But I face two issues here:

  1. It's very large for SystemML.jar , about 30M in my compute. Not very sure but my be reasonable.
  2. I don't know how to change compile order to make compile java at first, which cause class not found issues in eclipse.

I am not very good at maven config / pom, so is anyone can help to review this.

beside this, systemml has folder like "/systemml/src/test_suites/java", we may need change it to "/systemml/src/test/java", and add "systemml/src/test/scala" to follow plugin rule.

Regards.
Yu Wenpei.

@niketanpansare
Copy link
Contributor

@Wenpei Will look into this today.

@dusenberrymw
Copy link
Contributor

@Wenpei Thanks for updating the POM file.

  1. I think the issue with regards to size is due to the inclusion of the Scala library dependencies into to final JAR. Since this specific log reg API will only ever be used from Spark, we can assume that those Scala libraries will be available at runtime, and thus we don't need to include them in the JAR. Can you set the "scope" on all the Scala dependencies you added to "provided", indicating that they should not be included in the JAR?

@niketanpansare
Copy link
Contributor

@Wenpei Please do following things before making the switch:

  1. Change all the scala dependency to provided. This should bring the jar down to 6M. @dusenberrymw mentioned this already :)
  2. Please make sure the jar works on a cluster without Spark/Scala (as well as with Spark/Scala) on all the execution modes (i.e. standalone, hybrid, ....). I have done sanity test on our internal cluster which has both Spark/Scala (on the above 6M jar), but I would highly recommend using our performance test suite to ensure that there are no unintentional side-effects. Also please make sure you run the integration test suite (i.e. mvn verify).
  3. Please try the new setup on eclipse (and the IDE of your choice (IntelliJ)) on Mac/Windows and see if you are getting compilation errors.
  4. Please provide instructions to switch to new setup (for example: download certain version of Scala, certain version of IDE, etc ...) to other committers.

@mboehm7
Copy link
Contributor

mboehm7 commented Feb 18, 2016

Well, I don't have a strong opinion on the wrapper code. The only hard requirement is that it has to work nicely with eclipse - no matter what.

@niketanpansare
Copy link
Contributor

@Wenpei I was not able to reproduce the second issue.
Scala IDE

@Wenpei
Copy link
Contributor Author

Wenpei commented Feb 19, 2016

@niketanpansare I change flag in pom.xml to provider, and can get 6M SystemML.jar now.
And second issues gone after recompiled.

I will

  1. work on instruction those days and put it to dev folder.
  2. do some test to make sure SystemML.jar works fine.
  3. add scala test for this pipeline api.

@niketanpansare
Copy link
Contributor

Thanks @Wenpei, I really appreciate the effort :)

@Wenpei
Copy link
Contributor Author

Wenpei commented Feb 25, 2016

@niketanpansare I finish instruction, test suite.
But my compute report less memory error after run a hour for mvn verify, so it would be great if some one can launch a jenkins test for this. Or @bertholdreinwald can assign a account for me? :-)

For Instruction, I just leave it there. But may @deroneriksson want add it to http://apache.github.io/incubator-systemml/ as a part of instruction for contributor, can address this in another pr.

Regards.
Wenpei.

@niketanpansare
Copy link
Contributor

Jenkins test this.

@akchinSTC
Copy link
Contributor

test this please.

@niketanpansare
Copy link
Contributor

@Wenpei Please check https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/185/console and fix following errors:

  1. "Multiple versions of scala libraries detected" warning can be eliminated by modifying the pom file.
  2. "... error: not found" is most likely because the code might be behind git. For example: src/main/java/org/apache/sysml/parser errors could be related to changes introduced in [SYSTEMML-456] parser code deduplication #61

@Wenpei
Copy link
Contributor Author

Wenpei commented Feb 27, 2016

I will rebase this pr will new.
For multi scala version, it caused by jar dependence , will take a look but may common issues?

@niketanpansare
Copy link
Contributor

Thanks :)

@Wenpei Wenpei force-pushed the pr_scala_logisticregression_pipeline branch from e036255 to 7a74d01 Compare February 29, 2016 08:53
@Wenpei
Copy link
Contributor Author

Wenpei commented Feb 29, 2016

@niketanpansare I rebase pr with new. Please launch a test again.
For warning message, we didn't directly dependent to com.twitter. It should be caused by other jar(spark?).

@niketanpansare
Copy link
Contributor

@akchinSTC test this please.

@deroneriksson
Copy link
Member

@niketanpansare @mboehm7 @gweidner
Before merging, can someone verify that Scala is currently integrating smoothly in Eclipse with a Maven project? I experienced some issues with this (changes to classes not appearing in target/classes, standard debugging) when I tried last Friday, but it sounded like @gweidner was making some progress.

@niketanpansare
Copy link
Contributor

Thanks @gweidner for looking into it :)

@gweidner
Copy link
Member

gweidner commented Mar 3, 2016

I've been able to build the PR successfully within Eclipse when using Juno-based development environment.
build_juno

Build failures within Luna-based environment under investigation.

@Wenpei
Copy link
Contributor Author

Wenpei commented Mar 4, 2016

I have add eclipse version to dev/project import.md to specify Juno version.
I didn't test eclipse luna, and spark community prefer Juno also.
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools

@gweidner Can you paste a failed picture here? class not found or other error?

@gweidner
Copy link
Member

gweidner commented Mar 4, 2016

Thank you @Wenpei for the information. When using the latest scala ide sdk (4.3.0 downloaded from http://scala-ide.org/download/current.html), I've observed two sets of errors:
lr_scala_errors

test_suite_errors

@Wenpei
Copy link
Contributor Author

Wenpei commented Mar 7, 2016

@gweidner I do some test today, for scala IDE and eclipse luna
scala IDE:

  1. delete project from eclipse
  2. run command mvn eclipse:clean
  3. import mvn project from wizard
  4. If face incompatation issues , change IDE scala version
    project->propertiest->scala compiler -> scala installation
    to Fixed scala Installation: 2.10.5
  5. If face Not found type * issues.
    Run command mvn package
    And do project -> refresh

For eclipse luna:
Except scala IDE pulgin install, please make sure get update from "http://alchim31.free.fr/m2e-scala/update-site" if get maketplace not found issues. to update maven connector for scala.

If this works for you, I can add as tips to instruction doc

@gweidner
Copy link
Member

gweidner commented Mar 8, 2016

Thank you WenPei for your time and tips. That did work except for Project-Clean... build.

The configuration that worked best for me was Luna + Scala IDE 4.0 plugin + maven scala connector. That combination had no Eclipse build type issues shown in previous captures and full Project-Clean build was successful. Note my testing was on Windows only.

@Wenpei
Copy link
Contributor Author

Wenpei commented Mar 8, 2016

@gweidner @deroneriksson @niketanpansare
I update my project import.md file. Please let me know if any concern for this pr.

@niketanpansare
Copy link
Contributor

@gweidner @deroneriksson Do you have any more concerns for this PR ?

@Wenpei Can you please send an email with the tips on dev mailing list with a link to this PR ?

@deroneriksson
Copy link
Member

@niketanpansare @gweidner @mboehm7
If @gweidner has verified that the Scala does indeed integrate well with the Eclipse IDE (Scala can be debugged, updates to .scala files are automatically compiled to target/classes, Java developers who are not doing Scala can use the latest versions of Eclipse without issues), then I am very happy with this.

@gweidner
Copy link
Member

Yes this works with Luna (my preferred IDE). I'm preparing a documentation PR with specific Eclipse version information and tips for cases where developers prefer not to include Scala
in their IDE.

@Wenpei
Copy link
Contributor Author

Wenpei commented Mar 15, 2016

@niketanpansare I send a mail to dev mailing list with PR and tips link.

* under the License.
*/

package org.apache.sysml.api.ml.scala
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, this should just be in the org.apache.sysml.api.ml package, as it shouldn't matter to the end-user if the source was written in Scala.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cause a build error since we have both java and scala Logisiticregression in same package.
So i add scala version to seperated folder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, it might be better to move the Java version to a org.apache.sysml.api.ml.java folder (or possibly remove it), and move this file to org.apache.sysml.api.ml in order to encourage other Scala contributions to the API.

cc @niketanpansare Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to go ahead and merge this and then work out any additional updates in separate JIRAs/PRs since the conversation count is already 50+.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be fine as well. @Wenpei Can you go ahead and make a JIRA for this as well?

@dusenberrymw
Copy link
Contributor

@Wenpei Awesome work here! I left a couple of comments, the later of which would be particularly interesting to explore (and may even be an item for a subsequent JIRA instead). Aside from that, this should definitely be able to be merged.

@dusenberrymw
Copy link
Contributor

Also, just to track, I just pulled this into IntelliJ, and there are no issues encountered thus far.

@Wenpei
Copy link
Contributor Author

Wenpei commented Mar 16, 2016

Jenkins, test this please.

@Wenpei
Copy link
Contributor Author

Wenpei commented Mar 16, 2016

@dusenberrymw
Copy link
Contributor

Great, LGTM. I'll merge this today.

@dusenberrymw
Copy link
Contributor

I've opened SYSTEMML-580 to reference this PR, and I've set the Reporter and Assignee to you, @Wenpei.

dusenberrymw pushed a commit to dusenberrymw/systemml that referenced this pull request Mar 16, 2016
This adds a Scala version of the LogisiticRegression Spark ML pipeline API, as well as Scala build support for the project, effectively turning the project into a mixed Scala/Java project.

Closes apache#70.
@asfgit asfgit closed this in 7ce19c8 Mar 16, 2016
@dusenberrymw
Copy link
Contributor

PR merged. Thanks for the hard work, @Wenpei!

@dusenberrymw
Copy link
Contributor

Also, to anyone interested, there are several follow-up JIRA issues attached to SYSTEMML-580.

@Wenpei Wenpei deleted the pr_scala_logisticregression_pipeline branch April 22, 2016 07:59
asfgit pushed a commit that referenced this pull request Mar 27, 2020
1. Main DML script (mice_linearReg.dml)
2. Java test file (BuiltinMiceLinearRegTest.java)
3. DML test script (mice_linearRegression.dml)

Closes #70.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants