[SPARK-18862][SPARKR][ML] Split SparkR mllib.R into multiple files #16312

yanboliang · 2016-12-16T16:30:13Z

What changes were proposed in this pull request?

SparkR mllib.R is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain:

mllib_classification.R
mllib_clustering.R
mllib_recommendation.R
mllib_regression.R
mllib_stat.R
mllib_tree.R
mllib_utils.R

Note: Only reorg, no actual code change.

How was this patch tested?

Existing tests.

SparkQA · 2016-12-16T17:11:12Z

Test build #70256 has finished for PR 16312 at commit 319d1ed.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
#' threshold p is equivalent to setting thresholds c(1-p, p). In multiclass (or binary) classification to adjust the probability of

wangmiao1981 · 2016-12-16T18:23:04Z

This is quite necessary change! The file becomes very lengthy. Thanks!

wangmiao1981 · 2016-12-16T18:25:30Z

retest this please.

SparkQA · 2016-12-16T19:04:12Z

Test build #70261 has finished for PR 16312 at commit 319d1ed.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
#' threshold p is equivalent to setting thresholds c(1-p, p). In multiclass (or binary) classification to adjust the probability of

felixcheung · 2016-12-17T04:57:10Z

I like how they are grouped. not sure why tests are failing though

SparkQA · 2016-12-17T09:13:57Z

Test build #70302 has finished for PR 16312 at commit 2eccc8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-17T15:52:09Z

Test build #70310 has finished for PR 16312 at commit 84d2804.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-12-17T22:42:57Z

Hmm, from what I can see pretty sure it used to work with sparkR.session.stop() calls for enableHiveSupport = F sessions. Do we know why this is suddenly causing issues?

yanboliang · 2016-12-18T14:20:58Z

I'm trying to figure out the cause of this issue but still no progress until now, could you, @shivaram or @mengxr give some hints or suggestion? Thanks.

felixcheung · 2016-12-18T20:10:06Z

sure, I'll take a deeper look in a day or 2.

shivaram · 2016-12-23T05:48:25Z

I looked at this more closely and I think I found the problem - Not sure its easy to fix though.
What I traced here is:

When we call sparkR.session.stop and sparkR.session the same JVM backend is reused and only the SparkContext is stopped / recreated
Now the problem happens when we call read.ml to read a model after creating a new SparkSession. This in turn calls into RWrappers[1] which has an sc member variable
My understanding is that the sc member variable is bound the first time we create a SparkSession and when we stop, restart it has a handle to the stale SparkContext
Thus we see errors where it says Cannot call methods on a stopped SparkContext

I think the right fix here is to pass along a SparkContext into RWrappers and not rely on a prior initialization. However I'm not sure why that design decision was made before, so maybe I'm missing something.

[1]

spark/mllib/src/main/scala/org/apache/spark/ml/r/RWrappers.scala

Line 36 in f252cb5

val rMetadataStr = sc.textFile(rMetadataPath, 1).first()

felixcheung · 2016-12-23T06:04:26Z

ah, thank you @shivaram. sorry I couldn't get around to investigate earlier.

@yanboliang It looks like that is the design in the trait BaseReadWrite (here), where it holds references to sc, sqlContext and spark session. Although, I see other MLReader/MLWriter calls sc directly whereas the design should allow for the sc/spark session to be updated? Specifically we could change to pass the spark session to the RWrapper for these calls but generally reusing sc is the design of BaseReadWrite/MLReader/MLWriter, and is not specific to R.

yanboliang · 2016-12-23T16:10:57Z

@shivaram @felixcheung You comments are really very helpful, I understand the problem and will try to figure out some ways to fix it. Thanks very much.

SparkQA · 2017-01-06T16:45:09Z

Test build #70984 has finished for PR 16312 at commit d1f480d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-01-06T16:55:39Z

@felixcheung @shivaram I changed read.ml to pass in sparkSession if necessary, then it will not use stale spark context which will cause errors. This is consistent with MLReader/MLWriter in Scala/Python which also supports user-specified spark session. Any more comments, please feel free to let me know. Thanks.

felixcheung · 2017-01-06T22:44:09Z

R/pkg/R/mllib_utils.R

@@ -80,9 +81,9 @@ predict_internal <- function(object, newData) {
 #' model <- read.ml(path)
 #' }
 #' @note read.ml since 2.0.0
-read.ml <- function(path) {
+read.ml <- function(path, sparkSession = NULL) {


we should get this from the current default session instead of making it a parameter, like here
https://github.com/apache/spark/blob/master/R/pkg/R/SQLContext.R#L136

Sounds good, updated.

SparkQA · 2017-01-07T14:22:45Z

Test build #71013 has finished for PR 16312 at commit c81db86.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-01-08T00:31:38Z

R/pkg/R/mllib_utils.R

  path <- suppressWarnings(normalizePath(path))
-  jobj <- callJStatic("org.apache.spark.ml.r.RWrappers", "load", path, sparkSession)
+  sparkSession <- getSparkSession()
+  callJStatic("org.apache.spark.ml.r.RWrappers", "session", sparkSession)


~~I'm a bit confused by this - what does this do?~~
ok got it. this behavior is a bit confusing...

do we need to set the current SparkSession to session like this in other places calling RWrappers?

Here we call session method to set spark session, the function name is really confusing, may be named as setSession should be better, but we use the current name from the very beginning.
There is no other places need to be updated, since here is the only place calling RWrappers. Thanks.

felixcheung · 2017-01-08T00:34:48Z

LGTM. thanks!

yanboliang · 2017-01-08T08:59:58Z

Merged into master, thanks for reviewing.

## What changes were proposed in this pull request? SparkR ```mllib.R``` is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain: * mllib_classification.R * mllib_clustering.R * mllib_recommendation.R * mllib_regression.R * mllib_stat.R * mllib_tree.R * mllib_utils.R Note: Only reorg, no actual code change. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#16312 from yanboliang/spark-18862.

yanboliang added 4 commits January 6, 2017 06:20

Split SparkR mllib.R into multiple files

e6c16a8

Remove sparkR.session.stop() for each mllib test file.

35a7bd4

Reorg write.ml

3aaa853

Support to pass in sparkSession for read.ml.

d1f480d

yanboliang force-pushed the spark-18862 branch from 84d2804 to d1f480d Compare January 6, 2017 15:39

felixcheung reviewed Jan 6, 2017

View reviewed changes

Use getSparkSession to get default spark session.

c81db86

felixcheung reviewed Jan 8, 2017

View reviewed changes

asfgit closed this in 6b6b555 Jan 8, 2017

yanboliang deleted the spark-18862 branch January 8, 2017 09:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18862][SPARKR][ML] Split SparkR mllib.R into multiple files #16312

[SPARK-18862][SPARKR][ML] Split SparkR mllib.R into multiple files #16312

yanboliang commented Dec 16, 2016

SparkQA commented Dec 16, 2016

wangmiao1981 commented Dec 16, 2016

wangmiao1981 commented Dec 16, 2016

SparkQA commented Dec 16, 2016

felixcheung commented Dec 17, 2016

SparkQA commented Dec 17, 2016

SparkQA commented Dec 17, 2016

felixcheung commented Dec 17, 2016

yanboliang commented Dec 18, 2016 •

edited

Loading

felixcheung commented Dec 18, 2016

shivaram commented Dec 23, 2016

felixcheung commented Dec 23, 2016

yanboliang commented Dec 23, 2016

SparkQA commented Jan 6, 2017

yanboliang commented Jan 6, 2017 •

edited

Loading

felixcheung Jan 6, 2017

yanboliang Jan 7, 2017

SparkQA commented Jan 7, 2017

felixcheung Jan 8, 2017 •

edited

Loading

yanboliang Jan 8, 2017

felixcheung commented Jan 8, 2017

yanboliang commented Jan 8, 2017

[SPARK-18862][SPARKR][ML] Split SparkR mllib.R into multiple files #16312

[SPARK-18862][SPARKR][ML] Split SparkR mllib.R into multiple files #16312

Conversation

yanboliang commented Dec 16, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Dec 16, 2016

wangmiao1981 commented Dec 16, 2016

wangmiao1981 commented Dec 16, 2016

SparkQA commented Dec 16, 2016

felixcheung commented Dec 17, 2016

SparkQA commented Dec 17, 2016

SparkQA commented Dec 17, 2016

felixcheung commented Dec 17, 2016

yanboliang commented Dec 18, 2016 • edited Loading

felixcheung commented Dec 18, 2016

shivaram commented Dec 23, 2016

felixcheung commented Dec 23, 2016

yanboliang commented Dec 23, 2016

SparkQA commented Jan 6, 2017

yanboliang commented Jan 6, 2017 • edited Loading

felixcheung Jan 6, 2017

Choose a reason for hiding this comment

yanboliang Jan 7, 2017

Choose a reason for hiding this comment

SparkQA commented Jan 7, 2017

felixcheung Jan 8, 2017 • edited Loading

Choose a reason for hiding this comment

yanboliang Jan 8, 2017

Choose a reason for hiding this comment

felixcheung commented Jan 8, 2017

yanboliang commented Jan 8, 2017

yanboliang commented Dec 18, 2016 •

edited

Loading

yanboliang commented Jan 6, 2017 •

edited

Loading

felixcheung Jan 8, 2017 •

edited

Loading