Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18862][SPARKR][ML] Split SparkR mllib.R into multiple files #16312

Closed
wants to merge 5 commits into from

Conversation

yanboliang
Copy link
Contributor

What changes were proposed in this pull request?

SparkR mllib.R is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain:

  • mllib_classification.R
  • mllib_clustering.R
  • mllib_recommendation.R
  • mllib_regression.R
  • mllib_stat.R
  • mllib_tree.R
  • mllib_utils.R

Note: Only reorg, no actual code change.

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Dec 16, 2016

Test build #70256 has finished for PR 16312 at commit 319d1ed.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • #' threshold p is equivalent to setting thresholds c(1-p, p). In multiclass (or binary) classification to adjust the probability of

@wangmiao1981
Copy link
Contributor

This is quite necessary change! The file becomes very lengthy. Thanks!

@wangmiao1981
Copy link
Contributor

retest this please.

@SparkQA
Copy link

SparkQA commented Dec 16, 2016

Test build #70261 has finished for PR 16312 at commit 319d1ed.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • #' threshold p is equivalent to setting thresholds c(1-p, p). In multiclass (or binary) classification to adjust the probability of

@felixcheung
Copy link
Member

I like how they are grouped. not sure why tests are failing though

@SparkQA
Copy link

SparkQA commented Dec 17, 2016

Test build #70302 has finished for PR 16312 at commit 2eccc8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 17, 2016

Test build #70310 has finished for PR 16312 at commit 84d2804.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

Hmm, from what I can see pretty sure it used to work with sparkR.session.stop() calls for enableHiveSupport = F sessions. Do we know why this is suddenly causing issues?

@yanboliang
Copy link
Contributor Author

yanboliang commented Dec 18, 2016

I'm trying to figure out the cause of this issue but still no progress until now, could you, @shivaram or @mengxr give some hints or suggestion? Thanks.

@felixcheung
Copy link
Member

sure, I'll take a deeper look in a day or 2.

@shivaram
Copy link
Contributor

I looked at this more closely and I think I found the problem - Not sure its easy to fix though.
What I traced here is:

  • When we call sparkR.session.stop and sparkR.session the same JVM backend is reused and only the SparkContext is stopped / recreated
  • Now the problem happens when we call read.ml to read a model after creating a new SparkSession. This in turn calls into RWrappers[1] which has an sc member variable
  • My understanding is that the sc member variable is bound the first time we create a SparkSession and when we stop, restart it has a handle to the stale SparkContext
  • Thus we see errors where it says Cannot call methods on a stopped SparkContext

I think the right fix here is to pass along a SparkContext into RWrappers and not rely on a prior initialization. However I'm not sure why that design decision was made before, so maybe I'm missing something.

[1]

val rMetadataStr = sc.textFile(rMetadataPath, 1).first()

@felixcheung
Copy link
Member

ah, thank you @shivaram. sorry I couldn't get around to investigate earlier.

@yanboliang It looks like that is the design in the trait BaseReadWrite (here), where it holds references to sc, sqlContext and spark session. Although, I see other MLReader/MLWriter calls sc directly whereas the design should allow for the sc/spark session to be updated? Specifically we could change to pass the spark session to the RWrapper for these calls but generally reusing sc is the design of BaseReadWrite/MLReader/MLWriter, and is not specific to R.

@yanboliang
Copy link
Contributor Author

@shivaram @felixcheung You comments are really very helpful, I understand the problem and will try to figure out some ways to fix it. Thanks very much.

@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70984 has finished for PR 16312 at commit d1f480d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

yanboliang commented Jan 6, 2017

@felixcheung @shivaram I changed read.ml to pass in sparkSession if necessary, then it will not use stale spark context which will cause errors. This is consistent with MLReader/MLWriter in Scala/Python which also supports user-specified spark session. Any more comments, please feel free to let me know. Thanks.

@@ -80,9 +81,9 @@ predict_internal <- function(object, newData) {
#' model <- read.ml(path)
#' }
#' @note read.ml since 2.0.0
read.ml <- function(path) {
read.ml <- function(path, sparkSession = NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should get this from the current default session instead of making it a parameter, like here
https://github.com/apache/spark/blob/master/R/pkg/R/SQLContext.R#L136

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, updated.

@SparkQA
Copy link

SparkQA commented Jan 7, 2017

Test build #71013 has finished for PR 16312 at commit c81db86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

path <- suppressWarnings(normalizePath(path))
jobj <- callJStatic("org.apache.spark.ml.r.RWrappers", "load", path, sparkSession)
sparkSession <- getSparkSession()
callJStatic("org.apache.spark.ml.r.RWrappers", "session", sparkSession)
Copy link
Member

@felixcheung felixcheung Jan 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this - what does this do?
ok got it. this behavior is a bit confusing...

do we need to set the current SparkSession to session like this in other places calling RWrappers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we call session method to set spark session, the function name is really confusing, may be named as setSession should be better, but we use the current name from the very beginning.
There is no other places need to be updated, since here is the only place calling RWrappers. Thanks.

@felixcheung
Copy link
Member

LGTM. thanks!

@yanboliang
Copy link
Contributor Author

Merged into master, thanks for reviewing.

@asfgit asfgit closed this in 6b6b555 Jan 8, 2017
@yanboliang yanboliang deleted the spark-18862 branch January 8, 2017 09:14
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Jan 9, 2017
## What changes were proposed in this pull request?
SparkR ```mllib.R``` is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain:
* mllib_classification.R
* mllib_clustering.R
* mllib_recommendation.R
* mllib_regression.R
* mllib_stat.R
* mllib_tree.R
* mllib_utils.R

Note: Only reorg, no actual code change.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#16312 from yanboliang/spark-18862.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
SparkR ```mllib.R``` is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain:
* mllib_classification.R
* mllib_clustering.R
* mllib_recommendation.R
* mllib_regression.R
* mllib_stat.R
* mllib_tree.R
* mllib_utils.R

Note: Only reorg, no actual code change.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#16312 from yanboliang/spark-18862.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants