-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18862][SPARKR][ML] Split SparkR mllib.R into multiple files #16312
Conversation
Test build #70256 has finished for PR 16312 at commit
|
This is quite necessary change! The file becomes very lengthy. Thanks! |
retest this please. |
Test build #70261 has finished for PR 16312 at commit
|
I like how they are grouped. not sure why tests are failing though |
Test build #70302 has finished for PR 16312 at commit
|
Test build #70310 has finished for PR 16312 at commit
|
Hmm, from what I can see pretty sure it used to work with sparkR.session.stop() calls for enableHiveSupport = F sessions. Do we know why this is suddenly causing issues? |
sure, I'll take a deeper look in a day or 2. |
I looked at this more closely and I think I found the problem - Not sure its easy to fix though.
I think the right fix here is to pass along a SparkContext into RWrappers and not rely on a prior initialization. However I'm not sure why that design decision was made before, so maybe I'm missing something. [1]
|
ah, thank you @shivaram. sorry I couldn't get around to investigate earlier. @yanboliang It looks like that is the design in the trait BaseReadWrite (here), where it holds references to |
@shivaram @felixcheung You comments are really very helpful, I understand the problem and will try to figure out some ways to fix it. Thanks very much. |
84d2804
to
d1f480d
Compare
Test build #70984 has finished for PR 16312 at commit
|
@felixcheung @shivaram I changed |
@@ -80,9 +81,9 @@ predict_internal <- function(object, newData) { | |||
#' model <- read.ml(path) | |||
#' } | |||
#' @note read.ml since 2.0.0 | |||
read.ml <- function(path) { | |||
read.ml <- function(path, sparkSession = NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should get this from the current default session instead of making it a parameter, like here
https://github.com/apache/spark/blob/master/R/pkg/R/SQLContext.R#L136
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, updated.
Test build #71013 has finished for PR 16312 at commit
|
path <- suppressWarnings(normalizePath(path)) | ||
jobj <- callJStatic("org.apache.spark.ml.r.RWrappers", "load", path, sparkSession) | ||
sparkSession <- getSparkSession() | ||
callJStatic("org.apache.spark.ml.r.RWrappers", "session", sparkSession) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by this - what does this do?
ok got it. this behavior is a bit confusing...
do we need to set the current SparkSession to session
like this in other places calling RWrappers
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we call session
method to set spark session, the function name is really confusing, may be named as setSession
should be better, but we use the current name from the very beginning.
There is no other places need to be updated, since here is the only place calling RWrappers
. Thanks.
LGTM. thanks! |
Merged into master, thanks for reviewing. |
## What changes were proposed in this pull request? SparkR ```mllib.R``` is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain: * mllib_classification.R * mllib_clustering.R * mllib_recommendation.R * mllib_regression.R * mllib_stat.R * mllib_tree.R * mllib_utils.R Note: Only reorg, no actual code change. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#16312 from yanboliang/spark-18862.
## What changes were proposed in this pull request? SparkR ```mllib.R``` is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain: * mllib_classification.R * mllib_clustering.R * mllib_recommendation.R * mllib_regression.R * mllib_stat.R * mllib_tree.R * mllib_utils.R Note: Only reorg, no actual code change. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#16312 from yanboliang/spark-18862.
What changes were proposed in this pull request?
SparkR
mllib.R
is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain:Note: Only reorg, no actual code change.
How was this patch tested?
Existing tests.