-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10017] [MLlib]: ML model broadcasts should be stored in private vars: mllib NaiveBayes #8241
Conversation
testData.mapPartitions { iter => | ||
val model = bcModel.value | ||
val model = bcModel.get.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we will still want to make a local reference to bcModel here so we don't end up shipping the entire object across the wire.
Thanks for pointing that out @holdenk ! I have pushed a change to the PR! |
Awesome, it also seemed to be in many of your related PRs, you might want to update those as well. |
@holdenk yep :) I am updating those as well! |
Great :) |
@holdenk Not sure if you are reviewing the other PRs, but the fix should now be in all of them. Thx! |
I can take a look at some of the others too, I'm not a committer though so will still need a follow up review, but I can take a first pass :) |
@@ -19,6 +19,8 @@ package org.apache.spark.mllib.classification | |||
|
|||
import java.lang.{Iterable => JIterable} | |||
|
|||
import org.apache.spark.broadcast.Broadcast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So import ordering, this should probably be moved down to be with the other org.apache.spark imports.
Hi @holdenk I have moved the common logic to a trait and am mixin it with the model. Let me know if you see something that needs to be corrected. I am still getting familiar with Spark and the coding standards so I may have missed something! |
Awesome, great work :) Looking at the trait, it seems like you could maybe even move the var inside of the trait. Also passing the RDD around to get the context out of it seems a little funky, I'd probably just pass the context instead. |
I know my own PRs get triggered by jenkins, lets see if I can tell jenkins this is good to test or if we will need to get someone else. |
Thanks @holdenk ! That makes sense. I moved the private var to the trait. Let me know if you see something else that is out of place. I believe the generic types I am using are ok. However, let me know if there is a problem. Thanks again for these suggestions :) |
I forgot to mention, I am also only passing in the SparkContext now. I should have done that to begin with :( |
Glad to help, maybe @jkbradley can say if this is ok to test to jenkins as he was the author of the related JIRA :) |
@sabhyankar looks good overall:
|
I'll make some comments, but can you please resolve conflicts here? |
ok to test |
import org.apache.spark.broadcast.Broadcast | ||
import org.apache.spark.SparkContext | ||
|
||
import scala.reflect.ClassTag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
organize imports (Please follow style guide [https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide])
I like this approach. I'll check again after the conflicts have been resolved. Thanks! |
|
||
import scala.reflect.ClassTag | ||
|
||
private[mllib] trait Broadcastable[T] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is for broadcasting models, it shouldn't be mixed in with anything else. Also, the inner type of bcModel
should always be the same as the class mixing in Broadcastable
.
Thus, how about we rename Broadcastable
-> BroadcastableModel
, upper bound the type of T
, and constrain what types can mix this trait in with a self type annotation (i.e. altogether trait BroadcastableModel[T <: Model] { self: T =>
)
@feynmanliang @jkbradley thank you for the review guys. Traits will not allow context bound (to ClassTag) for type parameters and so I ended up using implicits in the method. I am going to try doing an upper bound with a self type annotation. However, I feel we are still going to need an implicit in the method (ill see if the compiler complains). As for the other changes, ill fix the imports/newline issues and fix the merge conflicts. I will also close #8248 #8247 #8243 #8241 and #8249 and merge all those into a single PR. However, since I am new to Spark, can you tell me what Jira I should open that PR against? Perhaps the umbrella SPARK-10014? |
@sabhyankar why is the ClassTag necessary? What's the error you're getting? You can tag multiple JIRA issues in a single PR, see #7507 for example |
@feynmanliang without the ClassTag, the compiler complains that no ClassTag is available for our type T. No ClassTag available for T Another unrelated question I have is that there is no common trait for the models under mllib so what would I upperbound the type T in the trait? |
@sabhyankar Actually, I'd recommend just keeping 1 PR per JIRA. I think that's easier for avoiding conflicts, even if the changes are all pretty similar. |
I think the classtag is needed because Broadcast needs the classtag. @feynmanliang This trait is for spark.mllib too, not just spark.ml. (Model is only a concept in spark.ml) |
…r creating local reference for broadcast
43dec14
to
ab175b4
Compare
OK, I understand why the ClassTag is necessary, thanks. @jkbradley @sabhyankar I have some minor concerns:
|
import org.apache.spark.rdd.RDD | ||
import org.apache.spark.sql.{DataFrame, SQLContext} | ||
import org.json4s.JsonDSL._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need these imports now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@feynmanliang These json* imports were always there. They were simply reordered by the import order plugin I used for IntelliJ. The only additional import is for the Broadcastable trait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, those need to go back to where they were before, see style guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, ill take care of that.
…ndom and mllib.stat The same as #8241 but for `mllib.stat` and `mllib.random`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8439 from mengxr/SPARK-10242. (cherry picked from commit c3a5484) Signed-off-by: Xiangrui Meng <meng@databricks.com>
…ndom and mllib.stat The same as #8241 but for `mllib.stat` and `mllib.random`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8439 from mengxr/SPARK-10242.
@jkbradley @feynmanliang I can certainly update the PR and change the filename and trait name to be the same (Broadcastable). I understand the concern regarding the Broadcastable trait not being restricted. However, since we don't have a common trait for the ml and mllib Models, I am not sure if there is any other option except duplicating the code in each of the models. Let me know what you guys advise. For now the PR is updated with the earlier recommendations and merge conflicts have been resolved. |
@jkbradley and I talked offline, summary: lets make it |
@feynmanliang - That sounds good - I can document the intent of the Broadcastable trait in the scaladoc. The reason I made this trait private[spark] is because we want to use it in both mllib and ml models (e.g #8243). If I make this private[mllib], we wont be able to reuse it in the ml models unless we duplicate it... Any suggestions on how to work around this? |
I'm OK with it being private to spark (not mllib) as long as the docs warn about usage. |
Sounds good @jkbradley . I dont think there is anything pending on my side for this PR. Let me know if you see something otherwise. I will update the other PRs once this is in. |
LGTM pending tests @jkbradley @mengxr can you trigger tests? |
ok to test |
Test build #41697 has finished for PR 8241 at commit
|
Couple issues. Some were already mentioned by @feynmanliang :
|
Hi @mengxr The current approach of rebroadcasting on every predict broadcasts the entire model and so I suppose the first issue you identified also applies to the current approach. Currently the trait doesnt have a limitation on where it can be mixed-in (except the private to spark) and the type of object it can broadcast. So I suppose we can broadcast multiple objects as a tuple if needed. I agree that given that models are mutable, preventing a rebroadcast via this PR will prevent RDD changes to be seen. However, given the Jira req I assumed that this was ok. Let me know if that is not the case. |
@sabhyankar Yes, the first issue also applies to the current approach. But I was expecting to solve it with this effort (broadcast less and only once). It is awkward to declare class with Issue 3 is actually the most important one because it changes the behavior. A safer approach would be making a new class for |
@sabhyankar Thanks for discussing these issues. Given the various arguments, I'd vote for:
|
closing per discussion on parent Jira |
ML model broadcasts should be stored in private vars: mllib NaiveBayes