[SPARK-10017] [MLlib]: ML model broadcasts should be stored in private vars: mllib NaiveBayes #8241

sabhyankar · 2015-08-17T15:02:09Z

ML model broadcasts should be stored in private vars: mllib NaiveBayes

holdenk · 2015-08-18T20:59:33Z

mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala

    testData.mapPartitions { iter =>
-      val model = bcModel.value
+      val model = bcModel.get.value


I believe we will still want to make a local reference to bcModel here so we don't end up shipping the entire object across the wire.

sabhyankar · 2015-08-18T21:38:41Z

Thanks for pointing that out @holdenk ! I have pushed a change to the PR!

holdenk · 2015-08-18T21:41:49Z

Awesome, it also seemed to be in many of your related PRs, you might want to update those as well.

sabhyankar · 2015-08-18T21:43:51Z

@holdenk yep :) I am updating those as well!

holdenk · 2015-08-18T21:46:06Z

Great :)

sabhyankar · 2015-08-18T22:23:42Z

@holdenk Not sure if you are reviewing the other PRs, but the fix should now be in all of them. Thx!

holdenk · 2015-08-18T23:12:40Z

I can take a look at some of the others too, I'm not a committer though so will still need a follow up review, but I can take a first pass :)

holdenk · 2015-08-18T23:13:16Z

mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala

@@ -19,6 +19,8 @@ package org.apache.spark.mllib.classification

 import java.lang.{Iterable => JIterable}

+import org.apache.spark.broadcast.Broadcast


So import ordering, this should probably be moved down to be with the other org.apache.spark imports.

sabhyankar · 2015-08-19T15:03:09Z

Hi @holdenk I have moved the common logic to a trait and am mixin it with the model. Let me know if you see something that needs to be corrected. I am still getting familiar with Spark and the coding standards so I may have missed something!

holdenk · 2015-08-19T19:08:07Z

Awesome, great work :) Looking at the trait, it seems like you could maybe even move the var inside of the trait. Also passing the RDD around to get the context out of it seems a little funky, I'd probably just pass the context instead.

holdenk · 2015-08-19T19:14:00Z

I know my own PRs get triggered by jenkins, lets see if I can tell jenkins this is good to test or if we will need to get someone else.

sabhyankar · 2015-08-19T22:19:04Z

Thanks @holdenk ! That makes sense. I moved the private var to the trait. Let me know if you see something else that is out of place. I believe the generic types I am using are ok. However, let me know if there is a problem. Thanks again for these suggestions :)

sabhyankar · 2015-08-19T22:21:54Z

I forgot to mention, I am also only passing in the SparkContext now. I should have done that to begin with :(

holdenk · 2015-08-20T00:29:02Z

Glad to help, maybe @jkbradley can say if this is ok to test to jenkins as he was the author of the related JIRA :)

feynmanliang · 2015-08-24T23:26:39Z

@sabhyankar looks good overall:

Do you mind resolving merge conflicts?
Can you put [SPARK-10019][MLlib]: ML model broadcasts should be stored in private vars: mllib IDFModel #8248 [SPARK-10018][MLlib]: ML model broadcasts should be stored in private vars: mllib clustering #8247 [SPARK-10015][MLlib]: ML model broadcasts should be stored in private vars: spark.ml tree ensembles #8243 [SPARK-10017] [MLlib]: ML model broadcasts should be stored in private vars: mllib NaiveBayes #8241 and [SPARK-10020][MLlib]: ML model broadcasts should be stored in private vars: mllib GeneralizedLinearModel #8249 all in one PR and close the others to make it easier to review and so they can all be tested together?

jkbradley · 2015-08-25T23:41:37Z

I'll make some comments, but can you please resolve conflicts here?

jkbradley · 2015-08-25T23:41:50Z

ok to test

jkbradley · 2015-08-25T23:45:06Z

mllib/src/main/scala/org/apache/spark/mllib/util/ModelBroadcast.scala

+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.SparkContext
+
+import scala.reflect.ClassTag


organize imports (Please follow style guide [https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide])

jkbradley · 2015-08-25T23:45:28Z

I like this approach. I'll check again after the conflicts have been resolved. Thanks!

feynmanliang · 2015-08-26T01:20:11Z

mllib/src/main/scala/org/apache/spark/mllib/util/ModelBroadcast.scala

+
+import scala.reflect.ClassTag
+
+private[mllib] trait Broadcastable[T] {


Since this is for broadcasting models, it shouldn't be mixed in with anything else. Also, the inner type of bcModel should always be the same as the class mixing in Broadcastable.

Thus, how about we rename Broadcastable -> BroadcastableModel, upper bound the type of T, and constrain what types can mix this trait in with a self type annotation (i.e. altogether trait BroadcastableModel[T <: Model] { self: T =>)

sabhyankar · 2015-08-26T02:32:30Z

@feynmanliang @jkbradley thank you for the review guys. Traits will not allow context bound (to ClassTag) for type parameters and so I ended up using implicits in the method. I am going to try doing an upper bound with a self type annotation. However, I feel we are still going to need an implicit in the method (ill see if the compiler complains).

As for the other changes, ill fix the imports/newline issues and fix the merge conflicts. I will also close #8248 #8247 #8243 #8241 and #8249 and merge all those into a single PR. However, since I am new to Spark, can you tell me what Jira I should open that PR against? Perhaps the umbrella SPARK-10014?

feynmanliang · 2015-08-26T02:59:40Z

@sabhyankar why is the ClassTag necessary? What's the error you're getting?

You can tag multiple JIRA issues in a single PR, see #7507 for example

sabhyankar · 2015-08-26T03:13:31Z

@feynmanliang without the ClassTag, the compiler complains that no ClassTag is available for our type T.

No ClassTag available for T
[error] case None => bcModel = Some(sc.broadcast(modelToBc))

Another unrelated question I have is that there is no common trait for the models under mllib so what would I upperbound the type T in the trait?

jkbradley · 2015-08-26T03:35:15Z

@sabhyankar Actually, I'd recommend just keeping 1 PR per JIRA. I think that's easier for avoiding conflicts, even if the changes are all pretty similar.

jkbradley · 2015-08-26T03:37:00Z

I think the classtag is needed because Broadcast needs the classtag.

@feynmanliang This trait is for spark.mllib too, not just spark.ml. (Model is only a concept in spark.ml)

… over the wire.

…r creating local reference for broadcast

sabhyankar · 2015-08-26T04:37:15Z

@jkbradley I have resolved the conflicts and updated the PR with the changes identified. Let me know if you see any issues. Since I am not combining PRs, this will have to go in before I update #8248 #8247 #8243 #8241 and #8249

feynmanliang · 2015-08-26T04:57:25Z

OK, I understand why the ClassTag is necessary, thanks.

@jkbradley @sabhyankar I have some minor concerns:

I'm a little uncomfortable that we don't restrict the types this trait can be mixed in to; this might lead to people using this in places we don't expect. This isn't a problem for ML models but since there's upper bound for mllib models in general, we have to decide what's better:

a. duplicate the broadcast code in each mllib model to prevent unintended use of this trait
b. leave it to each developer's best judgement

Personally I'm indifferent, leaning slightly towards (a).
We should make the filename consistent with the attribute name

feynmanliang · 2015-08-26T04:59:34Z

mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala

 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{DataFrame, SQLContext}
+import org.json4s.JsonDSL._


Why do we need these imports now?

@feynmanliang These json* imports were always there. They were simply reordered by the import order plugin I used for IntelliJ. The only additional import is for the Broadcastable trait.

Oh, those need to go back to where they were before, see style guide

Cool, ill take care of that.

…ndom and mllib.stat The same as #8241 but for `mllib.stat` and `mllib.random`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8439 from mengxr/SPARK-10242. (cherry picked from commit c3a5484) Signed-off-by: Xiangrui Meng <meng@databricks.com>

…ndom and mllib.stat The same as #8241 but for `mllib.stat` and `mllib.random`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8439 from mengxr/SPARK-10242.

sabhyankar · 2015-08-26T14:48:06Z

@jkbradley @feynmanliang I can certainly update the PR and change the filename and trait name to be the same (Broadcastable).

I understand the concern regarding the Broadcastable trait not being restricted. However, since we don't have a common trait for the ml and mllib Models, I am not sure if there is any other option except duplicating the code in each of the models. Let me know what you guys advise. For now the PR is updated with the earlier recommendations and merge conflicts have been resolved.

feynmanliang · 2015-08-26T17:35:08Z

@jkbradley and I talked offline, summary: lets make it private[mllib] and clearly document in the trait's scaladoc that it's to prevent rebroadcasting during prediction.

sabhyankar · 2015-08-26T17:49:46Z

@feynmanliang - That sounds good - I can document the intent of the Broadcastable trait in the scaladoc. The reason I made this trait private[spark] is because we want to use it in both mllib and ml models (e.g #8243). If I make this private[mllib], we wont be able to reuse it in the ml models unless we duplicate it... Any suggestions on how to work around this?

…onal comments

jkbradley · 2015-08-26T21:00:40Z

I'm OK with it being private to spark (not mllib) as long as the docs warn about usage.

sabhyankar · 2015-08-27T01:58:19Z

Sounds good @jkbradley . I dont think there is anything pending on my side for this PR. Let me know if you see something otherwise. I will update the other PRs once this is in.

feynmanliang · 2015-08-27T02:25:08Z

LGTM pending tests

@jkbradley @mengxr can you trigger tests?

mengxr · 2015-08-27T15:23:20Z

ok to test

SparkQA · 2015-08-27T16:13:51Z

Test build #41697 has finished for PR 8241 at commit 965aaec.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LogisticRegressionModel @Since("1.3.0") (
- class SVMModel @Since("1.1.0") (
- class BinaryClassificationMetrics @Since("1.3.0") (
- class MulticlassMetrics @Since("1.1.0") (predictionAndLabels: RDD[(Double, Double)])
- class MultilabelMetrics @Since("1.2.0") (predictionAndLabels: RDD[(Array[Double], Array[Double])])
- class RegressionMetrics @Since("1.2.0") (
- class FPGrowthModel[Item: ClassTag] @Since("1.3.0") (
- class FreqItemset[Item] @Since("1.3.0") (
- class FreqSequence[Item] @Since("1.5.0") (
- class PrefixSpanModel[Item] @Since("1.5.0") (
- class DenseMatrix @Since("1.3.0") (
- class SparseMatrix @Since("1.3.0") (
- class DenseVector @Since("1.0.0") (
- class SparseVector @Since("1.0.0") (
- class BlockMatrix @Since("1.3.0") (
- class CoordinateMatrix @Since("1.0.0") (
- class IndexedRowMatrix @Since("1.0.0") (
- class RowMatrix @Since("1.0.0") (
- abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode
- case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
- case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)
- case class Except(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)

mengxr · 2015-08-28T04:20:55Z

Couple issues. Some were already mentioned by @feynmanliang :

Broadcasting the entire model. For example, we might just need some pieces of it. For linear models, we only need the weights, but not summary statistics.
Using trait. What if we have two objects to broadcast? For example, if we have a local matrix factorization model. Are we going to make them a tuple then broadcast?
Models are mutable, and we didn't put a contract and say users shouldn't modify them. After this PR, if users modify the model after RDD prediction, local prediction will see changes but RDD prediction will not. Have we discussed this yet?

sabhyankar · 2015-08-28T20:33:15Z

Hi @mengxr The current approach of rebroadcasting on every predict broadcasts the entire model and so I suppose the first issue you identified also applies to the current approach. Currently the trait doesnt have a limitation on where it can be mixed-in (except the private to spark) and the type of object it can broadcast. So I suppose we can broadcast multiple objects as a tuple if needed. I agree that given that models are mutable, preventing a rebroadcast via this PR will prevent RDD changes to be seen. However, given the Jira req I assumed that this was ok. Let me know if that is not the case.

mengxr · 2015-08-28T21:44:44Z

@sabhyankar Yes, the first issue also applies to the current approach. But I was expecting to solve it with this effort (broadcast less and only once). It is awkward to declare class with Broadcastable[(Double, Array[Double], Vector)]. It doesn't provide much information but makes the code a little hard to read.

Issue 3 is actually the most important one because it changes the behavior. A safer approach would be making a new class for NaiveBayesModel with immutable (or by-contract immutable) data, which only needs broadcast once. This definitely requires more discussion before we make the change.

jkbradley · 2015-09-10T02:12:37Z

@sabhyankar Thanks for discussing these issues. Given the various arguments, I'd vote for:

eliminating the trait --- It's a nice idea but seems like it saves too little code to be worthwhile, since there is no easy one-trait-fits-all solution.
discussing mutability of models on the JIRA before proceeding [https://issues.apache.org/jira/browse/SPARK-10014]

sabhyankar · 2015-09-11T21:02:37Z

closing per discussion on parent Jira

holdenk reviewed Aug 18, 2015
View reviewed changes

jkbradley reviewed Aug 25, 2015
View reviewed changes

mengxr mentioned this pull request Aug 26, 2015

[SPARK-10240] [SPARK-10242] [MLLIB] update since versions in mlilb.random and mllib.stat #8439

Closed

feynmanliang reviewed Aug 26, 2015
View reviewed changes

Sameer Abhyankar and others added 6 commits August 26, 2015 00:01

Add private var to store broadcast variable

304ec4c

Create a local ref to model object to avoid sending the entire object…

1f47715

… over the wire.

Create a trait with logic to get/set broadcast model and use trait fo…

78aac79

…r creating local reference for broadcast

Style check corrections

0e7684e

Move private var to Broadcastable trait

e616b76

Fix merge conflicts, rearrange imports and style checks

ab175b4

sabhyankar force-pushed the spark_10017 branch from 43dec14 to ab175b4 Compare August 26, 2015 04:30

feynmanliang reviewed Aug 26, 2015
View reviewed changes

Rename filename to match trait name, rearranged imports, added additi…

965aaec

…onal comments

sabhyankar closed this Sep 11, 2015

		@@ -19,6 +19,8 @@ package org.apache.spark.mllib.classification

		import java.lang.{Iterable => JIterable}

		import org.apache.spark.broadcast.Broadcast


		import scala.reflect.ClassTag

		private[mllib] trait Broadcastable[T] {

[SPARK-10017] [MLlib]: ML model broadcasts should be stored in private vars: mllib NaiveBayes #8241

[SPARK-10017] [MLlib]: ML model broadcasts should be stored in private vars: mllib NaiveBayes #8241

Conversation

sabhyankar commented Aug 17, 2015

Choose a reason for hiding this comment

sabhyankar commented Aug 18, 2015

holdenk commented Aug 18, 2015

sabhyankar commented Aug 18, 2015

holdenk commented Aug 18, 2015

sabhyankar commented Aug 18, 2015

holdenk commented Aug 18, 2015

Choose a reason for hiding this comment

sabhyankar commented Aug 19, 2015

holdenk commented Aug 19, 2015

holdenk commented Aug 19, 2015

sabhyankar commented Aug 19, 2015

sabhyankar commented Aug 19, 2015

holdenk commented Aug 20, 2015

feynmanliang commented Aug 24, 2015

jkbradley commented Aug 25, 2015

jkbradley commented Aug 25, 2015

Choose a reason for hiding this comment

jkbradley commented Aug 25, 2015

Choose a reason for hiding this comment

sabhyankar commented Aug 26, 2015

feynmanliang commented Aug 26, 2015

sabhyankar commented Aug 26, 2015

jkbradley commented Aug 26, 2015

jkbradley commented Aug 26, 2015

sabhyankar commented Aug 26, 2015

feynmanliang commented Aug 26, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sabhyankar commented Aug 26, 2015

feynmanliang commented Aug 26, 2015

sabhyankar commented Aug 26, 2015

jkbradley commented Aug 26, 2015

sabhyankar commented Aug 27, 2015

feynmanliang commented Aug 27, 2015

mengxr commented Aug 27, 2015

SparkQA commented Aug 27, 2015

mengxr commented Aug 28, 2015

sabhyankar commented Aug 28, 2015

mengxr commented Aug 28, 2015

jkbradley commented Sep 10, 2015

sabhyankar commented Sep 11, 2015